writing a crawler - thinking about the whole internet

Crawling the whole internet using a home internet connection and a few PCs. Can it be done in a reasonable amount of time and without buying a million harddrives and computers? How big is this problem? Let's assume that we want to do the simplest type of analysis – just pulling all of the pages on the internet, grabbing the urls and then deleting the file and crawling more. We only need temporary storage of the html, so we only have to store the urls. Recently, Google celebrated the indexing of it trillionth web page (2007, I think). So, let's say that there are 1T urls that we need to store. Are those unique pages, or 1T pages, including dupes? For each url, we need to store its return code (200, 404, …). Normally, we'd store the date/time last fetched, Last-modified, mime-type, etc, but we're just going to store the url. If we store it in a flat file, and we store urls in a per-domain file, we can cut off the domain part of the urls (aka http://mydomain.com/) and just store the path and filename in the files. Making a wild guess at the average url length is across the entire internet, let's say it's somewhere in-between 40-100 characters. We'll choose 80 chars as a stab in the dark. 1T urls * 80 chars = 80TB worth of urls. If we store all of the URLs in a key/value database that supports compression (tokyo cabinet) and use the highest compression possible, we might be able to get the storage of all of the URLs down to below 10TB. So, that's possible with a single PC for a couple of grand. We now have a machine that's capable of storing every possible url on the internet (or close to it).
Downloading it all? Can I actually download the entire internet on a home internet connection? Assuming the same 1 trillion web pages, and the average web page size has increased to about 25k ( see: http://www.optimizationweek.com/reviews/average-web-page/ ). Doing the math, that's 25Petabytes over a 6Mbit connection. Wolfram Alpha says that that if I could max out my home internet connection, it would take 1057 years. Turning on compression would help, but I doubt that many servers actually use compression, so let's say 1000 years, so that's not really do-able until we get Gigabit fiber to the door (only 6.5 years) or something much faster.
So, I guess that crawling the whole internet is not really possible at this point for me. I can find a niche however and crawl a subset of the internet. Perhaps I'll do that. :)
- Steve's blog
- Login or register to post comments
- 1642 reads

Recent comments
6 weeks 2 days ago
7 weeks 2 days ago
18 weeks 12 hours ago
24 weeks 4 days ago
24 weeks 4 days ago
24 weeks 4 days ago
24 weeks 4 days ago
24 weeks 4 days ago
24 weeks 4 days ago
24 weeks 5 days ago