How the Web links Are Crawling,

Tuesday, March 25, 2008

Running a web crawler is a testing task. There are complicated performances; reliability issues are more importantly there are social issues. It's the most breakable application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

In order to balance hundreds of millions of web pages Google has a fast circulated crawling system, a single URLserver serves lists of URLs to a number of crawlers, the URLserver and the crawlers are implemented in python programs. Each crawler keeps roughly more than 250 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600 thousands per second of data. A major performance stress is DNS lookup. Each crawler maintains its own DNS cache, so it doesn't need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of unusual states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

It turns out running a crawler which connects to more than half a million servers and it generates tens of millions of entries. 'Cause of the vast number of people coming on line, there are always those who don't know what a crawler is, 'cause this is the first one they have seen. There are also some people who do not know about the robots exclusion protocol, think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. The vast amount of data involved, unexpected things will happen. (For ex) our system tried to crawl an online game. This resulted in plenty of junk messages in the middle of their game. It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the vast variation in web pages and servers, it's virtually impossible to test a crawler without running it on large part of the net. Regularly, there are hundreds of unclear problems which may only occur on one page out of the whole web and cause the crawler to crash or worse, cause of incorrect behavior. Systems which access large parts of the net require are designing very strong and carefully testing. Since large complex systems such as crawlers will regularly cause problems, there needs to are important resources devoted to reading the email and solving these problems as they come up.

posted by Horshan @ 5:02 PM permanent link   |

Post a Comment

|


0 Comments:

<< Home