SEO        Link Building        Services        Genie Magic        Web Design        Contact Us        SEO Tools
 
   
How the Web links Are Crawling, Tuesday, March 25, 2008
   
 

Running a web crawler is a testing task. There are complicated performances; reliability issues are more importantly there are social issues. It's the most breakable application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

In order to balance hundreds of millions of web pages Google has a fast circulated crawling system, a single URLserver serves lists of URLs to a number of crawlers, the URLserver and the crawlers are implemented in python programs. Each crawler keeps roughly more than 250 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600 thousands per second of data. A major performance stress is DNS lookup. Each crawler maintains its own DNS cache, so it doesn't need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of unusual states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

It turns out running a crawler which connects to more than half a million servers and it generates tens of millions of entries. 'Cause of the vast number of people coming on line, there are always those who don't know what a crawler is, 'cause this is the first one they have seen. There are also some people who do not know about the robots exclusion protocol, think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. The vast amount of data involved, unexpected things will happen. (For ex) our system tried to crawl an online game. This resulted in plenty of junk messages in the middle of their game. It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the vast variation in web pages and servers, it's virtually impossible to test a crawler without running it on large part of the net. Regularly, there are hundreds of unclear problems which may only occur on one page out of the whole web and cause the crawler to crash or worse, cause of incorrect behavior. Systems which access large parts of the net require are designing very strong and carefully testing. Since large complex systems such as crawlers will regularly cause problems, there needs to are important resources devoted to reading the email and solving these problems as they come up.

Labels:

                                Earthlink Netscape Netvouz RawSugar Shadows Sphinn StumbleUpon Yahoo MyWeb

 
   
 

posted by Horshan @ 5:02 PM permanent link   | Post a Comment |

 

0 Comments:

Post a Comment

<< Home

 

Categories
 
Archives
 
Previous posts
 
 
PageRank 10 sites
Search Engine Optimization SEO Blog
Search Engine
Optimization SEO News
SEO Copywriting Blog
Web Design Blog
Link Building Blog
Pay Per Click (PPC) Blog
Programming Blog
Search EngineGenie Blog
Lara's Personnal Blog
Search Engine
Optimization SEO Forum
SEO Comics
Webmaster & Search Events
 
search engine optimization
search engine marketing
SEO consulting
SEO plans
SEO services USA
search engine optimization SEO forum
SEO comics
Webmaster & Search Events
SEO Faqs
link popularity
strategies of link building
link building services
link cost
link request quote
link building blog
search engine genie company
our team
our celebrations
our experience
SEO
why us

web design
web designing services
dynamic website
web design and marketing
simple e-commerce website
complex e-commerce website
search engine friendly site
web design blog
web design
link building
internet marketing
ecommerce implementation
pay per click services
shopping feeds optimization
shopping cart customization
product development
online forms & database integration
programming services
PHP programming services
programming services Java,J2EE
.NET application development programming services
business process outsourcing
offshore outsourcing
Google Products(froogle feeds)
search engine optimization articles
google articles
yahoo articles
miscellenous articles
search engine optimization SEO blog
search engine optimization SEO news
SEO copywriting blog
web design blog
link building blog
pay per click (PPC) blog
programming blog
lara personnal blog
Search Engine Genie Blog
Google Tools
Yahoo Tools
MSN Tools
Comparison Tools
Link Popularity Tools
Search Engines Tools
Site Tools
Keyword tools

contact us
support
our guarantee
events