Want to know how do search engines work?

Webber (2000) defines a search engine as a ‘searchable catalogue, database or directory of websites’, of a reasonably large size.

There is a myth that when you type in some keywords to search on a particular thing, search engine goes out into the web and looks for matching pages to return. But this is not the case. Search engines continuously crawl web downloading millions of pages everyday to place in their own custom databases. So when a person types in certain keywords at search box, search engine return pages that have been indexed on the database. This is the reason when a person gets different results for a query at different search engines. However, search engine falls into two distinct categories:

Crawler based search engines
Human powered directories

For example, Yahoo! Started as human powered directory, but after working with partners like Inktomi, Alta vista and Google, Yahoo! Launched its own crawler based search engine to compete with Google directly. Google is not a human powered directory but crawler based that relies on technology to build its database. The term search engine is used generically for all search engines on the web. Though we use spiders/robots to build indexes, all search engines collect different information in different ways. The algorithm (the computer program which ranks search results) that each major search engines uses for ranking purpose is different for each specific service.

As home page is considered important for search engine spider, content and relevancy is also important. The content on the site page has greater relevancy factor for a specific term than a competitor’s home page, content rich site page is the one that search engine will return for keyword search. Internal pages of a website can attract more traffic to a site than a home page, as they have more relevant content, whereas a home page is normally very general. However, there are certain search engines that do not crawl deeply into a big site with many pages, so it is important to make most relevant site pages available to search engines.

Search engines in their research use different, or various terms for particular components, but we can use basic terms. Crawler based search engine can be broken down into following components:

➢ Crawler

➢ Indexer

➢ Query Handler

Browsers like internet explorer or Firefox, sends HTTP requests, to retrieve web pages to download and show them on your computer monitor. Crawler downloads the data to a client

• Crawler retrieves URL

• Crawler always connects to remote server where the page is hosted

• Crawler issues a request to retrieve the page and its textual content.

• Crawler scans the link the page contains for further crawling.

A crawler downloads only textual data; it is able to jump from one page to another through the links it has scanned at rapid speeds. If you notice, major search engines can now download millions of pages every day. You may think that crawlers work is simple, but in essence this is what it is doing. If you have seen log files of a site, you will see names like Yahoo Slurp or Google bot, these are names of spiders for Yahoo and Google.

Just imagine what will happen when Googlebot, visits a website?

Make sure there is <title> tag on the webpage as it is very important piece of information an organization which is used to feed search engine spider.
Next the actual text from the page is stripped out of HTML code and a note is made as to where it has to appear on the page.
Some search engines that use information in <meta> tags, the keyword and description are extracted.
Then spider pulls out hyperlinks and divides them into two categories: that which belongs to the site (internal links) and those which do not belong to the site, such as external links.
Generally external links are used in “crawl control” where they wait in queue for future crawling.
Each page that is downloaded from the site is added in the page repository and given an ID number.

All this information is now stored in search engine’s database. Search engines now know what the title and keywords are, how many internal and external links are there in the site.

Link analysis is considered an important factor which is taken into account when ranking pages following a query. It also makes sense to identify sites offering complementary services that are also likely to be of interest to your site user. Reciprocal linking can increase traffic and revenue generation.

But reciprocal arrangements have to be entered into with caution since many search engines will downgrade the value of such pages. Also, search engines like Google can easily find out link exchanges between whole networks of sites, so if you are developing various websites and interlinking them simply because you would have heard that Google likes links ad will increase your rankings-chances are you will be doing more harm than good. Rankings will suffer but you might end up being banned from Google’s index altogether-and no site can afford that to happen.

Link analysis module looks at the surrounding connectivity of the page, i.e. which pages link to you and which pages do you link to. This is very important to search engines and works on citation analysis. For example-if many pages point back to an organization’s site, then it is likely that their page may be an authority on a given subject. This process is referred as ‘hubs’ and ‘authorities. Hubs are sites that point to others and authorities are pages they point to.

Here are certain factors that determine ranking:

• Link popularity

• Keyword weight

• How old is domain

• How much long it has been in database

• How fresh is the content on the website?

But now factors have changed as algorithms have become more sophisticated. Now search engines determine relevancy between a search query and keywords within a phrase than simply considering keyword weight. Now Google considers more than 100 factors in its algorithm to determine rankings. Nowadays Google’s ranking algorithm gives more weight age to the age of web sites and the pages within them, as they want to give fresh and up to date content to searchers. Google’s algorithm gives importance to those websites or blogs where page content is updated, new pages are added and how many new links point to your website etc. But do remember Google will keep changing its ranking algorithm time to time and so what scores a top 10 result today may not do so tomorrow. That’s why you should monitor for changes.

Query interface is where the whole thing has to come together. The average query length at a search engine interface is two to three words. From these two to three words search engine has to decide from billions of pages it has in its database which ones to return. This is where organizations have to think and optimize the web pages. If a website has to be in top 20, it needs to have pages that will target around these three words that the surfer issues on any given subject and organizations should have link popularity to go with it if they are to be successful. However you will not receive success based on increasing the number of links to a site, it is more important about the quality of links received. It looks at words contained within the text of that link too. If they match in certain way then Google will consider the link more favorable which in turn will give an added boost to the rankings.

No Comments

No comments yet.

Want to know how do search engines work?

Leave a comment

Search

Blogroll

Categories