All About Robots.txt?

All About Robots.txt?

Robots.txt file is a simple text file with a few lines of text and it can decide whether the website should be shown on Google or not and what part of your website should be shown to the search engines like Google, Yahoo and MSN.

For example:

Allow all spiders to index everything

Disallow:
OR
Leave the Robots.txt blank without any commands.

 

Allow no spiders to index any part of your site

User-agent: *
Disallow: /
This ensures that no spider would index anything at all on your site.

To better understand what robots.txt files are, one must first understand what a Web Robot is. A robot may be defined as a program from search engines that are set out on the internet by search engines like Google, Yahoo, MSN, Altavista, Ask.com and others to find out new websites, indexing them and gathering the right information about the website. Robots are sometime called as spiders, crawlers and bots.

Robots.txt, simply stated, is a text file on a site to inform search robots which pages they should not visit. By defining a few rules in this little text file, you can instruct robots to not crawl and index certain files, directories within your site. Generally, search engines abstain from things they are asked not to do.

Importance of Robots.txt file to the webmaster:

From a webmaster’s point of view also, Robots.txt should be considered important because it ensures better indexing of the websites – resulting in more information passed to search engines to help gain better ranks. It becomes possible for the webmaster to decide how their websites should be crawled, indexed and ranked by the search engines by the use of well-written Robots.txt files. The function of the Robots.txt file is to give commands to the visiting robots, help them index and collect relevant information about the website. It is to be noted that the commands on the robots.txt file are completely configurable by the webmaster.

Functions of Robots

  • Site Indexing – This is similar to taking a copy of a new website it identifies and storing it somewhere in the search engines servers.

  • Validating the site code – This is like comparing the website code to W3C standards and grading them in keeping with its accuracy.

  • Link Checks - Tracing all possible links – both incoming and outgoing - from indexed websites, and calculating the sites grading factors.

  • Some advanced engine robots supposedly do more complex tasks like categorizing websites, analyzing their search engine metrics, popularity ratio etc.

At the same time, it is important to know that robots.txt is not a foolproof way to bar search engines from crawling your site - as it is not a firewall, or any kind of password protection.

It is welcome when search engines frequently visit your site and index your content but at times they may index parts of your online content that you do not want it to. In other words, robots.txt is a text file which can be used to tell web robots to access your web site only in areas you approve.

But professionals opine that if you have really sensitive data, it is best not to overly depend on robots.txt to protect it from being indexed or displayed in search results. Further, if you are keen to save some bandwidth by excluding images, style sheets and java script from indexing, you have to necessarily inform spiders to keep away from these items.

Spotting the Robots.txt file:

As a matter of fact, one common way to instruct search engines which files and folders on your Web site they should avoid visiting is by the use of the Robots metatags. But the problem is not all search engines are capable of reading metatags, and hence the robots.txt file.

It is necessary to locate robots.txt in the main directory or else search engines will not be able to find it. Please note that it is not the job of the search engine to search the whole site for robots.txt. At best they may look for it in the main directory and if it is not there, they will simply conclude that the site does not have a robots.txt file.

First, create a regular robots.txt file and make sure it is named exactly that. It is also important that this file is uploaded to the root accessible directory of your site, not to a subdirectory. These two steps are necessary for search engines to understand the instructions contained in the file

How to set up a Robots.txt file?

The next obvious question is - how to set up a Robots.txt file? It is better to study the basics well before setting up a Robots.txt file.

  • Open a new text document on your machine.

  • In it, type these text, accurately - User-agent: * Disallow:

  • Save it as "Robots.txt"

  • Go to server by accessing the file manager or the FTP, and go to the root folder.

  • Upload the "Robots.txt" file to the root folder.

The Robots.txt file is now set up successfully. But please know commands have been issued to allow all search engine robots to crawl the entire site without restriction. If you wish to selectively disallow/block certain files/folders to be crawled, then you will have to follow the commands shown below:

Exclude a file from an individual search engine

User-agent:Google
Disallow: /thepathtoyourfile.html
Replace "Google" with your search engine preference and replace "thepathtoyourfile.html" with the actual path to your file. If you would like to block more than one file, you have to repeat this command (second line) with specific file names.
Ex: Disallow: /file1.html
Disallow: /file2.html

 

Exclude a section of your site from all spiders and bots

User-agent: *
Disallow: /1/2/dir-to-be-blocked/
Replace "dir-to-be-blocked" with the actual path to your directory that is to be blocked.

 

For more robots commands just Click Here

 

NOTE: Some crawlers now support an additional field called ‘Allow:’ - particularly Google. As is evident, ‘Allow:’ lets you specifically dictate what files/folders can be crawled. But there is a word of caution. This field is currently not part of the "robots.txt" protocol and as such, better to use it only if absolutely needed, as it is likely to confuse some less intelligent crawlers.

As concluding remarks, it may be stated that Robots.txt is an extremely useful tool to control the way search engines can scan the website and gather information from them. As a matter of fact, the more complex and careful one plans the web design, the better the search engine positions would be. If you think not showing a folder content to search engines will avoid unnecessary information being passed to it, then you might as well use the Robots.txt file.

Never forget the fact that what you are doing is social bookmarking. Being amiable and adopting a friendly approach can help create a large network of people who belong to your niche. Encourage them to comment on the links that you post and interact by promptly replying to their comments. Seize this opportunity to really get to know people and use the information you gather to improve your online strategy.

Links to More Information and Resources

robotstxt - The Web Robots Pages

 

For more Search Engine Genie Articles

You may contact us for further details by clicking here or e-mail us at - support@searchenginegenie.com

 

Solution for SEO Problems
Search Engine Genie Clients
Request SEO Quote

SEO FAQ
Request a Free SEO Quote