Googlebot now digs deeper into your forms - Great new feature from Google smart guys

Google's crawling team has made a major step forward a step which everyone thought the Search Engine crawler will never go to. According to the Official webmaster central Blog now Google has the capability to crawl through HTML forms and find information this is a huge step forward.

Remember forms had always been a user only feature when we see a form we tend to add a query and search for products or catalogs or other relevant information. For example if we go to a product site we just see a search form, Some product sites will just have a form to reach the products on their website. There will not be any other way to access the inner product pages which might have valuable information for the crawlers. Good product descriptions which might be unique and useful for users will be hidden from t he users. Similarly imagine a edu website I personally know a lot of Edu websites which don't provide proper access to their huge inventory of research papers, PowerPoint presentations etc.
If you scan through Standford website you will not find these useful information connected to the website anywhere. They are rendered directly from a database. Now due to the advanced capability of Google bot to crawl forms they can use queries like Google research etc in sites like Standford and crawl all the PDFs, PowerPoint's and other features that are listed there. This is just amazing and a great valuable addition.

I am personally enjoying this addition by Google. When I go to some great websites they are no where never optimized most of their high quality product pages or research papers are hidden from regular crawlers. I always thought why don't I just email them asking to include a search engine friendly site map or some pages which has a hierarchical structure to reach inner pages. Most of the sites don't do this nor do they care that they don't have it. At last Google has a way to crawl the Great Hidden web that is out there. When they role out this option I am sure it will be a huge hit in future and will add few billion pages more to the useful Google index.

Also the Webmaster central blog reports Google bot has the capability to toggle between Radio buttons, drop down menus, check box etc. Wow that is so cool wish I was part of the Google team who did this research it is so interesting to make a automated crawler do all this Magic on your website which has always been part of the user option.

Good thing I noticed is they mention they do this to a select few quality sites though there are some high quality information out there we can also find a lot of Junk. I am sure the sites they crawl using this feature are mostly hand picked or if its automated then its subject to vigorous quality / Authority Scoring.

Another thing that is news to me is the capability of Google bot to scan Javascript and flash to scan inner links. I am aware that Google bot can crawl flash but not sure how much they reached with Javascript. Before couple of years Search Engines stayed away from Javascript to make sure they don't get caught in some sort of loop which might end up in crashing the server they are trying to crawl. Now its great to hear they are scanning and crawling links in Javascript and Flash without disturbing the well-being of the site in anyway.

Seeing the positive site we do have a negative side too, There are some people who don't want their pages hidden inside forms to be crawled by Search Engines. For that ofcourse google crawling and indexing team has a solution. They obey robots.txt, nofollow, and noindex directives and I am sure if you don't want your pages crawled you can block Googlebot from accessing your forms.

A simply syntax like

Useragent: Googlebot
Disallow: /search.asp
Disallow: /search.asp?
will stop your search forms if your Search form name is search.asp.

Also Googlebot crawls only get Method in forms and no Post Method in forms. This is very good since many Post method forms will have sensitive information to be entered by ther users. For example many sites ask for users email IDs, user name, passwords etc. Its great that Googlebot is designed to stay away from sensitive areas like this. If they start crawling all these forms and if there is a vulnerable form out there then hackers and password thieves will start using Google to find unprotected sites. Nice to know that Google is already aware of this and is staying away from sensitive areas.

I like this particular statement where they say none of the currently indexed pages will be affected thus not disturbing current Pagerank distribution:

"The web pages we discover in our enhanced crawl do not come at the expense of
regular web pages that are already part of the crawl, so this change doesn't
reduce PageRank for your other pages. As such it should only increase the
exposure of your site in Google. This change also does not affect the crawling,
ranking, or selection of other web pages in any significant way."

So what next from Google they are already reaching new heights with their Search algorithms, Indexing capabilities etc. I am sure for the next 25 years there wont be any stiff Competition for Google. I sincerely appreciate Jayant Madhavan and Alon Halevy, Crawling and Indexing Team for this wonderful news.

What is the next thing I expect Googlebot :

1. Currently I dont seem them crawl large PDFs in future I expect to see great crawling of the huge but useful PDFs out there. I would expect a cache to be provided by them for those PDFs.

2. Capability to crawl Zip or Rar files and find information in it. I know some great sites which provide down loadable research papers in .zip format. Probably search engines can read what is inside a zipped file and if its useful for users can provide a snippet and make it available in Search index.

3. Special capabilities to crawl through complicated DHTML menus and Flash menus. I am sure search engines are not anywhere near to doing that. I have seen plenty of sites using DHTML menus to access their inner pages, also there are plenty of sites who use Flash menus I am sure Google will overcome these hurdles , understand the DHTML and crawl the quality pages from these sites.

Good Luck to Google From - Search Engine Genie Team,

Labels: Google, search engines

1 Comments:

: Wow, good spot. That's certainly a change of tune from Google's point of view, when they declared that they'd stop pursuing dynamic parameters in URLs to avoid placing too much load on crawled sites! It'll be interesting to see some examples of this in the field. I suppose select boxes leave a farily finite set of choices; browser autofill also gets a pretty good idea of which fields need what data, and it wouldn't need a huge amount of intelligence to program something based on that knowledge into a crawler. Interesting stuff, thanks for the heads-up!; 7:53 AM