How can I optimize for “deep web” crawling?

We have a question from Brighton, Danny asks “What are Google’s plans for indexing the deep web? Are there best practices for form construction to optimize for this?”

Great question! We recently published a paper in VLDB which I believe stands for Very Large Databases, that talks exactly about our criteria all the ways though we tried to do it safely so if there are people who don’t want their forums to be crawled we won’t crawl them. So there are various simple things that you can do. So rather than having text that has to be filled out like a zip code if you could make it a drop down for example that’s much more helpful. If you could make it so that it’s not a huge form with 20 things to fill out but more like one drop down or two drop downs that’s going to be lot easier as well. I definitely encourage you to go read the paper there’s nothing sooper dooper confidential in it. And of course if you can make it that you are not part of the deep web you can take those pages that’s your database and have a HTML site map so that people can reach all the different pages on your site by crawling through categories or geographic areas, then we don’t have to fill out forms. And Google is a pretty good company about being able to index the deep web through forums but not every search engine does that. And so if you can expose that database somewhere where people can get to all the pages on your site just by clicking not by submitting a form then you are going to open yourself up to an even wider audience. If you could do that, that’s what I recommend. But if you can’t do that then I’ll say check out this paper from the VLDB conference where the team talked about it in more detail.

Star Wars or Star Trek?

Barbeta from Buenos Aires asks “Star Wars or Star Trek?”

Star Wars! Sorry Star Trek folks, but there are like 50,000 movies admittedly cold wars, I can see both sides. I’ve always been a Star Wars fan I don’t even know all the Star Trek criteria but there’s good points to both sides.

Can I tell Google not to use the posting date in my snippet?

Can I tell Google not to use the posting date in my snippet?

Here’s an interesting question from Brazil, Fabio Ricotta asks “In some queries I can see the date of the post/article in the description snippet (at Google search). Why? Can I tell Google not to use it? If yes, how?”

Right now I don’t think there’s a way to say please don’t do this. Our snippets team is always here to show really helpful descriptions or what we call snippets of our search results. If you are on a forum maybe we can show ‘oh there’s been 4 replies’ or if you are on a blog may be there’s been 30 comments on this blog post. So we are always trying to think about new ways to have helpful descriptions or helpful snippets. And one of those is to highlight the date on which you blog post or forum thread appeared because, if you know that something was recent that might be really useful to you as a user. So we do deserve the right to show the snippet that we think is best for users. Sometimes we provide a way to turn that off; no ODP is a midi tag not to use the open directory projects descriptions. But in general we deserve the right on do we show part of a page, do we highlight the date of a particular post went live, those sort of things we do deserve the right, because we want to return the best results for users.

Can rel=”canonical” index my hostname and not my IP address?

A smart question from Sweden, Anders has asked “Will the new canonical tag help with issues where you by accident (stupid editors linking to wrong addresses) have indexed sites by IP address rather than hostname?”

I’ll have to double check, but that’s the sort of thing that you’ll be liked to able to do. You’d like to take that IP address and put that over to the hostname. Now that I’m thinking aloud, we might consider the     IP address different than the hostname, so we’ll have to confirm on that. But I don’t think it would hurt to go ahead and have that. And ideally that is the sort of thing where you don’t want your IP address to show up, you want your hostname or domain name to show up instead. So I think that would be a nice thing to do. I’m not sure whether we supported for IP addresses yet but I’ll ask Yalcom, the guy that wrote and did the heavy lifting on this code and see what happens.

What do I do after being hacked?

Question from Laura Thieme from Columbus, OH. “I have a client who was hacked. The SEO consultant said the things were cleaned up, but they weren’t correctly. All 30,000 Viagra/cialis types and paid links have been removed but no improvement in SERPs. We sent reconsideration. What do we do now?”

I would send another reconsideration request, I would also do a site: search and look for site:example.com Viagra, cialis, porn, free sex any nasty spammy terms you can think of jus to make sure all the pages are gone. And I would also look at the keywords at the Webmaster Tools Council to see which keywords you are showing up for, if any of them look like spam or porn or anything like that. Do a fresh look.  You might also invite someone to take a look on the webmaster help forum and say ‘hey, is anything wrong with my site?’ Because sometimes people can spot things there. And make sure you have the current patched version of your software, if you are running wordpress, make sure you update your wordpress installation because sometimes you clean it up and you just get hacked again. So if you do search on Google Webmaster blog hacked there is two or three posts that we’ve done and you can read more about that. And if you really think it’s all completely cleaned up, do another reconsideration request and we’ll hopefully get that back in.

Is eating the same sandwich every day duplicate content?

Here’s a question from Canada. Quentin from Vancouver says “Hi Matt, I have the same sandwich for lunch every day. Will I be punished by Google fro duplicate content?” NO! “Can the canonical tag help me here?” Not really! It doesn’t work in meat space yet or sandwich space it only works in web space. “I just can’t get enough Reuben sandwiches!”

Power to you! Although Reubens are a little bad for you, you might consider turkey ham you know a little thin BLT So tasty. Anyway that don’t worry about but the canonical tag is helpful to splat canonicalness so that you can clean up the architecture of your site. Don’t worry about having the same sandwich for lunch

Should I use underscores or hyphens in URLs?

A question from Ontario, Canada. Tripstar says “Underscores vs hyphens in URLs, does it make a difference? my-page vs. my_page?”

It does make a difference; I would go with dashes or hyphens if you can. If you have underscores and if things are working fine for you I wouldn’t worry about changing your architecture. A while ago I said we are looking at using underscores as separators and the reason why we typically never talk about stuff in the future is that it gives us the freedom to change our mind. In fact the people who are working on that project worked on something slightly different for scoring in the URL that was actually higher impact and a much higher win. We might still get around to that, thanks for the ping I’ll try to ask some folks on our quality triage team, hey, can we take a fresh look at this. But for the time being underscores are treated as separators, sorry dashes or hyphens are treated as separators and underscores are not. That might change in the future but that’s the way it stands right now.

Does rel=”canonical” make it safe to use tracking parameters?

Here’s a perfect question from Nick in Chicago. “Does the new canonicalization tag make it safe to add tracking arguments to some of my internal links without fear that Google will split the quality signals between the two addresses?”

So I believe you can do this. I would try it out on just one directory or small set or URLs at first to make sure it’s completely safe. If you can’t fix it upstream like if you can do something with your cookies or analytics package where you can say “oh I’m getting into this point of my page so I’ll track that event.” If there’s a way to do it that way that’s just a little bit better because then there’s no , suppose someone copies and pastes a URL and they might copy and paste it differently may be that URL goes away or the tracking code changes. So if you can’t make the URLs unified that’s still better, but I believe that this sort of thing can work totally fine with the new canonicalization tag. Again just start out cautiously, make sure it works for you make, sure that there’s no problems, but this is the sort of thing that you can do, two conceptually same changes, may be one is, I came in from the work front page may be one was I came in from the help pages so you have slightly different breadcrumb parameter or something like that, you can use the canonicalization tag and say really these two are the same pages and the same pages without this breadcrumb parameter.

Is SearchWiki or Analytics data used for ranking?

We’ve got a question from Boston. Micahn asks “Is Google aggregating Search Wiki data with Analytics for ranking?”

No. I’ve said before that my team; the web spam team will not go over and ask the analytics team for data so we don’t get any feed from the analytics team and we’ve said that we are currently not using SearchWiki data. Is it possible then in the future we might, may be, but at least right now if you are spinning it up all day everyday voting up or submitting your URLs in SearchWiki you are basically wasting your time. We are not using that data right now. And if we ever did use that data we’d be very cautious about how we used it such that we try to prevent any sort of abuse. So instead, better to make great sites, get lots of visitors, build lots of buzz and that’s a great way forward and you don’t have to worry about SearchWiki data or anything like that.

Does Google Analytics work with Web 2.0 and social media?

We have a question from Indainapolis, Indiana. Benji says “Does Google Analytics have plans to start adding specific tools around web 2.0 or social media websites?”

Well Analytics is a tool for your website right, so the question is if I have a web 2.0 can Google Analytics help me? And I think the answer is yes, as I recall, I’m not the expert in Google analytics and I love those guys but I don’t talk to them that often, I think that we do provide analytics solutions for flash these days and also for AJAX. So there are ways to track internal events on your page, there are hooks where you can say so I’ve got to this part of the HTML fire-off some sort of things that analytics can use. So double check and do a little research to verify that, but if you do have a rich website with all sorts of interesting things I think you can still use Google Analytics for different events and track that. You have to do a little more work if it’s just static HTML but you’re already doing more work to make a really rich web 2.0 experience anyway. So I think it is possible, you just have to do some thought on what are the events that I want to track and then can I insert those hooks to track the visitors as they go through the funnel and how they convert.

Request a Free SEO Quote