A great post in webmasterworld by brett tabke explains how search engines treat duplicate content, It is worth a read by everyone, What is dupe content? a) Strip duplicate headers, menus, footers (eg: the template) This is quite easy to do mathematically. You just look for string patters that match on more that a few pages. b) Content is what is left after the template is removed. Comparing content is done the same way with pattern matching. The core is the same type of routines that make up compression algos like Lempel-Ziv (lz). This type of pattern matching is sometimes referred to as a sliding dictionary lookup. You build an index of a page (dictionary) based on (most probably) words. You then start with the lowest denominator and try to match it against other words in other pages. How close is duplicate content? A few years ago, an intern (*not* Pugh) who helped work on the dupe content routines (2000?), wrote a paper (now removed). The figure 12% was used. Even after studying, we are left to ask how that 12% is arrived at. Cause for concern with some sites? Absolutely. People that should worry: a) repetitive content for language purposes. b) those that do auto generated content with slightly different pages (such as weather sites, news sites, travel sites). c) geo targeted pages on different domains. d) multiple top level domains. Can I get around it with random text within my template? Debatable. I have heard some say that if a site of any size (more than 20pages) does not have a detectable template, that you are subject to another quasi penalty. When is dupe content checked? I feel it is checked as a background routine. It is a routine that could easily run 24x7 and hundreds of machines if they wanted to crank it up that high. I am almost certain there is a granularity setting to it where they can dialup or dial down how close they check for dupe content. When you think about it, this is not a routine that would actually have to be run all the time because one they flag a page as a dupe, that would take care of it for a few months until they came back to check again. So I agree with those that say it isn't a set pattern. Additionally, we also agree that G's indexing isn't as static as it used to be. We are into the "update all the time" era where the days of GG pressing the button are done because it is pressed all the time. The tweaks are on-the-fly now - it's pot luck. What does Google do if it detects duplicate content? Penalizes the second one found (with caveats). (As with almost ever Google penalty, there are exceptions we will get to in a minute). What generally happens is the first page found is considered to be the original prime page. The second page will get buried deep in the results. The exception (as always) - we believe - is high Page Rank. It is generally believe by some that mid-PR7 is considered the "white list" where penalties are dropped on a page - quite possibly - an entire site. This is why it is confusing to SEO's when someone says they absolutely know the truth about a penalty or algo nuance. The PR7/Whitelist exception takes the arguments and washes them. Who is best at detecting dupe content? Inktomi used to be the undisputed king, but since G changed their routines (late 2003/Florida?), G has detected the tiny page to the large duplicate page without fail. On the other, I think we have all seen some classic dupe content that has slipped by the filters with no explaination apparent. For example, these two pages: The original: http://www.webmasterworld.com/forum3/2010.htm The duplicate: http://www.searchengineworld.com/misc/guide.htm The 10,000 unauthorized rips: (10k is best count, but probably higher): Successful Site in 12 Months with Google Alone All-in-all, I think the dupe content issue is far over rated and easy to avoid with quality original content. If anything, it is a good way to watch a competitor penalized.
Earthlink
Netscape
Netvouz
RawSugar
Shadows
Sphinn
StumbleUpon
Yahoo MyWeb
|
2 Comments:
At 1:05 PM,
NY Party Shuttle said…
Interesting subject. Our site has several good quality incoming links and has good content on the home page, a blog, and internal pages. However, we are not in the top 1000 Google listings for our main keywords. On Yahoo, by comparison, we are in the top 100 listings for each major keyword. I'm trying to figure out why we are being penalized in Google. At first, I thought it was the sandbox, but this has been going on way to long (we launched the site early last summer. Here's my question:
We have two sites:
www.atlanticcitypartyshuttle.com
and
www.newyorkpartyshuttle.com
The one I'm worried about is NYPS. You mention duplicate pages, which there are none between the two sites, but there are several common links back and forth. Also, ACPS is the older site (by about 3 months last year). Any ideas on why we're not showing up higher in Google?
At 7:39 PM,
Ricjie said…
My site http://www.oasisoflove.com is being promoted via articles that I write. The articles are distributed around the net on article directories.
Since I have the same articles on my site, is it best not to have them on the site?
Post a Comment
Home