An other excellent post by brett tabke of webmasterworld.com on duplicate content issues with search engines,

A great post in webmasterworld by brett tabke explains how search engines treat duplicate content, It is worth a read by everyone,

What is dupe content?
a) Strip duplicate headers, menus, footers (eg:
the template)
This is quite easy to do mathematically. You just look for
string patters that match on more that a few pages.
b) Content is what is
left after the template is removed.
Comparing content is done the same way
with pattern matching. The core is the same type of routines that make up
compression algos like Lempel-Ziv (lz).
This type of pattern matching is
sometimes referred to as a sliding dictionary lookup. You build an index of a
page (dictionary) based on (most probably) words. You then start with the lowest
denominator and try to match it against other words in other pages.
How
close is duplicate content?
A few years ago, an intern (*not* Pugh) who
helped work on the dupe content routines (2000?), wrote a paper (now removed).
The figure 12% was used. Even after studying, we are left to ask how that 12% is
arrived at.
Cause for concern with some sites?
Absolutely. People that
should worry: a) repetitive content for language purposes. b) those that do auto
generated content with slightly different pages (such as weather sites, news
sites, travel sites). c) geo targeted pages on different domains. d) multiple
top level domains.
Can I get around it with random text within my template?
Debatable. I have heard some say that if a site of any size (more than
20pages) does not have a detectable template, that you are subject to another
quasi penalty.
When is dupe content checked?
I feel it is checked as a
background routine. It is a routine that could easily run 24x7 and hundreds of
machines if they wanted to crank it up that high. I am almost certain there is a
granularity setting to it where they can dialup or dial down how close they
check for dupe content. When you think about it, this is not a routine that
would actually have to be run all the time because one they flag a page as a
dupe, that would take care of it for a few months until they came back to check
again. So I agree with those that say it isn't a set pattern.
Additionally,
we also agree that G's indexing isn't as static as it used to be. We are into
the "update all the time" era where the days of GG pressing the button are done
because it is pressed all the time. The tweaks are on-the-fly now - it's pot
luck.
What does Google do if it detects duplicate content?
Penalizes the
second one found (with caveats). (As with almost ever Google penalty, there are
exceptions we will get to in a minute).
What generally happens is the first
page found is considered to be the original prime page. The second page will get
buried deep in the results.
The exception (as always) - we believe - is high
Page Rank. It is generally believe by some that mid-PR7 is considered the "white
list" where penalties are dropped on a page - quite possibly - an entire site.
This is why it is confusing to SEO's when someone says they absolutely know the
truth about a penalty or algo nuance. The PR7/Whitelist exception takes the
arguments and washes them.
Who is best at detecting dupe content? Inktomi
used to be the undisputed king, but since G changed their routines (late
2003/Florida?), G has detected the tiny page to the large duplicate page without
fail.
On the other, I think we have all seen some classic dupe content that
has slipped by the filters with no explaination apparent.
For example, these
two pages:
The original: http://www.webmasterworld.com/forum3/2010.htm
The
duplicate: http://www.searchengineworld.com/misc/guide.htm
The
10,000 unauthorized rips: (10k is best count, but probably higher): Successful
Site in 12 Months with Google Alone
All-in-all, I think the dupe content
issue is far over rated and easy to avoid with quality original content. If
anything, it is a good way to watch a competitor penalized.

2 Comments:

NY Party Shuttle said...: Interesting subject. Our site has several good quality incoming links and has good content on the home page, a blog, and internal pages. However, we are not in the top 1000 Google listings for our main keywords. On Yahoo, by comparison, we are in the top 100 listings for each major keyword. I'm trying to figure out why we are being penalized in Google. At first, I thought it was the sandbox, but this has been going on way to long (we launched the site early last summer. Here's my question:

We have two sites:
www.atlanticcitypartyshuttle.com
and
www.newyorkpartyshuttle.com

The one I'm worried about is NYPS. You mention duplicate pages, which there are none between the two sites, but there are several common links back and forth. Also, ACPS is the older site (by about 3 months last year). Any ideas on why we're not showing up higher in Google?; 1:05 PM
Ricjie said...: My site http://www.oasisoflove.com is being promoted via articles that I write. The articles are distributed around the net on article directories.

Since I have the same articles on my site, is it best not to have them on the site?; 7:39 PM

Links to this post:

<$BlogBacklinkTitle$>: <$BlogBacklinkSnippet$>
<$I18NPostedByBacklinkAuthor$> @ <$BlogBacklinkDateTime$>

Create a Link

<< SEO Blog Home