<div style="display:inline;float:right;margin-left:1em"><g:plusone href="https://www.searchenginegenie.com/blog-seo/an-other-excellent-post-by-brett-tabke-of-webmasterworld-com-on-duplicate-content-issues-with-search-engines/"></g:plusone></div>
<div style="display:inline;float:right;margin-left:1em"><g:plusone href="https://www.searchenginegenie.com/blog-seo/an-other-excellent-post-by-brett-tabke-of-webmasterworld-com-on-duplicate-content-issues-with-search-engines/"></g:plusone></div>
{"id":210,"date":"2005-03-11T16:43:00","date_gmt":"2005-03-11T20:43:00","guid":{"rendered":"http:\/\/www.searchenginegenie.com\/blog-seo\/an-other-excellent-post-by-brett-tabke-of-webmasterworld-com-on-duplicate-content-issues-with-search-engines\/"},"modified":"2012-09-20T05:34:18","modified_gmt":"2012-09-20T09:34:18","slug":"an-other-excellent-post-by-brett-tabke-of-webmasterworld-com-on-duplicate-content-issues-with-search-engines","status":"publish","type":"post","link":"https:\/\/www.searchenginegenie.com\/blog-seo\/an-other-excellent-post-by-brett-tabke-of-webmasterworld-com-on-duplicate-content-issues-with-search-engines\/","title":{"rendered":"An other excellent post by brett tabke of webmasterworld.com on duplicate content issues with search engines,"},"content":{"rendered":"<p>A great post in webmasterworld by brett tabke explains how search engines treat duplicate content, It is worth a read by everyone,<\/p>\n<blockquote><p>What is dupe content?<br \/>a) Strip duplicate headers, menus, footers (eg:<br \/>the template)<br \/>This is quite easy to do mathematically. You just look for<br \/>string patters that match on more that a few pages.<br \/>b) Content is what is<br \/>left after the template is removed.<br \/>Comparing content is done the same way<br \/>with pattern matching. The core is the same type of routines that make up<br \/>compression algos like Lempel-Ziv (lz).<br \/>This type of pattern matching is<br \/>sometimes referred to as a sliding dictionary lookup. You build an index of a<br \/>page (dictionary) based on (most probably) words. You then start with the lowest<br \/>denominator and try to match it against other words in other pages.<br \/>How<br \/>close is duplicate content?<br \/>A few years ago, an intern (*not* Pugh) who<br \/>helped work on the dupe content routines (2000?), wrote a paper (now removed).<br \/>The figure 12% was used. Even after studying, we are left to ask how that 12% is<br \/>arrived at.<br \/>Cause for concern with some sites?<br \/>Absolutely. People that<br \/>should worry: a) repetitive content for language purposes. b) those that do auto<br \/>generated content with slightly different pages (such as weather sites, news<br \/>sites, travel sites). c) geo targeted pages on different domains. d) multiple<br \/>top level domains.<br \/>Can I get around it with random text within my template?<br \/>Debatable. I have heard some say that if a site of any size (more than<br \/>20pages) does not have a detectable template, that you are subject to another<br \/>quasi penalty.<br \/>When is dupe content checked?<br \/>I feel it is checked as a<br \/>background routine. It is a routine that could easily run 24&#215;7 and hundreds of<br \/>machines if they wanted to crank it up that high. I am almost certain there is a<br \/>granularity setting to it where they can dialup or dial down how close they<br \/>check for dupe content. When you think about it, this is not a routine that<br \/>would actually have to be run all the time because one they flag a page as a<br \/>dupe, that would take care of it for a few months until they came back to check<br \/>again. So I agree with those that say it isn&#8217;t a set pattern.<br \/>Additionally,<br \/>we also agree that G&#8217;s indexing isn&#8217;t as static as it used to be. We are into<br \/>the &#8220;update all the time&#8221; era where the days of GG pressing the button are done<br \/>because it is pressed all the time. The tweaks are on-the-fly now &#8211; it&#8217;s pot<br \/>luck.<br \/>What does Google do if it detects duplicate content?<br \/>Penalizes the<br \/>second one found (with caveats). (As with almost ever Google penalty, there are<br \/>exceptions we will get to in a minute).<br \/>What generally happens is the first<br \/>page found is considered to be the original prime page. The second page will get<br \/>buried deep in the results.<br \/>The exception (as always) &#8211; we believe &#8211; is high<br \/>Page Rank. It is generally believe by some that mid-PR7 is considered the &#8220;white<br \/>list&#8221; where penalties are dropped on a page &#8211; quite possibly &#8211; an entire site.<br \/>This is why it is confusing to SEO&#8217;s when someone says they absolutely know the<br \/>truth about a penalty or algo nuance. The PR7\/Whitelist exception takes the<br \/>arguments and washes them.<br \/>Who is best at detecting dupe content? Inktomi<br \/>used to be the undisputed king, but since G changed their routines (late<br \/>2003\/Florida?), G has detected the tiny page to the large duplicate page without<br \/>fail.<br \/>On the other, I think we have all seen some classic dupe content that<br \/>has slipped by the filters with no explaination apparent.<br \/>For example, these<br \/>two pages:<br \/>The original: http:\/\/www.webmasterworld.com\/forum3\/2010.htm<br \/>The<br \/>duplicate: http:\/\/www.searchengineworld.com\/misc\/guide.htm<br \/>The<br \/>10,000 unauthorized rips: (10k is best count, but probably higher): Successful<br \/>Site in 12 Months with Google Alone<br \/>All-in-all, I think the dupe content<br \/>issue is far over rated and easy to avoid with quality original content. If<br \/>anything, it is a good way to watch a competitor penalized.<\/p><\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>A great post in webmasterworld by brett tabke explains how search engines treat duplicate content, It is worth a read by everyone, What is dupe content?a) Strip duplicate headers, menus, footers (eg:the template)This is quite easy to do mathematically. You just look forstring patters that match on more that a few pages.b) Content is what [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[31],"tags":[],"class_list":["post-210","post","type-post","status-publish","format-standard","hentry","category-webmaster-news"],"_links":{"self":[{"href":"https:\/\/www.searchenginegenie.com\/blog-seo\/wp-json\/wp\/v2\/posts\/210","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.searchenginegenie.com\/blog-seo\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.searchenginegenie.com\/blog-seo\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.searchenginegenie.com\/blog-seo\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.searchenginegenie.com\/blog-seo\/wp-json\/wp\/v2\/comments?post=210"}],"version-history":[{"count":1,"href":"https:\/\/www.searchenginegenie.com\/blog-seo\/wp-json\/wp\/v2\/posts\/210\/revisions"}],"predecessor-version":[{"id":1525,"href":"https:\/\/www.searchenginegenie.com\/blog-seo\/wp-json\/wp\/v2\/posts\/210\/revisions\/1525"}],"wp:attachment":[{"href":"https:\/\/www.searchenginegenie.com\/blog-seo\/wp-json\/wp\/v2\/media?parent=210"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.searchenginegenie.com\/blog-seo\/wp-json\/wp\/v2\/categories?post=210"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.searchenginegenie.com\/blog-seo\/wp-json\/wp\/v2\/tags?post=210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}