Duplicate Content: The Problems and the Solutions
Duplicate Content
Synopsis: Duplicate content is a huge problem on the Internet. While often times the motives behind duplicating content are genuinely innocent, most people don’t realize the impact it can have on the site that generated the content originally and could cause harm to the rankings of site that produced the original content. This article has been repurposed from content found on one of my personal favorite sites https://moz.com/learn/seo/duplicate-content. Hopefully it will help you better understand the issues with duplicate content.
What exactly is duplicate content?
Any time you have identical content on the Internet in two or more sites you have an issue with duplicate content. Although duplicate content will not instantly earn you a penalty with the search engines, it can certainly effect your search engine rankings. The problem is that duplicate content is that it makes it very difficult for Google to determine which version of the content is the most relevant to the end users search query. As an Search Engine Optimization (S.E.O.) best practice it is better write unique content always.
Does it really matter if I have duplicate content?
For search engines there are three distinct issues with duplicate content
- First, the search engines can’t discern which version(s) to include/exclude from their index.
- Secondly, the search engines which page to credit the link to
- Finally, the search engines don’t know which version(s) to rank for query results.
Two main problems for individual site owners who produced the original content
Duplicate content can cause the original content owner to lose ranking and resulting traffic to their website.
- Generally speaking, search engines like Google don’t normally show multiple versions of identical duplicate content as this would muddy up the search engine results pages. This forces the search engine to decide which version will be the most relevant to the end user who types in the query, thus limiting the visibility of variants.
- This can also impact the content originator by losing link equity that should be passed to the original owner of the content, but that is now being spread amongst all the variants of the same content. Ideally the link equity would pass directly to the original content developer. And since inbound links are a major contributor to ranking, duplicate content can dilute the real value of the original content causing the content owner to lose valuable placement on the search engines.
The net result? A piece of original content loses its value and doesn’t achieve the search visibility it otherwise would.
What is the culprit behind duplicate content?
In most cases, duplicate content isn’t some Machiavellian scheme for world dominance. In fact, the cause of duplicate content is usually innocuous in intent. Content can be inadvertently duplicated by site owners over time using the same language on their site across multiple pages. Other times it can be the innocent act of the old “Copy and Paste bug”. Duplicate content is not isolated either. It is believed that upwards of 25% or more of the content on the Internet is actually duplicate content!
But what is main way that duplicate content is produced?
Scraped or copied content is the primary offender.
Many times content is copied from one site to another with no changes (wordsmithing), citations and links back to original content, or any attribution given to the original content. This is huge in the world of eCommerce where many of the sellers pass around product descriptions like a common cold. After a while, everyone that sales X widget has the same product description and the search engines can’t tell which is the original.
Franchisee’s will often copy and paste the content from their corporate sites without thinking that this could ultimately damage the brands rankings which has a trickle down effect to the Franchisee’s own online presence.
So…how do I fix the problem of duplicate content
It all boils down to this: specifying which of the duplicates is the “correct” one.
Content that is on multiple URL’s should be canonicalized for search engine consumption. The first approach is to use a simple 301 redirect. Another approach is the “rel=canonical” attribute which can be accomplished in a more simple way using the parameter handling too in the Google search console.
What is a 301 Redirect?
By creating a 301 redirect from the duplicate page to the original content you can accomplish a kind of synergy that is mutually beneficial. When you have several pages, each with the potential of ranking well, all combined into a single page you create a higher level of relevancy and send a popularity signal to the search engines giving the original page the juice it deserves.
What is the Rel=”canonical” attribute?
the rel=canonical attribute tells the search engines that a particular page is a copy of an original piece which should get the credit for the content including all of the links, content metrics and ranking authority that the original piece deserves.
The rel=”canonical” attribute is part of the HTML head of a web page and looks like this:
General format:
<head> ...[other code that might be in your document's HTML head]... <link href="URL OF ORIGINAL PAGE" rel="canonical" /> ...[other code that might be in your document's HTML head]... </head>
By adding the rel=canonical attribute to the head of your HTML code with the URL of the original content, the attribute passes through to the original content much like it does with the 301 Redirect, while allowing the page owner of the duplicate content to benefit from having the same content on their site for their viewers.
Below is an example of what a canonical attribute looks like in action:
Another Method is Using the Meta Robots Noindex Robots Text
Meta tags are very helpful and especially so when handling duplicate content. When you add the meta values “noindex, follow” command for the Robots tag at the head of the page it will prevent the page, in most cases, from being indexed by the search engines.
Here is a sample:
<head> ...[other code that might be in your document's HTML head]... <meta name="robots" content="noindex,follow"> ...[other code that might be in your document's HTML head]... </head>
The robots meta tag, in this case allows the search engine to crawl the page but prevents them from using the links in their index. You will want the duplicate page to be crawled, but your telling the search engines not to index it. This is important as search engines like Google caution against restricting crawl access. You are essentially telling Google, “Come on in and take a look at the site, we are safe, but don’t index this particular page please.”
Preferred domain and parameter handling in Google Search Console
Another duplicate content issue arrises when you have an http and an https or perhaps even a URL that has the www. and a copy of the site that does not have the www in front of the domain name. Google has a really cool tool in the Google Search Console that allows you to choose the preferred domain of your website. When you choose the domain, Google will/should only crawl the site you choose based on your parameters.
You may want to use caution with this method as it only pertains to Google, and not the other search engines, such as Bing and Yahoo. These other search engines offer “Webmaster tools” as well. Check with each search engine to get a better understanding of their tools and protocols.
Other methods for dealing with duplicate content
- Be consistent in the canonical settings for links on your site. For example, if you determine that the canonical prefix of a domain is www.example.com/, then all internal links should go to http://www.example.com/example rather than http://example.com/page (notice the absence of www).
- When syndicating content (copying content directly from one site to another for the purpose of reproducing content relevant to your website readership), make sure the syndicating website adds a link back to the original content and not a variation on the URL.
- To add an extra safeguard against content scrapers who are notorious for stealing SEO credit for your content, it is a smart move to add a self-referential rel=canonical link to your existing pages. This is a canonical attribute that points to the URL it’s already on, the point being to thwart the efforts of some scrapers.Keep in mind that not all scrapers will port over the full HTML code of their source material, some will. For those that do, the self-referential rel=canonical tag will ensure your site’s version gets credit as the “original” piece of content. It is kind of like lo-jack for your content to some degree.