Duplicate web pages can be a problem for many websites. Whether it occurs from re-using manufacturers' descriptions, CMS systems that publish accidental duplicate pages' site content, or even people stealing your content. Duplication can cause problems, and confuse Google.
At this year’s London SES, Kristjan Mar Hauksson of Nordic eMarketing and Eric Enge of Stone Temple Consulting took to the stage to share ways to fix duplicate content.
Kristjan Mar Hauksson
1. Duplicate content is preventable. Many problems can be prevented from the beginning. Often it can arise from the CMS or content writers trying to make short cuts. Make sure you are familiar with how content is displayed on your website so that you can identify errors quickly.
2. Read the Google guidelines. Duplicate content can confuse Google and if Google is confused that can be bad news for you. Read the Google guidelines and understand what Google expects from you.
3. Use tools to troubleshoot. There are times when CMS systems leak content into web pages, serving up content that shouldn't be there. Tools like Xenu Link Sleuth and Screaming Frog can crawl a website and show you potential duplication through the data it collates.
4. Use Search Console. Log in to Search Console to troubleshoot errors. You can also use Google to do manual testing. A simple way is to type site:yoursite.com into the Google search bar. This will show how Google is currently crawling your site. Also check for duplicate content by taking random sentences or snippets of text from your product categories. Post that into Google search and see If duplicates appear.
5. Password protect under development URLs. Some unscrupulous people out there scrape websites to get content and then publish it - and this can include websites that are in development stage. It’s important to keep dev sites under lock and key until they are ready to be released. Have passwords to keep the public and search engines out of these development website URLs so they are not crawled and indexed. Kristjan pointed out that Iceland Air has development URLs that are still being picked up by Google five years later for keywords. Five years on, irrelevant URLs found on the dev side are still in the index.
6. Monitor your social content. There are websites taking content from large social websites like Flickr and hosting the content. Some may optimize it (even linking to you), but are effectively getting traffic from your assets. Kristjan’s advice is to take care and protect your valuable content. Make sure the ownership of content is clear.
7. Report plagiarism to Google immediately. Tell Google and submit an ownership claim
“Google wants to know if somebody is stealing content from you. If for some reason you are a victim of content being stolen, report it to Google. Google has a form that you can fill out. In many cases that I know of personally, Google has actually dealt with it quite well”.
8. Make a checklist and define a process. Take everything that is likely to be a cause of duplication and create a checklist. Go through the items one by one and create a process. Eg, in six months aim to clear all 404s. There is a good chance you will see an increase in organic visits.
9. Think about it from an ROI (return on investment) point of view. Kristjan believes when it comes to SEO you should always think about how to deal with it from an ROI point of view. If you have a problem, then evaluate if the impact of fixing the problem has greater benefits than it actually costs to do.
“When you have a page that is a duplicate, not only does it have no chance of ranking and you’re wasting crawl budget on it, but you have links pointing to it … and that PageRank (PR) is wasted”.
10. 301 redirect for the win! If you discover two duplicate pages that are under your control, the first solution is to 301 redirect the page to your primary source page. A 301 redirect is an excellent solution. So simply redirect to the canonical one ie, the one you want published.
<link rel="canonical" href="http://yoursite.com"/> ie, link to the target page. This can be a second solution if it is tricky for developers to fix a duplicate page problem with a 301. This method can parse link value from one page to another. The page is not prevented by being crawled by Google, so it’s technically not as good as eliminating the page and doing a 301. It does however conserve some crawl budget, PageRank, and in turn solves a duplicate content problem.
12. Use robots
With this option, pages do still get crawled. The noindex pages can still accumulate PageRank. Some of the PageRank will get passed back into your site and whilst it is inefficient, you technically don’t lose all the PageRank.
13. Use Disallow in Robots.txt. With this option the web page isn't crawled so it can’t lose any PageRank. Listing in robots.txt conserves crawl budget, but it does not conserve PR. It does however solve a duplicate problem.
14. Search Console ignore parameter. According to Eric the jury is out on this one. Technically it conserves crawl budget and PR. In theory it should solve the duplicate problem, but webmasters question how effective it is.
15. Addressing the problem of filters and duplicates. Many websites, including shopping sites, use filters to break down results based on price, size, color and so on. This filtered subset of content is seen as duplicate content. The preferred choice to fix this is link rel=canonical as the basic solution. However, there are occasions when Google will see the code and ignore it, if Google thinks it's been used correctly. Fortunately there are some solutions developers can opt to use such as JSON Data and jQuery or an Ajax implementation, any of which can address the issue.
“The gold standard is to get rid of the problem. So whatever is causing the duplicate content - do the things it takes to not have those duplicate pages any more.”
Read more about Duplicate Content on the Wordtracker Academy
You can optimize your WordPress blog, it's easy! (Learn how to prevent duplicate content on WordPress.)