Any webmaster who spends time ensuring content is unique, well written, and useful feels the pain when they find their content scraped and displayed on another website. Scrapers are just a part of doing business on the web, and there isn’t much a webmaster can do to stop it. You can, however, take some clever steps to fight it and preserve your site’s unique value in search engines.
There are several ways to block scrapers, but some of them also block legitimate search engine crawlers. The challenge for webmasters is to make sites scraper-unfriendly but still remain search engine friendly. This is no easy task, because what blocks scrapers generally blocks search engines as well.
For instance, one way to completely block scrapers is to transform your content into images. While this is great for fighting scrapers, it makes your site completely SEO-unfriendly. Search engines won’t be able to parse and read your content, so your rank will likely drop. Search engines are still text based, so they aren’t able to properly understand and read images.
Because scrapers and bots work similarly, it’s difficult to create a method to block scrapers without harming your SEO and ranking. When you choose a method, choose wisely. Even testing a method can have negative effects if it affects search engine bots. Don’t perform any massive structural changes unless you know that they won’t block legitimate bots.
Here are three ways you can fight content scrapers but keep your site search engine crawler friendly.
Set a Canonical in Your Pages
A canonical gives Google algorithms a strong suggestion when indexing duplicate content. A canonical basically says “This is duplicate content. Index this URL instead.” “This URL” is a page on your site.
When a scraper steals your content, it takes all content within the HTML tags including link tags. The result is that your canonical is set on the scraper’s pages. When Google crawls the scraper site, it reads the canonical and de-indexes the scraper’s page and preserves your own. Having a canonical link that points to the current page does not affect your Google index status, so you don’t need to worry about it causing issues with your local pages.
This technique usually works well, but there are a few issues with it. First, when the scraper’s owner figures out that a canonical is included, he can strip out the canonical. Second, a canonical is a suggestion for Google. While the search engine algorithm usually accepts the canonical and uses it for indexing, it’s not a guarantee. If Google sees strong signals pointing to the scraper pages, it might keep them indexed. However, this is rare. Strong signals include links, high-volume traffic, and popularity of the page.
The following is a canonical link code.
<link rel=”canonical” “https://yoursite.com/yourpage.html” />
Notice that you need the absolute URL, which means you include the protocol (HTTP), the domain name (yoursite.com), and the page name. Include this code on each of your content pages.
Use Absolute URLs in Your Links
There are two types of link URLs: absolute and relative. An absolute looks like the link in the previous section. It includes the protocol, the domain, and the page name.
A relative link just uses the directory and page name. Here’s an example:
- Absolute URL
<link rel=”canonical” “https://yoursite.com/yourpage.html” />
- Relative URL
<link rel=”canonical” “/yourpage.html” />
When a scraper steals your content, it scrapes all content and site structure. When you use relative URLs, the scraper site’s link will work. When you use absolute URLs, these links point to your own domain. The scraper must strip out your domain from all links or they all point to your site, which can actually be beneficial for your link graph. Unless the scraper owner can write code, he won’t be able to use your content unless he edits the scripts.
Create a Honeypot
Honeypots are decoys that companies use to attract hackers. They mimic a real server or system and allow the hacker to find vulnerabilities. The advantage of a honeypot is logging events as the hacker penetrates the system. They also lure hackers away from critical systems.
You can create a similar system on your web server. All it takes is creating one file. Create a blank HTML file and upload it to your web server. For instance, name the file “honey.html” and place it on your web server. Add the file to your robots.txt to stop robots from crawling it. Crawlers honor robots.txt directive, so they will not crawl the page if you have it blocked in the robots.txt file.
Next, place a hidden link to the honey.html page on one of your site’s active pages. You can hide the link with a “display: none” CSS div. The following code is an example:
<div style=”display: none;”><a href=”honey.html”>link name</a></div>
The above code is visible to crawlers and scrapers but not normal visitors.
What this trick does is point traffic to one file. Since legitimate blocks honor the robots.txt but scrapes won’t, you can see IPs crawling the page. You should be logging traffic on your website, so manually review IP addresses that crawl honey.html. Legitimate bots such as Google and Bing won’t crawl the page, but scrapers will. Find scraper IPs and block them on your web server or firewall. You should still verify the IP before you block it just in case any issues occur and legitimate traffic finds the page.
Scrapers Should Never Outrank Your Website
You can’t completely block sites from taking your content. After all, an unscrupulous site owner can manually copy your site content. However, a scraper site should never outrank yours. The most likely cause for a scraper to outrank your own site is problems with your own SEO.
Google has hundreds of factors that rank websites, so it’s difficult to know which factor could be affecting your site. Here is a breakdown of what you can review.
- Is your content unique, useful and written for users?
- Have you or a consultant performed any link building?
- Is your content authoritative?
- Are low quality pages set to noindex?
- Is your navigation easy for users to find content and products?
These are a few issues you can review, but you might need a professional to audit the site more thoroughly.
The good news is that scrapers usually die off quickly from Google penalties and complaints to the scraper site’s host. If you see a scraper ranking ahead of you, take these steps to stop them and take the time to review your site for quality.