The online world is dominated by search engines and internet portals. The three biggest, Google, Yahoo and Microsoft Live, determine what content a site has by using software robots to surf the web and “crawl” (like a spider) for content. The spider reads the content of the website (the words or copy that you write) and returns it to the search engine for processing. When a web user comes along they type in keywords and get a list of search results based on the data the software robots have retrieved for the search engine.What happens if an unscrupulous third party, unable or unwilling to generate meaningful content of their own, decides to scrape (literally word for word and image by image) the content of your site and puts it on theirs? I’m not talking fair use here, I’m talking about websites set up by script kiddies with the sole intent of searching out your content, stripping it and republishing as their own without your permission. At least, for the purposes of duping search engines into hitting their site (and hence their advertising click-through revenue potential increases).
The search engine also has to decide which site is the legitimate one and which one holds the pirated content. It is widely understood that Google especially despises duplicated content, and devalues both sites with the ripped content. The original author suffers as a result of being copy scraped.
Copy scraping isn’t so much an issue with old static HTML served websites, but modern content management systems and blogs such as WordPress, Blogger and TypePad are all susceptible due to the way in which they freely expose easy access via automated electronic “feeds”. The purpose of the “feed” is for legitimate users to consume content from multiple websites in an easy fashion without having to physically visit each one in turn. On the other hand, it allows script kiddies an easy mechanism to rip off people’s work. Since they won’t be hosting your images, any content on your blog will be streaming from your own servers – so they’re stealing your bandwidth too.
It saddens me that today I learn of someone who should know better, a computer studies student from America has set up a site and added all the 30day challenge blogs (presumably via the master feed) to it. So as each author contributes their hard work, this kid gets a copy for free.
edit: The owner of the offending website has promised to remove the copyrighted content and place his RSS reader behind an authenticated system to make it private for personal use only. As a gesture of goodwill, I have removed the screenshots of his website identity from this post.
Irritatingly the article which I’ve just laboured over has been stolen without my permission, which is more than annoying.
Depending on what content platform you use (WordPress, CMS, Blogger, etc), you will need to take action in a different way. If you’re a WordPress user, you’re in luck. There’s a relatively simple first line of defence that you can establish in order to neutralise this problem. A plugin called “AntiLeech” by Owen Winkler allows you to stipulate “fake” content when being read from certain IP addresses (such as 22.214.171.124) or via certain RSS readers. There is a common blog scraping tool called “RSS Bandit” which identifies itself as “Bandit” or “RSS Bandit” via HTTP when connecting to your blog. This plugin does all the work of removing all the links from your post, and generating a set of text explaining about splogging (the stealing of content from blogs) and optionally putting a link back to the original website.
I’ve configured this plugin to block this guy from ripping my posts, and since he’s reading the 30day challenge aggregated feed, I’m posting this post to that feed. If you’re using WordPress – you need this plug in. If you’re using something else, use Google to search for the preferred anti-splogging or anti-theft copyright tool for your own site. If you don’t, you’ll suffer in the search engine rankings and people looking for your content will find it, you’ll pay for it, but they’ll be a million miles away from you and you’ll never know about it (other than your higher bandwidth bill).