Skip to content

Screen Capers with Copy Scrapers (sploggers)

PirateThe online world is dominated by search engines and internet portals. The three biggest, Google, Yahoo and Microsoft Live, determine what content a site has by using software robots to surf the web and “crawl” (like a spider) for content. The spider reads the content of the website (the words or copy that you write) and returns it to the search engine for processing. When a web user comes along they type in keywords and get a list of search results based on the data the software robots have retrieved for the search engine.What happens if an unscrupulous third party, unable or unwilling to generate meaningful content of their own, decides to scrape (literally word for word and image by image) the content of your site and puts it on theirs? I’m not talking fair use here, I’m talking about websites set up by script kiddies with the sole intent of searching out your content, stripping it and republishing as their own without your permission. At least, for the purposes of duping search engines into hitting their site (and hence their advertising click-through revenue potential increases).

The search engine also has to decide which site is the legitimate one and which one holds the pirated content. It is widely understood that Google especially despises duplicated content, and devalues both sites with the ripped content. The original author suffers as a result of being copy scraped.

Copy scraping isn’t so much an issue with old static HTML served websites, but modern content management systems and blogs such as WordPress, Blogger and TypePad are all susceptible due to the way in which they freely expose easy access via automated electronic “feeds”. The purpose of the “feed” is for legitimate users to consume content from multiple websites in an easy fashion without having to physically visit each one in turn. On the other hand, it allows script kiddies an easy mechanism to rip off people’s work. Since they won’t be hosting your images, any content on your blog will be streaming from your own servers – so they’re stealing your bandwidth too.

It saddens me that today I learn of someone who should know better, a computer studies student from America has set up a site and added all the 30day challenge blogs (presumably via the master feed) to it. So as each author contributes their hard work, this kid gets a copy for free.

edit: The owner of the offending website has promised to remove the copyrighted content and place his RSS reader behind an authenticated system to make it private for personal use only. As a gesture of goodwill, I have removed the screenshots of his website identity from this post.

Irritatingly the article which I’ve just laboured over has been stolen without my permission, which is more than annoying.

Depending on what content platform you use (WordPress, CMS, Blogger, etc), you will need to take action in a different way. If you’re a WordPress user, you’re in luck. There’s a relatively simple first line of defence that you can establish in order to neutralise this problem. A plugin called “AntiLeech” by Owen Winkler allows you to stipulate “fake” content when being read from certain IP addresses (such as 74.86.144.164) or via certain RSS readers. There is a common blog scraping tool called “RSS Bandit” which identifies itself as “Bandit” or “RSS Bandit” via HTTP when connecting to your blog. This plugin does all the work of removing all the links from your post, and generating a set of text explaining about splogging (the stealing of content from blogs) and optionally putting a link back to the original website.

I’ve configured this plugin to block this guy from ripping my posts, and since he’s reading the 30day challenge aggregated feed, I’m posting this post to that feed. If you’re using WordPress – you need this plug in. If you’re using something else, use Google to search for the preferred anti-splogging or anti-theft copyright tool for your own site. If you don’t, you’ll suffer in the search engine rankings and people looking for your content will find it, you’ll pay for it, but they’ll be a million miles away from you and you’ll never know about it (other than your higher bandwidth bill).

Published inOld Evolved ISV PostsUncategorized

15 Comments

  1. Sohail Sohail

    I’m not sure I see anything wrong happening here. Depending on how you look at it, I’m doing the same thing with the 30 day master feed. I think you might be reading more into it than you think. Usually when you pirate something, you hope to make some money from it. I see no ads on that site (well, maybe AdBlockPlus is doing its work!) Anyway, from the looks of it, its just this one guys blogroll. This is free advertising for you!

  2. mike mike

    There are a couple of issues.

    Issue #1 – It’s not yet free advertising. There’s no links to my blog, site or services (well unless someone else blogs and gets caught up in this spam blog with a link to me).

    Issue #2 – It’s not an excerpt with a link to the full article on my site. This would be acceptable.

    Issue #3 – It devalues the pagerank of both sites, as Google sees multiple pages.

    Issue #4 – A site caught splogging (or a master site) can accidentally or on purpose end up on a blacklist, which again will devalue one or more of the sites (see Issue#3 above).

    Issue #5 – Hotlinking / Leeching of bandwidth. Bandwidth is expensive. If the images are img src’ed to my server, I’m paying this guy’s bandwidth bill.

    Issue #6 – It’s illegal.

  3. mike mike

    I should say, I don’t mind you (as another 30 dayer) using some or any of the content of my 30 day posts on your site, as long as you keep a link or credit to me in the process (and don’t hotlink my images – I’m going to implement some anti-hotlink scripts shortly). I’ll do the same for you.

    It’s these no-good spammers that I really can’t stand.

  4. Sohail Sohail

    Ok. Well I’m not doing it anyway I just aggregated the feed so whatever your rss feed shows is what my feed shows.

  5. Excuse me? You’re way off base buddy.

    That’s my private RSS feed reader. It just so happens it’s web based. Until today, I’ve never given the url out to anybody. Why didn’t you contact me to clear it up? It gets maybe 3 uniques per month

    Feel free to email me, and I don’t think your blocked worked as I read this post on my feed. I can add it to my robots.txt so google doesn’t dup content you.

    It’s not illegal, in any case. So email me, and I’d prefer if you took the url down and we’ll work something out.

  6. mike mike

    Andrew,

    Very quickly (as I am on a very tight schedule at the moment), thanks for getting in touch.

    I’m not as “off base” on this as you appear to be.

    Your definition of “private” for someone studying a computing course is way off. What you think of as “private” is in fact, rather “PUBLIC”, otherwise how was I so easily notified of it via search engines and other RSS services picking up on the content (within moments, no less, of making a post).

    Copying content without permission IS ILLEGAL IN EVERY CASE. I don’t have time to spend explaining this to you.

    As for putting this right:

    1. Removing the content that you are hosting without permission is the first step.

    2. An apology is usually due in these circumstances, don’t you think?

  7. mike mike

    As the splog is based in California, it falls under the jurisdiction of the Government of California and their copyright leglisation; which can be found here:

    http://www.universityofcalifornia.edu/copyright/usingcopyrightedworks.html

    It also has some very interesting articles on how to go about obtaining permission to use copyrighted works and general educational materials on Intellectual Property.

  8. mike mike

    I think on this instance the author of the offending site was not malicious, but just wasn’t well educated on copyright infringement.

    Your article also reminds me that I have an article half-written on why intellectual property laws are important and why the “software should be free” hippies (not to say all open source is hippy) have it totally wrong…

  9. mike mike

    I did just receive a rather rude email from the copyright offender which I will not publish here. However, the misunderstanding seems to be on the part of the website author who believes (mistakenly) that because there exists a mechanism that this confers all rights. i.e. an RSS feed exists so it is legal to pull from it and publish the works on a website that is accessible by the public whether by intention or not.

    Permission to consume an RSS feed is given for consumption only. Those that wish to follow the 30 day challenge or any other RSS feed are welcome to do so. Those that publish an RSS feed (and this goes for ANY website) or ANY content without permission inside a country which has subscribed to the well documented Copyright and Intellectual Property laws is committing a criminal offense. No debate.

    I have already listed above in my article and in the comments the reasons why this is a practical problem that genuinely hurts small businesses and individuals.

    Although I can prepare a detailed article explaining why intellectual property laws exist, actually getting to the heart of the issue – how it is simply rude and obnoxious to steal and republish (i.e. make available publically) content. Even when this is done due to negligence and not malicious it still causes damage.

    I am putting a little of my energy today into doing my bit towards providing some education and debate into why copyright infringement is a real problem, especially for Micro ISV’s.

    And for the record, I do contribute to the open source community and provide even free closed-source software and these movements have their merits also.

  10. Oskar Oskar

    A lot of people seem to think that copyright is about advertising and making money, so if they aren’t making money from it they somehow delude themselves that it’s okay to rip off someone else’s hard work. But copyright is just what it says, the right to copy. If it’s not yours, you don’t have the right to copy it unless you get permission from the person who owns that right. It’s simple enough.

  11. Hi Mike,

    This is one of the sad things about the internet and the “modern era”. IP is so misunderstood and in many cases blythely violated. People will put up any justification they can think of when they do it like:

    “I can’t afford it.”
    “If it’s on the web it’s free”
    “If you publish it on the internet it’s public domain”
    “Everybody is doing it”

    And so on ad nauseum. Simply put it’s very sad. The media and ISP’s have done *nothing* to help this state of affairs and in many quarters a good deal to encourage these kind of myths.

    Good for you on cutting the splogger out and for notifying the rest of us. I wonder if said splogger would feel as “free” with IP if it was his term paper being bandied about as “freely” or his “thesis” if he gets to that level of education.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.