“Scraping” By

A friend of mine suspected that some of their blog content was being “scraped.”* Scraping generally involves taking other people’s content and re-purposing it on another website without proper attribution or links, basically becoming the old-school version of plagiarism and copyright violation. While many blogs and websites have creative content licenses that allow you to use, cite, remix and otherwise make use and share their contents with others, these licenses often have some limitation extended, usually that the content be properly attributed to the original source and that it not be used for commercial purposes. This means, in a nutshell, that you can’t go to my blog, take all my content and put it in a book and sell it without express permission from me. However, you can quote a post, preferably with a link and attribution, and use it in creating your own content.

The rules of what constitutes plagiarism and copyright violation on the web have been at issue for a while now. Because so many people don’t understand nor respect copyright laws, the US enacted the Digital Millennium Copyright Act (DMCA) back in 1998. (For detailed information about court cases under the DMCA to date, see Electronic Frontier Foundation’s great website.) The wholesale adoption of your favorite blog’s RSS feed to place content on your site makes your site what is commonly referred to as a “Splog” or “Spam Blog.” While these sites are set up as a way to try to get more web traffic and money via adsense and other link-bait techniques, in the end, this strategy has plenty of problems.

Reasons Why Scraping is a Bad Idea

First of all, the problem of these Splogs or Content Farms as they are also known, is that Google and other search engines are more than aware of their existence. Recent changes in Google’s search algorithm, known as the Panda Update, has been aimed at lowering the relevance of these sites and their rankings as compared with original content. This means if you decide to host your own splog, it may not generate nearly the traffic and cash it once did. Moreover, the incentives to really take the time to create your own relevant content just went up.

Second of all, the splog and RSS scraping strategy divides the traffic between the original content and your “fake” ie. non–original content. This in turn, while perhaps providing links to the original content, actually likely serves to diminish the SEO value of the original site and yours as well. This means the more your splog, the less value of the content, and on it goes.

Lastly, a scraping strategy is likely to violate the DMCA as mentioned before, opening you to a whole host of inconveniences, ranging from a take down notice from the original creator, to fines starting at $750 per occurrence, up to $150,000.00. The take down notices can also be directed to your host and ISP which means you risk losing our website hosting account as well. Well publicized cases of music piracy, for example, have led to people paying fines of over $22,000 for 30 “shared” songs.

In the end, it seems to be a no-brainer that content scraping is a fairly risky and potentially expensive route to populating your website. If you are having problems coming up with useful content for your own site, I strongly suggest you pick up CC. Chapman and Ann Handley’s fantastic book, Content Rules and simply start creating your own content. You’ll produce better and more original material, and become an internet guru and resource in your own right. You might even sleep better at night, knowing that the lawyers aren’t out looking for you.

Like anything in life, the real answer is there are no true short cuts. We all have to work hard. Understanding a bit about copyright rules, SEO and the like will make you a better web citizen. Creating your own content you can be proud of will make you a more valuable citizen to us all as well.

Resources:

SEO Chat has a great post walking you through the process of filing a DMCA complaint.

Creative Commons licensing resource

*Definition:RSS scraping is the use of full-feed RSS feeds to populate another website without providing attribution or links back to the content originator. Another form of RSS scraper may or may not provide attribution, but they also make changes to the content – removing links, changing words, or otherwise modifying the content that was syndicated.

Definition of Scraping per Wikipedia: Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites.

Also Note:

“?Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear.^[4] While outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. U.S. courts have acknowledged that users of “scrapers” or “robots” may be held liable for committing trespass to chattels,^[5]^[6] which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing.”

Blogroll

Blogs I Read Regularly

Favorite Resources

Geek News & Channels

Media Mentions

My Morning "paper"

People you should know

Personal Links

Podcasts And Podcasters

Top Education/Teaching Resources