Some Thoughts on Theft and Unlawful Use of Copyrighted Content
by William Lise
(June 8, 2025)
Executive Summary: The problem with data scraping is not limited to well-known works.
The Recent Data-scraping Kerfuffle
There has been a great amount of discussion recently about AI developers scraping copyrighted material from websites and the like and using it to train their AI. The scraped material is not limited to textual content. One recent kerfuffle was caused by AI being used to convert user-submitted photos to images in the style of the well-known Ghibli anime studio. Other complaints involve scraping of the voices of performers or narrators.
Some people in the AI business have said that prohibiting AI from learning from copyrighted material would be a death sentence for AI. I would not mourn that death, but that is my personal view.
Perhaps more importantly, however, is that all the focus on well-known copyrighted works obscures the reality that just about everything uploaded to cyberspace has a copyright that is held by someone. It's not just the works of famous authors or images and movies that are commercially produced. It's almost everything written or created and uploaded into cyberspace, including the huge amounts of material that has been uploaded unlawfully by someone neither holding the copyright nor having permission to do so.
Copyright happens instantly.
Yes, in some jurisdictions the content needs to be registered in order to bring civil litigation, but most everything in cyberspace has a copyright holder, and a copyright notice is generally considered sufficient to at least indicate the intent of the rightsholder. The establishment of copyright does not generally require registration. It comes into being the moment a work is committed to a tangible form, and that includes online content.
People appear to have become accustomed to—and by their silence, permissive of—theft and unlawful publishing of copyrighted material in cyberspace, and in particular on social media platforms. Perhaps this has made the general public more willing to accept or be resigned to the next step in the enterprise of stealing content, the scraping of content for AI learning.
An underlying and purposefully used aspect of this theft is the confusion between public and public domain. Uploading something to a website or a social media platform makes it public, but does not render it public domain, which is something quite different. As someone who has twice had a considerable amount of their company webpage content stolen and unlawfully published by thieves in China, this is of personal concern to me.
Is there a way to protect your content?
So, what about protecting content from scraping? When I searched around for methods to do that recently, I found some protective strategies. However, I also found a lot of strategies for defeating those protective strategies. Essentially, there were people telling others how to scrape websites without being detected or blocked.
It could be that the only ultimate solution might be not to publish anything in cyberspace that you don't want stolen and unlawfully used. If you want to learn what someone has written, thinks, or creates, you might ultimately need to ask them for the related content. That essentially defeats the spirit of the Internet as originally envisioned, and the proliferation of unlawfully published or used material looks like it is taking the Internet in a direction not envisioned by its creators.