(June 8, 2025)Thoughts on Theft and Unlawful Use of Copyrighted Content

Executive Summary: The problem with data scraping is not limited to well-known commercial works.

The Recent Data-scraping Kerfuffle

There has been a great amount of discussion recently about AI developers scraping copyrighted material from websites and the like and using the material to train their AI. The scraped material is not limited to textual content. One recent kerfuffle was caused by AI being used to convert user-submitted photos to images in the style of the well-known Ghibli anime studio in Japan. Other complaints involve scraping of the voices of performers or narrators.

Some people in the AI business have said that prohibiting AI from learning from copyrighted material would be a death sentence for AI. I would not mourn that death, but that is my personal view.

Perhaps more importantly, however, is that all the focus on well-known copyrighted works tends to obscure the reality that just about everything uploaded to cyberspace has a copyright that is held by someone. It's not just the works of famous authors or images and movies that are commercially produced. It's almost everything written or created and uploaded into cyberspace, including the huge amounts of material that have been uploaded unlawfully by someone neither holding the copyright nor having permission to do so.

Copyright happens instantly.

Although in some jurisdictions works need to be registered in order to bring civil litigation for infringement, most everything in cyberspace has a copyright holder, and a copyright notice is generally considered sufficient to at least indicate the intent of the rightsholder to assert their rights. The establishment of copyright does not generally require registration. It comes into being the moment a work is committed to a tangible form, and that includes online content.

People appear to have become accustomed to—and by their silence, permissive of—theft and unlawful publishing of copyrighted material in cyberspace, and in particular on social media platforms. Perhaps this has made the general public more willing to accept or be resigned to the next step, which theft of content for use in AI training.

An underlying aspect of this theft is the purposeful confusion between public and public domain. Uploading something to a website or a social media platform makes it public, but does not place it in the public domain, which is something quite different. As someone who has twice had a considerable amount of their company webpage content stolen and unlawfully published by thieves in China, this is of personal concern to me.

Is there a way to protect your content?

So, what about protecting content from scraping? When I searched around for methods to do that recently, I found some protective strategies. However, I also found a lot of—perhaps even more—tutorials about strategies for defeating those protective strategies. Essentially, there were people telling others how to scrape websites without being detected or blocked.

It could be that the only ultimate solution might be not to publish anything in cyberspace publicly that you don't want stolen and unlawfully used. I am considering doing that with some of the things I write.

If you want to learn what someone has written, thinks, or creates, you might ultimately need to ask them for the related content or to obtain a password that allows access. That arguably defeats the spirit of the Internet as originally envisioned, and the increasingly serious proliferation of unlawfully published or used material looks like it is taking the Internet in a direction surely not envisioned by its creators.

Notice

Because we have our innate ability to think and reason and our acquired ability to write, the above was written by a sentient, carbon-based entity, without the participation or involvement of AI in any manner.