There has been a great amount of discussion recently about AI developers scraping copyrighted material from websites and using it to train AI. Some have said that prohibiting AI from learning from copyrighted material would be a death sentence for AI. One recent kerfuffle was caused by AI being able to convert user-submitted photos to images in the style of the well-known Ghibli anime studio.
One thing that appears to be slipping through a crack, however, is that just about everything uploaded to cyberspace has a copyright that is held by someone. It’s not just famous authors’ works, or images and movies that are commercially produced. It’s just about everything written or created and uploaded into cyberspace. including material that has been uploaded unlawfully.
Yes, in some jurisdictions the content needs to be registered in order to bring civil litigation, but most everything in cyberspace has a copyright holder, including huge amounts of material that has been stolen and unlawfully published. And a copyright notice is generally considered sufficient to at least indicate the intent of the rightholder. The actual establishment of copyright does not generally require registration. It comes into being the moment a work is committed to a tangible form, and that includes online content.
People appear to have become accustomed to—and by their silence, permissive of—theft and unlawful publishing of copyrighted material in cyberspace, and in particular on social media platforms. Perhaps this has made the general public more willing to accept or be resigned to the next step in the enterprise of stealing content—scraping of content for AI learning.
As someone who has twice had a considerable amount of my company webpage content stolen and unlawfully published by thieves in China, this is of personal concern to mine.
So, what about protecting content from scraping? When I searched around for methods to do that recently, I found some strategies. However, I also found a lot of strategies for defeating those protective strategies. Essentially, there were people telling others how to scrape websites without being detected or blocked.
It could be that the only ultimate solution is not to publish anything in cyberspace that you don’t want stolen and unlawfully used. If you want to learn what someone has written, thinks, or creates, you might ultimately need to ask them for the related content. The proliferation of unlawfully published or otherwise used material looks like it is taking the Internet in a direction not envisioned by its creators.