Opt-out From AI Training Datasets

It is now clear that AI models are trained using data, text, and images scraped from the web using programs called crawlers.

I won’t go into the details here, but the key aspect is that, if you have a public online presence like a website, a blog, or a public social media profile, chances are that the contents you publish there are being used to train AI models.

These AIs generate “new” content using your content as builing blocks.

Without going into the million nuances of this topic about ethics and copyright law, I decided to at least try and make it clear that I don’t want my photos included in AI training datasets.

First off, I included this disclaimer in the footer of every page of my website.

No image on this website can be used to train AI models without explicit consent from the author.

Then I added these two meta tags in the HTML header of every page (I found these on Deviant Art).

<meta name="robots" content="noimageai">
<meta name="robots" content="noai">

Because my web hosting setup easily allows for it, I also made sure that the web server returns these response headers on all and every resource returned to the browser – HTML pages, images, everything, just in case.

X-Robots-Tag: noai
X-Robots-Tag: noimageai

What about social media profiles?

Well, there’s no solution there. Everything is in the hands of each social media platform: until they decide to add this kind of opt-out flags, we’re out of luck.

Will crawlers honor these tags anyway? Probably not.

To make it clear, there’s no stopping this trend, unless you have the money to sue AI companies for copyright infringment. Also, from a technical perspective, it’s impossible to prevent web content scraping in a cost-effective way.

But we can take a stance.

Adding those tags is just that – taking a stance. It’s sort of a hidden manifest, of quiet sit-in protest. It means:

“I told you not to do it. If you do it anyway, well, shame on you.”