home ~ projects ~ socials

Poisoning Bad AI Actors

Introduction

I saw a post on Mastodon1 discussing ways to fight against AI scrapping. It got me thinking. What about this:

  1. Serve pages with all the links pointing into a honeypot of Markov chain2 generated content designed to posion AI's that ignored instructions telling them not to scrape.
  2. Use JavaScript to detect mousedown and touchstart events to update links to point to their actual valid locations when humans click/tap them.

Trying It Out

Here's a quick prototype.

If you're a human on a JavaScript enabled device and you click the "Take Me To The Next Page" link in the "Output" below it should open a new tab with the wikipedia page for Markov chains.

HTML

<a 
  href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" 
  target="_blank"
  id="test-link">Take Me To The Next Page</a>

Output

Take Me To The Next Page

The code in the "HTML Source" is what the server delivered on the page. The link points to this YouTube video instead of the Markov chain page.

Here's the javascript that does the link switch:

JavaScript

function switchLink() {
  const linkElToUpdate = document.querySelector(
    '#test-link'
  )
  linkElToUpdate.setAttribute(
    'href', 
    'https://en.wikipedia.org/wiki/Markov_chain'
  )
}

const linkEl = document.querySelector(
  '#test-link'
)
linkEl.addEventListener(
  'touchstart', switchLink
)
linkEl.addEventListener(
  'mousedown', switchLink
)

// NOTE: I got some good feedback
// that a `keydown` event should be
// included here as well. 

// NOTE: Should also handle mouse 
// over or hover events so that the
// URLs that show up in status bars
// are accurate

Notes And Problems

This is very much prototype code. There are several things that would need to be addressed. Among them:

  • First things first, this approach makes no accommodation for devices that don't run JavaScript on the page for whatever reason. That makes it a non-starter for me. But, if you're pages doesn't work without JavaScript anyway it doesn't add extra issues on that front.

    (See below for an idea with the possibility for use on non-JavaScript page loads.)

  • I've tested this on my personal computer and phone and everything worked as expected. You'd want to do a lot more testing across more devices to know how they'll behave before trying something like this.
  • You'd want to make sure the robots.txt is set up to tell bots to avoid the honeypot pages. The idea being that only bad actors should be sent into the honeypot. (e.g. we don't want to poison The Internet Archive)
  • It's entirely possible that search engines start ignoring the instructions in robots.txt and indexing the honeypot content. That would probably be a hit to SEO.

    It could also end up with your pages being ranked for some combination of words that it otherwise wouldn't have been. I expect that's way less of a potential issue though.

    Of course, with so mush AI slop out there now, it feels like worrying about SEO is becoming moot point.

  • You'd have to pay the cost to serve the honeypot content.
  • Bots could be set up to simulate click/tap events in a way that would negate the approach.
  • Probably a bunch more stuff that I haven't thought of yet.

An Approach For Non-JavaScript Browsing

I take the stance that websites should work as much as possible without JavaScript. Because the approach above sends the honeypot links when the page loads there's no way I know of to change them to point to the proper locations without using JavaScript.

Instead, you could sprinkle honeypot links around the content of the page and hide them with CSS/ARIA. (I'm not super well versed in ARIA, but I think that would work.)

There's more to think about, but working with non-JavaScript devices would be covered.

Justified Response

Advocating for folks to do harm is not something I do lightly. I have moral and ethical problems with lots of aspects of AI. I'm not sure they'd push me to publish this post except for one thing: It will only affect bad actors.

I have no problem putting poison in prohibited places. If you run a system that ignores the prohibitions, then ingesting the poison is what you designed the system to do.

The responsibility is yours. I'm glad to be a part of adding consequences to the results of your actions.

-- end of line --

Endnotes

  • I heard back from Alex Seay on the Frontend Horse discord about using ARIA to prevent screen readers from picking up the links. The code looks like this:

    <a 
      href="https://www.example.com" 
      aria-hidden="true" 
      tabindex="-1">
        Screen readers won't pick up on this, 
        and no one can tab to it
      </a>

    The likelihood of me doing this just went up a notch.

  • This is not new thinking. For example, there's this piece: Poisoning Well - from Heydon Pickering that uses an appraoch with nofollow on the links.

  • A few otehr good pieces I've seen recently about AI are:

  • Also, what if you just made entire sites that were nothing but honeypots of poison. Seems like it might be a great use of subdomains... I mean, I've already got one for every letters starting with a.alanwsmith.com. Why not make those just thousands of pages of jibberish?

  • algernon ludd (who made the post that got me thinking about this) pointed out that adding keyboard events would probalby be a good idea too.

  • Cloudflare has a new feature the outline in Trapping misbehaving bots in an AI Labyrinth. I'm less of fan of that one because they are using generative AI themselves to create the bogus content. All my problems with the extreme use of energy required by those tools comes into play.

    The Cloudflare post doesn't mention screen readers. I hope they are taking then into account with their tool.

    Also, the anecdotes I've heard are that AI companies specifically try to filter out other AI generated content to keep from polluting their systems. I've got no idea how well that works or if it applies to Markov chains as well. So, maybe it doesn't matter.

  • Another thing that Cloudflare does is use traffic to the honeypot pages to fingerprint the scrapper bots. If your site supports it, you could do the same thing and then take other actions with the knowledge. (e.g. slowing down response times for the bot, severing bogus content even at valid urls, etc...)

  • Another interesting tool in the fight against scrappers is Anubis. It's designed goal is protection instead of poisioning.

    Anubis works by sending a puzzle to browsers when they first request a page. Regular browsers can do the puzzle as a proof-of-work with no problem. When they return the answer the server give them the actual content for the page.

    The idea behind this is that scrapper bots won't have the programming necessary to do the puzzle. They won't be able to provide the answer so they never get the actual content.

    The appraoch is very cool.

    Using Anubis requires having a web server that can run the software to issues the challenge puzzles and receive the answer reponses. That's not something every server can do (e.g. the servers I use for my sites can't do it).

    I didn't see any mention of screen readers in my quick review of the docs. This is an area I don't know enough about so I'm unsure if Anubis would prevent folks who use them from having issues or not.

  • Finally, none of these solutions are silver bullets. They won't stop AI scrappers forever. Anything that's posted pubically on the web can be scrapped, regardless of countermesasure, if the people trying to do the scrapping are dedicated enough. The same goes for poisioning their datasets.

    It comes down to a game of leap frog. But, if enough poison gets dumped into the systems at the same time, it's possible that the results generative AI produces will become obviously worse. And, if it gets bad enough, folks will push for it to be removed.

    That's probalby unlikely, but it just might be worth a try.

References

Footnotes

1 ⤴
2 ⤴