Discussion

Search code, repositories, users, issues, pull requests...

splitbrainhack: -1 for the name

Imustaskforhelp: I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents.

GaggiX: These projects are the new "To-Do List" app.

QuantumNomad_: https://en.wikipedia.org/wiki/Miasma_theorySeems a clever and fitting name to me. A poison pit would probably smell bad. And at the same time, the theory that this tool would actually cause “illness” (bad training data) in AI is not proven.

madeofpalk: Is there any evidence or hints that these actually work?It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.

sd9: Even it did work, I just can't bring myself to care enough. It doesn't feel like anything I could do on my site would make any material difference. I'm tired.

meta-level: Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?

suprfsat: "disobeys robots.txt" is more of a feature

20k: I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand

rvz: > > Be sure to protect friendly bots and search engines from Miasma in your robots.txt!Can't the LLMs just ignore or spoof their user agents anyway?

nubg: What kind of migitations? How would you detect the poison fountain?

imdsm: Applied model collapse

snehesht: Why not simply blacklist or rate limit those bot IP’s ?

avereveard: style="display: none;" aria-hidden="true" tabindex="1"many scraper already know not to follow these, as it's how site used to "cheat" pagerank serving keyword soups

phoronixrly: It does work, on two levels:1. Simple cheap, easy-to-detect and badly-behaved bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way, thus poisoning them.2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data.My personal preference is using iocaine for this purpose though, in order to protect the entire server as opposed to a single site.

GaggiX: Because the internet is noisy and not up to date all recent LLMs are trained using Reinforcement Learning with Verifiable Rewards, if a model has learned the wrong signature of a function for example it would be apparent when executing the code.

phoronixrly: Well-behaved agents will obey robots.txt and not fall into the trap.

aduwah: There are way too many to do that

obsidianbases1: Why do this though?It's like if someone was trying to "trap" search crawlers back in the early 2000s.Seems counterproductive

xprnio: If you have real traffic and bot traffic, you still need to identify which is which. On top of that, bots very likely don’t reuse the same IPs over and over again. I assume if we knew all the IPs used only by bots ahead of time, then yeah it would be simple to blacklist them. But although it’s simple in theory, the practice of identifying what to blacklist in the first place is the part that isn’t as simple

Forgeties79: Web crawlers didn’t routinely take down public resources or use the scraped info to generate facsimiles that people are still having ethical debates over. Its presence didn’t even register and it was indexing that helped them. It isn’t remotely the same thing.https://www.libraryjournal.com/story/ai-bots-swarm-library-c...

phyzome: Because punishment for breaking the robots.txt rules is a social good.

bilekas: Because of bots that don't respect ROBOTS.txt .If you want an AI bot to crawl your website while you pay for that bandwidth then you wont use the tool.

tasuki: > If you have a public website, they are already stealing your work.I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!

Reader /

Discussion

GitHub - austin-weeks/miasma: Trap AI web scrapers in an endless poison pit. · GitHub

github.com

/ pin · @ user · Ctrl+Enter

No discussions yet

Discover

glama.ai

What I Learned from a $2,000 Pen Test | Glama

How a series of overnight attacks revealed a card testing vulnerability – and the countermeasures that actually worked.

1 3

blog.neocode24.com

A2A Protocol이 1.0이 되면서 바뀐 것들, 그리고 0.3 구현체를 옮기려면 | neocode24

Google 주도 에이전트 간 통신 표준 A2A Protocol v1.0의 핵심 변경사항과, 0.3 기반 구현체에서 1.0으로 전환할 때 필요한 단계별 마이그레이션 가이드.

jai.scs.stanford.edu

jai - easy containment for AI agents

Super-lightweight Linux sandbox for AI agents

dw.com

Solar is winning the energy race

The world’s cheapest power source is scaling at warp speed, pushing coal, gas and nuclear aside.

1 98

bbc.com

The Cloud: The dystopian book that changed Germany

For many Germans, The Cloud remains the ultimate "catastrophe book". But did it empower generations of children or leave…

1 9

collabora.com

How Monado became the foundation for OpenXR runtimes

Google's AndroidXR. Qualcomm's Snapdragon Spaces. NVIDIA CloudXR. What do they have in common? Monado!

1 2

GitHub - austin-weeks/miasma: Trap AI web scrapers in an endless poison pit. · GitHub

More from github.com

GitHub - emson/elfmem: sELF improving agent memory system · GitHub

GitHub - owenrumney/go-lsp: Go LSP helper library support 3.17 of the LSP specif…

GitHub - 1st1/lat.md: Agent Lattice: a knowledge graph for your codebase, writte…

GitHub - strurao/quickclaude · GitHub