this post was submitted on 12 Jul 2025

523 points (98.0% liked)

Selfhosted

53286 readers

759 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.
No low-effort posts. This is subjective and will largely be determined by the community member reports.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz

523

Anubis is awesome! Stopping (AI)crawlbots (infosec.pub)

submitted 4 months ago* (last edited 4 months ago) by sailorzoop@lemmy.librebun.com to c/selfhosted@lemmy.world

53 comments fedilink hide all child comments

Incoherent rant.

I've, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing. Multiple IPs all over the US.

So I've decided to do some restructuring of how I run things. Ditched Fedora on my VPS in favour of Alpine, just to start with a clean slate. And started looking into different options on how to combat things better.

Behold, Anubis.

"Weighs the soul of incoming HTTP requests to stop AI crawlers"

From how I understand it, it works like a reverse proxy per each service. It took me a while to actually understand how it's supposed to integrate, but once I figured it out all bot activity instantly stopped. Not a single one got through yet.

My setup is basically just a home server -> tailscale tunnel (not funnel) -> VPS -> caddy reverse proxy, now with anubis integrated.

I'm not really sure why I'm posting this, but I hope at least one other goober trying to find a possible solution to these things finds this post.

Anubis Github, Anubis Website

Edit: Further elaboration for those who care, since I realized that might be important.

You don't have to use caddy/nginx/whatever as your reverse proxy in the first place, it's just how my setup works.
My Anubis sits between my local server and inside Caddy reverse proxy docker compose stack. So when a request is made, Caddy redirects to Anubis from its Caddyfile and Anubis decides whether or not to forward the request to the service or stop it in its tracks.
There are some minor issues, like it requiring javascript enabled, which might get a bit annoying for NoScript/Librewolf/whatever users, but considering most crawlbots don't do js at all, I believe this is a great tradeoff.
The most confusing part were the docs and understanding what it's supposed to do in the first place.
There's an option to apply your own rules via json/yaml, but I haven't figured out how to do that properly in docker yet. As in, there's a main configuration file you can override, but there's apparently also a way to add additional bots to block in separate files in a subdirectory. I'm sure I'll figure that out eventually.

Edit 2 for those who care: Well crap, turns out lemmy-ui crashing wasn't due to crawlbots, but something else entirely.
I've just spent maybe 14 hours troubleshooting this thing, since after a couple of minutes of running, lemmy-ui container healthcheck would show "unhealthy" and my instance couldn't be accessed from anywhere (lemmy-ui, photon, jerboa, probably the api as well).
After some digging, I've disabled anubis to check if that had anything to do with it, it didn't. But, I've also noticed my host ulimit -n was set to like 1000.... (I've been on the same install for years and swear an update must have changed it)
After changing ulimit -n (nofile) and shm_size to 2G in docker compose, it hasn't crashed yet. fingerscrossed
Boss, I'm tired and I want to get off Mr. Bones' wild ride.
I'm very sorry for not being able to reply to you all, but it's been hectic.

Cheers and I really hope someone finds this as useful as I did.

top 50 comments

sorted by: hot top controversial new old

[–] ikidd@lemmy.world 111 points 4 months ago

Something that's less annoying than Anubis is fail2ban tarpitting the scrapers by putting in a hidden honeypot page link that they follow, and adding the followers to fail2ban.

https://petermolnar.net/article/anti-ai-nepenthes-fail2ban/

[–] Mora@pawb.social 87 points 4 months ago* (last edited 4 months ago) (4 children)

Besides that point: why tf do they even crawl lemmy. They could just as well create a "read only" instance with an account that subscribes to all communities ... and the other instances would send their data. Oh, right, AI has to be as unethical as possible for most companies for some reason.

[–] wizardbeard@lemmy.dbzer0.com 92 points 4 months ago

They crawl wikipedia too, and are adding significant extra load on their servers, even though Wikipedia has a regularly updated torrent to download all its content.

[–] ZombiFrancis@sh.itjust.works 72 points 4 months ago

See your brain went immediately to a solution based on knowing how something works. That's not in the AI wheelhouse.

[–] dan@upvote.au 35 points 4 months ago

They're likely not intentionally crawling Lemmy. They're probably just crawling all sites they can find.

[–] AmbitiousProcess@piefed.social 28 points 4 months ago

Because the easiest solution for them is a simple web scraper. If they don't give a shit about ethics, then something that just crawls every page it can find is loads easier for them to set up than a custom implementation to get torrent downloads for wikipedia, making lemmy/mastodon/pixelfed instances for the fediverse, using rss feeds and checking if they have full or only partial articles, implementing proper checks to prevent double (or more) downloading of the same content, etc.

[–] e0qdk@reddthat.com 42 points 4 months ago (1 children)

I don't like Anubis because it requires me to enable JS -- making me less secure. reddthat started using go-away recently as an alternative that doesn't require JS when we were getting hammered by scrapers.

[–] Jumuta@sh.itjust.works 3 points 4 months ago

iirc there's instructions on completing the anubis challenge manually

[–] reddeadhead@awful.systems 34 points 4 months ago* (last edited 4 months ago)

Anubis just released the no-JS challenge in a update. Page loads for me with JS disabled. https://anubis.techaro.lol/blog/release/v1.20.0/

[–] fireshell@kbin.earth 20 points 4 months ago

The development of Anubis remains a matter of enthusiasm: Zee is funding the project through Patreon and sponsorship on GitHub, but cannot yet afford to pursue it on a full-time basis. He would also like to hire a key community member, budget permitting.

[–] possiblylinux127@lemmy.zip 20 points 4 months ago (1 children)

It doesn't stop bots

All it does is make clients do as much or more work than the server which makes it less temping to hammer the web.

[–] sailorzoop@lemmy.librebun.com 32 points 4 months ago (1 children)

Yeah, from what I understand it's nothing crazy for any regular client, but really messes with the bots.
I don't know, I'm just so glad and happy it works, it doesn't mess with federation and it's barely visible when accessing the sites.

[–] possiblylinux127@lemmy.zip 8 points 4 months ago

Personally my only real complaint is the lack of wasm. Outside if that it works fairly well.

[–] SorteKanin@feddit.dk 15 points 4 months ago

I’ve, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing.

I'm just curious, how did you notice this in the first place? What are you monitoring to know and how do you present that information?

[–] rtxn@lemmy.world 15 points 4 months ago (2 children)

But don't you know that Anubis is MALWARE?

...according to some of the clowns at the FSF, which is definitely one of the opinions to have. https://www.fsf.org/blogs/sysadmin/our-small-team-vs-millions-of-bots

[–] dan@upvote.au 27 points 4 months ago (2 children)

tbh I kinda understand their viewpoint. Not saying I agree with it.

The Anubis JavaScript program's calculations are the same kind of calculations done by crypto-currency mining programs. A program which does calculations that a user does not want done is a form of malware.

[–] Natanox@discuss.tchncs.de 31 points 4 months ago* (last edited 4 months ago) (2 children)

That's guilt by association. Their viewpoint is awful.

I also wished there was no security at the gate of concerts, but I happily accept it if that means actual security (if done reasonably of course). And quite frankly, cute anime girl doing some math is so, so much better than those god damn freaking captchas. Or the service literally dying due to AI DDoS.

Edit: Forgot to mention, proof of work wasn't invented by or for crypto currency or blockchain. The concept exists since the 90's (as an idea for Email Spam prevention), making their argument completely nonsensical.

[–] Arghblarg@lemmy.ca 4 points 4 months ago (1 children)

Ah, hashcash. Wish that had taken off, it was a good idea ...

[–] xavier666@lemmy.umucat.day 2 points 4 months ago

TIL of hashcash

[–] xavier666@lemmy.umucat.day 2 points 4 months ago (1 children)

And quite frankly, cute anime girl doing some math is so, so much better than those god damn freaking captchas

One user complained that a random anime girl popping up is making his gf think he's watching hentai. So the mascot should be changed to something "normal".

[–] Natanox@discuss.tchncs.de 6 points 4 months ago

Lol.

"My relationship is fragile and it's the internets fault."

load more comments (1 replies)

[–] chihuamaranian@lemmy.ca 17 points 4 months ago (2 children)

The FSF explanation of why they dislike Anubis could just as easily apply to the process of decrypting TLS/HTTPS. You know, something uncontroversial that every computer is expected to do when they want to communicate securely.

I don't fundamentally see the difference between "The computer does math to ensure end-to-end privacy" and "The computer does math to mitigate DDoS attempts on the server". Either way, without such protections the client/server relationship is lacking crucial fundamentals that many interactions depend on.

[–] rtxn@lemmy.world 8 points 4 months ago

I've made that exact comparison before. TLS uses encryption; ransomware also uses encryption; by their logic, serving web content through HTTPS with no way to bypass it is a form of malware. The same goes for injecting their donation banner using an iframe.

[–] SheeEttin@lemmy.zip 5 points 4 months ago* (last edited 4 months ago)

Right. One of the facets of cryptography is rounds: if you apply the same algorithm 10,000 times instead of just one, it might make it slightly slower each time you need to run it, but it makes it vastly slower for someone trying to brute-force your password.

[–] TomAwezome@lemmy.world 9 points 4 months ago

Thanks for the "incoherent rant", I'm setting some stuff up with Anubis and Caddy so hearing your story was very welcome :)

[–] NotSteve_@piefed.ca 9 points 4 months ago* (last edited 4 months ago) (1 children)

I love Anubis just because the dev is from my city that's never talked about (Ottawa)

[–] SheeEttin@lemmy.zip 5 points 4 months ago (1 children)

Well not never, you've got the Senators.

Which will never not be funny to me since it's Latin for "old men".

[–] NotSteve_@piefed.ca 2 points 4 months ago* (last edited 4 months ago)

Hahaha I didn't know that but that is funny. Admittedly I'm not too big into hockey so I've got no gauge on how popular (edit: or unpopular 😅) the Sens are

[–] dan@upvote.au 7 points 4 months ago (1 children)

The Anubis site thinks my phone is a bot :/

tbh I would have just configured a reasonable rate limit in Nginx and left it at that.

Won't the bots just hammer the API instead now?

[–] Flipper@feddit.org 12 points 4 months ago

No. The rate limit doesn't work as they use huge IP Spaces to crawl. Each IP alone is not bad they just use several thousand of them.

Using the API would assume some basic changes. We don't do that here. If they wanted that, they could run their own instance and would even get notified about changes. No crawling required at all.

[–] danielquinn@lemmy.ca 7 points 4 months ago (2 children)

I've been thinking about setting up Anubis to protect my blog from AI scrapers, but I'm not clear on whether this would also block search engines. It would, wouldn't it?

[–] sailorzoop@lemmy.librebun.com 7 points 4 months ago

I'm not entirely sure, but if you look here https://github.com/TecharoHQ/anubis/tree/main/data/bots
They have separate configs for each bot. https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.json

[–] RedBauble@sh.itjust.works 7 points 4 months ago (1 children)

You can setup the policies to allow search engines through, the default policy linked in the docs does that

[–] danielquinn@lemmy.ca 6 points 4 months ago (2 children)

This all appears to be based on the user agent, so wouldn't that mean that bad-faith scrapers could just declare themselves to be typical search engine user agent?

[–] SorteKanin@feddit.dk 5 points 4 months ago

Most search engine bots publish a list of verified IP addresses where they crawl from, so you could check the IP of a search bot against that to know.

[–] SheeEttin@lemmy.zip 4 points 4 months ago (1 children)

Yes. There's no real way to differentiate.

[–] SorteKanin@feddit.dk 5 points 4 months ago

Actually I think most search engine bots publish a list of verified IP addresses where they crawl from, so you could check the IP of a search bot against that to know.

[–] sic_semper_tyrannis@lemmy.today 6 points 4 months ago

Futo gave them a micro-grant this month

[–] MITM0@lemmy.world 6 points 4 months ago

Go_Away is another alternative

[–] TheHobbyist@lemmy.zip 4 points 4 months ago* (last edited 4 months ago)

@demigodrick@lemmy.zip

Perhaps of interest? I don't know how many bots you're facing.

[–] fossilesque@mander.xyz 4 points 4 months ago

I've been planning on seeing this up for ages. Love the creators vibe. Thanks for this.

[–] Charlxmagne@lemmy.world 4 points 4 months ago

Been seeing this on people's invidious instances

[–] paraphrand@lemmy.world 3 points 4 months ago

I’ve seen some people reject this solution due to the anime.

[–] SorteKanin@feddit.dk 3 points 4 months ago (1 children)

Also your avatar and the image posted here (not the thumbnail) seem broken - I wonder if that's due to Anubis?

[–] sailorzoop@lemmy.librebun.com 4 points 4 months ago

Just updated the post again, yeah. But I think that was due to me changing nameservers for my domain at the time. Cheers.

[–] MichaelMuse@programming.dev 1 points 4 months ago

I think AI can provide an interface to let user submit the site for crawling, such as some website scanner doing, like urlscan. Otherwise the site can reject the AI crawler.

load more comments