Woof.group's been getting slower and slower lately, and I think it's cuz we're getting hammered by (maybe LLM?) scrapers. Hard to say, really, but don't buy that there are *that* many Windows/Chrome users clicking around through every single tag page.
Gonna add a bunch of the LLM bots to robots.txt--I know many of the big players just ignore robots.txt and fudge their UAs, but maybe it'll make a little dent. Fully 2% of our requests are ByteDance, and 5% are ahrefs--both of those should be blockable.
No idea what to do about what I suspect is residential proxy traffic, which makes up the vast majority of our load. I assume throwing Anubis in front of a Mastodon instance is going to break a ton of legitimate use cases.
@aphyr so I actually have been having some issues with it here and among other things I seem to be occasionally data limiting ( while not at keyboard, like I walk away and come back to it).
I feel like it's just weirdness, but I have been meaning to figure it out if I am doing something. I think at least one of the culprits ( on my end) is ublock lite from chrome, since I haven't done the hoops to make origin play nice again on my new Windows install (and turning it off fixes the problem?)
@aphyr we built out a trap for scanners: LLM bots will crawl the CT logs, so anyone making HTTP requests to the certs requested by our mail servers are fuckin' around and can go directly into the firewall.
@aphyr as someone with a pretty high-traffic site - adding a proof-of-work for requests claiming to be from chrome, safari or Firefox reduced traffic by 6x without breaking our apis.
Have you tried preventing this in your proxy?
I mean something like this:
Ngnix:
if ($http_user_agent ~* (GPTBot|ChatGPT)) {
return 403;
}
Apache .htaccess
SetEnvIfNoCase User-Agent “GPTBot” bad_bot
SetEnvIfNoCase User-Agent “ChatGPT” bad_bot
Deny from env=bad_bot
You just have to find the other bots.
@aphyr is there a way to add “nofollow” to every link? We’ve started doing that (professional service) and removing features like sorts because they’re always followed by these fucking bots. A 70k long list, previously sortable by 8 parameters. Both ascending and descending.
I hate it here. You have my sympathy. (I would zip bomb them professionally if I could.)
@af Possibly, but I try to maintain minimal code changes to Mastodon--even our small changes have put me in merge hell on a few security updates.
@bearleathermen Oh we've actually had GPTBot in robots.txt for ages, and they seem to respect it. The problem is the high-volume scrapers impersonating Chrome/Firefox/Edge/etc, usually just one or two request per IP.
@aphyr our version is for freebsd blocklistd, but the idea ought to be a weekends worth of implementation with other firewall rule engines https://fossil.se30.xyz/ratrap
@rexxdeane That hasn't been feasible in years, sadly. Scraper infra is mostly fanned out to zillions of residential devices via services like https://oxylabs.io/products/web-unblocker.
@rexxdeane @aphyr I wonder what it's like to work at a company that's so self evidently scummy
@rexxdeane Yeah :(
Addickted