Post by École des Bro-Arts

26d

Woof.group's been getting slower and slower lately, and I think it's cuz we're getting hammered by (maybe LLM?) scrapers. Hard to say, really, but don't buy that there are *that* many Windows/Chrome users clicking around through every single tag page.

5 0 0 View Post & Replies See Original

26d

@aphyr Can you block by IP?

1 0 0 View Post & Replies See Original

26d

@aphyr that must be OpenAI trying to teach its next model to "treat adults like adults" 🤭

0 0 0 View Post & Replies See Original

26d

Gonna add a bunch of the LLM bots to robots.txt--I know many of the big players just ignore robots.txt and fudge their UAs, but maybe it'll make a little dent. Fully 2% of our requests are ByteDance, and 5% are ahrefs--both of those should be blockable.

No idea what to do about what I suspect is residential proxy traffic, which makes up the vast majority of our load. I assume throwing Anubis in front of a Mastodon instance is going to break a ton of legitimate use cases.

4 0 0 View Post & Replies See Original

26d

@aphyr so I actually have been having some issues with it here and among other things I seem to be occasionally data limiting ( while not at keyboard, like I walk away and come back to it).

I feel like it's just weirdness, but I have been meaning to figure it out if I am doing something. I think at least one of the culprits ( on my end) is ublock lite from chrome, since I haven't done the hoops to make origin play nice again on my new Windows install (and turning it off fixes the problem?)

0 0 0 View Post & Replies See Original

26d

@aphyr we built out a trap for scanners: LLM bots will crawl the CT logs, so anyone making HTTP requests to the certs requested by our mail servers are fuckin' around and can go directly into the firewall.

1 0 1 View Post & Replies See Original

26d

@aphyr as someone with a pretty high-traffic site - adding a proof-of-work for requests claiming to be from chrome, safari or Firefox reduced traffic by 6x without breaking our apis.

0 0 0 View Post & Replies See Original

26d

Have you tried preventing this in your proxy?

I mean something like this:
Ngnix:

if ($http_user_agent ~* (GPTBot|ChatGPT)) {
return 403;
}

Apache .htaccess
SetEnvIfNoCase User-Agent “GPTBot” bad_bot
SetEnvIfNoCase User-Agent “ChatGPT” bad_bot
Deny from env=bad_bot

You just have to find the other bots.

@aphyr

1 0 0 View Post & Replies See Original

26d

@aphyr is there a way to add “nofollow” to every link? We’ve started doing that (professional service) and removing features like sorts because they’re always followed by these fucking bots. A 70k long list, previously sortable by 8 parameters. Both ascending and descending.

I hate it here. You have my sympathy. (I would zip bomb them professionally if I could.)

1 0 0 View Post & Replies See Original

26d

@af Possibly, but I try to maintain minimal code changes to Mastodon--even our small changes have put me in merge hell on a few security updates.

0 0 0 View Post & Replies See Original

26d

@bearleathermen Oh we've actually had GPTBot in robots.txt for ages, and they seem to respect it. The problem is the high-volume scrapers impersonating Chrome/Firefox/Edge/etc, usually just one or two request per IP.

1 0 0 View Post & Replies See Original

26d

@aphyr ok

0 0 0 View Post & Replies See Original

26d

@atax1a oh *interesting*

1 0 0 View Post & Replies See Original

26d

@aphyr our version is for freebsd blocklistd, but the idea ought to be a weekends worth of implementation with other firewall rule engines https://fossil.se30.xyz/ratrap

0 0 0 View Post & Replies See Original

26d

@rexxdeane That hasn't been feasible in years, sadly. Scraper infra is mostly fanned out to zillions of residential devices via services like https://oxylabs.io/products/web-unblocker.

2 0 0 View Post & Replies See Original

26d

@aphyr Ughhh.

1 0 0 View Post & Replies See Original

26d

@rexxdeane @aphyr I wonder what it's like to work at a company that's so self evidently scummy

0 0 0 View Post & Replies See Original

26d

@rexxdeane Yeah :(

0 0 0 View Post & Replies See Original