Facebook's Fascination with My Robots.txt

blog.nytsoi.net

56 points by Ndymium 3 hours ago


Nextgrid - an hour ago

> Perhaps someone at their end screwed up a loop conditional, but you'd think some monitoring dashboard somewhere would have a warning pop up because of this.

If you've been in any big company you'll know things perpetually run in a degraded, somewhat broken mode. They've even made up the term "error budget" because they can't be bothered to fix the broken shit so now there's an acceptable level of brokenness.

xg15 - 2 hours ago

Facebook just decided that instead of loading the robots.txt for every host they intend to crawl, they'll just ignore all the other robots.txt files and then access this one a million times to restore the average.

Ndymium - 3 hours ago

For some reason, Facebook has been requesting my Forgejo instance's robots.txt in a loop for the past few days, currently at a speed of 7700 requests per hour. The resource usage is negligible, but I'm wondering why it's happening in the first place and how many other robot files they're also requesting repeatedly. Perhaps someone at Meta broke a loop condition.

tananaev - an hour ago

Maybe they’re trying to DDoS it, and once an error is returned, they assume that no robots.txt file exists and then crawl everything else on the site?

petee - 21 minutes ago

Do crawlers follow/cache 301 permanent redirects? I wonder if you could point the firehouse back at facebook, but it would mean they wouldn't get your robots.txt anymore (though I'd just blackhole that whole subnet anyway)

dormento - an hour ago

Has anyone done research on the topic of trying to block these bots by claiming to host illegal material or talking about certain topics? I mean having a few entries in your robots like "/kill-president", "/illegal-music-downloads", "/casino-lucky-tiger-777" etc.

evv - an hour ago

Have you considered serving a zip bomb to this user agent?

mghackerlady - 31 minutes ago

>my extreme LibreOffice Calc skillz

How does one learn these skills, I can see them being useful in the future

matja - 2 hours ago

Did you try adding a Cache-Control response header?

lloydatkinson - 16 minutes ago

I recently started maintaining a MediaWiki instance for a niche hobbyist community and we'd been struggling with poor server performance. I didn't set the server up, so came into it assuming that the tiny amount of RAM the previous maintainer had given it was the problem.

Turns out all of the major AI slop companies had been hounding our wiki constantly for months, and this had resulted in Apache spawning hundreds of instances, bringing the whole machine to a halt.

Millions upon millions of requests, hundreds of GB's of bandwidth. Thankfully we're using Cloudflare so could block all of them except real search engine crawlers and now we don't have any problems at all. I also made sure to constrain Apache's limits a bit too.

From what I've read, forums, wikis, git repos are the primary targets of harassment by these companies for some reason. The worst part is these bots could just download a git repo or a wiki dump and do whatever it wants with it, but instead they are designed to push maximum load onto their victims.

Our wiki, in total, is a few gigabytes. They crawled it thousands of times over.