End of an era for me: no more self-hosted git

kraxel.org

179 points by dzulp0d 17 hours ago


kstrauser - 3 hours ago

I cut traffic to my Forgejo server from about 600K request per day to about 1000: https://honeypot.net/2025/12/22/i-read-yann-espositos-blog.h...

1. Anubis is a miracle.

2. Because most scrapers suck, I require all requests to include a shibboleth cookie, and if they don’t, I set it and use JavaScript to tell them to reload the page. Real browsers don’t bat an eye at this. Most scrapers can’t manage it. (This wasn’t my idea; I link to the inspiration for it. I just included my Caddy-specific instructions for implementing it.)

reactordev - 14 minutes ago

Ugh, exposing it with cgit is why.

Put it all behind an OAuth login using something like Keycloak and integrate that into something like GitLab, Forgejo, Gitea if you must.

However. To host git, all you need is a user and ssh. You don’t need a web ui. You don’t need port 443 or 80.

moebrowne - 4 hours ago

This kind of thing can be mitigated by not publishing a page/download for every single branch, commit and diff in a repo.

Make only the HEAD of each branch available. Anyone who wants more detail has to clone it and view it with their favourite git client.

For example https://mitxela.com/projects/web-git-sum (https://git.mitxela.com/)

kristjank - 4 hours ago

Is there a way to block it by shibboleth? Curious, since the recent Google hack where you add -(n-word) to the end of your query so the AI automatically shuts down works like a charm.

data-ottawa - 16 hours ago

Does anyone know what's the deal with these scrapers, or why they're attributed to AI?

I would assume any halfway competent LLM driven scraper would see a mass of 404s and stop. If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.

Are we seeing these scrapers using LLMs to bypass auth or run more sophisticated flows? I have not worked on bot detection the last few years, but it was very common for residential proxy based scrapers to hammer sites for years, so I'm wondering what's different.

krick - 14 hours ago

So, what's up with these bots, why am I hearing about that so often lately? I mean, DDoS atacks aren't a new thing, and, honestly, this is pretty much the reason why Cloudflare even exists, but I'd expect OpenAI bots (or whatever this is now) to be a little bit easier to deal with, no? Like, simply having resonable aggressive fail2ban policy? Or do they really behave like a botnet, where each request comes from different IP from a different network? How? Why? What is this thing?

- an hour ago
[deleted]
snorremd - 4 hours ago

I've recently been setting up web servers like Forgejo and Mattermost to service my own and friends' needs. I ended up setting up Crowdsec to parse and analyse access logs from Traefik to block bad actors that way. So when someone produces a bunch of 4XX codes in a short timeframe I assume that IP is malicious and can be banned for a couple of hours. Seems to deter a lot of random scraping. Doesn't stop well behaved crawlers though which should only produce 200-codes.

I'm actually not sure how I would go about stopping AI crawlers that are reasonably well behaved considering they apparently don't identify themselves correctly and will ignore robots.txt.

devsda - 15 hours ago

At this point, I think we should look at implementing filters that send different response when AI bots are detected or when the clients are abusive. Not just simple response code but one that poisons their training data. Preferably text that elaborates on the anti consumer practices of tech companies.

If there is a common text pool used across sites, may be that will get the attention of bot developers and automatically force them to backdown when they see such responses.

t312227 - 4 hours ago

hello,

as always: imho. (!)

idk ... i just put a http basic-auth in front of my gitweb instance years ago.

if i really ever want to put git-repositories into the open web again i either push them to some portal - github, gitlab, ... - or start thinking about how to solve this ;))

just my 0.02€

vachina - 14 hours ago

Scrapers are relentless but not DDoS levels in my experience.

Make sure your caches are warm and responses take no more than 5ms to construct.

Lerc - 16 hours ago

I presume people have logs that indicate the source for them to place blame on AI scrapers. Is anybody making these available for analysis so we can see exactly who is doing this?

JohnTHaller - 15 hours ago

The Chinese AI scrapers/bots are killing quite a bit of the regular web now. YisouSpider absolutely pummeled my open source project's hosting for weeks. Like all Chinese AI scrapers, it ignores robots.txt. So forget about it respecting a Crawl-delay. If you block the user agent, it would calm down for a bit, then it would just come back again using a generic browser user agent from the same IP addresses. It does this across 10s of thousands of IPs.

anarticle - 3 hours ago

I use a private gitlab that was setup by claude, have my own runners and everything. It's fine. I have my own little home cluster, net storage compute around 2.5k. Go NUCs, cluster, don't look back.

ptman - 10 hours ago

Maybe put the git repos on radicle?

bigbuppo - an hour ago

Just another example of AI and its DoSaaS ruining things for everyone. The AI bros just won't accept "NO" for an answer.

Joel_Mckay - 15 hours ago

Some run git over ssh, and a domain login for https:// permission manager etc.

Also, spider traps and 42TB zip of death pages work well on poorly written scrapers that ignored robots.txt =3

hattmall - 15 hours ago

Can we not charge for access? If I have a link, that says "By clicking this link you agree to pay $10 for each access" then sending the bill?

sdf2erf - 16 hours ago

[dead]

oceanplexian - 15 hours ago

[flagged]

Amol-917 - 4 hours ago

[flagged]

october8140 - 15 hours ago

You could put it behind Cloudflare and block all AI.

CuriouslyC - 16 hours ago

Does this author have a big pre-established audience or something? Struggling to understand why this is front-page worthy.

Jaxkr - 16 hours ago

The author of this post could solve their problem with Cloudflare or any of its numerous competitors.

Cloudflare will even do it for free.