Did Claude increase bugs in rsync?
alexispurslane.github.io303 points by logicprog 13 hours ago
303 points by logicprog 13 hours ago
Was just looking at commits and came across a commit and its revert
original commit: https://github.com/RsyncProject/rsync/commit/d046525de39315d...
```
- if (!ptr)
- ptr = malloc(num * size);
- else if (ptr == do_calloc)
+ if (!ptr || ptr == do_calloc)
ptr = calloc(num, size);
```Written with claude. This is a good example of what slips through LLM attention. It forces all allocations to be calloc as if it is a strict upgrade. For large and recursive allocations, this becomes a significant cost.
reverted in https://github.com/RsyncProject/rsync/commit/7db73ad9a1b8721...
if you read the description of revert half carefully, it's easy to tell that even that was written by an LLM .
I can understand the sentiment of whoever posted the original thread.
Also the amount of commits is suspicious. In the last two months, rsync had about as much commits as in the last two years before that. Most of them written with claude. And then stuff like this is in there.
That's exactly what I'd expect when someone is excited about AI usage and becomes... well, sloppy.
Tridge already explains this:
"Like many developers of open source packages I’ve been hit by a flood of security reports lately in my role as the rsync maintainer. Many of those reports are AI generated (not all though, there are some notable ones with very careful and high quality manual analysis).
As this flood started to get more intense I realised I needed to raise the defences on rsync a lot — we needed much more thorough test suites, code coverage analysis, CI testing on a lot more platforms, deliberate and thorough scanning for possible security issues (so I find at least some of them before other people!) and the addition of a whole lot of defence-in-depth hardening techniques. This is all a huge amount of work. "
> Also the amount of commits is suspicious. In the last two months, rsync had about as much commits as in the last two years before that.
I wonder if the data looks worse or better when not doing per-10commit and instead do per-commit.
Seems like someone could use Claude to port rsync to Rust and the whole enterprise would be safer from things like this.
Start with unsafe then gradually convert into idiomatic Rust.
Your let's redo this in Rust made me wonder if generative AI will also be susceptible to software fads. One LLM writes a few blog posts extoling a new framework/lanaguge. Other agentics read these and get 'influenced'. Then they start clamoring for 'lets redo this in X!'. Can't wait to see it. /g
Prompt: automate writing commits to increase safety in these software projects so that my profile increases and I can snag a high-paying Rust job.
LLM: this commit changes whole codebase to Rust!
We will need rigorous agnostic statistical experiments to know what stuff is better
AI multiplied by Linux overcommit. What times we live in!
(My own view: 10.8 GB is nothing these days. Your sprintf buffers are probably larger than that. (And if they aren't: they should be. That, or you should start using snprintf...))
> This is a good example of what slips through LLM attention. It forces all allocations to be calloc as if it is a strict upgrade.
I wouldn't assume Claude made that decision; it's not as if that was some incidental thing that it snuck into a large commit. The commit message starts with "zero all new memory from allocations", and that's exactly what the commit does. What do you imagine the prompt was?
It seems totally plausible to me that a human initially thought this was an improvement, then rethought after discovering the RSS regression. And it's not a law of nature anyway that this change has to increase RSS; calloc could special-case the case in which memory was freshly returned from the OS, knowing fresh memory mappings are zeroed anyway.
I blame AI for these regressions mostly in the sense that it caused a flurry of vulnerability reports. Those led to a flurry of quick fixes. Sometimes quick fixes cause other problems.
You don't really have to guess. The guy told us the AI didn't suggest this specific change:
> The change to zero memory was my idea and my change. It was a reaction to a security report I got which caused use of an element past the end of an array. By zeroing the allocation I could ensure that misuse of that memory if a similar bug came up in the future could only cause a null ptr deref, which is better than the chance of a valid pointer. It got a claude co-authored tag on it as I got it to do some tidy ups of a series of commits, and that is just what it does when it makes any modification. It doesn't mean the change was written by claude. It was written by me.
https://github.com/RsyncProject/rsync/issues/959#issuecommen...
> … By zeroing the allocation …
How does that prevent reading past the end of the buffer? Or change how bytes outside the buffer are used? Are these arrays of pointers so that the “null ptr deref” comment makes sense?
Or am I the bozo and don’t know what’s happening here?
It doesn’t. It’s just that dereferencing a zeroed pointer reliably crashes the program (unless you specifically do funky things with mmap) but dereferencing garbage memory as a pointer could do a lot more insidious damage.
okay I had not read this or any discussions there (except the one linked in the post), but this looks weirder. the comment you linked is a dev responding to what is very clearly a bot comment. I am sure they have good intentions and I have no reason to believe otherwise as I have no connection to the project whatsoever, but the original commit being 4-5 lines long (what did claude do then?) and the revert description is almost certainly written by an LLM makes in my mind the slop argument stronger.
I hope if this doesn't come across as unkind towards the dev who gives their time and energy to the project. Grateful for that.
For those commenting, I suggest you read the post linked by the rsync author:
https://medium.com/@tridge60/rsync-and-outrage-d9849599e5a0
(Disclosure: while I haven't talked with him in years, Tridge was my colleague and mentor for many years. I feel it is worth considering his view before joining a crusade)
> I thought it would be a good idea to do the core structure for the new test suite in public on master first though given all the rage that has generated maybe that was a bad idea.
I don't entirely understand what this is saying. People wouldn't have been outraged if only the tests had been updated and/or he pushed solely on master - but he pushed breaking changes onto the release branch(es) too. Breaking workflows that have worked for years is a prime way to get people irate, and then seeing "Claude" in the commits just pours gasoline onto the fire.
This should be the top comment.
I think it's pretty sad that he even had to write it. Quite a lot of judgement from people who aren't paying his bills.
I don't have a dog in this fight, but a few points that look a little suspicious:
- The release with the highest number of attributed bugs is the release _right before_ the first release with Claude-coauthored commits, released in January; is there a chance that unattributed LLM-authored commits made it into this release?
- The release attribution methodology is not great, since it will tend to attribute bugs introduced in a minor version update to the longest-lived patch release of that minor version. I doubt that 3.4.1 actually introduced a lot of bugs, but since it was released a day after 3.4.0, bugs that were introduced in that release get attributed to 3.4.1.
- Relatedly, more recent releases have had less time to have bugs filed against them, so there may be a bit of a bias toward evaluating recent releases as less buggy.
Agree. From the article:
> Here's my favorite part, though. Digging into the data, one of the first things that jumped out at me with blinding clarity was that the worst release, by far, in rsync history was entirely prior to the introduction of Claude ... And yet nobody noticed.
Language really does suggest the article's author does have a dog in this fight and is cloaking opinion in fancy statistics jargon. "Blinding clarity"? All you have to do is draw a plot. And anyway, v3.4.1 was 2025-01-16, technically well within the AI assisted coding era and before attribution was becoming standard practice.
Also from the article:
> "Claude clearly made things worse" &emdash; the main claim
This article was clearly generated by AI, yet I found no mention/attribution of that by author.
How likely is it than someone who vibe codes articles would also vibe code the underlying analysis and be eager to accept an outcome that is highly validating of that person’s workflow? I’d say very.
Are the numbers wrong? That's the only relevant thing here.
Also, humans do use em dashes, just FYI.
You can use LLMs in multiple ways, from very hands on to make local changes to completely hands-off.
I've seen plenty of code that was LLM generated but the commit message itself did not have the co-author attached to it. This only seems to happen when someone's interface to the codebase is completely though Claude/Codex/..., and those are usually the most verbose commits, and yet they say the least, because they just summarize the code changes, not the why.
On the other hand I've seen developers using Claude as a tool. They have VSCode open and a terminal window with Claude and go back and forth, ensuring they write correct code, and leave the plumbing to Claude.
So maybe the author of the code started off small and it grew over time?
I would expect a mature code base like rsync to have a lot of unit tests and integration tests and frankly if there's not enough that such bugs haven't been caught; that should be your first use of LLMs in order to setup some deterministic guidelines when you do start making changes to your actual code.
I have been experimenting with both aforementioned styles with interesting results.
I've had a local LLM spending weeks trying to write tests. then debug those tests. then write antipatterns and patterns for those tests.
It's amusing. It's not terrible, but tests arn't going to save you from a malicious tester.
Your first and second points seem to contradict each other because if all of the bugs for 3.4.1 should be attributed to 3.4.0, that pushes the timetable back even further that unattributed LLM commits would have to have been being committed to the project, which just makes your point even more absurd.
Which brings me to my overall response, which is that there is absolutely no evidence, and nothing even intimating this hypothesis, that LLM commits were secretly being added to earlier releases before they were attributed, and that's why the rate of bugs is higher. There's no reason to think that it's an unreasonable thing to think, and there's no evidence for that whatsoever unless you beg the question and assume that higher bug counts must automatically indicate AI involvement, which is just circular reasoning. You're essentially just making up a hypothesis out of thin air to preserve your point.
Regarding your third point, that one's fair, but I've done the analysis and I can put it up if you want, as to how long it usually takes to find bugs and how far through the release cycle we are for each version.
Sorry, I should have said this explicitly in the original comment: I think you're likely _correct_ that there isn't a clear increase in the rate of bugs attributable to LLM-authored code in rsync. Your analysis provides evidence in this direction; these are just the things that made me go "hmm". They're not accusations or claims that the conclusion is invalid. But they're definitely things to be curious about.
Regarding unlabeled LLM-authored commits, I don't think it's unreasonable in general to think that an open-source project might have had unlabeled LLM-authored commits at some point before 2026. Looking more closely at rsync's recent commit history, I think it's less likely in this case. There's just a low number of commits in general, _until_ large batches of Claude-authored commits start showing up early this year. But this then raises some questions about the bugs-per-commit metric; it does correct for something like "size of release", but also obscures a significant shift in commit velocity that may be downstream of adding LLM development tools to the workflow.
Like I said, I don't have a dog in this fight, and I try not to approach sorts of questions from a position of explicit advocacy. I do think it's an interesting question, though, and we should try to understand what the data is actually telling us.
Isn't the metric that you've used "bugs per commit ~ per new line of code" going to miss the issue?
All code is technical debt.
If rsync releases used to have 500 lines changed and 5 bugs in and AI-powered rsync releases have 50000 lines and 500 bugs, it's the same bugs/line but much worse experience for the user?
I've not looked into the details of this case and I do use AI assistance coding at work but in my experience, the problem is that it's too easy to write lots of code and therefore hard to review the huge volumes of code and this analysis will ignore that?
edit: actually your table shows there weren't unusually large numbers of commits in this release, so perhaps my initial skepticism shows a bias I have?
Let's start with most outright alarming error - the claude statistics are taken out of whole 2 data points
That's sort of the point. There isn't enough data to extrapolate, and yet that's exactly what those outraged about AI were doing, and when you do do the very minimal types of analyses (permutation tests, and looking at distributions, mostly) that are actually valid, safe, standard, and useful to do on such low amounts of date, again, no evidence for the outrage shows up, and the two releases look so normal that it sort of shows no one would've cared if they hadn't known or found out that Claude was involved.
I really think this a much better standard of evidence — limited though it is — to outrage-fueled cherry-picked anecdotes, which is what has been driving this whole thing. If you disagree, and think the outrage should go one when I've shown there's an absence of evidence entirely for it (although of course, that's not evidence of absence; maybe I'll have to eat my words 5 releases down the line, but appealing to that now feels like a Russell's Teapot), would you care to explain why?
I know you’re defending your work here but this behavior does absolutely nothing to help your point.
The interpretations of the p-value is also alarming. One of the first thing they teach you in statistics class is: “an absence of evidence is not evidence of absence”.
This analysis showed that there is indeed an absence of evidence, but it concludes there is evidence of absence.
Traditional p-hacking is done by oversampling and overtesting. If you do 20 analysis on average one will show p < 0.05 by random chance. This analysis is doing the inverse of that. Under-sampling, and concluding with p > 0.05
> This analysis showed that there is indeed an absence of evidence, but it concludes there is evidence of absence.
I tried pretty hard to avoid saying that, can you point me at how to rephrase? The point I'm trying to make is just that there is absolutely no evidence at all for what people are saying with such absolutism and claimed objectivity (that Claude made rsync worse), and thus it doesn't justify the outrage.
> Under-sampling, and concluding with p > 0.05
How would I avoid under-sampling here? And if you're going to say it's because I only have 2 data points, well, the side making the positive claim — that Claude made rsync worse — only had two as well, and unremarkable ones at that, as I've tried very hard to show.
You are interpreting the p-values on their own merit rather then using them to test a null-hypothesis. Quotes like:
> With a p-value of 74%, the answer is a decisive no. The odds ratio is 1.06 — essentially 1:1. Claude releases are no more likely to be above the median than any other releases.
are problematic in this context as the correct conclusion here is you just don‘t have enough data conclude whether or not you are more likely to encounter a bug after a Claude commit.
> How would I avoid under-sampling here?
You don‘t. You admit that you don’t have enough data and move on. What you are trying to do here is prove a negative, which is extremely hard to do. In your discussion you claim that the users complaining had no right to, however nothing in your analysis showed they were wrong. We simply don‘t have enough data (yet) to say either way. When we have enough data they may be proven right or wrong, but until then, we cannot conclude either way.
If you insist still, I recommend looking into bayesian analysis. Theoretically at least the posterior distribution from a bayesian analysis can be interpreted directly and analyses on its own merits. However I suspect your posterior will have way too much uncertainty to reach any conclusions.
Edited that claim, and made several clarifications elsewhere. The whole point of this analysis is that outrage is unjustified on the basis of two totally statistically unremarkable releases that no one would have remarked on pre-AI (my further proof of this is that there was a pre-AI remarkably broken release, and no one did comment!) and zero positive evidence outside cherry-picked anecdotes for any negative impact. We should wait for outrage and version pinning and cancelation until there is evidence, no? I'm just trying to say that these specific releases are unremarkable, and there's no evidence at all of harm currently; I'm not trying to build any kind of predictive model for future Claude releases to say anything grander than "these specific releases are fine, what are we freaking out about?", not some claim about what Claude-exposed releases will look like or trend like in the future or in general.
The concept you need here is "Statistical Power".
The ELI5 version is that there are two mistakes you can make when looking at a P value:
Type I error, where your P value is falsely low. In the experiment being discussed here, it would lead one to conclude that AI code is worse. Otherwise known as a false positive.
Type II error, where your P value is falsely high, leading you to conclude that AI code is no different. Otherwise known as a false negative.
https://en.wikipedia.org/wiki/Power_(statistics)
One can calculate statistical power for a given experimental protocol.
My hunch is that if you did this, you would find this experiment is grossly under-powered.
This means you can't make the "absence of evidence" claim.
He can't make the evidence of absence claim, but he can absolutely make the absence of evidence claim.
Perhaps in an “everyday language” way, but not in the technical, statistical sense.
In an underpowered statistical study, a claim that two experimental conditions did not differ are not persuasive.
No. It's a description of the result of the maybe underpowered study. the underpowered study did not find evidence. Evidence is absent. Because it is underpowered, it's not evidence that the effect is absent.
The claim is not "two experimental conditions did not differ". The claim is "The data do not show evidence that the experimental conditions did differ".
Unfortunately for the people mad about this, I predict the only thing they will accomplish by pressuring the rsync maintainers, is to discourage everyone else from responsibly disclosing their use of AI. You’re just going to make people disable Claude attribution on their commits to avoid drama.
I never care about AI usage disclosure, because I don't believe that human produced code is necessarily better than AI produced code, unless it's someone I personally know.
People need to be responsible for code they commit and push anyways. This has never changed. Whether the code is written by hand, by their cat walking over keyboard, or by AI, is not my concern.
A project's code quality can decline for all kinds of reasons. I don't think it's productive to laser-focus on whether it's produced by AI or not. That's a distraction. If a person just want to find excuse to criticize AI, and another person wants to fight back and defend AI, sure, go for it. But that's not how you would want to assess a project's code quality.
something as simple as requiring sign-offs like the DCO maybe relevant to people who care. I do think the driveby stuff may get smaller. People dont need to get stuff upstream. I have lots of patches I am keeping downmstrea and instead have a trigger system when new packages updates drop into debian and i rebuild the package with my patches on top using quill. Other systems like gentoo basically always supported this flow.
So - why bother forking or going upstream? maybe its selfish. I think publishing the patches are cool but I feel less of a need to force other people into doing what I want or even writing every possible configuration or solution. I just hack it for me
> People need to be responsible for code they commit and push anyways.
Well the GPL (which rsync is licensed under) says: "This program comes with ABSOLUTELY NO WARRANTY" so actually nobody is responsible for anything.
Nobody is suing the maintainer for support here so this is completely irrelevant.
> You’re just going to make people disable Claude attribution on their commits to avoid drama.
People should be doing this regardless of drama. No reason to provide free advertising for trillion dollar corporations. Generated-by trailers are only relevant when contributing to third party projects, in that case disclosure is polite.
The value of the Claude attribution is that you can tell at a glance who used AI.
I don't care about the advertising angle. We all know Claude by now. I want some indicator that AI was used.
At my employer, if AI is not used, it shows up on your performance report and you’ll be told if you don’t start using it, you will be dismissed. I work at a medium sized successful YC-backed SaaS. So here, the attribution is meaningless - they look at your Bedrock and LLM API calls as well as Claude Code history.
If the company policy is to have everyone using it then everyone is going to assume you're using it.
I don't see a need for an attribution line in this case.
Do you fellow ICs have access to those reports and can correlate commits from you to the prompts used to create them easily?
Not currently. Each IC's report is kept private unless they voluntarily share it, and IC's don't have visibility into other IC's Claude Code or Cursor logs. I think we're moving toward a model where it will be easier to correlate commits with chats, but timeline is not clear.
And why do you want to know that? So you can call our projects slop? Ostracize us?
Because LLMs are not humans, and the code they produce will have a different distribution of failure modes than human written code, so attribution is useful info while reviewing?
> while reviewing
As I said, disclosure is polite when contributing code to third party projects which will undergo human review.
No need for such things in one's own projects.
>which will undergo human review
This can be largely assumed to be true for any open source code. It's kinda the point of open source.
Nope. It cannot be assumed at all. Maintainer could just as easily tell Claude to review the hand written code you sent instead of spending any effort on it. Maintainer could sit on the patch for months on end only to swoop in later and rewrite it instead of engaging with you, thereby erasing your contribution and attribution. Maintainer could just ignore you entirely despite the pervasive "patches welcome" attitude.
If there's one thing I learned not to do in open source, it's to assume nonsense like that.
I'm referring to the fact that "open source" quite literally means "readable by humans [and machines]", and anything beyond that is a subject of debate. There are more users than readers in nearly all cases, but being able to read the code as a user is a significant benefit at times, and it's one of the reasons it's such a large ecosystem in terms of both users and contributors. (it usually being free is another big reason, of course)
Even with coding agents gaining popularity, many humans still look at the code at some point.