The last six months in LLMs in five minutes
simonwillison.net474 points by yakkomajuri 10 hours ago
474 points by yakkomajuri 10 hours ago
> The coding agents got really good
It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?
Absolutely not, not quite there not even close in my experience.
But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.
But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!
That's why the debate is so polizered imo, there isn't a shared experience
I believe by now we know exactly what it's good at and what it's terrible at.
The problem is that our CEO's fear of the future that pushes them to peculiar decisions that objectively make no sense (cf the infamous discussion of the Microsoft employee on Github that couldn't force its agent to do the proper thing).
It's not the first time I witness this kind of discrepancy and probably not the last, I just learned to adapt to it.
The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.
For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.
I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me. The headline gif on that repo just paints a terrible picture. It can't draw a box correctly, there's random underscores all over the screen. The UI itself is just incredibly incoherent. I don't even know what I'm looking at.
Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.
Edit: yes it's wrapping Claude. It's BREAKING the TUI. Not sure what people aren't getting here...
Take it up with Anthropic. It's actually their billion-dollar TUI product you're commenting on.
The problem with being such a naysayer is that you're entirely disconnected from what's going on. You haven't tried an agent like Claude Code and experienced it for yourself, so you don't recognise what it looks like when it's in front of you.
> Take it up with Anthropic. It's actually their billion-dollar TUI product you're commenting on.
That's like blaming the company making hammers because you're unable to build a lasting house with the hammer, it really isn't up to Anthropic, but all about how you use the tool you're holding.
I have tried Claude code. It doesn't look like that!
I don't know what the project is. All I see is a TUI that looks completely broken.
Go and use Claude Code right now. Does it look like that? Random underscores all over the page. No it doesn't.
It can look like that in certain conditions. The question is why are you so eager to give critique on unrelated work, appearing in a demo screencap, to someone who didn't produce it?
I don't know what you're talking about.
His tool wraps Claude and breaks the TUI. What's so hard to understand?
That's valid critique. What world have I woke up in today?
To be honest I assumed it was the screencap software running a basic terminal env without bells and whistles that CC needs, which I've seen before. If the actual tool functions like that too, that's not great. That said, it works for them, it works for them.
You do realize that you're complaining about the Claude Code TUI, right?
That's not what this product is; merely a tool it uses.
You claim "very high quality" but can't even get the basic UI working properly. You wrap tmux and a container in 2k lines of code and claim quality, I think the comment above was aimed at this claim.
The UI is working properly. Interfering with Anthropic's UI, or any of the other agent harness' UIs it supports, would be madness incarnate.
I also strongly suspect that you'd only taken a cursory glance at the top of the readme prior to passing judgment.
I did not much more than a cursory glance too, but found "./sandbox/create.go", a ~1300 lines long file with so much duplication even within just itself that I stopped counting.
Now it was a long time ago I did Go professionally, but I'm also in the camp of "That doesn't really count as high-quality", although I know for a fact you can get quality code out of LLMs, but I don't think that's a good showcase of that.
So why has your tool completely broken the Claude Code UI then?
Can't you see in the gif? It's completely broken. My Claude doesn't look like that. Neither does anyone else's.
Claude Code will automatically "dumb" the TUI down a bit when it can't properly detect certain terminal capabilities, to avoid potential font rendering issues.
Likely there are some terminal caps that aren't being properly preserved inside of the sandbox. It's never bothered me since the agent itself works fine.
Yeah, so whatever you're doing to wrap Claude is broken. Because it's breaking the UI.
"It's never bothered me". Cool. But your tool is bugged.
Feel free to open a bug report if it bothers you. Or a PR.
Or feel free to avoid the tool entirely if this UI issue shakes your faith in its overall quality down to its very foundations.
This is hardly a hill to die on.
You’re missing the point.
You claimed high quality and provided a repo.
Did you not expect someone to actually look and critique it?
Whether the visual bugs are a deal breaker or not isn’t the point.
The point is that’s not high quality code, it may work. But it’s not code I would ship at my job and therefore it’s not high enough quality for anyone serious