$500 GPU outperforms Claude Sonnet on coding benchmarks

454 points by yogthos a day ago

Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.

bartread - 9 hours ago

I agree. Also good for small changes that need to be applied consistently across an entire codebase.
I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).
Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.
- CraigJPerry - 9 hours ago
  
  Do you not end up breaking half the value of referential integrity doing it that way (e.g. you had to update all the queries but now you have a sharp edge in that all future queries need to remember to be soft delete aware. Not a blocker for sure, just a sharp edge).
  You know your system better than me for sure, a random commenter on a website :-D your comment just shocked me out of my daze enough for my brain to say "but I always move the record to another table rather than soft delete" and i felt compelled to give unsolicited and likely wrong opinion.
  - bartread - 5 hours ago
    
    Yeah, I did consider moving records to shadow tables, but - because of the nature of our data - it requires moving a lot of child records as well, so it's quite a lot of additional churn in WAL, and the same for restore. And this approach has its own challenges with referential integrity.
    More than that, though: lots of queries for reporting, and the like, suddenly need to use JOINs. Same for admin use cases where we want them to be able to see archived and live data in a unified view. The conclusion I came to is it doesn't really eliminate complexity for us: just moves it elsehwere.
    Totally valid approach though. I'd also considered different views for live versus archived (or live+archived) data. Again, it solves some issues, but moves complexity elsewhere.
    The other key point: it's a Ruby on Rails system so the moment you start doing funky stuff with separate tables or views, whilst it is doable, you lose a lot of the benefits of Active Record and end up having to do a lot more manual lifting. So, again, this sort of played against the alternatives.
    As I say, not to diss other approaches: in a different situation I might have chosen one of them.
    My conclusion - not for the first time - is that soft delete obviously adds some level of irreducible complexity to an application or system versus hard delete no matter how you do it. Whether or not that extra complexity is worth it very much depends on the application and your user/customer base.
    For some people, just the ability to restore deleted rows from backup would be enough - and in other cases it's been enough for me - but that is always a bit of a faff so not a great fit if you're optimising for minimal support overhead and rapid turnaround of any issues that do arise.
  - andyferris - 9 hours ago
    
    I move the record to another _index_, generally.
    It depends whether you reliably control all the DB client code, of course.
    
    NortySpock - 10 minutes ago
    
    This, make sure the 'active' flag (or deleted_at timestamp) is part of most indexes and you're probably going to see very small impacts on reads.
    It then turns into a slowly-growing problem if you never ever clean up the soft-deleted records, but just being able to gain auditability nearly immediately is usually well worth kicking the can down the road.
- dakolli - 5 hours ago
  
  must be something incredibly simple you're making out more complicated than it actually is, I've never seen an LLM do these things well.
  - bartread - 4 hours ago
    
    This is what gives me the warm fuzzies about the HN community: people jumping to wild conclusions about your domain and systems based on a 4 sentence comment. /s
sigmoid10 - 12 hours ago

Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.
jakozaur - 9 hours ago

Build systems are tested by CompileBench (Quesma's benchmark).
Disclaimer: I'm the founder.
slashdev - 7 hours ago

Generating big chunks code is all I do, all day.
I don't write code by hand any more, neither at work, nor for side projects.
I work mostly in Rust and TypeScript at a developer tools company.
- imiric - 7 hours ago
  
  [flagged]
  - serf - 6 hours ago
    
    I have never read a snide comment on this site that i've been more repulsed by.
    I think because it's so specifically sharpened to stab at the software developer, my compatriot, one of the foremost primary populations here, rather than just an overall shitty human insult -- and timed to do so when the person opens up in an honest dialogue about what they're doing.
    But good news: every large software house i've talked to in the past two years is touching AI. As tragic as that is for a multitude of good reasons surrounding the workforce/copyright/ip/human-laziness/loss-of-skill/etc, that means imric is going to be outside of software , by their own rules, in totality in just a few short years!
    Happy days!
    
    imiric - 3 hours ago
    
    [flagged]
    
    slashdev - 2 hours ago
    
    You only hurt yourself with that attitude. AI might take your job.
    
    imiric - 2 hours ago
    
    > You only hurt yourself with that attitude.
    Funny, others seem more hurt by it.
    > AI might take your job.
    I'm not the one "grieving the loss of his career". :)
  - slashdev - 7 hours ago
    
    We have the quietest on-call rotation of any company I've ever worked at.
    We have a high standard for code review, static verification, and tests.
    The fact that the code isn't hand-rolled artisanal code, and is generated by AI now, has so far turned out to have no impact on product quality or bugs reported.
    
    dlahoda - 6 hours ago
    
    What are company or tools you are working?
    
    imiric - 3 hours ago
    
    Ah, that's great, sounds like the ideal working environment.
    So, which company is it again?
  - aditmag - 7 hours ago
    
    Tbf, as long as you really know what you're doing and have the sense to avoid falling into a spaghetti code trap, generating bigger chunks of code absolutely works and should be done. The pitfall happens when
    (a) the dev has no idea what the agent is doing (b) the dev gives overtly-broad instructions.
    If you give it specific enough tasks (not to the point where it's writing singular functions) but a general class description, you're on a good track.
  - yohannparis - 7 hours ago
    
    Why? Because writing code is the only measure of quality when producing tools? What about Unit and Integration Tests, UX research, and Performance tests.
    
    adrian_b - 6 hours ago
    
    I agree that for many applications the code written by an LLM can be good enough, as proven by the many commercial applications that contain even worse code.
    However, anyone who uses an LLM must remain aware of the limitations of this method.
    There are many features of a program that cannot be tested exhaustively and which must be guaranteed by its design. When you do not understand very well the structure of a program it may be difficult to decide what must be tested.
    With performance, the confidence in what an LLM produces is even lower, because it is unlikely to know if you have really reached a performance limited by hardware. Obtaining a performance better than a previously existing program does not prove anything, because most existing programs are likely to have a performance much lower than possible.
    In many cases you just want a performance good enough, not the best attainable, so you can be content with your LLM-generated program. But you must not fool yourself by believing that this is really the best that can be done.
Bombthecat - 11 hours ago

Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.
It's amazing! Saves hours of work!
I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!
seunosewa - 7 hours ago

Create it!
philbitt - an hour ago

[dead]
d0963319287 - 6 hours ago

[flagged]

mmaunder - 19 hours ago

I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

vidarh - 9 hours ago

I get decent results with Kimi, but I agree with your overall premise. You do need to realise that while you can save money on a lot of tasks with those models, for the hardest tasks the "sticker price" of cost per million tokens isn't what matters.
It's also worth noting that the approach given in the link also benefits Sonnet and Opus. Not just as much - they are more forgiving - but put it in a harness that allows for various verification and repair and they too end up producing much better results than the "raw" model. And it's not clear that a harness around MiniMax, Kimi, or Qwen can measure up then.
I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.
If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.
thefourthchime - 17 hours ago

I won’t use anything less than the SOTA. It tried using Opus 4.6 medium and immediately regretted it. High messes up enough.
- overfeed - 14 hours ago
  
  What were you using 6 months ago?
  - withinboredom - 14 hours ago
    
    Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.
    
    hhh - 12 hours ago
    
    The models don’t change.
    
    tornikeo - 11 hours ago
    
    On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.
    
    armchairhacker - 10 hours ago
    
    And there’s an incentive to publish evidence of this to discourage it, do you have any?
    
    TeMPOraL - 9 hours ago
    
    Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.
    
    natebc - 9 hours ago
    
    There really always is a man behind the curtain eh?