Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

senior-swe-bench.snorkel.ai

144 points by matt_d 15 hours ago


jfim - 8 minutes ago

I wonder how they're planning for the benchmark to stay relevant over time.

If the benchmark is to implement features that are part of an open source project, and LLMs have those changes as part of their training dataset, it seems that they could just give a verbatim or slightly modified version of the change in their training data.

And if one updates the benchmark to only incorporate code changes that are past the models knowledge cutoff, then the benchmark is less comparable over time, since the changes in the benchmark at time T and T+1 aren't the same.

magnio - 13 hours ago

I saw on Twitter that in an ML course at Tsinghua University, one of the tests asks students to write quizzes that fail the most LLM models as possible.

What if we create a benchmark that works like this and assigns ELO scores? Models fight head-to-head by writing a question, a bug, or an incomplete implementation, which the opponent has to answer, fix, or finish.

jeffbee - 2 hours ago

Staff SWE Bench: LLM doubts whether we should do any of this, calls the entire project into question, refuses to merge code, but is happy to delete it.

_345 - 13 hours ago

This makes so much sense as to why I've always felt that Opus 4.8 was leagues ahead of GPT 5.5. It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project

21asdffdsa12 - 11 hours ago

The value of a senior situation is to apply known solutions and strategies, to novel problems. I can not see how any benchmark, without ever changing, can provide a novel challenge for long.

Any decent benchmark would use the whole of TRIZ to generate a giant ball of a problem first and watch a AI deduce a optimal solution.

jonathanleane - 15 hours ago

Top solve rate is currently 24% with Opus 4.8... What's a competent human supposed to score?

facorreia - 13 hours ago

It's nice to see a new public benchmark from Snorkel. They're doing some pretty sophisticated stuff over there.

bloody-crow - an hour ago

Isn't being open source creating incentives for the AI companies to optimize their LLMs for the specific benchmark? I thought all those benchmarks are deliberately closed source primarily for this reason.

LiamPowell - 14 hours ago

> You are a senior SWE-Bench reviewer, make no mistakes.

I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.

piterrro - 7 hours ago

> Senior engineers build features without over-specified requirements

To me this already disqualifies the benchmark. That statement is missing the most critical piece about senior engineers: the senior engineers know how to obtain input for their work on their own whether that talking to customers or using metrics. Never ever they come up with stuff on their own - that’s junior behaviour.

Until a coding agent will be able to *gather* the input on its own, its never going to be „senior”

purple-leafy - 14 hours ago

Benchmarks are great, but I feel like there’s a better way this seems quite subjective.

What you really need is an objective benchmark

monster_truck - 13 hours ago

Once again I am asking: who are these people and what makes them more qualified than any of you to asses anyone or anything "as a senior engineer" (with the subtext being that none of you are, either)

guilhermecgs - 13 hours ago

fable 5?

impartshadow - 3 hours ago

[flagged]

adrianwitaszak - 6 hours ago

[flagged]

danpalmer - 15 hours ago

Why didn't they just make it "Staff SWE-Bench", would be much better smh. /s

But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.

Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.

kerlenton - 10 hours ago

[flagged]

funnywish - 12 hours ago

[flagged]

jocelyner - 14 hours ago

[flagged]

0xbadcafebee - 13 hours ago

The "tasteful solves" is codified cargo culting. The software industry has a tendency to anthropomorphize software while playing to the ego of the programmer. The programmer imagines they are creating a "beautiful" artistic expression. Good code becomes "tasteful", as a software artist must have "good taste" to tell the good software from the bad software. Good quality lacks "bad smells", because a good artist has fine senses (and everybody must like the same smells). "Fine craftsmanship", in code as in woodworking, means your finely-crafted work is "technically superior", so you can charge more money for something that could've been made cheaper and faster and done the same thing.

But it's a lie. Nobody's paying you to make paintings. They're paying you to build machines. The comparison between "making working software" with "taste" always devolves into bikeshedding and subjective opinionism, uses subjective human feelings to describe what should be objective and functional, isn't rooted in scientific rigor, and detracts from the real purpose of the thing. The work doesn't actually get better by trying to apply artistic principles to engineering. It just feels better for the people making it.

Once you make the machine work, then you can go about gilding the lily. But this is unromantic, unsatisfying, boring. Since the inmates run this particular asylum, we end up with a benchmark that tries to accurately mimic the human ego as applied to software design. Thus the new Gods create their digital Adams and Eves in their image.

Madmallard - 14 hours ago

[flagged]

HarHarVeryFunny - 5 hours ago

Sounds more like vibe-bench.

For any professional work you care about the details.

Even for hobby work, if you are using LLMs then presumably it is to do the drudge work of coding, not making the decisions, and that goes doubly so if you are a senior developer. Sure the LLM can "fill in the details" and vibe code (or attempt to) you a compiler or whatever, but the whole reason you are doing a hobby project is presumably because you want to bring your experience to bear and build a GOOD compiler, not a generic one.

fiso64 - 11 hours ago

I think benchmarks like this are too subjective and narrow to be useful. For example, whether a patch "bloats" the codebase really depends on the situation: If it's building a feature that will grow in the future, or refactoring code that has a long history of bugs, then a larger patch might in fact be good. It's not clear from the blog just how much context the LLM judge receives about the long term project goals and history. Benchmarks should be focused on evaluating the final result only. Maybe ask the coder to build a full app, or implement many new large features for an existing app in sequence, with a larger set of requirements, or have another LLM roleplay as the human to make the instructions a little more underspecified. When done, ask a reviewer harness to test the product for 5 hours, not the code. Count the number of bugs and weigh them by severity. "Taste" would then become an automatic consequence of correctness.

(Full disclosure, I'm not a software engineer.)