FrontierCode

cognition.ai

97 points by streamer45 5 hours ago


swyx - 4 hours ago

:wave: i was on the team! AMA.

some headlines

- 3000 rubrics on code quality. First benchmark to measure: "would this code get actually merged?"

- 20+ expert open-source maintainer created tasks on their own repos to capture their opinion & taste.

- total 1000+ hours of real life software maintainer work captured in dataset. ON TOP of that, 40+ hours of real human work to turn that real life work into well validated and structured tasks with rubrics (even more work to turn tasks/prompts from devin-infra-specific to pluggable coding agent)

- results in 81% lower false positive rate than SWE-Bench Pro

- High quality bar: many QA stages & each task manually reviewed by Cognition researchers (examples in post)

Opus 4.8 scores 13% on FrontierCode Diamond.

one of my goals was also to datamine interesting stuff even on the easy tasks. for example, if you squint you can see the answer to "WTF Happened in late 2025" with coding models: https://x.com/swyx/status/2064081945567580323

Topfi - 3 hours ago

Great effort and a bit closer to my private evals than DeepSWE. I greatly appreciate the focus on false negative and positives, along with simply being far more focused on actual, mergeable quality output over plain passing. Could see a lot of others adopt your list of metrics as a basis, they are very well defined and solid coverage of everything one should want out of code provided, not just focused on one or two narrow targets. Will incorporate a lot of these ideas in my own tests and polish some other parts where I somewhat unintentionally already went into a roughly similar direction.

vessenes - 3 hours ago

This looks great. Well reasoned, tons of work put into eval, thanks for building it.

It strikes me as kind of wild that good evals can drive tens to hundreds of millions of dollars of compute deployment in the wild — there’s something new and collaborative and competitive about the eval / frontier model race that’s quite interesting..

In this case “shorter actually mergable patches that open source maintainers would accept” feels like a great thing to deliver to the world.

I didn’t deep dive into good and bad patches, but I wonder if swyx or others on the team have predictions on saturation. Both when, and how useful will it be? That is, do you guys think this test is broad enough as written to get better behavior out of models, and if there is saturation on this test, will we see generalized better patch / coding behavior?

singpolyma3 - 4 hours ago

Since no one knows or can agree on what "code quality" is and we can't measure it for human output, I'm dubious about measuring it for LLMs

einpoklum - 4 hours ago

> Today’s coding benchmarks have established that models can write correct code.

I wouldn't say that.

> But as AI-generated code becomes the dominant path to production

I really hope that's not the case.

fHr - 4 hours ago

babe wake up another eval dropped