Show HN: State of the Art of Coding Models, According to Hacker News Commenters

hnup.date

111 points by yunusabd 13 hours ago


Hello HN,

I was away from my computer for two weeks, and after coming back and reading the latest discussions on HN about coding assistants (models, harnesses), I felt very out of the loop. My normal process would have been to keep reading and figure out the latest and greatest from people's comments, but I wanted to try and automate this process.

Basically the goal is to get a quick overview over which coding models are popular on HN. A next iteration could also scan for harnesses that people use, or info on self-hosting or hardware setups.

I wrote a short intro on the page about the pipeline that collects and analyzes the data, but feel free to ask for more details or check the Google Sheet for more info.

https://hnup.date/hn-sota

jdw64 - 13 hours ago

Interpreting these metrics is quite interesting.

One thing for sure is that while Claude is currently taking the #1 spot in mentions, it carries a lot of negative sentiment due to API pricing policies and frequent server downtime. On the other hand, the runner-up, GPT-5.5, actually seems to have more positive feedback.

Personally, my experience with Codex wasn't as good as with Claude Code (Codex freezes on Windows more often than you'd expect), so this is a bit surprising. That said, the more defensive GPT is definitely better in terms of sheer code-writing capability. However, GPT actually has quite a few issues with text corruption when generating in Korean or Chinese—something English-speaking users probably don't notice. In terms of model capabilities, when given the same agent.md (CLAUDE.md) file, I think GPT is better at writing code, while Claude is better at writing text during code reviews.

Looking at the bottom right, Qwen and DeepSeek are open-source, so they are largely mentioned in the context of guarding against vendor lock-in, which drives positive sentiment. Considering that Hacker News occasionally shows negative sentiment toward China, the fact that they are viewed this positively—unlike US models—shows that being open-source is a massive advantage in itself.

Anyway, one thing for sure is that Gemini is pretty much unusable.

swyx - an hour ago

> Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute per user' of service 'sheets.googleapis.com' for consumer 'project_number:849324395320'.

maybe cache this thing my guy you're just doing a bunch of reads

---

constructive suggestions

- you have a pretty cheap process here, and HN exposes historical posts by date. perhaps worth running this back the last 2 years to reconstruct a history of sentiment?

- introduce alternative sorts around the net positive/negative sentiments and absolute positive sentiments, similar to State of JS (https://stateofjs.com) - you'll see the gpt outperformance more

- matching of Opus 4.7 and Opus Latest seems sus?

2ndorderthought - 10 hours ago

Interesting to see the positive sentiment around kimi2.6 qwen3.6 and deepseek relative to the negative. I hope the trend of people appreciating open models continue. They aren't namesakes yet, but it's a higher percentage then I thought it would be. Especially on HN where we are all talking about businesses.

I am upset because now anthropic, openai, meta, etc will continue their smear campaigns here. But I am also happy because it will make HN less useful when they do.

Everything is a give and take I guess. Excited to see where the equilibrium sits

gertlabs - 5 hours ago

This is awesome data! I've been wanting to measure how closely hype aligns to our results at https://gertlabs.com/rankings

Subjectively, it seemed like DeepSeek V4 Pro had the highest hype/performance ratio (meaning high hype for lower performance). Whereas MiMo V2.5 Pro didn't get much attention despite being the top dog in the open weights world, not even an honorable mention in your chart :( ...

misbau - an hour ago

I find using both very helpful and in most cases I have used Claude to build 70-80% of what I need and finish it off with Codex.

cheesecakegood - 7 hours ago

It's extra interesting because I think the model people should be talking about is actually not DeepSeek V4 Pro, but the Flash version. When accounting for cache hits, the input price (per OpenRouter) is effectively only 6 cents per million tokens (3 vs 14 cents hit/miss), and 28 cents on output. That's really good efficiency, and it's not a sale price like they are doing with V4 Pro, it's the normal price.

It's actually pretty difficult to find a good comparison model because there isn't one. Again, a 14/28 cent in/out model, ignoring cache, it scores just below GPT 5.4 Mini-xhigh (75/450) and Gemini 3 Flash (50/300) in intelligence. It's similar to Gemma 4 31B in some metrics (13/38) including cost, so it's not completely unheard of, but it's pretty notable that virtually everything else in the same region in most benchmarks are going to cost at least 5 times more (much, much more in very output-heavy contexts)

Jabbles - 12 hours ago

Please fix your graph so the names of the models are readable

chillfox - 9 hours ago

Surely "Claude Opus 4.7" and "Claude Opus Latest" should be the same, right?

idivett - 10 hours ago

Thanks for doing the hard work. I've bookmarked this, hoping it'll come handy when new models are released. If you're taking feature requests, I've a few. - Show combined measurements of model makes. Like All claude models vs open ai, Deepseek so on. - Another toggle to remove the neutral section?

brooksc - 11 hours ago

It'd be interesting to also graph this over time to see how sentiment changes from when a model is released to today.

gobdovan - 10 hours ago

Before harnesses, I'd fix the methodology/claims. A saner methodology would be to see comments that compare two models, say 'gpt5.5>opus4.7' and infer context ('ctx:frontend', for example). For your current methodology, 'opus 4.6 was very smart, opus4.7 is a disappointing upgrade to 4.6' would make normal aspect-based sentiment analysis consider 4.6 is smarter than 4.6. But considering you have <300 mentions total, probably you'd be better off scrapping some other websites as well. I'd also take out completely the SotA claim and downgrade the mentions to measuring something like visibility rather than performance.

skeptrune - 5 hours ago

What a win it is for open source that qwen and kimi show up on this at all.

yakkomajuri - 12 hours ago

"Prompts an LLM" -> which LLM?

I saw you're using Gemini for the sentiment rating (which I guess you picked because it's not often mentioned and thus "neutral"? lol)

But would be interesting to get more details overall

nonameiguess - an hour ago

Something that has been interesting to me for my entire life was the geek/jock cultural split in the US that seemed to crescendo in the 80s with the rise of popular nerd films and then the 90s when software started taking over the world. Being a pretty athletic kid who lettered in four sports, won a state championship, but also won math tournaments and spelling bees, it felt artificial to me. Plenty of high-level athletes have always been into video games, anime, and comic books, and are just as smart as people who can't run without tripping themselves and never learned to throw or catch any kind of ball.

Now it seems like it's come circle from the other direction, too. We always had fandom elements in computing nerd culture. Editor wars. Language wars. Framework wars. Now that software tooling has become nearly human-like, mercurial, unpredictable, inconsistent in performance and experience from week to week, software developers have turned into sports scouts and ESPN talking heads, going so far as to make continually updating live power rankings the way commentators try to predict in season which team is looking most like they'll win the championship that year. You're in the position talent evaluators were in roughly the late 90s, relying mostly on eye test and rough proxy measures of raw potential. Simon Willison applies the pelican test the way draft combines put athletes through shuttle drills and test vertical leap to try and predict how well they'll do in real gameplay.

It leaves me wondering when we'll have the Bill James style analytics breakthrough in software talent evaluation or if such a thing is even possible. At least with athletes, practice can make them better and injury and age can make them worse, but you can't just silently swap out an entirely different mind and body under the same name and face. You guys are trying to assess the performance of constantly moving targets that can and do change capabilities and characteristics on a daily basis.

jesse_dot_id - 7 hours ago

I suspect companies are deploying bots to shift sentiment around their products. I find metrics like this to be largely useless vs. actually just trying stuff out.

jatins - 3 hours ago

So no one's using Gemini on HN?

input_sh - an hour ago

Terrible metric that tells absolutely nothing about what's state-of-the-art. You might as well call this list the most astroturfed models on HN.

pbgcp2026 - 11 hours ago

So, it's a webpage with 3 paragraphs and a simple chart. It has: 1) terrible color scheme – fine, I switch to reader mode 2) shitloads of JS - fine, NoScript works, page breaks 3) Fancy "design" with simple graph but unreadable X axis labels - fine, I can use screen zoom for that ... to see 3x "Claude O..." LOL are we playing guess-me-over game? 4) ... "LxxxLxxx - Learn languages with YouTube!"

Hari2028 - 9 hours ago

How noisy is the sentiment classification? Feels like that could skew results a lot

julianlam - 6 hours ago

Interesting that Gemma 4 didn't crack the top 10.

I've been experimenting with the 26B-A4B model with some surprisingly good results (both in inference speed and code quality — 15 tok/s, flying along!), vs my last few experiments with Devstral 24B. Not sure whether I can fit that 35B Qwen model everybody's so keen on, on my 32GB unified RAM.

However I think I may be in the minority of HN commenters exploring models for local inference.

ranger_danger - 12 hours ago

Just FYI this article seems to define "start of the art" as "popular", as measured by "total mentions and user sentiment", without any bearing on the technical abilities or actual usage of the model.

tokkkie - 6 hours ago

more users = more complaints. negativity just means popularity.

kimi...?

Frannky - 8 hours ago

I am looking for a good alternative to Claude code + opus that is not codex. I tried switching back to opus 4.6. The attitude of 4.7 is what is more problematic. Difficult to enforce checking stuff before answering, and it suppose he knows better than me and reality. Plus all the latest shenanigans they did. Pretty disgusted I am still using them

soupspaces - 5 hours ago

[dead]