GLM-5: From Vibe Coding to Agentic Engineering

z.ai

188 points by meetpateltech 2 hours ago


Aurornis - 2 hours ago

The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.

Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

simonw - an hour ago

Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

Solid bird, not a great bicycle frame.

pcwelder - 2 hours ago

It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md

mnicky - 17 minutes ago

What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].

We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.

I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?

How do you read this?

[1] https://imgur.com/a/EwW9H6q

justinparus - 2 hours ago

Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.

woeirua - 2 hours ago

It might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.

cherryteastain - an hour ago

What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].

US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.

[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...

[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...

[3] https://www.reuters.com/world/china/chinas-customs-agents-to...

seydor - 9 minutes ago

I wish China starts copying Demis' biotech models as well soon

esafak - 2 hours ago

I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.

goldenarm - an hour ago

If you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :

Claude Opus 4.6: 65.5%

GLM-5: 62.6%

GPT-5.2: 60.3%

Gemini 3 Pro: 59.1%

karolist - an hour ago

The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0

pu_pe - 2 hours ago

Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.

jnd0 - 2 hours ago

Probably related: https://news.ycombinator.com/item?id=46974853

dana321 - 5 minutes ago

Just tried it, its practically the same as glm-4.7 - it isn't as "wide" as claude or codex so even on a simple prompt is misses out on one important detail - instead of investigating it ploughs ahead with the next best thing it thinks you asked for instead of investigating fully before starting a project.

nullbyte - an hour ago

GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.

Edit: Input tokens are twice as expensive. That might be a deal breaker.

mohas - an hour ago

I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.

algorithm314 - 2 hours ago

Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricing

Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?

There is also a GLM 5-code model.

beAroundHere - 2 hours ago

I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.

I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.

meffmadd - 2 hours ago

It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.

ExpertAdvisor01 - 2 hours ago

They increased their prices substantially

eugene3306 - 2 hours ago

why don't they publish at ARC-AGI ? too expensive?

surrTurr - an hour ago

we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated

woah - 2 hours ago

Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?

ChrisArchitect - an hour ago

Earlier: https://news.ycombinator.com/item?id=46974853

petetnt - 2 hours ago

Whoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!

testuser_xyz - an hour ago

[flagged]