GLM-4.7: Advancing the Coding Capability
z.ai414 points by pretext a day ago
414 points by pretext a day ago
My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese primary. Context window: 200k. Claims Claude 3.5 Sonnet/GPT-5 level performance. 716GB in FP16, probably ca 220GB for Q4_K_M.
My most important takeaway is that, in theory, I could get a "relatively" cheap Mac Studio and run this locally, and get usable coding assistance without being dependent on any of the large LLM providers. Maybe utilizing Kimik2 in addition. I like that open-weight models are nipping at the feet of the proprietary models.
I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow.
For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times.
At 4 bits that model won't fit into 128GB so you're spilling over into swap which kills performance. I've gotten great results out of glm-4.5-air which is 4.5 distilled down to 110B params which can fit nicely at 8 bits or maybe 6 if you want a little more ram left over.
Correction, my GLM-4.6 models are not Q4, I can only run lower ones eg:
- https://huggingface.co/unsloth/GLM-4.6-GGUF/blob/main/GLM-4.... - 84GB, Q1 - https://huggingface.co/unsloth/GLM-4.6-REAP-268B-A32B-GGUF/t... - 92GB, Q2
I ensure that there are enough RAM leftover ie limited context window setting, so no swapping.
As for GLM-4.5-Air, I run that daily, switching between noctrex/GLM-4.5-Air-REAP-82B-A12B-MXFP4_MOE-GGUF and kldzj/gpt-oss-120b-heretic
Are you getting any agentic out of gpt-oss-120b?
I can't tell if it's some bug regarding message formats or if it's just genuinely giving up, but it failed to complete most tasks I gave it.
GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:
https://openrouter.ai/docs/guides/best-practices/reasoning-t...
Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.
Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.
I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.
Given the cost of the system, how long would it take to be less expensive than, for example, a $200/mo Claude Max subscription with Opus running?
It's not really an apples-to-apples comparison - I enjoy playing around with LLMs, running different models, etc, and I place a relatively high premium on privacy. The computer itself was $2k about two years ago (and my employer reimbursed me for it), and 99% of my usage is for research questions which have relatively high output per input token. Using one for a coding assistant seems like it can run through a very high number of tokens with relatively few of them actually being used for anything. If I wanted a real-time coding assistant, I would probably be using something that fit in the 24GB of VRAM and would have very different cost/performance tradeoffs.
For what it is worth, I do the same thing you do with local models: I have a few scripts that build prompts from my directions and the contents of one or more local source files. I start a local run and get some exercise, then return later for the results.
I own my computer, it is energy efficient Apple Silicon, and it is fun and feels good to do practical work in a local environment and be able to switch to commercial APIs for more capable models and much faster inference when I am in a hurry or need better models.
Off topic, but: I cringe when I see social media posts of people running many simultaneous agentic coding systems and spending a fortune in money and environmental energy costs. Maybe I just have ancient memories from using assembler language 50 years ago to get maximum value from hardware but I still believe in getting maximum utilization from hardware and wanting to be at least the ‘majority partner’ in AI agentic enhanced coding sessions: save tokens by thinking more on my own and being more precise in what I ask for.
Tokens will cost same on Mac and on API because electricity is not free
And you can only generate like $20 of tokens a month
Cloud tokens made on TPU will always be cheaper and waaay faster then anything you can make at home
This generally isn't true. Cloud vendors have to make back the cost of electricity and the cost of the GPUs. If you already bought the Mac for other purposes, also using it for LLM generation means your marginal cost is just the electricity.
Also, vendors need to make a profit! So tack a little extra on as well.
However, you're right that it will be much slower. Even just an 8xH100 can do 100+ tps for GLM-4.7 at FP8; no Mac can get anywhere close to that decode speed. And for long prompts (which are compute constrained) the difference will be even more stark.
Never, local models are for hobby and (extreme) privacy concerns.
A less paranoid and much more economically efficient approach would be to just lease a server and run the models on that.
I was able to run a batch job that lasted ~2 weeks of inference time on my m4 max by running it over night against a large dataset I wanted to mine. It cost me pennies in electricity and writing a simple python script as a scheduler.
This.
I spent quite some time on r/LocalLLaMA and yet need to see a convincing "success story" of productively using local models to replace GPT/Claude etc.
I have several my own little success stories:
- For polishing Whisper speech to text output, so I can dictate things to my computer and get coherent sentences, or for shaping the dictation to specific format eg. "generate ffmpeg to convert mp4 video to flac with fade in and out, input file is myvideo.mp4 output is myaudio flac with pascal case" -> Whisper -> "generate ff mpeg to convert mp4 video to flak with fade in and out input file is my video mp4 output is my audio flak with pascal case" -> Local LLM -> "ffmpeg ..."
- Doing classification / selection type of work eg. classifying business leads based on the profile
Basically the win for local llm is that the running cost (in my case, second hand M1 Ultra) is so low, I can run large quantity of calls that don't need frontier models.
My comment was not very clear. I specifically meant Claude Code/Codex like workflows where the agent generates/run code interactively with user feedback. My impression is that consumer grade hardware is still too slow for these things to work.
You are right, consumer grade hardware is mostly too slow... although it's a relative thing right. For instance you can get Mac Studio Mx Ultra with 512GB RAM, run GLM-4.5-Air and have a bit of patience. It could work
It doesn't matter if you spend $200, $20,000, or $200,000 a month on an Anthropic Subscription.
None of them will keep your data truly private and offline.
Yes they conveniently forget about disclosing prompt processing time. There is an affordable answer to this, will be open sourcing the design and sw soon.
Need the M5 (max/ultra next year) with it's MATMUL instruction set that massively speeds up the prompt processing.
> Supports tool calling in OpenAI-style format
So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.
> in theory, I could get a "relatively" cheap Mac Studio and run this locally
In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
Yes, as someone who spent several thousand $ on a multi-GPU setup, the only reason to run local codegen inference right now is privacy or deep integration with the model itself.
It’s decidedly more cost efficient to use frontier model APIs. Frontier models trained to work with their tightly-coupled harnesses are worlds ahead of quantized models with generic harnesses.
Yeah, I think without a setup that costs 10k+ you can't even get remotely close in performance to something like claude code with opus 4.5.
10k wouldn't even get you 1/4 of the way there. You couldn't even run this or DeepSeek 3.2 etc for that.
Esp with RAM prices now spiking.
$10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).
The point in this thread is that it would likely be too slow due to prompt processing. (M5 Ultra might fix this with the GPU's new neural accelerators.)
> $10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).
Please do give that a try and report back the prefill and decode speed. Unfortunately, I think again that what I wrote earlier will apply:
> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it
I'd rather place that 10K on a RTX Pro 6000 if I was choosing between them.
> Please do give that a try and report back the prefill and decode speed.
M4 Max here w/ 128GB RAM. Can confirm this is the bottleneck.
I weighed about a DGX Spark but thought the M4 would be competitive with equal RAM. Not so much.
I think the DGX Spark will likely underperform the M4 from what I've read.
However it will be better for training / fine tuning, etc. type workflows.