Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving
qwen.ai606 points by mfiguiere 17 hours ago
606 points by mfiguiere 17 hours ago
Ok I find it funny that people compare models and are like, opus 4.7 is SOTA and is much better etc, but I have used glm 5.1 (I assume this comes form them training on both opus and codex) for things opus couldn't do and have seen it make better code, haven't tried the qwen max series but I have seen the local 122b model do smarter more correct things based on docs than opus so yes benchmarks are one thing but reality is what the modes actually do and you should learn and have the knowledge of the real strengths that models posses. It is a tool in the end you shouldn't be saying a hammer is better then a wrench even tho both would be able to drive a nail in a piece of wood.
GLM 5.1 was the model that made me feel like the Chinese models had truly caught up. I cancelled my Claude Max subscription and genuinely have not missed it at all.
Some people seem to agree and some don't, but I think that indicates we're just down to your specific domain and usage patterns rather than the SOTA models being objectively better like they clearly used to be.
It seems like people can't even agree which SOTA model is best at any given moment anymore, so yeah I think it's just subjective at this point.
Perhaps not even necessarily subjective, just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.
Unless you're looking at something like a pass@100 benchmark, the benchmarks are confounded heavily by a likelihood of a "golden path" retrieval within their capabilities. This is on top of uncertainties like how well your task within a domain maps to the relevant test sets, as well as factors like context fullness and context complexity (heavy list of relevant complex instructions can weigh on capabilities in different ways than e.g. having a history where there's prior unrelated tasks still in context).
The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).
All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.
>just performance is highly task-dependent and even variable within tasks. People get objectively different experiences, and assume one or another is better, but it's basically random.
You are right that this is not exactly subjectivity, but I think for most people it feels like it. We don't have good benchmarks (imo), we read a lot about other people's experiences, and we have our own. I think certain models are going to be objectively better at certain tasks, it's just our ability to know which currently is impaired.
And the subjectivity is bidirectional.
People judge models on their outputs, but how you like to prompt has a tremendous impact on those outputs and explains why people have wildly different experiences with the same model.
AI is a complete commodity
One model can replace another at any given moment in time.
It's NOT a winner-takes-all industry
and hence none of the lofty valuations make sense.
the AI bubble burst will be epic and make us all poorer. Yay
The value in Claude Code is its harness. I've tried the desktop app and found it was absolutely terrible in comparison. Like, the very nature of it being a separate codebase is already enough to completely throw off its performance compared to the CLI. Nuts.
> The value in Claude Code is its harness
If this was the case then Anthropic would be in a very bad spot.
It's not, which is why people got so mad about being forced to use it rather than better third party harnesses.
Pi is better than CC as a harness in almost every respect.
Anthropic limiting Claude subs to Claude code is what pushed me away in the end because I wanted to keep using Pi.
Just sign up for an AWS account and use the Anthropic models through Bedrock which Pi can use.
Can you enumerate why?
- Claude Code has repeatedly had enormous token wastage bugs. Its agent interactions are also inefficient. These are the cause of many of the reports of "single prompt blew through 5-hour quota" even though it's a reasonable prompt.
- It still lacks support for industry standards such as AGENTS.md
- Extremely limited customization
- Lots of bugs including often making it impossible to view pre-compaction messages inside Claude Code.
- Obvious one: can't easily switch between Claude and non-Claude models
- Resource usage
More than anything, I haven't found a single thing that Pi does worse. All of it is just straight up better or the same.
What is your workflow? Do you use Cursor or another tool for code Gen?
I use Opencode, both directly and through Discord via a little bridge called Kimaki.
I have been using GLM-5.1 with pi.dev through Ollama Cloud for my personal projects and I am very happy with this setup. I use pi.dev with Claude Sonnet/Opus 4.6 at work. Claude Code is great but the latest update has me compacting so much more frequently I could not stand it. I don't miss MCP tool calling when I am using pi.dev; it uses APIs just fine. I actually think GML-5.1 builds better websites than Claude Opus. For my personal projects I am building a full stack development platform and GLM-5.1 is doing a fantastic job.
I'm using pi the same as you. However, I have an MCP I need to use and the popular extension for that support works fine for me.
Really liking pi and glm 5.1!
The only reason I'm stuck with Claude and Chatgpt is because of their tool calling. They do have some pretty useful features like skills etc. I've tried using qwen and deepseek but they can't even output documents. How are you guys handling documents and excels with these tools? I'd love to switch tbh.
> I've tried using qwen and deepseek but they can't even output documents
What agent harness did you use? Usually, "write_file", "shell_exec" or similar is two of the first tools you add to an agent harness, after read_file/list_files. If it doesn't have those tools, unsure if you could even call it a agent harness in the first place.
Sorry for the confusion, I was actually talking about their Web based chat. Since most of my work is governance and docs, I just use their Web chats and they just refuse to output proper documents like Claude or Chatgpt do.
Aha... Well, I let Codex (Claude Code would work too) manage/troubleshoot .xlsx files too, seems to handle it just fine (it tends to un-archive them and browse the resulting XML files without issues), seen it do similar stuff for .app and .docx files too so maybe give that a try with other harnesses/models too, they might get it :)
Yeah, it's just way easier to do via the web/mobile app but I'll give using it via the CLI a try. Thanks :)
You're not giving an AI command line access to your work computer? How do you expect to keep up? /s
You give it command line access in a VM...
i give it in real ubuntu, no vm, no docker. so long I don't ask it to organize files, it will behave. it has not screw me so far.
You mean a VM like the one that contains a 0day that can escape the sandbox that gets found every year at pwn2own?
Presumably you’re also using a browser to view this web page. There have also been vulnerabilities in that. You have to draw a line somewhere.
Maybe, but the sort of 0days you're talking about aren't exploited in any meaningful way for almost all developers.
"Seatbelts don't save the life of everyone who gets into an accident, so why bother wearing one?"
You can make a harness fully functional with just the "shell_exec" tool if you give it access to a linux/unix environment + playwright cli.
When was the last time you used Qwen models? Their 3.5 and 3.6 models are excellent with tool calling.
I gave it a try a few weeks ago tbh, I'll give it another shot tho. I mainly use their Web chats since that's easier to use and previously, qwen, deepseek, kimi, all were unable to output proper docx files or use skills.
Try loading the models up in a coding harness like Claude Code. There's a few docx skills listed on Vercel's skill index.
outputting docx files does not have much to do with model capability. it is about whether tool calling has be configured .
You can use GLM-5.1 with claude code directly, I use ccs, GLM-5.1 setup as plan, but goes via API key.
I've been using qwen-code (the software, not to be confused with Qwen Code the service or Qwen Coder the model) which is a fork of gemini-cli and the tool use with Qwen models at least has been great.
You can use both codex and Claude CLI with local models. I used codex with Gemma4 and it did pretty well. I did get one weird session where the model got confused and couldn't decide which tools actually existed in its inventory, but usually it could use tools just fine.
I wonder why glm is viewed so positively.
Every time I try to build something with it, the output is worse than other models I use (Gemini, Claude), it takes longer to reach an answer and plenty of times it gets stuck in a loop.
I've been running Opus and GLM side-by side for a couple weeks now, and I've been impressed with GLM. I will absolutely agree that it's slow, but if you let it cook, it can be really impressive and absolutely on the level of Opus. Keep in mind, I don't really use AI to build entire services, I'm mostly using it to make small changes or help me find bugs, so the slowness doesn't bother me. Maybe if I set it to make a whole web app and it took 2 days, that would be different.
The big kicker for GLM for me is I can use it in Pi, or whatever harness I like. Even if it was _slightly_ below Opus, and even though it's slower, I prefer it. Maybe Mythos will change everything, but who knows.
> The big kicker for GLM for me is I can use it in Pi, or whatever harness I like.
Yes, but... isn't the same true for Opus and all the other models too?
Opus is about 7 times more expensive than GLM with API pricing. And since you can only use the Opus subscription plan in CC, you're essentially locked into API pricing for Pi and any other harness.
So you're either paying $1000's for Opus in Pi, or $30/month for GLM in Pi. If the results are mostly equivalent that's an easy choice for most of us.
Perhaps I'm being extremely daft: If the API is 7 times more expensive, then why is it $1000 vs $30? Or is there a GLM subscription one can use with Pi? Certainly not available in my (arguably outdated) Pi.
I'm not the OP, but it's the latter. I'm currently using the "Lite" GLM subscription with OpenCode, for example. I'm not using it very heavily, but I haven't come close to hitting the limits, whereas I burned through my weekly limits with Claude very regularly.
I am using GLM-5.1 in pi.dev through Ollama Cloud. I am able to get by on the $20 plan. I use it a lot and the reset is hourly for sessions and weekly overall. This is the first week I got close to the limit before reset at about 85% used. I am probably using it about 4 hours a day on average 6 or 7 days per week.
You can use GLM’s coding plan in Pi, just use the anthropic API instead of the OpenAI compatible one they give.
Or tell pi to add support for the coding plan directly. That gave me GLM-5.1 support in no time along with support for showing the remaining quota, etc, too.
It also compresses the context at around 100k tokens.
In case anyone is interested: https://github.com/sebastian/pi-extensions/tree/main/.pi/ext...
I have used GLM 4.7, 5 and 5.1 now for about 3 month via OpenCode harness and I don't remember it every being stuck in a loop.
You have to keep it below ~100 000 token, else it gets funny in the head.
I only use it for hobby projects though. Paid 3 EUR per month, that is not longer available though :( Not sure what I will choose end of month. Maybe OpenCode Go.
That's unfortunate. 70-80k tokens is roughly the point where I start wrapping up with giving agent required context even on the small to medium sized requests.
That would leave almost no tokens for actual work
IDK about GLM but GPT 5.4 Extra High has been great when I've used it in the VS Code Copilot extension, I see no actual reason Opus should consume 3x more quota than it the way it does
I think it offers a very good tradeoff of cost vs competency
4.7 is better, but its also wildly expensive
Opus 4.6 was incredible but Opus 4.7 is genuinely frustrating to me so far. It's really sharp but can be so lazy. It's constantly telling me that we should save this for tomorrow, that it's time for bed (in the middle of the day), and very often quite sloppy and bold in its action. These adjustments are getting old. The next crop of open models seems ready to practically replace the big ones as sharp orchestrator agents.
The models test roughly equal on benchmarks, with generally small differences in their scores. So, it’s reasonable to choose the model based on other criteria. In my case, I’d switch to any vendor that had a decent plugin for JetBrains.
Qwen3-Coder produced much better rust code (that utilized rust's x86-64 vectorized extensions) a few months ago than Claude Opus or Google Gemini could. I was calling it from harnesses such as the Zed editor and trae CLI.
I was very impressed.
I think claude in general, writes very lazy, poor quality code, but it writes code that works in fewer iterations. This could be one of the reasons behind it's popularity - it pushes towards the end faster at all costs.
Every time codex reviews claude written rust, I can't explain it, but it almost feels like codex wants to scream at whoever wrote it.
Their latest, Qwen3.6 35B-A3B is quite capable, and fast and small enough I don't really feel constrained running it locally. Some of the others that I've run that seem reasonably good, like Gemma 4 31B and Qwen3.5 122B-A10B just feel a bit too slow, or OOM my system too often, or run up on cache limits so spend a lot of time re-processing history. But the latest Qwen3.6 is both quite strong, and lightweight enough that it feels usable on consumer hardware.
Codex is pretty good at Rust with x86 and arm intrinsics too, it replaced a bunch of hand written C/assembly code I was using. I will try Qwen and Kimi on this kind of task too.
Not to mention, that Opus cost orders of magnitude more money. These are VERY impressive and usage.
FAANGS love to give away money to get people addicted to their platforms, and even they, the richest companies in the world, are throttling or reducing Opus usage for paying members, because even the money we pay them doesn't cover it.
Meanwhile, these are usable on local deployments! (and that's with the limited allowance our AI overlords afford us when it comes to choices for graphics cards too!)
Consider that SWE benchmarking is mainly done with python code. It tells something
I tried GLM and Qwen last week for a day. And some issues it could solve, while some, on surface relatively easy, task it just could not solve after a few tries, that Opus oneshotted this morning with the same prompt. It’s a single example ofcourse, but I really wanted to give it a fair try. All it had to do was create a sortable list in Magento admin. But on the other hand, GLM did oneshot a phpstorm plugin
Do you use Opus through the API or with subscription? Did you use OpenCode or Code?
Opus trough Claude Code, the Chinese models trough OpenCode Go, which seems like a great package to test them out.
If you showed me code from GLM 5.1, Opus 4.6, and Kimi K2.6, my ranking for best model would be highly random.
I tried GLM5.1 last week after reading about it here. It was slow as molasses for routine tasks and I had to switch back to Claude. It also ran out of 5H credit limit faster than Claude.
If you view the "thinking" traces you can see why; it will go back and forth on potential solutions, writing full implementations in the thinking block then debating them, constantly circling back to points it raised earlier, and starting every other paragraph with "Actually…" or "But wait!"
I see this with Opus too.
Indeed. And that’s with Anthropic hiding reading traces unlike these other comparisons.
> "Actually…" or "But wait!"
You’re absolutely right!
Jokes apart, I did notice GLM doing these back and forth loops.
I was watching Qwen3.6-35B-A3B (locally) doing the same dance yesterday. It eventually finished and had a reasonable answer, but it sure went back and forth on a bunch of things I had explicitly said not to do before coming to a conclusion. At least said conclusion was not any of the things I'd said not to do.
That is essentially what the reasoning reinforcement training does. It is getting the model to say things that are more likely to result in the correct final answer. Everything it does in between doesn't necessarily need to be valid argument to produce the answer. You can think of it as filling the context with whatever is needed to make the right answer come out next. Valid arguments obviously help. but so might expressions of incorrect things that are not obviously untrue to the model until it sees them written out. The What's The Magic Word paper shows how far that could go. If the policy model managed to learn enough magic words it would be theoretically possible to end up with an LLM that spouts utter gibberish until delivering the correct answer seemingly out of the blue.
That's pretty cool, thanks for the extra context! (pardon the... not even pun I guess)
Also, thanks for pointing me at that specific paper; I spend a lot more of my life closer to classical control theory than ML theory so it's always neat to see the intersection of them. My unsubstantiated hypothesis is that controls & ML are going to start getting looked at more holistically, and not in the way I normal see it (which is "why worry about classical control theory, just solve the problem with RL"). Control theory is largely about steering dynamic systems along stable trajectories through state space... which is largely what iterative "fill in the next word" LLM models are doing. The intersection, I hope, will be interesting and add significant efficiency.
Benchmarking is grossly misleading. Claude’s subscription with Code would not score this high on the benchmarks because how they lobotomized agentic coding.
>but I have seen the local 122b model do smarter more correct things based on docs than opus
Could you please share more about this
Maybe a bit misleading. I have used in in two places.
One Is for local opencode coding and config of stuff the other is for agent-browser use and for both it did better (opus 4.6) for the thing I was testing atm. The problem with opus at the moment I tired it was overthinking and moving itself sometimes I the wrong direction (not that qwen does overthink sometimes). However sometimes less is more - maybe turning thinking down on opus would have helped me. Some people said that it is better to turn it of entirely when you start to impmenent code as it already knows what it needs to do it doesn't need more distraction.
Another example is my ghostty config I learned from queen that is has theme support - opus would always just make the theme in the main file
Many people averted religion (which I can get behind with), but have never removed the dogmatic thinking that lay at its root.
As so many things these days: It's a cult.
I've used Claude for many months now. Since February I see a stark decline in the work I do with it.
I've also tried to use it for GPU programming where it absolutely sucks at, with Sonnet, Opus 4.5 and 4.6
But if you share that sentiment, it's always a "You're just holding it wrong" or "The next model will surely solve this"
For me it's just a tool, so I shrug.
> I've used Claude for many months now. Since February I see a stark decline in the work I do with it.
I find myself repeating the following pattern: I use an AI model to assist me with work, and after some time, I notice the quality doesn't justify the time investment. I decide to try a similar task with another provider. I try a few more tests, then decide to switch over for full time work, and it feels like it's awesome and doing a good job. A few months later, it feels like the model got worse.
I wonder about this. I see two obvious possibilities (if we ignore bias):
1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.
2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.
I definitely find your last point is true for me. The more work I am doing with AI the more I am expecting it to do, similar to how you can expect more over time from a junior you are delegating to and training. However the model isn't learning or improving the same way, so your trust is quickly broken.
As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.
> However the model isn't learning or improving the same way, so your trust is quickly broken.
One other failure mode that I've seen in my own work while I've been learning: the things that you put into AGENTS.md/CLAUDE.md/local "memories" can improve performance or degrade performance, depending on the instructions. And unless you're actively quantitatively reviewing and considering when performance is improving or degrading, you probably won't pick up that two sentences that you added to CLAUDE.md two weeks ago are why things seem to have suddenly gotten worse.
> similar to how you can expect more over time from a junior you are delegating to and training
That's the really interesting bit. Both Claude and Codex have learned some of my preferences by me explicitly saying things like "Do not use emojis to indicate task completion in our plan files, stick to ASCII text only". But when you accidentally "teach" them something that has a negative impact on performance, they're not very likely to push back, unlike a junior engineer who will either ignore your dumb instruction or hopefully bring it up.
> As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.
That is definitely a thing too. There have been a few times that I have "let my guard down" so to speak and haven't deeply considered the implications of every commit. Usually this hasn't been a big deal, but there have been a few really ugly architectural decisions that have made it through the gate and had to get cleaned up later. It's largely complacency, like you point out, as well as burnout trying to keep up with reviewing and really contemplating/grokking the large volume of code output that's possible with these tools.
Your version of the last point is a bit softer I think — parent was putting it down to “loss of talent” but yours captures the gaps vs natural human interaction patterns which seems more likely, especially on such short timescales.
I confusingly say both. First I say that the ratio of work coming from the model is increasing, and when I am clarifying I say “your talent keeps deteriorating”. You correctly point out these are distinct, and maybe this distinction is important, although I personally don‘t think so. The resulting code would be the same either way.
Personally I can see the case for both interpretation to be true at the same time, and maybe that is precisely why I confused them so eagerly in my initial post.
I don’t think the providers intentionally nerf the models to make the new one look better. It’s a matter of them being stingy with infrastructure, either by choice to increase profit and/or sheer lack of resources to keep n+1 models deployed in parallel without deprecating older ones when a new one is released.
I’d prefer providers to simply deprecate stuff faster, but then that would break other people’s existing workflows.
Point 2 is so true, I definitely find myself spending more time reading code vs writing it. LLMs can teach you a lot, but it's never the same as actually sitting down and doing it yourself.
I think it might have to do with how models work, and fundamental limits with them (yes, they're stochastic parrots, yes they confabulate).
Newer (past two years?) models have improved "in detail" - or as pragmatic tools - but they still don't deserve the anthropomorphism we subject them to because they appear to communicate like us (and therefore appear to think and reason, like us).
But the "holes" are painted over in contemporary models - via training, system prompts and various clever (useful!) techniques.
But I think this leads us to have great difficulty spotting the weak spots in a new, or slightly different model - but as we get to know each particular tool - each model - we get better at spotting the holes on that model.
Maybe it's poorly chosen variable names. A tendency to write plausible looking, plausibly named, e2e tests that turns out to not quite test what they appear to test at first glance. Maybe there's missing locking of resources, use of transactions, in sequencial code that appear sound - but end up storing invalid data when one or several steps fail...
In happy cases current LLMs function like well-intentioned junior coders enthusiasticly delivering features and fixing bugs.
But in the other cases, they are like patholically lying sociopaths telling you anything you want to hear, just so you keep paying them money.
When you catch them lying, it feels a bit like a betrayal. But the parrot is just tapping the bell, so you'll keep feeding it peanuts.
I agree - the problem is it’s hard to see how people who say they’re using it effectively actually are using it, what they’re outputting, and making any sort of comparison on quality or maintainability or coherence.
In the same way, it’s hard to see how people who say they’re struggling are actually using it.
There’s truth somewhere in between “it’s the answer to everything” and “skill issue”. We know it’s overhyped. We know that it’s still useful to some extent, in many domains.
Well summarized.
We're also seeing that the people up top are using this to cull the herd.
What is it that is dogma free? If one goes hardcore pyrrhonism, doubting that there is anything currently doubting as this statement is processed somehow, that is perfectly sound.
At some point the is a need to have faith in some stable enough ground to be able to walk onto.
All people think dogmatically. The only difference is what the ontological commitments and methaphysical foundations are. Take out God and people will fit politics, sports teams, tools, whatever in there. Its inescapable.
All people think dogmatically, but religion does not prevent people from acting dogmatically in politics, sports, etc. It just doesn't. It never did.
Under normal circumstances I'd consider this a nit and decline to pick it, but the number of evangelists out there arguing the equivalent of "cure your alcohol addiction with crystal meth!" is too damn high.
Allow me to introduce you to Buddhism
Elaborate. Buddhism is going to have the same epistemological issues as anything, since its a human consciousness issue.
> since its a human consciousness issue
I'd encourage you to check it out for yourself. It's certainly possible to be a dogmatic Buddhist, but one of the foundational beliefs of Buddhism is that the type of dogmatic attachment you're describing is avoidable. It's not easy, but that's why you meditate.
Which one?
Zen
The Western Zen? In my experience it is downgraded from being a religion to being a system of practice which relieves it of the broader Mahayana cosmology. But I would suggest the dogma is less obvious but still there, often just somewhere else, such as in its own limitations, or in a philosophical container at a higher level such as scientism.
All Zen is about releasing those attachments. Granted it's pretty hard, because if you succeed, you're enlightened.
East, West, Religion, Practice… From a Zen perspective, you're just troubling your mind with binaries and conflict.
Ah and there is the dogma -- the otherness of the enlightened.
The binaries still functionally exist. I see a lot of value in reflective practices. At the same time it seems unlikely to me that the point of existing is to not trouble your mind.