Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

venturebeat.com

322 points by lostmsu 11 hours ago


Aurornis - 8 hours ago

If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.

I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released.

They are impressive, but they are not performing at Sonnet 4.5 level in my experience.

I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there.

That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex.

mstaoru - 9 hours ago

I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.

So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.

Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.

Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.

jackcosgrove - 6 hours ago

I am a total neophyte when it comes to LLMs, and only recently started poking around into the internals of them. The first thing that struck me was that float32 dimensions seemed very generous.

I then discovered what quantization is by reading a blog post about binary quantization. That seemed too good to be true. I asked Claude to design an analysis assessing the fidelity of 1, 2, 4, and 8 bit quantization. Claude did a good job, downloading 10,000 embeddings from a public source and computing a similarity score and correlation coefficient for each level of quantization against the float32 SoT. 1 and 2 bit quantizations were about 90% similar and 8 bit quantization was lossless given the precision Claude used to display the results. 4 bit was interesting as it was 99% similar (almost lossless) yet half the size of 8 bit. It seemed like the sweet spot.

This analysis took me all of an hour so I thought, "That's cool but is it real?" It's gratifying to see that 4 bit quantization is actually being used by professionals in this field.

alexpotato - 9 hours ago

I recently wrote a guide on getting:

- llama.cpp

- OpenCode

- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)

working on a M1 MacBook Pro (e.g. using brew).

It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models.

https://gist.github.com/alexpotato/5b76989c24593962898294038...

OkWing99 - 36 minutes ago

Can someone who has done this, simplify and say what specs we need on a `local computer` to run and test this, with a reasonable speed?

Excluding MBP M5 128GB.

jjcm - 7 hours ago

Getting better, but definitely not there yet, nor near Sonnet 4.5 performance.

What these open models are great for are for narrow, constrained domains, with good input/output examples. I typically use them for things like prompt expansion, sentiment analysis, reformatting or re-arranging flow of code.

What I found they have trouble with is going from ambiguous description -> solved problem. Qwen 3.5 is certainly the best of the OSS models I've found (beating out GPT 120b OSS which was the previous king), and it's just starting to demonstrate true intelligence in unbound situations, but it isn't quite there yet. I have a RTX 6000 pro, so Qwen 3.5 is free for me to run, but I tend to default to Composer 1.5 if I want to be cheap.

The trend however is super encouraging. I bought my vid card with the full expectation that we'll have a locally running GPT 5.2 equiv by EoY, and I think we're on track.

solarkraft - 10 hours ago

Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.

Up until relatively recently, while people had already long been making these claims, it came with the asterisks of „oh, but you can’t practically use more than a few K tokens of context“.

sunkeeh - 9 hours ago

Qwen3.5-122B-A10B BF16 GGUF = 224GB. The "80Gb VRAM" mentioned here will barely fit Q4_K_S (70GB), which will NOT perform as shown on benchmarks.

Quite misleading, really.

syntaxing - 6 hours ago

A big part that a lot of local users forget is inference is hard. Maybe you have the wrong temperature. Maybe you have the wrong min P. Maybe you have the wrong template. Maybe the implementation in llama cpp has a bug. Maybe Q4 or even Q8 just won’t compare to BF16. Reality is, there’s so many knobs to LLM inferencing and any can make the experience worse. It’s not always the model’s fault.

xmddmx - 7 hours ago

Ollama users: there are notable bugs with ollama and Qwen3.5 so don't let your first impression be the last.

Theory is that some of the model parameters aren't set properly and this encourages endless looping behavior when run under ollama:

https://github.com/ollama/ollama/issues?q=is%3Aissue%20state... (a bunch of them)

nu11ptr - 9 hours ago

Thinking about getting a new MBP M5 Max 128GB (assuming they are released next week). I know "future proofing" at this stage is near impossible, but for writing Rust code locally (likely using Qwen 3.5 for now on MLX), the AIs have convinced me this is probably my best choice for immediate with some level of longevity, while retaining portability (not strictly needed, but nice to have). Alternatively was considering RTX options or a mac studio, but was leaning towards apple for the unified memory. What does HN think?

solarkraft - 10 hours ago

What are the recommended 4 bit quants for the 35B model? I don’t see official ones: https://huggingface.co/models?other=base_model:quantized:Qwe...

Edit: The unsloth quants seem to have been fixed, so they are probably the go-to again: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

oscord - 8 hours ago

SWE chart is missing Claude on front page, interesting way to present your data. Mix and match at will. Grown up people showing public school level sneakiness. That fact alone disqualifies your LL. Business/marketing leaders usually are brighter than average developers... so there.

mark_l_watson - 10 hours ago

The new 35b model is great. That said, it has slight incompatibility's with Claude Code. It is very good for tool use.

shell0x - 5 hours ago

Can't wait to try that out locally. Keen to reduce my dependence on American products and services.

lubitelpospat - 3 hours ago

All right guys, this is your time - what consumer device do you use for local LLM inference? GPU poor answers only

erelong - 10 hours ago

What kind of hardware does HN recommend or like to run these models?

car - 8 hours ago

Can it do FizzBuzz in Brainfuck? Thus far all local models have tripped over their feet or looped out.

kristianpaul - 9 hours ago

https://unsloth.ai/docs/models/qwen3.5#qwen3.5-27b “ Qwen3.5-27B For this guide we will be utilizing Dynamic 4-bit which works great on a 18GB RAM”

gunalx - 9 hours ago

qwen 3.5 is really decent. oOtside for some weird failures on some scaffolding with seemingly different trained tools.

Strong vision and reasoning performance, and the 35-a3b model run s pretty ok on a 16gb GPU with some CPU layers.

aliljet - 10 hours ago

Is this actually true? I want to see actual evals that match this up with Sonnet 4.5.

karmasimida - 6 hours ago

Raw scale of parameters is POWER, you can't get performance out of a small model from a much larger one.

hsaliak - 6 hours ago

No it does not. None of these models have the “depth” that the frontier models have across a variety of conversations, tasks and situations. Working with them is like playing snakes and ladders, you never know when it’s going to do something crazy and set you back.

piyh - 5 hours ago

Unsloth is working magic with the qwen quants

jbellis - 8 hours ago

this is bullshit with a kernel of truth.

none of the qwen 3.5 models are anywhere near sonnet 4.5 class, not even the largest 397b.

BUT 27b is the smartest local-sized model in the world by a wide wide margin. (35b is shit. fast shit, but shit.)

benchmarks are complete, publishing on Monday.

renewiltord - 3 hours ago

In practice I have not seen this. Sonnet is incredible performance. No open model is close. Hosted open models are so much worse that I end up spending more because of inferior intelligence.

kristianpaul - 9 hours ago

They work great with kagi and pi

pstuart - 4 hours ago

One highly annoying facet of the hardware is that AND's support for the NPU under linux is currently non-existent. which abandons 50 of the 126 TOPS stated of AI capability. They seem to think that Windows support is good enough. Grrrrrr.

PunchyHamster - 9 hours ago

I asked it to recite potato 100 times coz I wanted to benchmark speed of CPU vs GPU. It's on 150 line of planning. It recited the requested thing 4 times already and started drafting the 5th response.

...yeah I doubt it

xenospn - 11 hours ago

Are there any non-Chinese open models that offer comparable performance?

Paddyz - 5 hours ago

[dead]

aplomb1026 - 7 hours ago

[dead]

u1hcw9nx - 11 hours ago

[flagged]