Two Qwen3 models on one DGX Spark: the residency math

devashish.me

41 points by devashish86 3 days ago


wolttam - 2 hours ago

I started with antirez' DwarfStar[1] on one spark and that (~11-14tok/s generation, ~300-400 tok/s prompt processing) was enough of a taste for me to jump into 2 sparks, running the native quant of DSv4 Flash.

Now at 40-50tok/s generation and ~2000 tok/s prefill with a model that I've seen reason through race conditions and be able to trivially pull off any straight-forward coding task, and remain coherent at 500k context. With a preview checkpoint of the weights!

I'm excited for the future of local LLMs. There is some buy-in but apparently not an extreme amount to get access to models that can stand in the for the giants on all but the most challenging and/or hands-off coding tasks.

[1]: https://github.com/antirez/ds4

shireboy - 3 hours ago

I’ve been considering a move to local llm setup, having been underwhelmed coat vs value of various online offerings. But at the same time worried anything I get will be obsolete in a couple months. And I don’t want to have to babysit it. I really want some agents managing and creating side hustles for me and have some other things. I’m technical-have written my own harness and use gh copilot and grok daily and have a hosted openwebui+openrouter thing. I’m also torn between a 128g MacBook Pro or a framework, or spark or similar and lightweight laptop to access. Would love advice anyone has for (or against) going local. I have asked ai but have analysis paralysis as 5k would be a big investment for me so I want to make right choices

dzink - 3 hours ago

Have you tried llama.cpp with unsloth and models suited to it? GLM flash? It seemed to allow more models to be tried soon after they are released. Haven’t tried for long term deployment though, that’s the next step.

roger_ - 2 hours ago

How about Qwen3.6? What sort of prefill/decode rates?

Edit: 3.6 not 3.7!

cws_ai_buddy - 9 minutes ago

[flagged]

softwareseko - 37 minutes ago

[flagged]

devashish86 - 3 days ago

Author here. Quick context the post doesn't quite spell out:

The tool_choice="auto" failure on Qwen3-Next isn't a parser issue — the model reasons inside <think>, decides, and never emits the tool call. No error, just empty tool_calls. The fix was swapping the backbone from Thinking to Instruct, not tuning any parser flag.

The "load the bigger model first, size the smaller against actual residency" playbook generalizes to anything with shared CUDA framework overhead. The ~5 GiB framework floor shows up even at small gpu_memory_utilization values — plan against actuals, not targets.