Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution
github.com207 points by FranckDernoncou 21 hours ago
207 points by FranckDernoncou 21 hours ago
So will this help openai/anthropic have lower congestion in the afternoons if they implement something similar?
No, it would make it worse.
This adds more computation and sacrifices throughput to improve latency of a serial single-user generation.
Large scale providers run inference in batches, sacrificing latency to gain throughput.
The most interesting part of this idea for me is how it wasn't tried / implemented before, as it makes sense.
I haven't read the paper but of course DTree tricks work here as well
Does this translate into a similar reduction in compute?
What's the catch?
It is all about moving the bottleneck. During prompt processing everything can be calculated in parallel, while during token generation you create a single token at a time. For example, using an RTX 4000 Ada, I'm getting 2700 t/s for prompt processing, and 48 t/s for token generation using an 8B class model.
Their approach is essentially a speculative decoding approach where multiple tokens are predicted at once and then verified. Therefore getting more tokens to be created at a speed that is closer to the prompt processing speed.
It seems to be special because their approach yields the exact same output distribution as the base model and it only takes a negligable amount of additional memory.
The main catch is that if your prompt processing speed is already bad, it will not help you all that much.
For example, the M-series Macs (up to M4) have a relative high generation speed compared to their prompt processing speed. That means they will not benefit as much (if at all). With the M5 the prompt processing speed has increased 4x, so those can expect to see a good uplift.
> multiple tokens are predicted at once and then verified
Reminds me a little of a carry lookahead adder.
> Does this translate into a similar reduction in compute?
No, quite the opposite actually. Like with speculative decoding this model will compute more tokens and discard the invalid ones.
> What's the catch?
LLMs[1] are limited by memory latency and not by compute[2]: because they process tokens one at a time, you spend more time loading and unloading the weights on the GPU registers from VRAM than waiting for compute to happen. Techniques like these allow to process multiple tokens in parallel instead of one by one, and as such exploit better the compute of your graphic card. They do so by predicting which tokens are likely to occur and then verifying that the guess was correct.
For instance if the previous token is “hello”.
A regular autoregressive LLM will compute:
“hello” => “! ”,
then “hello! ” => “how ”,
“hello! how ” => “are ”,
“hello! how are ” => “you”.
and finally “hello! how are you” => “?<end>”
One at a time. Loading and unloading every weights 5 times from the GPU memory to its compute units.
With speculative decoding (I'd say this one isn't strictly speculative decoding, but it's a variant of the same principle), you have something that guesses that the whole sentence is going to be “how are you today?”, so the LLM can generate
“hello” => “! ”,
“hello! ” => “how ”,
“hello! how ” => “are ”,
“hello! how are ” => “you”.
“hello! how are you” => “?<end>”
“hello! how are you today” => “?<end>”
In parallel. So each weight would have been loaded only once from the VRAM instead of 5.
The last token will be discarded though, as the prefix “how are you today” doesn't match what has actually been generated. So in that particular example, you'd have gotten your 5 tokens 5 times faster than with pure autoregressive inference, but at the expense of a 6th token being generated and discarded immediately. So 5 times more token throughtput, but 20% compute cost increase per token.
[1]: autoregressive LLMs, that is. Which are the ones everybody uses because they are the most performant.
[2]: at least when run at low batch size, on your own computer for your personal use. On a datacenter, with many concurrent users, GPUs are actually compute-bound.
Minor nit re[2]: for agentic workloads that are actually worth money - i.e., claude code and similar, things are either prefill-bound - which this does not help - or more importantly tps/user bound (at 150k+ context windows) - you want your big magic model to emit 200 tps/user. This is why Nvidia bought Groq (now LPU) and what Cerebras is trying to do, etc, etc. So for the stuff that makes money in the field - GPUs are not really compute bound once context lengths are large - but still memory transfer bound (may be KV-cache transfer, may be HBM->SRAM-on-chip, etc..)
> i.e., claude code and similar, things are either prefill-bound
When accounting for prefix caching, this greatly accelerates each turn. Barring large file reads, prefill still isn't the bottleneck vs. decoding reasoning tokens. Script-writing too.
This is especially true during exploration phases when traversing through directory trees and grepping files, you're talking about a few hundred tokens/turn.
Fantastic results. Well done. ...So this is built into the way the model works.. if I'm understanding it correctly.
I was wondering what would be involved in getting it to work with GGUF files, rather than safetensor files...
Just to get it into a GGUF file would be fairly trivial. But using that GGUF file would need a bunch of additional things. One would need to create a new architecture derived from Qwen3, and then probably adapt the speculative decoding functionality.
At the moment not even MTP is merged into llama.cpp, so I wouldn't quite hold my breath for it.
MTP merged today, a couple of hours after your post by the looks of things.
I thought that might be the case. I naively wondered. I'll see if I can understand the paper :-)
Hope the paper gets lots of references and the technique gets a lot of use to save power and time.
There's been several potential big changes for LLM inference efficiency over the last few months. There's been Attention Sequencing (I think it's called..?) Turbo Quant and this one.
Interesting times.
If someone can make this work with GGUF and Quantized Qwen 3.6 or Deepseek 4 it would greatly help running local models.
Multi-token prediction is available now, I'm still getting it set up but it sounds like it'll be 1.5x or 2x on the bigger models.
I've tried MTP, and that got me about 1.5x on average with a very spec friendly benchmark.
I didn't run the full benchmark with the demo code, just picked up a single prompt from it. The prompt is about 1300 token, the response is about 3200 token.
Baseline: 44.8 t/s With Orthrus: 164.6 t/s
Note: Don't use the `use_diffusion_mode=` config flag in their example to collect a baseline. Something about how the fallback to "normal" makes it grind to a crawl.
I wonder what our man @antirez will make of this
I don't understand. This distills a diffusion transformer out of Qwen3. And while the provably identical is nice, a full diffusion transformer would be a lot faster still.
A full diffusion transformer would need more forward passes (thus being slower) or produce worse output (because it can't properly account for dependencies between tokens when generating them independently in parallel), or both. Keeping the output identical to the autoregressive baseline ensures the speedup doesn't come at the cost of quality degradation.
Paper: https://arxiv.org/abs/2605.12825 ; Code+models: https://github.com/chiennv2000/orthrus ; Disclosure: co-author.
Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model.
Results:
- Up to 7.8x TPF, ~6x wall-clock on MATH-500.
- 16% of params trained, <1B tokens, 24h on 8xH200.
- vs. diffusion LMs (Dream, Fast-dLLM-v2, SDAR, Mercury, Gemini Diffusion): they modify base weights and lose accuracy (Fast-dLLM-v2: -11 pts on MATH-500). Orthrus freezes the backbone; accuracy matches Qwen3-8B exactly.
- vs. Speculative Decoding (EAGLE-3, DFlash): no external drafter, no separate cache, zero TTFT penalty (no drafter to init/sync). KV overhead is O(1) (~4.5 MiB flat). Acceptance length on MATH-500: 11.7 vs. 7.9 (DFlash) vs. 3.5 (EAGLE-3).
- Single-step denoising beats multi-step (6.35 vs. 3.53 TPF). KL distillation beats CE on acceptance rate.
Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only.
On the limitation side:
Do you think this would scale to larger transformer models with more parameters per layer?
How would this work with MOE models or sparse models?
Amazing. Is it possible to do this with Qwen 3.6 27B? Will it work with quants (I assume so)?
From a quick and shallow view of the paper, it looks very feasible (with a little tinkering ) to be adapted to qwen3.6 27B. The process looks somewhat similar to training a LoRA, or in a way distilling your own model so that a mini model learns how to imitate it, and you glue them. I might bite the bullet and rent a gpu to do it for 3.6 27b, as this will solve a lot of my problems.
Scratch that, I don't have that kind of money, and 3.5's architecture is a little more divergent from 3's, so it will be a bit less trivial. It does look possible, just not on a student's paycheck.
There are websites that let you rent GPUs for cheap, such as QuickPod. Have you checked those P2P GPU rentals out?
My plan is to validate it first using qwen3.5 0.8B if it even works (as it has the same architecture as qwen3.6 27b, just scaled down a bit) on my 3090. If it does, I'll make a git about the process if anyone wants to use my approach, while I try to convince my uni to lend me h100s for a day.
If anyone is interested in watching my 0.8B experiments: https://orthrus.kokoham.com/ . The current code is here: https://git.kokoham.com/sleepy/qwen_orthrus .
The hard part was that the original Orthrus works with transformers, but 3.5(and 3.6) is Hybrid: 75% GatedDeltaNet + 25% GatedAttention. I am testing a trick that might make is work with the GatedDeltaNet, and dry runs are promising, but only a full train will reveal if it works. More information in the repo and on the site under the "What is this all about?" button.
Note: i may restart it or try different configs at different points, if the site is down there is probably some sort of result/conclusion in the repo.
And it also looks like the original authors are working on qwen 3.5 too: https://github.com/chiennv2000/orthrus/issues/1#issuecomment...
I would probably treat the (3 GatedDeltaNet + 1 GatedAttention) Blocks as one transformer block, when generating next steps one would therefore use the kv cache for the gated attention and skip the entire delta nets.
3.6 already supports multi token generation AFAIK
Yes, but not diffusion based, it's still doing token-at-a-time speculation.
I thought it can do multiple tokens at a time
There was a chart from the Unsloth folks posted to Reddit in the last couple of days which showed that the draft sweet spot for MTP was 2-3 tokens ahead depending on the quant. Thats not much, and I think this might do a lot better. The whole "provably identical distribution" thing is doing a lot of work in my head, and I don't think that's true of the MTP model in qwen's architecture.
Think of this as another way of achieving that. This theoretically has a higher ceiling of how much it can predict at a time. And more importantly is a lot more memory efficient during actual inference.
Really cool work!
Does the training data budget scale with model size?
How would you compare the Gemma 4 draft model which is also integrated with the base kv cache?
So, it's D-Flash but at each transformer layer and share the KV cache of the original model? Very smart!
Kindof yeah - predictivity is a question though for larger layers - when trying to scale this up. But yeah, this is a "95% predictor in latent space is a 7x improvement in speed if done right" approach.
BTW the paper says
> Since only (Qdiff,Kdiff,Vdiff) are updated during training, the total number of trainable parameters is approximately 16% of the full model.
But the code defines q_proj_diff, k_proj_diff, v_proj_diff, and o_proj_diff, and it only matches 16% when you include the O term.
[flagged]
[dead]