AdapTive-LeArning Speculator System (ATLAS): Faster LLM inference

together.ai

182 points by alecco 13 hours ago


wishawa - 10 hours ago

Inference is impressively fast. But what about quality? In the Kimi vendor verifier (https://github.com/MoonshotAI/K2-Vendor-Verifier/), Together has one of the highest tool call failure rates (>300 failures over the benchmark, compared to 0-2 for the official API, groq, SiliconFlow, and Infinigence).

hazrmard - 2 hours ago

Do I understand this right?

A light-weight speculative model adapts to usage, keeping the acceptance rate for the static heavy-weight model within acceptable bounds.

Do they adapt with LoRAs?

Havoc - 10 hours ago

>a faster speculator (also known as the draft model) proposes multiple tokens ahead, and the target model verifies them in parallel in a single forward pass

TIL. Bit of an aha moment - never understood till now how a big model can verify faster than it can generate

necovek - 6 hours ago

So with a 4x speed-up, Together will give us at least 2x lower price for top-end models, right? :)

andblac - 9 hours ago

At first glance, this reminds me of how branch prediction is utilized in CPUs to speedup execution. As I understand it, this development is like a form of soft branch prediction over language trajectories: a small model predicts what the main model will do, takes few steps ahead and then verifies the results (and this can be done in parallel). If it checks out, you just jump forward, it not you take miss but its rare. I find it funny how small-big ideas like this come up in different context again and again in history of our technological development. Of course ideas as always are cheap. The hard part is how to actually use them and cash in on them.

LogicFailsMe - 8 hours ago

No barrier to entry whatsoever? Backprop on the speculative decoding weights during inference to improve their accuracy on a per application basis?

Cool hack though, kudos. Wonder if they can make Groq or Cerebras do the same thing?

ashvardanian - 10 hours ago

Will need some time to go through the details, but it’s increasingly rare to see teams consistently delivering meaningful improvements in the open. Impressive work!

jsutton97 - 4 hours ago

I can't help but wonder how much longer we'll see this work shared openly.

petesergeant - 10 hours ago

> Built on top of Together Turbo Speculator, ATLAS reaches up to 500 TPS on DeepSeek-V3.1 and up to 460 TPS on Kimi-K2 in a fully adapted scenario — 2.65x faster than standard decoding, outperforming even specialized hardware like Groq

and yet, if you click on: https://openrouter.ai/moonshotai/kimi-k2-0905

You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq and Cerebras often feel like the only games in town. I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.

Comparing how quickly gpt-oss-120b runs gives a broader picture: https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and SambaNova do pretty good on it too, but still, the difference between a top provider and an also-ran is giant.

God I love OpenRouter.

diamond559 - 4 hours ago

Great, my slop memes can come out much faster now. This is the future of the world economy!