Cerebras launches Qwen3-235B, achieving 1.5k tokens per second

cerebras.ai

363 points by mihau 3 days ago


mehdibl - 3 days ago

It seem this news is "outdated" as it's from Jul 8 and might picked up confusing this model with yesterday Qwen 3 coder 405B release that is different in specs.

aurareturn - 3 days ago

If this is the full fp16 quant, you'd need 2TB of memory to use with the full 131k context.

With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.

For comparison, you can buy a DGX B200 with 8x B200 Blackwell chips and 1.4TB of memory for around $500k. Two systems would give you 2.8TB memory which is enough for this. So $1m vs $135m to run this model.

It's not very scalable unless you have some ultra high value task that need super fast inference speed. Maybe hedge funds or some sort of financial markets?

PS. The reason why I think we're only in the beginning of the AI boom is because I can't imagine what we can build if we can run models as good as Claude Opus 4 (or even better) at 1500 tokens/s for a very cheap price and tens of millions of context tokens. We're still a few generations of hardware away I'm guessing.

0vermorrow - 3 days ago

I'm eagerly awaiting for Qwen 3 coder being available on Cerebras.

I run plenty of agent loops and the speed makes a somewhat interesting difference in time "compression". Having a Claude 4 Sonnet-level model running at 1000-1500 tok/s would be extremely impressive.

To FEEL THE SPEED, you can either try it yourself on Cerebras Inference page, through their API, or for example on Mistral / Le Chat with their "Flash Answers" (powered by Cerebras). Iterating on code with 1000 tok/s makes it feel even more magical.

sneilan1 - 3 days ago

So I installed litellm proxy, pointed it at the new Cerebras API with Qwen-235B and hooked up Aider to litellm. This is not as good as claude code yet but it's so much faster. I even tried using the leaked claude code prompt into Aider but it doesn't do what I expect. Still worth trying but I learned that claude code's prompt is very specific to claude. I think this is very promising however! Aider basically spat out a bunch of text, installed some stuff, made some web calls & exited. WAS REALLY FAST LOL.

you can repeat my experiment quickly with the following>

config.yaml for litellm ``` model_list: - model_name: qwen3-235b litellm_params: model: cerebras/qwen-3-235b-a22b api_key: os.environ/CEREBRAS_API_KEY api_base: https://api.cerebras.ai/v1 ```

run litellm with ``` litellm --config config.yaml --port 4000 --debug ``` (may need to install litellm[proxy])

start aider with ``` aider --model cerebras/qwen-3-235b-a22b --openai-api-base http://localhost:4000 --openai-api-key fake-key --no-show-model-warnings --auto-commits --system-file ./prompt.txt --yes ```

install whatever you need with pip etc. prompt.txt contains the leaked claude code prompt which you can find yourself on the internet.

doctoboggan - 3 days ago

Has anyone with a lot of experience with Claude Code and sonnet-4 tried Claude Code with Qwen3-Coder? The fast times enabled here by Cerebras are enticing, but I wouldn't trade a speedup for a worse quality model.

vadepaysa - 3 days ago

While the speeds are great, in my experience with Cerebras, its really hard to get any actual production level rate limits or token quantity allocations. We cannot design systems around them and we use other vendors.

We've spoken to their sales teams, and we've been told no.

nisten - 3 days ago

"Full 131k" context , actually the full context is double that at 262144 context and with 8x yarn mutiplier it can go up to 2million. It looks like even full chip scale Cerebras has trouble with context length, well, this is a limitation of the transformer architechture itself where memory requirements scale ~linearly and compute requirements roughly quadratically with the increase in kv cache.

Anyway, YOU'RE NOT SERVING FULL CONTEXT CEREBRAS, YOU'RE SERVING HALF. Also what quantization exactly is this, can the customers know?

pjs_ - 3 days ago

Cerebras is truly one of the maddest technical accomplishments that Silicon Valley has produced in the last decade or so. I met Andy seven or eight years ago and I thought they must have been smoking something - a dinner plate sized chip with six tons of clamping force? They made it real, and in retrospect what they did was incredibly prescient

doubtfuluser - 3 days ago

Very impressive speed. A bit OT: what is the current verdict on Qwen, Kimi et al. When it comes to censorship / bias concerning narratives not allowed in the origin country?

mehdibl - 3 days ago

Would be great if they support the latest Qwen 3 405B launched yesterday and more aimed at agentic work/coding.

westurner - 3 days ago

> Qwen3-235B uses an efficient mixture-of-experts architecture that delivers exceptional compute efficiency, enabling Cerebras to offer the model at $0.60 per million input tokens and $1.20 per million output tokens—less than one-tenth the cost of comparable closed-source models.

  $ 0.60/million input tokens
  $ 1.20/million output tokens
How many minutes of 4K YouTube HDR video is that equivalent to in kWh of energy usage?

> Concurrent with this launch, Cerebras has quadrupled its context length support from 32K to 131K tokens—the maximum supported by Qwen3-235B.

bluelightning2k - 3 days ago

This is (slightly) old news from July 8, resurfaced due to the Qwen 3 coder.

I think the gist of this thread is entirely: "please do the same for Qwen 3 coder", with us all hoping for:

a) A viable alternative to Sonnet 3 b) Specifically a faster and cheaper alternative

poly2it - 3 days ago

Very impressive speed. With a context window of 40K however, usability is limited.

p0w3n3d - 3 days ago

I'm looking for a setup for local development with local qwen on my macbook. I tried localforge with mlx_lm.server but it failed to communicate (I saw a proof of concept on their page but now it seems to fail on "empty response" which in reality is not empty)

Anyone could recommend a solution?

rbanffy - 3 days ago

Who remembers “wafer scale integration” from the 1980s?

Insane that Cerebras succeeded where everyone else failed for 5 decades.

Inviz - 3 days ago

I contacted their sales team before, cerebras started at $1500 a month at that time, and the limits were soooooo small. Did it get better?

Edit: Looks like it did. They both introduced pay as you go, and have prepaid limits too at $1500. I wonder if they have any limitations on parallel execution for pay as you go...

rafaelero - 3 days ago

If they do the same for the coding model they will have a killer product.

cedws - 3 days ago

With this kind of speed you could build a large thinking stage into every response. What kind of improvement could you expect in benchmarks from having say 1000 tokens of thinking for every response?

mikewarot - 3 days ago

It's quite possible this is getting near the upper limit that is possible with current architectures for compute. Let's say the limit were 10k tokens/second with Qwen3-235B.

There's always going to be some latency in any compute architecture. Assume some insane billionaire cast the entire Qwen3-235B model into silicon, so it all ran in parallel, tokens going in one end, and the next token coming out the other end. This wafer (or likely, stack of interconnected wafers) would likely add up to a latency from end to end of 10 to 100 milliseconds.

If you then added pipelining, the latency might actually increase a millsecond or two, but the aggregate throughput would be N times the number of pipeline stages.

If you could increase N to the point that the clock cycle were a nanosecond... what would the economic value of this thing be? 100,000 separate streams at 10,000 tokens per second, multiplexing through it.

If you change it from cast in silicon, to a program to configure the silicon (line an FPGA, but far less clunky), I believe you get the future of LLM compute. Ever faster and wider lanes between compute and RAM is a dead end, a premature optimization.

cubefox - 3 days ago

It sounds like Cerebras would be perfect for models with Mamba architecture, as those don't need a large KV cache for long contexts.

jug - 3 days ago

They better not cheat me with a quantized version!

the_arun - 3 days ago

IMHO innovations are waiting to happen. Unless we get similar speeds using commodity hardware / pricing, we are not there yet.

pr337h4m - 3 days ago

Quantization?

rsolva - 3 days ago

What would the energy use be for an average query be, when using large models at this speed?

adamtaylor_13 - 3 days ago

I’m a guy that simply runs Claude Code. How can I start toying around with this?

avnathan - 3 days ago

They pull these off pretty consistently!

mohsen1 - 3 days ago

K2 is also now available on Groq

https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...

very fun to see agents using those backends

avnathan - 3 days ago

They pull these off quite consistently!

OxfordCommand - 3 days ago

isn't Qwen Alibaba's family of models? What does cerebras have to do with this? i'm lost.

skeezyboy - 3 days ago

but does it still produce potentially unreliable and hallucinatory output? id hate to see that feature go

tonyhart7 - 3 days ago

Yeah but the price is uhh

iyerbalaji - 3 days ago

Amazing, this is blazing fast

poupou127 - 3 days ago

WOW

iyerbalaji - 3 days ago

[dead]