Cerebras launches Qwen3-235B, achieving 1.5k tokens per second

363 points by mihau 3 days ago

It seem this news is "outdated" as it's from Jul 8 and might picked up confusing this model with yesterday Qwen 3 coder 405B release that is different in specs.

simonw - 3 days ago

I initially thought this was about the Qwen release from two days ago, Qwen3-235B-A22B-Instruct-2507 - https://simonwillison.net/2025/Jul/22/qwen3-235b-a22b-instru... - but that's a no-reasoning model and the Cerebras announcement talks about reasoning, which tipped me off that this was Qwen's Qwen3-235B-A22B from April.
(These model names are so confusing.)
- aitchnyu - 3 days ago
  
  Is Qwen3 235B A22B in OpenRouter the stock version or Cerebras version?
  https://openrouter.ai/provider/cerebras
  - simonw - 3 days ago
    
    https://openrouter.ai/qwen/qwen3-235b-a22b/providers has a list, it's currently DeepInfra, Parasail, Together, Nebius AI Studio, Friendli, Fireworks, Cerebras.

aurareturn - 3 days ago

If this is the full fp16 quant, you'd need 2TB of memory to use with the full 131k context.

With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.

For comparison, you can buy a DGX B200 with 8x B200 Blackwell chips and 1.4TB of memory for around $500k. Two systems would give you 2.8TB memory which is enough for this. So $1m vs $135m to run this model.

It's not very scalable unless you have some ultra high value task that need super fast inference speed. Maybe hedge funds or some sort of financial markets?

PS. The reason why I think we're only in the beginning of the AI boom is because I can't imagine what we can build if we can run models as good as Claude Opus 4 (or even better) at 1500 tokens/s for a very cheap price and tens of millions of context tokens. We're still a few generations of hardware away I'm guessing.

Voloskaya - 3 days ago

> With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.
That's not how you would do it with Cerebras. 44GB is SRAM, so on chip memory, not HBM memory where you would store most of the params. For reference one GB200 has only 126MB of SRAM, if you tried to estimate how many GB200 you would need for a 2TB model just by looking at the L2 cache size you would get 16k GB200 aka ~600M$, obviously way off.
Cerebras uses a different architecture than Nvidia, where the HBM is not directly packaged with the chips, this is handled by a different system so you can scale memory and compute separately. Specifically you can use something like MemoryX to act as your HBM which will be high speed interconnected to the chips SRAM, see [1]. I'm not at all an expert in Cerebras, but IIRC you can connect up to like 2PB of memory to a single Cererbas chip, so almost 1000x the FP16 model.
[1]: https://www.cerebras.ai/blog/announcing-the-cerebras-archite...
- aurareturn - 3 days ago
  That's not how you would do it with Cerebras. 44GB is SRAM, so on chip memory, not HBM memory where you would store most of the params. For reference one GB200 has only 126MB of SRAM, if you tried to estimate how many GB200 you would need for a 2TB model just by looking at the L2 cache size you would get 16k GB200 aka ~600M$, obviously way off.
  Yes but Cerebras achieves its speed by using SRAM.
  - Voloskaya - 3 days ago
    
    There is no way not to use SRAM on a GPU/Cerebras/most accelerators. This is where the cores fetch the data.
    But that doesn’t mean you are only using SRAM, that would be impractical. Just like using a CPU just by storing stuff in the L3 cache and never going to the RAM. Unless I am missing something from the original link, I don’t know how you got to the conclusion that they only used SRAM.
    
    IshKebab - 3 days ago
    
    > Just like using a CPU just by storing stuff in the L3 cache and never going to the RAM. Unless I am missing something from the original link, I don’t know how you got to the conclusion that they only used SRAM.
    That's exactly how Graphcore's current chips work, and I wouldn't be surprised if that's how Cerebras's wafer works. It's probably even harder for Cerebras to use DRAM because each chip in the wafer is "landlocked" and doesn't have an easy way to access the outside world. You could go up or down, but down is used for power input and up is used for cooling.
    You're right it's not a good way to do things for memory hungry models like LLMs, but all of these chips were designed before it became obvious that LLMs are where the money is. Graphcore's next chip (if they are even still working on it) can access a mountain of DRAM with very high bandwidth. I imagine Cerebras will be working on that too. I wouldn't be surprised if the abandon WSI entirely due to needing to use DRAM.
    
    aurareturn - 3 days ago
    
    I know Groq chips load the entire model into SRAM. That's why it can be so fast.
    So if Cerebras uses HBM to store the model but stream weights into SRAM, I really don't see the advantage long term over smaller chips like GB200 since both architectures use HBM.
    The whole point of having a wafer chip is that you limit the need to reach out to external parts for memory since that's the slow part.
    
    Voloskaya - 3 days ago
    
    > I really don't see the advantage long term over smaller chips like GB200 since both architectures use HBM.
    I don’t think you can look at those things binarily. 44GB of SRAM is still a massive amount. You don’t need infinite SRAM to get better performances. There is a reason NVidia is increasing the L2 cache size with every generation rather than just sticking with 32MB if it really changed nothing to have a bit more. The more SRAM you have the more you are able to mask communication behind computation. You can imagine with 44GB being able to load the weights of layer N+1 into SRAM while computing layer N, thereby entirely negating the penalty of going to HBM (same idea as FSDP).
    
    vlovich123 - 3 days ago
    
    > You can imagine with 44GB being able to load the weights of layer N+1 into SRAM while computing layer N, thereby entirely negating the penalty of going to HBM (same idea as FSDP).
    You would have to have an insanely fast bus to prevent I/O stalls with this. With a 235B fp16 model you’d be streaming 470GiB of data every graph execution. To do that 1000tok/s, you’d need a bus that can deliver a sustained ~500 TiB/s. Even if you do a 32 wide MoE model, that’s still about 15 TiB/s of bandwidth you’d need from the HBM to avoid stalls at 1000tok/s.
    It would seem like this either isn’t fp16 or this is indeed likely running completely out of SRAM.
    Of course Cerebas doesn’t use a dense representation so these memory numbers could be way off and maybe that is SRAM+DRAM combo
    
    qeternity - 3 days ago
    
    > I don’t know how you got to the conclusion that they only used SRAM.
    Because they are doing 1,500 tokens per second.
- throwawaymaths - 3 days ago
  
  what are the bandwidth/latency of memoryX? those are the key parameters for inference
  - Zenst - 3 days ago
    
    Well MemoryX compared to H100 HBM3 the key details are that MemoryX has lower latency, but also far lower bandwidth. However the memory on Cerebras is scales a lot more over NVidia. You need a cluster of H100's to create a model, as only way to scale the memory, Cerbras is more suited to that aspect, Nvidia do their scaling in tooling, with Cerbras doing theirs in design via there silicon approach.
    That's my take on it all, not many apples to oranges comparisons to work from on these two system for even rolling down the same slope.
    
    perfobotto - 3 days ago
    
    No way an offchip HBM has same or better bandwidth then onchip
    
    0xCMP - 3 days ago
    
    > MemoryX has lower latency, but also far lower bandwidth
    
    - 3 days ago
    
    [deleted]
- imtringued - 3 days ago
  
  Yeah sure, but if you do that you are heavily dropping the token/s for a single user. The only way to recover from that is continuous batching. This could still be interesting if the KV caches of all users fit in SRAM though.
  - Voloskaya - 3 days ago
    
    > but if you do that you are heavily dropping the token/s for a single user.
    I don’t follow what you are saying and what “that” is specifically. Assuming it’s referencing using HBM and not just SRAM, this is not optional on a GPU, SRAM is many order of magnitudes too small. Data is constantly flowing between HBM and SRAM by design, and to get data in/out of your GPU you have to go through HBM first, you can’t skip that.
    And while it is quite massive on a Cerebras system it is also still too small for very large models.
yvdriess - 3 days ago

> With 44GB of SRAM per Cerebras chip, you'd need 45 chips chained together. $3m per chip. $135m total to run this.
That on-chip SRAM memory is purely temporary working memory and does need to hold the entire model weights. The Cerebras chip works on a sparse weights representation, streams non-zero off their external memory server and the cores work in a transport-triggered dataflow manner.
smcleod - 3 days ago

There is no reason to run models for inference at static fp16, modern quantisation formats dynamically assign precision to the layers that need them, an average of 6bpw is practical imperceptible from full precision, 8bpw if you really want to squeeze every tiny last drop out of it (although it's unlikely it will be detectable). That is a huge memory saving.
- nhecker - 3 days ago
  
  > dynamically assign precision to the layers that need them
  Well now I'm curious; how is a layer judged on its relative need for precision? I guess I still have a lot of learning to do w.r.t. how quantization is done. I was under the impression it was done once, statically, and produced a new giant GGUF blob or whatever format your weights are in. Does that assumption still hold true for the approach you're describing?
  - irthomasthomas - 3 days ago
    
    Last I checked they ran some sort of evals before and after quantisation and measured the effect. E.g Exllama-v2 measures the loss while reciting Wikipedia articles.
    
    clownpenis_fart - 3 days ago
    
    [dead]
  - smcleod - 3 days ago
    
    Within the GGUF (and some other formats) you'll see each layer gets its own quantisation, for example embeddings layers are usually more sensitive to quantisation and as such are often kept at Q8 or FP16. If you run GGUF-dump or click on the GGUF icon on a model in huggingface you'll see.
  - smcleod - 2 days ago
    
    Have a watch of https://youtu.be/vW30o4U9BFE
- vlovich123 - 3 days ago
  
  What quantization formats are these? All the OSS ones from GGML apply a uniform quantization
  - int_19h - 3 days ago
    
    GGML hasn't been a thing for some time, and GGUF (its successor) has features such as "importance matrix" quantization that is all about quantizing adaptively. Then there's all the stuff that Unsloth does, e.g.: https://unsloth.ai/blog/dynamic-v2
  - smcleod - 2 days ago
    
    Have a watch of https://youtu.be/vW30o4U9BFE
  - smcleod - 3 days ago
    
    No they don't. GGML is non-uniform. Each layer gets its own level of quantisation.
Adfeldman - 3 days ago

Our chips don't cost $3M. I'm not sure where you got that number but its wildly incorrect.
- aurareturn - 3 days ago
  
  So how much does it cost? Google search return $3m. Here's your chance to tell us your real price if you disagree.
  - 1W6MIC49CYX9GAP - 3 days ago
    
    He also didn't argue about the rest of the math so it's likely correct that the whole model needs to be in SRAM :)
- UltraSane - 3 days ago
  
  Is it actually $4M?
- cgdl - 3 days ago
  
  Do you distinguish betwen "chips" and the wafer-scale system? Is the wafer-scale system significantly less than 3MM?
  EDIT: online it seems TSMC prices are about 25K-30K per wafer. So even 10Xing that a wafer-scale system should be about 300K.
- agentastic - 3 days ago
  
  Congrats on Qwen3 launch, also ty for the exploration tier. Makes our life a lot easier.
  Any plan/ETA on launching it's big-brother (Qwen3-code)?
- npsomaratna - 3 days ago
  
  Are you the CEO of Cerebras? (Guessing from the handle)
  - rajnathani - 2 days ago
    
    I wonder why he (Andrew Feldman) didn't retort to the SRAM vs HBM memory incorrect assumption made by the OP comment; maybe he was so busy that he couldn't even cite the sibling comment? That's a bigger wrong assumption than being off by maybe 30-50% at most on Cerebras's single server price (it definitely doesn't cost less than $1.5-2M).
    
    aurareturn - 2 days ago
    
    Probably because they are loading the entire model into SRAM. Thats how they can achieve 1.5k tokens/s.
- qualeed - 3 days ago
  
  In that case, mind providing a more appropriate ballpark?
stingraycharles - 3 days ago

So, does that mean that in general for the most modern high end LLM tools, to generate ~1500 tokens per seconds you need around $500k in hardware?
Checking: Anthropic charges $70 per 1 million output tokens. @1500 tokens per second that would be around 10 cents per second, or around $8k per day.
The $500k sounds about right then, unless I’m mistaken.
- andruby - 3 days ago
  
  62 days to break even, that would be a great investment
twothreeone - 3 days ago

I think you're missing an important aspect: how many users do you want to support?
> For comparison, you can buy a DGX B200 with 8x B200 Blackwell chips and 1.4TB of memory for around $500k. Two systems would give you 2.8TB memory which is enough for this.
That would be enough to support a single user. If you want to host a service that provides this to 10k users in parallel your cost per user scales linearly with the GPU costs you posted. But we don't know how many users a comparable wafer-scale deployment can scale to (aside from the fact that the costs you posted for that are disputed by users down the thread as well), so your comparison is kind of meaningless in that way, you're missing data.
- coolspot - 3 days ago
  
  > That would be enough to support a single user. If you want to host a service that provides this to 10k users in parallel your cost per user scales linearly with the GPU costs you posted.
  No. Magic of batching allows you to handle multiple user requests in parallel using the same weights with little VRAM overhead per user.
lordofgibbons - 3 days ago

Almost everyone runs LLM inference at fp8 - for all of the open models anyway. You only see performance drop off below fp8.
- stingraycharles - 3 days ago
  
  Isn’t usually mixed? I understood that Apple even uses fp1 or fp2 on their hardware embedded models they ship on their phones, but as far as I know it’s typically a whole bunch of different precisions.
  - llm_nerd - 3 days ago
    
    Small bit of pedantry: While there are 1 and 2-bit quantized types used in some aggressive schemes, they aren't floating point so it's inaccurate to preface them with FP. They are int types.
    The smallest real floating point type is FP4.
    EDIT: Who knew that correctness is controversial. What a weird place HN has become.
    
    xtracto - 3 days ago
    
    I wonder if the fact that we use "floating point" is itself a bottleneck that can be improved.
    Remembering my CS classes, storing an FP value requires the base and the exponent; that's a design decision. Also remembering some assembler classes, Int arithmetic is way faster than FP.
    Could there be a better "representation " for the numbers needed in NN that would provide the accuracy of floating point but provide faster operations? (Maybe even allow to perform required operations as bitwise ops. Kind of like the left/right shifting to double/half ints. )
    
    llm_nerd - 3 days ago
    
    Sure, we have integers of many sizes, fixed point, and floating point, all of which are used in neural networks. Floating points are ideal when the scale of a value can vary tremendously, which is of obvious importance for gradient descent, and then after we can quantize to some fixed size.
    A modern processor can do something similar to an integer bit shift about as quickly with a floating point, courtesy of FSCALE instructions and similes. Indeed, modern processors are extremely performant at floating point math.
    
    h3lp - 3 days ago
    
    A common FP4 layout is 1 sign bit, 3 exponent bits, 0 mantissa bits. There's just not that much difference in complexity between that and a 4-bit integer---the ALU can just be a simple table lookup, for both FP and integer.
    
    kadushka - 3 days ago
    
    Could there be a better
    Yes. Look up “block floating point”.
    
    xtracto - 3 days ago
    
    Shit I'd love to do R&D on this.
thegeomaster - 3 days ago

You're assuming that the whole model has to be in SRAM.
recursivecaveat - 3 days ago

The metric of run/not-run is too simplistic. You have to divide out the total throughout the system gives to all concurrent users (which we don't know). Like a golf-cart can get you from New York to LA same as a train, but the unit economics of the train are a lot more favorable, despite its increased cost. The minimum deployment scale is not irrelevant, it may make it infeasible to run an on-prem solution for most customers for eg, but if you are selling tokens via a big cloud API it doesn't really matter.
- 3 days ago

[deleted]
makestuff - 3 days ago

I agree there will be some breakthrough (maybe by Nvidia or maybe someone else) that allows these models to run insanely cheap and even locally on a laptop. I could see a hardware company coming out with some sort of specialized card that is just for consumer grade inference for common queries. That way the cloud can be used for sever side inference and training.
cgdl - 3 days ago

Exactly what I was thinking.
What sort of latency do you think one would get with 8x B200 Blackwell chips? Do you think 1500 tokens/sec would be achievable in that setup?
nroets - 3 days ago

1500 tokens/s is 5.4 million per hour. According to the document it costs $1.20 x 5.4 = $6.48 per hour.
Which is not enough to even pay the interest on one $3m chip.
What am I missing here ?
- cgdl - 3 days ago
  
  Indeed, and even if the cost per wafer was 300K, since about say 20-50 wafers are needed, its still 6MM to 15MM for the system. So likely it would appear this is VC subsidized.
jsemrau - 3 days ago

>Maybe hedge funds or some sort of financial markets?
I'd think that HFT is already mature and doesn't really benefit from this type of model.
- rbanffy - 3 days ago
  
  True, but if the hardware could be “misused” for HFT, it’d be awesome.
htrp - 3 days ago

>Maybe hedge funds or some sort of financial markets?
Definitely not hedge funds / quant funds.
You'd just buy a dgx
derefr - 3 days ago

> We're still a few generations of hardware away I'm guessing.
I don't know; I think we could be running models "as good as" Claude Opus 4, a few years down the line, with a lot less hardware — perhaps even going backwards, with "better" later models fitting on smaller, older — maybe even consumer-level — GPUs.
Why do I say this? Because I get the distinct impression that "throwing more parameters at the problem" is the current batch of AI companies' version of "setting money on fire to scale." These companies are likely leaving huge amounts of (almost-lossless) optimization on the table, in the name of having a model now that can be sold at huge expense to those few customers who really want it and are willing to pay (think: intelligence agencies automating real-time continuous analysis of the conversations of people-of-interest). Having these "sloppy but powerful" models, also enables the startups themselves to make use of them in expensive one-time batch-processing passes, to e.g. clean and pluck outliers from their training datasets with ever-better accuracy. (Think of this as the AI version of "ETL data migration logic doesn't need to be particularly optimized; what's the difference between it running for 6 vs 8 hours, if we're only ever going to run it once? May as well code it in a high-level scripting language.")
But there are only so many of these high-value customers to compete over, and only so intelligent these models need to get before achieving perfect accuracy on training-set data-cleaning tasks can be reduced to "mere" context engineering / agentic cross-validation. At some point, an inflection point will be passed where the marginal revenue to be earned from cost-reduced volume sales outweighs the marginal revenue to be earned from enterprise sales.
And at that point, we'll likely start to see a huge shift in in-industry research in how these models are being architected and optimized.
No longer would AI companies set their goal in a new model generation first as purely optimizing for intelligence on various leaderboards (ala the 1980s HPC race, motivated by serving many of the same enterprise customers!), and then, leaderboard score in hand, go back and re-optimize to make the intelligent model spit tokens faster when run on distributed backplanes (metric: tokens per watt-second).
But instead, AI companies would likely move to a combined optimization goal of training models from scratch to retain high-fidelity intelligent inference capabilities on lower-cost substrates — while minimizing work done [because that's what OEMs running local versions of their models want] and therefore minimizing "useless motion" of semantically-meaningless tokens. (Implied metric: bits of Shannon informational content generated per (byte-of-ram x GPU FLOP x second)).

0vermorrow - 3 days ago

I'm eagerly awaiting for Qwen 3 coder being available on Cerebras.

I run plenty of agent loops and the speed makes a somewhat interesting difference in time "compression". Having a Claude 4 Sonnet-level model running at 1000-1500 tok/s would be extremely impressive.

To FEEL THE SPEED, you can either try it yourself on Cerebras Inference page, through their API, or for example on Mistral / Le Chat with their "Flash Answers" (powered by Cerebras). Iterating on code with 1000 tok/s makes it feel even more magical.

scosman - 3 days ago

Exactly. I can see my efficiency going up a ton with this kind of speed. Every time I'm waiting for agents my mind looses some focus and context. Running parallel agents gets more speed but at the cost of focus. Near instant iteration loops in Cursor would feel magical (even more magical?).
It will also impact how we work: interactive IDEs like Cursor probably make more sense than CLI tools like Claude code when answers are nearly instant.
- vidarh - 3 days ago
  
  I was justing thinking the opposite. If the answers are this instant, then subject to cost I'd be tempted to have the agent fork and go off and try a dozen different things, and run a review process to decide which approach(es) or part of approaches to present to the user.
  It opens up a whole lot of use cases that'd be a nightmare if you have to look at each individual change.
mogili - 3 days ago

Same.
However, I think Cerebras first needs to get the APIs to be more openAI compliant. I tried their existing models with a bunch of coding agents (include Cline which they did a PR for) and they all failed to work either due to a 400 error or tool calls not being formatted correctly. Very disappointed.
meowface - 3 days ago

I just set up Groq with Kimi K2 the other day and was blown away by the speed.
Deciding if I should switch to Qwen 3 and Cerebras.
(Also, off-topic, but the name reminds me of cerebrates from Starcraft. The Zerg command hierarchy lore was fascinating when I was a young child.)
- throwaw12 - 3 days ago
  
  Have you used Claude Code and how do you compare the quality to Claude models? I am heavily invested in tools around Claude, still struggling to make a switch and start experimenting with other models
  - meowface - 3 days ago
    
    I still exclusively use Claude Code. I have not yet experimented with these other models for practical software development work.
    A workflow I've been hearing about is: use Claude Code until quota exhaustion, then use Gemini CLI with Gemini 2.5 Pro free credits until quota exhaustion, then use something like a cheap-ish K2 or Qwen 3 provider, with OpenCode or the new Qwen Code, until your Claude Code credits reset and you begin the cycle anew.
  - bredren - 3 days ago
    
    Are you using Claude code or the web interface? I would like to try this with CC myself, apparently with some proxy use an OpenAI compatible LLM can be swapped in.
    
    throwaw12 - 3 days ago
    
    I am using Claude code, my experience with it so far is great. I use it primarily from terminal, this way I stay focused while reading code and CC doing its job in the background.
    
    bredren - 3 days ago
    
    I’ve heard this repeated that using the env vars you can use gpt models, for example.
    But then also that running a proxy tool locally is needed.
    I haven’t tried this setup, and can’t say offhand if Cerebras’ hosted qwen described here is “OpenAI” compatible.
    I also don’t know if all of the tools CC uses out of the box are supported in the most compatible non-Anthropic models.
    Can anyone provide clarity / additional testimony on swapping out the engine on Claude Code?
    
    derac - 3 days ago
    
    I've used Kimi K2, it works well. Personally I'm using Claude Code Router.
    https://github.com/musistudio/claude-code-router
- mehdibl - 3 days ago
  
  Issue most groq models are limited in context as that cost a lot of memory.
- zozbot234 - 3 days ago
  
  Obligatory reminder that 'Groq' and 'Grok' are entirely different and unrelated. No risk of a runaway Mecha-Hitler here!
  - throwawaymaths - 3 days ago
    
    instead risk of requiring racks of hardware to run just one model!
logicchains - 3 days ago

It'll be nice if this generates more pressure on programming language compilation times. If agentic LLMs get fast enough that compilation time becomes the main blocker in the development process, there'll be significant economic incentives for improving compiler performance.

sneilan1 - 3 days ago

So I installed litellm proxy, pointed it at the new Cerebras API with Qwen-235B and hooked up Aider to litellm. This is not as good as claude code yet but it's so much faster. I even tried using the leaked claude code prompt into Aider but it doesn't do what I expect. Still worth trying but I learned that claude code's prompt is very specific to claude. I think this is very promising however! Aider basically spat out a bunch of text, installed some stuff, made some web calls & exited. WAS REALLY FAST LOL.

you can repeat my experiment quickly with the following>

config.yaml for litellm ``` model_list: - model_name: qwen3-235b litellm_params: model: cerebras/qwen-3-235b-a22b api_key: os.environ/CEREBRAS_API_KEY api_base: https://api.cerebras.ai/v1 ```

run litellm with ``` litellm --config config.yaml --port 4000 --debug ``` (may need to install litellm[proxy])

start aider with ``` aider --model cerebras/qwen-3-235b-a22b --openai-api-base http://localhost:4000 --openai-api-key fake-key --no-show-model-warnings --auto-commits --system-file ./prompt.txt --yes ```

install whatever you need with pip etc. prompt.txt contains the leaked claude code prompt which you can find yourself on the internet.

bredren - 3 days ago

Thanks for the report. Can this be hooked up to claude code via a proxy?

doctoboggan - 3 days ago

Has anyone with a lot of experience with Claude Code and sonnet-4 tried Claude Code with Qwen3-Coder? The fast times enabled here by Cerebras are enticing, but I wouldn't trade a speedup for a worse quality model.

AgentMatrixAI - 3 days ago

haven't tried Qwen but used these "near instant token" like groq and another one that uses diffusion model to generate code via LLaMA and the results weren't satisfactory.
now if something like Gemini 2.5 pro or Sonnet 4 even can run on Cerebras generating tens of thousands of code in a few seconds, that could really make a difference.

vadepaysa - 3 days ago

While the speeds are great, in my experience with Cerebras, its really hard to get any actual production level rate limits or token quantity allocations. We cannot design systems around them and we use other vendors.

We've spoken to their sales teams, and we've been told no.

nisten - 3 days ago

"Full 131k" context , actually the full context is double that at 262144 context and with 8x yarn mutiplier it can go up to 2million. It looks like even full chip scale Cerebras has trouble with context length, well, this is a limitation of the transformer architechture itself where memory requirements scale ~linearly and compute requirements roughly quadratically with the increase in kv cache.

Anyway, YOU'RE NOT SERVING FULL CONTEXT CEREBRAS, YOU'RE SERVING HALF. Also what quantization exactly is this, can the customers know?

zamadatix - 3 days ago

The model page says 32,768 natively with performance validated for up to 4x YaRN https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-...
That would seem to align with the 131k number?

pjs_ - 3 days ago

Cerebras is truly one of the maddest technical accomplishments that Silicon Valley has produced in the last decade or so. I met Andy seven or eight years ago and I thought they must have been smoking something - a dinner plate sized chip with six tons of clamping force? They made it real, and in retrospect what they did was incredibly prescient

nickpsecurity - 3 days ago

It's a modern take on an old idea. I first saw it in European research for wafer-scale, analog, neural networks. I found another project while looking for it. I'll share both.
https://www.kip.uni-heidelberg.de/Veroeffentlichungen/downlo...
https://archive.ll.mit.edu/publications/journal/pdf/vol02_no...
The second's patents would also be long-expired since it's from 1989.
cherryteastain - 3 days ago

The concept is super cool but does anyone actually use them instead of just buying Nvidia?
- terramex - 3 days ago
  
  Mistral uses them for Le Chat, it is really fast.
  https://chat.mistral.ai/chat
  https://www.cerebras.ai/blog/mistral-le-chat
- esafak - 3 days ago
  
  Most people don't buy nvidia; they use a provider, like Openrouter.
throwawaymaths - 3 days ago

nah, it was designed for hpc and raw flops. llm inference really requires memory bandwidth.
- HumanOstrich - 3 days ago
  
  Memory bandwidth, eh? You should learn the basics about what Cerebras does. https://www.cerebras.ai/chip
  - adamtaylor_13 - 3 days ago
    
    Sheeeesh. 21 petabytes per second of memory bandwidth? That’s bonkers.
- cgdl - 3 days ago
  
  I'd say llm inference requires both memory capacity and bandwidth. Cerebras provides bandwidth with on-chip SRAM, but not capacity (an entire wafer has only 44GB SRAM).
vFunct - 3 days ago

Wafer-scale integration was done decades before.

doubtfuluser - 3 days ago

Very impressive speed. A bit OT: what is the current verdict on Qwen, Kimi et al. When it comes to censorship / bias concerning narratives not allowed in the origin country?

jszymborski - 3 days ago

The Qwen models are, anecdotally, probably some of the best open weight models, particularly the MoE models.
They are also, anecdotally, super scary censored. Asking it if anything "interesting has happened in Tianamen Square?" And then refining with "any notable protests?" And finally "maybe something to do with a tank"... All you get is vague allusions to the square being a beautiful place with a rich history.
- impossiblefork - 3 days ago
  
  Do you think it's done so carefully that you suspect that they have perhaps even removed texts mentioning the Tiananmen square massacre from the training set?
  - jszymborski - a day ago
    
    I have no special knowledge on the matter, but I imagine it's the same kind of alignment that prevents other LLMs from telling you e.g. how to make meth.

mehdibl - 3 days ago

Would be great if they support the latest Qwen 3 405B launched yesterday and more aimed at agentic work/coding.

westurner - 3 days ago

> Qwen3-235B uses an efficient mixture-of-experts architecture that delivers exceptional compute efficiency, enabling Cerebras to offer the model at $0.60 per million input tokens and $1.20 per million output tokens—less than one-tenth the cost of comparable closed-source models.

  $ 0.60/million input tokens
  $ 1.20/million output tokens

How many minutes of 4K YouTube HDR video is that equivalent to in kWh of energy usage?

> Concurrent with this launch, Cerebras has quadrupled its context length support from 32K to 131K tokens—the maximum supported by Qwen3-235B.

bluelightning2k - 3 days ago

This is (slightly) old news from July 8, resurfaced due to the Qwen 3 coder.

I think the gist of this thread is entirely: "please do the same for Qwen 3 coder", with us all hoping for:

a) A viable alternative to Sonnet 3 b) Specifically a faster and cheaper alternative

poly2it - 3 days ago

Very impressive speed. With a context window of 40K however, usability is limited.

diggan - 3 days ago

The first paragraph contains:
> Cerebras Systemstoday [sic] announced the launch of Qwen3-235B with full 131K context support on its inference cloud platform
Then later:
> Cline users can now access Cerebras Qwen models directly within the editor—starting with Qwen3-32B at 64K contexton the free tier. This rollout will expand to include Qwen3-235B with 131K context
Not sure where you get the 40K number from.
- poly2it - 3 days ago
  
  I looked at their OpenRouter page, which they link to in their pricing section. Odd discrepancy.
- mehdibl - 3 days ago
  
  Extra 40k you get when you use PAID API calls instead of free tier API calls.
wild_egg - 3 days ago

Post says 131k context though? What did I miss?
- mehdibl - 3 days ago
  
  The PR confuse a but 32k/64k and 131k if paid API.
  Also this model https://huggingface.co/Qwen/Qwen3-235B-A22B
  Is native 32k. So the 64k and 131k use ROPE that is not the best for effective context.
  While https://qwenlm.github.io/blog/qwen3-coder/ it's 256k native https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct.
- asb - 3 days ago
  
  The situation is very confusing, but the tweet that went out with the announcement indicates it's not full 131k context yet and that is coming "soon"https://xcancel.com/CerebrasSystems/status/19437653011094202...

p0w3n3d - 3 days ago

I'm looking for a setup for local development with local qwen on my macbook. I tried localforge with mlx_lm.server but it failed to communicate (I saw a proof of concept on their page but now it seems to fail on "empty response" which in reality is not empty)

Anyone could recommend a solution?

zhobbs - 3 days ago

Possible I'm misunderstanding what you're trying to do, but ollama works well for me for local inference with qwen on my Macbook Pro (32GB).
- nateb2022 - 3 days ago
  
  Yup, also using Ollama and on a Macbook Pro. Ollama is #1
  - p0w3n3d - 2 days ago
    
    But isn't ollama only local chat? Or I am missing something? I'd like to setup it as a server for my usages on another laptop (use it as my local AI hub) and would love to integrate it with some IDE using MCP
    
    grosswait - 2 days ago
    
    No, it can listen on 0.0.0.0 or you can serve it through a proxy

rbanffy - 3 days ago

Who remembers “wafer scale integration” from the 1980s?

Insane that Cerebras succeeded where everyone else failed for 5 decades.

Inviz - 3 days ago

I contacted their sales team before, cerebras started at $1500 a month at that time, and the limits were soooooo small. Did it get better?

Edit: Looks like it did. They both introduced pay as you go, and have prepaid limits too at $1500. I wonder if they have any limitations on parallel execution for pay as you go...

rafaelero - 3 days ago

If they do the same for the coding model they will have a killer product.

rbanffy - 3 days ago

If they do that for chip design and it successfully iterates the design into the next generation on 2nm or less, it’ll be even more ludicrous.

cedws - 3 days ago

With this kind of speed you could build a large thinking stage into every response. What kind of improvement could you expect in benchmarks from having say 1000 tokens of thinking for every response?

lionkor - 3 days ago

Thinking can also make the responses worse; AIs don't "overthink", instead they start throwing away constraints and convincing themselves of things that are tangential or opposite to the task.
I've often observed thinking/reasoning to cause models to completely disregard important constraints, because they essentially can act as conversational turns.
- rbanffy - 3 days ago
  
  > start throwing away constraints and convincing themselves of things that are tangential or opposite to the task
  Funny that, when given too much brainpower, AIs manifest ADHD symptoms…
  - lionkor - 3 days ago
    
    Validating if you have ADHD, but still an issue that is somehow glanced over by everyone who uses AI daily(?)
    
    rbanffy - 3 days ago
    
    I don't feel validated, and I don't feel better in any way because of this.
    This will all end up in tears.
falcor84 - 3 days ago

My use-case would probably be of autocompacting the context of another LLM. I've been using Claude Code a lot recently, and feel that it generally gets better at handling my codebase once it uses up a lot of context (often >50%), but then it often runs out of context before finishing the task. So I'd be very interested in something that runs behind the scenes and compacts it to e.g. ~80%.
I know that Letta have a decent approach to this, but I haven't yet seen it done well with a coding agent, by them or anyone else. Is there anyone doing this with any measure of success?

mikewarot - 3 days ago

It's quite possible this is getting near the upper limit that is possible with current architectures for compute. Let's say the limit were 10k tokens/second with Qwen3-235B.

There's always going to be some latency in any compute architecture. Assume some insane billionaire cast the entire Qwen3-235B model into silicon, so it all ran in parallel, tokens going in one end, and the next token coming out the other end. This wafer (or likely, stack of interconnected wafers) would likely add up to a latency from end to end of 10 to 100 milliseconds.

If you then added pipelining, the latency might actually increase a millsecond or two, but the aggregate throughput would be N times the number of pipeline stages.

If you could increase N to the point that the clock cycle were a nanosecond... what would the economic value of this thing be? 100,000 separate streams at 10,000 tokens per second, multiplexing through it.

If you change it from cast in silicon, to a program to configure the silicon (line an FPGA, but far less clunky), I believe you get the future of LLM compute. Ever faster and wider lanes between compute and RAM is a dead end, a premature optimization.

cubefox - 3 days ago

It sounds like Cerebras would be perfect for models with Mamba architecture, as those don't need a large KV cache for long contexts.

jug - 3 days ago

They better not cheat me with a quantized version!

OldfieldFund - 3 days ago

I tried the non-quantized version, and it was pretty bad at creative writing compared to Kimi K2. Very deterministic and every time I regenerated the same prompt I got the usual AI phrases like "the kicker is:", etc. Kimi was much more natural.

the_arun - 3 days ago

IMHO innovations are waiting to happen. Unless we get similar speeds using commodity hardware / pricing, we are not there yet.

pr337h4m - 3 days ago

Quantization?

TechDebtDevin - 3 days ago

Its not a new model, but rather their infrastructure and hardware they are showcasing.
- pr337h4m - 3 days ago
  
  Groq appears to have quantized the Kimi K2 model they're serving, which is part of the reason why there's a noticeable performance gap between K2 on Moonshot's official API and the one served by Groq.
  We don't know how/whether the Qwen3-235B served by Cerebras has been quantized.
  - logicchains - 3 days ago
    
    Cerebras have previously stated for other models they hosted that they didn't quantise, unlike Groq.

rsolva - 3 days ago

What would the energy use be for an average query be, when using large models at this speed?

scottcha - 3 days ago

I’ve asked that question on linked in to the Cerebras team a couple times and haven’t ever received a response. There is system max tdp values posted online but I’m not sure you can assume the system is running in max tdp for these queries. If it is the numbers are quite high (I just tried to find the number but couldn’t find it but I had it in my notes as 23kw).
If someone from Cerebras is reading this feel free to dm me as optimizing this power is what we do.
- skeezyboy - 3 days ago
  
  23kw gotdamn

adamtaylor_13 - 3 days ago

I’m a guy that simply runs Claude Code. How can I start toying around with this?

Maxious - 3 days ago

https://cc.yovy.app/ lets you connect https://openrouter.ai/provider/cerebras to claude code

avnathan - 3 days ago

They pull these off pretty consistently!

mohsen1 - 3 days ago

K2 is also now available on Groq

https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...

very fun to see agents using those backends

meowface - 3 days ago

There are rumors that the K2 model Groq is serving is quantized or otherwise produces lower-quality responses than expected due to some optimization, FYI.
I tested it and the speed is incredible, though.
- skeezyboy - 3 days ago
  
  have they managed to remove the "output may contain mistakes" disclaimer from a single LLM yet?
  - lazide - 3 days ago
    
    Never will.
    But then, same for humans yes?
    
    skeezyboy - 3 days ago
    
    >But then, same for humans yes? And? Whats your point? This is a computer. Humans make errors doing arithmetic, therefore should we not expect computers to be able to reliably perform arithmetic? No. Silly retort and a common reply from people who are suitably wowed by the current generation of AI.
    
    lazide - 3 days ago
    
    This is incredibly dumb.
    
    skeezyboy - 3 days ago
    
    whats what im trying to tell you
jscheel - 3 days ago

k2 on groq is really bad right now. I'm not sure what's causing the problem, but they've said that they are working on a few different issues.

avnathan - 3 days ago

They pull these off quite consistently!

OxfordCommand - 3 days ago

isn't Qwen Alibaba's family of models? What does cerebras have to do with this? i'm lost.

esafak - 3 days ago

Serving. https://openrouter.ai/provider/cerebras

skeezyboy - 3 days ago

but does it still produce potentially unreliable and hallucinatory output? id hate to see that feature go

tonyhart7 - 3 days ago

Yeah but the price is uhh

iyerbalaji - 3 days ago

Amazing, this is blazing fast

poupou127 - 3 days ago

WOW

iyerbalaji - 3 days ago

[dead]