Previewing GPT‑5.6 Sol: a next-generation model

947 points by minimaxir 14 hours ago

System card: https://deploymentsafety.openai.com/gpt-5-6-preview

All: for comments on the policy side please go to this related thread:

U.S. government will decide who gets to use GPT-5.6 - https://news.ycombinator.com/item?id=48690101

Easily the most interesting part of this announcement is buried in the second to last paragraph:

"We're also launching GPT‑5.6 Sol on Cerebras at up to 750 tokens per second in July, bringing frontier intelligence to customers at unprecedented speed. Access will initially be limited to select customers as we expand capacity."

750 tokens/s on a frontier model is going to be extremely interesting. I doubt this new version is anything but a version bump in terms of capabilities but if we can start getting these answers back faster, they end up being more useful.

Just off the top of my head, I can think of the tedious task of finding certain functionality within a codebase. I usually can't beat an AI agent harness at this task today. If the AI model is 3x faster I have less of chance.

qznc - 9 hours ago

https://mikeveerman.github.io/tokenspeed/?rate=750&mode=thin...
This is what 750tps looks like, I guess.
- amluto - 4 hours ago
  
  That’s an awful visualization. I can skim code quite quickly, but not when it shows up one character at a time in a small window, modem style.
  At least that site should draw out a full page then start replacing that page with the next, starting from the top and working downwards, repeating each time it hits the bottom.
- me-vs-cat - 2 hours ago
  
  You get used to it. I don't even see the code. All I see is blonde.. brunette.. redhead.
- buddhistdude - 8 hours ago
  
  Just to think what this will look like in a couple of years.
  - OGWhales - 8 hours ago
    
    Hopefully like this (but smarter): https://chatjimmy.ai/
    
    kkotak - 3 hours ago
    
    Why is the insane speed of 13KTPS of this site is not more on the the top of the AI conversations?
    
    Ey7NFZ3P0nzAe - 2 hours ago
    
    It's pretty well known by now.
    
    nomel - 7 hours ago
    
    This is genuinely confusing to my senses. The future is going to be so strange/neat/me unemployed.
    
    matheusmoreira - 9 minutes ago
    
    The future is totally illegible to me. I love these AI models, but I feel like I'm going to be jobless within 10 years.
    Anomie is at an all time high right now.
    
    falcor84 - 7 hours ago
    
    > strange/neat/me unemployed
    I'm not sure if that's what you were going for, but I read it as if it were written by The Board in the game Control, and found myself with the appropriate level of existential dread.
    
    mh- - 5 hours ago
    
    and I haven't played that game, so I read it in Ralph Wiggum's voice.. which also feels appropriate.
    I'm in danger.
    
    niyazpk - 7 hours ago
    
    Wow.. what?! How is this so fast?! Where can I read more?
    
    fcsp - 7 hours ago
    
    Funnily enough, pasting your comment straight into Jimmy leads to a... Funnily suboptimal answer that does not answer the question.
    As someone else already contributed, this is driven by a Canadian startup taalas that basically makes chips that are llms, so everything is very fast but also, baked into the chip. Once this kind of stuff is a commodity in like 10 years, our world will be very, very different.
    
    hajile - 2 hours ago
    
    Taalas HC1 AI uses Llama 3.1 8B, but takes up a massive 53B transistors and 815mm2 on TSMC N6 (nearly at the reticle limit of 858mm2). N2 is a little less than 3x as dense (110MTr/mm2 vs 313MTr/mm2).
    This chip would still be 272mm2 on N2 which is an eye-watering $30k/wafer and bigger than a 9950x or Nvidia 5070.
    This just isn't feasible. Some of the latest-gen LLMs seem to have 5-10T parameters or about 1000x more. I don't know that taping out just one chip makes economic sense let alone the 300-1000 chips required for a cutting-edge model. Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips.
    There are a TON of uses for an 8B parameter models on the edge, but this is WAY too big to put on the edge of anything. Something like a 10mm2 100m parameter voice model might be feasible on the edge, but only for expensive devices, but most of those are TSMC 28nm (up to 29MTr/mm2) or GF FDX22 (up to 40MTR/mm2) which would increase the AI chip to the point where it would absolutely dominate the BOM.
    
    lelanthran - 39 minutes ago
    
    > Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips.
    They probably have a few ideas around that. Me, personally, I'd have one main expensive chip (replaced every 10 years, or whatever), with a secondary cheap chip in front of it that gets replaced every year or so.
    The secondary chip could act the way RAG does, or perhaps both chips together can act as LoRA.
    Either way, 99.999% of the knowledge is static, you just need to fine-tune the weights with that remaining 0.001% knowledge, which can be done using RAG or LoRA on a much smaller (thus cheaper) disposable chip.
    
    HaloZero - an hour ago
    
    the flash models have fallen in size at least between deep seek models. Is there a limit to the shrinking capacity of the models?
    
    ayewo - 29 minutes ago
    
    Taalas https://taalas.com/the-path-to-ubiquitous-ai/
    Previous HN discussion: https://news.ycombinator.com/item?id=47103661
    
    dmd - 7 hours ago
    
    https://taalas.com/
    
    vitorgrs - an hour ago
    
    Not opening here... HN killed?
    
    plaguuuuuu - 5 hours ago
    
    [dead]
  - alienbaby - 6 hours ago
    
    I started with a 2400baud modem, I've seen how this goes
  - accrual - 6 hours ago
    
    Sometimes I visualize a setup like this [0], based on 2D art by Simon Stålenhag. Someone has their home robot sitting on a desk connected to their old PC with thick cabling, dumping endless lines of each subsystem's <think> logs to diagnosis why it did something weird earlier in the day. Systems pushing 750+ tokens per second per subsystem might even be considered on the slow side for realtime tasks by then.
    [0] https://www.therookies.co/entries/39513
  - bredren - 4 hours ago
    
    Probably will not be looking at text like this in a few years.
  - cactusplant7374 - 4 hours ago
    
    Probably not. Everyone will still need a lot of reasoning tokens and tool calls. Running the tests for every round is tiring but must be done.
  - refulgentis - 8 hours ago
    
    Imagine a Beowulf cluster of these…
    
    noisy_boy - 5 hours ago
    
    That's a name I haven't heard in a while.
  - senectus1 - 6 hours ago
    
    probably something like this https://sb0xw.csb.app/
sberens - 13 hours ago

For comparison, openrouter says opus 4.8 is ~55 tokens/s and fast mode is ~102.
750 tokens/s for their largest model is going to be nuts
- windexh8er - 11 hours ago
  
  What about 15k tokens per second? [0] I remember looking at this earlier in the year and it being so fast that it feels fake. And, yes, this model is old - but still awesome for what it is.
  [0] https://chatjimmy.ai/
  - ehsankia - an hour ago
    
    I just tried it, and the answer is non-sense.
    I asked it something simple, list some good indie puzzle games, and half the answers are games that don't exist. Imo quality > speed.
  - Kirby64 - 10 hours ago
    
    It’s not just old, it’s also tiny and quantized. It’s llama 3.1 8b at 3/6-bit quant. This is the type of thing you can run on almost any device…
    
    windexh8er - 9 hours ago
    
    I get that, but not at 15k tokens/s.
    
    Kirby64 - 9 hours ago
    
    But it’s irrelevant. 750 tokens/s on a full frontier model is useful. 15000 poor quality tokens is much less useful no matter how much scaffolding you put around it.
    
    Legend2440 - 8 hours ago
    
    You are missing the point. This is a technology demonstration on prototype hardware, and no one intends it to be seriously useful.
    Their architecture has fundamental speed and efficiency advantages over GPUs or Cerebras. They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.
    
    Kirby64 - 8 hours ago
    
    > They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.
    I’ll patiently wait to see this in reality. Their demonstration hardware is a 250W chip that is enormous in die area for the model size. They’re making a lot of claims, but until they can deliver then it’s nearly vaporware in my view.
    I’d be happy to be proven wrong, but I think they’re going to quickly run into hardware realities quite soon if they think they can just chain a bunch of chips together to achieve the same performance on larger sizes.
    
    windexh8er - 7 hours ago
    
    Why can't they do it? Jim Keller's company is also taking a different approach [0].
    The simple fact that we think what we have now is scalable is basically what you are saying can't be done: " just chain a bunch of chips together to achieve the same performance on larger sizes". How do you think current architectures work? And what is being used today is all proprietary to one company!
    [0] https://tenstorrent.com/solutions/llm-inference
    
    windexh8er - 9 hours ago
    
    I think you missed the point and don't understand / aren't considerate of SLM utility.
    
    Kirby64 - 9 hours ago
    
    But I’m not missing the point. If you can run one frontier model at 750t/s, then you can probably run many many instances of an SLM in parallel at a rate that exceeds 15k/s. That’s kinda the point of the flash or ultrafast variants. And they’re on something much more modern than llama3.1.
    
    windexh8er - 9 hours ago
    
    Yes, you are missing the point. 1) It's a demo. [0] 2) It hasn't been updated for 4+ months.
    You don't need LLMs for everything. That is 100% the point. You can burn down the world with all of your frontier LLMs that are being used for simple queries OR we can do something faster and more efficient like this. Just because you can run a SotA model at "fast" speeds, again, severely misses the point.
    And no, you can't run anything from Anthropic or OAI on-prem, so until you can there's really no comparison. If people want to continue down the path of gate-kept models with no other options then we'll all follow you off the cliff.
    [0] https://taalas.com/products/