Previewing GPT‑5.6 Sol: a next-generation model
openai.com947 points by minimaxir 14 hours ago
947 points by minimaxir 14 hours ago
System card: https://deploymentsafety.openai.com/gpt-5-6-preview
All: for comments on the policy side please go to this related thread: U.S. government will decide who gets to use GPT-5.6 - https://news.ycombinator.com/item?id=48690101 Easily the most interesting part of this announcement is buried in the second to last paragraph: "We're also launching GPT‑5.6 Sol on Cerebras at up to 750 tokens per second in July, bringing frontier intelligence to customers at unprecedented speed. Access will initially be limited to select customers as we expand capacity." 750 tokens/s on a frontier model is going to be extremely interesting. I doubt this new version is anything but a version bump in terms of capabilities but if we can start getting these answers back faster, they end up being more useful. Just off the top of my head, I can think of the tedious task of finding certain functionality within a codebase. I usually can't beat an AI agent harness at this task today. If the AI model is 3x faster I have less of chance. https://mikeveerman.github.io/tokenspeed/?rate=750&mode=thin... This is what 750tps looks like, I guess. That’s an awful visualization. I can skim code quite quickly, but not when it shows up one character at a time in a small window, modem style. At least that site should draw out a full page then start replacing that page with the next, starting from the top and working downwards, repeating each time it hits the bottom. You get used to it. I don't even see the code. All I see is blonde.. brunette.. redhead. Just to think what this will look like in a couple of years. Hopefully like this (but smarter): https://chatjimmy.ai/ Why is the insane speed of 13KTPS of this site is not more on the the top of the AI conversations? This is genuinely confusing to my senses. The future is going to be so strange/neat/me unemployed. The future is totally illegible to me. I love these AI models, but I feel like I'm going to be jobless within 10 years. Anomie is at an all time high right now. > strange/neat/me unemployed I'm not sure if that's what you were going for, but I read it as if it were written by The Board in the game Control, and found myself with the appropriate level of existential dread. and I haven't played that game, so I read it in Ralph Wiggum's voice.. which also feels appropriate. I'm in danger. Wow.. what?! How is this so fast?! Where can I read more? Funnily enough, pasting your comment straight into Jimmy leads to a... Funnily suboptimal answer that does not answer the question. As someone else already contributed, this is driven by a Canadian startup taalas that basically makes chips that are llms, so everything is very fast but also, baked into the chip. Once this kind of stuff is a commodity in like 10 years, our world will be very, very different. Taalas HC1 AI uses Llama 3.1 8B, but takes up a massive 53B transistors and 815mm2 on TSMC N6 (nearly at the reticle limit of 858mm2). N2 is a little less than 3x as dense (110MTr/mm2 vs 313MTr/mm2). This chip would still be 272mm2 on N2 which is an eye-watering $30k/wafer and bigger than a 9950x or Nvidia 5070. This just isn't feasible. Some of the latest-gen LLMs seem to have 5-10T parameters or about 1000x more. I don't know that taping out just one chip makes economic sense let alone the 300-1000 chips required for a cutting-edge model. Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips. There are a TON of uses for an 8B parameter models on the edge, but this is WAY too big to put on the edge of anything. Something like a 10mm2 100m parameter voice model might be feasible on the edge, but only for expensive devices, but most of those are TSMC 28nm (up to 29MTr/mm2) or GF FDX22 (up to 40MTR/mm2) which would increase the AI chip to the point where it would absolutely dominate the BOM. > Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips. They probably have a few ideas around that. Me, personally, I'd have one main expensive chip (replaced every 10 years, or whatever), with a secondary cheap chip in front of it that gets replaced every year or so. The secondary chip could act the way RAG does, or perhaps both chips together can act as LoRA. Either way, 99.999% of the knowledge is static, you just need to fine-tune the weights with that remaining 0.001% knowledge, which can be done using RAG or LoRA on a much smaller (thus cheaper) disposable chip. the flash models have fallen in size at least between deep seek models. Is there a limit to the shrinking capacity of the models? Taalas https://taalas.com/the-path-to-ubiquitous-ai/ Previous HN discussion: https://news.ycombinator.com/item?id=47103661 Sometimes I visualize a setup like this [0], based on 2D art by Simon Stålenhag. Someone has their home robot sitting on a desk connected to their old PC with thick cabling, dumping endless lines of each subsystem's <think> logs to diagnosis why it did something weird earlier in the day. Systems pushing 750+ tokens per second per subsystem might even be considered on the slow side for realtime tasks by then. Probably not. Everyone will still need a lot of reasoning tokens and tool calls. Running the tests for every round is tiring but must be done. For comparison, openrouter says opus 4.8 is ~55 tokens/s and fast mode is ~102. 750 tokens/s for their largest model is going to be nuts What about 15k tokens per second? [0] I remember looking at this earlier in the year and it being so fast that it feels fake. And, yes, this model is old - but still awesome for what it is. I just tried it, and the answer is non-sense. I asked it something simple, list some good indie puzzle games, and half the answers are games that don't exist. Imo quality > speed. It’s not just old, it’s also tiny and quantized. It’s llama 3.1 8b at 3/6-bit quant. This is the type of thing you can run on almost any device… I get that, but not at 15k tokens/s. But it’s irrelevant. 750 tokens/s on a full frontier model is useful. 15000 poor quality tokens is much less useful no matter how much scaffolding you put around it. You are missing the point. This is a technology demonstration on prototype hardware, and no one intends it to be seriously useful. Their architecture has fundamental speed and efficiency advantages over GPUs or Cerebras. They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty. > They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty. I’ll patiently wait to see this in reality. Their demonstration hardware is a 250W chip that is enormous in die area for the model size. They’re making a lot of claims, but until they can deliver then it’s nearly vaporware in my view. I’d be happy to be proven wrong, but I think they’re going to quickly run into hardware realities quite soon if they think they can just chain a bunch of chips together to achieve the same performance on larger sizes. Why can't they do it? Jim Keller's company is also taking a different approach [0]. The simple fact that we think what we have now is scalable is basically what you are saying can't be done: " just chain a bunch of chips together to achieve the same performance on larger sizes". How do you think current architectures work? And what is being used today is all proprietary to one company! I think you missed the point and don't understand / aren't considerate of SLM utility. But I’m not missing the point. If you can run one frontier model at 750t/s, then you can probably run many many instances of an SLM in parallel at a rate that exceeds 15k/s. That’s kinda the point of the flash or ultrafast variants. And they’re on something much more modern than llama3.1. Yes, you are missing the point. 1) It's a demo. [0] 2) It hasn't been updated for 4+ months. You don't need LLMs for everything. That is 100% the point. You can burn down the world with all of your frontier LLMs that are being used for simple queries OR we can do something faster and more efficient like this. Just because you can run a SotA model at "fast" speeds, again, severely misses the point. And no, you can't run anything from Anthropic or OAI on-prem, so until you can there's really no comparison. If people want to continue down the path of gate-kept models with no other options then we'll all follow you off the cliff.
dang - 12 hours ago
gandreani - 13 hours ago
qznc - 9 hours ago
amluto - 4 hours ago
me-vs-cat - 2 hours ago
buddhistdude - 8 hours ago
OGWhales - 8 hours ago
kkotak - 3 hours ago
nomel - 7 hours ago
matheusmoreira - 9 minutes ago
falcor84 - 7 hours ago
mh- - 5 hours ago
niyazpk - 7 hours ago
fcsp - 7 hours ago
hajile - 2 hours ago
lelanthran - 39 minutes ago
HaloZero - an hour ago
ayewo - 29 minutes ago
accrual - 6 hours ago
cactusplant7374 - 4 hours ago
sberens - 13 hours ago
windexh8er - 11 hours ago
ehsankia - an hour ago
Kirby64 - 10 hours ago
windexh8er - 9 hours ago
Kirby64 - 9 hours ago
Legend2440 - 8 hours ago
Kirby64 - 8 hours ago
windexh8er - 7 hours ago
windexh8er - 9 hours ago
Kirby64 - 9 hours ago
windexh8er - 9 hours ago