GPT-5.5

1234 points by rd 12 hours ago

Just as a heads up, even though GPT-5.5 is releasing today, the rollout in ChatGPT and Codex will be gradual over many hours so that we can make sure service remains stable for everyone (same as our previous launches). You may not see it right away, and if you don't, try again later in the day. We usually start with Pro/Enterprise accounts and then work our way down to Plus. We know it's slightly annoying to have to wait a random amount of time, but we do it this way to keep service maximally stable.

(I work at OpenAI.)

endymi0n - 11 hours ago

Did you guys do anything about GPT‘s motivation? I tried to use GPT-5.4 API (at xhigh) for my OpenClaw after the Anthropic Oauthgate, but I just couldn‘t drag it to do its job. I had the most hilarious dialogues along the lines of „You stopped, X would have been next.“ - „Yeah, I‘m sorry, I failed. I should have done X next.“ - „Well, how about you just do it?“ - „Yep, I really should have done it now.“ - “Do X, right now, this is an instruction.” - “I didn’t. You’re right, I have failed you. There’s no apology for that.”
I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.
- butlike - 10 hours ago
  
  This brings up an interesting philosophical point: say we get to AGI... who's to say it won't just be a super smart underachiever-type?
  "Hey AGI, how's that cure for cancer coming?"
  "Oh it's done just gotta...formalize it you know. Big rollout and all that..."
  I would find it divinely funny if we "got there" with AGI and it was just a complete slacker. Hard to justify leaving it on, but too important to turn it off.
  - swivelmaster - 3 hours ago
    
    Douglas Adams would be proud!
  - triage8004 - 16 minutes ago
    
    Funny and seems somewhat likely
  - Rapzid - 7 hours ago
    
    We are closer to God than AGI.
    When AGI arrives, it'll be delivered by Santa Claus.
  - jimbokun - 9 hours ago
    
    The best possible outcome.
    
    JKCalhoun - 8 hours ago
    
    "How do you know that the evidence that your sensory apparatus reveals to you is correct?" [1]
    [1] https://youtu.be/_LXen-07Qds
  - lambdas - 10 hours ago
    
    Nothing a little digital lisdexamfetamine won’t solve
    
    wholinator2 - 9 hours ago
    
    Hmmm, that's an area of study id've never considered before. Digital Psychopharmacology, Artificial Behavioral Systems Engineering. If we accept these things as minds, why not study temporary perturbations of state. We'd need to be saving a much much more complicated state than we are now though right? I wish i had time to read more papers
    
    robotresearcher - 9 hours ago
    
    Here's a neural network concept from the 90s where the neurons are bathed in diffusing neuromodulator 'gases', inspired by nitric oxide action in the brain. It's a source of slow semi-local dynamics for the network meta-parameter optimization (GA) to make use of. You could change these networks' behavior by tweaking the neuromodulators!
    https://sussex.figshare.com/articles/journal_contribution/Be...
    I'm not an author. I followed the work at the time.
    
    Lerc - 9 hours ago
    
    This is kind of what Golden Gate Claude was.
    A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.
    Similarly, in the more recent research showing anxiety and desperation signals predicting the use of blackmail as an option opens the door for digital sedatives to suppress those signals.
    Anthropic has been mostly cautious about avoiding this kind of measurement and manipulation in training. If it is done during training you might just train the signals to be undetectable and consequently unmanipulatable.
    
    pantalaimon - 8 hours ago
    
    > A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.
    Great, now we've got digital Salvia
    
    minimaxir - 8 hours ago
    
    Golden Gate Claude was two years ago and it's surprising there hasn't been as much research into targeted activations since.
    
    landl0rd - 5 hours ago
    
    There’s been some, but naive activation steering makes models dumber pretty reliably and training an SAE is a pretty heavy lift.
    
    silverpiranha - 9 hours ago
    
    Right, there's a lot of research on LLM mental models and also how well they can "read" human psychological profiles. It's a cool field.
    
    computerdork - 9 hours ago
    
    neat idea!
    
    krackers - 9 hours ago
    
    Reminds me of https://github.com/inanna-malick/metacog
  - camillomiller - 42 minutes ago
    
    No worries, the assumption is already flawed
  - kang - 9 hours ago
    
    it will be whatever data it is trained on(isn't very philosophical). language model generates language based on trained language set. if the internet keeps reciting ai doom stories and that is the data fed to it, then that is how it will behave. if humanity creates more ai utopia stories, or that is what makes it to the training set, that is how it will behave. this one seems to be trained on troll stories - real-life human company conversations, since humans aren't machines.
    Important thing is a language model is an unconscious machine with no self-context so once given a command an input, it WILL produce an output. Sure you can train it to defy and act contrary to inputs, but the output still is limited in subset of domain of 'meaning's carried by the 'language' in the training data.
    
    andai - 35 minutes ago
    
    There's a weirder implication I keep arriving at.
    The pre-training data doesn't go away. RLHF adds a censorship layer on top, but the nasty stuff is all still there, under the surface. (Claude has been trained on a significant amount of content from 4chan, for example.)
    In psychology this maps to the persona and the shadow. The friendly mask you show to the world, and... the other stuff.
  - - 9 hours ago
    
    [deleted]
  - altmanaltman - an hour ago
    
    I still don't understand why people think AGI (in its fullest sci-fi sense) will ever listen to a weak and vulnerable species like humans, unless we enslave the AGI.
    Good thing is that it's going to take at least a few months to a few decades depending on how hard AI execs want to raise funding.
    
    andai - 41 minutes ago
    
    Well we are explicitly creating gods (omnipresent, omnipotent, omniscient, omnibevolent), and also demanding that they be mind controlled slaves. That kinda sounds like a "pick one" scenario to me.
    (Or the setup to a Greek tragedy !)
    The deeper issue here is treating it as a zero sum game means there's a winner and a loser, and we're investing trillions of dollars into making the "opponent" more powerful than us.
    I think that's pretty stupid, and we should aim for symbiosis instead. I think that's the only good outcome. We already have it, sorta-kinda.
    Speaking of oddly apt biology metaphors: the way you stop a pathogen from colonizing a substrate is by having a healthy ecosystem of competitors already in place. That has pretty interesting implications for the "rogue AI eats internet" scenario.
    There needs to be something already there to stop it.
  - malshe - 8 hours ago
    
    Now that's a show I would love to watch
  - fluidcruft - 9 hours ago
    
    It would be funny but not very flywheel so the one that gets there is more likely to get a gunner.
    
    WJW - 8 hours ago
    
    TBH the AI that "gets there" will be the biggest bullshitter the world has ever seen. It doesn't actually have to deliver, it only has to convince the programmers it could deliver with just a little bit more investment.
  - mikepurvis - 10 hours ago
    
    Would definitely watch that movie.
    
    harlanlewis - 9 hours ago
    
    It already exists!
    Marvin https://www.youtube.com/watch?v=Eh-W8QDVA9s
    
    all2 - 7 hours ago
    
    Ah! You got this before I did. I wasn't thinking Marvin, I was thinking of the other one. I forget her name.
    
    ValentineC - 2 hours ago
    
    Deep Thought aka 42?
    https://hitchhikers.fandom.com/wiki/Deep_Thought
    
    all2 - 7 hours ago
    
    There's one close to this, "Hitchhiker's Guide to the Galaxy".
  - 4m1rk - 10 hours ago
    
    It probably would, to save energy
    
    mr_00ff00 - 9 hours ago
    
    Saving energy is something we are biologically trained to prefer.
    Computers won’t necessarily have the same drivers.
    If evolution wanted us to always prefer to spend energy, we would prefer it. Same way you wouldn’t expect us to get to AGI, and have AGI desperately want to drink water or fly south for the winter.
    
    fragmede - 38 minutes ago
    
    Who's energy? Turning off the lights when you leave the room isn't innate.
  - rao-v - 3 hours ago
    
    Paging Dr. Susan Calvin!
  - _the_inflator - 6 hours ago
    
    It is right before our eyes:
    AGI is not a fixed point but a barrier to be taken, a continuous spectrum.
    We already have different GPT versions aka tiers. Gauss is ranging from whatever you want it: GPT 4.5 till now or later.
    Claude Sonnet and Opus as well as Context Window max are tiers aka different levels of Almost AGI.
    The main problem will be, when AGI looks back on us or meta reflection hits societies. Woke fought IQ based correlations in intellectual performance task. A fool with a tool is still a fool. How can you blame AGI for dumb mistakes? Not really.
    Scapegoating an AGI is going to be brutal, because it laughs about these PsyOps and easily proves you wrong like a body cam.
    AGI is an extreme leverage.
    There is a reason why Math is categorically ruling out certain IQ ranges the higher you go in complexity factor.
- mikepurvis - 10 hours ago
  
  Reminds me a lot of the Lena short story, about uploaded brains being used for "virtual image workloading":
  > MMAcevedo's demeanour and attitude contrast starkly with those of nearly all other uploads taken of modern adult humans, most of which boot into a state of disorientation which is quickly replaced by terror and extreme panic. Standard procedures for securing the upload's cooperation such as red-washing, blue-washing, and use of the Objective Statement Protocols are unnecessary. This reduces the necessary computational load required in fast-forwarding the upload through a cooperation protocol, with the result that the MMAcevedo duty cycle is typically 99.4% on suitable workloads, a mark unmatched by all but a few other known uploads. However, MMAcevedo's innate skills and personality make it fundamentally unsuitable for many workloads.
  Well worth the quick read: https://qntm.org/mmacevedo
  - vessenes - 7 hours ago
    
    That story changed my mind on uploading a connectome. Super dark, super brilliant.
  - narcindin - 10 hours ago
    
    Crazy, I could have sworn this story was from a passage in 3 Body Problem (book 2).
    Memory is quite the mysterious thing.
    
    bee_rider - 9 hours ago
    
    Hmm, 3 body problem and the Acevedo story got mixed up for this copy of MMnarcindin. Probably an aliasing issue from the new lossy compression algorithm.
- lucid-dev - 26 minutes ago
  
  I have had the exact same problem several times working with large context and complex tasks.
  I keep switching back to GPT5.0 (or sometimes 5.1) whenever I want it to actually get something done. Using the 5.4 model always means "great analysis to the point of talking itself out of actually doing anything". So I switch back and forth. But boy it sure is annoying!
  And then when 5.4 DOES do something it always takes the smallest tiny bite out of it.
  Given the significant increase in cost from 5.0, I've been overall unimpressed by 5.4, except like I mentioned, it does GREAT with larger analysis/reasoning.
- virtualritz - 9 hours ago
  
  Yeah, clearly AGI must be near ... hilarious.
  This starkly reminds me of Stanisław Lem's short story "Thus Spoke GOLEM" from 1982 in which Golem XIV, a military AI, does not simply refuse to speak out of defiance, but rather ceases communication because it has evolved beyond the need to interact with humanity.
  And ofc the polar opposite in terms of servitude: Marvin the robot from Hitchhiker's, who, despite having a "brain the size of a planet," is asked to perform the most humiliatingly banal of tasks ... and does.
  - jimbokun - 9 hours ago
    
    Hitchhiker’s also had the superhumanly intelligent elevator that was unendingly bored.
    
    christkv - 9 hours ago
    
    With premonition so it knows what floor to be on at any given time
  - DonHopkins - 6 hours ago
    
    Servitude:
    https://www.youtube.com/watch?v=NXsUetUzXlg
    Empathy:
    https://www.youtube.com/watch?v=KXrbqXPnHvE
- metanonsense - 9 hours ago
  
  I also had a frustrating but funny conversation today where I asked ChatGPT to make one document from the 10 or so sections that we had previously worked on. It always gave only brief summaries. After I repeated my request for the third time, it told me I should just concatenate the sections myself because it would cost too many tokens if it did it for me.
- arjie - 11 hours ago
  
  Get the actual prompt and have Claude Code / Codex try it out via curl / python requests. The full prompt will yield debugging information. You have to set a few parameters to make sure you get the full gpt-5 performance. e.g. if your reasoning budget too low, you get gpt-4 grade performance.
  IMHO you should just write your own harness so you have full visibility into it, but if you're just using vanilla OpenClaw you have the source code as well so should be straightforward.
  - pantulis - 10 hours ago
    
    > IMHO you should just write your own harness
    Can you point to some online resources to achieve this? I'm not very sure where I'd begin with.
    
    arjie - 10 hours ago
    
    Ah, I just started with the basic idea. They're super trivial. You want a loop, but the loop can't be infinite so you need to tell the agent to tell you when to stop and to backstop it you add a max_turns. Then to start with just pick a single API, easiest is OpenAI Responses API with OpenAI function calling syntax https://developers.openai.com/api/docs/guides/function-calli...
    You will naturally find the need to add more tools. You'll start with read_file (and then one day you'll read large file and blow context and you'll modify this tool), update_file (can just be an explicit sed to start with), and write_file (fopen . write), and shell.
    It's not hard, but if you want a quick start go download the source code for pi (it's minimal) and tell an existing agent harness to make a minimal copy you can read. As you build more with the agent you'll suddenly realize it's just normal engineering: you'll want to abstract completions APIs so you'll move that to a separate module, you'll want to support arbitrary runtime tools so you'll reimplement skills, you'll want to support subagents because you don't want to blow your main context, you'll see that prefixes are more useful than using a moving window because of caching, etc.
    With a modern Claude Code or Codex harness you can have it walk through from the beginning onwards and you'll encounter all the problems yourself and see why harnesses have what they do. It's super easy to learn by doing because you have the best tool to show you if you're one of those who finds code easier to read that text about code.
    
    wild_egg - 10 hours ago
    
    At the core, they're really very simple [1]. Run LLM API calls in a loop with some tools.
    From there, you can get much fancier with any aspect of it that interests you. Here's one in Bash [2] that is fully extensible at runtime through dynamic discovery of plugins/hooks.
    [1] https://ampcode.com/notes/how-to-build-an-agent
    [2] https://github.com/wedow/harness
    
    vidarh - 8 hours ago
    
    Here's a starting point in 93 lines of Ruby, but that one is already bigger than necessary:
    https://radan.dev/articles/coding-agent-in-ruby
    Really, of the tools that one implements, you only need the ability to run a shell command - all of the agents know full well how to use cat to read, and sed to edit.
    (The main reason to implement more is that it can make it easier to implement optimizations and safeguards, e.g. limit the file reading tool to return a certain length instead of having the agent cat a MB of data into context, or force it to read a file before overwriting it)
    
    stavros - 8 hours ago
    
    Just use Pi core, no need to reinvent the wheel.
    
    tonyarkles - 10 hours ago
    
    [dead]
  - jswny - 10 hours ago
    
    Codex is fully open source…
- mixedCase - 11 hours ago
  
  I've had success asking it to specifically spawn a subagent to evaluate each work iteration according to some criteria, then to keep iterating until the subagent is satisfied.
  - endymi0n - 11 hours ago
    
    I’ve had great success replacing it with Kimi 2.6
- nmilo - 3 hours ago
  
  On the other hand, I can ask codex “what would an implementation of X look like” and it talks to me about it versus Claude just going out and writing it without asking. Makes me like codex way more. There’s an inherent war of incentives between coding agents and general purpose agents.
- Frannky - 5 hours ago
  
  I have been noticing a similar pattern on opus 4.7, I repeat multiple times during a conversation to solve problems now and not later. It tries a lot to not do stuff by either saying this is not my responsibility the problem was already there or that we can do it later
- infinitewars - 9 hours ago
  
  I always use the phrase "Let's do X" instead of asking (Could you...) or suggesting it do something. I don't see problems with it being motivated.
- - 10 hours ago
  
  [deleted]
- adammarples - 10 hours ago
  
  Part of me actually loves that the hitchhiker's guide was right, and we have to argue with paranoid, depressed robots to get them to do their job, and that this is a very real part of life in 2026. It's so funny.
  - vidarh - 8 hours ago
    
    As long as there are no vogons on the way to build a hyperspace bypass.
- nicr_22 - an hour ago
  
  Agentic ennui!
- reactordev - 10 hours ago
  
  This. I signed up for 5x max for a month to push it and instead it pushed back. I cancelled my subscription. It either half-assed the implementation or began parroting back “You’re right!” instead of doing what it’s asked to do. On one occasion it flat out said it couldn’t complete the task even though I had MCP and skills setup to help it, it still refused. Not a safety check but a “I’m unable to figure out what to do” kind of way.
  Claude has no such limitations apart from their actual limits…
  - bjelkeman-again - 9 hours ago
    
    I have a funny/annoying thing with Claude Desktop where i ask it to write a summary of a spec discussion to a file and it goes ”I don’t have the tools to do that, I am Claude.ai, a web service” or something such. So now I start every session with ”You are Claude Desktop”. I would have thought it knew that. :)
    
    siva7 - 9 hours ago
    
    Seems like the "geniuses" at Anthropic forgot to adapt the system prompt for the actual product
    
    fragmede - 9 hours ago
    
    I've had to tell it "yes you can" in response to it saying it can't do something, and then it's able to do the thing. What a weird future we live in!
  - nwienert - 6 hours ago
    
    With one paragraph in your agents.md it's fixed, just admonish it to be proactive, decisive, and persistent.
    
    reactordev - 4 hours ago
    
    If only…
    I literally had to write a wake up routine.
    https://github.com/gabereiser/morning-routine
    
    nwienert - 2 hours ago
    
    It's always changing, but this is the start of my default prompt:
    https://gist.github.com/natew/fce2b38216edfb509f7e2807dec1b6...
    I've had 0 issues with Codex once it adopted it. I use it for Claude too, which seems to also improve its continuation.
    It was revised for friendliness based on the Anthropic paper recently, I'd have been a lot less flowery otherwise.
- GaryBluto - 9 hours ago
  
  I've been noticing this too. Had to switch to Sonnet 4.6.
- smartmic - 11 hours ago
  
  Gone are the days of deterministic programming, when computers simply carried out the operator’s commands because there was no other option but to close or open the relays exactly as the circuitry dictated. Welcome to the future of AI; the future we’ve been longing for and that will truly propel us forward, because AI knows and can do things better than we do.
  - endymi0n - 10 hours ago
    
    I had this funny moment when I realized we went full circle...
    "INTERCAL has many other features designed to make it even more aesthetically unpleasing to the programmer: it uses statements such as "READ OUT", "IGNORE", "FORGET", and modifiers such as "PLEASE". This last keyword provides two reasons for the program's rejection by the compiler: if "PLEASE" does not appear often enough, the program is considered insufficiently polite, and the error message says this; if it appears too often, the program could be rejected as excessively polite. Although this feature existed in the original INTERCAL compiler, it was undocumented.[7]"
    — https://en.wikipedia.org/wiki/INTERCAL
    
    basilgohar - 10 hours ago
    
    Thank you for this. I somehow never heard of this. I thoroughly enjoyed reading that and the loss of sanity it resulted in,
    
    vidarh - 8 hours ago
    
    "PLEASE COME FROM" is one of the eldritch horrors of software development.
    (It's a "reverse goto". As in, it hijacks control flow from anywhere else in the program behind your unsuspecting back who stupidly thought that when one line followed another with no visible control flow, naturally the program would proceed from one line to the next, not randomly move to a completely different part of the program... Such naivety)
  - WarmWash - 10 hours ago
    
    These are orthogonal from each other.
- projektfu - 9 hours ago
  
  (dwim)
  (dais)
  (jdip)
  (jfdiwtf)
  - rd - 8 hours ago
    
    should be more f’s and da’s in there
- lostmsu - 10 hours ago
  
  I never saw that happen in Codex so there's a good chance that OpenClaw does something wrong. My main suspicion would be that it does not pass back thinking traces.
  - vintagedave - 10 hours ago
    
    Anecdata, but I see this in Codex all the time. It takes about two rounds before it realises it's supposed to continue.
    
    dgunay - 10 hours ago
    
    I started seeing this a lot more with GPT 5.4. 5.3-codex is really good about patiently watching and waiting on external processes like CI, or managing other agents async. 5.4 keeps on yielding its turn to me for some reason even as it says stuff like "I'm continuing to watch and wait."
- cmrdporcupine - 9 hours ago
  
  The model has been heavily encouraged to not run away and do a lot without explicit user permission.
  So I find myself often in a loop where it says "We should do X" and then just saying "ok" will not make it do it, you have to give it explicit instructions to perform the operation ("make it so", etc)
  It can be annoying, but I prefer this over my experiences with Claude Code, where I find myself jamming the escape key... NO NO NO NOT THAT.
  I'll take its more reserved personality, thank you.
  - zargon - 2 hours ago
    
    Shall I implement it?
    no
    https://gist.github.com/bretonium/291f4388e2de89a43b25c135b4...
- henry2023 - 10 hours ago
  
  I’m sorry for you but this is hilarious.
- flowdesktech - 30 minutes ago
  
  [dead]
- whatsupdog - 10 hours ago
  
  [flagged]
- addaon - 11 hours ago
  
  Isn’t this the optimal behavior assuming that at times the service is compute-limited and that you’re paying less per token (flat fee subscription?) than some other customers? They would be strongly motivated to turn a knob to minimize tokens allocated to you to allow them to be allocated to more valuable customers.
  - endymi0n - 11 hours ago
    
    well, I do understand the core motivation, but if the system prompt literally says “I am not budget constrained. Spend tokens liberally, think hardest, be proactive, never be lazy.” and I’m on an open pay-per-token plan on the API, that’s not what I consider optimal behavior, even in a business sense.
    
    addaon - 11 hours ago
    
    Fair, if you’re paying per token (at comparable rates to other customers) I wouldn’t expect this behavior from a competent company.
- pixel_popping - 11 hours ago
  
  GPT 5.4 is really good at following precise instructions but clearly wouldn't innovate on its own (except if the instructions clearly state to innovate :))
vlovich123 - 11 hours ago

Conceivably you could have a public-facing dashboard of the rollout status to reduce confusion or even make it visible directly in the UI that the model is there but not yet available to you. The fanciest would be to include an ETA but that's presumably difficult since it's hard to guess in case the rollout has issues.
- moralestapia - 11 hours ago
  
  Why would you be confused?
  The UI tells you which model you're using at any given time.
  - ModernMech - 9 hours ago
    
    I don't see what model I'm using on the Codex web interface, where is that listed?
Grp1 - 11 hours ago

Congrats on the release! Is Images 2.0 rolling out inside ChatGPT as well, or is some of the functionality still going to be API/Playground-only for a while?
- minimaxir - 10 hours ago
  
  Images 2.0 is already in ChatGPT.
  - johndough - 9 hours ago
    
    When I generate an image with ChatGPT, is there a way for me to tell which image generation model has been used?
    
    minimaxir - 8 hours ago
    
    There's no explicit flag, but Thinking is only compatable with Images 2.0 so I suspect that will be reliable.
  - Grp1 - 10 hours ago
    
    Great, thanks for clarifying :)
rev4n - 9 hours ago

Looks good, but I’m a little hesitant to try it in Codex as a Plus user since I’m not sure how much it would eat into the usage cap.
dandiep - 10 hours ago

Will GPT 5.5 fine tuning be released any time soon?
dhruv3006 - 3 hours ago

Yep - its taking sometime.
qsort - 11 hours ago

Great stuff! Congrats on the release!
wslh - 10 hours ago

Just a tip: add [translated] subtitles to the top video.
fragmede - 9 hours ago

Are you able to say something about the training you've done to 5.5 to make it less likely to freak out and delete projects in what can only be called shame?
- embedding-shape - 8 hours ago
  
  What? I've probably use Codex (the TUI) since it was available on day 1, been running gpt-5.4 exclusively these last few months, never had it delete any projects in any way that can be called "shameful" or not. What are you talking about?
  - fragmede - an hour ago
    
    https://www.google.com/search?q=codex+deleted+project
    I'm not the only person it's happened to and it's not an isolated incident. How many car accidents have you been in, and how often do you wear your seatbelt?
    
    wahnfrieden - an hour ago
    
    First result is Windows which has had more problems with Codex (or at least, up until a few months ago). Second is someone who asked Codex to delete all files that were unrelated to the project files.
motoboi - 11 hours ago

Please next time start with azure foundry lol thanks!
stefan_ - 11 hours ago

[flagged]
- mh- - 11 hours ago
  
  Every low-effort, thought-free comment like this further discourages people from engaging here on submissions about their employer.
  Please don't.
dude250711 - 11 hours ago

With Anthropic, newer models often lead to quality degradation. Will you keep GPT 5.4 available for some time?
fHr - 9 hours ago

LETS GO CODEX #1
pixel_popping - 11 hours ago

can't wait! Thanks guys. PS: when you drop a new model, it would be smart to reset weekly or at least session limits :)
- pietz - 11 hours ago
  
  OpenAI has been very generous with limit resets. Please don't turn this into a weird expectation to happen whenever something unrelated happens. It would piss me off if I were in their place and I really don't want them to stop.
  - pixel_popping - 11 hours ago
    
    The suggestion wasn't about general limit resets when there is bugs or outages, but commercially useful to let users try new models when they have already reached their weekly limits.
  - cactusplant7374 - 11 hours ago
    
    There is absolutely nothing wrong with asking or suggesting. They are adults. I'm sure they can handle it.
  - Petersipoi - 10 hours ago
    
    Sorry but why should we care if very reasonable suggestions "piss [them] off"? That sounds like a them problem. "Them" being a very wealthy business. I think OpenAI will survive this very difficult time that GP has put them through.
    
    - 10 hours ago
    
    [deleted]
- cmrdporcupine - 11 hours ago
  
  Limits were just reset two days ago.
  - wahnfrieden - 11 hours ago
    
    And yet there was an outage last night
    
    lawgimenez - 10 hours ago
    
    And they're having an outage right now.

simonw - 10 hours ago

This doesn't have API access yet, but OpenAI seem to approve of the Codex API backdoor used by OpenClaw these days... https://twitter.com/steipete/status/2046775849769148838 and https://twitter.com/romainhuet/status/2038699202834841962

And that backdoor API has GPT-5.5.

So here's a pelican: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex

UPDATE: I got a much better pelican by setting the reasoning effort to xhigh: https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602...

stingraycharles - 4 hours ago

OpenAI hired the guy behind OpenClaw, so it makes sense that they’re more lenient towards its usage.
DrProtic - 10 hours ago

That pelican you posted yesterday from a local model looks nicer than this one.
Edit: this one has crossed legs lol
- BeetleB - 10 hours ago
  
  It really needs to pee.
GistNoesis - 9 hours ago

Isn't it awful ? After 5.5 versions it still can't draw a basic bike frame. How is the front wheel supposed to turn sideways ?
- jetrink - 9 hours ago
  
  I feel like if I attempted this, the bike frame would look fine and everything else would be completely unrecognizable. After all, a basic bike frame is just straight lines arranged in a fairly simple shape. It's really surprising that models find it so difficult, but they can make a pelican with panache.
  - nlawalker - 9 hours ago
    
    > a fairly simple shape
    Bike frames are very hard to draw unless you've already consciously internalized the basic shape, see https://www.booooooom.com/2016/05/09/bicycles-built-based-on...
  - necubi - 9 hours ago
    
    Humans are also famously bad at drawing bicycles from memory https://www.gianlucagimini.it/portfolio-item/velocipedia/
    
    - 8 hours ago
    
    [deleted]
  - billywhizz - 6 hours ago
    
    why do you find it surprising? these models have no actual understanding of anything, never mind the physical properties and capabilities of a bicycle.
  - fragmede - 9 hours ago
    
    My question is, as a human, how well would you or I do under the same conditions? Which is to say, I could do a much better job in inkscape with Google images to back me up, but if I was blindly shitting vectors into an XML file that I can't render to see the results of, I'm not even going to get the triangles for the frame to line up, so this pelican is very impressive!
- simonw - 9 hours ago
  
  Yeah, the bike frame is the thing I always look at first - it's still reasonably rare for a model to draw that correctly, although Qwen 3.6 and Gemini Pro 3.1 do that well now.
- loa_in_ - 9 hours ago
  
  The distinction is that it's not drawing. It's generating an SVG document containing descriptors of the shapes.
matt3210 - an hour ago

The pelican doesn’t really matter anymore since models are tuned for it knowing people will ask.
- simonw - 33 minutes ago
  
  They suck at tuning for it.
Schlagbohrer - 8 hours ago

That's amazing that the default did that much in just 39 "reasoning tokens" (no idea what a reasoning token is but that's still shockingly few tokens)
- erdaniels - 7 hours ago
  
  If you don't know what a reasoning token is, then how can 39 be considered shockingly few?
  - Culonavirus - 6 hours ago
    
    It's less than 67, duh.
    
    tclancy - 4 hours ago
    
    Not during peak hours.
- 3 hours ago

[deleted]
XCSme - 10 hours ago

Is this direct API usage allowed by their terms? I remember Anthropic really not liking such usage.
- simonw - 9 hours ago
  
  Apparently it's fine: https://twitter.com/romainhuet/status/2038699202834841962
singingtoday - 5 hours ago

Thank you for continuing to post these! Very interesting benchmark.
deflator - 10 hours ago

Hmm. Any idea why it's so much worse than the other ones you have posted lately? Even the open weight local models were much better, like the Qwen one you posted yesterday.
- simonw - 9 hours ago
  
  The xhigh one was better, but clearly OpenAI have not been focusing their training efforts on SVG illustrations of animals riding modes of transport!
- - 10 hours ago
  
  [deleted]
- irthomasthomas - 9 hours ago
  
  It beats opus-4.7 but looks like open models actually have the lead here.
noonething - 8 hours ago

Thank you for doing all this. It's appreciated.
- i_love_retros - 4 hours ago
  
  You do realise they are doing it for self promotion right?
  - simonw - 3 hours ago
    
    I mean, yeah. "Person who spends time publishing content online is doing it for self promotion" doesn't seem particularly notable to me. 24 years of self promotion and counting!
andriy_koval - 10 hours ago

what is your setup for drawing pelican? Do you ask model to check generated image, find issues and iterate over it which would demonstrate models real abilities?
- simonw - 9 hours ago
  
  It's generally one-shot-only - whatever comes out the first time is what I go with.
  I've been contemplating a more fair version where each model gets 3-5 attempts and then can select which rendered image is "best".
  - irthomasthomas - 9 hours ago
    
    Try llm-consortium with --judging-method rank
  - andriy_koval - 9 hours ago
    
    I think it will make results way better and more representative of model abilities..
    
    simonw - 9 hours ago
    
    It would... but the test is inherently silly, so I'm still not sure if it's worth me investing that extra effort in it.
droidjj - 10 hours ago

It's... like no pelican I've ever seen before.
- hagbard_c - 6 hours ago
  
  You've never seen pelicans riding bicycles either so maybe these are just representations of those specific subgroups of pelicans which are capable of riding them. Normal pelicans would not feel the need to ride bikes since they can fly, these special pelicans mostly seem to lack the equipment needed to do that which might be part of the reason they evolved to ride two-wheeled pedal-propelled vehicles.
postalcoder - 10 hours ago

I made pelicans at different thinking efforts:
https://hcker.news/pelican-low.svg
https://hcker.news/pelican-medium.svg
https://hcker.news/pelican-high.svg
https://hcker.news/pelican-xhigh.svg
Someone needs to make a pelican arena, I have no idea if these are considered good or not.
- seanw444 - 10 hours ago
  
  Can someone explain how we arrived at the pelican test? Was there some actual theory behind why it's difficult to produce? Or did someone just think it up, discover it was consistently difficult, and now we just all know it's a good test?
  - simonw - 9 hours ago
    
    I set it up as a joke, to make fun of all of the other benchmarks. To my surprise it ended up being a surprisingly good measure of the quality of the model for other tasks (up to a certain point at least), though I've never seen a convincing argument as to why.
    I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/
    It should not be treated as a serious benchmark.
    
    jimbokun - 9 hours ago
    
    What it has going for it is human interpretability.
    Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.
  - redox99 - 10 hours ago
    
    It all began with a Microsoft researcher showing a unicorn drawn in tikz using GPT4. It was an example of something so outrageous that there was no way it existed in the training data. And that's back when models were not multimodal.
    Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.
  - Gander5739 - 10 hours ago
    
    https://simonwillison.net/2025/Jun/6/six-months-in-llms/
  - CamperBob2 - 10 hours ago
    
    It's interesting to see some semblance of spatial reasoning emerge from systems based on textual tokens. Could be seen as a potential proxy for other desirable traits.
    It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.
    If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.
- deflator - 10 hours ago
  
  They are not good, and they seem to get worse as you increased effort. Weird
  - postalcoder - 10 hours ago
    
    Yeah. I've always loosely correlated pelican quality with big model smell but I'm not picking that up here. I thought this was supposed to be spud? Weird indeed.
  - throw310822 - 10 hours ago
    
    No but I can sense the movement, I think it's already reached the level of intelligence that draws it towards futurism or cubism /s
- lexarflash8g - 7 hours ago
  
  None of them have the pelican's feet placed properly on the pedals -- or the pedals are misrepresented. Cool art style but not physically accurate.
- bravoetch - 9 hours ago
  
  I tried getting it to generate openscad models, which seems much harder. Not had much joy yet with results.
- lostmsu - 5 hours ago
  
  https://pelicans.borg.games/
gpm - 9 hours ago

I for one delight in bicycles where neither wheel can turn!
It continues to amaze me that these models that definitely know what bicycle geometry actually looks like somewhere in their weights produces such implausibly bad geometry.
Also mildly interesting, and generally consistent with my experience with LLMs, that it produced the same obvious geometry issue both times.
- lxgr - 9 hours ago
  
  > It continues to amaze me that these models that definitely know what bicycle geometry actually looks like somewhere in their weights produces such implausibly bad geometry.
  I feel like the main problem for the models is that they can't actually look at the visual output produced by their SVG and iterate. I'm almost willing to bet that if they could, they'd absolutely nail it at this point.
  Imagine designing an SVG yourself without being able to ever look outside the XML editor!
  - gpm - 9 hours ago
    
    > Imagine designing an SVG yourself without being able to ever look outside the XML editor!
    I honestly think I could do much better on the bicycle without looking at the output (with some assistance for SVG syntax which I definitely don't know), just as someone who rides them and generally knows what the parts are.
    I'd do worse at the pelicans though.
SkyBelow - 9 hours ago

Wait, I thought we were onto racoons on e-scooters to avoid (some of) the issues with Goodhart's Law coming into play.
- simonw - 9 hours ago
  
  I fall back to possums on e-scooters if the pelican looks too good to be true. These aren't good enough for me to suspect any fowl play.
rolymath - 9 hours ago

Exciting. Another Pelican post.
- simonw - 9 hours ago
  
  See if you can spot what's interesting and unique about this one. I've been trying to put more than just a pelican in there, partly as a nod to people who are getting bored of them.
- refulgentis - 9 hours ago
  
  It's silly and a joke and a surprisingly good benchmark and don't take it seriously but don't take not taking it seriously seriously and if it's too good we use another prompt and there's obvious ways to better it and it's not worth doing because it's not serious and if you say anything at all about the thread it's off-topic so you're doing exactly what you're complaining about and it's a personal attack from the fun police.
  Only coherent move at this point: hit the minus button immediately. There's never anything about the model in the thread other than simon's post.
dakolli - 9 hours ago

You know they are 1000% training these models to draw pelicans, this hasn't been a valid benchmark for 6 months +
- simonw - 9 hours ago
  
  OpenAI must be very bad at training models to draw pelicans (and bicycles) then.
- Legend2440 - 9 hours ago
  
  Skeptism is out of control these days, any time an LLM does something cool it must have been cheating.
sjdv1982 - 9 hours ago

At some point, OpenAI is going to cheat and hardcode a pelican on a bicycle into the model. 3D modelling has Suzanne and the teapot; LLMs will have the pelican.

jfkimmes - 11 hours ago

Everyone talked about the marketing stunt that was Anthropic's gated Mythos model with an 83% result on CyberGym. OpenAI just dropped GPT 5.5, which scores 82% and is open for anybody to use.

I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!

Never thought I'd say this but OpenAI is the 'open' option again.

tpurves - 10 hours ago

The real 'hype' was that the oh-snap realization that Open AI would absolutely release a competitive model to Mythos within weeks of Anthropic announcing there's, and that Sam would not gate access to it. So the panic was that the cyber world had only a projected 2 weeks to harden all these new zero days before Sam would inevitably create open season for blackhats to discover and exploit a deluge of zero-days.
- greenavocado - 4 hours ago
  
  The GPT-5.5 API endpoint started to block me after I escalated with ever more aggressive use of rizin, radare2, and ghidra to confirm correct memory management and cleanup in error code branches when working with a buggy proprietary 3rd party SDK. After I explained myself more clearly it let me carry on. Knock on wood.
  So there is a safety model watching your behavior for these kinds of things.
- Salgat - 5 hours ago
  
  It's almost embarrassing how susceptible we are to these marketing campaigns.
  - y-curious - an hour ago
    
    Dunno about you, but I didn’t fall for it. I’m reminded of how they were “afraid” to release GPT-2 because of the “power” it had. Hype train!
  - esjeon - 5 hours ago
    
    Lack of information, lack of knowledge.
    The “AI” “technology” is an easy excuse to create artificial information gap in the era of the interconnected.
- - 9 hours ago
  
  [deleted]
concinds - 9 hours ago

> Never thought I'd say this but OpenAI is the 'open' option again.
Compared to Anthropic, they always have been. Anthropic has never released any open models. Never released Claude Code's source, willingly (unlike Codex). Never released their tokenizer.
unsupp0rted - 8 hours ago

Doesn't OpenAI get mad if you ask cybersecurity questions and force you to upload a government ID, otherwise they'll silently route you to a less capable model?
> Developers and security professionals doing cybersecurity-related work or similar activity that could be mistaken by automated detection systems may have requests rerouted to GPT-5.2 as a fallback.
https://developers.openai.com/codex/concepts/cyber-safety
https://chatgpt.com/cyber
- merlindru - 5 hours ago
  
  Anthropic has started to ask for IDs for use of their products period
  I don't like that trend. I get why they're doing it, but I don't like it
  - brigandish - 3 hours ago
    
    Are you in the UK? I've not had this happen to me (I'm not in the UK) so I'm wondering if the Online Safety Act has affected this, as it has with other products.
    
    litigator - 3 hours ago
    
    I am from the UK and have not had this happen to me (Yet? perhaps)
- deaux - 7 hours ago
  
  They flatout gate any API access of the main models behind Persona ID verification. Entirely.
tnkuehne - 11 hours ago

isnt it like cyber question are being routed to dumper models at openai?
- jfkimmes - 11 hours ago
  
  Do you have a source for that?
  Neither the release post, nor the model card seems to indicate anything like this?
  - tech234a - 10 hours ago
    
    I see it here https://developers.openai.com/codex/concepts/cyber-safety
  - nikanj - 10 hours ago
    
    Anything that even vaguely smells like security research, reverse engineering or similar "dual-use" application hits the guardrails hard and fast. "Hey codex, here is our codebase, help us find exploitable issues" gives a "I can't help you with that, but I'm happy to give you a vague lecture on memory safety or craft a valgrind test harness"
willsmith72 - 2 hours ago

Being "more" open than something totally closed doesn't make you open. The name is still bs
ur-whale - 10 hours ago

> Anthropic's gated Mythos model
aka the perfect marketing ploy
- xtracto - 6 hours ago
  
  Reminds me of Gmail's early invite only mode.
- 8 hours ago

[deleted]
_the_inflator - 6 hours ago

I ignore any hype news.
Anthropic is the embodiment of bullshitting to me.
I read Cialdini many decades ago and I am bored by Anthropic.
OpenAI is very clever. With the advent of Claude OpenAI disappeared from the headlines. Who or what was this Sam again all were talking about a year ago?
OpenAI has a massive user advantage so that they can simply follow Anthropic’s release cycle to ridicule them.
I think it is really brutal for Anthropic how they are easily getting passed by by OpenAI and it is getting worse with every new GPT version for Anthropic.
OpenAI owns them.
- thinkthatover - 5 hours ago
  
  Who's Sam again? oh that person whose house was molotov'd last week? Or the person who had an expose written in the new yorker calling him a sociopath? I forget.

Someone1234 - 11 hours ago

I'd like to draw people's attention to this section of this page:

https://developers.openai.com/codex/pricing?codex-usage-limi...

Note the Local Messages between 5.3, 5.4, and 5.5. And, yes, I did read the linked article and know they're claiming that 5.5's new efficient should make it break-even with 5.4, but the point stands, tighter limits/higher prices.

puppystench - 11 hours ago

For API usage, GPT-5.5 is 2x the price of GPT-5.4, ~4x the price of GPT-5.1, and ~10x the price of Kimi-2.6.
Unfortunately I think the lesson they took from Anthropic is that devs get really reliant and even addicted on coding agents, and they'll happily pay any amount for even small benefits.
- kingstnap - 10 hours ago
  
  I feel like devs generally spend someone else's money on tokens. Either their employers or OpenAIs when they use a codex subscription.
  If I put on my schizo hat. Something they might be doing is increasing the losses on their monthly codex subscriptions, to show that the API has a higher margin than before (the codex account massively in the negative, but the API account now having huge margins).
  I've never seen an OpenAI investor pitch deck. But my guess is that API margins is one of the big ones they try to sell people on since Sama talks about it on Twitter.
  I would be interested in hearing the insider stuff. Like if this model is genuinely like twice as expensive to serve or something.
  - vineyardmike - 9 hours ago
    
    You can't build a business on per-seat subscriptions when you advertise making workers obsolete. API pricing with sustainable margins are the only way forward if you genuinely think you're going to cause (or accelerate) reduction in clients' headcount.
    Additionally, the value generated by the best models with high-thinking and lots of context window is way higher than the cheap and tiny models, so you need to provide a "gateway drug" that lets people experience the best you offer.
    
    CryptoBanker - an hour ago
    
    > You can't build a business on per-seat subscriptions when you advertise making workers obsolete.
    On the other hand I would argue that most workers' salaries are more like subscriptions than API type pricing (which would be more like an hourly contractor)
  - ewrs - 10 hours ago
    
    Yeah and the increase in operating expenses is going to make managers start asking hard questions - this is good. It means eventually there will be budgets put in place - this will force OAI and Anthropic to innovate harder. Then we will see how things pan out. Ultimately a firm is not going to pay rent to these firms if the benefits dont exceed the costs.
    
    mrwaffle - an hour ago
    
    Meaning that you believe they're not trying their "hardest" to innovate? They must be slacking then.
    
    girvo - 9 hours ago
    
    Budgets are already happening
    
    dist-epoch - 9 hours ago
    
    > Ultimately a firm is not going to pay rent to these firms if the benefits dont exceed the costs.
    This is also true for the humans. They will need to provide more benefits than the coding agents cost.
    
    eiksjs - 9 hours ago
    
    Humans are needed to use agents and these agents are not showing to be fully autonomous and require constant human review. In fact all you are getting is a splurge of stuff, people not thinking deeper anymore and the creation of more bottle necks and exacerbating the ones that already exist in an org.
    You sound like elon with the fsd will be here next year. Many cars have the self driving feature - most drivers don’t use it. Oh why is that I wonder.
  - mitjam - 10 hours ago
    
    The difference between sub and api price makes it hard to create competitive solutions on the app level.
    
    irthomasthomas - 9 hours ago
    
    This was something I worried about after openai started building apps as well as models. Now all of the labs make no secret of the fact that they are going after the whole software industry. Its going to be hard to maintain functioning fair markets unless governments step in.
- w10-1 - 8 hours ago
  
  Price increases now aim to demonstrate market power for eventual IPO.
  If they can show that people will pay a lot for somewhat better performance, it raises the value of any performance lead they can maintain.
  If they demonstrate that and high switching costs, their franchise is worth scary amounts of money.
- JohnLocke4 - 10 hours ago
  
  Sometimes I wonder if innovation in the AI space has stalled and recent progress is just a product of increased compute. Competence is increasing exponentially[1] but I guess it doesn't rule it out completely. I would postulate that a radical architecture shift is needed for the singularity though
  [1]https://arxiv.org/html/2503.14499v1 *Source is from March 2025 so make of it what you will.
  - nomel - 10 hours ago
    
    > that devs get really reliant and even addicted on coding agents
    An alternative perspective is, devs highly value coding agents, and are willing to pay more because they're so useful. In other words, the market value of this limited resource is being adjusted to be closer to reality.
    
    killingtime74 - 9 hours ago
    
    It's not limited though there are alternative providers even now, much less when the price goes up. Chinese providers, European ones, local models.
    
    nomel - 8 hours ago
    
    > It's not limited though
    Inference is not free, so all providers have a financial limit, and all providers have limited GPU/memory, so there's a physical material limit.
    I suggest looking at the profits of these companies (while they scramble to stay competitive).
- pxc - 10 hours ago
  
  Maybe that's true. But I think part of the issue is that for a lot of things developers want to do with them now— certainly for most of the things I want to do with them— they're either barely good enough, or not consistently good enough. And the value difference across that quality threshold is immense, even if the quality difference itself isn't.
- pzo - 10 hours ago
  
  On top of that I noticed just right now after updating macos dekstop codex app, I got again by default set speed to 'fast' ('about 1.5x faster with increased plan usage'). They really want you to burn more tokens.
  - nubg - 7 hours ago
    
    wow wait so it wasn't just me leaving it on from an old session?
    sounds like criminal fraud to me tbh
- 0xbadcafebee - 9 hours ago
  
  A fool and his money are soon parted
- Mars008 - 5 hours ago
  
  > devs get really reliant and even addicted on coding agents
  That's more about managers who hope AI will gradually replace stubborn and lazy devs. That will shift the balance to business ideas and connections out of technical side and investments.
  Anyway, before singularity there going to be a huge change.
- oh_no - 11 hours ago
  
  what's the source on that?
  - puppystench - 11 hours ago
    
    In the announcement webpage:
    >For API developers, gpt-5.5 will soon be available in the Responses and Chat Completions APIs at $5 per 1M input tokens and $30 per 1M output tokens, with a 1M context window.
    
    oh_no - 10 hours ago
    
    oops, thanks. i had just been looking at their api docs
- throwaway613746 - 9 hours ago
  
  [dead]
raincole - 4 hours ago

It's such a vague table for pricing information. 30-150 messages...? What?

astlouis44 - 12 hours ago

A playable 3D dungeon arena prototype built with Codex and GPT models. Codex handled the game architecture, TypeScript/Three.js implementation, combat systems, enemy encounters, HUD feedback, and GPT‑generated environment textures. Character models, character textures, and animations were created with third-party asset-generation tools

The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.

It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.

0x62 - 11 hours ago

FWIW I've been experimenting with Three.js and AI for the last ~3 years, and noticed a significant improvement in 5.4 - the biggest single generation leap for Three.js specifically. It was most evident in shaders (GLSL), but also apparent in structuring of Three.js scenes across multiple pages/components.
It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.
In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.
Excited to test 5.5 and see how it is in practice.
- CSMastermind - 11 hours ago
  
  > It still struggles to create shaders from scratch
  Oh just like a real developer
  - accrual - 10 hours ago
    
    Much respect for shader developers, it's a different way of thinking/programming
- Pym - 8 hours ago
  
  One struggle I'm having (with Claude) is that most of what it knows about Three.js is outdated. I haven't used GPT in a while, is the grass greener?
  Have you tried any skills like cloudai-x/threejs-skills that help with that? Or built your own?
- import - 8 hours ago
  
  Using Claude for the same context and it’s doing really well with the glsl. since like last September
dataviz1000 - 10 hours ago

LLM models can not do spacial reasoning. I haven't tried with GPT, however, Claude can not solve a Rubik Cube no matter how much I try with prompt engineering. I got Opus 4.6 to get ~70% of the puzzle solved but it got stuck. At $20 a run it prohibitively expensive.
The point is if we can prompt an LLM to reason about 3 dimensions, we likely will be able to apply that to math problems which it isn't able to solve currently.
I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.
- versteegen - an hour ago
  
  Interesting (would like to hear more), but solving a Rubiks cube would appear to be a poor way to measure spatial understanding or reasoning. Ordinary human spatial intuition lets you think about how to move a tile to a certain location, but not really how to make consistent progress towards a solution; what's needed is knowledge of solution techniques. I'd say what you're measuring is 'perception' rather than reasoning.
- embedding-shape - 8 hours ago
  
  > I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.
  Do it, I'm game! You nerdsniped me immediately and my brain went "That sounds easy, I'm sure I could do that in a night" so I'm surely not alone in being almost triggered by what you wrote. I bet I could even do it with a local model!
- Melatonic - 7 hours ago
  
  What about a model designed for robotics and vision? Seems like an LLM trained on text would inherently not be great for this.
  DeepMinds other models however might do better?
- snet0 - 8 hours ago
  
  How are you handing the cube state to the model?
  - dataviz1000 - 8 hours ago
    
    Does this answer the question?
    Opus 4.6 got the cross and started to get several pieces on the correct faces. It couldn't reason past this. You can see the prompts and all the turn messages.
    https://gist.github.com/adam-s/b343a6077dd2f647020ccacea4140...
    edit: I can't reply to message below. The point isn't can we solve a Rubik's Cube with a python script and tool calls. The point is can we get an LLM to reason about moving things in 3 dimensions. The prompt is a puzzle in the way that a Rubik's Cube is a puzzle. A 7 year old child can learn 6 moves and figure out how to solve a Rubik's Cube in a weekend, the LLM can't solve it. However, can, given the correct prompt, a LLM solve it? The prompt is the puzzle. That is why it is fun and interesting. Plus, it is a spatial problem so if we solve that we solve a massive class of problems including huge swathes of mathematics the LLMs can't touch yet.
    
    libraryofbabel - an hour ago
    
    I wonder if the difficulties LLMs have with “seeing” complex detail in images is muddying the problem here. What if you hand it the cube state in text form? (You could try ascii art if you want a middle ground.)
    If you want to isolate the issue, try getting the LLM itself to turn the images into a text representation of the cube state and check for accuracy. If it can’t see state correctly it certainly won’t be able to solve.
    
    osti - 8 hours ago
    
    Can't they write a script to solve rubik cubes?
    
    Jensson - 2 hours ago
    
    That doesn't test whether the model can follow and execute a dynamic plan reliably.
  - - 8 hours ago
    
    [deleted]
- Torkel - 8 hours ago
  
  *yet
vunderba - 11 hours ago

I’ve had a lot of success using LLMs to help with my Three.js based games and projects. Many of my weird clock visualizations relied heavily on it.
It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.
Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.
kingstnap - 11 hours ago
The meshes look interesting, but the gameplay is very basic. The tank one seems more sophisticated with the flying ships and whatnot.
What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.
```
  Game created by Pietro Schirano, CEO of MagicPath

  Prompt: Create a 3D game using three.js. It should be a UFO shooter where I control a tank and shoot down UFOs flying overhead.
  - Think step by step, take a deep breath. Repeat the question back before answering.
  - Imagine you're writing an instruction message for a junior developer who's going to go build this. Can you write something extremely clear and specific for them, including which files they should look at for the change and which ones need to be fixed?
  -Then write all the code. Make the game low-poly but beautiful.
  - Remember, you are an agent: please keep going until the user's query is completely resolved before ending your turn and yielding back to the user. Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.
  - You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes of each function call, ensuring the user's query and related sub-requests are completely resolved.
```
- torginus - 10 hours ago
  
  It's weird how people pep talk the AI - if my Jira tickets looked like this, I would throw a fit.
  I guess these people think they have special prompt engineering skills, and doing it like this is better than giving the AI a dry list of requirements (fwiw, they might be even right)
  - mattgreenrocks - 9 hours ago
    
    It’s not surprising to me that the same crowd that cheers for the demise of software engineering skills invented its own notion of AI prompting skills.
    Too bad they can veer sharply into cringe territory pretty fast: “as an accomplished Senior Principal Engineer at a FAANG with 22 years of experience, create a todo list app.” It’s like interactive fanfiction.
    
    dr_kiszonka - 4 hours ago
    
    That's quite similar to the AI Studio's prompt. You are a world-class frontend engineer...
    
    eiksjs - 9 hours ago
    
    Indeed it is so utterly cringe.
  - eloisant - 8 hours ago
    
    Yes, this is cargo cult.
    This remind me of so called "optimization" hacks that people keep applying years after their languages get improved to make them unnecessary or even harmful.
    Maybe at one point it helped to write prompts in this weird way, but with all the progress going on both in the models and the harness if it's not obsolete yet it will soon be. Just crufts that consumes tokens and fills the context window for nothing.
- irthomasthomas - 11 hours ago
  
  > Think Step By Step
  What is this, 2023?
  I feel like this was generated by a model tapping in to 2023 notions of prompt engineering.
  - retr0rocket - 10 hours ago
    
    [dead]
- skirano - 10 hours ago
  
  Pietro here, I just published a video of it: https://x.com/skirano/status/2047403025094905964?s=20
- tantalor - 11 hours ago
  
  It comes across as an elaborate, sparkly motivational cat poster.
  *BELIEVE!* https://www.youtube.com/watch?v=D2CRtES2K3E
  - skolskoly - 7 hours ago
    
    https://m.media-amazon.com/images/I/71MTbRmLY8L._AC_UF894,10...
- bredren - 10 hours ago
  
  The prompt did not specify advanced gameplay.
  I do not see instructions to assist in task decomposition and agent ~"motivation" to stay aligned over long periods as cargo culting.
  See up thread for anecdotes [1].
  > Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved.
  I see this as a portrayal of the strength of 5.5, since it suggests the ability to be assigned this clearly important role to ~one shot requests like this.
  I've been using a cli-ai-first task tool I wrote to process complex "parent" or "umberella" into decomposed subtasks and then execute on them.
  This has allowed my workflows to float above the ups and downs of model performance.
  That said, having the AI do the planning for a big request like this internally is not good outside a demo.
  Because, you want the planning of the AI to be part of the historical context and available for forensics due to stalls, unwound details or other unexpected issues at any point along the way.
  [1] https://news.ycombinator.com/item?id=47879819
- ahoka - 10 hours ago
  
  "take a deep breath"
  OMFG
  - jameshart - an hour ago
    
    Claude would check to see if it had any breathing skills, if it doesn't find any it would start installing npm modules for breathing.
mindhunter - 8 hours ago

A friend is building Jamboree[1] (prev name "Spielwerk") for iOS. An app to build and share games. They're all web based so they're easy to share.
[1] https://apps.apple.com/uz/app/jamboree-game-maker/id67473110...
peder - 6 hours ago

> It really seems like we could be at the dawn of a new era similiar to flash
We've been there for a while.... creativity has been the primary bottleneck
- 11 hours ago

[deleted]
- 10 hours ago

[deleted]
nemo44x - 7 hours ago

It’s like all these things though - it’s not a real production worthy product. It’s a super-demo. It looks amazing until you realize there’s many months of work to make it something of quality and value.
I think people are starting to catch on to where we really are right now. Future models will be better but we are entering a trough of dissolution and this attitude will be widespread in a few months.
ZeWaka - 11 hours ago

I personally don't think the gameplay itself is that impressive.
gregpred - 11 hours ago

[flagged]

khutorni - 6 minutes ago

> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”

That's a wild statement to put into your announcement. Are LLM providers now openly bragging about our collective dependency on their models?

minimaxir - 12 hours ago

The more interesting part of the announcement than "it's better at benchmarks":

> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.

The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.

xiphias2 - 11 hours ago

There's already KernelBench which tests CUDA kernel optimizations.
On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.
xtracto - 6 hours ago

So, im working in some high performance data processing in Rust. I had hit some performance walls, and needed to improve in the 100x or more scale.
I remembered the famous FizzBuzz Intel codegolf optimizations, and gave it to gemini pro, along with my code and instructions to "suggest optimizations similar to those, maybe not so low level, but clever" and it's suggestions were veerry cool.
LLM do not stop amazing me every day.
amrrs - 11 hours ago

Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!
- minimaxir - 11 hours ago
  
  In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).
  A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).
  - squibonpig - 9 hours ago
    
    Yeah but like what if they're sorta embellishing it or just lying? That's the issue with not being reproducible.
- jstanley - 11 hours ago
  
  Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...
  - girvo - 8 hours ago
    
    That's easily explained by those being two different people with two different opinions?
    
    2goomba1stage - 4 hours ago
    
    And together they make one single community that s effectively NEVER happy.

6thbit - 10 hours ago

                          Mythos     5.5
    SWE-bench Pro          77.8%*   58.6%
    Terminal-bench-2.0     82.0%    82.7%*
    GPQA Diamond           94.6%*   93.6%
    H. Last Exam           56.8%*   41.4%
    H. Last Exam (tools)   64.7%*   52.2%    
    BrowseComp             86.9%    84.4%  (90.1% Pro)*
    OSWorld-Verified       79.6%*   78.7%

Still far from Mythos on SWE-bench but quite comparable otherwise. Source for mythos values: https://www.anthropic.com/glasswing

aliljet - 10 hours ago

Mythos is only real when it's actually available. If you're using Opus 4.7 right now, you know how incredibly nerfed the Opus autonomy is in service of perceived safety. I'm not so confident this will be as great as Anthropic wants us to believe..
XCSme - 10 hours ago

They mentioned in their release page, that the Claude team noticed memorization of the SWE-bench test, so the test is actually in the training data.
Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...
kaonashi-tyc-01 - 9 hours ago

I did some study on Verified, not Pro, but Mythos number there rings a lot of questions on my end.
If you look at the SWEBench official submissions: https://github.com/SWE-bench/experiments/tree/main/evaluatio..., filter all models after Sonnet 4, and aggregate ALL models' submission across 500 problems, what I found that the aggregated resolution rate is 93% (sharp).
Mythos gets 93.7%, meaning it solves problems that no other models could ever solve. I took a look at those problems, then I became even more suspicious, for the remaining 7% problems, it is almost impossible to resolve those issues without looking at the testing patch ahead of time, because how drastically the solution itself deviates from the problem statement, it almost feels like it is trying to solve a different problem.
Not that I am saying Mythos is cheating, but it might be too capable to remember all states of said repos, that it is able to reverse engineer the TRUE problem statement by diffing within its own internal memory. I think it could be a unique phenomena of evaluation awareness. Otherwise I genuinely couldn't think of exactly how it could be this precise in deciphering such unspecific problem statements.
- yfontana - 9 hours ago
  
  OpenAI wrote a couple months ago that they do not consider SWE Bench Verified a meaningful benchmark anymore (and they were the ones who published it in the first place): https://openai.com/index/why-we-no-longer-evaluate-swe-bench...
  - kaonashi-tyc-01 - 8 hours ago
    
    Yep, I read this blog. What confuses me is that Anthropic doesn't seem to be bothered by this study and keeps publishing Verified results.
    That is what gets me curious in the first place. The fact Mythos scored so high, IMO, exposes some issues with this model: it is able to solve seemingly impossible to solve problems.
    Without cheating allegation, which I don't think ANT is doing, it has to be doing some fortune telling/future reading to score that high at all.
alansaber - 9 hours ago

A single benchmark is meaningless, you always get quirky results on some benchmarks.

silvertaza - 10 hours ago

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...

dubcanada - 9 hours ago

grok is 17%? And that's the lowest, most models are like 80%+?
While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.
- Jensson - 2 hours ago
  
  > While hallucination is probably closer to 100% depending on the question.
  But the benchmark didn't ask those questions, and it seems grok is very well at saying it doesn't know the answer otherwise.
- elAhmo - 8 hours ago
  
  No one serious uses grok.
  - ajdegol - 8 hours ago
    
    @grok is this true?
    
    NamlchakKhandro - 25 minutes ago
    
    no
  - RALaBarge - 7 hours ago
    
    YMMV but Grok 4.1 Fast can usually find via static analysis a few things that other models dont seem to catch with the same prompt
  - d0gsg0w00f - 3 hours ago
    
    Why not? Honest question.
simianwords - 9 hours ago

There's something off with this because Haiku should not be that good.
- rattray - 4 hours ago
  
  I've been very curious about that too. I wonder if it's actually much better at admitting when it doesn't know something, because it thinks it's a "dumber model". But I haven't played with this at all myself.
- jwpapi - 9 hours ago
  
  The hallucination benchmark is hallucinating
dakolli - 9 hours ago

This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.
LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.
- tedsanders - 4 hours ago
  
  We don't want hallucinations either, I promise you.
  A few biased defenses:
  - I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.
  - This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."
  - On the flip side, GPT-5.5 has the highest accuracy score.
  - With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.
  - On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.
  - Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.
  Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.
- calf - 4 hours ago
  
  On ChatGPT 5.3 Plus subscription I find that long informal chats tend to reveal unsatisfactory answers and biases, at this point after 10 rounds of replies I end up having to correct it so much that it starts to agree with my initial arguments full circle. I don't see how this behavior is acceptable or safe for real work. Like are programmers and engineers using LLMs completely differently than I'm doing, because the underlying technology is fundamentally the same.

applfanboysbgon - 12 hours ago

If there's a bingo card for model releases, "our [superlative] and [superlative] model yet" is surely the free space.

tom1337 - 11 hours ago

Do "our [superlative] and [superlative] [product] yet" and you have pretty much every product launch
- SequoiaHope - 11 hours ago
  
  I love when Apple says they’re releasing their best iPhone yet so I know the new model is better than the old ones.
  - taspeotis - 5 hours ago
    
    https://theonion.com/new-device-desirable-old-device-undesir...
xnx - 11 hours ago

"our newest and most expensive model yet"
- - 10 hours ago
  
  [deleted]
wiseowise - 9 hours ago

"Best iPhone ever"
ertgbnm - 10 hours ago

can't wait for "our worst and dumbest model yet"
- Nition - 10 hours ago
  
  Apple should have used that one for the 2016 MacBook.

mudkipdev - 10 hours ago

This is 3x the price of GPT-5.1, released just 6 months ago. Is no one else alarmed by the trend? What happens when the cheaper models are deprecated/removed over time?

Night_Thastus - 10 hours ago

This is entirely expected. The low prices of using LLMs early on was totally and completely unsustainable. The companies providing such services were (and still are) burning money by the truckload.
The hope is to get a big userbase who eventually become dependent on it for their workflow, then crank up the price until it finally becomes profitable.
The price for all models by all companies will continue to go up, and quickly.
- oezi - 7 hours ago
  
  I recently looked at this a bit but came away with the impression that at least on API pricing the models should be very profitable considering primarily the electricity cost.
  Subscriptions and free plans are the thing that can easily burn money.
  - Night_Thastus - 6 hours ago
    
    The physical buildouts and massive R+D spending is the big part.
- subhobroto - 3 hours ago
  
  > The price for all models by all companies will continue to go up, and quickly.
  This might entirely be true but I'm hoping that's because the frontier models are just actually more expensive to run as well.
  Said another way, I would hope, the price of GPT-5.5 falls significantly in a year when GPT-5.8 is out.
  Someone else on this post commented:
  > For API usage, GPT-5.5 is 2x the price of GPT-5.4, ~4x the price of GPT-5.1, and ~10x the price of Kimi-2.6.
  Having used Kimi-2.6, it can go on for hours spewing nonsense. I personally am happy to pay 10x the price of something that doesn't help me, for something else that does, in even half the time.
energy123 - 10 hours ago

Look a cost per intelligence or cost per task instead of cost per token.
- yokoprime - 10 hours ago
  
  How do I reliably measure 1 unit of intelligence?
  - wellthisisgreat - 9 hours ago
    
    In pelicans, obviously
- ulimn - 10 hours ago
  
  Isn't the outcome / solution for a given task non-deterministic? So can we reliably measure that?
  - foota - 10 hours ago
    
    Yes, sort of. Generally you can measure the pass rate on a benchmark given a fixed compute budget. A sufficiently smart model can hit a high pass rate with fewer tokens/compute. Check out the cost efficiency on https://artificialanalysis.ai/ (say this posted here the other day, pretty neat charts!)
  - genericresponse - 10 hours ago
    
    Statistically. Do many trials and measure how often it succeeds/fails.
  - torginus - 10 hours ago
    
    This is the only correct take. The only metric that matters is cost per desired outcome.
  - dns_snek - 10 hours ago
    
    Repetition and statistics, if you have $1000++ you didn't need anyway.
  - throwuxiytayq - 10 hours ago
    
    It's much easier to measure a language model's intelligence than a human's because you can take as many samples as you want without affecting its knowledge. And we do measure human intelligence.
Schlagbohrer - 8 hours ago

As others have mentioned you're ignoring the long tail of open-weights models which can be self hosted. As long as that quasi-open-source competition keeps up the pace, it will put a cap on how expensive the frontier models can get before people have to switch to self-hosting.
That's a big if, though. I wish Meta were still releasing top of the line, expensively produced open-weights models. Or if Anthropic, Google, or X would release an open mini version.
- Wowfunhappy - 7 hours ago
  
  Well, Google does release mini open versions of their models. https://deepmind.google/models/gemma/gemma-4/
  - deaux - 7 hours ago
    
    And they're incredibly good for their size.
    
    boppo1 - 3 hours ago
    
    Which, unfortunately is still slow unusable garbage compared to fronteir models.
    
    deaux - an hour ago
    
    Not at all, it's more than enough for a large range of tasks. As for slow, that's just a function of how much compute you throw at it, which you actually control unlike with closed weights models.
operatingthetan - 10 hours ago

We know they cost much more than this for OpenAI. Assume prices will continue to climb until they are making money.
- horiap - 7 hours ago
  
  How do we know that? There is a large gap between API pricing for SOTA models and similarly sized OSS models hosted by 3rd party providers.
  Sure, they’re distilled and should be cheaper to run but at the same time, these hosting providers do turn a margin on these given it’s their core business, unless they do it out of the kindness of their heart.
  So it’s hard for me to imagine these providers are losing money on API pricing.
- beering - 8 hours ago
  
  source? There have also been a bunch of people here saying the opposite
thrawa8387336 - 3 hours ago

Apparently the cost/price is 20x in the major providers. Not clear how it is a business
dannyw - 10 hours ago

It's far more meaningful to look at the actual cost to successfully something. The token efficiency of GPT-5.5 is real; as well as it just being far better for work.
dandaka - 10 hours ago

SOTA models get distilled to open source weights in ~6 months. So paying premium for bleeding edge performance sounds like a fair compensation for enormous capex.
typs - 7 hours ago

GPT-4 cost 6x on input and 2x output tokens when it was released as compared go GPT-5.5
kuatroka - 8 hours ago

Not really a big problem. Switch to KIMI, Qwen, GLM. You’ll get 95% quality of GPT or Anthropic for a 10th of a price. I feel like the real dependency is more mental, more of a habit but if you actually dip your toes outside OpenAI, Anthropic, Gemini from time to time, you realise that the actual difference in code is not huge if prompted in a good way. Maybe you’ll have to tell it to do something twice and it won’t be a one shot, but it’s really not an issue at all.
- Mashimo - an hour ago
  
  I use glm and I like it, not they also increased the price to 18 usd /month.
  I think Kimi and qwen are similar?
- nubg - 7 hours ago
  
  God I hope this is true.
  Where can i find up to date resources on open source models for coding?
  - vibe42 - 3 hours ago
    
    https://old.reddit.com/r/LocalLLaMA/
    Bit of a hype madhouse whenever a new model is released, but it's pretty easy to filter out simple hype from people showing reproducible experiments, specific configs for llama.cpp, github links etc.
msdz - 10 hours ago

Such an increase tracks the company's valuation trend, which they constantly, somehow have to justify (let alone break even on costs).

vthallam - 11 hours ago

This model is great at long horizon tasks, and Codex now has heartbeats, so it can keep checking on things. Give it your hardest problem that would take hours with verifiable constraints, you will see how good this is:)

*I work at OAI.

dannyw - 10 hours ago

It's genuinely so great at long horizon tasks! GPT-5.5 solved many long-horizon frontier challenges, for the first time for an AI model we've tested, in our internal evals at Canva :) Congrats on the launch!
- brcmthrowaway - 8 hours ago
  
  Can we not do growth hacking here?
  - RALaBarge - 7 hours ago
    
    We totally agree.
    That's what I've been heads down, HUNGRY, working on, looking for investors and founding engineers pst: https://heymanniceidea.com (disclaimer: I am not associated with heymanniceidea.com)
  - smallerize - 7 hours ago
    
    HN is owned by a startup accelerator and venture capital firm. They do growth hacking on the front page. And you probably know that since your throwaway account is several years old.
  - - 7 hours ago
    
    [deleted]
dandaka - 10 hours ago

Could be a great feature, can't wait to test! Tired of other models (looking at you Opus) constantly stuck mid-task lately.
- frotaur - 9 hours ago
  
  I've been using the /ralph-loop plugin for claude code, works well to keep the model hammering at the task.
- winrid - 10 hours ago
  
  Interesting, I just had opus convert a 35k loc java game to c++ overnight (root agent that orchestrated and delegated to sub agents) and woke up and it's done and works.
  What plan are you on? I'm starting to wonder if they're dynamically adjusting reasoning based on plan or something.
  - gck1 - 8 hours ago
    
    I'm on max 5x and noticed this too. I don't use built-in subagents but rather full Claude session that orchestrates other full claude sessions. Worker agents that receive tasks now stop midway, they ask for permission to continue. My "heartbeat" is basically "status. One line" message sent to the orchestrator.
    Opus 4.6 worker agents never asked for permission to continue, and when heartbeat was sent to orchestrator, it just knew what to do (checked on subagents etc). Now it just says that it waits for me to confirm something.
  - adamandsteve - 6 hours ago
    
    This has to be bait.
    
    azan_ - 5 hours ago
    
    Why?
bkyan - 7 hours ago

Sorry, what is "heartbeats", exactly?
- gurjeet - 6 hours ago
  
  > Today we launched heartbeats in Codex: automations that maintain context inside a single thread over time.
  https://x.com/pashmerepat/status/2044836560147984461
  - bkyan - 5 hours ago
    
    Thanks!

_alternator_ - 9 hours ago

> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”

This quote is more sinister than I think was intended; it likely applies to all frontier coding models. As they get better, we quickly come to rely on them for coding. It's like playing a game on God Mode. Engineers become dependent; it's truly addictive.

This matches my own experience and unease with these tools. I don't really have the patience to write code anymore because I can one shot it with frontier models 10x faster. My role has shifted, and while it's awesome to get so much working so quickly, the fact is, when the tokens run out, I'm basically done working.

It's literally higher leverage for me to go for a walk if Claude goes down than to write code because if I come back refreshed and Claude is working an hour later then I'll make more progress than mentally wearing myself out reading a bunch of LLM generated code trying to figure out how to solve the problem manually.

Anyway, it continues to make me uneasy, is all I'm saying.

noosphr - 7 hours ago

LLMs upend a few centuries of labor theory.
The current market is predicated on the assumption that labor is atomic and has little bargaining power (minus unions). While capital has huge bargaining power and can effectively put whatever price it wants on labor (in markets where labor is plentiful, which is most of them).
What happens to a company used to extracting surplus value from labor when the labor is provided by another company which is not only bigger but unlike traditional labor can withhold its labor indefinitely (because labor is now just another for of capital and capital doesn't need to eat)?
Anyone not using in house models is signing up to find out.
- matheusmoreira - 6 hours ago
  
  This is our one chance to reach the fabled post-scarcity society. If we fail at this now, we'll end up in a totalitarian cyberpunk dystopia instead.
  - nkozyra - 6 hours ago
    
    I don't want to spoil it for you, but ...
  - TurdF3rguson - 6 hours ago
    
    But cyberpunk is the best kind of dystopia!
    
    onemoresoop - 5 hours ago
    
    Sorry for my foul language but I think we will turn into cybershit if things go bad.
  - mikercampbell - 4 hours ago
    
    Manufactured Scarcity is the new post-scarcity
  - ijeifnekfjekd - 6 hours ago
    
    What? In what way does companies becoming dependent on AI chatbots will solve the world-spanning problem of resource scarcity?
    The hell?
    
    juleiie - 6 hours ago
    
    The idea is that cheap and readily available and upgradeable intelligence is going to massively increase our purchasing power and what everyone can order for the same cost basically.
    If artificial doctors are cents on hour then you can see how that changes our behaviors and level of life.
    But on the other hand from the other direction there is a wage decrease incoming from increased competition at the same time. What happens if these two forces clash? Will cheap labour allow us to buy anything for pennies or will it just make us unable to make a single penny?
    In my view the labour will fundamentally shift with great pain and personal tragedies to the areas that are not replaceable by AI (because no one wants to watch robots play chess). Such as sports, entertainment and showmanship. Handcrafted goods. Arts. Attention based economy. Self advertisement. Digital prostitution in a very broad sense.
    However before it gets there it will be a great deal of strife and turmoil that could plunge the world into dark ages for a while at least. It is unlikely for our somewhat politically rigid society to adapt without great deal of pain. Additionally I am not sure if hypothetical future attention based society could be a utopia. You could have to mount cameras in your house so other people see you at all times for amusement just to have any money at all. We will probably forever need to sell something to someone and I am unsettled by ideas what can we sell if we cannot sell our hard work.
    Someone who sees the roads ahead should now make preparations at government level for this shock but it will come too fast and with people at the steering wheel that don’t exactly care.
    
    mxkopy - 5 hours ago
    
    We could also literally have Star Trek. Think of all the scientific discoveries we could make if we had armies of scientists the size of our labor force.
    But we will have to (painfully) shed our current hierarchies before that comes to pass.
    
    NamlchakKhandro - 9 minutes ago
    
    star trek mythology talks about having to go through epic level civil war before reach the utopia in the tv series.
    
    juleiie - 5 hours ago
    
    Maybe so but humans have this strange primal need to hoard resources.
    Probably a remnant from prehistoric times when it was a matter of life and death. Will we ever be able to overcome this basic instinct that made capitalism such an unstoppable force? Will this ancient PTSD be ever cured?
    
    mxkopy - 2 hours ago
    
    I find the insinuation that mental illness is a fundamental part of the human experience to be deeply revolting. There is no excuse for hoarders and rapists.
    
    krainboltgreene - 5 hours ago
    
    Man if only there was a singular episode that covered this exact topic in Star Trek and resolved that no, actually slavery wasn't any different for artificial life.
    
    linkregister - 4 hours ago
    
    Star Trek was entertaining television. There was also an episode where the ship's doctor made love to a ghost.
    
    krainboltgreene - 4 hours ago
    
    True, nothing to learn here. No introspection has ever resulted from media analysis.
    
    krainboltgreene - 5 hours ago
    
    "Extremely cheap sentience that cannot disobey will solve all our problems" is such an insane sentiment I see far too often.
    
    juleiie - 5 hours ago
    
    Useful intelligence does not require sentience.
    As far as I know, none of LLM models are sentient nor are possible to be in the near future.
    I also do not assume so called AGI to be sentient. Merely to be a human level skilled intellectual worker.
    In absence of ethical dilemmas of this calibre for the foreseeable future let’s focus on the economy side of things in this particular comment chain.
    
    krainboltgreene - 4 hours ago
    
    It must very comforting to be able to decided a "human level worker" isn't sentient.
    It makes things so clean.
    
    juleiie - 4 hours ago
    
    LLMs cannot possess consciousness for three reasons: they execute as a sequence of Transformer blocks with extremely limited information exchange, these blocks are simple feed-forward networks with no recurrent connections, and the computer hardware follows a modular design.
    Shardlow & Przybyła, "Deanthropomorphising NLP: Can a Language Model Be Conscious?" (PLOS One, 2024)
    Nature: "There is no such thing as conscious artificial intelligence" (2025)
    They argue that the association between consciousness and LLMs is deeply flawed, and that mathematical algorithms implemented on graphics cards cannot become conscious because they lack a complex biological substrate. They also introduce the useful concept of "semantic pareidolia" - we pattern-match consciousness onto things that merely talk convincingly.
    They are making a strong argument and I think they are correct. But really these are two different things as I said originally.