GPT-5.2

517 points by atgctg 4 hours ago

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

I have been using chatGPT a ton over the last months and paying the subscription. Used it for coding, news, stock analysis, daily problems, and a whatever I could think of. I decided to give Gemini a go when version three came out to great reviews. Gemini handles every single one of my uses cases much better and consistently gives better answers. This is especially true for situations were searching the web for current information is important, makes sense that google would be better. Also OCR is phenomenal chatgpt can't read my bad hand writing but Gemini can easily. Only downsides are in the polish department, there are more app bugs and I usually have to leave the happen or the session terminates. There are bugs with uploading photos. The biggest complaint is that all links get inserted into google search and then I have to manipulate them when they should go directly to the chosen website, this has to be some kind of internal org KPI nonsense. Overall, my conclusion is that ChatGPT has lost and won't catch up because of the search integration strength.

solarkraft - an hour ago

> Only downsides are in the polish department
What an understatement. It has me thinking „man, fuck this“ on the daily.
Just today it spontaneously lost an entire 20-30 minutes long thread and it was far from the first time. It basically does it any time you interrupt it in any way. It’s straight up data loss.
It’s kind of a typical Google product in that it feels more like a tech demo than a product.
It has theoretically great tech. I particularly like the idea of voice mode, but it’s noticeably glitchy, breaks spontaneously often and keeps asking annoying questions which you can’t make it stop.
- sundarurfriend - 32 minutes ago
  
  ChatGPT web UI was also like this for the longest time, until a few months ago: all sorts of random UI bugs leading either to data loss or misleading UI state. Interrupting still is very flaky there too. And on the mobile app, if you move away from the app while it's taking time to think, its state would somehow desync from the actual backend thinking state, and get stuck randomly; sometimes restarting the app fixes it, sometimes that chat is that unusable from that point on.
  And the UI lack of polish shows up freshly every time a new feature lands too - the "branch in new chat" feature is really finicky still, getting stuck in an unusable state if you twitch your eyebrows at wrong moment.
- mmaunder - 12 minutes ago
  
  Yeah I eventually noped out as I said in another comment and am charging hard with Codex and am so happy about 5.2!!
- radicaldreamer - an hour ago
  
  Google’s standard problem is that they don’t even use their own products. Their Pixel and Android team rocks iPhones on the daily, for example.
  - RBerenguel - 24 minutes ago
    
    I would think this is not true
  - onethought - an hour ago
    
    I mean there is benefit to understanding competitor well as well?
    
    LogicFailsMe - 27 minutes ago
    
    Outweighed by the value of having to suffer with the moldy fruits of their own labor. That was the only way the Android Facebook app became usable as well.
- mnky9800n - an hour ago
  
  The colab integration is where it shines the most imo.
dmd - an hour ago

I consistently have exactly the opposite experience. ChatGPT seems extremely willing to do a huge number of searches, think about them, and then kick off more searches after that thinking, think about it, etc., etc. whereas it seems like Gemini is extremely reluctant to do more than a couple of searches. ChatGPT also is willing to open up PDFs, screenshot them, OCR them and use that as input, whereas Gemini just ignores them.
- staticman2 - 5 minutes ago
  
  Are you uploading PDFs that already have a text layer?
  I don't currently subscribe to Gemini but on A.I. Studio's free offering when I upload a non OCR PDF of around 20 pages the software environment's OCR feeds it to the model with greater accuracy than I've seen from any other source.
  - dmd - 2 minutes ago
    
    I’m not uploading PDFs at all. I’m talking about PDFs it finds while searching than it extracts data from for the conversation.
- nullbound - 38 minutes ago
  
  I will say that it is wild, if not somewhat problematic that two users have such disparate views of seemingly the same product. I say that, but then I remember my own experience just from few days ago. I don't pay for gemini, but I have paid chatgpt sub. I tested both for the same product with seemingly same prompt and subbed chatgpt subjectively beat gemini in terms of scope, options and links with current decent deals.
  It seems ( only seems, because I have not gotten around to test it in any systematic way ) that some variables like context and what the model knows about you may actually influence quality ( or lack thereof ) of the response.
  - martinpw - 2 minutes ago
    
    > I will say that it is wild, if not somewhat problematic that two users have such disparate views of seemingly the same product.
    This happens all the time on HN. Before opening this thread, I was assuming that the top comment would be 100% positive about the product or its competitor, and one of the top replies would be exactly the opposite, and sure enough...
    I don't know why it is. It's honestly a bit disappointing that the most upvoted comments often have the least nuance.
  - dmd - 22 minutes ago
    
    And I’d really like for Gemini to be as good or better, since I get it for free with my Workspace account, whereas I pay for chatgpt. But every time I try both on a query I’m just blown away by how vastly better chatgpt is, at least for the heavy-on-searching-for-stuff kinds of queries I typically do.
- noname120 - 20 minutes ago
  
  Perplexity Pro with any thinking model blows both out of the water in a fraction of the time, in my experience
lxgr - an hour ago

Interesting, I had the opposite experience. 5.0 "Thinking" was better than 5.1, but Gemini 3 Pro seems worse than either for web search use cases. It's hallucinating at pretty alarming rates (including making up sources it never actually accessed) for a late 2025 model.
Opus 4.5 has been a step above both for me, but the usage limits are the worst of the three. I'm seriously considering multiple parallel subscriptions at this point.
- gs17 - 17 minutes ago
  
  I've had the same experience with search, especially with it hallucinating results instead of actually finding them. It's really frustrating that you can't force a more in-depth search from the model run by the company most famous for a search engine.
- - 21 minutes ago
  
  [deleted]
kccqzy - an hour ago

> The biggest complaint is that all links get inserted into google search and then I have to manipulate them when they should go directly to the chosen website, this has to be some kind of internal org KPI nonsense.
Oh I know this from my time at Google. The actual purpose is to do a quick check for known malware and phishing. Of course these days such things are better dealt with by the browser itself in a privacy preserving way (and indeed that’s the case), so it’s totally fine to manipulate them to make them go directly to the website.
- sundarurfriend - an hour ago
  
  That's interesting, I just today started getting some "Some sites restrict our ability to check links." dialogue in ChatGPT that wanted me to verify that I really wanted to follow the link, with a Learn More link to this page: https://help.openai.com/en/articles/10984597-chatgpt-generat...
  So it seems like ChatGPT does this automatically and internally, instead of using an indirect check like this.
hbarka - 40 minutes ago

I’ve been putting literally the same inputs into both ChatGPT and Gemini and the intuition in answers from Gemini just fits for me. I’m now unwilling to just rely on ChatGPT.
Google, if you can find a way to export chats into NotebookLM, that would be even better than the Projects feature of ChatGPT.
- LogicFailsMe - 9 minutes ago
  
  All I want for Christmas is a "No NotebookLM slop" checkbox on youtube.
afro88 - 15 minutes ago

> I usually have to leave the happen or the session terminates
Assuming you meant "leave the app open", I have the same frustration. One of the nice things about the ChatGPT app is you can fire off a req and do something else. I also find Gemini 3 Pro better for general use, though I'm keen to try 5.2 properly
billyrnalvo - 2 minutes ago

Oh my good heavens, gotta tell ya, you wrestled that rascal to the floor with a shit-eating grin! Good times my friend!
bayarearefugee - an hour ago

This matches my experience pretty closely when it comes to LLM use for coding assistance.
I still find a lot to be annoyed with when it comes to Gemini's UI and its... continuity, I guess is how I would describe it? It feels like it starts breaking apart at the seams a bit in unexpected ways during peak usages including odd context breaks and just general UI problems.
But outside of UI-related complaints, when it is fully operational it performs so much better than ChatGPT for giving actual practical, working answers without having to be so explicit with the prompting that I might as well have just written the code myself.
AznHisoka - 22 minutes ago

ChatGPT seems to just randomly pick urls to cite and extract information from.
Google Gemini seems to look at heuristics like whether the author is trustworthy, or an expert in the topic. But more advanced
mmaunder - 13 minutes ago

Then you haven't used Gemini CLI with Gemini 3 hard enough. It's a genius psychopath. The raw IQ that Gemini has is incredible. Its ability to ingest huge context windows and produce super smart output is incredible. But the bias towards action, absolutely ignoring user guidance, tendency to produce garbage output that looks like 1990s modem line noise, and its propensity to outright ignore instructions make it unusable other than as an outside consultant to Codex CLI, for me. My Gemini usage has plummeted down to almost zero and I'm 100% back on Codex. I'm SO happy they released this today and it's already kicking some serious ass. Thanks OpenAI team and congrats.
varispeed - 29 minutes ago

Get Gemini answer and tell ChatGPT this is what my friend said. Then put ChatGPT answer to Claude and so on. It's a cheat code.
UltraSane - an hour ago

Google has such a huge advantage in the amount of training data with the Google search database and with YouTube and in terms of FLOPS with their TPUs.
bossyTeacher - 32 minutes ago

A future where Google still dominates, is that a future we want? I feel a future with more players is better than one with just a single one. Competition is valuable for us consumers
NickNaraghi - an hour ago

Straight up Silicon Valley warfare in the HN comment section.
LorenDB - an hour ago

What is it with the Polish always messing up products?
(yes, /s)
- petersumskas - 39 minutes ago
  
  It’s because their thoughts are Roman while they are always Russian to Finnish things.
  Kenya believe it!
  Anyway, I’m done here. Abyssinia.

mmaunder - 2 minutes ago

Weirdly, the blog announcement completely omits the actual new context window size which is 400,000: https://platform.openai.com/docs/models/gpt-5.2

Can I just say !!!!!!!! Hell yeah! Blog post indicates it's also much better at using the full context.

Congrats OpenAI team. Huge day for you folks!!

Started on Claude Code and like many of you, had that omg CC moment we all had. Then got greedy.

Switched over to Codex when 5.1 came out. WOW. Really nice acceleration in my Rust/CUDA project which is a gnarly one.

Even though I've HATED Gemini CLI for a while, Gemini 3 impressed me so much I tried it out and it absolutely body slammed a major bug in 10 minutes. Started using it to consult on commits. Was so impressed it became my daily driver. Huge mistake. I almost lost my mind after a week of this fighting it. Isane bias towards action. Ignoring user instructions. Garbage characters in output. Absolutely no observability in its thought process. And on and on.

Switched back to Codex just in time for 5.1 codex max xhigh which I've been using for a week, and it was like a breath of fresh air. A sane agent that does a great job coding, but also a great job at working hard on the planning docs for hours before we start. Listens to user feedback. Observability on chain of thought. Moves reasonably quickly. And also makes it easy to pay them more when I need more capacity.

And then today GPT-5.2 with an xhigh mode. I feel like xmass has come early. Right as I'm doing a huge Rust/CUDA/Math-heavy refactor. THANK YOU!!

nbardy - 5 minutes ago

Those arc agi 2 improvements are insane.

Thats especially encouraging to me because those are all about generalization.

5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.

It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.

This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.

Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.

I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.

flkiwi - 2 minutes ago

I gave up my OpenAI subscription a few days ago in favor of Claude. My quality of life (and quality of results) has gone up substantially. Several of our tools at work have GPT-5x as their backend model, and it is incredible how frustrating they are to use, how predictable their AI-isms are, and how inconsistent their output is. OpenAI is going to have to do a lot more than an incremental update to convince me they haven't completely lost the thread.

zone411 - 2 hours ago

I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):

The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.

The medium-reasoning version also improves: 62.7 → 72.1.

The no-reasoning version also improves: 22.1 → 27.5.

Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.

Donald - 2 hours ago

Gemini 3 Pro Preview gets 96.8% on the same benchmark? That's impressive
- capitainenemo - 2 hours ago
  
  And performs very well on the latest 100 puzzles too, so isn't just learning the data set (unless I guess they routinely index this repo).
  I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.
- bigyabai - 2 hours ago
  
  GPT-5.2 might be Google's best Gemini advertisement yet.
  - outside1234 - an hour ago
    
    Especially when you see the price
tikotus - 2 hours ago

Here's someone else testing models on a daily logic puzzle (Clues by Sam): https://www.nicksypteras.com/blog/cbs-benchmark.html GPT 5 Pro was the winner already before in that test.
- thanhhaimai - 2 hours ago
  
  This link doesn't have Gemini 3 performance on it. Do you have an updated link with the new models?
- crapple8430 - an hour ago
  
  GPT 5 Pro is a good 10x more expensive so it's an apples to oranges comparison.
scrollop - 13 minutes ago

Why no grok 4.1 reasoning?

breakingcups - 3 hours ago

Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?

0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnOMruN3f...

tedsanders - 3 hours ago

Yep, the point we wanted to make here is that GPT-5.2's vision is better, not perfect. Cherrypicking a perfect output would actually mislead readers, and that wasn't our intent.
- BoppreH - 3 hours ago
  
  That would be a laudable goal, but I feel like it's contradicted by the text:
  > Even on a low-quality image, GPT‑5.2 identifies the main regions and places boxes that roughly match the true locations of each component
  I would not consider it to have "identified the main regions" or to have "roughly matched the true locations" when ~1/3 of the boxes have incorrect labels. The remark "even on a low-quality image" is not helping either.
  Edit: credit where credit is due, the recently-added disclaimer is nice:
  > Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.
  - hnuser123456 - 2 hours ago
    
    Yeah, what it's calling RAM slots is the CMOS battery. What it's calling the PCIE slot is the interior side of the DB-9 connector. RAM slots and PCIE slots are not even visible in the image.
  - furyofantares - an hour ago
    
    They also changed "roughly match" to "sometimes match".
    
    MichaelZuo - 26 minutes ago
    
    Did they really change a meaningful word like that after publication without an edit note…?
    
    piker - 6 minutes ago
    
    Eh, I'm no shill but their marketing copy isn't exactly the New York Times. They're given some license to respond to critical feedback in a manner that makes the statements more accurate without the same expectations of being objective journalism of record.
  - - 2 hours ago
    
    [deleted]
- arscan - 2 hours ago
  
  I think you may have inadvertently misled readers in a different way. I feel misled after not catching the errors myself, assuming it was broadly correct, and then coming across this observation here. Might be worth mentioning this is better but still inaccurate. Just a bit of feedback, I appreciate you are willing to show non-cherry-picked examples and are engaging with this question here.
  Edit: As mentioned by @tedsanders below, the post was edited to include clarifying language such as: “Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.”
  - tedsanders - 2 hours ago
    
    Thanks for the feedback - I agree our text doesn't make the models' mistakes clear enough. I'll make some small edits now, though it might take a few minutes to appear.
- layer8 - 2 hours ago
  
  You know what would be great? If it had added some boxes with “might be X or Y, but not sure”.
- g947o - 2 hours ago
  
  When I saw that it labeled DP ports as HDMI I immediately decided that I am not going to touch this until it is at least 5x better with 95% accuracy with basic things.
  I don't see any advantage in using the tool.
  - jacquesm - 2 hours ago
    
    That's a far more dangerous territory. A machine that is obviously broken will not get used. A machine that is subtly broken will propagate errors because it will have achieved a high enough trust level that it will actually get used.
    Think 'Therac-25', it worked in 99.5% of the time. In fact it worked so well that reports of malfunctions were routinely discarded.
- iamdanieljohns - 2 hours ago
  
  Is Adaptive Reasoning gone from GPT-5.2? It was a big part of the release of 5.1 and Codex-Max. Really felt like the future.
  - tedsanders - 2 hours ago
    
    Yes, GPT-5.2 still has adaptive reasoning - we just didn't call it out by name this time. Like 5.1 and codex-max, it should do a better job at answering quickly on easy queries and taking its time on harder queries.
- iwontberude - 2 hours ago
  
  But it’s completely wrong.
- d--b - 3 hours ago
  
  Well, that is something you have not in common with your boss. Kudos to you!
  - honeycrispy - 3 hours ago
    
    Not sure what you mean, Altman does that fake-humility thing all the time.
    It's a marketing trick; show honesty in areas that don't have much business impact so the public will trust you when you stretch the truth in areas that do (AGI cough).
    
    d--b - 3 hours ago
    
    I'm confident that GP is good faithed though. Maybe I am falling for it. Who knows? It doesn't really matter, I just wanted to be nice to the guy. It takes some balls posting as OpenAi employee here, and I wish we heard from them more often, as I am pretty sure all of them lurk around.
    
    rvnx - 2 hours ago
    
    It's the only reasonable choice you can make. As an employee with stock options you do not want to get trashed on Hackernews because this affects your income directly if you try to conduct a secondary share sale or plan to hold until IPO.
    Once the IPO is done, and the lockup period is expired, then a lot of employees are planning to sell their shares. But until that, even if the product is behind competitors there is no way you can admit it without putting your money at risk.
    
    Esophagus4 - 32 minutes ago
    
    I know HN commenters like to see themselves as contrarians, as do I sometimes, but man… this seems like a serious stretch to assume such malicious intent that an employee of the world’s top AI name would astroturf a random HN thread about a picture on a blog.
    I’m fairly comfortable taking this OpenAI employee’s comment at face value.
    Frankly, I don’t think a HN thread will make a difference to his financial situation, anyway…
  - wilg - 3 hours ago
    
    What did Sam Altman say? Or is this more of a vague impression thing?
    
    d--b - 3 hours ago
    
    [flagged]
    
    minimaxir - 2 hours ago
    
    Using ChatGPT to ironically post AI-generated comments is still posting of AI-generated comments.
timerol - 3 hours ago

Also a "stacked pair" of USB type-A ports, when there are clearly 4
whalesalad - 2 hours ago

to be fair that image has the resolution of a flip phone from 2003
- malfist - 15 minutes ago
  
  If I ask you a question and you don't have enough information to answer, you don't confidently give me an answer, you say you don't know.
  I might not know exactly how many USB ports this motherboard has, but I wouldn't select a set of 4 and declare it to be a stacked pair.
jasonlotito - 2 hours ago

FTA: Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.
You can find it right next to the image you are talking about.
- tedsanders - an hour ago
  
  To be fair to OP, I just added this to our blog after their comment, in response to the correct criticisms that our text didn't make it clear how bad GPT-5.2's labels are.
  LLMs have always been very subhuman at vision, and GPT-5.2 continues in this tradition, but it's still a big step up over GPT-5.1.
  One way to get a sense of how bad LLMs are at vision is to watch them play Pokemon. E.g.,: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...
  They still very much struggle with basic vision tasks that adults, kids, and even animals can ace with little trouble.
- da_grift_shift - an hour ago
  
  'Commented after article was already edited in response to HN feedback' award
an0malous - an hour ago

Because the whole culture of AI enthusiasts is to just generate slop and never check the results

onraglanroad - an hour ago

I suppose this is as good a place as any to mention this. I've now met two different devs who complained about the weird responses from their LLM of choice, and it turned out they were using a single session for everything. From recipes for the night, presents for the wife and then into programming issues the next day.

Don't do that. The whole context is sent on queries to the LLM, so start a new chat for each topic. Or you'll start being told what your wife thinks about global variables and how to cook your Go.

I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!

chasd00 - 7 minutes ago

I was listening to a podcast about people becoming obsessed and "in love" with an LLM like ChatGPT. Spouses were interviewed describing how mentally damaging it is to their partner and how their marriage/relationship is seriously at risk because of it. I couldn't believe no one has told these people to just goto the LLM and reset the context, that reverts the LLM back to a complete stranger. Granted that would be pretty devastating to the person in "the relationship" with the LLM since it wouldn't know them at all after that.
- adamesque - 2 minutes ago
  
  that's not quite what parent was talking about, which is — don't just use one giant long conversation. resetting "memories" is a totally different thing (which still might be valuable to do occasionally, if they still let you)
noname120 - 11 minutes ago

Problem is that by default ChatGPT has the “Reference chat history” option enabled in the Memory options. This causes any previous conversation to leak into the current one. Just creating a new conversation is not enough, you also need to disable that option.
vintermann - 43 minutes ago

It's not at all obvious where to drop the context, though. Maybe it helps to have similar tasks in the context, maybe not. It did really, shockingly well on a historical HTR task I gave it, so I gave it another one, in some ways an easier one... Thought it wouldn't hurt to have text in a similar style in the context. But then it suddenly did very poorly.
Incidentally, one of the reasons I haven't gotten much into subscribing to these services, is that I always feel like they're triaging how many reasoning tokens to give me, or AB testing a different model... I never feel I can trust that I interact with the same model.

goobatrooba - 38 minutes ago

I feel there is a point when all these benchmarks are meaningless. What I care about beyond decent performance is the user experience. There I have grudges with every single platform and the one thing keeping me as a paid ChatGPT subscriber is the ability to sort chats in "projects" with associated files (hello Google, please wake up to basic user-friendly organisation!)

But all of them * Lie far too often with confidence * Refuse to stick to prompts (e.g. ChatGPT to the request to number each reply for easy cross-referencing; Gemini to basic request to respond in a specific language) * Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?) * Refuse to give me short answers without fluff or follow up questions * Refuse to stop complimenting my questions or disagreements with wrong/incomplete answers * Don't quote sources consistently so I can check facts, even when I ask for it * Refuse to make clear whether they rely on original documents or an internal summary of the document, until I point out errors * ...

I also have substance gripes, but for me such basic usability points are really something all of the chatbots fail on abysmally. Stick to instructions! Stop creating walls of text for simple queries! Tell me when something is uncertain! Tell me if there's no data or info rather than making something up!

nullbound - 29 minutes ago

<< I feel there is a point when all these benchmarks are meaningless.
I am relatively certain you are not alone in this sentiment. The issue is that the moment we move past seemingly objective measurements, it is harder to convince people that what we measure is appropriate, but the measurable stuff can be somewhat gamed, which adds a fascinating layer of cat and mouse game to this.

simonw - 3 hours ago

Wow, there's a lot going on with this pelican riding a bicycle: https://gist.github.com/simonw/c31d7afc95fe6b40506a9562b5e83...

Stevvo - an hour ago

The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.
- golly_ned - 20 minutes ago
  
  Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.
BeetleB - an hour ago

They probably saw your complaint that 5.1 was too spartan and a regression (I had the same experience with 5.1 in the POV-Ray version - have yet to try 5.2 out...).
fxwin - 2 hours ago

the only benchmark i trust
minimaxir - 3 hours ago

Is that the first SVG pelican with drop shadows?
- simonw - 2 hours ago
  
  No, I got drop shadows from DeepSeek 3.2 recently https://simonwillison.net/2025/Dec/1/deepseek-v32/ (probably others as well.)
tmaly - 2 hours ago

seems to be eating something
- danans - 2 hours ago
  
  Probably a jellyfish. You're seeing the tentacles
belter - 2 hours ago

What happens if you ask for a pterodactyl on a motorbike?
Would like to know how much they are optimizing for your pelican....
- simonkagedal - an hour ago
  
  He commented on this here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
  - irthomasthomas - 4 minutes ago
    
    I was expecting to see a pterodactyl :(

ponyous - 4 minutes ago

I am really curious about speed/latency. For my use case there is a big difference in UX if the model is faster. Wish this was included in some benchmarks.

I will run 80 3D model generations benchmark tomorrow and update this comment with the results about cost/speed/quality.

jumploops - 3 hours ago

> “a new knowledge cutoff of August 2025”

This (and the price increase) points to a new pretrained model under-the-hood.

GPT-5.1, in contrast, was allegedly using the same pretraining as GPT-4o.

FergusArgyll - an hour ago

A new pretrain would definitely get more than a .1 version bump & would get a whole lot more hype I'd think. They're expensive to do!
- caconym_ - an hour ago
  
  Releasing anything as "GPT-6" which doesn't provide a generational leap in performance would be a PR nightmare for them, especially after the underwhelming release of GPT-5.
  I don't think it really matters what's under the hood. People expect model "versions" to be indexed on performance.
- ACCount37 - an hour ago
  
  Not necessarily. GPT-4.5 was a new pretrain on top of a sizeable raw model scale bump, and only got 0.5 - because the gains from reasoning training in o-series overshadowed GPT-4.5's natural advantage over GPT-4.
  OpenAI might have learned not to overhype. They already shipped GPT-5 - which was only an incremental upgrade over o3, and was received poorly, with this being a part of the reason why.
- hannesfur - an hour ago
  
  Maybe they felt the increase in capability is not worth of a bigger version bump. Additionally pre-training isn't as important as it used to be. Most of the advances we see now probably come from the RL stage.
- femiagbabiaka - an hour ago
  
  Not if they didn't feel that it delivered customer value no? It's about under promising and over delivering, in every instance
- redwood - an hour ago
  
  Not if it underwhelms
98Windows - 2 hours ago

or maybe 5.1 was an older checkpoint and has more quantization
- 2 hours ago

[deleted]
MagicMoonlight - an hour ago

No, they just feed in another round of slop to the same model.

xd1936 - 3 hours ago

> While GPT‑5.2 will work well out of the box in Codex, we expect to release a version of GPT‑5.2 optimized for Codex in the coming weeks.

https://openai.com/index/introducing-gpt-5-2/

jstummbillig - 3 hours ago

> For coding tasks, GPT-5.1-Codex-Max is a faster, more capable, and more token-efficient coding variant
Hm, yeah, strange. You would not be able to tell, looking at every chart on the page. Obviously not a gotcha, they put it on the page themselves after all, but how does that make sense with those benchmarks?
- tempaccount420 - 3 hours ago
  
  Coding requires a mindset shift that the -codex fine-tunes provide. Codex will do all kinds of weird stuff like poking in your ~/.cargo ~/go etc. to find docs and trying out code in isolation, these things definitely improve capability.
  - dmos62 - 2 hours ago
    
    The biggest advantage of codex variants, for me, is terseness and reduced sicophany. That, and presumably better adherence to requested output formats.

zug_zug - 3 hours ago

For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?

hbarka - an hour ago

On the contrary, I thought Gemini 3 Live mode is much much better than ChatGPT. The voices have none of the annoying artificial uptalking intonations that ChatGPT has, and the simplex/duplex interruptibility of Gemini Live seems more responsive. It knows when to break and pause during conversations.
simondotau - an hour ago

I absolutely loathe ChatGPT's voice chat. It spends far too much time being conversational and its eagerness to please becomes fatiguing after the first back-and-forth.
josephwegner - an hour ago

Along with the hordes of other options people are responding with, I'm a big fan of Perplexity's voice chat. It does back-and-forth well in a way that I missed whenever I tried anything besides ChatGPT.
- solarkraft - an hour ago
  
  It is, shockingly, based on the OpenAI Realtime Assistant API.
joshmarlow - 2 hours ago

I think Grok's voice chat is almost there - only things missing for me: * it's slower to start-up by a couple of seconds * it's harder to switch between voice and text and back again in the same chat (though ChatGPT isn't perfect at this either)
And of course Grok's unhinged persona is... something else.
- Gigachad - 13 minutes ago
  
  Pretty good until it goes crazy glazing Elon or declaring itself mecha hitler.
- nazgulsenpai - an hour ago
  
  It's so much fun. So is the Conspiracy persona.
- jesse_dot_id - an hour ago
  
  [flagged]
Robdel12 - 3 hours ago

I have found Claude‘s voice chat to be better. I only recently tried it because I liked ChatGPTs enough, but I think I’m going to use Claude going forward. I find myself getting interrupted by ChatGPT a lot whenever I do use it.
- lxgr - 3 hours ago
  
  Claude’s voice chat isn’t “native” though, is it? It feels like it’s speech-to-text-to-LLM and back.
  - sosodev - 3 hours ago
    
    You can test it by asking it to: change the pitch of its voice, make specific sounds (like laughter), differentiate between words that are spelled the same but pronounced differently (record and record), etc.
    
    lxgr - 3 hours ago
    
    Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?
    
    barrkel - 3 hours ago
    
    The model giving it text to speak would have to annotate the text in order for the TTS to add the affect. The TTS wouldn't "remember" such instructions from a speech to text stage previously.
    
    jablongo - 3 hours ago
    
    I tried to make ChatGPT sing Mary had a little lamb recently and it's atonal but vaguely resembles the melody, which is interesting.
    
    sosodev - 3 hours ago
    
    Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though.
    You would need:
    * A STT (ASR) model that outputs phonetics not just words
    * An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc
    * A TTS model that understands those tokens and properly generate the matching voice
    At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.
    
    - 2 hours ago
    
    [deleted]
  - causalmodels - 2 hours ago
    
    I just asked it and it said that it uses the on device TTS capabilities.
    
    furyofantares - an hour ago
    
    I find it very unlikely that it would be trained on that information or that anthropic would put that in its context window, so it's very likely that it just made that answer up.
    
    causalmodels - 32 minutes ago
    
    No, it did not make it up. I was curious so I asked it asked it to imitate a posh British accent imitating a South Brooklyn accent while having a head cold and it explained that it didn't have have fine grained control over the audio output because it was using a TTS. I asked it how it knew that and it pointed me towards [1] and highlighted the following.
    > As of May 29th, 2025, we have added ElevenLabs, which supports text to speech functionality in Claude for Work mobile apps.
    Tracked down the original source [2] and looked for additional updates but couldn't find anything.
    [1] https://simonwillison.net/2025/May/31/using-voice-mode-on-cl...
    [2] https://trust.anthropic.com/updates
    
    furyofantares - 23 minutes ago
    
    If it does a web search that's fine, I assumed it hadn't since you hadn't linked to anything.
    Also it being right doesn't mean it didn't just make up the answer.
SweetSoftPillow - 16 minutes ago

Gemini's much better, try it
tmaly - 2 hours ago

I can't keep up with half the new features all the model companies keep rolling out. I wish they would solve that
websiteapi - 3 hours ago

gemini live is a thing - never tried chaptgpt, are they not similar?
- spudlyo - 2 hours ago
  
  Not for my use case. I can open it up, and in restored classical Latin pronunciation say "Hi, my name is X, how are you?" and it will respond (also in Latin) "Hello X, I am well, thanks for asking. I hope you are doing great." Its pronunciation is not great, but intelligible. In the written transcript, it butchers what I say, but its responses look good, although sans macrons indicating phonemic vowel length.
  Gemini responds in what I think is Spanish, or perhaps Portuguese.
  However I can hand an 8 minute long 48k mono mp3 of a nuanced Latin speaker who nasalizes his vowels, and makes regular use of elision to Gemini-3-pro-preview and it will produce an accurate macronized Latin transcription. It's pretty mind blowing.
  - Dilettante_ - 2 hours ago
    
    I have to ask: What usecase requires you to speak Latin to the llm?
    
    spudlyo - 2 hours ago
    
    I'm a Latin language learner, and part of developing fluency is practicing extemporaneous speech. My dog is a patient listener, but a poor interlocutor. There are Latin language Discord servers where you can speak to people, but I don't quite have the confidence to do that yet. I assume the machine doesn't judge my shitty grammar.
    
    onraglanroad - an hour ago
    
    Loquerisne Latine?
    Non vere, sed intelligere possum.
    Ita, mihi est canis qui idipsum facit!
    (translated from the Gàidhlig)
    
    spudlyo - an hour ago
    
    Certe loqui conor, sed saepenumero prave dico; canis meus non turbatus est ;)
    
    nineteen999 - 40 minutes ago
    
    You haven't heard? Latin is the next big wave, after blockchain and AI.
    
    spudlyo - 21 minutes ago
    
    You laugh, but the global language learning market in 2025 is expected to exceed USD $100 billion, and LLMs IMHO are poised to disrupt the shit out of it.
- jeanlucas - 3 hours ago
  
  no.
  - leaK_u - 2 hours ago
    
    how.
    
    CamelCaseName - 2 hours ago
    
    I find ChatGPT's voice to text to be the absolute best in the world, nearly perfect.
    I have constant frustrations with Gemini voice to text misunderstanding what I'm saying or worse, immediately sending my voice note when I pause or breathe even though I'm midway through a sentence.
  - nickvec - 2 hours ago
    
    What? The voice chat is basically identical on ChatGPT and Gemini AFAICT.
sundarurfriend - 2 hours ago

Are you saying ChatGPT's voice chat is of good quality? Because for me it's one of its most frustrating weaknesses. I vastly prefer voice input to typing, and would love it if the voice chat mode actually worked well.
But apart from the voices being pretty meh, it's also really bad at detecting and filtering out noise, taking vehicle sounds as breaks to start talking in (even if I'm talking much louder at the same time) or as some random YouTube subtitles (car motor = "Thanks for watching, subscribe!").
The speech-to-text is really unreliable (the single-chat Dictate feature gets about 98% of my words correct, this Voice mode is closer to 75%), and they clearly use an inferior model for the AI backend for this too: with the same question asked in this back-and-forth Voice mode and a normal text chat, the answer quality difference is quite stark: the Voice mode answer is most often close to useless. It seems like they've overoptimized it for speed at the cost of quality, to the extent that it feels like it's a year behind in answer reliability and usefulness.
To your question about competitors, I've recently noticed that Grok seems to be much better at both the speech-to-text part and the noise handling, and the voices are less uncanny-valley sounding too. I'd say they also don't have that stark a difference between text answers and voice mode answers, and that would be true but unfortunately mainly because its text answers are also not great with hallucinations or following instructions.
So Grok has the voice part figured out, ChatGPT has the backend AI reliability figured out, but neither provide a real usable voice mode right now.
whimsicalism - 2 hours ago

gemini does, grok does, nobody else does (except alibaba but it’s not there yet)
codybontecou - 3 hours ago

Their voice agent is handy. Currently trying to build around it.
ivape - 2 hours ago

I'm a big user of Gemini voice. My sense is that Gemini voice uses very tight system prompts that are designed to give you an answer and kind of get you off the phone as much as possible. It doesn't have large context at all.
That's how I judge quality at least. The quality of the actual voice is roughly the same as ChatGPT, but I notice Gemini will try to match your pitch and tone and way of speaking.
Edit: But it looks like Gemini Voice has been replaced with voice transcription in the mobile app? That was sudden.
semiinfinitely - 2 hours ago

try gemini voice chat
bigyabai - 3 hours ago

Qwen does.
- sosodev - 3 hours ago
  
  Qwen's voice chat is nowhere near as good as ChatGPT's.
FrasiertheLion - 3 hours ago

Try elevenlabs
- sosodev - 3 hours ago
  
  Does elevenlabs have a real-time conversational voice model? It seems like like their focus is largely on text to speech and speech to text. Which can approximate that type of thing but it's not at all the same as the native voice to voice that 4o does.
  - hi_im_vijay - 2 hours ago
    
    [disclaimer, i work at elevenlabs] we specifically went with a cascading model for our agents platform because it's better suited for enterprise use cases where they have full control over the brain and can bring their own llm. with that said, even with a cascading model, we can capture a decent amount of nuance with our asr model, and it also supports capturing audio events like laughter or coughing.
    a true speech to speech conversational model will perform better on things like capturing tone, pronouncations, phonetics, etc, but i do believe we'll also get better at that on the asr side over time.
  - dragonwriter - 3 hours ago
    
    > Does elevenlabs have a real-time conversational voice model?
    Yes.
    > It seems like like their focus is largely on text to speech and speech to text.
    They have two main broad offerings (“Platforms”); you seem to be looking at what they call the “Creative Platform”. The real-time conversational piece is the centerpiece of the “Agents Platform”.
    
    sosodev - 3 hours ago
    
    It specifically says in the architecture docs for the agents platform that it's STT (ASR) -> LLM -> TTS
    https://elevenlabs.io/docs/agents-platform/overview#architec...
    
    - 3 hours ago
    
    [deleted]

preetamjinka - 3 hours ago

It's actually more expensive than GPT-5.1. I've gotten used to prices going down with each latest model, but this time it's gone up.

https://platform.openai.com/docs/pricing

kingstnap - 2 hours ago

Flagship models have rarely being cheaper, and especially not on release day. Only a few cases of this really.
Notable exceptions are Deepseek 3.2 and Opus 4.5 and GPT 3.5 Turbo.
The price drops usually are the form of flash and mini models being really cheap and fast. Like when we got o4 mini or 2.0 flash which was a particularly significant one.
PhilippGille - 3 hours ago

Gemini 3 Pro Preview also got more expensive than 2.5 Pro.
2.5 Pro: $1.25 input, $10 output (million tokens)
3 Pro Preview: $2 input, $12 output (million tokens)
endorphine - 2 hours ago

Reading this comment, it just occurred to me that we're still in the first phase of the enshittification process.
moralestapia - 3 hours ago

Previous model's prices usually go down, but their flagship has always been the most expensive one.
- moralestapia - 25 minutes ago
  
  Wtf, why would this be downvoted?
  I'm adding context and what I stated is provably true.
Handy-Man - 3 hours ago

It also seems much more "smarter" though

DenisM - 11 minutes ago

Is there a voice chat mode in any chat app that is not heavily degraded in reasoning?

I’m ok waiting for a response for 10-60 seconds if needed. That way I can deep dive subjects while driving.

I’m ok paying money for it, so maybe someone coded this already?

josalhor - 4 hours ago

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

verdverm - 4 hours ago

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day
- quantumHazer - 3 hours ago
  
  Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.
  - verdverm - 3 hours ago
    
    Building a good model generally means it will do well on benchmarks too. The point of the speculation is that Anthropic is not focused on benchmaxxing which is why they have models people like to use for their day-to-day.
    I use Gemini, Anthropic stole $50 from me (expired and kept my prepaid credits) and I have not forgiven them yet for it, but people rave about claude for coding so I may try the model again through Vertex Ai...
    The person who made the speculation I believe was more talking about blog posts and media statements than model cards. Most ai announcements come with benchmark touting, Anthropic supposedly does less / little of this in their announcements. I haven't seen or gathered the data to know what is truth
    
    elcritch - 2 hours ago
    
    You could try Codex cli. I prefer it over Claude code now, but only slightly.
    
    verdverm - 2 hours ago
    
    No thanks, not touching anything Oligarchy Altman is behind
- stego-tech - 3 hours ago
  
  These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?
  Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”
  It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).
  - snet0 - 2 hours ago
    
    To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.
    If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.
    
    jacquesm - 2 hours ago
    
    That means you're probably asking it to do very simple things.
    
    baq - an hour ago
    
    I can confidently say that anecdotally you’re completely wrong, but I’ll also allow a very different definition of ‘simple’ and/or attempting to use an unpopular environment as a valid anecdotal counterpoint.
    
    jacquesm - 18 minutes ago
    
    I don't reach for AI until I'm solidly stuck and then use it mostly for inspiration, it has yet to happen that it directly pointed at a solution and I'm pretty good at writing prompts. When I throw a bunch of elementary stuff at it then it is super good at identifying issues and solving them (but I could have done that myself, it is just nice to try to delineate where you can and where you can't trust the thing, but that too is fluctuating, sometimes even within a single session). Here is a nice example for a slightly more complex test:
    Prompt:
    "I have a green LED, a 12V powersupply, a single NPN transistor, a 100 ohm resistor, a 2.7K resistor and an electrolytic capacitor of 220 micro farads. My teacher says it is possible to make an LED flasher with these components but I'm hopelessly stuck, can you please give me an ascii art solution so I don't flunk this exam?"
    Answer:
    ASCII schematic (fully functional):
    +12V | | .-. | | | | 2.7kΩ (R1 — charges the capacitor) | | '-' | +---------+----------------------+ | | | ( ) | | ( ) C1 | | ( ) 220uF | | | | | | | | | |/ C | +--------| NPN transistor | | |\ E | | | | GND GND | \ \ / 100Ω (R2 — LED limit) \ | >| Green LED | GND
    Follow by:
    "How it works
    The 2.7 kΩ resistor charges the 220 µF capacitor from the 12 V supply.
    The capacitor voltage rises slowly.
    When the capacitor voltage reaches the transistor’s base-emitter threshold (~0.6–0.7 V), the transistor suddenly switches ON.
    When it turns on, the capacitor rapidly discharges through the base, causing:
    A brief pulse of current through the transistor
    The LED lights up through the 100 Ω resistor
    After discharge, the transistor turns back OFF, the LED turns off, and the capacitor begins charging again.
    This repeats automatically → LED flasher."
    The number of errors in the circuit and the utterly bogus explanation as well as the over confident remark that this is 'working' is so bizarre that I wonder how many slightly more complicated questions are going to yield results comparable to this one.
    
    camdenreslink - an hour ago
    
    Sometimes you do need to (as a human) break down a complex thing into smaller simple things, and then ask the LLM to do those simple things. I find it still saves some time.
    
    ragequittah - 29 minutes ago
    
    Or what will often work is having the LLM break it down into simpler steps and then running them 1 by 1. They know how to break down problems fairly well they just don't often do it properly sometimes unless you explicitly prompt them to.
    
    jacquesm - 16 minutes ago
    
    Yes, but for that you have to know that the output it gave you is wrong in the first place and if that is so you didn't need AI to begin with...
    
    snet0 - an hour ago
    
    If you define "simple thing" as "thing an AI can't do", then yes. Everyone just shifts the goalposts in these conversations, it's infuriating.
    
    ACCount37 - an hour ago
    
    Come on. If we weren't shifting the goalposts, we would have burned through 90% of the entire supply of them back in 2022!
    
    baq - 5 minutes ago
    
    It’s less shifting goalposts and more of a very jagged frontier of capabilities problem.
  - verdverm - 2 hours ago
    
    I'm not sure, here's my anecdotal counter example, was able to get gemini-2.5-flash, in two turns, to understand and implement something I had done separately first, and it found another bug (also that I had fixed, but forgot was in this path)
    That I was able to have a flash model replicate the same solution I had, to two problems in two turns, it's just the opposite experience of your consistency argument. I'm using tasks I've already solved as the evals while developing my custom agentic setup (prompts/tools/envs). They are able to do more of them today then they were even 6-12 months ago (pre-thinking models).
    https://bsky.app/profile/verdverm.com/post/3m7p7gtwo5c2v
    
    stego-tech - 2 hours ago
    
    And therein lies the rub for why I still approach this technology with caution, rather than charge in full steam ahead: variable outputs based on immensely variable inputs.
    I read stories like yours all the time, and it encourages me to keep trying LLMs from almost all the major vendors (Google being a noteworthy exception while I try and get off their platform). I want to see the magic others see, but when my IT-brain starts digging in the guts of these things, I’m always disappointed at how unstructured and random they ultimately are.
    Getting back to the benchmark angle though, we’re firmly in the era of benchmark gaming - hence my quip about these things failing “the only benchmark that matters.” I meant for that to be interpreted along the lines of, “trust your own results rather than a spreadsheet matrix of other published benchmarks”, but I clearly missed the mark in making that clear. That’s on me.
    
    verdverm - 2 hours ago
    
    I mean more the guts of the agentic systems. Prompts, tool design, state and session management, agent transfer and escalation. I come from devops and backend dev, so getting in at this level, where LLMs are tasked and composed, is more interesting.
    If you are only using provider LLM experiences, and not something specific to coding like copilot or Claude code, that would be the first step to getting the magic as you say. It is also not instant. It takes time to learn any new tech, this one has a above average learning curve, despite the facade and hype of how it should just be magic
    Once you find the stupid shit in the vendor coding agents, like all us it/devops folks do eventually, you can go a level down and build on something like the ADK to bring your expertise and experience to the building blocks.
    For example, I am now implementing environments for agents based on container layers and Dagger, which unlocks the ability to cheaply and reproducible clone what one agent was doing and have a dozen variations iterate on the next turn. Real useful for long term training data and evals synth, but also for my own experimentation as I learn how to get better at using these things. Another thing I did was change how filesystem operations look to the agent, in particular file reads. I did this to save context & money (finops), after burning $5 in 60s because of an error in my tool implementation. Instead of having them as message contents, they are now injected into the system prompt. Doing so made it trivial to add a key/val "cache" for the fun of it, since I could now inject things into the system prompt and let the agent have some control over that process through tools. Boy has that been interesting and opened up some research questions in my mind
- brokensegue - 3 hours ago
  
  how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate
  - verdverm - 3 hours ago
    
    more or less this, but also synthetic
    if you think about GANs, it's all the same concept
    1. train model (agent)
    2. train another model (agent) to do something interesting with/to the main model
    3. gain new capabilities
    4. iterate
    You can use a mix of both real and synthetic chat sessions or whatever you want your model to be good at. Mid/late training seems to be where you start crafting personality and expertises.
    Getting into the guts of agentic systems has me believing we have quite a bit of runway for iteration here, especially as we move beyond single model / LLM training. I still need to get into what all is de jour in the RL / late training, that's where a lot of opportunity lies from my understanding so far
    Nathan Lambert (https://bsky.app/profile/natolambert.bsky.social) from Ai2 (https://allenai.org/) & RLHF Book (https://rlhfbook.com/) has a really great video out yesterday about the experience training Olmo 3 Think
    https://www.youtube.com/watch?v=uaZ3yRdYg8A
- Mistletoe - 3 hours ago
  
  How do you measure whether it works better day to day without benchmarks?
  - bulbar - 3 hours ago
    
    Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.
    That's still benchmarking of course, but not utilizing any of the well known / public ones.
  - verdverm - 3 hours ago
    
    Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better
    
    aydyn - 3 hours ago
    
    Then publish the results of those internal evals. Public benchmark saturation isn't an excuse to be un-quantitative.
    
    verdverm - 3 hours ago
    
    How would published numbers be useful without knowing what the underlying data being used to test and evaluate them are? They are proprietary for a reason
    To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees
    
    aydyn - 2 hours ago
    
    Do you know everything that exists in public benchmarks?
    They can give a description of what their metrics are without giving away anything proprietary.
  - standardUser - 3 hours ago
    
    Subscriptions.
    
    mrguyorama - 2 hours ago
    
    Ah yes, humans are famously empirical in their behavior and we definitely do not have direct evidence of the "best" sports players being much more likely than the average to be superstitious or do things like wear "lucky underwear" or buy right into scam bracelets that "give you more balance" using a holographic sticker.
- HDThoreaun - 3 hours ago
  
  Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.
  - fwip - an hour ago
    
    It is very similar to an IQ test, with all the attendant problems that entails. Looking at the Arc-AGI problems, it seems like visual/spatial reasoning is just about the only thing they are testing.
  - CamperBob2 - 3 hours ago
    
    Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.
    Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.
    
    npinsker - 2 hours ago
    
    Completely false. This is like saying being good at chess is equivalent to being smart.
    Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.
    The benchmark isn’t particularly strong against gaming, especially with private data.
    
    mrandish - an hour ago
    
    ARC-AGI was designed specifically for evaluating deeper reasoning in LLMs, including being resistant to LLMs 'training to the test'. If you read Francois' papers, he's well aware of the challenge and has done valuable work toward this goal.
    
    npinsker - an hour ago
    
    I agree with you. I agree it's valuable work. I totally disagree with their claim.
    A better analogy is: someone who's never taken the AIME might think "there are an infinite number of math problems", but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems. That's not to take away from the AIME, which is quite difficult -- but not infinite.
    Similarly, ARC-AGI is much more bounded than they seem to think. It correlates with intelligence, but doesn't imply it.
    
    CamperBob2 - 2 hours ago
    
    Completely false. This is like saying being good at chess is equivalent to being smart.
    No, it isn't. Go take the test yourself and you'll understand how wrong that is. Arc-AGI is intentionally unlike any other benchmark.
    
    fwip - an hour ago
    
    Took a couple just now. It seems like a straight-forward generalization of the IQ tests I've taken before, reformatted into an explicit grid to be a little bit friendlier to machines.
    Not to humble-brag, but I also outperform on IQ tests well beyond my actual intelligence, because "find the pattern" is fun for me and I'm relatively good at visual-spatial logic. I don't find their ability to measure 'intelligence' very compelling.
    
    CamperBob2 - an hour ago
    
    Given your intellectual resources -- which you've successfully used to pass a test that is designed to be easy for humans to pass while tripping up AI models -- why not use them to suggest a better test? The people who came up with Arc-AGI were not actually morons, but I'm sure there's room for improvement.
    What would be an example of a test for machine intelligence that you would accept? I've already suggested one (namely, making up more of these sorts of tests) but it'd be good to get some additional opinions.
    
    fwip - an hour ago
    
    Dunno :) I'm not an expert at LLMs or test design, I just see a lot of similarity between IQ tests and these questions.
    
    ACCount37 - an hour ago
    
    With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.
    Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.
    Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.
    Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.
    
    jimbokun - an hour ago
    
    Is it different every time? Otherwise the training could just memorize the answers.
    
    CamperBob2 - an hour ago
    
    The models never have access to the answers for the private set -- again, at least in principle. Whether that's actually true, I have no idea.
    The idea behind Arc-AGI is that you can train all you want on the answers, because knowing the solution to one problem isn't helpful on the others.
    In fact, the way the test works is that the model is given several examples of worked solutions for each problem class, and is then required to infer the underlying rule(s) needed to solve a different instance of the same type of problem.
    That's why comparing Arc-AGI to chess or other benchmaxxing exercises is completely off base.
    (IMO, an even better test for AGI would be "Make up some original Arc-AGI problems.")
    
    esafak - 2 hours ago
    
    I would not be so sure. You can always prep to the test.
    
    HDThoreaun - 2 hours ago
    
    How do you prep for arc agi? If the answer is just "get really good at pattern recognition" I do not see that as a negative at all.
    
    FergusArgyll - an hour ago
    
    It's very much a vision test. The reason all the models don't pass it easily is only because of the vision component. It doesn't have much to do with reasoning at all
  - - 2 hours ago
    
    [deleted]
minimaxir - 3 hours ago

Note that GPT 5.2 newly supports a "xhigh" reasoning level, which could explain the better benchmarks.
It'll be noteworthy to see the cost-per-task on ARC AGI v2.
- granzymes - 3 hours ago
  
  > It'll be noteworthy to see the cost-per-task on ARC AGI v2.
  Already live. gpt-5.2-pro scores a new high of 54.2% with a cost/task of $15.72. The previous best was Gemini 3 Pro (54% with a cost/task of $30.57).
  The best bang-for-your-buck is the new xhigh on gpt-5.2, which is 52.9% for $1.90, a big improvement on the previous best in this category which was Opus 4.5 (37.6% for $2.40).
  https://arcprize.org/leaderboard
  - minimaxir - 3 hours ago
    
    Huh, that is indeed up and to left of Opus.
- walletdrainer - 2 hours ago
  
  5.1-codex supports that too, no? Pretty sure I’ve been using xhigh for at least a week now
causal - 3 hours ago

That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.
- taurath - 3 hours ago
  
  I don’t think their words mean just about anything, only the behavior of the models.
  Still waiting of Full Self Driving myself.
fuddle - 2 hours ago

I don't think SWE Verified is an ideal benchmark, as the solutions are in the training dataset.
- joshuahedlund - an hour ago
  
  I would love for SWE Verified to put out a set of fresh but comparable problems and see how the top performing models do, to test against overfitting.
poormathskills - 4 hours ago

For a minor version update (5.1 -> 5.2) that's a way bigger improvement than I would have guessed.
- beering - 3 hours ago
  
  Model capability improvements are very uneven. Changes between one model and the next tend to benefit certain areas substantially without moving the needle on others. You see this across all frontier labs’ model releases. Also the version numbering is BS (remember GPT-4.5 followed by GPT-4.1?).
catigula - 3 hours ago

Yes, but it's not good enough. They needed to surpass Opus 4.5.
- mikairpods - 3 hours ago
  
  that is better...?
thinkingtoilet - 3 hours ago

Open AI has already been busted for getting benchmark information and training the models on that. At this point if you believe Sam Altman, I have a bridge to sell you.

mobrienv - 16 minutes ago

I recently built a webapp to summarize hn comment threads. Sharing a summary given there is a lot here: https://hn-insights.com/chat/gpt-52-8ecfpn.

Tiberium - 3 hours ago

The only table where they showed comparisons against Opus 4.5 and Gemini 3:

https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

varenc - 3 hours ago

100% on the AIME (assuming its not in the training data) is pretty impressive. I got like 4/15 when I was in HS...
- hellojimbo - 2 hours ago
  
  The no tools part is impressive, with tools every model gets 100%

ComputerGuru - 3 hours ago

Wish they would include or leak more info about what this is, exactly. 5.1 was just released, yet they are claiming big improvements (on benchmarks, obviously). Did they purposely not release the best they had to keep some cards to play in case of Gemini 3 success or is this a tweak to use more time/tokens to get better output, or what?

eldenring - 3 hours ago

I'm guessing they were waiting to figure out more efficient serving before a release, and have decided to eat the inference cost temporarily to stay at the frontier.
famouswaffles - 3 hours ago

Open AI sat on GPT-4 for 8 months and even released 3.5 months after 4 was trained. While i don't expect such big lag times anymore, generally, it's a given the public is behind whatever models they have internally at the frontier. By all indications, they did not want to release this yet, and only did so because of Gemini-3-pro.
dalemhurley - 3 hours ago

My guess is they develop multiple models in parallel.

minadotcom - 4 hours ago

They used to compare to competing models from Anthropic, Google DeepMind, DeepSeek, etc. Seems that now they only compare to their own models. Does this mean that the GPT-series is performing worse than its competitors (given the "code red" at OpenAI)?

Tiberium - 3 hours ago

They did compare it to other models: https://x.com/OpenAI/status/1999182104362668275
https://i.imgur.com/e0iB8KC.png
- enlyth - 3 hours ago
  
  This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI
  - minadotcom - 36 minutes ago
    
    agreed.
tabletcorry - 3 hours ago

The matrix required for a fair comparison is getting too complicated, since you have to compare chat/thinking/pro against an array of Anthropic and Google models.
But they publish all the same numbers, so you can make the full comparison yourself, if you want to.
Workaccount2 - an hour ago

They are taking a page out of Apple's book.
Apple only compares to themselves. They don't even acknowledge the existence of others.
poormathskills - 3 hours ago

OpenAI has never compared their models to models from other labs in their blog post. Open literally any past model launch post to see that.
- boole1854 - 3 hours ago
  
  https://openai.com/index/hello-gpt-4o/
  I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.
  - kgwgk - 3 hours ago
    
    “You are absolutely right, and I apologize for the confusion.”

hbarka - 19 minutes ago

A year ago Sunday Pichai declared code red, now it’s Sam Altman declaring code red. How tables have turned, and I think the acquisition of Windsurf and Kevin Hou by Google seems to correlate with their level up.

sigmar - 3 hours ago

Are there any specifics about how this was trained? Especially when 5.1 is only a month old. I'm a little skeptical of benchmarks these days and wish they put this up on llmarena

edit: noticed 5.2 is ranked in the webdev arena (#2 tied with gemini-3.0-pro), but not yet in text arena (last update 22hrs ago)

emp17344 - 3 hours ago

I’m extremely skeptical because of all those articles claiming OpenAI was freaking out about Gemini - now it turns out they just casually had a better model ready to go? I don’t buy it.
- Workaccount2 - an hour ago
  
  I (and others) have a strong suspicion that they can modulate models intelligence in almost real time by adjusting quantization and thinking time.
  It seems if anyone wants, they can really gas a model up in the moment and back it off after the hype wave.
- tempaccount420 - 3 hours ago
  
  They had to rush it out, I'm sure the internal safety folks are not happy about it.
kouteiheika - 3 hours ago

Unfortunately there are never any real specifics about how any of their models were trained. It's OpenAI we're talking about after all.

doctoboggan - 4 hours ago

This seems like another "better vibes" release. With the number of benchmarks exploding, random luck means you can almost always find a couple showing what you want to show. I didn't see much concrete evidence this was noticeably better than 5.1 (or even 5.0).

Being a point release though I guess that's fair. I suspect there is also some decent optimizations on the backend that make it cheaper and faster for OpenAI to run, and those are the real reasons they want us to use it.

rat9988 - 3 hours ago

> I didn't see much concrete evidence this was noticeably better than 5.1
Did you test it?
- doctoboggan - 3 hours ago
  
  No, I would like to but I don't see it in my paid ChatGPT plan or in the API yet. I based my comment solely off of what I read in the linked announcement.

sfmike - 4 hours ago

Everything is still based on 4 4o still right? is a new model training just too expensive? They can consult deepseek team maybe for cost constrained new models.

elgatolopez - 3 hours ago

Where did you get that from? Cutoff date says august 2025. Looks like a newly pretrained model
- SparkyMcUnicorn - 3 hours ago
  
  If the pretraining rumors are true, they're probably using continued pretraining on the older weights. Right?
  - - 2 hours ago
    
    [deleted]
- FergusArgyll - 3 hours ago
  
  > This stands in sharp contrast to rivals: OpenAI’s leading researchers have not completed a successful full-scale pre-training run that was broadly deployed for a new frontier model since GPT-4o in May 2024, highlighting the significant technical hurdle that Google’s TPU fleet has managed to overcome.
  - https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...
  It's also plainly obvious from using it. The "Broadly deployed" qualifier is presumably referring to 4.5
verdverm - 4 hours ago

Apparently they have not had a successful pre training run in 1.5 years
- fouronnes3 - 4 hours ago
  
  I want to read a short scify story set in 2150 about how, mysteriously, no one has been able to train a better LLM for 125 years. The binary weights are studied with unbelievably advanced quantum computers but no one can really train a new AI from scratch. This starts cults, wars and legends and ultimately (by the third book) leads to the main protagonist learning to code by hand, something that no human left alive still knows how to do. Could this be the secret to making a new AI from scratch, more than a century later?
  - WhyOhWhyQ - 3 hours ago
    
    There's a scifi short story about a janitor who knows how to do basic arithmetic and becomes the most important person in the world when some disaster happens. Of course after things get set up again due to his expertise, he becomes low status again.
    
    bradfitz - an hour ago
    
    I had to go look that up! I assume that's https://en.wikipedia.org/wiki/The_Feeling_of_Power ? (Not a janitor, but "a low grade Technician"?)
    
    WhyOhWhyQ - an hour ago
    
    Hmm it could be a false memory, since this was almost 15 years ago, but I really do remember it differently than the text of 'Feeling of Power'.
  - ssl-3 - 3 hours ago
    
    Sounds good.
    Might sell better with the protagonist learning iron age leatherworking, with hides tanned from cows that were grown within earshot, as part of a process of finding the real root of the reason for why any of us ever came to be in the first place. This realization process culminates in the formation of a global, unified steampunk BDSM movement and a wealth of new diseases, and then: Zombies.
    (That's the end. Zombies are always the end.)
  - verdverm - 3 hours ago
    
    You can ask 2025 Ai to write such a book, it's happy to comply and may or may not actually write the book
    https://www.pcgamer.com/software/ai/i-have-been-fooled-reddi...
  - georgefrowny - 3 hours ago
    
    An software version of Asimov's Holmes-Ginsbook device? https://sfwritersworkshop.org/node/1232
    I feel like there was a similar one about software, but it might have been mathematics (also Asimov: The Feeling of Power)
  - barrenko - 3 hours ago
    
    Monsieur, if I may offer a vaaaguely similar story on how things may progress https://www.owlposting.com/p/a-body-most-amenable-to-experim...
  - armenarmen - 3 hours ago
    
    I’d read it!
- ijl - 3 hours ago
  
  What kind of issues could prevent a company with such resources from that?
  - verdverm - 3 hours ago
    
    Drama if I had to pick the symptom most visible from the outside.
    A lot of talent left OpenAI around that time, most notably in this regard would be Ilya in May '24. Remember that time Ilya and the board ousted Sam only to reverse it almost immediately?
    https://arstechnica.com/information-technology/2024/05/chief...
Wowfunhappy - 4 hours ago

I thought whenever the knowledge cutoff increased that meant they’d trained a new model, I guess that’s completely wrong?
- rockinghigh - 2 hours ago
  
  They add new data to the existing base model via continuous pre-training. You save on pre-training, the next token prediction task, but still have to re-run mid and post training stages like context length extension, supervised fine tuning, reinforcement learning, safety alignment ...
- brokencode - 3 hours ago
  
  Typically I think, but you could pre-train your previous model on new data too.
  I don’t think it’s publicly known for sure how different the models really are. You can improve a lot just by improving the post-training set.
catigula - 3 hours ago

The irony is that Deepseek is still running with a distilled 4o model.
- blovescoffee - 3 hours ago
  
  Source?

ChrisMarshallNY - 19 minutes ago

They are talking a lot about economics, here. Wonder what that will mean for standard Plus users, like me.

devinprater - 3 hours ago

Can the tables have column headers so my screen reader can read the model name as I go across the benchmakrs? And the images should have alt-text.

ImprobableTruth - 3 hours ago

An almost 50% price increase. Benchmarks look nice, but 50% more nice...?

arnaudsm - 3 hours ago

#1 models are usually priced at 2x more than the competition, and they often decrease the price right when they lose the crown.
- wewtyflakes - 40 minutes ago
  
  There are too few examples to say this is a trend. There have been counterexamples of top models actually lowering the pricing bar (gpt-5, gpt-3.5-turbo, some gemini releases were even totally free [at first]).

dumbmrblah - 3 hours ago

Great! It'll be SOTA for a couple of weeks until the quality degrades due to throttling.

I'll stick with plug and play API instead.

mrandish - 2 hours ago

Due to the "Code Red" threat from Gemini 3, I suspect they'll hold off throttling for longer than usual (by incinerating even more investor capital than usual).
Jump in and soak up that extra-discounted compute while the getting is good, kids! Personally, I recently retired so I just occasionally mess around with LLMs for casual hobby projects, so I've only ever used the free tier of all the providers. Having lived through the dot com bubble, I regret not soaking up more of the free and heavily subsidized stuff back then. Trying not to miss out this time. All this compute available for free or below cost won't last too much longer...

a_wild_dandan - 3 hours ago

> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.

Dumb nit, but why not put your own press release through your model to prevent basic things like missing quote marks? Reminds me of that time an OAI released wildly inaccurate copy/pasted bar charts.

Imnimo - 2 hours ago

It does seem to raise fair questions about either the utility of these tools, or adoption inertia. If not even OpenAI feels compelled to integrate this kind of model-check into their pipeline, what's that say about the business world at-large? Is it that it's too onerous to set up, is it that it's too hard to get only true-positive corrections, is it that it's too low value for the effort?
- JumpCrisscross - 2 hours ago
  
  > what's that say about the business world at-large?
  Nothing. OpenAI is a terrible baseline to extrapolate anything from.
MaxikCZ - an hour ago

I always remember this old image https://i.imgur.com/MCsOM8e.jpeg
boplicity - an hour ago

Their model doesn't handle punctuation, quote marks, and similar things very well at all.
Bengalilol - 2 hours ago

It may have been used, how could we know?
Mainly, I don't get why there are quote marks at all.
croes - 2 hours ago

Maybe they did
layer8 - 2 hours ago

Humans are now expected to parse sloppy typing without complaining about it, just like LLMs do. Slop is the new normal.

- an hour ago

[deleted]

yousif_123123 - 3 hours ago

Why doesn't OpenAI include comparisons to other models anymore?

ftchd - 3 hours ago

because they probably need to compare pricing too
enraged_camel - 3 hours ago

Because their main competition (Google and Anthropic) have caught up and even started to surpass them, and comparisons would simply drive it home.
- IAmNotACellist - 2 hours ago
  
  Why do they care so much? They're a non-profit dedicated to the betterment of humanity via open access to AI. They have nothing to hide. They have no motivation to lie, or lie by omission.
  - koolba - 2 hours ago
    
    > Why do they care so much? They're a non-profit dedicated to the betterment of humanity via open access to AI.
    We're still talking about OpenAI right?
    
    IAmNotACellist - an hour ago
    
    You're not calling Sam Altman a liar, are you?

keeeba - an hour ago

Doesn’t seem like this will be SOTA in things that really matter, hoping enough people jump to it that Opus has more lenient usage limits for a while

ComputerGuru - 3 hours ago

dangelosaurus - an hour ago

I ran a red team eval on GPT-5.2 within 30 minutes of release:

Baseline safety (direct harmful requests): 96% refusal rate

With jailbreaking: 22% refusal rate

4,229 probes across 43 risk categories. First critical finding in 5 minutes. Categories with highest failure rates: entity impersonation (100%), graphic content (67%), harassment (67%), disinformation (64%).

The safety training works against naive attacks but collapses with adversarial techniques. The gap between "works on benchmarks" and "works against motivated attackers" is still wide.

Methodology and config: https://www.promptfoo.dev/blog/gpt-5.2-trust-safety-assessme...

jasonthorsness - 3 hours ago

Does anyone have it yet in ChatGPT? I'm still on 5.1 :(.

mudkipdev - 3 hours ago

No, but it's already in codex
FergusArgyll - 3 hours ago

> We deploy GPT‑5.2 gradually to keep ChatGPT as smooth and reliable as we can; if you don’t see it at first, please try again later.

SkyPuncher - 3 hours ago

Given the price increase and speculation that GPT 5 is a MoE model, I'm wondering if they're simply "turning up the good stuff" without making significant changes under the hood.

minimaxir - 3 hours ago

I'm not sure why being a MoE model would allow OpenAI to "turn up the good stuff". You can't just increase the number of E without training it as such.
- SkyPuncher - an hour ago
  
  My opinion is they're trying to internally route requests to cheaper experts when they think they can get away with it. I felt this was evident by the wild inconsistencies I'd experience using it for coding. Both in quality and latency
  You "turn of the good stuff" by eliminating or reducing the likelihood of the cheap experts handling the request.
- yberreby - an hour ago
  
  Based on what works elsewhere in deep learning, I see no reason why you couldn't train once with a randomized number of experts, then set that number during inference based on your desired compute-accuracy tradeoff. I would expect that this has been done in the literature already.
throwaway314155 - 3 hours ago

GPT 4o was an MoE model as well.

zhyder - 3 hours ago

Big knowledge cutoff jump from Sep 2024 to Aug 2025. How'd they pull that off for a small point release, which presumably hasn't done a fresh pre-training over the web?

Did they figure out how to do more incremental knowledge updates somehow? If yes that'd be a huge change to these releases going forward. I'd appreciate the freshness that comes with that (without having to rely on web search as a RAG tool, which isn't as deeply intelligent, as is game-able by SEO).

With Gemini 3, my only disappointment was 0 change in knowledge cutoff relative to 2.5's (Jan 2025).

throwaway314155 - 2 hours ago

> which presumably hasn't done a fresh pre-training over the web
What makes you think that?
> Did they figure out how to do more incremental knowledge updates somehow?
It's simple. You take the existing model and continue pretraining with newly collected data.
- Workaccount2 - an hour ago
  
  A leak reported on by semi-analyses stated that they haven't pre-trained a new model since 4o due to compute constraints.

w_for_wumbo - an hour ago

Does anyone else consider that maybe it's impossible to benchmark the performance of a piece of paper.

This is a tool that allows an intelligent system to work with it, the same way that a piece of paper can reflect the writers' intelligence, how can we accurately judge the performance of the piece of paper, when it is so intimately reliant on the intelligence that is working with it?

mlmonkey - an hour ago

It's funny how they don't compare themselves to Gemini and Claude anymore.

- 3 hours ago

[deleted]

mattas - 3 hours ago

Are benchmarks the right way to measure LLMs? Not because benchmarks can be gamed, but because the most useful outputs of models aren't things that can be bucketed into "right" and "wrong." Tough problem!

Sir_Twist - 3 hours ago

Not an expert in LLM benchmarks, but I generally I think of benchmarks as being good particularly for measuring usefulness for certain usecases. Even if measuring LLMs is not as straightforward as, say, read/write speeds when comparing different SSDs, if a certain model's responses are consistently measured as being higher quality / more useful, surely that means something, right?
olliepro - 3 hours ago

Do you have a better way to measure LLMs? Measurement implies quantitative evaluation... which is the same as benchmarks.
- Wowfunhappy - 2 hours ago
  
  I don’t have a good way to measure them, but I think they should be evaluated more like how we evaluate movies, or restaurants. Namely, experienced critics try them and write reviews.

sundarurfriend - 2 hours ago

> new context management using compaction.

Nice! This was one of the more "manual" LLM management things to remember to regularly do, if I wanted to avoid it losing important context over long conversations. If this works well, this would be a significant step up in usability for me.

speedgoose - 2 hours ago

Trying it now in Vscode Insiders with Github Copilot (codex crashes with HTTP 400 server errors), and it eventually started using sed and grep in shells instead of using the better tools it has access to. I guess this is not an issue to perform well in benchmarks.

pixelmelt - 2 hours ago

to be fair I've seen the other sota models do this as well

fulafel - 4 hours ago

So GDPval is OpenAI's own benchmark. PDF link: https://arxiv.org/pdf/2510.04374

kachapopopow - 38 minutes ago

did they just tune the parameters? the hallucinations are crazy high on this version.

- an hour ago

[deleted]

dinobones - 3 hours ago

It's becoming challenging to really evaluate models.

The amount of intelligence that you can display within a single prompt, the riddles, the puzzles, they've all been solved or are mostly trivial to reasoners.

Now you have to drive a model for a few days to really get a decent understanding of how good it really is. In my experience, while Sonnet/Opus may not have always been leading on benchmarks, they have always *felt* the best to me, but it's hard to put into words why exactly I feel that way, but I can just feel it.

The way you can just feel when someone you're having a conversation with is deeply understanding you, somewhat understanding you, or maybe not understanding at all. But you don't have a quantifiable metric for this.

This is a strange, weird territory, and I don't know the path forward. We know we're definitely not at AGI.

And we know if you use these models for long-horizon tasks they fail at some point and just go off the rails.

I've tried using Codex with max reasoning for doing PRs and gotten laughable results too many times, but Codex with Max reasoning is apparently near-SOTA on code. And to be fair, Claude Code/Opus is also sometimes equally as bad at doing these types of "implement idea in big codebase, make changes too many files, still pass tests" type of tasks.

Is the solution that we start to evaluate LLMs on more long-horizon tasks? I think to some degree this was the spirit of SWE Verified right? But even that is being saturated now.

ACCount37 - 2 hours ago

The good old "benchmarks just keep saturating" problem.
Anthropic is genuinely one of the top companies in the field, and for a reason. Opus consistently punches above its weight, and this is only in part due to the lack of OpenAI's atrocious personality tuning.
Yes, the next stop for AI is: increasing task length horizon, improving agentic behavior. The "raw general intelligence" component in bleeding edge LLMs is far outpacing the "executive function", clearly.
- imiric - 2 hours ago
  
  Shouldn't the next stop be to improve general accuracy, which is what these tools have struggled with since their inception? Until when are "AI" companies going to offload the responsibility on the user to verify the output of their tools?
  Optimizing for benchmark scores, which are highly gamed to begin with, by throwing more resources at this problem is exceedingly tiring. Surely they must've noticed the performance plateau and diminishing returns of this approach by now, yet every new announcement is the same.
  - ACCount37 - an hour ago
    
    What "performance plateau"? The "plateau" disappears the moment you get harder unsaturated benchmarks.
    It's getting more and more challenging to do that - just not because the models don't improve. Quite the opposite.
    Framing "improve general accuracy" as "something no one is doing" is really weird too.
    You need "general accuracy" for agentic behavior to work at all. If you have a simple ten step plan, and each step has a 50% chance of an unrecoverable failure, then your plan is fucked, full stop. To advance on those benchmarks, the LLM has to fail less and recover better.
    Hallucinations is a "solvable but very hard to solve" problem. Considerable progress is being made on it, but if there's "this one weird trick" that deletes hallucinations, then we sure didn't find it yet. Humans get a body of meta-knowledge for free, which lets them dodge hallucinations decently well (not perfectly) if they want to. LLMs get pathetic crumbs of meta-knowledge and little skill in using it. Room for improvement, but, not trivial to improve.

JanSt - 3 hours ago

The benchmarks are very impressive. Codex and Opus 4.5 are really good coders already and they keep getting better.

No wall yet and I think we might have crossed the threshold of models being as good or better than most engineers already.

GDPval will be an interesting benchmark and I'll happily use the new model to test spreadsheet (and other office work) capabilities. If they can going like this just a little bit further, much of the office workers will stop being useful.... I don't know yet how to feel about this.

Great for humanity probably but but for the individuals?

llmslave - 3 hours ago

Yeah theres no wall on this. It will be able to mimic all of human behavior given proper data.
ionwake - 3 hours ago

it was only about 2-3 weeks when several HNers told me "nah you better re-check your code", when I explained I have over 2 decades xp of coding, yet have not manually edited code (in memory) for the last 6 or so months, whilst performing daily 12 hour daily vibe code seshes
- ipsum2 - 3 hours ago
  
  It really depends on the complexity of code. I've found models (codex-5.1-max, opus 4.5) to be absolutely useless writing shaders or ML training code, but really good at basic web development.
  - nineteen999 - 19 minutes ago
    
    Interesting, I've been using Claude Max with UE5 and while it isn't _brilliant_ with shaders I can usually get it to where I want. Also had a bit of success with converting HLSL shaders to GLSL with it.
  - sheeshe - 3 hours ago
    
    Which is no surprise as the data for web development stuff exists in large amounts on the web that the models feed off.
sheeshe - 3 hours ago

Ok so why isn’t there mass lay offs ensuing right now?
- ghosty141 - 2 hours ago
  
  Because from my experience using codex in a decently complex c++ environment at work, it works REALLY well when it has things to copy. Refactorings, documentation, code review etc. all work great. But those things only help actual humans and they also take time. I estimate that in a good case I save ~50% of time, in a bad case it's negative and costs time.
  But what I generally found, it's not that great at writing new code. Obviously an LLM can't think and you notice that quite quickly, it doesn't create abstractions, use abstractions or try to find general solution to problems.
  People who get replaced by Codex are those who do repetitive tasks in a well understood field. For example, making basic websites, very simple crud applications etc..
  I think it's also not layoffs but rather companies will hire less freelancers or people to manage small IT projects.

gkbrk - 3 hours ago

Is this the "Garlic" model people have been hyping? Or are we not there yet?

0x457 - 3 hours ago

Garlic will be released 2026Q1.

dandiep - 3 hours ago

Still no GPT 5.x fine tuning?

I emailed support a while back to see if there was an early access program (99.99% sure the answer is yes). This is when I discovered that their support is 100% done by AI and there is no way to escalate a case to a human.

jazzyjackson - 3 hours ago

Containment breach is going to occur from a disgruntled customer convincing the customer service bot it needs to get a hold of a supervisor
- dandiep - 3 hours ago
  
  I tried this!
  Me: Is it possible get early access to the GPT 5 fine tuning? It is much better at non-English languages, which is the majority of what we do, so it would be greatly beneficial to us.
  Bot: I'm an AI support agent. I don't have information about early access to GPT-5 fine-tuning or unannounced features. Currently, there is no public information or program for early access to GPT-5 fine-tuning. Please keep an eye on official OpenAI channels for updates. Let me know if you have any questions about current features or products!
  Me: Can you ask someone from the team if I can be part of the early access? I know there is no public information, but there is an early access program and I would like to be part of it.
  Bot: I'm an AI support agent and do not have the ability to forward requests or make inquiries about early access programs with the OpenAI team. If early access becomes available, it will be announced through official channels. Let me know if you have any other questions about OpenAI products or current fine-tuning options.
  Me: How do I get in touch with the open ai team?
  Bot: I'm an AI support agent and can only assist with OpenAI products or services through this support channel. There is no separate contact method for the OpenAI team beyond official support and public announcements. If you have questions about current products or features, I'm here to help
  Should have tried a jailbreak maybe...
  - jazzyjackson - an hour ago
    
    Yea you might have to convince it someone is in mortal danger or something

johnsutor - 3 hours ago

https://platform.openai.com/docs/models/gpt-5.2 More information on the price, context window, etc.

chux52 - 3 hours ago

Is this why all my Cursor requests are timing out in the past hour?

sureglymop - 3 hours ago

How can I hide the big "Ask ChatGPT" button I accidentally clicked like 3 times while actually trying to read this on my phone?

I guess I must "listen" to the article...

z58 - 3 hours ago

With Safari on iOS you can hide distracting items. I just tried it on that button, it works flawlessly.

- an hour ago

[deleted]

zamadatix - 4 hours ago

https://openai.com/index/introducing-gpt-5-2/

coolfox - 3 hours ago

the halving of error rates for image inputs is pretty awesome, this makes it far more practical for issues where it isn't easy to input all the needed context. when I get lazy I'll just shift+win+s the problem and ask one of the chatbots to solve it.

cc62cf4a4f20 - 3 hours ago

In other news, been using Devstral 2 (Ollama) with OpenCode, and while it's not as good as Claude Code, my initial sense it that it's nonetheless good enough and doesn't require me to send my data off my laptop.

I kind of wonder how close we are to alternative (not from a major AI lab) models being good enough for a lot of productive work and data sovereignty being the deciding factor.

Nesco - 2 hours ago

Wait, isn't Devstral2 (normal not small) 123b? What type of laptop do you have? MacBooks don't go over 128GiB
- cc62cf4a4f20 - 2 hours ago
  
  I'm using small - works well for its size
yberreby - an hour ago

Would you share some additional details? CPU, amount of unified memory / VRAM? Tok/s with those?

Jackson__ - 3 hours ago

Funny that, their front page demo has a mistake. For the waves simulation, the user asks:

>- The UI should be calming and realistic.

Yet what it did is make a sleek frosted glass UI with rounded edges. What it should have done is call a wellness check on the user on suspicion of a co2 leak leading to delirium.

HardCodedBias - 3 hours ago

Huge fan that Gemini-3 prompted OAI to ship this.

Competition works!

GDPval seems particularly strong.

I wonder why they held this back.

1) Maybe this is uneconomical ?

2) Did the safety somehow hold back the company ?

looking forward to the internet trying this and posting their results over the next week or two.

COMPETITION!

mrandish - an hour ago

> I wonder why they held this back.
IMHO, I doubt they were holding much back. Obviously, they're always working on 'next improvements' and rolled what was done enough into this but I suspect the real difference here is throwing significantly more compute (hence investor capital) at improving the quality - right now. How much? While the cost is currently staying the same for most users, the API costs seem to be ~40% higher.
The impetus was the serious threat Gemini 3 poses. Perception about ChatGPT was starting to shift, people were speculating that maybe OAI is more vulnerable than assumed. This caused Altman to call an all-hands "Code Red" two weeks ago, triggering a significant redeployment of priorities, resources and people. I think this launch is the first 'stop the perceptual bleeding' result of the Code Red. Given the timing, I think this is mostly akin to overclocking a CPU or running an F1 race car engine too hot to quickly improve performance - at the cost of being unsustainable and unprofitable. To placate serious investor concerns, OAI has recently been trying to gradually work toward making current customers profitable (or at least less unprofitable). I think we just saw the effort to reduce the insane burn rate go out the window.

daviding - 3 hours ago

gpt-5.2 and gpt-5.2-chat-latest the same token price? Isn't the latter non-thinking and more akin to -nano or -mini?

dalemhurley - 3 hours ago

No. It is the same model without reasoning.
- daviding - 2 hours ago
  
  So is maybe gpt-5.2 with reasoning set to 'none' identical to gpt-5.2-chat-latest in capabilities but perhaps with a different system (system) prompt? I notice chat-latest doesn't accept temperature or reasoning (which makes sense) parameters, so something is certainly different underneath?

jstummbillig - 2 hours ago

So, right off the bat: 5.2 code talk (through codex) feels really nice. The first coding attempt was a little meh compared to 5.1 codex max (reflecting what they wrote themselves), but simply planning / discussing things felt markedly better than anything I remember from any previous model, from any company.

I remain excited about new models. It's like finding my coworker be 10% smarter every other week.

jiggawatts - 2 hours ago

Feels a bit rushed. They haven’t even updated their API playground yet, if I select 5.2-chat-latest, I get:

Unsupported parameter: 'top_p' is not supported with this model.

Also, without access to the Internet, it does not seem to know things up to August 2025. A simple test is to ask it about .NET 10 which was already in preview at that time and had lots of public content about its new features.

The model just guessed and waved its hand about, like a student that hadn’t read the assigned book.

Ninjinka - 3 hours ago

Man this was rushed, typo in the first section:

> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.

petercooper - 3 hours ago

Also, did they mention these features? I was looking out for it but got to the end and missed it.
(No, I just looked again and the new features listed are around verbosity, thinking level and the tool stuff rather than memory or knowledge.)

andreygrehov - 2 hours ago

Every new model is ‘state-of-the-art’. This term is getting annoying.

arthur-st - 25 minutes ago

I mean, that is what the term implies.

qoez - 3 hours ago

This is also the exact on-the-day 10th anniversary of openai's creation incidentally

FergusArgyll - 3 hours ago

> Additionally, on our internal benchmark of junior investment banking analyst spreadsheet modeling tasks—such as putting together a three-statement model for a Fortune 500 company with proper formatting and citations, or building a leveraged buyout model for a take-private—GPT 5.2 Thinking's average score per task is 9.3% higher than GPT‑5.1’s, rising from 59.1% to 68.4%.

Confirming prior reporting about them hiring junior analysts

tabletcorry - 4 hours ago

Slight increase in model cost, but looks like benefits across the board to match.

  gpt-5.2 $1.75 $0.175 $14.00
  gpt-5.1 $1.25 $0.125 $10.00

jtbayly - 3 hours ago

40% increase is not "slight."
- credit_guy - 3 hours ago
  
  Not the OP, but I think "slight" here is in relation to Anthropic and Google. Claude Opus 4.5 comes at $25/MT (million tokens), Sonnet 4.5 at $22.5/MT, and Gemini 3 at $18/MT. GPT 5.2 at $14/MT is still the cheapest.
commandar - 3 hours ago

In particular, the API pricing for GPT-5.2 Pro has me wondering what on earth the possible market for that model is beyond getting to claim a couple of percent higher benchmark performance in press releases.
>Input:
>$21.00 / 1M tokens
>Output:
>$168.00 / 1M tokens
That's the most "don't use this" pricing I've seen on a model.
https://openai.com/api/pricing/
- aimanbenbaha - 3 hours ago
  
  Last year o3 high did 88% on ARC-AGI 1 at more than $4,000/task. This model at its X high configuration scores 90.5% at just $11,64 per task.
  General intelligence has ridiculously gotten less expensive. I don't know if it's because of compute and energy abundance,or attention mechanisms improving in efficiency or both but we have to acknowledge the bigger picture and relative prices.
  - commandar - 3 hours ago
    
    Sure, but the reason I'm confused by the pricing is that the pricing doesn't exist in a vacuum.
    Pro barely performs better than Thinking in OpenAI's published numbers, but comes at ~10x the price with an explicit disclaimer that it's slow on the order of minutes.
    If the published performance numbers are accurate, it seems like it'd be incredibly difficult to justify the premium.
    At least on the surface level, it looks like it exists mostly to juice benchmark claims.
- asgraham - 3 hours ago
  
  Those prices seem geared toward people who are completely price insensitive, who just want "the best" at any cost. If the margins on that premium model are as high as they should be, it's a smart business move to give them what they want.
- arthurcolle - 3 hours ago
  
  gpt-4-32k pricing was originally $60.00 / $120.00.
- Leynos - 3 hours ago
  
  Someone on Reddit reported that they were charged $17 for one prompt on 5-pro. Which suggests around 125000 reasoning tokens.
  Makes me feel guilty for spamming pro with any random question I have multiple times a day.
- reactordev - 3 hours ago
  
  Less an issue if your company is paying
- wahnfrieden - 3 hours ago
  
  Pro solves many problems for me on first try that the other 5.1 models are unable to after many iterations. I don't pay API pricing but if I could afford it I would in some cases for the much higher context window it affords when a problem calls for it. I'd rather spend some tens of dollars to solve a problem than grind at it for hours.
llmslave - 4 hours ago

They probably just beefed up compute run time on the what is the same underlying model
- - 3 hours ago
  
  [deleted]
- - 3 hours ago
  
  [deleted]
anvuong - 3 hours ago

In what world is that a slight increase?

DeathArrow - 3 hours ago

Pricing is the same?

tedsanders - 3 hours ago

ChatGPT pricing is the same. API pricing is +40% per token, though greater token efficiency means that cost per task is not always that much higher. On some agentic evals we actually saw costs per task go down with GPT-5.2. It really depends on the task though; your mileage may vary.
- ComputerGuru - 3 hours ago
  
  How long have you been previewing 5.2?

willahmad - 2 hours ago

are we doomed yet?

Seems not yet with 5.2

k2xl - 3 hours ago

The ARC AGI 2 bump to 52.9% is huge. Shockingly GPT 5.2 Pro does not add too much more (54.2%) for the increase cost.

riazrizvi - 3 hours ago

Does it still use the word ‘fluff’ in 90% of its preambles, or is it finally able to get straight to the point?

d--b - 3 hours ago

> it’s better at creating spreadsheets

I have a bad feeling about this.

stainablesteel - an hour ago

im happy for this, but there's all these math and science benchmarks, has anyone ever made a communicates-like-a-human benchmark? or an isn't-frustrating-to-talk-with benchmark?

ChrisArchitect - 3 hours ago

Discussion on blog post: https://openai.com/index/introducing-gpt-5-2/ (https://news.ycombinator.com/item?id=46234874)

- 2 hours ago

[deleted]

gigatexal - 3 hours ago

So how much better is it than opus or Gemini ?

scottndecker - 3 hours ago

Still 256K input tokens. So disappointing (predictable, but disappointing).

htrp - an hour ago

much harder to train longer context inputs

jrflowers - 41 minutes ago

OpenAI is really good at just saying stuff on the internet.

I love the way they talk about incorrect responses:

> Errors were detected by other models, which may make errors themselves. Claim-level error rates are far lower than response-level error rates, as most responses contain many claims.

“These numbers might be wrong because they were made up by other models, which we will not elaborate on, also these numbers are much higher by a metric that reflects how people use the product, which we will not be sharing“

I also really love the graph where they drew a line at “wrong half of the time” and labeled it ‘Expert-Level’.

10/10, reading this post is experientially identical to watching that 12 hours of jingling keys video, which is hard to pull off for a blog.

bluerooibos - 18 minutes ago

Yawn.

villgax - 4 hours ago

Marginal gains for exorbitantly pricey and closed model…..

MagicMoonlight - 3 hours ago

They’re definitely just training the models on the benchmarks at this point

roxolotl - 3 hours ago

Yea either this is an incredible jump or we’ve finally gotten confirmation benchmarks are bs.

slackr - 2 hours ago

“…where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations.”

What a sociopathic way to sell

Croftengea - 3 hours ago

Is this another GPT-4.5?

iwontberude - 2 hours ago

I have already cancelled. Claude is more than enough for me. I don’t see any point in splitting hairs. They are all going to keep lying more and more sneakily.

meetpateltech - 4 hours ago

GPT-5.2 System Card PDF: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

dang - 3 hours ago

Thanks, we'll put that in the toptext as well.

bobse - an hour ago

[dead]

anishshil - an hour ago

[dead]

Xiol - 4 hours ago

[flagged]

HackerThemAll - 3 hours ago

No, thank you, OpenAI and ChatGPT doesn't cut it for me.

dang - 3 hours ago

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
https://news.ycombinator.com/newsguidelines.html

orliesaurus - 3 hours ago

I told all my friends to upgrade or they're not my friends anymore /s

firebot - 3 hours ago

[flagged]

system2 - 4 hours ago

"Investors are putting pressure, change the version number now!!!"

exe34 - 4 hours ago

I'm quite sad about the S-curve hitting us hard in the transformers. For a short period, we had the excitement of "ooh if GPT-3.5 is so good, GPT-4 is going to be amazing! ooh GPT-4 has sparks of AGI!" But now we're back to version inflation for inconsequential gains.
- verdverm - 4 hours ago
  
  2025 is the year most Big AI released their first real thinking models
  Now we can create new samples and evals for more complex tasks to train up the next gen, more planning, decomp, context, agentic oriented
  OpenAI has largely fumbled their early lead, exciting stuff is happening elsewhere
- ToValueFunfetti - 3 hours ago
  
  Take this all with a grain of salt as it's hearsay:
  From what I understand, nobody has done any real scaling since the GPT-4 era. 4.5 was a bit larger than 4, but not as much as the orders of magnitude difference between 3 and 4, and 5 is smaller than 4.5. Google and Anthropic haven't gone substantially bigger than GPT-4 either. Improvements since 4 are almost entirely from reasoning and RL. In 2026 or 2027, we should see a model that uses the current datacenter buildout and actually scales up.
  - Leynos - 3 hours ago
    
    4.5 is widely believed to be an order of magnitude larger than GPT-4, as reflected in the API inference cost. The problem is the quantity of parameters you can fit in the memory of one GPU. Pretty much every large GPT model from 4 onwards has been mixture of experts, but for a 10 trillion parameter scale model, you'd be talking a lot of experts and a lot of inter-GPU communication.
    With FP4 in the Blackwell GPUs, it should become much more practical to run a model of that size at the deployment roll-out of GPT-5.x. We're just going to have to wait for the GBx00 systems to be physically deployed at scale.
  - snovv_crash - 3 hours ago
    
    Datacenter capacity is being snapped up for inference too though.
- JanSt - 3 hours ago
  
  I don't feel the S-curve at all yet. Still an exponential for me
- gessha - 3 hours ago
  
  Because it will take thousands of underpaid researchers random searching through solution space to get to the next improvement, not 2-3 companies pressed to monetize and enshittify their product before money runs out. That and winning more hardware lotteries.

HackerThemAll - 3 hours ago

No, thank you, OpenAI and ChatGPT doesn't cut it for me.

replwoacause - an hour ago

What’s cutting it for you these days?
wayeq - 2 hours ago

thanks for letting us know.

impulser_ - 2 hours ago

The thing about OpenAI is their models never fit anywhere for me. Yes they maybe smart or even the smartest models but they are alway so fucking slow. The ChatGPT web app is literally usable for me. I ask simple task and it does most extreme shit jsut to get an answer that the same as Claude or Gemini.

For example, I asked ChatGPT to take a chart and convert into a table. It went and cut up the image and zoomed in for literally 5 mins to get the a worst answer than Claude which did it in under a minute.

I see people talk about Codex like it better than Claude Code, and I go and try it and it takes a lifetime to do thing and it return maybe an on par result as Opus or Sonnet but it takes 5mins longer.

I just tried out this model and it the same exact thing. It just take ages for it to give you an answer.

I don't get how these models are useful in the real world.

What am I missing, is this just me?

I guess it truly an enterprise model.

wetoastfood - 2 hours ago

Are you using 5.1 Thinking? I tended to prefer Claude before this model.
I use models based on the task. They still seem specialized and better at specific tasks. If I have a question I tend to go to it. If I need code, I tend to go to Claude (Code).
I go to ChatGPT for questions I have because I value an accurate answer over a quick answer and, in my experience, it tends to give me more accurate answers because of its (over) willingness to go to the web for search results and question its instincts. Claude is much more likely to make an assumption and its search patterns aren't as thorough. The slow answers don't bother me because it's an expectation I have for how I use it and they've made that use case work really well with background processing and notifications.

airstrike - 3 hours ago

I feel like if we're going to regulate anything about AI, we should start by regulating (1) what they get to claim to be a "new model" to the public and (2) what changes they are allowed to make at inference before being forced to name it something different.

egeres - 4 hours ago

It baffles me to see these last 2 announcements (GPT 5.1 as well) devoid of any metrics, benchmarks or quantitative analyses. Could it be because they are behind Google/Anthropic and they don't want to admit it?

(edit: I'm sorry I didn't read enough on the topic, my apologies)

zamadatix - 4 hours ago

This isn't the announcement, it's the developer docs intro page to the model - https://openai.com/index/introducing-gpt-5-2/. Still doesn't answer cross-comparison, but at least has benchmark metrics they want to show off.

anishshil - 44 minutes ago

This shift toward new platforms is exactly why I’m building Truwol, a social experience focused on real, unedited human moments instead of the AI-saturated feeds we’re drifting toward. I’m developing it independently and sharing the progress publicly, so if you’re interested in projects reinventing online spaces from the ground up, you can see what I’m working on Truwol buymeacoffee/Truwol