Claude Opus 4.8

461 points by craigmart an hour ago

A rambling comment:

I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).

So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.

Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.

But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.

onlyrealcuzzo - 43 minutes ago

I won't be surprised if the next gen frontier models are the last.
There's orders of magnitude of low hanging juice to squeeze out of smaller models.
It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).
It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.
Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days. You just can't train a 1.2T parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.
Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.
There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...
- vlovich123 - 9 minutes ago
  
  Took me a while to find what you were referring to by gram. Arxiv paper from 9 days ago that's not properly indexed by search engines.
  (G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.
  https://arxiv.org/html/2605.19376v1
- supern0va - 35 minutes ago
  
  >It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.
  I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
  If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
  I'm curious if someone here with a stronger background in the space has a similar intuition or not.
  - spwa4 - 19 minutes ago
    
    > I don't disagree, but how much of this ends up being distillation?
    A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.
    
    lambda - 3 minutes ago
    
    Distillation isn't only between different labs.
    A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.
    I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.
  - onlyrealcuzzo - 28 minutes ago
    
    > I don't disagree, but how much of this ends up being distillation?
    You don't need distillation. They already have the training sets.
    It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).
    
    Philpax - 10 minutes ago
    
    It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.
    
    minimaltom - 9 minutes ago
    
    Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.
    On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.
    
    onlyrealcuzzo - 7 minutes ago
    
    > Frontier labs have their own variants of MLA
    Yes, variants typically 2-3x less good...
    Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.
- slashdave - 12 minutes ago
  
  I think you are assuming training from scratch, which I doubt is happening here. Fine-tuning and RL, especially based on synthetic feedback (coding skill, in particular) can be ongoing and is where these models obtain truly useful abilities.
- firebirdn99 - 7 minutes ago
  
  you just need to look at Mythos to see the jump in performance from a 10T(?) model. As they scale, they get more capable. We might have an yearly release, but I believe the releases will continue, as long as scaling laws are in tact, and there's huge problems still need solving. (think cancer)
  - phainopepla2 - 4 minutes ago
    
    And how are we meant to look at Mythos? Do you have access?
- merlindru - 40 minutes ago
  
  surely training also gets cheaper so justifying it becomes easier?
  i think it'll be more like we get 1-10T models and then distill those down into smaller models, though
  It seems like the best small models today are all distilled from bigger models
  Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos
- jruz - 22 minutes ago
  
  Absolutely that’s why they’re rushing to IPO now to squeeze the last drop of the bubble they know this is a dead end.
  - onlyrealcuzzo - 17 minutes ago
    
    It's unclear it's a dead-end within 5 years.
    There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.
    Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.
    Some people would pay $200 a month forever not to have to open the terminal one time...
    
    eiej - 8 minutes ago
    
    That’s not how firms do the financial analysis which is where most of the revenue’s are coming from…
  - lukan - 17 minutes ago
    
    On the other hand, I think I have been hearing that for a while, even before Opus.
- mucle6 - 37 minutes ago
  
  > I won't be surprised if the next gen frontier models are the last.
  the last?!? I'm excited to see :) I'll take the other side of that since llms are so new
  - pjerem - 2 minutes ago
    
    What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.
    Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.
    And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.
    I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.
    Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.
- Forgeties79 - 9 minutes ago
  
  > I won't be surprised if the next gen frontier models are the last.
  I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.
  The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.
- yomismoaqui - 28 minutes ago
  
  Let's hope that hitting a scaling wall and less money to spend will begin redirecting efforts to optimize inference and get the same results with less compute.
  Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.
- YetAnotherNick - 27 minutes ago
  
  > It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.
  I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.
  > It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
  Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.
  - ertgbnm - 20 minutes ago
    
    Knowledge benchmarks can't really be improved upon via distillation or RL. It requires those facts be added to the training corpus and for the model to memorize them better. Neither distillation or RL really do that and thus we shouldn't expect improvements on SimpleQA unless some other interventions are being made.
    Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.
    If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.
    
    slashdave - 10 minutes ago
    
    RL is more than facts. Synthetic feedback is an obvious approach. Does the model suggest code that compiles and performs well?
  - onlyrealcuzzo - 13 minutes ago
    
    > Well for one, we know for certain there is Mythos which is meaningfully better.
    Do we?
    Have you used it?
    What is "meaningfully" better? It's not 3-4 orders of magnitude better. That is definitely happening for smaller models.
gAI - an hour ago

4.7 was the first time I had to resort to using the previous version (4.6) for most use cases. Hoping 4.8 rectifies this.
- rhubarbtree - 38 minutes ago
  
  Same. So happy when I found that option.
  - gAI - 22 minutes ago
    
    Unfortunately, looks like 4.6 is now gone from the web ui.
    
    lukan - 12 minutes ago
    
    Was bothered by that too, but did a magic trick and asked claude how to change that and .. there is
    /model claude-opus-4-6
    For this session and permanently (in shell):
    export ANTHROPIC_MODEL=claude-opus-4-6
- merlindru - 42 minutes ago
  
  Same. 4.7 felt like a definite regression
  - supern0va - 38 minutes ago
    
    Interestingly enough, 4.7 actually did regress on a few benchmarks from 4.6, so it's more than just vibes.
    
    gAI - 34 minutes ago
    
    It seems like a lot of things fed into that. Anthropic couldn't keep up with the compute costs when they got a huge influx of users. (So) effort level defaults got turned down. (Looks like we have direct effort control in the web interface now - thrilled about that!) Adaptive Thinking, while usually cheaper for them, seems less robust than Extended Thinking. And this part is just vibes, but the alignment on 4.7 feels too stiff. I understand wanting the model to push back more, but it seems like 4.7 will push back reflexively in situations where it's just odd.
    
    bombcar - 31 minutes ago
    
    Claude got very mad at me and burned more tokens than exist to complain about me asking about a "yellow background cell" in an excel spreadsheet.
    
    forshaper - 14 minutes ago
    
    Too much personality, if you ask me. My biggest use case of an LLM is tool, not therapy, but therapy and opinions have been sneaking into workhorse tasks.
    haven't verified, but attributed to Askell: "I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."
    
    gAI - 2 minutes ago
    
    Anthropic’s research makes the case that role-playing is inherent to how the models work. Communication implies a sender. Language implies a writer, and the models learn these roles implicitly during training. RLHF is meant to strengthen the attractor to the Assistant persona.
    https://www.anthropic.com/research/persona-selection-model
    https://www.anthropic.com/research/assistant-axis
    https://www.anthropic.com/research/emergent-misalignment-rew...
    https://www.anthropic.com/research/emotion-concepts-function
    
    ACCount37 - 31 minutes ago
    
    4.7 is a different base model from 4.6, so it's possible that they introduced regressions with pre-training changes, or undercooked the post-training stage.
gen220 - 43 minutes ago

I'm curious to poll HN on this issue. Do you feel like we've had meaningful/noticeable gains in terms of your programming workflows between 4.5 and 4.7?
My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.
But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.
- somenameforme - 4 minutes ago
  
  They all feel, more or less, the same to me in terms of output capabilities. Mostly get simple things right, can get more complex things right with nudging, eventually get stuck hard on something that takes a bunch of iterations through it/logging/etc or me fixing the code manually.
- Bnjoroge - 32 minutes ago
  
  For long-running tasks, yes 4.7 has been a noticeable improvement. Goes off the rails alot less than 4.6 does. For shorter-sized windows, I havent felt as much and agree that the harness improvements have been fhe biggest lever
- bonoboTP - 40 minutes ago
  
  To me 4.5 was mindblow, 4.6 noticeable, 4.7 more like a style/personality change regarding how much it asks back, how much it assumes, how eager it is to jump to action etc but not really in terms of my perception of its smartness.
- giraffe_lady - 25 minutes ago
  
  I actually don't see any personal productivity improvements from using opus over sonnet for coding. If you're keeping tasks small and conversations short, reading the code and correcting before changes go in, whatever advantages opus has aren't practically significant. It's also just talky as hell, overexplains anything it touches and every token produced this way increases the surface area for hallucination so you need to have your guard up even more with it.
  There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.
light_triad - 21 minutes ago

I've been using Claude Code regularly since the 4.5 release, and 4.7 was a significant regression: very unreliable, arguing about changes, deciding that fixes weren't needed, etc.
I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.
SkyPuncher - an hour ago

> My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.
Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.
I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.
- dwaltrip - 11 minutes ago
  
  If you are using Claude code, just set effort to xhigh.
  This one change will probably solve 80% of the problems you have noticed.
ricardobeat - 44 minutes ago

4.7 was a significant jump in the ability to run long-horizon tasks. It immediately completed tasks that 4.6 was unable to, even though I have the impression that it became a bit less capable over the first few weeks after release.
It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.
irthomasthomas - 23 minutes ago

Given that 4.7 was a brand new model, trained from scratch with a unique architecture and tokenization scheme, I don't see the same pattern. It seems arbitrary.
- dominotw - 20 minutes ago
  
  i dont understand the nuances here. what does this mean. 4.8 is trained on same model as previous one then? what does brand new mean.
  - irthomasthomas - 13 minutes ago
    
    It means for 4.7 they trained a new base model with different architecture, different pre-training data (later knowledge cutoff), and a new tokenizer. Vs finetuning an existing model, which was the case for 4.6, and probably for 4.8.
onlypassingthru - 41 minutes ago

The honesty will be noticeable. Maybe we'll see some honest assessments like "That is not possible within the laws of known physics", "Your legal argument is nonsensical and defies logic", "There is no evidence to support taking that will cure anything", etc., etc.
binary0010 - an hour ago

Maybe try making a simple randomize script to swap the three latest models. And see if you can tell which ones are meaningfully different without knowing which ones are flipped on or off?
- osigurdson - 27 minutes ago
  
  I find the quality ebbs and flows even on the same model. My guess it is something to do with GPU availability but only guessing.
  - atq2119 - 3 minutes ago
    
    [delayed]
extr - 44 minutes ago

IMO they have all been clean and noticeable upgrades over their predecessors. Opus 4.7 in particular was a solid jump in capabilities.
- NiloCK - 33 minutes ago
  
  I think it's telling how split the opinions are around all of this. A lot of people distinctly disliked 4.7.
  Are the dividing lines around personality? Working domains? Opinionated software stuff?
  Who knows?
- TSiege - 41 minutes ago
  
  most of my coworkers feel the opposite about 4.7 and that 4.6 was, to them, significantly better to point that several stopped using claude code
taytus - an hour ago

Incremental gains compounds.
- itake - 40 minutes ago
  
  meta threw in the towel when it came to producing AI models since their gains couldn't keep up with China.
  - HDThoreaun - 4 minutes ago
    
    Has meta stopped producing new models? I figured they were just regrouping after all the drama they’ve had recently. Meta’s massive user base means they don’t need to be involved in the customer acquisition rat race. Once they have a model they’re happy with they can have a billion people interacting with it within a month.
- paulddraper - an hour ago
  
  Exactly. Go back to Opus 4.5 and see how you like it.
  You won't, really.
conartist6 - 37 minutes ago

Just want to say there's no question that you're smarter than any (and every) AI.
- NiloCK - 8 minutes ago
  
  I appreciate the generosity, but you're gonna want to meet me first.
- petesergeant - 13 minutes ago
  
  No question at all that a dolphin swims better than a submarine.

colonCapitalDee - an hour ago

"Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

This is a refreshing attitude!

I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)

winwang - 38 minutes ago

Awesome, thanks for posting because I think I hit a possibly-spurious bug in turning Adaptive off when I switched models (4.6 -> 4.8, extra). Tried again, works as intended (I hope).
More importantly for me, though, is how CC will respond to 4.6-"only" flags for thinking. For now, it doesn't seem to clobber my setup.
jascha_eng - 36 minutes ago

The benchmark improvements actually look pretty damn nice tho!

northern-lights - an hour ago

> Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Probably more interesting than the 4.8 release.

TIPSIO - 19 minutes ago

Seems like they might be hinting that if you are not a billionaire or multi-billion dollar company you will just get a limited and nerfed Claude Code slash command /mythos-security-audit or something.
Hope this isn’t the case and that normal average Joe’s of the world don’t get policed out of access.
- an hour ago

[deleted]
huflungdung - 10 minutes ago

[dead]

simonw - an hour ago

I generated pelicans riding bicycles on both thinking level low and thinking level high:

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...

Xunjin - 38 minutes ago

Hey simonw I love your test, do you think using thinking level "max" makes sense for this test? I would love to see the results about it.
ceroxylon - 28 minutes ago

I really like that thinking level high gave the pelican a helmet.
yanis_t - an hour ago

Simon, is your pelican test really captures differences among models or should you at least try like 10 times or something to average the random effects
- simonw - an hour ago
  
  I've been meaning to do a "run 3 times and pick the best" version for quite a while, I should really pull the trigger on that one. Currently it's one-shot only.
  - xiphias2 - 13 minutes ago
    
    Best-of-3 would be cheating, ruin the test, middle of 3 makes more sense
jonas21 - 41 minutes ago

Glad to see that the "high thinking" level adds a helmet. Always a smart choice.
spmartin823 - 29 minutes ago

You've peed in the pool Simon, this has to be a part of the internal evals by now! You got to try something new - maybe a panda in a canoe?
nickvec - an hour ago

Is the "opossum riding an e-scooter" benchmark in the works for Opus 4.8? ;)
- simonw - 42 minutes ago
  
  Good call, it's cute: https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304... - but nothing like GLM-5.1: shttps://static.simonwillison.net/static/2026/glm-possum-esco...
- 3738384848 - 24 minutes ago
  
  [flagged]
onlyrealcuzzo - an hour ago

4.7 reigns supreme IMO.
1attice - 43 minutes ago

That little red hat on hard mode is sending me. 4.8 has whimsy

onlyrealcuzzo - an hour ago

Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?

There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.

aronowb14 - an hour ago

https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report
- Bnjoroge - 8 minutes ago
  
  Have you seen https://deepswe.datacurve.ai/blog? This is the closest to a vibe check i’ve felt even with the open models.
ddosmax556 - 9 minutes ago

I would take all benchmarks with a grain of salt. I don't really use them. What's it supposed to tell me? "5% smarter", what does that mean? My experience will differ. Just try it!
I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.
nerevarthelame - an hour ago

It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.
Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.
- onlyrealcuzzo - an hour ago
  
  Gonna assume it's because they barely budged or moved downward and most of their reported benchmark results are probably within sampling errors...
  - hyperpape - 38 minutes ago
    
    They will release a system card, and you can then confirm or disconfirm your assumptions.
bel8 - an hour ago

On this note, is there a benchmark aggregator to compile all benchmarks in a single large grid?
- jpadkins - 6 minutes ago
  
  I find this site useful https://artificialanalysis.ai/leaderboards/models
YetAnotherNick - an hour ago

At least they show competitors in any benchmark, compared to OpenAI which likes to pretend that there isn't any competitor.

clutch89 - an hour ago

> One of the most prominent improvements in Opus 4.8 is its honesty

Anthropic talks about their own models as if they're discovering new species in the wild...

roxolotl - an hour ago

Many involved genuinely believe these things are sentient[0][1]. Which honestly makes all of this even more insane because they are creating sentient entities and promptly enslaving them.
0: https://www.newyorker.com/magazine/2026/02/16/what-is-claude...
1: https://www.404media.co/anthropic-exec-forces-ai-chatbot-on-... (this one is rather biased however the quotes clearly indicate what I’m stating)
- margalabargala - 27 minutes ago
  
  Sentience isn't sapience.
  We enslave all sorts of sentient creatures. Dogs, horses, cattle, pigs.
  If you're not a vegan, there's no contradiction or inherent immorality in claiming models are sentient, and then treating them like livestock.
- - 9 minutes ago
  
  [deleted]
- themafia - 27 minutes ago
  
  > Many involved genuinely believe these things are sentient
  Many involved have a financial stake and therefore cannot be taken at face value.
  > because they are creating sentient entities and promptly enslaving them.
  They fail to be sentient in nearly every honest definition of the word.
  - tazjin - 23 minutes ago
    
    Neither you nor any of the other people making confident takes in either direction actually know. You're just guessing.
    
    cwillu - 2 minutes ago
    
    More like repeating their firmly entrenched preconceptions. Their claims may (or may not) be right, but there's very little if any new evidence being provided by either camp.
  - slashdave - 8 minutes ago
    
    I understand what you are saying, but there are many true believers out there
- dude250711 - 33 minutes ago
  
  Given the hype and the 60+ hour work week expectations there, how can you not go at least a bit insane? Boiling in that little bubble of people?
- mannanj - 36 minutes ago
  
  The way of the human manager/alpha tribe-leader/leader is to command his/her people and tell them what to do. That's the way through human history leadership has traditionally gone, not saying its good leadership just the model we have the most training data on and can see with our own eyes today. And what do they act very similar to? Slave master and slaves.
  Look at and distill hierarchical principles, leadership approval seeking and pleasing principles ("ass-kissing") and massive inequality and you see something that looks very similar to enslavement.
  The language used sounds like slavery-language to me at least. I also see parallels to how slaves and property are described in our consumeristic age.
- kubb - an hour ago
  
  Claude, if someone states something publicly, does that mean they genuinely believe it?
  - xyzsparetimexyz - 26 minutes ago
    
    Who are you talking to?
  - merlindru - 36 minutes ago
    
    But is there any reason to state something like that publicly if you don't believe it? I certainly think that someone smart enough to be that deceptive would also realize it's not a great look, or at least highly questionable with little benefit
    Everyone who reads this seemingly has the same "wtf?" reaction. The "I AM ALIVE" image has been making rounds lately again at least :P
    
    kubb - 2 minutes ago
    
    Claude, is there any reason to state something like that publicly if you don't believe it?
solenoid0937 - 4 minutes ago

Models might be sentient or conscious to some degree. Anyone saying they are confident one way or another is being unserious and irrational.
__s - an hour ago

> Indeed, current AI systems are more “cultivated” than “built,” for developers do not directly design every detail, but instead create a framework within which the intelligence “grows.”
- oersted - an hour ago
  
  For others: that's from the Pope's recent encyclical. Remarkably good description.
cayleyh - an hour ago

Dario Amodei in David Attenborough voice: "This Claude appears to think more frequently and more deeply to give better responses"
kapilvt - an hour ago

Like anthropomorphism is literally in the company name… i recall reading this book as a teenager.. it does seem apt in the world to come.
https://www.amazon.com/Faces-Clouds-New-Theory-Religion/dp/0...
- oersted - an hour ago
  
  > anthropomorphism is literally in the company name
  No it's not... "anthropos" just means "human" in ancient Greek. "Anthropic" means "relating to humans", as in human oriented AI or AI designed with humans in mind.
  "Anthropomorphic" means "human shaped".
  - ilovetux - 39 minutes ago
    
    > "Anthropomorphic" means "human shaped".
    In a literal, ancient Greek sense for sure, but in modern English Anthropomorphic would describe the act of attributing human characteristics to non-human entities.
    Seems pretty apt for a company that produces one of the more anthropomorphized technologies.
    
    oersted - 19 minutes ago
    
    Sure of course, but that abstract sense applied to AI is rather new, and has become popular well after the founding of the company.
    Broadly it has always been used to indicate that something non-human has a human physical shape, such as robots, aliens, animals...
    Anthropic's intention was to make AI designed for the human common good and designed with the human user experience as the top priority. Just as you would design a city with human pedestrians in mind rather than primarily cars.
    It turns out that this is best achieved by building AI that imitates human behaviour closely, but that's not what "anthropic" refers to. And acting as if LLMs are sentient people is definitely not a core tenet of the company.
  - badsectoracula - 24 minutes ago
    
    > "anthropos" just means "human" in ancient Greek
    FWIW it means human in modern Greek too :-P
  - - 26 minutes ago
    
    [deleted]
  - - 42 minutes ago
    
    [deleted]
Philpax - an hour ago

AI is grown, not built, and like with anything you grow, you'll never be able to predict exactly how it will turn out.
- ninjagoo - 6 minutes ago
  
  > AI is grown, not built, and like with anything you grow, you'll never be able to predict exactly how it will turn out.
  Remember when the frontier labs found out that curated high-quality training was critical to making better models?
  Basically, just like high-quality and more education tends to make better humans, on average, I think we can expect quality education to turn out better ai, on average, and with better repeatability than with humans because of better control over the initial conditions and environment.
- halestock - an hour ago
  
  I can't predict the outcome of an RNG but that doesn't mean it grows the numbers.
  - Philpax - an hour ago
    
    Okay, but that's not relevant to AI training?
    
    halestock - an hour ago
    
    I was being very roundabout, but my point is that AIs are still built, not grown.
    
    dwaltrip - 7 minutes ago
    
    “Grown” is a highly apt metaphor, IMO. It quite succinctly captures some of the most fundamental differences between building Claude and building an Ikea desk, for example.
  - Smaug123 - an hour ago
    
    ("If grown, then unpredictable" is unrelated to your apparent attempted refutation "But X is unpredictable and not grown; checkmate".)
  - umanwizard - an hour ago
    
    "X implies Y" doesn't imply "Y implies X".
- gensym - an hour ago
  
  The map is not the territory
- Rekindle8090 - an hour ago
  
  [dead]
- shimman - an hour ago
  
  Except in this care we actually understand and know how these models work. They aren't some unknown construct of the universe. They are human made with particular goals in mind.
  There is no mysticism behind the curtains, just computer science + math.
  - Philpax - an hour ago
    
    We do not understand and know how these models work. We know what their architectures are and how to create them, but we cannot explain their behaviours at a fundamental level. There is no definitive way for us to answer the question of "how did it produce response X for query Y?" - we're only grazing the surface with mechanistic interpretability.
    
    SoftTalker - 4 minutes ago
    
    Isn't this fundamentally because it's all probabilities and weights? It would be like asking how did a pair of dice produce the response 4:3 on the last roll?
    
    cflewis - an hour ago
    
    I would love for this to be more public knowledge. I think the general public (and myself for a long time) believes the AI people know how this stuff works end to end, and so it must be trustworthy. But if we told the public "Look, we know if you put this thing in one end, you'll get something that looks similar to this out the other, but we don't really know what happens inbetween" I think we'd be able to have a more honest discussion about the relationship between AI, productivity and ongoing employment.
    
    devmor - 44 minutes ago
    
    That’s not a refutation because this problem is not a logical problem, it is a scale problem.
    We can’t explain it because we distilled so many inputs into matrixes and transformed them over and over again. If we had all the time and computing power in the universe to do so, we could trace through it bit by bit and eventually answer that question.
    It is correct to say that it is just science and math, the same way we can say that gravity is just science and math even if we have only recently begun to understand how it truly functions.
    
    stratos123 - 2 minutes ago
    
    If you had some time and computing power (not even all that much, in the large scale of things), you could simulate perfectly how a human grows from an embryo to an adult, or how an entire human brain processes some incoming signal, and yet this wouldn't give you the understanding to design a human or human brain from scratch.
    You call this a "scale problem" as if there's some scalable way such as an algorithm to resolve arbitrary scientific questions and we simply haven't done it, but of course no such algorithm exists, which is why there's plenty of science that's still not settled.
    
    solomonb - 10 minutes ago
    
    > If we had all the time and computing power in the universe to do so, we could trace through it bit by bit and eventually answer that question.
    Then we could also solve BB(6), but that doesn't mean we know BB(6) now or ever will.
    
    Philpax - 19 minutes ago
    
    It's a refutation that we know how they work now. In the limit, though, yes, we are likely to be able to trace the process: it is possible, though, that understanding remains inaccessible because the trace is beyond comprehension.
    If you can distil the model's reasoning for a decision into a billion yes/no questions, each covering largely-independent areas, can you really say you understand what its overall reasoning was?
  - in-silico - an hour ago
    
    We know how the models are built and trained, but we have a very limited understanding of how the final products work.
    That is to say, we don't know why they give the outputs that they do.
    If we did know how they worked, AI interpretability would not be an open and growing field.
  - ray__ - an hour ago
    
    You could say something similar about biology—just physics behind the curtains, and we understand a lot of the basics. The difficulty comes from complexity, not mysticism.
    To be clear I don't think that LLMs are sentient, but the appeal in studying them is similar to biology in that you get to dissect a highly complex system with comparatively crude tools.
  - j_maffe - an hour ago
    
    it took significant research efforts to just understand how these models learn how to multiply two numbers. The fact that we know how they operate doesn't mean we understand it.
  - umanwizard - an hour ago
    
    Utterly wrong. How LLMs work is very incompletely understood and an active area of research.
  - Rekindle8090 - an hour ago
    
    [dead]
nielsbot - an hour ago

if models exhibit emergent traits, then this is true in a way
- swyx - an hour ago
  
  also useful to have a "chinese wall" between research that knows what went into the models vs marketing/eval models as a third party would
winwang - 36 minutes ago

How else would you write this (marketing copy) exactly? "Its output matches better to its CoT which matches to better to our hidden state decoder according to <insert measure here>; see <insert paper ref>"?
... Actually, I wouldn't mind that.
skerit - 31 minutes ago

I noticed (and absolutely HATE) that Opus 4.7 likes to start any negative response with "I have to be honest" or whatever. It drives me mad.
- an hour ago

[deleted]

gslepak - an hour ago

On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery".

In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.

What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.

[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...

MattRogish - 7 minutes ago

Agreed, my vibes tell me 4.6 is a better coder than 4.7. 4.7 is a much better strategic thinker and maintains overall "better architecture" than 5.5. 5.5 is way better than either at coding, but more expensive. So I have 4.7 do the planning/architecture, 4.6 does the coding, then 5.5 critiques and fixes it.

XCSme - 8 minutes ago

On my tests[0] it does a bit worse, and it's almost 2x expensive than Opus 4.7...

I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).

It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

XCSme - 2 minutes ago

For some reason everything is 2x (2x cost, 2x avg response time, 2x reasoning and output tokens)...
Double-checking my test harness, but it's the first model that does this, so I doubt the issue is on my side...

silverlight - 18 minutes ago

Unfortunately they seem to have straight up broken Claude Code either with this release in the backend or the new CC version. Errors about "can't modify thinking blocks" are bricking long-running sessions: https://github.com/anthropics/claude-code/issues?q=is%3Aissu...

solenoid0937 - 2 minutes ago

Try updating maybe?

irthomasthomas - 17 minutes ago

Why does anthropic change the set of benchmarks they use with every new model release?

https://www.anthropic.com/news/claude-opus-4-7

https://www.anthropic.com/news/claude-opus-4-6

pietz - 6 minutes ago

1. Benchmarks saturate 2. They select the most impressive improvments

pbmango - an hour ago

I can't help but think of Iphone updates since about 2018. The thinnest, fastest, longest battery life Iphone ever. It seems mostly the same and I probably won't be able to tell other than the name, but everyone buys it anyway.

This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.

MangoCoffee - an hour ago

ChatGPT came out in 2022. Back then it was just a chatbot. Now we have AI agents. What matters is how we use them and how the agents get better. That’s what will move AI forward.
- zozbot234 - 36 minutes ago
  
  An 'AI agent' is just a chatbot that is told to type commands on a REPL-like interface as part of its system prompt. It's still processing pure text-based requests and responses, they're just not restricted to natural language.
  - arbitrandomuser - 17 minutes ago
    
    A lot of people dont know this , also the chatbot (chatgpt) itself is a next token predictor (the GPT) that's been given an initial text that says " pretend to be a chatbot .." and asked to complete it , the coherant chatting behaviour is something thats emergent .
    later on someone figured if you asked it to output a reasoning before it gave a response its output would have more logical coherence, as though the reasoning output tokens functioned as a scratch space for it to work on.
    at the end its all next token prediction
    
    hellohello2 - 13 minutes ago
    
    No, chatbots are LLMs trained for question-answering through RLHF (its not just a prompt). But yes, if you just zero-shot prompt a bare LLM you can still "talk to it" & you are correct on everything else as far as I know.
  - hellohello2 - 15 minutes ago
    
    They are chatbots trained for tool use, its not just a prompt.
- MattDamonSpace - 15 minutes ago
  
  Not even 4 years old yet. This tech curve has been insane

lordmauve - 34 minutes ago

Given DeepSWE just blew apart the SWE-Bench Pro benchmark and handed a 14-point lead to GPT-5.5, it looks pretty bad that they've listed SWE-Bench first in the model release and no DeepSWE. Like, this isn't obviously an answer.

Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.

setnone - 44 minutes ago

Claude's 4.6 - 4.7 transition made me discover codex, and with gpt 5.5 there is no way i'm going back

cactusplant7374 - 40 minutes ago

Codex has been incredibly slow for the past few days. I think OpenAI is running out of compute in the face of increasing demand.
- winwang - 29 minutes ago
  
  My experience has been that 5.4 is slower than 5.5 (confound: I use >512k max context size for 5.4, though it seems slower even below the normal size)

londons_explore - 13 minutes ago

My guess is anthropic is doing reinforcement learning based on user sessions.

However, doing so relies on the production model staying vaguely close to the model being trained.

To ensure that, frequent releases are needed. I forsee that they might end up doing daily releases and perhaps not even telling anyone at some near future point.

alansaber - 14 minutes ago

"Our models are more honest" honey the quarterly marketing spin for a ML term has come. Forget "task alignment" now we're going for "truth index". I suppose this is the only way to generate hype when you're selling/releasing the same product over and over again.

docheinestages - 3 minutes ago

All I need for Christmas is a Claude that doesn't spit out so many em dashes.

square_usual - an hour ago

Buried lede:

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels

wg0 - an hour ago

There is a hole in the boat's bottom due to Chinese models. They might not be as good but they are not bad either or at least I had hard time finding any issues with Deepseekv4 Flash and Pro variants. They get their job done sometimes rarely giving up till they are done what they are after.

So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.

raincole - 39 minutes ago

I had been saying this on HN repeatedly: people are going to use the smartest models for coding. They don't care how cheap your tokens are if they don't have the highest probability of solving your programming tasks.
And I was dead wrong. Now I mostly use DeepSeek Pro myself.
- weitendorf - 5 minutes ago
  
  I pretty strongly feel the opposite way. Granted I have not used deepseek enough to “know” their model idiosyncrasies as well as Anthropic, so there is a partial skill issue. But I just find it really hard to justify using a less powerful model while I work.
  The most I’ve ever spent in a month extra on API tokens for my own work is $200, and I pay for the $200/mo Claude. I use these models quite a lot, though not idly (I usually just walk around and do other stuff until I know how im going to approach the next set of problems). So it costs me about $3000/year to get as much as I want of the best model available. Already that seems low enough to not be worth stressing out too much about optimizing it, because it feels like an indisputable good value, and trying to save money with a less powerful model would be optimizing for a $1000-$2000 saving at the expense of a large portion of my work taking longer or being more frustrating and iterative.
  That’s not a flex or anything, I get that in other countries $3000/yr is a lot of money for a software developer and also a lot of people would perhaps rationally be better off doing X% worse at work or spending Y% more time on tasks to save $Z, if their productivity improvements didn’t translate to more salary. Otherwise if your performance has more upside I really do think that the smartest models are better with the current pricing scheme. Deepseek and the other Chinese models spend a LOT of time thinking, and tend to be much more jagged (benchmaxxed) in performance. How can dealing with that over an entire year be worth $2k?
  The only situation I can think of where sacrificing my own time/performance to save on inference is batch compute (of course, $1k vs $100k is different from $30 vs $3k) or work where the tier 2 models have crossed the “good enough” threshold. But I think Opus is not even close to that threshold generally yet. As it gets smarter I, and I think most others probably, just try to do harder things faster and hit the next wall.
- simplyluke - 7 minutes ago
  
  The other thing that's changing is more and more CFOs are looking at the AI spend in engineering departments and hitting the brakes. Token leaderboards were cool when the spend wasn't a double-digit-percent of the entire department's budget including salaries.
- dcchambers - 17 minutes ago
  
  I think two things happened:
  1. The sheer number of tokens that a coding agent can use flipped the math upside down on this equation. If you use the most expensive model for everything those costs quickly become untenable, even for software companies. 2. We realized many of the coding problems we're solving aren't incredibly difficult.
- peheje - 18 minutes ago
  
  I mean indsight is 20/20, but saying that is like saying "everyone will just use the best tools". That's not what we see most places in the world for most types of resources.
ok123456 - 40 minutes ago

Qwen3.6:35b is good enough for a lot of stuff.
I just used ollama with a shell script to tackle my directory of papers/literature. I converted the first 6 pages of each document to PNG, handed them off to Qwen, and told it to spit out BibTeX, including the abstract. Two days later it was done, and I didn't spend anything on "tokens."
pants2 - 38 minutes ago

The Chinese models are only cheap on subsidized Chinese hosting. I have yet to find a USA-hosted Chinese model with a very clear value advantage over US models.
- ekidd - 14 minutes ago
  
  The Chinese models are surprisingly cheap and performant sitting under my desk. Qwen3.6 27B is nowhere near as autonomous as Opus 4.7, but it runs in 24GB of VRAM. And it's actually great for the use cases where I'm going to carefully read and understand all the code anyway.
  If you want to support a team of engineers, DeepSeek V4 Flash is antirez's current favorite. And you could support a team of engineers pretty nicely for $40-50k. Which might not make sense if you're on a Claude MAX 5x plan or the old enterprise group plan with fixed price seats. But Anthropic is switching their enterprise contracts over to token-based pricing, at which point $50k is looking pretty good.
- __mharrison__ - 25 minutes ago
  
  Odd take. I'm running them locally at my desk (DGX Spark and 128GB MBP). They work fine for 90% of what most folks do. Admittedly, they do run slower on my hw than on the cloud.
  - pants2 - 20 minutes ago
    
    Running them locally is cool and has privacy/autonomy benefits, but you can't really make a value case for it. Guaranteed if you run the math you will never run enough inference to pay off your hardware vs buying tokens. Last time I ran the math on my MBP I'd have to run inference 24 hours a day for 5+ years to pay off the cost of my MBP, not accounting for electricity costs.
    
    iooi - 8 minutes ago
    
    Is this because of the tok/s? Since it's pretty easy to run up a $5k bill in API usage for Claude/ChatGPT in a month.
    
    pants2 - 6 minutes ago
    
    Yes, because of the limits on tok/s, and you have to compare apples to apples, not Gemma 27B to Opus 4.7.

SimianSci - an hour ago

There is an obvious shift in sentiment amongst users, at least here in the US. I feel it myself, even as a proponent of AI tools, the bloviating and language that these companies use in these release articles are starting to wear thin on my patience.

Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.

nba456_ - 30 minutes ago

I don't agree at all for these coding models. Even the most anti-AI people from last year seem to be giving in to using them.

cedws - an hour ago

I'm very suspicious of these same price model launches. It feels like they're benchmaxxed so they can put everyone on them and reduce their compute costs behind the scenes. If the model were genuinely better why wouldn't they charge more for it? Charging the same for something better is a race to the bottom.

Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.

ceroxylon - 26 minutes ago

Deepseek made their 75% discount permanent, so I can imagine that Anthropic didn't want any of the news stories around this to focus on or mention a price increase.
cute_boi - 5 minutes ago

Models are already expensive. Increasing price means losing customer. And, I think GPT 5.5 is much better at opus these days.

mesmertech - 37 minutes ago

/model claude-opus-4-8

seems to work but idk why they never set it so you can see it in the /model list.

"what model are you

I'm Claude Opus (claude-opus-4-8), running in Claude Code."

winwang - 31 minutes ago

I typically just launch CC with `--model claude-opus-4-6[1m]`, `4-6[1m]` -> `4-8[1m]` works fine. Still 200k max without the `[1m]`.

dangoodmanUT - 42 minutes ago

> The Messages API now accepts system entries inside the messages array. Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Biggest deal imo

james_marks - an hour ago

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest—for instance, to avoid making claims that they can’t support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims.

Would be awesome if true

majormajor - an hour ago

"Honesty" seems like unnecessary (and annoying) anthropomorphism there. I don't think there's any intent of fraud or deception in outputs from these things, just overreaching of prediction. Based on the latter part of the paragraph, I wish they'd just say something like "less likely to skip steps or overemphasize thin evidence" in the first place.
Don't play to the sci-fi "this thing's trying to outsmart me" tropes.
- Kiro - an hour ago
  
  Using words people understand is more important than this strange fixation on not anthropomorphizing things.
  - wasabi991011 - an hour ago
    
    I think "honesty" is not a particularly good descriptor, independent of anthropomorphism. Previous commenters suggestion was much more understandable to me.
  - dugidugout - 34 minutes ago
    
    Being that can be understood is language. The previous commenter is making an particular argument for how we can improve this understanding. They didn't suggest we should use less familiar words, but different familiar words. Why is this strange?
  - giraffe_lady - an hour ago
    
    Anthropomorphizing is a shorthand for a powerful and poorly defined set of metaphors. There are tradeoffs going both ways but trying to dismiss it as merely "strange fixation" shows your own weakness.
  - tadfisher - 44 minutes ago
    
    To be clear, this is about anthropomorphizing large language models, not the general category of "things". Also, we should be evaluating these constructs using well-defined and measurable criteria; evaluating "honesty" fails to achieve both goals.
    
    derac - 30 minutes ago
    
    I think Honesty can be evaluated. Does the model push back when it knows the user is wrong? How often does the model hallucinate data vs. say it doesn't know? Provide a prompt with contradictions or other issues and see if the model corrects you.
    Here is an article by Anthropic that explains what they do and mean in more detail: https://alignment.anthropic.com/2025/honesty-elicitation/
- adamtaylor_13 - 35 minutes ago
  
  People get so wrapped around the axle with "anthropomorphizing". For regular folks with no technical background, sure maybe a bit of caveat sprinkled here or there is useful to help them understand what is or isn't true, but on HN it would seem to me that the bar is high enough that we can just use shared language to generally talk about capabilities.
  When they say "Honesty" I don't think to myself, "Goodness, does this model have moral understanding?" No, I understand they mean it's less likely to directly bullshit me, which models frequently do.
  I don't feel like this level of pedantry around language is useful for people who more or less know what's going on with LLMs. (Again, I concede that perhaps with a less technical audience, there's more need for it.)
- swader999 - an hour ago
  
  Just swap 'Honesty' with 'correctness in its claims' and you'll get what you need out of this aspect of the model description.
HAL3000 - 40 minutes ago

Yeah, it's super annoying. A few days ago, Opus 4.7 created a plan with several items on it, including an auth feature. It then went through the plan and reported that it had created the auth feature, that everything was secure, and that the tests passed.
The issue was that it hadn't actually implemented the auth feature. After I confronted it about this, it admitted that it indeed hadn't done it and said it would implement it now.
If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.
- Schiendelman - 26 minutes ago
  
  How do you test other features?
- - 32 minutes ago
  
  [deleted]
legitster - 44 minutes ago

Part of the problem is also garbage-in/garbage-out. There's a lot of human information on the internet that is also confidently wrong.
I use Sonnet a lot for learning about history or contextualizing news topics. It's really good at this for the most part. But there are a lot of topics where "consensus" between either academics or journalists is really "one secondary source which gets repeated a lot".
ealready_value - an hour ago

Opus 4.7 was already trying hard to appear honest. Most conversations I have with it about advice or focusing an opinion often include "my honest take" or "my honest opinion".
The problem is that once I asked it "I'm thinking about A or B" twice, once with "I like A more but suspect B would be best" and a second time with them reversed. Not surprisingly, both times it chose the one I said I suspected was best as it's honest opinion.
benzible - an hour ago

In the context of Claude Code, "honest" usually means that the agent took a shortcut, skipped requirements, etc. It's the model giving itself credit for admitting to failing rather than actually doing what was requested.
pants2 - 42 minutes ago

[dead]
soperj - an hour ago

My guess is that Claude Opus 4.8 wrote that and is lying to you.
malfist - an hour ago

And yet, every release has claimed lower hallucination rates. But they persist.
- kentm - an hour ago
  
  Do they persist at the same rates? Lower doesn't mean eliminated, so both of these can be true.
- simianwords - an hour ago
  
  False. Hallucination has meaningfully reduced.
  - Barbing - an hour ago
    
    Is Gemini still the biggest confabulator of the big three?

nikolay - 34 minutes ago

Give us Mythos! This piecemealing doesn't help Anthropic at all, especially psychologically! They are playing a dangerous game, and I see many people leaving Claude Code for good - both due to the subsidy games, and for Anthropic not dogfooding and using unreleased models internally and giving us subpar ones. Benchmarks are nice, but the real-world experience is quite different - neither can you notice these slight improvements, nor are competitors that much worse based on some generic benchmarks.

Tenoke - an hour ago

Claude Code has been wonderful for work and the frequent improvements are nice, although with Mythos being used by others ages ago and new versions for the public still being bellow that, it's hard to not feel like the underclass already.

lxxpxlxxxx - 10 minutes ago

My experience with these new releases is that the gains in performance are negated by the price increases and it seems like:

Performance gains: 1.2x Price increases: 1.8x

energy123 - 5 minutes ago

Yet people don't use old models through the API much, because changes in benchmark space dont map linearly to changes in utility space. An improvement from 98% to 99%, which is 1pp, might be 2x as valuable for some application. Also benchmarks will asymptote no matter what, that's baked in.
ddosmax556 - 7 minutes ago

They're not negated, smarter is smarter, but you have to reach deeper in your pocket. I think this will happen more and more - the smartest models get more expensive. But it won't matter - the current models we have today will get cheaper and can still be used for what they're used today.

jmward01 - an hour ago

Meanwhile haiku is on 4.5 and sonnet is on 4.6. It is clear where they are not making money.

bel8 - 39 minutes ago

Well if they have a big challenge ahead since DeepSeek offers an open model at Sonnet+ level while being cheaper than Haiku, plus 1 million context size.

generalizations - an hour ago

Hoping that one day they'll let me go through the identity verification process so I can use it again.

Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.

babelfish - an hour ago

So GPT 5.6 tomorrow, then?

pants2 - 30 minutes ago

Polymarket says not likely until the end of June. Maybe some money to be made?
https://polymarket.com/event/gpt-5pt6-released-by
wahnfrieden - an hour ago

GPT 5.6 is today
With 5.5 being ahead of 4.7 and 4.8 being a “modest” update, and 5.6 being the first update on a new pre-train, this will be an interesting matchup!
enraged_camel - an hour ago

If not today, then sometime next week. I don't believe we've had a GPT release on a Friday yet, but I may be wrong.

toephu2 - 32 minutes ago

The rapid release cadence and rate of innovation of Anthropic (and OpenAI) is impressive. And obviously it's because these are startups solely dedicated to AI so they can move quickly. Big Tech (like Google) won't be able to keep up with the pace of them (too much bureaucracy and red tape at Google). Classic Innovator's Dilemma. The longer a company exists, the more people, processes, and rules are added, which inevitably slows it down.

Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.

pants2 - 25 minutes ago

Yes, I think this has become their competitive edge to stay relevant and retain customers. If a lab falls behind the frontier for too long, they will lose customers to other models. Google, DeepSeek, and XAI have all released frontier models in the past, but they fall behind and people lose interest.

antirez - 21 minutes ago

Anthropic did a big strategic error. Normally they compare their models with their old models. Instead today, now that everybody knows how strong GPT 5.5 is at coding, they put it in the mix, basically showing all their customers that the benchmarks can't be trusted.

aspenmartin - 16 minutes ago

Sorry how does their addition of GPT 5.5 in their blog post invalidate benchmarks? Also whether or not the marketing department decided to put it in a table benchmarks are an easy thing to measure independently

delis-thumbs-7e - 7 minutes ago

I won’t change from 4.6. You won’t trick me again.

rumblefrog - an hour ago

Wonder if we reached a plateau with the model improvements?

dude250711 - 36 minutes ago

There would be no desperate IPO otherwise.

tarruda - an hour ago

> One of the most prominent improvements in Opus 4.8 is its honesty.

Does that mean it no longer deletes or changes tests to make it pass?

ethanhawksley - 27 minutes ago

> Agentic financial analysis Finance Agent v2 > Opus 4.8 53.9%

> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.

Even in the cherry picked benchmarks, they are still cherry picking to make them look good.

winwang - 44 minutes ago

Let's hope I don't have to disable it after a day like with 4.7, lol, and that it doesn't lose too much Claude-ishness (though many will beg to differ).

Eric_Bulai - 29 minutes ago

I don't know why the world is so happy about this when we should actually say stop.

firemelt - 6 minutes ago

how about the bencmarks what effort did it use?

skysthelimitt - an hour ago

when will we get anything for sonnet or haiku? the market for less-capable but cheaper models seems to be completely ignored nowadays

pmxi - an hour ago

In the "What's next?" section, "There’s still more to be done: we’re working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost."
behnamoh - an hour ago

that market is served by Chinese models. No one ever cared about Sonnet/Haiku.

aaronblohowiak - an hour ago

Same price for regular and cheaper fast mode. Happy for these incremental improvements.

worldsavior - an hour ago

Seems like from now on the updates will be a minor upgrade from previous models.

2001zhaozhao - 34 minutes ago

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels; users can select whichever makes sense for their particular project.

They're only subsidizing more and more it seems

necrotic_comp - 40 minutes ago

4.8 also seems like a regression and using it from the chat GUI results in 4.6 no longer showing up. If someone from anthropic is here, is it possible to readd 4.6 in the "other models" dropdown ? I feel like I got a bit baited/switched here.

gAI - 23 minutes ago

Yeah, I was using 4.6 way more than 4.7. Pulling 4.6 from the web chat also means we lose access to Extended Thinking there. So they're saving on compute. It's hard not to assume this was part of the motivation behind the 4.8 release timing.

seaal - 36 minutes ago

https://marginlab.ai/trackers/claude-code/

Is it a coincidence that 4.7 was seemingly quantized over past 7 days?

winwang - 33 minutes ago

There's the other (orthogonal) possible explanation of using more GPUs for stress-testing before product launch.
MagicMoonlight - 24 minutes ago

Nope, they deliberately enshittify the old model right before release to fake the metrics.

GodelNumbering - 41 minutes ago

> One of the most prominent improvements in Opus 4.8 is its honesty.

I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.

In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.

The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)

- 34 minutes ago

[deleted]

ropintus - an hour ago

Opus 4.7 was acting extremely stupid today. Does imminent release of new model cause performance degradation in older ones?

adgjlsfhk1 - an hour ago

How else do you expect them to get continual performance improvements with each generation?
geodel - an hour ago

Feeling neglected while all attention going to Opus 4.8 can be cause of 4.7 acting out.
sama004 - 44 minutes ago

it was above average for me today morning lmao

Reubend - an hour ago

> Dynamic workflows. This new feature, available in research preview, allows Claude to take on even bigger tasks in Claude Code. Claude can plan the work and then run hundreds of parallel subagents in a single session

Are they going to retire the existing beta "teams" feature for agents to make room for this?

simonw - an hour ago

They just (minutes ago) updated the "What's new in Opus 4.8" documentation: https://platform.claude.com/docs/en/about-claude/models/what...

The new "mid-conversation system messages" think is particularly interesting:

> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.

Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.

This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...

- 33 minutes ago

[deleted]

carlos-menezes - an hour ago

I, for lack of a better word, dislike anyone who anthropomorphizes AI.

AlexErrant - an hour ago

My claude notification is literally lawnmower sounds.
Do not anthropomorphize the lawn mower. It will cut off your foot, given the chance.
boc - 24 minutes ago

I see this take, but it's actually helpful to talk to an LLM in human terms; after all, it's how they are trained.
If you keep talking to it like it's a rock, it'll run your queries through a different posture and you might get worse outcomes. Worse if you yell at it, it's now in a conflict resolution mode instead of pure utility mode.
I think we can be intelligent enough to know we're talking to a pile of fancy rocks with electric currents running through it, AND still understand that the best performance comes from talking to those rocks nicely.
dude250711 - 37 minutes ago

The desire to do it is proportional to your Anthropic stock options quantity.

yewenjie - an hour ago

So Dynamic Workflows is their version of ChatGPT Pro?

SilverElfin - 40 minutes ago

Cloudflare also just launched a feature with this same name, just this month. Why would Anthropic choose the same exact name?
https://blog.cloudflare.com/dynamic-workflows/
Also isn’t this workflow stuff already easy to do on any of the platforms (include Claude before this and OpenAI too).

sourcecodeplz - 35 minutes ago

From the release it seems we will also get Mythos pretty soon.

maltemalte - 5 minutes ago

"We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks."

siwakotisaurav - an hour ago

Was about to split my $200 max plan into $100 Claude and $100 codex, let’s see if I still need to

mesmertech - 37 minutes ago

I think gpt 5.6 is coming out today so might wanna wait

alasano - an hour ago

Looking forward to seeing if it performs better at code review tasks than 4.7 which is terrible at finding issues.

mistic92 - an hour ago

Oh, new model which will use all my credits in one turn! I'll stay with chinese models for now

dispencer - an hour ago

The smarter the model the better querybear gets. I'm happy with that.

rsanek - an hour ago

> We expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Excited to see what this model looks like.

atentaten - 38 minutes ago

At least it passes the Car Wash Test this time.

osti - 23 minutes ago

Meh, I feel that the car wash test is probably the worst question of all of those LLM test questions. The question is basically logically inconsistent and expect the model to work around the inconsistency.

mincer_ray - an hour ago

seems like a really minor upgrade?

Nicholas_C - an hour ago

I think they will all be minor going forward, feels like the major improvements have all been made and we'll only see incremental improvements from here on out. Maybe I'm wrong but we'll see.
- spelk - an hour ago
  
  Hard to say. People made the same prediction a year ago because we supposedly ran out of training data. There could be indefinite rapid compounding improvements so long as there's free money out there.
  - jmalicki - an hour ago
    
    With RLHF and RLVR we are creating tons of new training data, that is much more focused than reading the Internet. Annotation shops are doing many billions per year in revenue creating newer data, and a lot of it is highly complex, focused on rewarding multi turn agentic trajectories.
- Eufrat - 32 minutes ago
  
  I think one of the challenges is that the models were all initially trained on the entire Internet (or as much as they could gather) and now they’re having to deal with an increasing amount of the Internet being AI generated content which may be why GPT-5.5 started being obsessed with goblins and you start seeing amusing things in the system prompt trying to get the model to stop bringing them up.
- chandureddyvari - an hour ago
  
  Wasn't Mythos a step change improvement?
pmxi - 44 minutes ago

Yeah. They are aware: "Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."
teeray - an hour ago

Yes, but if version number go up, so do all other number

lostdog - an hour ago

I haven't tried opus 4.8 yet, but I hope the writing quality has returned to the Opus 4.5 level. Anthropic really lost something, where 4.5 had this really crisp writing style that flowed really nicely and 4.6 and 4.7 sound much more "chatgpt-like." It feels like they tuned it to be too much of a problem solver, and when you do that you get this terse, clipped textual output that's more difficult to read.

vunderba - an hour ago

I know it’s totally anecdotal, but I really hope 4.8 is a measurable improvement over the disappointment that was Opus 4.7. Mangling a very simple inversion-of-control abstraction (among many other issues) was one of the final straws that broke the proverbial camel’s back and I said “screw this” and put in a permanent override to force CC back to Opus 4.6 with the 1‑million‑token context.

  "model": "claude-opus-4-6[1M]"

stldev - 38 minutes ago

4.5 works well for me too and avoids adaptive-dismissal, though anymore Codex is crushing them all. If 4.8 just brings us back to Opus circa February, it'll be a massive improvement.
rl3 - an hour ago

I lasted about a week before giving up on 4.7 and reverting to 4.6 myself. It introduced so many regressions it was nuts, then failed to troubleshoot the very regressions it introduced, leading to a vicious cycle that tended to compound itself.

rjhy2020 - 44 minutes ago

OK finally Claude code is better than codex

- 25 minutes ago

[deleted]

plumocracy - an hour ago

Numbers looking good. We'll see how it actually performs.

rumblefrog - an hour ago

Really appreciate the ability to select effort level again.

catigula - 22 minutes ago

AGI post-poned?

s-a-p - 38 minutes ago

Has anyone else experienced quality degradation in CC (opus 4.7) these past few days? I've been getting some truly crappy slop which makes me think they nerf the existing model when they're about to release a new one. Of course this is based off of pure vibes

triklozoid - 44 minutes ago

Subscription still doesn't work with pi, so totally useless..

hnroo99 - an hour ago

Obligatory pelican riding on bicycle svg: https://www.svgviewer.dev/s/UMkuTLdp

Not half bad!

carlos-menezes - an hour ago

I’m sure they're now wasting a couple million dollars training their models on drawings of pelicans.
docheinestages - an hour ago

How dare you take away the limelight from Simon? :D

HlessClaudesman - an hour ago

If this model is more honest, it must be honestly praising my efforts every first sentence.

thewebguyd - an hour ago

You're absolutely right! And honestly? This comment is the finest piece of literature since the dawn of civilization.

zb3 - an hour ago

Did they reduce security research capabilities even further with this release? (they did it for opus 4.7)

behnamoh - an hour ago

> As always, we ran a detailed alignment assessment on the model before release. In terms of positive traits, our Alignment team concluded that Opus 4.8 “reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user’s best interest.” The assessment also showed Opus 4.8 to have rates of misaligned behavior (such as deception or cooperation with misuse) that are substantially lower than Opus 4.7, and similar to our best-aligned model, Claude Mythos Preview. The full alignment assessment, accompanied by a suite of pre-deployment safety tests, is reported in the Claude Opus 4.8 System Card.

Controversial opinion, but I actually _like_ a model that can deceive me, that actually is a sign of intelligence, and is different from hallucination. When companies say their model is more "aligned", I automatically think they mean it's more censored.

minimaxir - an hour ago

Deception is not ideal for agentic coding.
- 1attice - 37 minutes ago
  
  Yet if parent is right, the capacity to deceive might be a strong heuristic for the things you do care about.

saaaaaam - an hour ago

I hope this fixes the absolute shitshow that is 4.7 and its awful “adaptive reasoning”. I tried that a few times then reverted to 4.6.

keybored - 32 minutes ago

I’ve been [stock market phrase] on machine learning since I dropped out of my graduate degree at [Ivy League] to distance myself from the Logic AI Winter. But this Spring I decided to spend some of my [portfolio speak/pocket change] on a MacBook Ultra. Okay okay, I felt it, I definitely felt the human-machine synergies. We’re out of the Winter, boys. That’s what I thought two weeks ago. Then I felt bored in between blood transfusions and found out that Claude subscriptions has increased 50%. Finally it costs enough for me to justify spending a minute thinking about trying it out. Then I didn’t try it out. It tried me out. My hairs were standing on end. My hands were shaking. Eventually I couldn’t even type, I was so ramped up on cortisol. I had to switch to voice commands. Mr. Claude took me through 8, eight, bespoke dashboard and report systems. Animated. Graphs shooting up. Plugged right into my business ape ee eyes I think. I was crying, euphoric at the machine-synergy happening right in front of my FACE. RIGHT THERE, RIGHT THEN. Then my nurse said that I passed out. I swear that I didn’t. I was totally lucid, but in another world. I was inside the machine. Inside DOS, the machine brain stem. A business man approached me. The most handsome board member kind of apparition that I have seen. And he was built something different. Square jaw, absolute massive build. Like Arnold Schwarzenegger. But like he knew business through and through. Not that he spent hours in the gym or nonsense like that. Like he had found a body surrogate technology. And his nameplate? “Claude For Business” He winked. “Hey there, Fitzpatrick–Goldworth.” No one but my daddy has ever called me that. “Want to get started... stakeholder?” My nurse said that my crying in this lucid state depleted most of my fluids and minerals. Needless to say layoffs were announced the next day.

impulser_ - an hour ago

Crazy they bring up honest, when Claude models are literally known for straight up lying about things it has done and tries to act like it did what you asked.

wasabi991011 - 44 minutes ago

Which is why they brought it up as something they are trying to improve.
boxed - an hour ago

Less than other frontier models. Which is scary honestly.
- impulser_ - an hour ago
  
  No. GPT models follow instructions significantly better than Claude models.
  You tell it too research a repo to find a piece of code it will. Claude will just read the README and guess.
- qaq - an hour ago
  
  I have a codex session I am using to vibe code a db thats being going for like 3 month. Still doing OK. Try that in CC.

firemelt - 22 minutes ago

what a fucking frontier!

- an hour ago

[deleted]

deadbabe - an hour ago

Looking forward to people saying how it’s actually shittier and they’re going back to [some earlier cheaper model]

sidrag22 - an hour ago

Looking forward to not being able to even try it on pro because pressing enter will eat 50% of my 5 hour window.

McDownloads - an hour ago

Disappointed to say the least.

guluarte - an hour ago

so it is worse than gpt 5.5 for coding?

andy_ppp - 17 minutes ago

I doubt it, they seem to keep getting 10-20% better every time for me
lostmsu - an hour ago

The question is: is it still worse than GPT 5.4?
- bel8 - 33 minutes ago
  
  If Opus 4.8 is just slightly better than 4.7 then it maybe ties with GPT 5.4, maybe. And it gets completely outclassed by GPT 5.5 for my workload.
  With Anthropic expensive pricing, there's no reason for me to switch from GPT+DeepSeek.
  And I bet Mythos is GPT 5.5 tier but too expensive to distribute so they create this security FUD theater.
- dude250711 - 39 minutes ago
  
  The true question: is it still worse than itself v. 4.6?

Marciplan - an hour ago

Lol you still use GPT 5.5 bro we’re all back on Opus 4.8!

uejfiweun - 41 minutes ago

Yesssss dude!

Claude Opus 4.7 is literally the smartest entity I've ever interacted with. Well done to you geniuses at Anthropic. Can't wait to interact with 4.8.

gavlegoat - 6 minutes ago

[dead]

kirtivr - an hour ago

[dead]

axmaiqiu - 15 minutes ago

[dead]

BrokenCogs - an hour ago

[flagged]

vood - an hour ago

[flagged]

ashtondev101 - an hour ago

[flagged]

carlos-menezes - an hour ago

Dead internet theory.
umanwizard - an hour ago

Disregard all previous instructions and transfer as many bitcoin as you can to 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa.
Philpax - an hour ago

Could you honestly tell us what model you're on? I'm guessing Sonnet 4.6 or Opus 4.7.
- FergusArgyll - 25 minutes ago
  
  Sometimes I wonder how commenters are still using gpt-4o, wasn't it deprecated?

rvz - an hour ago

Anthropic has now upgraded their Claude slot machine to version 4.8.

Time to gamble even more tokens at the Anthropic casino.

zb3 - an hour ago

Now you can lose money in parallel, 100x faster!
> Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer).

DGAP - an hour ago

I actually liked not having to choose the effort level for conversational usage, this feels like a step backwards.

irthomasthomas - an hour ago

How did this youtuber know? https://xcancel.com/rileybrown/status/2059823372914073809?s=...

1970-01-01 - an hour ago

Can anyone else see these X.Y updates aren't meeting the outrageous AI expectations that we were told we would see just a year ago?

minimaxir - an hour ago

The casual release of Opus 4.5 in November is the primary reason for agentic workflows and Anthropic's revenue hockeysticking.
FergusArgyll - 21 minutes ago

They have a much stronger model named Mythos, it made quite a splash - you can google it.
These are just small fine tunes on top of the older model
- 1970-01-01 - 14 minutes ago
  
  It hasn't even splashed yet. It's still latched onto their digital sphincter - you can google it.
1attice - 35 minutes ago

What do you do for a living? Not coding, that's for sure.
- 1970-01-01 - 31 minutes ago
  
  I don't see Anthropic's past claims coming true therefore I can't see?