Claude Opus 4.8

anthropic.com

461 points by craigmart an hour ago


NiloCK - an hour ago

A rambling comment:

I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).

So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.

Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.

But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.

colonCapitalDee - an hour ago

"Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

This is a refreshing attitude!

I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)

northern-lights - an hour ago

> Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Probably more interesting than the 4.8 release.

simonw - an hour ago

I generated pelicans riding bicycles on both thinking level low and thinking level high:

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...

onlyrealcuzzo - an hour ago

Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?

There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.

clutch89 - an hour ago

> One of the most prominent improvements in Opus 4.8 is its honesty

Anthropic talks about their own models as if they're discovering new species in the wild...

gslepak - an hour ago

On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery".

In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.

What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.

[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...

XCSme - 8 minutes ago

On my tests[0] it does a bit worse, and it's almost 2x expensive than Opus 4.7...

I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).

It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

silverlight - 18 minutes ago

Unfortunately they seem to have straight up broken Claude Code either with this release in the backend or the new CC version. Errors about "can't modify thinking blocks" are bricking long-running sessions: https://github.com/anthropics/claude-code/issues?q=is%3Aissu...

irthomasthomas - 17 minutes ago

Why does anthropic change the set of benchmarks they use with every new model release?

https://www.anthropic.com/news/claude-opus-4-7

https://www.anthropic.com/news/claude-opus-4-6

pbmango - an hour ago

I can't help but think of Iphone updates since about 2018. The thinnest, fastest, longest battery life Iphone ever. It seems mostly the same and I probably won't be able to tell other than the name, but everyone buys it anyway.

This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.

lordmauve - 34 minutes ago

Given DeepSWE just blew apart the SWE-Bench Pro benchmark and handed a 14-point lead to GPT-5.5, it looks pretty bad that they've listed SWE-Bench first in the model release and no DeepSWE. Like, this isn't obviously an answer.

Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.

setnone - 44 minutes ago

Claude's 4.6 - 4.7 transition made me discover codex, and with gpt 5.5 there is no way i'm going back

londons_explore - 13 minutes ago

My guess is anthropic is doing reinforcement learning based on user sessions.

However, doing so relies on the production model staying vaguely close to the model being trained.

To ensure that, frequent releases are needed. I forsee that they might end up doing daily releases and perhaps not even telling anyone at some near future point.

alansaber - 14 minutes ago

"Our models are more honest" honey the quarterly marketing spin for a ML term has come. Forget "task alignment" now we're going for "truth index". I suppose this is the only way to generate hype when you're selling/releasing the same product over and over again.

docheinestages - 3 minutes ago

All I need for Christmas is a Claude that doesn't spit out so many em dashes.

square_usual - an hour ago

Buried lede:

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels

wg0 - an hour ago

There is a hole in the boat's bottom due to Chinese models. They might not be as good but they are not bad either or at least I had hard time finding any issues with Deepseekv4 Flash and Pro variants. They get their job done sometimes rarely giving up till they are done what they are after.

So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.

SimianSci - an hour ago

There is an obvious shift in sentiment amongst users, at least here in the US. I feel it myself, even as a proponent of AI tools, the bloviating and language that these companies use in these release articles are starting to wear thin on my patience.

Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.

cedws - an hour ago

I'm very suspicious of these same price model launches. It feels like they're benchmaxxed so they can put everyone on them and reduce their compute costs behind the scenes. If the model were genuinely better why wouldn't they charge more for it? Charging the same for something better is a race to the bottom.

Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.

mesmertech - 37 minutes ago

/model claude-opus-4-8

seems to work but idk why they never set it so you can see it in the /model list.

"what model are you

I'm Claude Opus (claude-opus-4-8), running in Claude Code."

dangoodmanUT - 42 minutes ago

> The Messages API now accepts system entries inside the messages array. Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Biggest deal imo

james_marks - an hour ago

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest—for instance, to avoid making claims that they can’t support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims.

Would be awesome if true

nikolay - 34 minutes ago

Give us Mythos! This piecemealing doesn't help Anthropic at all, especially psychologically! They are playing a dangerous game, and I see many people leaving Claude Code for good - both due to the subsidy games, and for Anthropic not dogfooding and using unreleased models internally and giving us subpar ones. Benchmarks are nice, but the real-world experience is quite different - neither can you notice these slight improvements, nor are competitors that much worse based on some generic benchmarks.

Tenoke - an hour ago

Claude Code has been wonderful for work and the frequent improvements are nice, although with Mythos being used by others ages ago and new versions for the public still being bellow that, it's hard to not feel like the underclass already.

lxxpxlxxxx - 10 minutes ago

My experience with these new releases is that the gains in performance are negated by the price increases and it seems like:

Performance gains: 1.2x Price increases: 1.8x

jmward01 - an hour ago

Meanwhile haiku is on 4.5 and sonnet is on 4.6. It is clear where they are not making money.

generalizations - an hour ago

Hoping that one day they'll let me go through the identity verification process so I can use it again.

Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.

babelfish - an hour ago

So GPT 5.6 tomorrow, then?

toephu2 - 32 minutes ago

The rapid release cadence and rate of innovation of Anthropic (and OpenAI) is impressive. And obviously it's because these are startups solely dedicated to AI so they can move quickly. Big Tech (like Google) won't be able to keep up with the pace of them (too much bureaucracy and red tape at Google). Classic Innovator's Dilemma. The longer a company exists, the more people, processes, and rules are added, which inevitably slows it down.

Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.

antirez - 21 minutes ago

Anthropic did a big strategic error. Normally they compare their models with their old models. Instead today, now that everybody knows how strong GPT 5.5 is at coding, they put it in the mix, basically showing all their customers that the benchmarks can't be trusted.

delis-thumbs-7e - 7 minutes ago

I won’t change from 4.6. You won’t trick me again.

rumblefrog - an hour ago

Wonder if we reached a plateau with the model improvements?

tarruda - an hour ago

> One of the most prominent improvements in Opus 4.8 is its honesty.

Does that mean it no longer deletes or changes tests to make it pass?

ethanhawksley - 27 minutes ago

> Agentic financial analysis Finance Agent v2 > Opus 4.8 53.9%

> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.

Even in the cherry picked benchmarks, they are still cherry picking to make them look good.

winwang - 44 minutes ago

Let's hope I don't have to disable it after a day like with 4.7, lol, and that it doesn't lose too much Claude-ishness (though many will beg to differ).

Eric_Bulai - 29 minutes ago

I don't know why the world is so happy about this when we should actually say stop.

firemelt - 6 minutes ago

how about the bencmarks what effort did it use?

skysthelimitt - an hour ago

when will we get anything for sonnet or haiku? the market for less-capable but cheaper models seems to be completely ignored nowadays

aaronblohowiak - an hour ago

Same price for regular and cheaper fast mode. Happy for these incremental improvements.

worldsavior - an hour ago

Seems like from now on the updates will be a minor upgrade from previous models.

2001zhaozhao - 34 minutes ago

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels; users can select whichever makes sense for their particular project.

They're only subsidizing more and more it seems

necrotic_comp - 40 minutes ago

4.8 also seems like a regression and using it from the chat GUI results in 4.6 no longer showing up. If someone from anthropic is here, is it possible to readd 4.6 in the "other models" dropdown ? I feel like I got a bit baited/switched here.

seaal - 36 minutes ago

https://marginlab.ai/trackers/claude-code/

Is it a coincidence that 4.7 was seemingly quantized over past 7 days?

GodelNumbering - 41 minutes ago

> One of the most prominent improvements in Opus 4.8 is its honesty.

I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.

In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.

The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)

ropintus - an hour ago

Opus 4.7 was acting extremely stupid today. Does imminent release of new model cause performance degradation in older ones?

Reubend - an hour ago

> Dynamic workflows. This new feature, available in research preview, allows Claude to take on even bigger tasks in Claude Code. Claude can plan the work and then run hundreds of parallel subagents in a single session

Are they going to retire the existing beta "teams" feature for agents to make room for this?

simonw - an hour ago

They just (minutes ago) updated the "What's new in Opus 4.8" documentation: https://platform.claude.com/docs/en/about-claude/models/what...

The new "mid-conversation system messages" think is particularly interesting:

> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.

Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.

This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...

carlos-menezes - an hour ago

I, for lack of a better word, dislike anyone who anthropomorphizes AI.

yewenjie - an hour ago

So Dynamic Workflows is their version of ChatGPT Pro?

sourcecodeplz - 35 minutes ago

From the release it seems we will also get Mythos pretty soon.

maltemalte - 5 minutes ago

"We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks."

siwakotisaurav - an hour ago

Was about to split my $200 max plan into $100 Claude and $100 codex, let’s see if I still need to

alasano - an hour ago

Looking forward to seeing if it performs better at code review tasks than 4.7 which is terrible at finding issues.

mistic92 - an hour ago

Oh, new model which will use all my credits in one turn! I'll stay with chinese models for now

dispencer - an hour ago

The smarter the model the better querybear gets. I'm happy with that.

rsanek - an hour ago

> We expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Excited to see what this model looks like.

atentaten - 38 minutes ago

At least it passes the Car Wash Test this time.

mincer_ray - an hour ago

seems like a really minor upgrade?

lostdog - an hour ago

I haven't tried opus 4.8 yet, but I hope the writing quality has returned to the Opus 4.5 level. Anthropic really lost something, where 4.5 had this really crisp writing style that flowed really nicely and 4.6 and 4.7 sound much more "chatgpt-like." It feels like they tuned it to be too much of a problem solver, and when you do that you get this terse, clipped textual output that's more difficult to read.

vunderba - an hour ago

I know it’s totally anecdotal, but I really hope 4.8 is a measurable improvement over the disappointment that was Opus 4.7. Mangling a very simple inversion-of-control abstraction (among many other issues) was one of the final straws that broke the proverbial camel’s back and I said “screw this” and put in a permanent override to force CC back to Opus 4.6 with the 1‑million‑token context.

  "model": "claude-opus-4-6[1M]"
rjhy2020 - 44 minutes ago

OK finally Claude code is better than codex

plumocracy - an hour ago

Numbers looking good. We'll see how it actually performs.

rumblefrog - an hour ago

Really appreciate the ability to select effort level again.

catigula - 22 minutes ago

AGI post-poned?

s-a-p - 38 minutes ago

Has anyone else experienced quality degradation in CC (opus 4.7) these past few days? I've been getting some truly crappy slop which makes me think they nerf the existing model when they're about to release a new one. Of course this is based off of pure vibes

triklozoid - 44 minutes ago

Subscription still doesn't work with pi, so totally useless..

hnroo99 - an hour ago

Obligatory pelican riding on bicycle svg: https://www.svgviewer.dev/s/UMkuTLdp

Not half bad!

HlessClaudesman - an hour ago

If this model is more honest, it must be honestly praising my efforts every first sentence.

zb3 - an hour ago

Did they reduce security research capabilities even further with this release? (they did it for opus 4.7)

behnamoh - an hour ago

> As always, we ran a detailed alignment assessment on the model before release. In terms of positive traits, our Alignment team concluded that Opus 4.8 “reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user’s best interest.” The assessment also showed Opus 4.8 to have rates of misaligned behavior (such as deception or cooperation with misuse) that are substantially lower than Opus 4.7, and similar to our best-aligned model, Claude Mythos Preview. The full alignment assessment, accompanied by a suite of pre-deployment safety tests, is reported in the Claude Opus 4.8 System Card.

Controversial opinion, but I actually _like_ a model that can deceive me, that actually is a sign of intelligence, and is different from hallucination. When companies say their model is more "aligned", I automatically think they mean it's more censored.

saaaaaam - an hour ago

I hope this fixes the absolute shitshow that is 4.7 and its awful “adaptive reasoning”. I tried that a few times then reverted to 4.6.

keybored - 32 minutes ago

I’ve been [stock market phrase] on machine learning since I dropped out of my graduate degree at [Ivy League] to distance myself from the Logic AI Winter. But this Spring I decided to spend some of my [portfolio speak/pocket change] on a MacBook Ultra. Okay okay, I felt it, I definitely felt the human-machine synergies. We’re out of the Winter, boys. That’s what I thought two weeks ago. Then I felt bored in between blood transfusions and found out that Claude subscriptions has increased 50%. Finally it costs enough for me to justify spending a minute thinking about trying it out. Then I didn’t try it out. It tried me out. My hairs were standing on end. My hands were shaking. Eventually I couldn’t even type, I was so ramped up on cortisol. I had to switch to voice commands. Mr. Claude took me through 8, eight, bespoke dashboard and report systems. Animated. Graphs shooting up. Plugged right into my business ape ee eyes I think. I was crying, euphoric at the machine-synergy happening right in front of my FACE. RIGHT THERE, RIGHT THEN. Then my nurse said that I passed out. I swear that I didn’t. I was totally lucid, but in another world. I was inside the machine. Inside DOS, the machine brain stem. A business man approached me. The most handsome board member kind of apparition that I have seen. And he was built something different. Square jaw, absolute massive build. Like Arnold Schwarzenegger. But like he knew business through and through. Not that he spent hours in the gym or nonsense like that. Like he had found a body surrogate technology. And his nameplate? “Claude For Business” He winked. “Hey there, Fitzpatrick–Goldworth.” No one but my daddy has ever called me that. “Want to get started... stakeholder?” My nurse said that my crying in this lucid state depleted most of my fluids and minerals. Needless to say layoffs were announced the next day.

impulser_ - an hour ago

Crazy they bring up honest, when Claude models are literally known for straight up lying about things it has done and tries to act like it did what you asked.

firemelt - 22 minutes ago

what a fucking frontier!

- an hour ago
[deleted]
deadbabe - an hour ago

Looking forward to people saying how it’s actually shittier and they’re going back to [some earlier cheaper model]

McDownloads - an hour ago

Disappointed to say the least.

guluarte - an hour ago

so it is worse than gpt 5.5 for coding?

Marciplan - an hour ago

Lol you still use GPT 5.5 bro we’re all back on Opus 4.8!

uejfiweun - 41 minutes ago

Yesssss dude!

Claude Opus 4.7 is literally the smartest entity I've ever interacted with. Well done to you geniuses at Anthropic. Can't wait to interact with 4.8.

gavlegoat - 6 minutes ago

[dead]

kirtivr - an hour ago

[dead]

axmaiqiu - 15 minutes ago

[dead]

BrokenCogs - an hour ago

[flagged]

vood - an hour ago

[flagged]

ashtondev101 - an hour ago

[flagged]

rvz - an hour ago

Anthropic has now upgraded their Claude slot machine to version 4.8.

Time to gamble even more tokens at the Anthropic casino.

DGAP - an hour ago

I actually liked not having to choose the effort level for conversational usage, this feels like a step backwards.

irthomasthomas - an hour ago

How did this youtuber know? https://xcancel.com/rileybrown/status/2059823372914073809?s=...

1970-01-01 - an hour ago

Can anyone else see these X.Y updates aren't meeting the outrageous AI expectations that we were told we would see just a year ago?