Claude Opus 4.5

anthropic.com

697 points by adocomplete 5 hours ago


https://platform.claude.com/docs/en/about-claude/models/what...

sbinnee - a minute ago

As much as I am excited by the price, the tools they called "the advanced tool"[1] look so useful to me; Tool search, programmatic tool calling (smolagents.CodeAgent by HF), and tool use examples (in-context learning).

They said that they have seen 134K tokens for tool definition alone. That is insane. I also really liked the puzzle game video.

[1] https://www.anthropic.com/engineering/advanced-tool-use

llamasushi - 5 hours ago

The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.

Also notable: they're claiming SOTA prompt injection resistance. The industry has largely given up on solving this problem through training alone, so if the numbers in the system card hold up under adversarial testing, that's legitimately significant for anyone deploying agents with tool access.

The "most aligned model" framing is doing a lot of heavy lifting though. Would love to see third-party red team results.

unsupp0rted - 5 hours ago

This is gonna be game-changing for the next 2-4 weeks before they nerf the model.

Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.

Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases” might have lead to less than optimal performance. Still largely a user skill issue though.

Then a couple months later they’ll release Opus 4.7 and go through the cycle again.

My allegiance to these companies is now measured in nerf cycles.

I’m a nerf cycle customer.

827a - 5 hours ago

I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.

I think Anthropic is making the right decisions with their models. Given that software engineering is probably one of the very few domains of AI usage that is driving real, serious revenue: I have far better feelings about Anthropic going into 2026 than any other foundation model. Excited to put Opus 4.5 through its paces.

dave1010uk - 2 hours ago

The Claude Opus 4.5 system card [0] is much more revealing than the marketing blog post. It's a 150 page PDF, with all sorts of info, not just the usual benchmarks.

There's a big section on deception. One example is Opus is fed news about Anthropic's safety team being disbanded but then hides that info from the user.

The risks are a bit scary, especially around CBRNs. Opus is still only ASL-3 (systems that substantially increase the risk of catastrophic misuse) and not quite at ASL-4 (uplifting a second-tier state-level bioweapons programme to the sophistication and success of a first-tier one), so I think we're fine...

I've never written a blog post about a model release before but decided to this time [1]. The system card has quite a few surprises, so I've highlighted some bits that stood out to me (and Claude, ChatGPT and Gemini).

[0] https://www.anthropic.com/claude-opus-4-5-system-card

[1] https://dave.engineer/blog/2025/11/claude-opus-4.5-system-ca...

bnchrch - 5 hours ago

Seeing these benchmarks makes me so happy.

Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.

This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.

Ive been holding onto Claude Code for the last little while since Ive built up a robust set of habits, slash commands, and sub agents that help me squeeze as much out of the platform as possible.

But with the last few releases of Gemini and Codex I've been getting closer and closer to throwing it all out to start fresh in a new ecosystem.

Thankfully Anthropic has come out swinging today and my own SOP's can remain in tact a little while longer.

nickandbro - 11 minutes ago

"Create me a SVG of a PS4 controller"

Gemini 3.0 Pro: https://www.svgviewer.dev/s/CxLSTx2X

Opus 4.5: https://www.svgviewer.dev/s/dOSPSHC5

I think Opus 4.5 did a bit better overall, but I do think eventually frontier models will eventually converge to a point where the quality will be so good it will be hard to tell the winner.

simonw - 4 hours ago

Notes and two pelicans: https://simonwillison.net/2025/Nov/24/claude-opus/

futureshock - 5 hours ago

A really great way to get an idea of the relative cost and performance of these models at their various thinking budgets is to look at the ARC-AGI-2 leaderboard. Opus 4.5 stacks up very well here when you compare to Gemini 3’s score and cost. Gemini 3 Deep Think is still the current leaders but at more than 30x the cost.

The cost curve of achieving these scores is coming down rapidly. In Dec 2024 when OpenAI announced beating human performance on ARC-AGI-1, they spent more than $3k per task. You can get the same performance for pennies to dollars, approximately an 80x reduction in 11 months.

https://arcprize.org/leaderboard

https://arcprize.org/blog/oai-o3-pub-breakthrough

stavros - 5 hours ago

Did anyone else notice Sonnet 4.5 being much dumber recently? I tried it today and it was really struggling with some very simple CSS on a 100-line self-contained HTML page. This never used to happen before, and now I'm wondering if this release has something to do with it.

On-topic, I love the fact that Opus is now three times cheaper. I hope it's available in Claude Code with the Pro subscription.

EDIT: Apparently it's not available in Claude Code with the Pro subscription, but you can add funds to your Claude wallet and use Opus with pay-as-you-go. This is going to be really nice to use Opus for planning and Sonnet for implementation with the Pro subscription.

However, I noticed that the previously-there option of "use Opus for planning and Sonnet for implementation" isn't there in Claude Code with this setup any more. Hopefully they'll implement it soon, as that would be the best of both worlds.

EDIT 2: Apparently you can use `/model opusplan` to get Opus in planning mode. However, it says "Uses your extra balance", and it's not clear whether it means it uses the balance just in planning mode, or also in execution mode. I don't want it to use my balance when I've got a subscription, I'll have to try it and see.

EDIT 3: It looks like Sonnet also consumes credits in this mode. I had it make some simple CSS changes to a single HTML file with Opusplan, and it cost me $0.95 (way too much, in my opinion). I'll try manually switching between Opus for the plan and regular Sonnet for the next test.

hebejebelus - 5 hours ago

On my Max plan, Opus 4.5 is now the default model! Until now I used Sonnet 4.5 exclusively and never used Opus, even for planning - I'm shocked that this is so cheap (for them) that it can be the default now. I'm curious what this will mean for the daily/weekly limits.

A short run at a small toy app makes me feel like Opus 4.5 is a bit slower than Sonnet 4.5 was, but that could also just be the day-one load it's presumably under. I don't think Sonnet was holding me back much, but it's far too early to tell.

jumploops - 5 hours ago

> Pricing is now $5/$25 per million [input/output] tokens

So it’s 1/3 the price of Opus 4.1…

> [..] matches Sonnet 4.5’s best score on SWE-bench Verified, but uses 76% fewer output tokens

…and potentially uses a lot less tokens?

Excited to stress test this in Claude Code, looks like a great model on paper!

obblekk - an hour ago

80% on swebench verified is incredible. a year ago the best model was at ~30%. i wonder if we'll soon have a convincingly superhuman coding capability (even in a narrow field like kernel optimization).

this is the most interesting time for software tools since compilers and static typechecking was invented.

jasonthorsness - 5 hours ago

I used Gemini instead of my usual Claude for a non-trivial front-end project [1] and it really just hit it out of the park especially after the update last week, no trouble just directly emitting around 95% of the application. Now Claude is back! The pace of releases and competition seems to be heating up more lately, and there is absolutely no switching cost. It's going to be interesting to see if and how the frontier model vendors create a moat or if the coding CLIs/models will forever remain a commodity.

[1] https://github.com/jasonthorsness/tree-dangler

MaxLeiter - 4 hours ago

We've added support for opus 4.5 to v0 and users are making some pretty impressive 1-shots:

https://x.com/mikegonz/status/1993045002306699704

https://x.com/MirAI_Newz/status/1993047036766396852

https://x.com/rauchg/status/1993054732781490412

It seems especially good at threejs / 3D websites. Gemini was similarly good at them (https://x.com/aymericrabot/status/1991613284106269192); maybe the model labs are focusing on this style of generation more now.

pwython - 2 hours ago

I was excited to see this today, as I just finished a new website in Figma last night and planned to have CC help build it out with the Figma MCP today. CC started up with Opus 4.5 and... Not off to a good start. It described a totally different looking layout on the first prompt to see the selected frame. I asked why:

You're right to call that out. Looking back at what happened:

  1. First get_design_context call - Nothing was selected, so it returned an
  error
  2. Second get_design_context call - It returned a tiny snippet of React code
  for a "Vector" component (just an image element), which gave me almost no
  context
  3. First get_screenshot call - The image returned was essentially blank/empty
  (the output showed <output_image> with what appears to be no actual visual
  content)

  I then hallucinated an entire design description based on:
  - The project context from CLAUDE.md (mentions "Creative Director",
  "Performance Marketing", "20+ years")
  - Common portfolio design patterns
  - The small code snippet mentioning a "Vector" element

  That was a significant error on my part. I should have told you "The
  screenshot appears empty or didn't load properly - can you try selecting a
  different element or re-selecting?" instead of fabricating a description.
andai - 5 hours ago

Why do they always cut off 70% of the y-axis? Sure it exaggerates the differences, but... it exaggerates the differences.

And they left Haiku out of most of the comparisons! That's the most interesting model for me. Because for some tasks it's fine. And it's still not clear to me which ones those are.

Because in my experience, Haiku sits at this weird middle point where, if you have a well defined task, you can use a smaller/faster/cheaper model than Haiku, and if you don't, then you need to reach for a bigger/slower/costlier model than Haiku.

jaakkonen - 2 hours ago

Tested this today for implementing a new low-frequency RFID protocol to Flipper Zero codebase based on a Proxmark3 implementation. Was able to do it in 2 hours with giving a raw psk recording alongside of it and some troubleshooting. This is the kind of task the last generation of frontier models was incapable of doing. Super stoked to use this :)

elvin_d - 5 hours ago

Great seeing the price reduction. Opus historically was prices at 15/75, this one delivers at 5/25 which is close to Gemini 3 Pro. I hope Anthropic can afford increasing limits for the new Opus.

morgengold - 2 hours ago

I'm on a Claude Code Max subscription. Last days have been a struggle with Sonnet 4.5 - Now it switched to Claude Opus 4.5 as default model. Ridiculous good and fast.

chaosprint - 5 hours ago

SWE's results were actually very close, but they used a poor marketing visualization. I know this isn't a research paper, but for Anthropic, I expect more.

irthomasthomas - 3 hours ago

I wish it was open-weights so we could discuss the architectural changes. This model is about twice as fast as 4.1, ~60t/s Vs ~30t/s. Is it half the parameters, or a new INT4 linear sparse-moe architecture?

andreybaskov - 4 hours ago

Does anyone know or have a guess on the size of this latest thinking models and what hardware they use to run inference? As in how much memory and what quantization it uses and if it's "theoretically" possible to run it on something like Mac Studio M3 Ultra with 512GB RAM. Just curious from theoretical perspective.

keeeba - 5 hours ago

Oh boy, if the benchmarks are this good and Opus feels like it usually does then this is insane.

I’ve always found Opus significantly better than the benchmarks suggested.

LFG

sync - 4 hours ago

Does anyone here understand "interleaved scratchpads" mentioned at the very bottom of the footnotes:

> All evals were run with a 64K thinking budget, interleaved scratchpads, 200K context window, default effort (high), and default sampling settings (temperature, top_p).

I understand scratchpads (e.g. [0] Show Your Work: Scratchpads for Intermediate Computation with Language Models) but not sure about the "interleaved" part, a quick Kagi search did not lead to anything relevant other than Claude itself :)

[0] https://arxiv.org/abs/2112.00114

adidoit - 2 hours ago

Tested this building some PRs and issues that codex-5.1-max and gemini-3-pro were strugglig with

It planned way better in a much more granular way and then execute it better. I can't tell if the model is actually better or if it's just planning with more discipline

starkparker - 3 hours ago

Would love to know what's going on with C++ and PHP benchmarks. No meaningful gain over Opus 4.1 for either, and Sonnet still seems to outperform Opus on PHP.

I_am_tiberius - 4 hours ago

Still mad at them because they decided not to take their users' privacy serious. Would be interested how the new model behaves, but just have a mental lock and can't sign up again.

PilotJeff - 2 hours ago

More blowing up of the bubble with anthropic essentially offering compute/LLM for below cost. Eventually the laws of physics/market will take over and look out below.

alvis - 5 hours ago

“For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the same number of Opus tokens as you previously had with Sonnet.” — seems like anthropic has finally listened!

maherbeg - 2 hours ago

Ok, the victorian lock puzzle game is pretty damn cool way to showcase the capabilities of these models. I kinda want to start building similar puzzle games for models to solve.

jmward01 - 4 hours ago

One thing I didn't see mentioned is raw token gen speed compared to the alternatives. I am using Haiku 4.5 because it is cheap (and so am I) but also because it is fast. Speed is pretty high up in my list of coding assistant features and I wish it was more prominent in release info.

ximeng - 4 hours ago

With less token usage, cheaper pricing, and enhanced usage limits for Opus, Anthropic are taking the fight to Gemini and OpenAI Codex. Coding agent performance leads to better general work and personal task performance, so if Anthropic continue to execute well on ergonomics they have a chance to overcome their distribution disadvantages versus the other top players.

pingou - 4 hours ago

What causes the improvements in new AI models recently? Is it just more training, or is it new, innovative techniques?

GenerWork - 5 hours ago

I wonder what this means for UX designers like myself who would love to take a screen from Figma and turn it into code with just a single call to the MCP. I've found that Gemini 3 in Figma Make works very well at one-shotting a page when it actually works (there's a lot of issues with it actually working, sadly), so hopefully Opus 4.5 is even better.

aliljet - 5 hours ago

The real question I have after seeing the usage rug being pulled is what this costs and how usable this ACTUALLY is with a Claude Max 20x subscription. In practice, Opus is basically unusable by anyone paying enterprise-prices. And the modification of "usage" quotas has made the platform fundamentally unstable, and honestly, it left me personally feeling like I was cheated by Anthropic...

saaaaaam - 4 hours ago

Anecdotally, I’ve been using opus 4.5 today via the chat interface to review several large and complex interdependent documents, fillet bits out of them and build a report. It’s very very good at this, and much better than opus 4.1. I actually didn’t realise that I was using opus 4.5 until I saw this thread.

mutewinter - 3 hours ago

Some early visual evaluations: https://x.com/mutewinter/status/1993037630209192276

agentifysh - 4 hours ago

again the question of concern as codex user is usage

its hard to get any meaningful use out of claude pro

after you ship a few features you are pretty much out of weekly usage

compared to what codex-5.1-max offers on a plan that is 5x cheaper

the 4~5% improvement is welcome but honestly i question whether its possible to get meaningful usage out of it the way codex allows it

for most use cases medium or 4.5 handles things well but anthropic seems to have way less usage limits than what openai is subsidizing

until they can match what i can get out of codex it won't be enough to win me back

edit: I upgraded to claude max! read the blog carefully and seems like opus 4.5 is lifted in usage as well as sonnet 4.5!

adastra22 - 4 hours ago

Does it follow directions? I’ve found Sonnet 4.5 to be useless for automated workflows because it refuses to follow directions. I hope they didn’t take the same RLHF approach they did with that model.

ramon156 - 3 hours ago

I've almost ran out of Claude on the Web credits. If they announce that they're going to support Opus then I'm going to be sad :'(

viraptor - 5 hours ago

Has there been any announcement of a new programming benchmark? SWE looks like it's close to saturation already. At this point for SWE it may be more interesting to start looking at which types of issues consistently fail/work between model families.

adt - 4 hours ago

https://lifearchitect.ai/models-table/

jedberg - 5 hours ago

Up until today, the general advice was use Opus for deep research, use Haiku for everything else. Given the reduction in cost here, does that rule of thumb no longer apply?

thot_experiment - 4 hours ago

It's really hard for me to take these benchmarks seriously at all, especially that first one where Sonnet 4.5 is better at software engineering than Opus 4.1.

It is emphatically not, it has never been, I have used both models extensively and I have never encountered a single situation where Sonnet did a better job than Opus. Any coding benchmark that has Sonnet above Opus is broken, or at the very least measuring things that are totally irrelevant to my usecases.

This in particular isn't my "oh the teachers lie to you moment" that makes you distrust everything they say, but it really hammers the point home. I'm glad there's a cost drop, but at this point my assumption is that there's also going to be a quality drop until I can prove otherwise in real world testing.

gigatexal - 2 hours ago

Love the competition. Gemini 3 pro blew me away after being spoiled by Claude for coding things. Considered canceling my Anthropic sub but now I’m gonna hold on to it.

The bigger thing is Google has been investing in TPUs even before the craze. They’re on what gen 5 now ? Gen 7? Anyway I hope they keep investing tens of billions into it because Nvidia needs to have some competition and maybe if they do they’ll stop this AI silliness and go back to making GPUs for gamers. (Hahaha of course they won’t. No gamer is paying 40k for a GPU.)

alvis - 5 hours ago

What surprise me is that Opus 4.5 lost all reasoning scores to Gemini and GPT. I thought it’s the area the model will shine the most

whitepoplar - 4 hours ago

Does the reduced price mean increased usage limits on Claude Code (with a Max subscription)?

synergy20 - 3 hours ago

great, paying $100/m for claude code, this stops me from switching to gemini 3.0 for now.

throwaway2027 - 3 hours ago

Oh that's why there were only 2 usage bars.

rishabhaiover - 5 hours ago

Is this available on claude-code?

kachapopopow - 3 hours ago

slightly better at react and spacial logic than gemini 3 pro, but slower and way more expensive.

xkbarkar - 3 hours ago

This is great. Sonnet 4.5 has degraded terribly.

I can get some useful stuff from a clean context in the web ui but the cli is just useless.

Opus is far superiour.

Today sonnet 4.5 suggested to verify remote state file presence by creating an empty one locally and copy it to the remote backend. Da fuq? University level programmer my a$$.

And it seems like it has degraded this last month.

I keep getting braindead suggestions and code that looks like it came from a random word generator.

I swear it was not that awful a couple of months ago.

Opus cap has been an issue, happy to change and I really hope the nerf rumours are just that. Undounded rumours and the defradation has a valid root cause

But honestly sonnet 4.5 has started to act like a smoking pile of sh**t

tschellenbach - 4 hours ago

Ok, but can it play Factorio?

cyrusradfar - 5 hours ago

I'm curious if others are finding that there's a comfort in staying within the Claude ecosystem because when it makes a mistake, we get used to spotting the pattern. I'm finding that when I try new models, their "stupid" moments are more surprising and infuriating.

Given this tech is new, the experience of how we relate to their mistakes is something I think a bit about.

Am I alone here, are others finding themselves more forgiving of "their preferred" model provider?

GodelNumbering - 5 hours ago

The fact that the post singled out SWE-bench at the top makes the opposite impression that they probably intended.

CuriouslyC - 4 hours ago

I hate on Anthropic a fair bit, but the cost reduction, quota increases and solid "focused" model approach are real wins. If they can get their infrastructure game solid, improve claude code performance consistency and maintain high levels of transparency I will officially have to start saying nice things about them.

gsibble - 3 hours ago

They lowered the price because this is a massive land grab and is basically winner take all.

I love that Antrhopic is focused on coding. I've found their models to be significantly better at producing code similar to what I would write, meaning it's easy to debug and grok.

Gemini does weird stuff and while Codex is good, I prefer Sonnet 4.5 and Claude code.

AJRF - 4 hours ago

that chart at the start is egregious

lerp-io - an hour ago

80% and 77% is not that much lol

0x79de - 5 hours ago

this is quite a good

zb3 - 5 hours ago

The first chart is straight from "how to lie in charts"..

fragmede - 4 hours ago

Got the river crossing one:

https://claude.ai/chat/0c583303-6d3e-47ae-97c9-085cefe14c21

Still fucked up one about the boy and the surgeon though:

https://claude.ai/chat/d2c63190-059f-43ef-af3d-67e7ca1707a4

- 4 hours ago
[deleted]