We decreased our LLM costs with Opus

88 points by shad42 8 hours ago

> We switched to the "triager" pattern: a Haiku agent with a very specific and narrow job. Is this issue already tracked or not? If it is, stop right there. If not, escalate to Opus.

> 4 out of 5 failures never reach Opus. A triager match costs around 25x less than a full investigation.

The title feels misleading. Why clickbait on that when you can just be genuine about the architecture?

idorosen - 7 hours ago

The title does not match the article title: “We Upgraded to a Frontier Model and Our Costs Went Down”.
- stingraycharles - 6 hours ago
  
  It’s still misleading, though.
shad42 - 5 hours ago

I am one of Mendral co-founder (my co-founder wrote the article), I am the one to blame for changing the title when posting. I thought our original one was too clickbait and I wanted to better summarize with this title.
Despite the original title, a lot of what we learned comes to how Opus evolved and the ability to reason. And also the fact that Haiku is quite capable if scoped properly, that's the whole purpose of the article.
- locknitpicker - 3 hours ago
  
  > Despite the original title, a lot of what we learned comes to how Opus evolved and the ability to reason. And also the fact that Haiku is quite capable if scoped properly, that's the whole purpose of the article.
  I think you're misrepresenting the whole thing. The blog post boils down to introducing a specialized triage step which is then offloaded to a cheap model. The cost savings come from skipping the expensive model. It has absolutely nothing to do with what choice of expensive model is being used. You could write the same blog post by completely ignoring and omitting the expensive model.
  - kovek - 2 hours ago
    
    Does thinking about how to offload matter?
    
    locknitpicker - an hour ago
    
    A discussion on how to avoid paying the price of running an expensive model is not about the expensive model. You can triage things running a cheap model with Ollama. Heck, throw in gpt4.1 which is free.

cadamsdotcom - 6 hours ago

I have rewritten the article to be slightly shorter:

“Let a cheap agent decide if the expensive one is needed.”

baxtr - 4 hours ago

And this was probably also the prompt someone used to generate the article.
a_t48 - 6 hours ago

Sounds like L1 vs L2 support :)
- shad42 - 5 hours ago
  
  It's the same as an escalation. Something we omitted from the post is that we often use Sonnet to write SQL queries.
  We wrote another post that was on HN some time ago that goes into the details of SQL queries (linked at the top of this article). Sonnet is perfect for this.
dmazhukov - 6 hours ago

[dead]

vanviegen - 3 hours ago

> It's the same reason you don't want to lead a debugging session by saying "I think the problem is in this file": you've biased the investigation before it started.

Unless you're evaluating the agent/person doing the debug session, why would you not provide them with some relevant insight about the problem you have? Given that you're pretty sure about it, of course.

alex1sa - 2 hours ago

[dead]

neya - 6 hours ago

The whole clickbait article can be summarized in one line:

    Let a cheap agent decide if the expensive one is needed

albert_e - 5 hours ago

I want to create a "harness" that does this with Claude Code and other expensive agents.

Buffer user prompts, use conversation history and repo state as context -- and run a local model or a cheap and fast cloud model like Haiku to determine the optimal way to address the user's ask, reframe the query with better context (user reviews and approves if needed) and THEN let expensive models like Opus have a go at it.

If we are operating within Anthropic ecosystem with Haiku and Opus -- this sort of logic should ideally be doable within Claude Code as harness. Currently skills cannot be tagged to different models. Ideally we should be able to say -- for trivial tasks, the skill should always use Haiku even if invoked from a session with Opus xhigh.

koenvdb - 3 hours ago

> Currently skills cannot be tagged to different models. Ideally we should be able to say -- for trivial tasks, the skill should always use Haiku even if invoked from a session with Opus xhigh.
You can set the model for a skill. You just set model: haiku at the top and it will use haiku! You can even set the effort level, look for “Frontmatter reference” in this doc article: https://code.claude.com/docs/en/skills
- dmitry_dv - an hour ago
  
  Same works for subagents — .claude/agents/triager.md with model: haiku plus a Task call from the main loop. The reason to roll your own was the sandbox, not the routing.
- albert_e - 2 hours ago
  
  Interesting - thanks!
  Not sure when this was added.
  I found "open" feature requests in GitHub asking for this exact thing
shad42 - 5 hours ago

We considered wrapping Claude Code when we started building Mendral (this agent in the article). We ended up building our own agent, it's lot more work because we followed all the right patterns as the models evolved (sub-agents, proper token caching, redo basic tools like read,write,edit,bash, etc...). But it paid off over time when you build an agent that is focused on a specific task (not a general coding agent).
The main driver for writing our own agent was to leave it out of the sandbox (the agent loop runs on our backend, we call the sandbox only when needed). We wrote another post about that (it's the latest post on the blog).
However, I am curious how would you implement the triager pattern by only using Claude Code as harness.
- 4 hours ago

[deleted]

- 4 hours ago

[deleted]

syntaxing - 5 hours ago

Is RAG dead? I would be very surprised a local small SOTA embedded model like llama-embed-nemotron-8b doesnt outperform the Haiku layer for this application. Should be pretty cheap and easy to prove out. With 32K context size, you can literally one shot the whole ticket.

preommr - 5 hours ago

Yea, but RAG takes effort. At the very least some kind of system to organize the documents and do the retrieval.
My theory is that the AI frenzy has reached new levels of insane, where it's literally just throw anything and everything at the model, and just burn tokens to let the AI figure everything out. Why bother paying the upfront cost for a RAG, when the models/agents are constantly evolving, so just slap in a markdown file telling it to check a folder, and call it a day.
Like in design world, people are doing minor tweaks like changing the spacing by typing in prompts instead of just changing a number in an input field. We are legitimately approaching just using llms instead of calculators, or memes like that endpoint that calls an llm to generate the code to do some business logic, rather than directly code the logic.
shad42 - 5 hours ago

IMO RAG is mostly dead. The game changer with newer models like Opus is the reasoning. So instead of pushing all the context up front (RAG style), it's better to give strong primitives (eg. bash, SQL) and let the agent figure it out.
It's what Claude Code is doing now and the principles we applied for Mendral as well.
That said, you're right that some smaller models can outperform Haiku and we're thinking supporting oss models at some point. But it does not change the core design principles IMO.

iammrpayments - an hour ago

I’m afraid claude code will start doing this on the background without telling you

2001zhaozhao - 5 hours ago

> We switched to the "triager" pattern: a Haiku agent with a very specific and narrow job. Is this issue already tracked or not? If it is, stop right there. If not, escalate to Opus.

I'm planning to self host qwen3.6 27b basically for this purpose

shad42 - 5 hours ago

Nice, it's on our todo list to use oss models too. What are you building?
- 2001zhaozhao - 3 hours ago
  
  The basic idea is to run a 24/7 SWE agent loop on local hardware to maximize the cost effectiveness. The agents just keep running and refining development tasks in a project board. When the human is in the loop they do just enough to complete the tasks with semi-frequent human review. However, whenever human is absent or human feedback becomes the bottleneck they start autonomously debating, delegating to cloud LLMs, etc. to try to clear bottlenecks autonomously instead. So essentially the system will try to do useful things to fully utilize the hardware, which is a specific optimization for local models (you'd never need to do this if you use cloud models).
  The local models would also be queryable on-demand (which overrules the 24/7 tasks in terms of priority) as cheap inference. The idea is that in user-queried interactive tasks, the main Claude agent primarily only gets summaries from other agents and makes decisions based on it, thus saving a ton of tokens compared to giving it access to the codebase. These small-model calls would preferentially route to my local model to save costs but overflow to a cloud provider if demand is momentarily too high.

whalesalad - 6 hours ago

Looking at the diagram, is this seriously a case of replacing basic functional concepts like "write to clickhouse" or "have we seen this before" to a model? could those be actual function calls in some language?

just seems wasteful all around. having an agent in the critical path when a regular expression (or similar) could do just seems odd. yeah haiku is cheap but re.match() is cheaper.

shad42 - 5 hours ago

We're dealing with CI logs, produced by a variety of frameworks, languages, etc... And the tough ones to look into are e2e tests, with outputs from infrastructure. I wish a re.match() would be enough, but we often don't even know what to match in the first place.
We started to add deterministic matching on the patterns that the agent sees the most so we don't have to go through the whole thing (for example a flake on PostHog can occurs 100+ times during a day, you don't need to reinvestigate every time). But for new errors, it's tricky.
prashant3210 - 2 hours ago

[dead]

saltyoldman - 6 hours ago

I do a similar thing with a "planner agent" that uses the cheapest (I think it's using openai-gpt-5.2-mini or something at like 20 cents for 1M.) that more or less emits a plan name, task list and the task list has a recommended model in each task. It's not perfect, but many of our tasks are accomplished with lighter weight models. When doing code generation or fixing we upgrade to a more expensive model, planning and decisions are done more cheaply. Keep in mind the tasks are relatively constrained, so planning done with a cheap agent makes sense here. An open-ended agent would likely use a more expensive call for planning.

shad42 - 5 hours ago

Curious, what steps did you follow to end up with this design (what did you try before)? And what's your use case for this agent?

marlburrow - 4 hours ago

[dead]

EverMemory - 5 hours ago

[dead]

Rekindle8090 - 5 hours ago

[dead]