CC-Canary: Detect early signs of regressions in Claude Code

59 points by tejpalv 20 hours ago

A useful(ish) trick I've found is adding a persona block to my CLAUDE.md. When it stops addressing me as 'meatbag' I know the HK-47 persona instructions are not being followed, which means other instructions are not being followed. Dumb trick? Yup. Does it work? Kinda? Does it make programming a lot more fun and funny? Heck yes.

Don't lecture me on basins of attraction--we all know HK is a great programmer.

Baeocystin - 10 hours ago

The brown m&m trick turns out to have more applications than one would think!
JonSchneider - 10 hours ago

Mind sharing that block? Is it just: "Persona: You are HK-47"?

Robdel12 - an hour ago

There’s no way I’m going to spend any time tracking and fighting a tool like this.

If you feel the need to do this, it’s time to move onto a tool you trust?

evantahler - 19 hours ago

I feel like asking the thing that you are measuring, and don’t trust, to measure itself might not produce the best measurements.

john_strinlai - 18 hours ago

"we investigated ourselves and found nothing wrong"
- deaux - 10 hours ago
  
  Funny, but in this case it will be the opposite. If you tell an LLM to find potential regression, it will lean towards "finding" it even where there is none.

jdiff - 12 hours ago

My attitude towards this is growing similar to my attitude towards Windows. If I have to fight against my tools and they are actively working against me, I'd rather save the sanity and time and just find a new tool.

thawab - 4 hours ago

I am on the same boat, started to build my extensions and preferences in PI. The community is awesome and helpful. My assumption is the personalization has higher priority than the intelligence gap between opus and gpt or others. At least i wont stop working if claude is down.
trueno - 10 hours ago

i think a lot of us are kind of sitting back and seeing what dark horse rises up. it's such a non-deterministic outputting technology that's still in discovery mode and the resources expenses/constraints are outta control, the companies leading the charge are eating themselves and can't guarantee jack squat. down in the trenches people have built these things up to be critical dependencies in either their day to day life or their work, my eyes are mostly glazing over hearing about how people are using claude to do whatever grand array of things with no oversight. the way we benchmark this shit is all over the map, the goalposts are just teleporting randomly at this point.
my claude usage has drastically dried up as i've personally realized the real bottom of this stuff is always gonna be genuinely learning and becoming excellent at a thing. i think claudes not bad for helping me get through early stages of that process, and for actual work i think claudes great for just ripping out something im too lazy to do, but something i know so well that i can catch him slippin. absolute coinflip on if it's worth the pain, many times now ive said "i shouldve just done this myself".
i've got my fingers crossed for like llm's to reach some sort of proverbial opus 4,5 territory here. even if thats gonna cost me a bit in hardware, that's kind of my personal bench mark for "good enough, i'm unplugging from all this craziness".
one thing is for sure, anthropic needs to stop adding _features_. claude code or vscode extension was their bread and butter reliability that garnered them a lot of goodwill from people who were willing to pay good money for a good service. seeing them launch their design thing just has me rolling my eyes. they're kind of microsoft'ing themselves here by trying to do too much, and they'll end up delivering a lot of subpar services that aren't best-in-class at any one thing. we're already seeing that i think.
- cindyllm - 10 hours ago
  
  [dead]

Retr0id - 17 hours ago

What is "drift"? It seems to be one of those words that LLMs love to say but it doesn't really mean anything ("gap" is another one).

jldugger - 17 hours ago

IDK how it applies to LLMs but the original meaning was a change in a distribution over time. Like if you had some model based app trained on American English, but slowly more and more American Spanish users adopt your app; training set distribution is drifting away from the actual usage distribution.
In that situation, your model accuracy will look good on holdout sets but underperform in user's hands.
Retr0id - 16 hours ago

This definition satisfied me: https://universalpaperclips.fandom.com/wiki/Value_Drift
- ryoshoe - 15 hours ago
  
  Fandom sites continue to have one of the most unpleasant user experiences on the modern web, I just want to read the article without having my to watch 4 different video ads...
  - justinclift - 21 minutes ago
    
    You're not using an ad blocker?
idle_zealot - 17 hours ago

I believe it's businessspeak for "change." Gap is suittongue for "difference."
redanddead - 16 hours ago

there are many causes, but it’s a drift in performance
you can drift a tool via the harness in many ways
you can modify the system prompt
you can modify the underlying model powering the harness
you can use different “thinking” levels for different processes in the harness
you can change the entire way a system works via the harness, which could be better or worse, depending on many things
you can introduce anti-anti-slop within the harness to foil attempts from users using patch scripts
you can modify how your tool sends requests to your server depending on many variables
you can handle requests differently, depending on any variable of your choosing, at the server level
you can modify the compute allotment per user depending on many things, from the backend, without telling the user, it’s very easy. you can modify it dynamically depending on your own usage or the user’s cycle. Or their organization’s priority level as a customer. The weekly and daily usage management system is intricate, compute is very finite and must be managed
the user has literally no way to know and you have no legal obligation to tell them, you never made them any legally binding promises
the combination of so many factors that all affect each other means that you can, if you’d want to, create a new clusterfuck of an experience anytime any of these or unknown variables change, it may not even be deliberate, it grows exponentially complex, so you may not even be able to promise a specific standard to your users
drift is not imagined, sure, but admitting to it could expose you to unneeded liability
- Retr0id - 16 hours ago
  
  That's a lot of words without actually defining the term, although idle_zealot's suggestion of "change" seems to make grammatical sense as a replacement here.
  - redanddead - 16 hours ago
    
    yeah, figured i’d put some thought into it, you know?

majormajor - 11 hours ago

In addition to the elsewhere-mentioned "you're using a black box to try to analyze the same black box," the fundamental metrics all seem incredibly prone to other factors than any Claude Code changes.

Claude Code changes all the time—it's the whole shitty trend of the day—but you can't tell which of those changes are better or worse from analyzing results on independent novel tasks.

And you're baking in certain conclusions: "HOLDING / SUSPECTED REGRESSION / CONFIRMED REGRESSION / INCONCLUSIVE". Where's an option for "better than previous baseline"? Seems certainly possible that a session could have better-than-average numbers on the measured things.

Overall, though, there's just so much here that's just uncontrolled. The most obvious thing that isn't controlled for is the work itself. What does the typical software project look like? A continued accumulation of more code performing more features? What's gonna make an LLM-based agent have to do more work? Having to deal with a larger, more complicated codebase. Nothing in this seems to attempt to deal with the possibility that a session that got labeled a regression might have actually been scored even lower against a month ago's Claude Code.

"It's harder to read code than to write code" and "codebases take more effort to modify over time as they grow" are ancient observations.

Drift detection would require static targets and frequent re-attempts.

I use it everyday and haven't seen worsening. (It's definitely not static but the general trend has been good.) But I use it on a codebase that was already very complex before we started using these tools, where overall every three months or so has brought significant improvements in usability and accuracy.

aleksiy123 - 19 hours ago

Interesting approach, I've been particularly interested in tracking and being able to understand if adding skills or tweaking prompts is making things better or worse.

Anyone know of any other similar tools that allow you to track across harnesses, while coding?

Running evals as a solo dev is too cost restrictive I think.

FrankRay78 - 17 hours ago

See the very last section in this doc for how I minimise token usage and track savings, all three plugins co-exist fine: https://github.com/FrankRay78/NetPace/blob/main/docs/agentic...
- aleksiy123 - 14 hours ago
  
  This is very nice as well as the reference links. I’ve been having trouble with closing the loop.
  Going to feed into my own.
  Out of curiosity how have your agents evolved and metric changed.

wongarsu - 18 hours ago

See also https://marginlab.ai/trackers/claude-code-historical-perform... for a more conventional approach to track regressions

This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets

redanddead - 16 hours ago

the actual canary is the need for the canary itself

lioeters - 16 hours ago

like the status page of a service provider that goes down when the service goes down. you had one job

Yemane5 - 15 hours ago

thanks

tommy29tmar - 8 hours ago

[dead]

tejpalv - 20 hours ago

[dead]