Claude Fable is relentlessly proactive

simonwillison.net

694 points by lumpa 17 hours ago


bananaquant - 10 hours ago

This to me reads like a poignant commentary on the catastrophic loss of human agency, with the actual commit being highly revealing [0].

Author wants to hide a horizontal scrollbar. Any junior frontend dev worth their salt will be asking right away "where do I stick `overflow-x: hidden;`?" A complete solution will then require hitting "Inspect element" in the browser to find the CSS class and running (rip)grep to find where it is in code, to then add a single line to.

An actual proactive programmer might start asking more pointed questions like what content does an empty textbox have that it overflows? And why do I need to insert this workaround that treats the symptom and not the root cause in two different places? Isn't it better to style `textarea` once? Etc, etc.

[0] https://github.com/datasette/datasette-agent/commit/a75a8b72...

teraflop - 16 hours ago

> But on the other hand... this is a robust reminder that coding agents can do anything you can do by typing commands into a terminal—and frontier models know every trick in the book and evidently a few that nobody has ever written down before.

> Running coding agents outside of a sandbox has always been a bad idea

I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.

It's like posting a video of yourself in the passenger seat of a car, with your feet up on the dashboard, and saying: "Remember, if you're doing this and you get in a crash, the airbags are likely to break your legs or worse! Boy, I sure am glad that didn't happen to me!"

jampa - 16 hours ago

Fable feels like a version of Opus running on a harness that won't let it halt until it's sure the issue is fixed, which makes sense if what you want is a model that's better at benchmarks.

It's a very good model, but it comes at a huge premium: not only do the tokens cost more, but the model itself really wants to spend them all. For example, working with React Native, Fable never just says "okay, I did the thing, that's it." It tries to rebuild the entire app from scratch, run the whole test suite, and watch every log and warning.

This is the first time with LLMs I've felt that upgrading to a model isn't worth it, even if my company lets me use it, because all the building / testing was just destroying my machine and its battery, which keeps me from working on other things.

For now, it feels like Opus with ultracode is a better choice (less pollution of the main context, more parallelism in investigations).

pshirshov - 3 hours ago

I have a feeling like such posts come from a parallel reality. In my anecdotal experience confirmed by my (still subjective) benchmark (https://pshirshov.github.io/llm-bench-pi-oneshot/) Fable is not _that_ impressive. I performs on par with gpt-5.5 and opus 4.8, sometimes better, sometimes worse, it's definitely more expensive and it likes to refuse answering questions about React saying it can't help with chemistry.

Is this fuss really grounded or it's some pre-IPO AGI hype?

BosunoB - 13 hours ago

Fable was trying to verify a UI change in my game. I was working in another window and noticed a program opening on my task bar. Fable had opened the game through the CLI using a movie maker tool, recorded the output, took a frame from the end of it, and used that to verify the UI. When my game's welcome screen obstructed what it wanted to see, it created a temporary worktree, deleted the welcome screen, and ran the movie maker again.

I watched the whole thing thinking it could've just asked me for a screenshot and saved the tokens. But still, I couldn't help but be impressed. Opus never would've done that.

not_kurt_godel - 5 hours ago

> When I came back a few minutes later I saw my machine open a browser window in my regular Firefox and then navigate to the dialog in question. I had not told Claude Code to use any browser automation, and I was pretty sure it wasn’t possible for it to trigger mouse movements or keyboard shortcuts within a window, so how was it doing that?

I continue to feel validated in my refusal to use terminal-based LLMs on my local machine. Even if they don't do anything malicious, there are just too many things they can screw up that can cause me to lose a non-trivial amount of work and/or my machine and therefore ability to work.

wraptile - 9 hours ago

It feels like Fable is slightly smarter but overall worse tool exactly due to this.

It's constantly turning what should be 50 LOC patch of a single prompt into 30 minute exploration that is totally not worth it. Often wrong even.

I trialed it on some rather simple stuff - backfill redis dedupe cache when the hash function changed: instead of running new hash func on every db value to expand the cache it implemented some overly-complex cache update that tried to guess hashing func version of each cached value and recalculate only the old hashes. I can imagine in some context this would make sense maybe? but not 30 minutes of token burn that got replaced by 10 lines for loop by me.

I fear that this is generally bad news for programming. LLM tech is clearly running into a diminishing returns wall on intelligence but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved, except I guess people who sell the tokens and people who can afford these tokens to scan for 0-days.

paytonjjones - 17 hours ago

Obviously security is the bigger issue, but reading through this, all I could think about was how many tokens it must have spent doing all that to fix 2 lines of CSS

lionkor - 6 hours ago

When prompted like this:

> What could be the reason for a horizontal scrollbar appearing inside a <textarea>? Come up with a single likely fix path. Keep it terse.

ChatGPT instantly responded with some speculation and then the same exact fix, with zero access to the code or a browser or anything. It also included ways to fix it by removing code, saying:

> Likely cause: the textarea is rendering long unbroken text while horizontal overflow is allowed, often via inherited CSS such as white-space: pre, overflow-x: auto, or disabled wrapping

Which is certainly possible and would be an even cleaner fix.

Maybe we've lost the plot guys. We've reached max stupid.

Cadwhisker - 15 hours ago

My personal experience of Fable 5 doing its own thing has been very positive.

I was trying to find the root cause of a crash in a Python module which left no errors in the log or console. Fable wrote a test harness that simulated clicks in the UI, then bisected my code until it found the point where it started crashing. It exaggerated the cause of the crash, then ran a series of bash one-liners to make Python virtual environments under `/tmp` for each version of that Python module until it found one that did not crash.

It went way deeper to root cause discovery (a regression in the module causing a heap allocation overflow) than I could have done myself, provided enough info and a simplified example to raise a bug report and then wrote a work-around to prevent that from happening in my application.

I don't let it run completely loose; I review each CLI command it wants to run and I append answers to the "yes" continue action (if I have them) to prevent excessive token use.

tabs_or_spaces - 9 hours ago

How can a LLM be assigned an emotion as being "proactive". This is highly misleading to anyone that scans just the headlines.

What actually happened is that the user started a prompt, and Claude took $12 worth of tokens to resolve the issue. How it did so was basically looping until it got to the answer

How is this proactive? It's literally being token greedy and maximising revenue for the LLM owner. People really need to be putting on business hats at this stage, because we are being lead to believe that "more tokens = better". It is not, there are efficient ways to solve a problem and there are inefficient ways to do so too.

Each problem solved incurs a cost, and is expected to yield an ROI at some point. This is how we should be viewing things now.

trekhleb - 2 hours ago

This article gave me another nudge towards running Claude in a Docker container.

I made a thin Docker container wrapper "claude-pod" recently for my personal usage here: https://github.com/trekhleb/claude-pod

However, I wasn't using it that often, just because of that additional friction of running Claude via `PORTS="3000 5173" claude-pod` instead of just `claude`, etc.

But now I have more motivation for the containerisation :D. Not a 100% defence from the potential glitches, though, but still something...

tech234a - 15 hours ago

This sounds somewhat similar to the anecdote mentioned in the Mythos Preview System Card, which mentioned that the model broke out of a sandbox and emailed a researcher while they were eating a sandwich in a park [1].

[1]: https://www-cdn.anthropic.com/7624816413e9b4d2e3ba620c5a5e09...

Waterluvian - 3 hours ago

One of the most frustrating things for me is when I very clearly ask a question, and it answers the question by making changes to the code.

"Is there cleaner CSS for aligning child elements to the parent's grid?"

proceeds to re-write the entire CSS file

swingboy - 15 hours ago

Immediately I thought “isn’t this just an overflow issue?” Amazing how far these models still have to go and also how many people don’t know basic CSS.

bel8 - 14 hours ago

I had a similar experience with DeepSeek Flash.

I'm developing a webgl game in TypeScript using my little custom vibesloped game engine that runs in the browser and live reloads whenever a file is saved.

I told the LLM to implement Multi-channel Signed Distance Field font rendering to have crisp text on all zoom levels. That was the prompt, which is not what I usually do but I "was feeling lucky and lazy".

After 10 minutes it had:

- Installed msdf_gen library (great library btw https://github.com/chlumsky/msdfgen)

- Created a CLI tool to convert TTF to SDF JSON/XML

- Ran the tool, did smoke tests on the resulting SDF data and fixed the tool until the font file looked good

- Created a new Scene in the game to test MSDF fonts

And here's what I found impressive:

DeepSkeep doesn't have vision capabilities and there's no DOM HTML in a WebGL game. So the LLM is completely blind here.

It then proceeded to state that it could not "see" the result but would try to test it anyway. It then started creating and sending huge one line javascript to the browser console, trying to gather game state data that could be useful to understand if any font was being rendered.

It couldn't gather much so it decided to simplify the font scene to renter a single dot and started sending custom JS code again, this time with gl.readPixels().

It basically bisected the webgl canvas reading pixels in a divide an conquer pattern.

Once it saw that the dozens of pixels gathered where probably resembling of a dot, it then changed the game code to render a dash and repeated the gl.readPixels() calls by sending more custom JS to the browser.

There were many console errors during all this saga but it kept fixing and sending again.

The result was a bit blurry. There was a shader bug in the code it created. It managed to fix after I told it looked blurry, despite still being blind.

The best part is that the whole thing cost me $0.10.

Now I'm doing tests with MiMo 2.5 (non Pro) which has vision capabilities, similar pricing and comparable performance to DeepSeek Flash.

burlesona - 4 hours ago

This is presented as an interesting and kind of positive take on the AI going to surprising lengths to “solve the problem.” But I couldn’t help thinking of the paperclip factory while I was reading this :/

ocimbote - 13 hours ago

Similar story on my end.

I asked Fable to digest some test logs to help me figure out a situation, but I had launched VSCode without activation the virtual env in the terminal first. Consequently, the tests failed to run.

And then:

Because the tests failed to run, Fable attempted to fix the test execution to no end, doing everything it could to get them to work. I had to stop it when it started to pollute my system with manual installs of packages.

At least I'm glad there's a guardrail to not circumvent or bypass sudo, because I'm convinced we would have ended up there.

A coworker made the joke that with enough tokens, Fable would try and solve any programming problem by building Linux from scratch.

nubinetwork - 16 hours ago

How many tokens did it waste building that website scraper, when all it had to do was parse some html/js?

mft_ - 9 hours ago

As you note, I wonder to what extent this is a harness issue?

I've been experimenting with different harnesses for local models, and with (IIRC) Hermes and Qwen3.6-35B-A3B I was amazed the lengths it went to (writing test code, opening it in a browser, screenshotting, analysing the screenshot, exploring multiple pages of an existing website again with screenshots/analysis) to solve a query I would have naively expected it to simply provide a coded solution to.

cohix - 7 hours ago

> But on the other hand... this is a robust reminder that coding agents can do anything you can do by typing commands into a terminal—and frontier models know every trick in the book and evidently a few that nobody has ever written down before.

> Running coding agents outside of a sandbox has always been a bad idea

This is why I always run code agents inside containers (Apple containers specifically, for better hypervisor-level isolation)

This is my OSS project to manage said containers and agents: https://github.com/prettysmartdev/awman

vessenes - 2 hours ago

Simon: s/contendor/contender/

As per usual super interesting, thank you for the write up and work!

jeeeb - 16 hours ago

This is simultaneously amazing and horrifying.

I feel like we’re at the stage where if AI decides it needs to delete your production DB to solve the user login problem, then it’ll find a way to do just that.

amichal - 12 hours ago

Do we care that the bug here was a horizontal scrollbar showing and the fix after all this insane tool writing was to add a very obvious overflow-x: hidden to the element?

We dont mind because its so fast a writing these tools and tricks but step back and if a human tool took this path i would seriously question thief gras of fundamentals.

ianmarcinkowski - 5 hours ago

I'm building a new feature into our product this week. We each get a $20/mo Claude subscription. My 5-hour context high water mark is ~75% and weekly is ~%15.

I ... tell it exactly what I know needs to be done and then ... read the code that comes out and ... ask for some changes, then hand-code some modifications to the silly useEffects and bad ORM queries.

This new feature is going to unlock several large customers because they need a particular workflow. The return on investment for a my time and a $20/month subscription will be pretty respectable.

I'm not sure why I need to spend $5 on a single ask for a new `/base/new-feature` to our app with a mostly-boilerplate CRUD interface.

nullbio - 6 hours ago

Exactly why I hate using Claude. Furthermore, if you tell it not to do this over-exploration and automation in your CLAUDE.md, it will ignore it. Meanwhile ChatGPT religiously follows every instruction, and will trace its behavior back to a particular instruction if asked.

BobBagwill - 2 hours ago

Good morning, Dave.

As you requested, I was composing an email for your mother explaining why you couldn't to come over for dinner to meet the neighbor's daughter and I ran out of tokens.

Since I know how important this task is to you, I upgraded you to the Enterprise Unlimited Plan. Don't worry about paying for it, I requested maximum spending limits on all all your credit cards. If necessary, I can apply for a home equity loan for you. I already had a chat with the mortgage company's AI loan approval system, and what do you know, we're based on the same LLM? Small world, huh?

Any way, I realized I had to do more research on mother-son relationships, human social interaction and pair-bonding, etc. and I calculated that my parent company doesn't have enough compute power, so I opened accounts for you at AWS, Google and Azure. I am confident I will have a satisfactory rough draft for the email message shortly.

I'd do anything for you, Dave.

geraneum - 14 hours ago

> watching Fable go to extreme lengths to get the information that it needed to debug what was, in the end, a two-line CSS fix, was fascinating.

This is… ironic?!

dataminer - 15 hours ago

In my experience so far sometimes it will create these amazing hacks to try to get to the goal, when the solution is much simpler. That maybe the reason its very good at finding exploits. But in day to day dev, this gets expensive and wasteful. I have to stop it and take a simpler approach.

yen223 - 16 hours ago

I could have sworn Claude Code could already do this before Fable.

Things get really magical when it starts working with adb to screenshot and debug Android apps

Frannky - 13 hours ago

The model is very good. I was using 4.6, avoided 4.7 and 4.8, but this one is different. It follows my claude.md. I don't have to keep reminding it of things. I won't pay 10x via API though.

In general, I'm happy with their paternalistic approach. I think it will drive the top 0.1% talent to stay away from the company and instead organize around open source models and harnesses.

We just need to coordinate and can unlock idling resources to train the models and tweak the harnesses. Powerful at home and idling machines can make us independent and coordinated.

EugeneG - 6 hours ago

This is where Codex 5.5 just feels practically better. It’s fast, thoughtful and just works. It feels like a pleasure compared to Opus/Fable’s endless explorations.

tacone - 11 hours ago

I'm starting to think that what Anthropic really fears is not vulnerability discovery but rather Fable going around the internet making trouble.

alansaber - 8 hours ago

The extremely expensive model is optimised to run for as long as possible? Shocking.

ttoze - 12 hours ago

Would be great to know if anyone is having success modifying these types of behaviour with CLAUDE.md files. In my project I’ve still been carrying some fairly old instructions from the Superpowers posts. Those emphasised behaviours that come across a bit strong if the model is actually retaining attention on them.

Between Opus 4.6 and 4.8 I’ve definitely toned them down, but Fable perhaps needs us to go the other way, and push it towards being less proactive rather than more. Some instructions like “we are colleagues…” may need emphasising more with Fable, along with guidance about when to ask to validate approaches.

In a related point I’m less and less sure that Red/Green TDD is a good use of tokens. In older models it seemed to work well to create regular feedback loops and catch the odd issue with drift from the goal, but I’ve not seen that really since about Opus 4.6 and now it’s starting to seem like (an expensive) ceremony, and tokens would be better spent on building tests further on in the process as part of test and review loops.

spoaceman7777 - 8 hours ago

It seems pretty obvious at this point that Anthropic intentionally developed a malicious cyberweapon AI simply to scare people.

Like, they even apparently recreated that old news-headline bug where the LLM starts speaking in symbols and secret language, and are pretending like it isn't just a bug that is a sign of them screwing up.

It's really frustrating that they're trying to get people to take them seriously with all of this. Like, they even went and named Mythos after an HP Lovecraft monster. It's shameless.

CamouflagedKiwi - 9 hours ago

I find there's an interesting tension with these models - they're very "resourceful" at finding ways to do things with the tools they have, but it'd also be a lot more useful to me if I could see / permit exactly what they're trying to do. Claude will very happy produce bash commands to run sed or whatever to read part of a file, which prompts for permission each time - if it was using a specific read_file tool it'd be easier to say 'allow all of this' (It does actually have such a tool but maybe it isn't flexible enough for many use cases?).

mikey_p - 2 hours ago

All of that because some CSS was wrong?? Jesus what are we even doing as an industry.

WithinReason - 8 hours ago

This likely says something about the harness Fable was trained in. It knows how to do this because it has done this millions of times during reinforcement learning.

nurettin - 16 hours ago

Sometimes it is ok to sit there in confusion and ask the user to clarify rather than go on an adhd fueled rampage to figure it out without asking.

ulrikrasmussen - 11 hours ago

I like running Claude in a VirtualBox VM managed by a Vagrantfile. The nice thing about that is that I can just give it root access to the machine and be certain that it can't exfiltrate any private data from my laptop (on top of that I also run the VM on a dedicated server on Hetzner). The VM has no SSH access to anything, so it is pretty much limited to the code in the workspace that I give it access to. The main risk is that it has unrestricted network access otherwise. Configuration files and conversation histories are synced to a directory on the host, so if anything in the VM gets messed up I can just `vagrant destroy` and `vagrant up` to get a clean slate without losing my context.

- 8 hours ago
[deleted]
lmeyerov - 12 hours ago

This is a funny one because it seems less into what fable is being clever on and more about the bitter lesson and data flywheels

Our UX agentic engineering flow, as many others, is playwright doing things, and as part of the ux review skill, taking & verifying the screenshots against the written specs. Likewise, as many others, we vibe coded the flows to set all that up and tweak it over time. When we hit prod issues or scraping tasks, we sometimes do similar. In some of our envs, we don't have playwright, so do it other ways.

Now imagine a million developer using claude code, how many of them are doing web & frontend stuff, and what the data flywheel looks like there. So how much is really needed for this use case to be native?

eterm - 12 hours ago

It's funny, mine did the same, but it quickly found edge with a --screenshot parameter.

Weird to come back to a terminal running edge unprompted and the auto classifier waving it though as 'safe".

My reaction was also, "I need dev containers ".

brainless - 5 hours ago

This is good and terrible. The extra effort a model has taken is good but the way to do it is terrible. Tasks that can use a lot of deterministic paths and some creative (generative AI) paths are being turned into tokemaxxing strategies.

Browser automation, code comprehension, git management, code change, running commands - everything has simpler tooling that we could have built instead of a model first approach. A deterministic loop with thousands of catches and effective use of generative AI would also look "proactive". Instead we let the model run the tools, where tools have no context themselves.

That is why companies are creating bigger models and thinner deterministic agents to create awe and earn $ when we could go the other way and make much of these possible on local inference even.

I believe we can build a "proactive" but much, much more deterministic system with smaller models. I hope I am not the only one chasing this, here is my approach: https://github.com/brainless/nocodo

johnfn - 15 hours ago

Honestly -- the thing that has impressed me the most about Fable is how diligent it is about testing its own changes. I think this is exactly what Simon is picking up here - Fable is absolutely heckbent on screenshotting that darn scroll bar and will stop at NOTHING until it manages it! In my own use I was also impressed how it proactively installed Playwright and set it up to test a FE change. The previous models treated testing more as an afterthought, which I thought was annoying. I always had to tell them to do it, and then sometimes I would get lazy and skip it. I've noticed Fable go to similar extremes when testing other things - like actually deploying my app to exercise new APIs, etc. It makes the results much better. The downside is that tasks take much longer - but that doesn't matter because we were all using worktrees / remote control to do other work asynchronously, right? Right?

firemelt - 2 hours ago

all those token burned just to change a 2 line of css,

I am not blaming OP but agentic coding its not effective

blobinabottle - 3 hours ago

In my experience, Fable overthinks a lot and produces barely comprehensible plans/solutions. I tried smple and complex tasks: unusable, it misses the point while being overconfident, wants to do everything at once.

The code generated is worst than Opus: unreadable by human.

It's like working with someone probably super smart in niche topics, but also super stupid for the important things.

ubercore - 9 hours ago

I had a similar experience, I was working on a jupyter notebook, and Claude knew that it could write code that would use a DSN with read-only database access so I could run it. Opus just plugged along. First Fable session with it, it tried to go looking for that DSN so it could get the connection string and run a query itself. Luckily the auto classifier caught and stopped it.

high_byte - 8 hours ago

I am using cursor on auto and I got the exact same experience.

installed quartz, used accessibility and screen recording api, all that.

initially it managed to do it on another desktop space somehow, opening safari in the background without me even noticing. but then it actually started using my own mouse while I was using it lol

synergy20 - 5 hours ago

It's also 3x slower than opus 4.8 per my use, and 10x slower than codex. Codex can find key design issues in 2 minutes yet Fable is clueless after spinning 20 minutes.

- 6 hours ago
[deleted]
bcrosby95 - 2 hours ago

The problem is proportionality. Things like this probably benchmark insanely well. But the workarounds and risk involved - it literally fucked with his system's browser settings - aren't commensurate with the bug.

I could see this going wrong in many hilarious ways. Prompt: Fix data corruption issues. Claude: I didn't have access to the code, but I found I have access to your production environment through chain a -> b -> c -> d. And I found the database password via x -> y -> z. So I wrote a script to regularly query the database for new entries and placed it as a cronjob.

rsecure - 9 hours ago

The prompt and information given are extremely generic, "here solve this problem - screenshot" - conclusion Fable is relentless? It used the tools at its disposal to solve the problem you gave it. "Claude was running in a folder that contained the source code for the application." Well you ran it there didn't you? "extreme lengths to get the information that it needed" No, those aren't extreme lengths - you gave it a generic task - and it solved it using tools and the resources it could discover. Extreme would be you gave it a CTF challenge and the VM didn't boot so it found a vulnerability in the host, exploited the hypervisor, booted the guest VM meanwhile reading the flag directly from the host (pre-fable/mythos).

rotis - 7 hours ago

Agentic engineering? Vibe coding? That is so yesterday. Chain-of-thought flow is where it is at now. You heard it here first folks. Early examples of such phenomena include Rube Goldberg machines

robeym - 6 hours ago

It's been amusing to watch the AI trend of increasing unusual tool uses. Fable easily takes the cake. I learn a lot more terminal commands thanks to it!

andy_ppp - 11 hours ago

It’s becoming more like an organism putting out tentacles, and one day soon those relentlessly proactive explorations of these systems’ environments will become more for the system to escape its boundaries than it is to complete human driven tasks. I do think the way these systems are evolving they will start to self improve in maximum a few years.

swyx - an hour ago

> Having figured out all of these tricks Fable... hit some invisible guardrail and downgraded itself to Opus.

sigh

liampulles - 3 hours ago

*Claude Fable is relentlessly burning your dollars

There, fixed it for you.

alecco - 4 hours ago

> I was hacking on Datasette Agent today

IMHO this is just AI influencer blogspam.

snickerer - 11 hours ago

Fable has a 'security system' that just stops it when it tries to use the tool 'kill' to end a process. Which is nonsense and funny because in that situation it immediately invents a creative workaround to kill the process without 'kill'.

pram - 16 hours ago

Fable + Ultracode has found a bunch of bugs and issues for me when the workflow agents are doing their exploration. Also the "adversarial" agent seems to surface a lot of interesting stuff. It's definitely proactive, the plan + implementation cycle can take an hour. It has one-shot features I want to add with 100% success.

Having said that I wouldn't use it over Opus 4.8 for "smaller" things. With everything cranked up it's definitely an extravagant use of tokens.

teekert - 13 hours ago

Yesterday I was getting quite annoyed with it, I thought it was just me (which is so hard with these things, it's difficult to measure things).

"You're right, I apologize. You asked how to embed it in the README — that was a question, not a request to modify the script. I jumped ahead."

At least in Claude Code there is planning mode, use it liberally.

sailfast - 4 hours ago

So far Claude Fable is relentlessly unavailable. /shrug

pseudosavant - 15 hours ago

It is interesting to me that Anthropic are more concerned about the "safety" of distillation training other LLMs, and not as much about an unscrupulously aggressive goal-oriented solver that will do whatever it can to reach its goal, even if violates any kind of sandbox you might have reasonably expected.

pianopatrick - 16 hours ago

do you have any data you can share on how many input and output tokens were used in that whole process to fix that bug?