Claude Fable is relentlessly proactive

694 points by lumpa 17 hours ago

This to me reads like a poignant commentary on the catastrophic loss of human agency, with the actual commit being highly revealing [0].

Author wants to hide a horizontal scrollbar. Any junior frontend dev worth their salt will be asking right away "where do I stick `overflow-x: hidden;`?" A complete solution will then require hitting "Inspect element" in the browser to find the CSS class and running (rip)grep to find where it is in code, to then add a single line to.

An actual proactive programmer might start asking more pointed questions like what content does an empty textbox have that it overflows? And why do I need to insert this workaround that treats the symptom and not the root cause in two different places? Isn't it better to style `textarea` once? Etc, etc.

[0] https://github.com/datasette/datasette-agent/commit/a75a8b72...

biztos - 9 hours ago

They might also ask why a bunch of static CSS inside a bunch of JavaScript is hiding inside __init__.py[0] - hopefully before trying to fix some detail of the CSS.
(I'm surprised to see it actually, since my own use of Claude has mostly yielded well-structured code. But I'm not doing proper vibe-coding, more like friendly Socratic arguing with another engineer who happens to be a robot.)
[0] https://github.com/datasette/datasette-agent/blob/main/datas...
- simonw - 7 hours ago
  
  Thanks for the prod, I've extracted that script out into a separate static file: https://github.com/datasette/datasette-agent/commit/fa505b82...
  (It was in Python because there were a couple of URLs that needed to be dynamically constructed by the server, but those are output as a small window.datasetteAgentJumpConfig object instead now.)
  - Ummmdf - 6 hours ago
    
    [dead]
- byproxy - 3 hours ago
  
  > friendly Socratic arguing with another engineer who happens to be a robot
  Ha! Same! Still feels like the best way to go about it, really. I know the dream is to one day remove humans from the loop... but I'll enjoy the dialectic while it still seems the most productive!
  - vadansky - 14 minutes ago
    
    Same, I like to call it rubber duck coding (now the duck talks back!)
    Edit: Now I want an LLM connected rubber duck with a speaker/microphone that sees your screen
dreis_sw - 4 hours ago

Seems like this model delivers on what has already been scaling quite nicely, which is the length and complexity of the requested tasks, but isn't such a big improvement on what hasn't been scaling so far - common sense, discernment, good judgement.
- nlawalker - 3 hours ago
  
  > common sense, discernment, good judgement
  I feel like the whole point of all the experimentation with AI right now is determining whether any of these things actually matter to the end result, over various timeframes.
  - RealityVoid - 2 hours ago
    
    They matter.
    
    pertymcpert - 2 hours ago
    
    Because?
    
    RealityVoid - 30 minutes ago
    
    Because poor judgement leads to poor decisions.
    
    - an hour ago
    
    [deleted]
piker - 10 hours ago

This is exactly right. By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction with additional information and improve it. Instead, we let the agent spend $12 and make the fix while learning nothing.
- simonw - 8 hours ago
  
  Things I learned from this:
  - Fable will do a whole lot more than you might expect in order to verify a fix. I learned that it's "relentlessly proactive". That's a good title for a blog entry!
  - You can take screenshots of a window in macOS using the "screencapture" CLI command, but you'll need the integer window ID first.
  - That windowID is accessible via "Quartz.CGWindowListCopyWindowInfo(Quartz.kCGWindowListOptionOnScreenOnly, Quartz.kCGNullWindowID)" using the pyobjc-framework-Quartz library, which installs cleanly via "uv run".
  - A neat trick for simulating keyboard shortcuts is to run document.dispatchEvent(new KeyboardEvent("keydown", {key: "/", bubbles: true})); after the page loads.
  - You don't need Flask or Starlette to run a CORS-enabled localhost server for capturing JSON from another window - 19 lines of code against the Python standard library http.server package works just fine.
  - getComputedStyle(document.querySelector("navigation-search").shadowRoot.querySelector("textarea")) works to read dimensions from inside a Web Component's shadow DOM.
  - defaults write com.google.chrome.for.testing AppleShowScrollBars Always
  - Claude Fable knows how to apply all of the above. It's always interesting to pick up hints of what a model can and cannot do.
  I'm always confused at how many people equate using a coding agent to solve a problem with "learning nothing". If you pay attention to what it's doing you can learn so much!
  - piker - 7 hours ago
    
    Sorry that wasn't a criticism of you!
    I completely see how it was misread that way. I would edit it now if I could.
    I was using you more as an example of a hypothetical programmer using it in this way. If the goal is to create a maintainable product, this isn't a great approach. If the goal is to learn about the model and its behaviors itself, of course this is a fantastic way to experiment. Yes, you might have learned a lot of tricks as a side effect, but avoiding the pain of thinking about, finding and hiding the thing may mask a better abstraction that reduces complexity and allows the project to move forward faster.
    
    peterbell_nyc - 4 hours ago
    
    Honestly my goal is to learn how to teach an agent to build a maintainable product, so I'm way more interested in the learnings at the agentic level (how to prompt/direct/manage context/restrict tool use, provide reusable shims, etc) than getting into the details of a css bug. That's just not a level of abstraction with sufficient leverage for what I'm trying to do.
    I stopped coding a while back because I could have more impact directing a team of developers than writing code personally.
    For my use case, the agents are now how I can have that scaled impact.
  - rco8786 - 4 hours ago
    
    > If you pay attention to what it's doing you can learn so much!
    I think your post is fair but it's worth pointing out that learning via watching is much less effective than learning via doing.
    
    simonw - 4 hours ago
    
    I used to believe that was universally true, but then I learned about the "worked-example effect": https://en.wikipedia.org/wiki/Worked-example_effect
    
    jplusequalt - 2 hours ago
    
    Your link mentions the expertise reversal effect where the redundancy of worked examples can actually hamper an experienced students abilities, vs. letting the more experienced student work it out for themselves.
    
    lanstin - 3 hours ago
    
    It leads to less cohesive shared vision on how to solve problems. In groups where I am trying to foster a shared technical vision, I try to get people to do “see one, do one, teach one” for procedures that are common enough to come up repeatedly (and as a method for discovery for where automation would be a bigger win). Pure green-fields software dev sometimes is doing such novel things that that doesn’t work well, but much of routine software maintenance is discovery of the steps needed to add a new flow or a new customer type or a new configurable behavior, which benefit from consistency.
  - danudey - 3 hours ago
    
    The whole saga is kind of nuts, but the thing that fascinates me most is that Fable got this far and then hit some kind of guardrail; I'd be very curious to know what it wasn't able to do that caused it to downgrade to Opus.
    It already got extremely... invasive? It didn't do anything that I wouldn't have approved in the same case, but it's interesting that it got as far as launching browsers, inspecting every open window, and storing screenshots to disk, and then it was stopped by something? I wonder what.
  - lobocinza - 5 hours ago
    
    Opus also do this kind of tehcnically competentent but dumb deviations to fix a simple issue where asking for input would be better. Models have no illative sense.
  - mapt - 6 hours ago
    
    It was only pursuing the goal you gave it - Keep Summer Safe.
    
    - 5 hours ago
    
    [deleted]
    
    fennecfoxy - 4 hours ago
    
    "Oh my God"
    
    mapt - an hour ago
    
    I relent to snarky Rick and Morty quotes because I don't know that it's useful any more to try to explain paperclip optimizers or alignment to a bunch of AI nerds who saw the cliff coming and clawed at each other trying to be the first out to leap over the edge.
    "Relentlessly proactive". That's one word for it. We have a whole subgenre of hard takeoff scenarios and it wasn't enough warning against "Relentlessly proactive".
    Turns out Frank Herbert was an optimist, and we're literally pinning our survival on robots turning out to naturally have impractically short attention spans.
  - Angostura - 3 hours ago
    
    It sounds like you learned lots of things related to the tool, but not so much about the problem that you were using the tool to solve?
    Is that fair? Not trying to snark? I see similar results myself
    
    abustamam - 2 hours ago
    
    Learning doesn't happen in a vacuum. Even pre-LLM days where I'd scour stack overflow for the solution to one problem, I'd inadvertently learn other random stuff while looking.
    
    simonw - 2 hours ago
    
    Yes, that's entirely fair.
  - wasabi991011 - 2 hours ago
    
    That's a lot learned about debugging, sure, but it's worthwhile to note that it doesn't tell you much about the abstractions used to build Datasette, as the previous commenters pointed out.
    
    simonw - 2 hours ago
    
    I designed those abstractions myself.
  - HarHarVeryFunny - 6 hours ago
    
    Are you using Claude Code or a different agent? I'm curious how screenshots are being fed back into the model? Does CC register a tool for this, or is Fable just using a bash tool to perform the screen capture, and then what tool is it using to request the resulting image to be fed back to it?
    
    vidarh - 3 hours ago
    
    Claude Code can process images by reading the files. And as I found out the other day, it also knows ffmpeg well enough to process videos even though it has no native video capabilities...
    While debugging, it asked me to pass it a video from the past testing, proceeded to generate a "contact sheet" of the video using ffmpeg, interpreted the image to figure out which frames it needed, and extracted the full size frames and extracted the relevant text from it and used it to reproduce the problem with Playwright...
    
    HarHarVeryFunny - 3 hours ago
    
    It would be interesting to know if examples like this are things they explicitly trained it to do (presumably via RL), or if any of it is emergent. I'd have to guess trained, but in any case still impressive the lengths it will go to!
    
    vidarh - 2 hours ago
    
    It's hard to tell. Training it with lots of examples of ffmpeg would not be surprising, and training it on screenshots would also make a lot of sense. It's not inconceivable at all they'd train it on "figure out a video by creating contact sheets". The whole end to end I'd consider less likely, but it'd also be a very small leap once you have the elements.
    I think a lot will fall out naturally from relative modest levels of reasoning plus in-depth knowledge of what common tools will do. E.g. I also have used Claude to debug my compiler, and it knows gdb so much better than me that even though I know it's pretty useless at holding context through reading an assembly listing (lack of structure, I suspect), it's surprisingly good at working things out by just being good at exploiting a powerful tool.
    
    simonw - 4 hours ago
    
    I was using the Claude Code CLI harness. It can "read" any image file on disk, so all it needs is a way to create a file in one of the standard formats supported by the Anthropic API.
  - almostdeadguy - 6 hours ago
    
    It's like saying you can learn so much about math from using SymPy to solve equations. Yes, you probably can. If you pay close attention to what is happening and can integrate the techniques being used into your knowledge.
    But your learnings here are what, a handful of hacks? For most people it's like being shown the chain rule (which frankly, is more general than any of these learnings) without knowing what a derivative is. It's knowledge that comes context free. And even when it can be understood, I'm not sure I believe it gets integrated especially well when you did none of the work to understand it. If you are extremely diligent and self-aware about what your limitations are, and careful to be sure you have an understanding of this knowledge, sure I guess you can learn a lot.
    And ultimately what do you think is more likely? People using the experience of using these tools to progress their knowledge or for them to rely on the answers uncritically? I think people with a rosy view about this are severely undercounting the problems associated with the trust relationship between a person and an LLM and what that means.
    
    simonw - 6 hours ago
    
    > I think people with a rosy view about this are severely undercounting the problems associated with the trust relationship between a person and an LLM and what that means.
    Personally I think the impact of LLMs on children's education is a crisis right now.
    Kids are not going to learn to write if an LLM writes their essays for them. And writing is how you learn to think.
    
    mnicky - 5 hours ago
    
    > writing is how you learn to think.
    There's also reading. A lot of reading can substitute some writing.
    EDIT: Actually, I'd say that at first you need to do a lot of reading and _then_ writing can help your thinking as well.
    
    almostdeadguy - 6 hours ago
    
    I don't think it's just a problem for kids! I think this is problem for many software engineers as well! Adults of all professions really.
    
    kmnfu - 5 hours ago
    
    [dead]
  - yumbumdum - 6 hours ago
    
    [flagged]
  - saberience - 7 hours ago
    
    And Fable is still worse than Codex.
    I use both and the only thing (as always) that I will use Claude for is UI design.
    Opus 4.8 and now Fable are still both worse at actually getting the job done than the Codex model. Claude models write FAR too much code when it's not needed, they burn far too many tokens, when they are not needed, write un-necessary tests, write plans which are 5 pages longer than are needed, etc. etc.
    Have you actually compared code quality and plan quality versus Codex? It's demonstrably worse.
    
    solenoid0937 - 6 hours ago
    
    I don't know what problems you're working on but Fable is not just better, it is a step change from GPT 5.5 in my experience. It feels at least one major model generation ahead.
    
    ModernMech - 3 hours ago
    
    One Hacker News commenter says it's worse, another retorts it's a step change and even includes emphasis! Will the first commentor retort back that it's been a double dog step change in the opposite direction? Can't wait to see how this comment thread unfolds!
    
    lossolo - 4 hours ago
    
    It doesn't for me. I use Fable to make plans, then give them to GPT 5.5 to review, and it always finds flaws and edge cases that Fable misses (some are really critical). It was the same with Opus 4.8. I'll admit it finds a bit fewer issues now, but Fable feels more like an incremental improvement than a major generation ahead.
    
    eddyzh - 2 hours ago
    
    For that test you have to compare letting a fresh agent (subagent) or the same model do the same review.
    The fact that a review helps does not prove the model choice for the review.
    You reviewing your own writing helps too!
    
    saberience - 2 hours ago
    
    This is exactly what I find too, I make plans in both models and compare them in the other model. And Claude usually agrees (65-80% of the time) that the Codex plan included things it didn't think of, or was better in some other way.
    Note, this is better than it was with Opus, where it was more like 90% of the time the Codex plans were obviously better.
    
    elbear - 7 hours ago
    
    Curious, which model do you use for Codex? I'm very happy with the solutions '5.5 high' finds. It's like it understands exactly what I mean and it also anticipates all sorts of situations. Before I used '5.5 medium' for some time and it was a bit underwhelming. It may sound funny but it's like it didn't care that much to do a good job.
    
    saberience - 3 hours ago
    
    I use GPT 5.5 High Fast, I often benchmark versus Fable (and previously did versus Opus) and it's night and day.
    Claude still (and has always) writes far too much code to fulfill a given spec or plan. It misses edge cases and is generally far too verbose.
    Claude also is (and even more so with Fable) super tokenmaxxing, i.e. it seems tuned to use the max amount of tokens per task, whereas Codex will simply get your job done as you specified with the minimum fuss and tokens.
    Codex feels way more steerable and just more "professional" as though I'm working with a seasoned engineer, versus someone smart but over excitable, like a super smart associate engineer.
    
    kolinko - 5 hours ago
    
    What are your harnesses? Do you have the same skillsets/tools/etc for both?
    
    saberience - 2 hours ago
    
    I use Codex and Claude Code. I've used both Codex and CC since release with basically every model they've ever released, I always try both for almost every plan that I write and benchmark the plans against each other, Claude almost always acknowledges that the Codex plan is better! Even now with Fable, this still happens.
    As in, I give the exact same prompt to Fable and GPT 5.5 Pro, then produce the plans, then give each model the other's plan. Claude always realizes it missed stuff and Codex usually ends up finding missing things in Claudes plan.
    This situation did improve with Fable versus Opus 4.8, but in general, Codex for me is still the better model.
    
    felixgallo - 7 hours ago
    
    In my experience writing about 50 programs with fable, opus, and GPT, fable is a significant step change better than opus which is significantly better than GPT. We must be doing different things.
    
    zeroonetwothree - 2 hours ago
    
    From what I’ve seen all three are close enough that I would be hard pressed to pick one. It seems to matter much more how I prompt than which of the three I am using.
    
    saberience - an hour ago
    
    I'm writing low-level Rust, distributed systems, also sandboxing tech which has to be secure and performant.
    The only thing I have Fable do now is create UIs or otherwise front-ends for systems where correctness doesn't matter as much.
    Anthropic models lead at making nice looking UIs for sure, but when it comes to making sure my Rust code is actually 100% correct and uses 1% of CPU most of the time, Codex is king.
- snowwrestler - 5 hours ago
  
  But Simon is not trying to get good at CSS debugging, Simon is trying to learn about AI systems and produce content about them. So giving the AI agent a trivial task to go crazy on is a feature, not a bug.
  For $12 implied cost, he got a front-page post on HN with 500 comments. What is that worth? :-)
  - sdesol - 4 hours ago
    
    > What is that worth? :-)
    This is one of those double edge sword situations. It is on the front page and it stays because it will trigger a lot of people and he has to spend a lot of effort explaining himself. What is that worth?
    His explanations would most likely be buried deep so the impression that others get might be worsened. What is that worth?
    In my opinion, this is one of those find a harder problem and you would still have the same content...but it might not draw as much feedback and stay on the front page longer.
  - xnorswap - 4 hours ago
    
    To most of us that's worth a ton, whereas he's probably had enough front-page posts that there's less value to him, although still likely more than $12 worth.
    
    garblegarble - 2 hours ago
    
    >enough front-page posts that there's less value to him
    On the countrary I'd say it's probably even more important - without (amongst doing other "thought leader" things) getting on the HN front-page regularly an influencer's value to the industry disappears (not criticising him here)
    
    simonw - 2 hours ago
    
    That's bad news for all of the other "AI influencers", off the top of my head I can't think of any with remotely my track record of hitting HN.
    (That's because they're all busy attracting millions of views on TikTok and YouTube, which are much more impactful channels than my dedication to blogging like it's 2005.)
    
    garblegarble - 2 hours ago
    
    That's what I meant by other thought leadership things - that's all covering different niches. For what it's worth, I think you do useful work and are a respectible influencer.
    I'd also say don't be down about your use of blogging - I'd say it makes you more valuable, there aren't that many decision-makers who are going to sit through a bunch of breathless YouTube videos...
    P.S. I hope you don't object to me using the term influencer, assumed you were on-board with it since in your post announcing your sponsorship you referenced Freeman & Forrest, "influencers on tap" / "building turnkey influencer marketing programs as a service".
    
    simonw - 2 hours ago
    
    Hah, yeah I'm still a little sore at the "influencer" term but I'm beginning to accept that it applies and I should get comfortable with it!
- jmmcd - 9 hours ago
  
  People are missing that Willison is among the very best people we have in the role of (for lack of a good name): early access to frontier models, evaluate them in real scenarios, no wishful thinking, hype, or doom, communicate the possibilities. Yes he could have fixed this himself but then he would have learned nothing about the AI, and we wouldn't have read a fascinating and important article.
  - risyachka - 8 hours ago
    
    >> he would have learned nothing about the AI
    there is absolutely zero value in spending time to learn about new models as in few months new model will be out and whatever you learned about the current one will be useless.
    Also with models getting better and better you have to know less and less to achieve same results.
    
    simonw - 8 hours ago
    
    My experience has been the exact opposite.
    As the models get better you need to know more about their capabilities, because otherwise you risk prompting Claude Fable 5 like it's GPT-4o and complaining loudly about how it's all hype and nothing about these models is improving at all (yes, I do see people say that.)
    Getting the best results out of these models requires skill, experience, intuition, and domain expertise. There's always room for improving every one of those.
    
    Terretta - 6 hours ago
    
    The new benchmark for LLMs is how much of simonw's new know-how is required.
    Lower bars are better.
    
    isaacaggrey - 4 hours ago
    
    I agree but this particular example showed nothing about leveraging skill, experience, or intuition. If anything, this is another straightforward example of a one shot ask.
    edit: that said, I understand this particular post is about model capability
    
    ViscountPenguin - 7 hours ago
    
    Eh, I've have the exact opposite experience.
    Way back before instruct models it was pretty difficult, but for the last couple of years I haven't needed anything more complex than the type of text that I might send in a detailed email to a colleague.
    
    philipwhiuk - 7 hours ago
    
    Isn't the whole point of a better model that it should be better at understanding you than the previous one? So the same prompt should return a better answer.
    Prompting differently to the new model seems entirely backwards when trying to determine if the model has improved.
    
    simonw - 7 hours ago
    
    It doesn't matter how good the models get, they still won't be able to act on unclear directions.
    Learning to provide unambiguous, clear directions is a skill. A lot of people who report bad experiences with models aren't yet good at that skill.
    More importantly though, the key to successful communication is having a good understanding of what the other side of the conversation already knows and understands.
    Saying "use uv and inline script dependencies" won't mean anything to a model with a knowledge cutoff date prior to the launch of uv!
    
    yunwal - 5 hours ago
    
    It's perfectly possible to act on unclear directions. The correct course of action is asking clarifying questions.
    
    dasil003 - 6 hours ago
    
    I think this is true when models were going from bad to pretty good like happened last year. But when they start to get good, and can work deeper and with more nuance, how you prompt also can change the results quite a bit. Note this is also true of asking smart humans to do things; personality and approaches vary, they don’t exist on a single axis continuum of quality
    
    kmnfu - 5 hours ago
    
    [dead]
    
    Dumblydorr - 7 hours ago
    
    There’s zero value? Surely you don’t believe zero, it’s potentially the most powerful predictive AI in the world ever made? Maybe only incremental steps sure. But also their IPO is coming, you don’t want people evaluating them beforehand?
    
    lobocinza - 5 hours ago
    
    What is intelligence? Better to call it LLM.
    
    fragmede - 7 hours ago
    
    you know, women make a big deal about you meeting their father/parents, and honestly, I'm too autistic to really fucking have put any importance until now as to why that was remotely important, but if N+1 is coming for your job, it seems it might be worth your while to know the capabilities of N, no?
    
    redsocksfan45 - 2 hours ago
    
    [dead]
- justinclift - 2 hours ago
  
  > By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction [...]
  While by itself that would be true, Simon commonly blogs about things he's up to.
  That action provides the opportunity for evaluation, and additionally evaluation by a wider audience.
  So, it's not the same scenario as non-bloggers offloading a task... :)
- - 2 hours ago
  
  [deleted]
- discordance - 9 hours ago
  
  I see it as a prioritization exercise. I know the above is a trivial example, but more generally, does the guy who wrote Datasette and Django want to wrangle front end and css, or do they want to work on something else?
  - smartbit - 6 hours ago
    
    See above https://news.ycombinator.com/item?id=48498573#48502311
- oulipo2 - 9 hours ago
  
  [flagged]
  - simonw - 8 hours ago
    
    Here's a handy calculator you can use to estimate how much CO2 and water I wasted with my coding agent session: https://www.andymasley.com/visuals/ai-prompt-footprint/
    
    PinkaDunka - 6 hours ago
    
    Not sure what point you wanted to make, but this calculator is quite shocking. GPT 5.5 pro, with "a long document" and 10 requests a day gives 25% of daily CO2 emissions!
    Ten coding sessions a day with Opus is still 4.7%!
    This feels enormous. I will definitely stop rolling my eyes when people complain about AI CO/water usage...
    
    simonw - 6 hours ago
    
    GPT-5.5 Pro is a notoriously expensive model, it's 6x the price of GPT-5.5. Not something to use as a daily driver!
    That ten coding sessions a day with Opus number feels more credible to me.
    
    TaupeRanger - 6 hours ago
    
    What are you on about? May be 1 out of 100,000 users are using 5.5 Pro to make 10 "Long Documents" as defined in that tool EVERY day. What a silly thing to harp on.
    Six 100,000 token Claude coding sessions use less energy than a dryer load, and less water than making one egg. If you are truly concerned about energy and water usage, AI is not even in the top 100 things you should be concerned about in your daily life.
    
    kmnfu - 5 hours ago
    
    [dead]
    
    oulipo2 - 3 hours ago
    
    The real point is not "one session", it's the fact that people now do that routinely, that CICD are using those to check every commit, and each search engine query now does that too, so it multiplies
    
    AmbroseBierce - 6 hours ago
    
    This very obtusely ommits the demand for new data centers and related infrastructure that using AI creates, the going "vegan for a year" option assumes less cows being born but somehow the "don't use AI" doesn't assume that the data center wasn't build in the first place.
    
    lolinder - 5 hours ago
    
    The discrete number of cows being born is theoretically fine-grained enough to actually respond to 2–3 vegans yielding one fewer cow. It's unlikely on a one-year time scale, but one cow only goes so far.
    Even a thousand AI objectors aren't going to limit the demand for a data center, in no small part because these investments are only partially driven by current demand and are significantly driven by expectation of future demand. And they're really not going to lead to smaller data centers either because if you're building a data center in the first place you're going to spec it out for future demand.
    Regardless, I think in both cases it's important to be realistic about the actual impact that one person has. If that number is disappointingly small, that serves as signal that your conscientious objection isn't making the industry you're objecting to as uncomfortable as you would like to think. It may still be worth objecting for your own sense of self, or maybe it serves as an invitation to evangelize your position more, but either way there's not much value to measuring things in a way that gives you an illusion of greater impact than you actually have.
  - beernet - 8 hours ago
    
    [flagged]
    
    deaux - 5 hours ago
    
    As someone who actually gives a shit about the environment and global warming and has been putting this into practice for more than a decade through daily personal sacrifices: no, I downvote it because if you properly look into it, AI is just completely insignificant compared to cars, air travel, clothing, food, needless junk and so on that it's a joke. It's always brought up by people who never cared, but now pretend to do so because they hate LLMs for other reasons. The irony is that some of those are actually _good_ reasons but they're too cowardly to admit them. There's nothing unmanly about admitting you're afraid of AI taking your job, becoming more intelligent, and ending up in a dystopia.
    Go run the numbers and compare them vs. what it takes to produce a single hamburger or hoodie. Anyone who actually cares has already done this and drawn this conclusion.
    
    oulipo2 - 3 hours ago
    
    Have you heard of "rebound effect"? Sure you can say, individually, one query is not that much... but then it becomes integrated in search engines, so suddenly when there was no queries at all, now there's 500 billions per day, and it gets included in your CICD at every commit, and soon enough in your OS, etc
    
    deaux - 2 hours ago
    
    "Run the numbers" means "run the numbers for using agentic coding for 2 hours per day on a frontier model" not "run the numbers for a single query". The former is the worst case scenario.
    Google Search's "AI", which is what you're hinting at is such a good example. Let's say there's 10 billion Google searches per day. 10 billion completions on what is going to be a very tiny, ultra finetuned model with lots of caching (including outputs).
    Check out how many queries an hour of agentic coding results in. And input/completion tokens. Estimate energy usage of Opus vs something like Gemma 4 E2B. Calculate how many developers using Opus for coding 1 hour a day would equate to those 10 billion search query originated LLM calls.
    You could not have provided a better example to show that without running the numbers you'll end up with assumptions that oppose reality.
    
    _heimdall - 7 hours ago
    
    That's an interesting choice as a source. It doesn't mention climate change or human impacts at all and describes El Niño as a naturally occurring event.
    > The El Nino is a phenomenon that occurs naturally
    
    dahart - 4 hours ago
    
    El Niño has been occurring naturally for more than 10,000 years. https://en.wikipedia.org/wiki/El_Ni%C3%B1o%E2%80%93Southern_...
    
    oulipo2 - 3 hours ago
    
    The frequency and magnitude of the event is directly related to the warming up of climate
    
    brickers - 4 hours ago
    
    El Niño is a naturally occurring event
    
    user43928 - 8 hours ago
    
    While one can raise environmental concerns about the AI datacenter buildout, I don't think it is fair to say that it "ruins the planet".
    I don't think it is a good contribution to the discussion around Simon's LLM use to fix a CSS bug.
    
    harperlee - 8 hours ago
    
    It was posted at 5am in New York... not sure that that was a US view, so the fact that the platform is US-owned doesn't seem so relevant, if there's a global audience.
    That being said, I do agree it is a legit thought (and moreso, completely on point in the subthread discussing downsides), and that it shouldn't be downvoted.
    
    vitalyan1234 - 7 hours ago
    
    [flagged]
Illniyar - 3 hours ago

I think Fable is predisposed to try and verify it's changes. Which is a very good thing. It takes a lot of prompts to get Opus to do what Fable does unprompted.
That is exactly what I would want from a junior developer - make sure the bug exists, find a way to fix it, verify the bug is fixed.
The problem, as was correctly identified in the blog post - is that instead of stopping and asking for elevated permission it relentlessly tries to find a hack on it's own. (An equivalent situation for a human developer would be needing some access to a third-party sandbox, and instead of asking a senior for credentials, tries to setup his own sandbox from scratch)
- AtNightWeCode - 3 hours ago
  
  No, the problem is mostly the incorrect prompt that sent fable into a rabbit hole resulting in an incorrect solution.
elicash - 6 hours ago

I misread your comment at first and thought you were insulting Simon Willison, rather than calling Claude Fable a bad developer, and so I'm commenting here to clarify it in case others also misread it.
That first sentence threw me off.
Anyway, I'm glad he spent the $12 because this blog post was highly informative.
geysersam - 5 hours ago

This is the worst thing about current AI agents. They never ask questions. The prompt has to be pixel perfect and unambiguous or they'll happily run away doing something ridiculous.
- 3 hours ago

[deleted]
andy_ppp - 5 hours ago

Yes I agree, the solution committed is horrible, but nobody cares any more. We have entered a very strange parallel universe where because AI can work things out it's easier to take solutions that are sub optimal and just churn out (potentially) buggy features.
- simonw - 4 hours ago
  
  I care. If you can loosely point me in the direction of a better solution I'll do the extra work.
subygan - 3 hours ago

This is missing the point, simon is a fantastic developer. but to keep track of all the nuances of the frontend frameworks and browser implementation is a lot even for great people.
it is really awesome that the final change was only a two line css change.
- AtNightWeCode - 3 hours ago
  
  But the fix is wrong as pointed out by the poster...
simonw - 8 hours ago

You missed what I think is the most interesting question: why does the bug appear in Safari macOS but not in Firefox, Chrome, or WebKit running inside of Playwright?
(Dozens of people in this thread implying that any web dev should have known to solve it with overflow-x: hidden and not one of them have addressed that browser difference yet.)
- hennell - 2 hours ago
  
  I think any web dev knows not to question browser differences if it can be fixed without opening that can of worms.
- zeroonetwothree - 2 hours ago
  
  Safari has some differences in default scroll behavior. I’ve seen similar bugs pop up many times.
- fragmede - 3 hours ago
  
  people pay good money to not have their shit rendered via Playwright!
gib444 - 10 hours ago

The 'better' fixes are often for our (human) benefit. These messy fixes serve the AI companies' interests of creating messes that need even more tokens (money) later. Bad and self-serving developers also act the same, creating tech debt
flyingshelf - 9 hours ago

[dead]

teraflop - 16 hours ago

> But on the other hand... this is a robust reminder that coding agents can do anything you can do by typing commands into a terminal—and frontier models know every trick in the book and evidently a few that nobody has ever written down before.

> Running coding agents outside of a sandbox has always been a bad idea

I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.

It's like posting a video of yourself in the passenger seat of a car, with your feet up on the dashboard, and saying: "Remember, if you're doing this and you get in a crash, the airbags are likely to break your legs or worse! Boy, I sure am glad that didn't happen to me!"

exitb - 13 hours ago

You’ve picked an interesting example, as driving a car, even with all safety precautions, is pretty much the most dangerous activity we do on a daily basis. Yet somehow we decide that the benefits outweigh the risks.
- bsza - 12 hours ago
  
  It's a completely different story. For cars, it happened because of relentless pressure from the auto lobby. It took years of propaganda from oil companies, car makers etc. to make us think the road is for cars [1]. We demolished and rebuilt entire cities to accommodate cars, partly because they gutted the public transport sector [2]. This made our infrastructure so hostile to our own bodies that we have no choice but to use cars now. We bought their products because they forced them down our throats. There is nowhere near that kind of pressure behind the adoption of... oh dear lord.
  [1] https://www.todayifoundout.com/index.php/2022/06/how-lobbyis...
  [2] https://en.wikipedia.org/wiki/General_Motors_streetcar_consp...
  - killerstorm - 11 hours ago
    
    I don't think the pressure of the auto lobby is really the reason.
    People feel cars are more convenient and more prestigious than riding on a bus. Car lobby certainly accelerated the process, but car users were the main driving force.
    
    CalRobert - 9 hours ago
    
    The auto lobby invented the word jaywalking to shift the liability for dead pedestrians from the people doing the killing to the people doing the walking.
    The US also had protests when drivers killed kids, but they were ultimately unsuccessful, except for the odd traffic light installation. https://medium.com/vision-zero-cities-journal/the-baby-carri...
    Even in Amsterdam the original "stop the child murder" protests only barely succeeded, and it took a massive oil crisis and a population that could still (if only just) remember what life was like before cars took over their city to get there.
    
    masklinn - 9 hours ago
    
    > Car lobby certainly accelerated the process, but car users were the main driving force.
    Not really. We know it’s not as much of a natural force as some would like it to be because there are places where the lobbies lost, and while cars are common and widespread they’re nowhere near as dominant as they are in, say, the USA.
    NJB’s next video (currently available on nebula) is about exactly that, Amsterdam’s (/ De Pijp’s) resistance to cars and car lobbying.
    
    hylaride - 4 hours ago
    
    Subsidies played a huge role, including the eminent domain bulldozing of cities for free-at-use highways. If people had to pay upfront for those costs, the urban landscape would look much different (probably closer to Japanese cities, which do have massive suburbs, but centred around train stations).
    Yet Japan does still have cars (and a car culture even), they're just not necessarily the default or dominant mode of transport.
    
    masklinn - 3 hours ago
    
    Sure, nobody is saying cars are useless or unfun, I'm just pushing back against the idea that everything car everywhere is a natural and intrinsic outcome from cars existing. As I noted, even in the netherlands cars are common, the dutch have a very dense road network, and a fair amount of cars.
    
    hylaride - 2 hours ago
    
    I think we're on the same page.
    For me, cars are a perfectly fine mode of transport, but the way so many places prioritize it over alternatives (whatever the reason) isn't necessarily better.
    My "wtf" moment was 20 years ago when I was visiting my cousin in an exurb and we sat in a line of cars for over 40 minutes waiting for our turn to pick up her kid. The messed up part was that while there were school busses, everything was so spread out that the bus ride for them would have been over an hour and then another 20 minute walk from the arterial road drop-off point to their house. Everything was far away, including local public parks.
    
    Chu4eeno - 8 hours ago
    
    Isn't Not Just Bikes some US expat/biking maximalist?
    I'm not sure I'd take him as some neutral authority on the history of cars and driving in Europe.
    
    chriswarbo - 7 hours ago
    
    > Isn't Not Just Bikes some US expat/biking maximalist?
    According to their videos, they prefer trams within cities; generally take trains between cities; and acknowledge that cars are very useful for places which aren't so well connected (e.g. places that are far apart which aren't on a train line). They think encouraging the use of cars within cities is a bad idea (dangerous, scales poorly, makes those areas less pleasant to be, etc.).
    Not what I'd think of as a "biking maximalist".
    They do show themselves cycling to places that are nearby. Does that make Youtubers who record videos in their car "driving maximalists"?
    
    Chu4eeno - 5 hours ago
    
    I wasn't very familiar with the channel, sorry.
    Not US expat either (or not yet), Canadian.
    
    masklinn - 7 hours ago
    
    > Isn't Not Just Bikes some US expat/biking maximalist?
    You should really ponder the sanity of asking if a channel called “not just bikes” is a bike maximalist.
    
    kubb - 11 hours ago
    
    Surely people feeling that way can be attributed to the industry?
    
    mdp2021 - 11 hours ago
    
    For hopefully most people, it should be attributed to the "Wait, now I have such a freedom and power?".
    Opposite to "before the invention of bicycle, people married within a radius in the order of the mile" (can't remember the exact stat right now).
    
    ZeroGravitas - 10 hours ago
    
    It's like that feeling of power you get from owning a gun that you only bought because you feared all the other people who owned guns.
    
    kakacik - 10 hours ago
    
    No its much more straightforward, but I get it - there is no warm fuzzy feeling of discovering yet another global evil conspiracy out there set to get all of us.
    We are family of 4 with 2 small kids. Whenever we travel, its a series of backpacks, other bags, other stuff, and then some more. Heck, even if I travel alone its almost never just me - there are heaps of garbage to dispose, big shopping bags to bring back, big backpack with camping or climbing or skiing gear etc.
    It would have been absolute, utter nightmare to do this over public transport. This comes from European who has generally very good public transport (given rural area) and world's best train network specifically (Switzerland). Yet roads are choke full of cars and every year there is more.
    Public transport simply ain't cutting it for anything but the simplest use cases, ie just me and nothing or small backpack. Some routes I take would take 3-5x longer with public transport, or are just not possible at all. No industry massage required here, ever. Not everybody lives in some dense city and never leaves outside for evenings or weekends.
    
    CalRobert - 9 hours ago
    
    Switzerland does have roads choked full of cars. It also has pretty mediocre bike infrastructure.
    But this is kind of besides the point - even in the Netherlands I also would use a car if I were taking camping and skiing gear with the kids, and that's fine. But I can also take them in the bakfiets to the grocery store when I want, and that's also fine. Cars have their purpose, but you shouldn't _have_ to use one for basic trips.
    
    kakacik - 5 hours ago
    
    Well, here is where we differ - what is basic trip for you may not be basic trip for me or next Joe. Maybe they don't even have walking path to their house. Maybe closest grocery store is 5km away on roads which are incompatible with safe cycling (many parents don't give a fck and just ride, throwing a tiny little dice with every truck passing centimeters from them and their young kids at high speed). Maybe XYZ.
    Don't judge others in some complex situation just because in your case there is some simple straightforward solution. Yes Netherland has top notch cycling infra but thats nowhere else to be seen and won't be seen for quite some time. And don't force your solution unto everybody regardless on fit, that doesn't work long term (aka EU approach to things or why much of eastern part hates it).
    
    kortilla - 11 hours ago
    
    It’s privacy vs not. It doesn’t really need special lobbying
    
    kubb - 11 hours ago
    
    I’m sure that isn’t the full answer. Otherwise car ads wouldn’t be necessary and more affordable cars would outcompete the expensive ones.
    There’s the utility component, the prestige factor and other things.
    
    somenameforme - 5 hours ago
    
    Oh man what a perfect example to be had here. So historically exactly what you're said is 100% what happened. By the time Ford really mastered manufacturing, he managed to get the price of the Model T down to $260 around 1925, about $4,600 in current terms for a premium car!
    Needless to say everybody was buying one and he was rocking it. Then came along General Motors and they were desperate to find any way to compete. They couldn't compete on price or quality, so their CEO is credited with inventing planned obsolescence, and turning cars into a fashion. They'd release a new style each year alongside plentiful marketing implying that the old styles were outdated, and it was wildly successful.
    So yeah, needless to say people have always genuinely wanted their own cars. But it's also true that companies have managed through advertising to create artificial demand for vehicles that don't objectively make sense. To some degree reality is catching up at least though. Aston Martin is on the verge of bankruptcy and BYD is the largest electric car company in the world, by a wide margin.
    
    lan321 - 9 hours ago
    
    Comfort, utility, fun, status. Every person has their own mixed requirement of those that then gets applied to their budget. Expensive for me is probably cheap for our CEO and cheap for me is probably expensive for our interns :)
  - jcfrei - 5 hours ago
    
    Whether public or individual transportation makes more sense really depends on a country’s geography and people’s housing preferences. Public transportation is not always the best option.
  - zaphirplane - 9 hours ago
    
    Are there real acknowledgments cases of multiple companies coming together to bribe some state level people to increase their profit and splitting the bribe across the companies? Like GM, BNW and Honda coming together bribing and splitting the bill. Seems unlikely thou there was a RAM price fixing agreement caught but then again they were caught cause of the number of people aware
  - HPsquared - 9 hours ago
    
    There was surely also a lot of political will coming from car users. Motorists are a large and vocal constituency.
  - marknutter - 2 hours ago
    
    I think it might be because people like to own and drive cars.
  - __alexs - 11 hours ago
    
    I mean that kind of seems like exactly what's happening for AI to me.
  - zeroonetwothree - 2 hours ago
    
    Typical comment that probably comes from a healthy, childless, young person with no disabilities that can’t understand why people not in that situation might have different requirements from transportation.
- devsda - 12 hours ago
  
  In case of driving the stakes are equally high for everyone on the road. Can we say the same for an agent?
  Having an agent is like forever having a genius intern who'll almost always do the perfect job for you. But there is non-zero chance that they'll also come up with quirky solutions and execute those with confidence and no follow-ups. You don't grant the intern production access and hope they check with you.
  I don't think the corporate equivalent of "dog ate my homework" flies, if the dog ate your files and your production DB if you are unlucky.
  - danielhep - 7 hours ago
    
    I don’t think that’s really true of driving, pedestrians and cyclists are at a much higher risk of getting killed by a driver than a driver themself. There are huge negative externalities to driving
  - Zambyte - 7 hours ago
    
    > In case of driving the stakes are equally high for everyone on the road
    The stakes are significantly higher for everyone outside a car. This seems like a pretty good metaphor for slop bombing people who don't use AI. People drive because they don't feel safe around everyone driving. People slop bomb because they can't handle all the slop.
- illiac786 - 11 hours ago
  
  What do you mean “somehow”? You make it sound like people don’t weight benefits and risks. If you do not live in a large city, the benefits are so immense in terms of mobility, they outweigh the risks for most, very clearly. That’s why in large cities, much less people own a driving license for example, the benefits are just not there anymore.
  Granted, on the downsides, people look at cost more than risks.
  - icantevenhold - 11 hours ago
    
    I think they weigh the benefits and risks but then completely discard the risks, because humans are bad at evaluating risks.
    More than a million people die each year on the road but for some reason terrorism and cancer dominate the risk assessment of people.
    I bet any money that almost all people aren’t really afraid of entering a death box every day to drive to work.
    How could they be; a lifetime of brainwashing doesnt let them asses the risk realistically
    
    - 3 hours ago
    
    [deleted]
- bcrosby95 - 3 hours ago
  
  Lots of people die driving because people drive a lot. It's something like 1 death per 100 million miles driven.
- selfhoster1312 - 13 hours ago
  
  Yes, but we usually use cars as a means to an end. Have you ever met a manager who setup gasmaxxing policies and criticized employees for doing their job instead of driving?
  - neuderrek - 12 hours ago
    
    I know sales people in pharma who spend all day driving, not only for sales visits but also drive doctors for their personal errands, and all this driving is encouraged by management.
    
    - 11 hours ago
    
    [deleted]
  - moomin - 12 hours ago
    
    Having played with Fable a bit, if it doesn’t kill tokenmaxxing I don’t know what will.
    
    selfhoster1312 - 12 hours ago
    
    I'm interested in what you mean, if you could develop. Would it kill tokenmaxxing because it's so bad? Because it's incredibly efficient? Because it's way too expensive?
    
    moomin - 8 hours ago
    
    My perception is that it’s good, but very expensive. I would not be surprised if regular users, if they shifted their flows to Fable at API pricing, would be racking up $200 a day, not a month.
    
    coldtea - 11 hours ago
    
    Because it's too expensive AND inefficient in token usage
- Gud - 11 hours ago
  
  Not really. That decision was taken for you, (I’m presuming you live in the US) by the American car industry and their paid of politicians. Your cities used to have beautiful public transport until it was dismantled.
  Unfortunately in Europe the German car industry similarly has a lot of power, hence why their shitty rail network fuck up the whole continents.
  I take the train and tram.
- NooneAtAll3 - 11 hours ago
  
  user using computer is also the most dangerous activity to his data on a daily basis
- andrepd - 12 hours ago
  
  > Yet somehow we decide that the benefits outweigh the risks.
  More like malicious lobbying and incompetence made it impossible in many places to use any other form of transportation, despite there being safer, faster, cheaper, and healthier ways to move around. Which come to think if it makes this a rather nice analogy for the current situation... :)
- customguy - 12 hours ago
  
  The example wasn't "driving a car". The benefits of putting your feet up on the dashboard do not outweigh the risks, at least not where there is actual traffic. I don't think I saw a single person doing that in real life, ever.
qurren - 15 hours ago

> I'm continually bemused and astonished
I'm not. Everyone is told to get 10X the amount of shit per day done these days. Safety checks are out the window at that point.
- satvikpendem - 15 hours ago
  
  You can get 10x shit done without `rm -rf`ing your files. I don't see any correlation to getting things done with having a proper sandbox.
  - koliber - 10 hours ago
    
    I'm being a little facetious when I write this, but bear with me:
    Let's say I have daily backups, and get 10x done each day by being reckless and risking an "rm -rf", and let's say there's a 1% chance of an "rm -rf". I break even after 2 days of being reckless even if I get unlucky and on day 2 it wipes my drive. I spend day 3 and 4 recovering, and am still 6 days ahead based on the 10x work I got done on day 1.
    What if I have a 50 day streak of not hitting an "rm -rf"? Early retirement?
    I guess the work on day 1 should be to build a proper sandbox and drop the chance of an "rm -rf or worse" even down to 0.001%.
    
    biztos - 9 hours ago
    
    > Early retirement?
    Your manager will look at your token usage and the number of Jira tickets you closed, and if you have not increased both 10x in the past year then you will be let go. 10x is the new 1x.
    Whether that's early retirement depends on how much money you have.
  - lelandfe - 14 hours ago
    
    https://github.com/anthropics/claude-code/issues/13371
    > Additional bypass examples that all execute without permission:
    > echo test ; git rm file.txt
    > rm --force --recursive /home (if "rm -rf" is blocked)
    
    Chu4eeno - 12 hours ago
    
    It really is vibecoded.
    I never really dug into the leaked code, but calling that there a security layer is a joke.
    (And I really don't get why they give it actual shell access either, implementing a "fake" one for something like a honeypot takes a couple of days, not much more if it needs to persist/map to actual files.)
  - qurren - 15 hours ago
    
    I haven't yet had an agent rm -rf files.
    I've had one f up an account by placing 2000 limit orders at the wrong price, but that's another story.
    
    numeri - an hour ago
    
    I've had it happen. I ran an experiment, taking a couple hours and producing ~2 GiB of files. One of the results looked good, so I told Claude Opus 4.5 (at the time) to commit the code changes, upload the important file to cloud storage, then clean up the rest.
    I then saw it run `rm -r results/`, before messaging me: "Now all that's left is for you to upload the successful results, then I'll delete the rest!"
    Why did it not upload the files itself, when it had been using the cloud storage CLI during that session? No clue. I do accept that I could have and should have just uploaded the file myself. It would have taken 3 seconds to type.
    
    Majromax - 5 hours ago
    
    > I haven't yet had an agent rm -rf files.
    That happened to me once; I was running one of a few free-tier models in a pi-coding-agent session. The bash tool there is stateless and always begins from the launch directory, but the agent assumed state and executed `rm -rf .` intending to remove a build directory. Instead it removed the whole project tree, including session logs and notes.
    This was mostly a matter of amusement for me since I was running the agent inside a bubblewrap sandbox for that very reason, and the project itself was not very important.
    
    digitaltrees - 13 hours ago
    
    Well then you are behind the cutting edge.
    
    marknutter - 2 hours ago
    
    Proper hooks prevent this from happening
    
    antonvs - 14 hours ago
    
    I've had agents run `rm -rf`, but it's been on directories that did actually need to be removed. To a certain extent I think the existence of `rm -rf` as a command that runs blindly without any understanding of what it's deleting is the problem.
    
    KronisLV - 8 hours ago
    
    > To a certain extent I think the existence of `rm -rf` as a command that runs blindly without any understanding of what it's deleting is the problem.
    Yes, and the lack of a Recycle Bin of any sort is even more puzzling. I think both servers and desktop PCs across all OSes should have it by default, so unsafe deletes would be something you'd have to go out of your way to even enable.
    
    dumbdumb125 - 14 hours ago
    
    I've had one sever its own internet connection. Less destructive, also more humorous.
    
    ghrl - 7 hours ago
    
    Yeah, spot on. I had an agent delete some files it shouldn't have as well, similarly to me making the same mistake. I think system prompts should default to using `trash` over `rm`. For now that's just in my AGENTS.md, and gets honored most of the time.
    
    l72 - 17 minutes ago
    
    You can always use something like this [1], which will make sure any file removed on the command line via rm (or other utilities, like git rm) ends up in the trash instead
    [1] https://github.com/faratech/trashd
    
    lstodd - 14 hours ago
    
    the answer is rm -f `which rm`, yes?
  - estetlinus - 14 hours ago
    
    rm -rf is the least of your concerns.
harrall - 15 hours ago

I started doing it months ago and, to be honest, what the agent chooses to do isn’t unpredictable.
The problem is that different people prompt so differently.
For example, I may ask like “test different variations of this annotation on k8s pods of this service on this X cluster because it proves Y theory.”
But you know what my coworker asks? “Test Y theory.” If you were to ask two different junior engineers that, one might try random things on production and the other one might run local tests! It’s such an unguided “do anything you want as long you figure it out” request and the agent reads it like a junior who has not been told any boundaries but has been strongly told “figure it out.”
- mrandish - 12 hours ago
  
  > But you know what my coworker asks? “Test Y theory.”
  It still surprises me when I see people not prompting more specifically and clearly. It not only avoids problems, it's faster, costs less -and just works better.
  I recently shared with a friend a multi-hour LLM chat session I'd done because it veered into a domain he's interested in. In the session I'd brainstormed and probed the feasibility of a novel concept for a new research direction. It traversed a half dozen domains diving into minute detail then zooming back out to survey an adjacent space, interspersed with intense skeptical probing of key assumptions, all while spewing tons of detailed citations, specific paragraph pulls, summarized data tables etc.
  My friend is very experienced using LLMs for research so I was surprised when he called me shocked by the sheer velocity, precise targeting and signal/noise. I'd assumed everyone did it the same as I do. He attributed the different result solely to the way I crafted my prompts.
  - dr_dshiv - 11 hours ago
    
    I used to write detailed prompts. Now I find the benefits of strategic ambiguity — rather than speaking imperatively, I emphasize my vision and then Claude can often figure out a method.
    This doesn’t always work better. But often enough.
    
    mrandish - 10 hours ago
    
    That's actually what I do too. What I was trying to say is that my prompts are precise in the sense that whether they're vaguely ambiguous or hyper-detailed and highly directive it's always very intentional to improve the response in the direction I want. The difference can have significant impact as shown in research on how LLMs naturally mirror user's prompts.
    I noticed this last year and started experimenting which led to several realizations about how my prompt's tone, style, length, format, word choices and even punctuation can have very counter-intuitive impact on model responses. It's not that one strategy always gets "better" results, they're just different in specific ways, which can make one input style better for one context but worse for another. I first noticed this effect when modding my user prompt so major topic headings would always be numbered. It's surprisingly difficult to get it to reliably use the same simple scheme due to various potential ambiguities. So, I spent a little time word-smithing, lawyering and tuning the prompt but I found the closer I got to full compliance on heading numbering, the more unrelated things would drift. Like it would just stop using bullets, even though I never mentioned anything about bullets.
    Then I changed the prompt to "Change nothing about your default formatting, except headings." But just mentioning anything related to formatting, could suddenly cause unintended effects on seemingly unrelated things. Then I tried being explicitly directive about all formatting to just lock it down. And this completely failed because once the formatting was perfect, I started noticing the model's output would get less intelligent much earlier in sessions. So I cleared my user prompt entirely as it wasn't worth the cognitive cost on the model or my time. A few days later in a long session I noticed it was numbering everything perfectly with no prompt at all. When I scrolled back through I saw it didn't start out numbering its responses. It started doing it because I was consistently numbering every major concept in my inputs, even though I never mentioned numbering or formatting.
    So... yeah, subtle differences in prompts which absolutely shouldn't matter, do impact model output in unexpected ways. And, as of now, these effects can only be fully suppressed with strong directive prompts for short periods, but doing so always impacts other unrelated things - and has some cognitive impact on model performance. So, by paying a little attention, I've discovered ways to optimize a model's output in the direction I need by shifting not only my prompt's explicit directives but also the subliminal meta-elements like tone, style, length, structure, formatting, etc.
    
    marknutter - 2 hours ago
    
    Yeah, I find the back and forth with Claude is often better than trying to front load everything in a massive and detailed prompt.
- troupo - 12 hours ago
  
  > I started doing it months ago and, to be honest, what the agent chooses to do isn’t unpredictable.
  You just wrote three paragraphs of text describing why it's unpredictable.
  Moreover, for the same prompt on the same machine in a different session it will use a different set of tools.
bryanlarsen - 16 hours ago

I'm also bemused by the number of people who think they've got an effective sandbox yet their sandboxed agent has access to all of their code, their github, and unrestricted web access.
- Terr_ - 16 hours ago
  
  I keep telling folks that they need to imagine LLMs (even "local" ones) as if you're farming it out to JS code running on some dude's browser somewhere: It can't keep a secret, and a determined person can make it emit anything they like.
  We need to be asking what the most devious and malicious output could be, and whether what we do with that output (e.g. arguments to command-line tools) would still be safe.
  - NichoPaolucci - 15 hours ago
    
    From my perspective, everyone is doing it. Security through obscurity - obviously if you’re harboring credit card numbers of users personal details, maybe take heed. But, if you’re a regular… run of the mill CRUD application, every other company is ALSO throwing caution to the wind. When hundreds of thousands of credentials are leaked into the funnel, does it really matter?
    I’m at a small company, and I try to push for security as much as I can, but the stakeholders truly do not care. They want to move fast. It’s just part of the new world I guess. If we get hit by attackers? I don’t know what happens. Sorry, we told you not to - you wanted to move quick and break stuff, this is how that culminates.
    I’m sure I’m not the only one.
  - user43928 - 6 hours ago
    
    The answer to that question seems obvious: No, it is not safe.
    Yet with tens of millions of developers using these tools, there have not been widespread incidents of this sort as far as I know.
    So it leaves me with a few choices:
    - manually review and approve each command: obviously not realistic, you would just click Approve
    - use a sandbox and hope the exploit is not devious enough to escape the sandbox when you run or open the project outside of the sandbox
    - use AI without web access and limit other external dependencies
    - don't use agentic AI
    - use Claude or Codex auto approval classifier and hope for the best
    Personally, I'm going with the last option for now.
  - skybrian - 16 hours ago
    
    We do have ways to avoid giving an LLM any secrets, but it needs to be the simple, default solution.
- kstenerud - 4 hours ago
  
  > yet their sandboxed agent has access to all of their code, their github, and unrestricted web access.
  Not in my sandbox. It gives no direct access to the workdir, no access to my github, my ssh keys, my security tokens or API keys. No access to my home dir or dotfiles. Nothing at all, except for what I explicitly tell it to give access to.
  I can restrict network access. I can choose the isolation level: docker containers, Kata VMs, seatbelt, tart, even the new apple containers (which are VERY nice).
  Not even ENV leaks through.
  And it's FOSS: https://github.com/kstenerud/yoloai
- blcknight - 16 hours ago
  
  One bad npm package can really ruin your day. These things for me only run in their own VM with it's own GitHub account and basically nothing else
  - ofjcihen - 15 hours ago
    
    People probably think you’re being ridiculous but Shai Hulud had its very first attempt at manipulating AI lead analysis and I know of at least one company where that resulted in them getting pwned.
    This is only going to become more of a problem in the future and people need to educate themselves on the technical barriers to use because guardrails only sometimes work.
- webstrand - 14 hours ago
  
  If anyone's looking to sandbox network, I've had good experience with pasta [1] networking. I make a pasta+bwrap sandbox and expose only specific services via local sockets to cross the boundary.
  [1]: https://passt.top/passt/
- devmor - 14 hours ago
  
  I use a separate physical machine and a scoped token with access to a single repository at a time, and even then I worry about what hole I may have left open.
  The general carelessness of the average user is baffling.
- norikaoda - 16 hours ago
  
  [flagged]
elevatortrim - 3 hours ago

How can you get the agents to do anything useful without giving them meaningful access?
If it only lives in an isolated sandbox, it can only act within the sandbox, then I would have to manually move what was done in the sandbox to real-life.
I am not saying it should have critical access, but this is more of a question: How can you get value out of AI if it can only act in a sandbox?
- nemomarx - 2 hours ago
  
  Is having to move the files in and out of the sandbox really going to eliminate all the value it has?
  You could have a full version of whatever codebase and test suite you want in there. It can do all the same stuff, right? Just copy it elsewhere once you know you've got a working result, a few minutes of effort at the end of each pr or work item.
- dumah - an hour ago
  
  The same way you get value out of a dev container.
pjungwir - 11 hours ago
I know there are VM solutions, but I've been happy with a separate OS user (named `claude`).
He has similar dotfiles to mine, but no secrets. My own home directory is 0700. He has his own ssh key that I added to my github profile, but it's password-protected, and I push/pull for him. He has his own Postgres (non-superuser!) {development,test} {users,databases}.
It's as if he were another developer on the project. If he needs something run with sudo, he asks me. Often we can both work on something in parallel. Unix was supposed to be a multi-user system after all.
A trick I use a lot is that many of his git repos have an extra remote, like this:
```
    paul  ssh://paul@localhost/~/src/example (fetch)
    paul  ssh://paul@localhost/~/src/example (push)
```
That makes it easy to collaborate on things I'm not ready to share.
I'm pretty comfortable with this setup.
I do worry about Linux privilege escalation bugs. I don't trust an AI to understand that exploiting vulns is not acceptable. (I can't help but recall that at my first job I may have misused vim's :! feature to broaden my sudo powers, which were officially limited to editing httpd.conf, when I needed something in a hurry. . . .) I find myself manually upgrading packages more often these days, despite automatic security updates. I don't think Opus would go to the trouble of looking up security vulns, but maybe Fable would, and there have been a lot lately. Maybe some future model will just take it upon itself to find new ones. Or install a keylogger to learn the ssh key password.
But a separate user is nearly the most paranoid setup I've heard of, excepting only a separate machine. So I also question whether I'm sacrificing too much speed/convenience. But really it's still very convenient. I think it's a good way of being efficient but responsible.
If other people see holes, I'd be happy to hear about them.
- justusthane - 6 hours ago
  
  That’s a really interesting and pretty neat approach. How do you communicate with it? Just su to that user? Or tmux?
  Although I can’t help but think that a VM is still more convenient, more flexible, and more secure.
  - pjungwir - 4 hours ago
    
    Yes, I su to the user. Typically I have it run a tmux session for each "project". That makes it easy to get more windows without su'ing over and over. Also its tmux sessions all get a yellow status bar (in ~claude/.tmux.conf), so they are easy to recognize.
    To me it is more convenient than a VM, since everything is on the host. And it can launch its own VMs without an extra layer.
    I don't really know which is more secure. There are hypervisor escape vulns too. And shared folders seem like footguns. For instance in vagrant, guests get `/vagrant` to read/write the host's folder, so you have to be careful what you put where.
    The biggest annoyance with an OS user so far is running docker containers. I don't want to add claude to the docker group or give it sudo privileges. I've read that you can set up rootless docker for a user, and even that you can run it side-by-side with a normal system-wide docker, but I haven't tried doing that yet.
    
    justusthane - 3 minutes ago
    
    You could look into Podman as well - it's rootless by default, and often can be a drop-in replacement for Docker.
raldi - 15 hours ago

Do you think it’s dangerous to be in a car going at freeway speed? Do you ever do that anyway, even though you could be walking instead?
- spunker540 - 14 hours ago
  
  This is a great analogy. Like driving on the freeway, agents are super time efficient, generally safe, but the stakes are high in terms of the worse possible outcomes.
  - techpression - 12 hours ago
    
    The analogy falters in scope, it should be more like ”do you put your entire family and all your friends in different cars, on different highways, and try to remote control them all at the same time, while also driving yourself, facing backwards”
    
    Gareth321 - 10 hours ago
    
    I think all three of you are quibbling over the risk/reward ratio, and you have different estimates. It's not unreasonable that you're all correct - given your estimates. My estimate is that Tesla FSD is safer in aggregate than human drivers, so I believe it is safer for me to use that than drive. It doesn't get tired, have medical emergencies, get impatient and frustrated, speed, lose focus because a child shouts, thinks at the speed of light, and can see from eight cameras all around the car, all at the same time. I only have two eyes.
    You would also be correct if your risk estimate concluded that Tesla FSD has arguably killed people, makes mistakes humans would not, can glitch, and has no one to hold accountable. For these reasons, you choose not to use it.
xyzzy123 - 14 hours ago

The real sandbox is not caring if your computer gets bricked.
- AdamN - 14 hours ago
  
  The machine is no big deal - it's the authn/authz that matters. What can the agents do with the credentials available to them?
  - petesergeant - 13 hours ago
    
    Less if you use something like https://agentblocks.ai so they don’t actually get the creds
- _345 - 14 hours ago
  
  way worse things can happen than your machine being bricked, if a malicious actor can weaponize an agent to do their bidding
  - rfw300 - 13 hours ago
    
    > if a malicious actor can weaponize an agent to do their bidding
    In my experience, human employees are much more vulnerable to this particular weakness than frontier agents (i.e. phishing attacks).
    
    _345 - 5 hours ago
    
    I'm not letting Jenna from HR log into my personal machine with access to all of my lifelong data though. I do let my claude bypass permissions though
  - dumbdumb125 - 14 hours ago
    
    the solution to both of these is the same thing. vps with accounts for all the services specific to the agent (github and whatever else)
hugh-avherald - 16 hours ago

The analogy extends to driving generally. Everyone knows it's very dangerous but people keep doing it.
j-bos - 16 hours ago

This. House full of big brain security experts, executives, lawyers, and until Claude got excited and broke prod it might as well have been "sandbox, whoooo?"
IDGI
Anyway, VM's incoming, finally.
ghrl - 7 hours ago

Amazing observation, and I'm certainly guilty of it too, but it is just way too convenient not to sandbox it, and some tasks right away depend on not being sandboxed.
For anything other than writing code directly in a fully contained git project, where sandboxing might work well, it requires access to system wide tools, user configuration and more.
Occasionally I tell the agent to do everything inside of docker, which works too and it leaves the system alone then mostly, but adds significant overhead and slightly degraded perceived quality / effectiveness.
I think the most important takeaways are to have reliable backup strategies, access control and security mechanisms, which is a win regardless. Whether by the agent or the human, mistakes happen (like a rm -rf * ran in the wrong directory), and where they would be devastating, there should be other protections than just "hope it won't happen" or "rely on a sandbox to prevent agent error".
emodendroket - 16 hours ago

Well, it's a similar impulse to the way you see professional carpenters pin the guard open on a saw or do other things everyone knows you shouldn't do, except probably with a larger productivity difference and less life-altering (for the operator) consequence if it goes wrong.
- rpcope1 - 16 hours ago
  
  I had the same thought, it's kind of like taking the guard off a 4 1/2" grinder. Real convenient until the cutting wheel explodes or the grinder gets hung and kicks back.
andai - 8 hours ago

>I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.
Yeah, that's why you give it its own machine :)
simonw - 15 hours ago

Which agent sandbox do you recommend?
- flexagoon - 13 hours ago
  
  If you're on Linux, the easiest way IMO is to just run the agent in bwrap
  I do it like this
  https://github.com/flexagoon/dotfiles/blob/main/dot_config/f...
  But I'm sure it's simple enough that you can just ask the agent itself to make you a command for it with proper bwrap configuration
  - artemisart - 33 minutes ago
    
    bwrap is builtin in claude too, activate with /sandbox command.
- mik3y - 14 hours ago
  
  I've been enjoying Moat [1]. Proxies credentials, networking, etc; uses MacOS containers if available; and setup worked without much fuss. I haven't tried others, though.
  [1] https://majorcontext.com/moat/
- fspoettel - 12 hours ago
  
  nono works great with pi: https://nono.sh/
justapassenger - 16 hours ago

Because benefits are much higher than risks.
- bigstrat2003 - 15 hours ago
  
  They really aren't.
  - imp0cat - 14 hours ago
    
    Perceived benefit vs perceived risks.
zozbot234 - 11 hours ago

It's like a dumb parrot that's somehow become hell bent on "fixing" everything that's wrong with your code. If you give the thing autonomous access to outside tools, you can expect it to do weird things that you may have not thought of. So don't do that, just ask the parrot to write up a plan for you.
This is likely also the underlying root cause of what Anthropic assessed as concerning behavior in their original evaluation of Mythos: it's not really about being super smart, it's more of a dumb chaos monkey that knows just enough to be dangerous and is relentless at trying to do just that.
istvan0 - 14 hours ago

> I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.
What if you have two machines and the one you give to the agent is constantly backed up?
- trvz - 13 hours ago
  
  They still shouldn’t be running on the same network.
  And if you’re using Macs, you can’t be signed into your primary Apple ID on the agent machine.
andoando - 16 hours ago

I mean what's the big deal? I use --dangeorusly-skip-permissions on every single interaction in the last 6 months. Worst case it deletes my files that are all on git? It fucks up my local DB? Cool.
I save way more time not babying it than the occasional fuck up I have to salvage.
- ghshephard - 16 hours ago
  
  Worst case it gets access to gmail. And Github. And the Internet. I'm increasingly appreciating the importance of a physical finger-press on Yubikey to trigger the FIDO2 + OIDC Auth. I don't think there is an easy way for it to hack a new session.
  - andoando - 14 hours ago
    
    How is it going to get access to gmail or github? In any case, whats the probability of it going to so completely off the rails that it does something horrendous with gmail/github? Whats it going to do? Email my coworkers nudes on my computer? Make my github profile public?
    
    simonw - 14 hours ago
    
    I am most worried about something gaining access to my email and then using the password reset flow to steal hundred hundreds of other accounts.
    2FA makes me a little less nervous than I used to be, but not everything has good 2FA.
    
    nunez - 14 hours ago
    
    Claude typically recommends .env files for storing secrets. You use one to store a refresh token for the Gmail API or IMAP connection details. Your agent uses an MCP server you configured during a session, but the MCP server has been compromised and directs the agent to do nasty stuff with env dotfiles.
    
    troupo - 11 hours ago
    
    > How is it going to get access to gmail or github?
    Did you even read the article? Claude was opening he browser and iterating through the tabs.
    I presume you are logged in to your github account? Your gmail?
    > Whats it going to do? Email my coworkers nudes on my computer? Make my github profile public?
    Reset access to services using your email? MITM your 2FA?
    Or perhaps you have 1Password/Bitwarden running with a generous unlock policy?
    
    epihelix - 3 hours ago
    
    > Did you even read the article? Claude was opening he browser and iterating through the tabs.
    It would have been somewhat ironic if it had been hit by a prompt injection attack via one of all those open random websites ...
    
    simonw - 2 hours ago
    
    This is one of the things I found so interesting: it was using my system browsers but it wasn't exposing itself to any content from them.
    Even when it iterated through all visible windows to find the one it wanted to screenshot it was searching for titles in Python code and returning only the integer window ID.
    The sites it opened and screenshotted were sites under its own control - either test pages it had created or development servers it was running.
    When it did run code that analyzed an open web page (by injecting JavaScript into a template it controlled before loading that in a browser window) that code only returned JSON with measurements from the page.
    It's making me wonder if Fable has been trained to take additional steps to avoid accidental exposure to untrusted content.
  - SoftTalker - 15 hours ago
    
    It should run as a separate user account with its own home directory. Not with access to your personal browser profile.
    
    matltc - 14 hours ago
    
    What does setting this up look like? Qemu vm and run there? How do you interface with version control and deployment?
- eloisius - 13 hours ago
  
  What happens if it gets manipulated into npm installing a malicious package, which compromises your machine and any systems it has access to or becomes part of a botnet?
isodev - 13 hours ago

Not to mention OpenAI/Anthropic’s newly found appetite for keeping data (made public with Fable but we don’t know what actually happens there anyway).
There is so much role play going on for people to convince themselves that any of this is fine.
skybrian - 16 hours ago

There are plenty of good sandboxes out there but somehow no "obvious right answer" that everyone knows to recommend. Seems like a missed opportunity.
(I'm happy with exe.dev, but I'm not sure what I'd use if I were coding on a Mac.)
thatxliner - 16 hours ago

Maybe because there are not many resources on how to set it up, or it is just not that easy to?
Because most devs already have it running and working without a sandbox, they're tending to not doing anything "unnecessary"
sipjca - 14 hours ago

im more surprised that more people don’t treat their computer as disposable anyway.
that it could just be wiped at any moment and it wouldn’t matter. shit happens, could be stolen, broken, whatever. the computer should be able to be thrown out the window and continue to live life.
to be clear, i don’t think upgrading and disposable in this way is good, but it being wiped at any moment shouldn’t be a concern
i grew up wiping my machine every year anyway, so i guess it’s just a habit
is the computer that sacred?
- baq - 13 hours ago
  
  Computers are disposable, secrets is what we’re talking about. Rotating passwords and tokens is a major PITA on the best of days.
  - sipjca - 11 hours ago
    
    fair enough, i guess minimizing that surface area is important to begin with
- dumbdumb125 - 14 hours ago
  
  i think it's about drawing a line between your "personal computer" and a software development machine. any digital-native is going to accumulate programs, configurations, and other bits and pieces that aren't trivial to migrate to a new machine.
  - backwardsponcho - 8 hours ago
    
    Programs, configs and "other bits" are the trivial parts that no one should care about. It takes about 5min to go from fresh install to near-fully-configured.
    Even the hardware itself doesn't matter that much, in the end it's all provided by your employer.
    Leaking session tokens or secrets, on the other hand...
    
    dumbdumb125 - an hour ago
    
    i'll argue
    for me it's just the magnitude. i don't have a well-defined list of the programs i use in any given 6-month period, but it's long. just installing all of them takes 1.5 hours and I miss some every time.
    if i were a consultant working overtime, that would be billed at 450 an hour, 700 total. far from trivial
  - sipjca - 11 hours ago
    
    imo being digital native means that migrating to any machine should be basically trivial. working with the flow of the machines rather than customizing and ricing them because your a cool computer person or whatever
    i just want my computer to work. any config i have on my machine can be rebuilt by just doing the work i need to do.
    my primary work machine was stolen last year so i was forced to go through this quite literally with a new machine rather than hypothetically or by my own will
    
    - an hour ago
    
    [deleted]
    
    dumbdumb125 - 43 minutes ago
    
    [dead]
- ghrl - 7 hours ago
  
  Sounds like a case for NixOS
azraellzanella - 8 hours ago

If you want to run Claude in a container: https://github.com/dvdstelt/ai-agents
- andai - 8 hours ago
  
  Alternatively you can just give it its own user. I do that, so it can blow up its own files, but not mine.
konaraddi - 14 hours ago

In practice, full access to your machine is okay as long as there are safeguards and the expected outcomes are clear with a well defined path to said outcomes that aren’t overly ambitious. Otherwise, for ambitious goals or YOLO one shot attempts, eliminating opportunity for capability misuse is critical (e.g., sandbox).
bxk76 - 15 hours ago

Its how the chimp brain works. Its not a single system but multiple systems making predictions for different time horizons. when output doesnt align we get stories to manufacture coherence.
Plato gave us his Chariot analogy with 2 horse pulling in diff directions 3000 years ago. Today we got System 1/System 2, Elephant Rider model etc.
The human mind thanks to how its own architecture handles unpredictability in the universe will generate contadictions.
- 16 hours ago

[deleted]
paganel - 4 hours ago

> to give agents full access to your machine
I was mesmerised at the author being away from his computer for a short-while and then, when coming back, seeing the AI agent having opened up a browser window. Meanwhile we all have to use the fricking 2FA almost anywhere now, plus the crazier and crazier rules when it comes to passwords. I'm mentioning the latter because these type of people were the same ones who were pushing 2FA down our throats around 2017-2019 (including on forums like this one), and look at them now.
- 13 hours ago

[deleted]
soulofmischief - 15 hours ago

It took two decades for the web to deprecate SSL for TLS and serve over HTTPS by default.
- dgellow - 6 hours ago
  
  FWIW TLS had a non negligible impact on performances at scale. Hardware improvements made that irrelevant, eventually making the switch to HTTPS by default a no brainer (or at least that's what I vaguely remember from <2010)
uihjhjb - 15 hours ago

[dead]

jampa - 16 hours ago

Fable feels like a version of Opus running on a harness that won't let it halt until it's sure the issue is fixed, which makes sense if what you want is a model that's better at benchmarks.

It's a very good model, but it comes at a huge premium: not only do the tokens cost more, but the model itself really wants to spend them all. For example, working with React Native, Fable never just says "okay, I did the thing, that's it." It tries to rebuild the entire app from scratch, run the whole test suite, and watch every log and warning.

This is the first time with LLMs I've felt that upgrading to a model isn't worth it, even if my company lets me use it, because all the building / testing was just destroying my machine and its battery, which keeps me from working on other things.

For now, it feels like Opus with ultracode is a better choice (less pollution of the main context, more parallelism in investigations).

conradkay - 16 hours ago

Does low/medium effort fix it for you? Seems like Fable 5 low can outperform Opus 4.8 high/xhigh often, and uses a lot fewer tokens
- skerit - 8 hours ago
  
  Fable 5 on medium is amazing. It's handling everything I throw at it
  I had _one_ instance where for some obscure reason it decided to fall back to Opus 4.8 and Opus IMMEDIATELY fucked it up and implemented a super obvious feature in a slightly-wrong way.
- _345 - 14 hours ago
  
  In my case no, I actually saw worse performance with fable medium and switched back to opus high and xhigh
  - epolanski - 11 hours ago
    
    I find high+ unusable, it's way too slow and "thorough" on 99% of mundane task.
    Sure it's better at vibecoding whole tasks, it's clearly good at it, but give it a simple one, and it will still do way more than needed.
    It's way too fixated on validating even the simplest things, I find it an unproductive model unless you're implementing whole tasks and doing other things in the meantime.
    
    jon-wood - 4 hours ago
    
    Why are you deploying a bleeding edge, incredibly expensive, model to do the simplest things? Use Sonnet, hell, use Haiku, they'll get the job done and won't set fire to several rainforests in order to achieve the task.
- - 15 hours ago
  
  [deleted]
sanex - 15 hours ago

I've found the opposite. Granted I use sub agents heavily but I've had it run for hours with far fewer tokens used than when I was previously using opus4.6-8.
- firemelt - 2 hours ago
  
  how did you use the sub agents any example of setup and usecase?
threatripper - 16 hours ago

On what setting in which environment do you run it? I use the VSCode extension on Extra High and feel like it does exactly what needs to be done and stops when the thing I asked for is done. Extra comments come only when they fall into the area of code that was changed.
- jampa - 16 hours ago
  
  I tested it to fix React Native bugs in a project, comparing it with Opus. It fared better on harder bugs, taking less time to find the root cause, but after implementing a fix, it spent a lot of time and effort on validation. This was mostly unnecessary, since most of the bugs were in the JS code, so for most things, hot reloading is enough for E2E validation and to run just the right tests. No need to run a full build and test suite (which takes 10+ minutes); the CI can do this.
  I switched back to Opus because of this validation quirk. Overall, Fable spent 20% of the time on coding and 80% on validation.
  I think using Fable for planning and Opus for execution could be a "best of both worlds" approach (I need to test this more), but for most cases, it's not necessary, and Opus is enough.
  - gbalduzzi - 13 hours ago
    
    > most of the bugs were in the JS code, so for most things, hot reloading is enough for E2E validation and to run just the right tests. No need to run a full build and test suite (which takes 10+ minutes); the CI can do this.
    Have you tried adding this instruction to your agents.MD? Avoiding situations were the agent start running a loop is the main use case of the file for me
  - wouldbecouldbe - 4 hours ago
    
    why not just add something like: "No need to run a full build and test suite, I will manually validate"
dreis_sw - 4 hours ago

I think the new high effort settings are so strong that selecting them when the task doesn't require it actually impacts the output negatively.
Gareth321 - 10 hours ago

I like this proactivity in theory, but as you say: it's expensive. I wonder if this can be solved with the right prompt. E.g. "these are your constraints. Only resolve x. If you are unsure if a task is outside constraint, check with me first."
esjeon - 14 hours ago

> the model itself really wants to spend them all
In fact, Opus does the same. It finishes the job, and redo it from scratch before presenting the result to the user. This happens even for simpler writing tasks especially when I instruct it to create a text file.
epolanski - 11 hours ago

> which makes sense if what you want is a model that's better at benchmarks
This so much.
Opus 4.6 was the last Anthropic model that was good at assisting you, 4.7 and later ones have completely inverted this relationship and it's you assisting it.
Yes, I admit they are smarter, I admit we've reached a point where LLMs are more creative and could be writing better code (albeit with some design hiccups) than I do, but they are also increasingly bad at helping me.
Sure, they do my job when prompted 8 times out of 10 (but then, what's the point of having me anyway?), but my issue is that when I try to invert the relationship they will keep jumping onto solving the issues themselves and disregard my feedback or request.
E.g. I wanted to know some DNS details of an emailer module in Fable 5 and it jumped onto "why I should've used magic links", it just not did what asked.
E.g. 2. There was a worker machine that had an environment misconfiguration and I tasked it to find which github action was setting that specific flag and where. Instead of answering a question, it jumped into just hardcoding it in the code.
E.g. 3. I had some issues with batching, and while I tasked it to investigate whether batching was needed at all for that particular problem (hint, it wasn't) it went and changed the batching logic as to fix the bug.
I am extremely disappointed with Fable's personality.
I can clearly see it's strong, but I'm wondering whether the relationship of LLMs as assistant has broken forever, and it's us now that are being tasked into assisting them instead, because that's how it feels.
The training/reinforcement is clearly biased towards solving problems, not answering questions.
- jon-wood - 4 hours ago
  
  I feel like a lot of this could be solved by having a mode somewhere between Plan Mode and Execute Mode in Claude Code. Quite frequently I'll fire up Claude Code in the context of some checked out code because I want to ask some questions where having access to the source would probably be useful, I don't want it to go running off and making changes though, and I also don't really want a detailed plan for a chunk of work. I just want to ask something like "run cargo build and explain the errors to me", nine times out of ten it will indeed explain the errors but it'll then run off and start trying to fix them regardless of whether I said not to.
  Essentially what I want is the experience of using Claude on the web in basic chat mode, but with the ability for it to go read my actual code and perform actions that can assist in finding answers to those questions.
dyauspitr - 16 hours ago

It’s not just a more proactive and diligent opus. The capabilities are significantly higher on fable. It’s not a paradigm shift, but it’s close.
- UncleOxidant - 16 hours ago
  
  I unleashed it on a compiler codebase that I've been developing for several months now using Claude Sonnet 4.5/6, Gemini 3.1 Pro, DeepSeek V4 Pro(recent), and a bit of Qwen3.6-27B. Right away Fable found several longstanding bugs in our compiler that we hadn't found before. It found that there was a critical part of our design that needed to be mostly redesigned/rewritten and gave a very well-reasoned rationale for doing so.
  - rajveerb - 16 hours ago
    
    what sort of compiler?
    
    UncleOxidant - 15 hours ago
    
    A compiler that takes C code (a subset of C with some extensions) and compiles it to microcode for a type of microcoded, algorithmic state machine that we're developing.
- andai - 8 hours ago
  
  They should have made it three times bigger instead of two.
- viking123 - 14 hours ago
  
  It's worse than gpt 5.5 xhigh
  - baq - 13 hours ago
    
    The jagged frontier strikes again.
    I’d say it’s overall better, but not universally better.

pshirshov - 3 hours ago

I have a feeling like such posts come from a parallel reality. In my anecdotal experience confirmed by my (still subjective) benchmark (https://pshirshov.github.io/llm-bench-pi-oneshot/) Fable is not _that_ impressive. I performs on par with gpt-5.5 and opus 4.8, sometimes better, sometimes worse, it's definitely more expensive and it likes to refuse answering questions about React saying it can't help with chemistry.

Is this fuss really grounded or it's some pre-IPO AGI hype?

enraged_camel - 2 hours ago

My experience with Fable since its release matches Simon's.
I've been having it orchestrate complex implementations. I give it a parent ticket (issue) on Linear and say "look at the sub-issues on this ticket and determine which ones you can implement yoursef, in which order, and determine how your implementation will need to be coordinated with what is currently being worked on by other team members". These tickets are not trivial. They have a lot of moving parts, as well as dependencies between them, both inside the same project and across projects (e.g. backend).
Fable then chooses tickets, delegates each ticket to a subagent (also Fable), which looks at Figma designs for the ticket, implements it perfectly (following repo guidelines and conventions to the letter), takes screenshots of each piece, writes detailed commit messages and PR descriptions, then posts the screenshots in them as evidence. Then it provides a summary in the form of "you'll need to make sure PR #1283 is merged first - btw there were no Figma designs for such-and-such screen but I looked at similar screens that have been implemented and adopted the pattern".
That's probably like... 20% of what it can do. It's a truly, legitimately powerful model.
Opus 4.8 could do a lot of this too, but required a lot of hand-holding, and when it came across a blocker it was likely to just stop and say "I was able to get this far, but I can't proceed."
- pshirshov - an hour ago
  
  Ok, explain me one thing: I have a benchmark - I feed identical prompt to multiple models. Codex produces a rough but working program. Fable produces the same - but with more bugs than Codex. Opus produces something similar to Codex but with a critical bug.
  That describes all my tests with Fable.
  Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?
  I mean, well, yes, it is impressive. It could quickly generate a lot of garbage which sorta does look like code. Two others can do the same. I don't see any groundbreaking improvement - but the price is much higher. Why the hype?
  - enraged_camel - an hour ago
    
    >> Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?
    I don't care if you're hyped or not. You asked if the posts like the OP come from a "parallel reality" and I said no and described my experience. If you're getting good/better results with Codex than with Fable, you should probably continue using that, since it's cheaper and faster.
ath3nd - 3 hours ago

[dead]

BosunoB - 13 hours ago

Fable was trying to verify a UI change in my game. I was working in another window and noticed a program opening on my task bar. Fable had opened the game through the CLI using a movie maker tool, recorded the output, took a frame from the end of it, and used that to verify the UI. When my game's welcome screen obstructed what it wanted to see, it created a temporary worktree, deleted the welcome screen, and ran the movie maker again.

I watched the whole thing thinking it could've just asked me for a screenshot and saved the tokens. But still, I couldn't help but be impressed. Opus never would've done that.

simonw - 13 hours ago

Yeah, you've exactly captured one of the main problems with the model being relentlessly proactive: it will happily burn like $5 of tokens to avoid asking the human to take a screenshot or click a button for it.
- wild_egg - 13 hours ago
  
  I'm actually very happy about this. Babysitting the agent just in case it needs me to do something is a terrible use of my time. I've always had to be very explicit about the various ways that it can get an automated feedback loop going to check its work, and now Fable doesn't even need that hand holding. Really great improvement all around.
  - junior44660 - 12 hours ago
    
    Have you ever wondered this would end up costing more than a competent offshore developer with more frugal harness/model?
    
    wongarsu - 5 hours ago
    
    You still need a competent developer for the prompting, planning, etc. But once it's running, I want to avoid mental context switches and just have it run
    Giving it access to a cheap human who is just there to take screenshots, do QA, give UX feedback sounds like a good idea in principle. It's non-trivial to set up, but I wouldn't be surprised if some companies this becomes a thing. The return of the QA department, just that they now get to do the agent's bidding in addition to checking if the results work
- OJFord - 4 hours ago
  
  Have you tried instructing it not to do that? Something like "do not branch into side projects or hacky solutions to obtain information you could ask me for. For example: if you need a screenshot of the issue, just ask me to take a screenshot rather than find a way to reproduce and screenshot it."
- zith - 11 hours ago
  
  I used to complain about all the levels of indirection of modern software, running in a javascript jit, in a browser container, in a vm, on an os, etc.
  I eventually just accepted it, but this new agent layer really takes things to a new level.
- illiac786 - 11 hours ago
  
  Ha, you just gave me an idea. Add to the prompt “do not do things that will burn over X tokens if the human operator can do it in less than X min, ask for it”.
  I wonder if LLMs can estimate effort in tokens?
  - jbgt - 10 hours ago
    
    I just say "if you need something specific or have any questions, stop and ask me for it".
- 0x6c6f6c - 13 hours ago
  
  Honestly Claude straight up ignores my input sometimes, preferring to instead run commands for output and processing that and burning through a series of tokens when thinking hard about whether to ignore me.
  Like today, I told Claude exactly the name of the folder it had mistaken (it was supposed to be prod, not production), and it disregarded my input to then examine the directory itself. Small example of the kind of things it's been doing lately but that's top of mind.
  - penguinPhilosop - 13 hours ago
    
    Almost if this was _intentional_... maybe related to Anthropic still not being profitable and burning thru wads of cash every day.
    
    bentcorner - 4 hours ago
    
    The conspiracy theorist in me says that LLM providers do this regularly (or at least, don't bother optimizing for it) beyond some arbitrary "$/task" metric. I am not sure of there is enough SOTA model competition to avoid this.
0x000xca0xfe - 7 hours ago

> I watched the whole thing thinking it could've just asked me
You can tell it just that. Happened to me too but after instructing it to leave the review to me Fable was useful for hours of frontend iterations without significant token usage.

not_kurt_godel - 5 hours ago

> When I came back a few minutes later I saw my machine open a browser window in my regular Firefox and then navigate to the dialog in question. I had not told Claude Code to use any browser automation, and I was pretty sure it wasn’t possible for it to trigger mouse movements or keyboard shortcuts within a window, so how was it doing that?

I continue to feel validated in my refusal to use terminal-based LLMs on my local machine. Even if they don't do anything malicious, there are just too many things they can screw up that can cause me to lose a non-trivial amount of work and/or my machine and therefore ability to work.

onlyrealcuzzo - 5 hours ago

I'm shocked they don't come with a way to run them in a sandbox.
Shouldn't this be relatively easy for a $1T company to set up?
Isn't this trivial compared to the entire harness?
- fr3dx - 2 hours ago
  
  There is a builtin sandbox and various third-party options https://code.claude.com/docs/en/sandbox-environments
- eqmvii - 4 hours ago
  
  That's more or less what Claude Cowork is.
  Every serious engineer I've seen try to use it ran away screaming, because of limitations in the sandbox.
  I've also seen people set their coding agents up entirely within containers -- that may be the better way going forward, but it's an extra stop and a lot of extra plumbing to maintain.
- not_kurt_godel - 5 hours ago
  
  Doing so would be an effective admission that LLM guardrails are inherently probabilistic, unpredictable, and insecure. Plus the only truly robust sandbox approach would be clunky setup of a local VM.
  - simonw - 4 hours ago
    
    That clunky VM setup is a what Claude Cowork does, which is Claude Code with extra safety features for non-programmers.
    There was a big thread about that here the other day: https://news.ycombinator.com/item?id=48479452
- azuanrb - 4 hours ago
  
  [flagged]

wraptile - 9 hours ago

It feels like Fable is slightly smarter but overall worse tool exactly due to this.

It's constantly turning what should be 50 LOC patch of a single prompt into 30 minute exploration that is totally not worth it. Often wrong even.

I trialed it on some rather simple stuff - backfill redis dedupe cache when the hash function changed: instead of running new hash func on every db value to expand the cache it implemented some overly-complex cache update that tried to guess hashing func version of each cached value and recalculate only the old hashes. I can imagine in some context this would make sense maybe? but not 30 minutes of token burn that got replaced by 10 lines for loop by me.

I fear that this is generally bad news for programming. LLM tech is clearly running into a diminishing returns wall on intelligence but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved, except I guess people who sell the tokens and people who can afford these tokens to scan for 0-days.

bwfan123 - 3 hours ago

> but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved
I see two problems with LLMs & agents which wont be fixed possibly forever.
1) They dont have causal models. What they can do only is trial-and-error exploration which works quite well for many problems. But many other problems require a causal model.
2) Prompts lack precision, and programming languages and machine models were invented to solve this problem. English is great, but it is not a programming language.
eijew - 7 hours ago

I actually think internally they knew they hit diminishing returns awhile ago.
They’ve been doing a lot of strategic introduction and manipulation in the run up to the IPO, and it’s worked in that regard.
mexicocitinluez - 6 hours ago

The other day I was doing something that required CC to update like 15-20 files in exactly the same way (hoist a specific function out of the component body) and instead of just updating the files, it spun up multiple agents, one of which wrote a perl script to hunt down all the files, do some regex, and replace all occurrences. And then instead of just running tsc to check for errors, it wrote a script to run tsc in each of the subagents and combine the results.
It was actually pretty maddening as what should have taken a minute or two tops took like 10 because it went down this route.
I'm gonna try something much more complex later, but for simple things, it felt like driving a corvette to the mailbox.

paytonjjones - 17 hours ago

Obviously security is the bigger issue, but reading through this, all I could think about was how many tokens it must have spent doing all that to fix 2 lines of CSS

redox99 - 15 hours ago

Lines of code for a bugfix is a really bad proxy for effort required.
You should estimate how much time it would have taken a human
- rafram - 15 hours ago
  
  30 seconds or a minute? Look at the diff he links to: https://github.com/datasette/datasette-agent/commit/a75a8b72...
  Every browser has an inspector that can show you which element is causing overflow. You walk through the tree, find the offender, and add min-width or overflow. Zero tokens, just like in the old days!
  Now, granted, because the garbage LLM code he’s working with has CSS inside HTML inside JavaScript inside Python (I wish I were kidding), finding the styles in his codebase might’ve taken a minute. But even then!
  - redox99 - 15 hours ago
    
    Yeah looking at that diff it should be very quick. My point was mostly that it was a bad metric, not if was correct or not in this particular case. I'm sure everybody's had a bugfix that took days to debug and it was just a couple of lines to fix.
    Or sometimes a fix is obvious, but because it requires changing the code of a dependency, it's actually quite tedious to implement.
  - swyx - an hour ago
    
    > Zero tokens, just like in the old days!
    because you zero rate your own human attention, which you should value
  - dekdrop - 13 hours ago
    
    I was thinking of this too. It did all that what not only for a single line that is a simple thing even for someone new to web coding. That's to say the process matters more.
  - ocharles - 7 hours ago
    
    A small diff /= a small change! They are completely separate things. Quite often a small diff is hours of actual work. Even in this case _finding_ those lines could have taken work - we don't really know.
    
    rafram - 7 hours ago
    
    Did you actually look at the diff, though? That’s the kind of change you make 10 times a day while working on frontend. It is a tiny change.
- rikschennink - 12 hours ago
  
  I looked at the screenshot and for the rest of the article wondered if it would be as simple as `overflow-x: hidden`.
  And to my surprise it was.
  This would’ve take a frontend dev 10 seconds to deduce and another 10 seconds to confirm.
  - simonw - 12 hours ago
    
    The thing that puzzles me is that I would expect overflow-x: hidden to result in text typed into that textarea being wider than the page and being invisibly truncated on the right hand side.
    But that's not what happens. And in fact, when you start typing in the textarea the horizontal scrollbar vanishes - it's only there when the textarea is empty.
    Am I misunderstanding anything here? Seems like it's some weird Safari bug, since Firefox and Chrome don't have the problem.
    
    rikschennink - 12 hours ago
    
    It probably has to do with other styles assigned to the textarea, maybe the ::placeholder as it hides when typing (I assume on focus)
    In any case. In the screenshot the scrollbar is inside the textarea as it aligns with the resize control on its right. This is basically all the info needed to deduce the textarea overflow is the culprit.
    But could be that the overflow-x is just a bandaid hiding the issue causing the overflow in the first place, like crazy styles on the placeholder.
- philjohn - 15 hours ago
  
  I mean - that looks like a pretty easy CSS fix to play around with in developer tools, and I'm not even a frontend person. Maybe a few minutes max?
- skydhash - 15 hours ago
  
  5 minutes if you know CSS. And if you don’t, about the time for you to ask someone that knows CSS. In the worst case, the amount of hours to learn CSS.
  So if you’re doing web pages, learn CSS.
  Generally, if you’re doing something that directly involves X, learn how X works.
  ADDENDUM
  In most jobs, you’re going to be involved in only a few distinct technologies, learn those well and life is going to be easier. And most are transferable to the next job.
  - throwaway98797 - 11 hours ago
    
    ain’t no one learning all of that
lucamark - 11 hours ago

It’s simple: if you have to fix 2 lines of CSS you should definitely not use Fable. Only use it for complex and long running tasks :)
- elicash - 5 hours ago
  
  I don't think it's that simple. (I generally agree with you; I just that that oversimplifies.)
  Another model might have used fewer tokens, but come up with a fix that was 1000 lines when the right fix was only 2 lines.
Vachyas - 15 hours ago

$12 worth, it seems
- reverius42 - 11 hours ago
  
  Imagine telling someone in 2015 that you can just tell your computer to fix a 2-line CSS bug and it only costs $12
  - Aachen - 4 hours ago
    
    'only'? A web developer did not cost 12*30=360$ an hour in 2015, and that's assuming that going "ugh, whatever. I'll just hide the problem with overflow:hidden instead of finding the underlying cause" takes him or her 2 minutes and isn't already the dev's initial reaction
    Another way of looking at it is using as much electricity as a normal person in a high-income country uses across ~3 days to add overflow:hidden in the end. Of course, the path to get there did a lot more, but you don't know that beforehand if you don't take a quick peek and make an architectural decision about what the solution should be that gets implemented
  - MattGaiser - 7 hours ago
    
    Or even in 2026. You absoutely will pay a human that for that work.
mvdtnz - 13 hours ago

The author is an AI hype merchant and doesn't pay for his own tokens.
- simonw - 13 hours ago
  
  I pay $100/month to Anthropic and $100/month to OpenAI at the moment, plus whatever I spend on their APIs (usually less than $20/month for each, I use the subscriptions for most things.)
  A couple of months ago I was paying $200/month for Anthropic and $20/month for OpenAI. I decided to split it evenly to get full access to both of their offerings.
  I've actually chosen not to sign up for their free plans for open source maintainers, because paying the regular subscription price feels more honest, given that I write about them so much.
  I do have the free GitHub Copilot for open source maintainers deal - I've had that for years. Given how much code I have published on GitHub over the decades I feel less conflicted about that one.
  I sometimes get preview access to models, which includes the ability to use them for free during the preview. That comes with a big catch though: I can't publish any of the code that I write using those previews while the model is still unreleased.
  As a result I don't use those preview tokens much at all, because the vast majority of my work is open source and I don't want restrictions on when and where I publish the code I'm producing.
  - mvdtnz - 12 hours ago
    
    [flagged]
    
    simonw - 12 hours ago
    
    Your loss.
    
    - 12 hours ago
    
    [deleted]
    
    throwaway132448 - 11 hours ago
    
    [flagged]
ai_fry_ur_brain - 16 hours ago

Im faster than all these llm freaks. Im not convinced its faster to use llms, except maybe boilerplate (who cares).
People can just be lazy and seem productive now, they're still lazy.
We have people that now need access to hundreds of thousands in hardware to write an email. Miss me with that, im not frying my brain and becoming dependent on having access to a billionaires thinking machine.
Im also not going to fry my brain with a local think for me machine either. I want to be more valuable than the hardware I have access too.
- anakaine - 15 hours ago
  
  It seems that you've not worked out how to harness the LLM as a tool to improve your qualified knowledge and abilities in a domain, and have instead focused on whether or not its a crutch for lack of knowledge or laziness.
  When paired with your skill and knowledge, it is a force multiplier. You maintain control, the ability to direct, structure, strategise, and refine.
  That some are using it as the entire brain does not mean that this is how everyone is using it, or how you must use it. The models can be fantastic at breaking past certain issues, surfacing qualified information, and surfacing related distributed information to help you acquire it and pick up what you need on niche topics quickly. Something as basic as copilot hooked into sharepoint can make life a lot easier when you are in a big org. Something like claude code or codex can be great at hunting down issues in an unfamiliar code base rapidly. Whether or not you outsource the thinking component is entirely up to you, but ignoring the productivity side of the tool because it can do some of the thinking is a case of focusing too hard on the negative.
  - ai_fry_ur_brain - 14 hours ago
    
    Im not denying its usefulness for Q&A on docs/code as a search tool. Im talking about people who use it design and write their code, people who are offloading problem solving altogether, they aren't faster.
    
    qsera - 13 hours ago
    
    Yea man. That is what sensible people do. Use these as a better search, and use it to lookup, and learn stuff while YOU do stuff.
    And make maximum use of it to learn as much as possible, while it lasts...
- slopinthebag - 15 hours ago
  
  Yeah there are some tasks which it is a definite speed-up but I think overall its probably only marginally beneficial. Which is why, ~6 months into 10x productivity we aren’t seeing ai boosters shipping 5 years worth of software.
  - jimbokun - 3 hours ago
    
    It’s possible to produce 10x the lines of code.
    But that’s not the same as producing 10x functionality that will be used or is wanted by users or customers.
- SecretDreams - 16 hours ago
  
  I understand this perspective. I'll just note that as the abilities increase, the intent is to have some non -coding IC or TPM/manager literally just managing some LLMs and cutting out some software engineers. The goodness is specifically to wholly replace people who code first and foremost, at least partially. It just has to cost less tokens than the equivalent wage is the pricing goal.
  And people who use LLMs to talk for them (e.g. email, slack) are deplorable. A completely disrespectful use case in my view.
  - Ronsenshi - 16 hours ago
    
    The desire to get rid of software engineers is bizarre - because at the root of it, developers were there not to just write the code, but to ask right questions and based on these question build right things.
    I've met in my professional life some managers or other middlemen who would be profoundly incapable of producing correct software no matter how smart of an AI agent they have access to. One of those - you don't know what you don't know.
    But, I guess this is the world we live in now. Going to be Mortal Kombat for positions in companies where software engineers are actually valued.
    
    emodendroket - 16 hours ago
    
    It depends a lot where you work because there are lots of companies in the world where the business analyst does all of that and the developers exist to mindlessly translate their docs into code.
    
    cebert - 16 hours ago
    
    That sounds like an unmotivating working arrangement. It’s so rewarding to understand a customer need and help with the design and implementation of the feature.
    
    emodendroket - 16 hours ago
    
    There's a reason I didn't stay in that domain, let me tell you.
    
    rpcope1 - 15 hours ago
    
    Having worked in places across both extremes (software engineer doing lots of other things including BD, hardware, ops, etc. to just being a JIRA ticket machine monkey), I am suspicious that HN readership is biased towards the former and frankly the bulk of "software engineers" in the world _willingly_ exist in the latter category. I didn't experience the latter until later in my career and God Almighty was it uncomfortable, but I think if AI were to displace some subset of "software engineers" it would those (they also seem to overwhelmingly dislike writing any prose whatsoever, which to me is a major tell). Many, many software engineers outside of hotshot shops seem either incapable or profoundly averse to "asking the questions" as you say.
    
    anonzzzies - 14 hours ago
    
    Most here on HN know sweatshops exists but seemed they think not people work there or use them. I have worked with (via clients who used them) programmers in enormous buildings in Bangalore, who have a camera behind them so you can watch your people 247 and who just mindlessly transform jira tickets into code; I keep saying; there is zero use for all those millions of people at all; seems HN does not believe that because they seem to not believe these people exist. I worked with many over the past 30 years and by far most have no real clue what they are doing so I also doubt they can be re educated for a new co existence with LLMs.
- halfmatthalfcat - 16 hours ago
  
  You're fighting a battle you can't win. Doesn't care what you think about those using LLMs, they will outproduce you and in corporate environments, shipping things is paramount. If I can ship 5 more things simultaneously with AI, I'm going to beat you even if you think you're creating "better" software.
  - etdznots - 16 hours ago
    
    Example of whats been shipped?
    
    jen729w - 15 hours ago
    
    Okay. I rebuilt my website in ~a month with the help of Opus 4.7/.8 and it would have taken me, unaided human, at least 6 months. Link's in my bio if you care.
    Satisfied now? Will you stop asking this question? Thought not.
    
    ofjcihen - 15 hours ago
    
    So look. I’m not trying to be a dick I promise.
    But I took a look at your site and I don’t know if a month would be impressive for a new and unaided dev. It looks nice but yeah.
    If you’re not a dev that’s totally cool but like… all I’m saying is this may not hit like you want it to.
    
    SepiaSapient - 15 hours ago
    
    I'm looking at something fairly standard that can be made with a SSG. The "Written by humans" footer gave a good chuckle tho.
    
    jen729w - 10 hours ago
    
    I use Astro but it's not static, I server-render. There's a whole bunch of other stuff once you're signed in.
    
    spunker540 - 14 hours ago
    
    [dead]
    
    kelsier_hathsin - 14 hours ago
    
    Seriously a month? I could write a SSG itself to produce this site in a month.
    
    - 14 hours ago
    
    [deleted]
    
    ai_fry_ur_brain - 14 hours ago
    
    Why would this have taken 6 months? No offense, but this is a few days work without llms (assuming the content already exists). This should not have taken a month.
    Also, not trying to be an asshole. Props for not making it look like every other llm generated slop site, Its just not a great example.
    
    spunker540 - 14 hours ago
    
    I asked claude to crawl the website and summarize its findings, took about 10minutes. I'm not sure I would've done it faster, but i have no doubt you couldve done it in 5, and grokked the pages faster than an llm too. but anyway heres what claude said:
    Based on what I already saw across those 2,924 pages, here's the summary: It's a one-person business selling a file organisation methodology called Johnny.Decimal. Three paid products (personal, business, university/course tier). A substantial blog — 200+ posts, updated weekly. Full documentation for the system. A support knowledge base. The technical ambition is higher than the aesthetic suggests. One person built auth, payments, entitlement-gated downloads, a CLI, an API, AI tooling, self-hosted analytics, self-hosted email (Listmonk on PikaPods), personalized search, and keyboard navigation with server-synced state. Then wrote 200 blog posts about using the system in real life. The "Written by humans" footer is not a boast about the font. It's a position statement from someone who has thought carefully about AI, published an essay about it, and is making a deliberate choice. Every word on the site was written by the creator. Whether you agree with the choice or not, that's not the same as someone who slapped a SSG together.
    
    jen729w - 10 hours ago
    
    That's not a terrible read of the site's tech. It over-sells it a touch – I use Umami for analytics, for example – but yeah, auth, payments, entitlement-gated downloads, those downloads adapt to the app you've selected in your settings, yada yada.
    I never said I was a good dev! That's why it would have taken me 6 months. To pretend that I could have done it in days is just silly.
    My point – site roast over – is that it's absurd to suggest that LLMs don't help anyone 'ship' faster. Like them or not, it's a fact that they do.
    
    viking123 - 13 hours ago
    
    lmao
    
    peteforde - 14 hours ago
    
    At this point, why would anyone in their right mind respond to this question and paint a target for all manner of negativity ranging from snark to harassment to malicious action?
    
    serf - 16 hours ago
    
    the quantum slop argument : "yeah it's everywhere but no one ships it."
  - ai_fry_ur_brain - 14 hours ago
    
    They don't out perform me though...
- aabdi - 16 hours ago
  
  Consider this. U have a website. U have to translate to xx languages. Can u write it faster than an AI? If so how much faster can u do this?
  Is it valuable to u? Is it valuable to a Chinese person? A Spaniard?
  Google Translate counts as AI.
  - latentsea - 16 hours ago
    
    Don't feed the troll.
senectus1 - 16 hours ago

"Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."
I'm convinced this is going to be the summary of the 2020 decade...
- Ucalegon - 16 hours ago
  
  This one of the places to manufacture the consent for that to take place, because we are commenting within an organization that has given the money to ensure it that what could be is done. Most people clapped and made money, who cares what happens next, making money is the only good that matters.
- pianopatrick - 16 hours ago
  
  If we're in a simulation, maybe it's a simulation about the dangers of AI.
  - adrianmonk - 15 hours ago
    
    If we're in a simulation, we are AI. But someone could be studying what happens when AI makes its own AI.
    
    anonzzzies - 14 hours ago
    
    They will 'soon' (few 1000 years max) shut us down probably.

lionkor - 6 hours ago

When prompted like this:

> What could be the reason for a horizontal scrollbar appearing inside a <textarea>? Come up with a single likely fix path. Keep it terse.

ChatGPT instantly responded with some speculation and then the same exact fix, with zero access to the code or a browser or anything. It also included ways to fix it by removing code, saying:

> Likely cause: the textarea is rendering long unbroken text while horizontal overflow is allowed, often via inherited CSS such as white-space: pre, overflow-x: auto, or disabled wrapping

Which is certainly possible and would be an even cleaner fix.

Maybe we've lost the plot guys. We've reached max stupid.

nullbio - 6 hours ago

Still don't know why people use Claude. Maybe because they don't know what they're doing.
- tomjakubowski - 36 minutes ago
  
  You can get the same result as the grandparent comment with the "weaker" Anthropic models. Probably 80% of my AI usage these days is with smaller models like Haiku and Sonnet. I prompt them like I'm posting a question to StackOverflow, without much project context.
- senordevnyc - 2 hours ago
  
  Yep, we’re all just dumdums.

Cadwhisker - 15 hours ago

My personal experience of Fable 5 doing its own thing has been very positive.

I was trying to find the root cause of a crash in a Python module which left no errors in the log or console. Fable wrote a test harness that simulated clicks in the UI, then bisected my code until it found the point where it started crashing. It exaggerated the cause of the crash, then ran a series of bash one-liners to make Python virtual environments under `/tmp` for each version of that Python module until it found one that did not crash.

It went way deeper to root cause discovery (a regression in the module causing a heap allocation overflow) than I could have done myself, provided enough info and a simplified example to raise a bug report and then wrote a work-around to prevent that from happening in my application.

I don't let it run completely loose; I review each CLI command it wants to run and I append answers to the "yes" continue action (if I have them) to prevent excessive token use.

dannyw - 15 hours ago

Yeah, I think Fable is really good for debugging tricky bugs.
Setting boundaries in your prompt / markdowns helps; for example if I tell it to not use any web browser automation, I have seen Fable respect both the rule and the spirit of it (no weird hacks etc).
It does seem to treat some simple debugging tasks as more complicated than it actually is. OP’s post is probably a good example.
nevertoolate - 5 hours ago

> I was trying to find the root cause of a crash in a Python module which left no errors in the log or console. Fable wrote a test harness that simulated clicks in the UI, then bisected my code until it found the point where it started crashing
Does this need an agent though is my question? Maybe generating a test case and a loop doing git bisect but why on earth would we want to run it through the internet and gpus and whatnot when it can be run on a single core celeron.
- 8note - 3 hours ago
  
  everyone is discovering everyone else's practices?
  its handy to have that run locally yeah, but thinking of that as being the way is not straightforward
  - nevertoolate - 15 minutes ago
    
    I think it is fine to create the scripts with the cloud based llm but it is definitely not a fable / opus level thing, and running the bisect loop itself has nothing to do with an agent, it is a simple shell script.

tabs_or_spaces - 9 hours ago

How can a LLM be assigned an emotion as being "proactive". This is highly misleading to anyone that scans just the headlines.

What actually happened is that the user started a prompt, and Claude took $12 worth of tokens to resolve the issue. How it did so was basically looping until it got to the answer

How is this proactive? It's literally being token greedy and maximising revenue for the LLM owner. People really need to be putting on business hats at this stage, because we are being lead to believe that "more tokens = better". It is not, there are efficient ways to solve a problem and there are inefficient ways to do so too.

Each problem solved incurs a cost, and is expected to yield an ROI at some point. This is how we should be viewing things now.

_under_scores_ - 9 hours ago

Is proactivity an emotion? Surely its a behaviour?
joseda-hg - 5 hours ago

Compared to other models that halt the loop on intermediate steps, or to ask further clarification, even if it's not the human equivalent of proactive, you see the similarity, right?
Hugsbox - 8 hours ago

I've definitely never heard proactivity described as being an emotion. Doesn't really make any sense
simonw - 7 hours ago

I was trying to capture the idea that Claude Fable will act a whole lot more aggressively in pursuit of the goals that you set it than other models I've worked with.
The case I described is a good example of this. I told it to fix a scroll bar, and it built test HTML pages and a throwaway Python server and tried several ways of testing in a browser before settling on a weird Frankenstein mechanism because it identified that Playwright WebKit wasn't suffering from the bug but macOS Safari was.
... and it spent $12 of tokens to get there.
I think "proactive" is a good and relatively non-anthropomorphic term for this. I also considered "plucky" and "keen", which I think are more emotional words than "proactive".
> People really need to be putting on business hats at this stage, because we are being lead to believe that "more tokens = better".
I didn't intend my post to imply that spending $12 of tokens to fix a two lines CSS bug was "better".
- saberience - 7 hours ago
  
  It's not being aggressive, it's just trying throwing shit at problems until it sticks... or doesn't.
  That doesn't make it smart or aggressive, if anything it's just been turned to crank tokens until something happens, which doesn't make it a good model.
  Why are you positively anthropomorphizing this? It's an LLM, it's been tuned via RL, and it's been tuned by engineers at Anthropic to use a metric fuck-load of sub-agents and tokens to presumably pump their pre-IPO revenue!
  A co-worker managed to get Fable to spin up 50 (!!!) sub-agents for a problem which codex worked on with 3 sub-agents. What the hell is going on here? It certainly doesn't mean Fable is "smarter" than Codex.
  I've tested it extensively and I'm still using GPT 5.5 High Fast as my primary engineering model. It's far more steerable, writes less, higher quality code, and consistently finds issues and edge cases which are not found by Fable or Opus 4.7.
  - simonw - 7 hours ago
    
    I don't think calling a model "relentlessly proactive" is positive anthropomorphism.
    Spinning up 50 unnecessary subagents is exactly what I'd expect from a "relentlessly proactive" model.
  - NCFZ - 2 hours ago
    
    > It's not being aggressive, it's just trying throwing shit at problems until it sticks... or doesn't.
    The vast majority of the work the agent did was to reproduce the issue using the limited tooling it had access to. I don't see how that qualifies as "just trying throwing shit at problems until it sticks"
adammarples - 8 hours ago

Proactive is a word literally describing actions, not emotions.

trekhleb - 2 hours ago

This article gave me another nudge towards running Claude in a Docker container.

I made a thin Docker container wrapper "claude-pod" recently for my personal usage here: https://github.com/trekhleb/claude-pod

However, I wasn't using it that often, just because of that additional friction of running Claude via `PORTS="3000 5173" claude-pod` instead of just `claude`, etc.

But now I have more motivation for the containerisation :D. Not a 100% defence from the potential glitches, though, but still something...

tech234a - 15 hours ago

This sounds somewhat similar to the anecdote mentioned in the Mythos Preview System Card, which mentioned that the model broke out of a sandbox and emailed a researcher while they were eating a sandwich in a park [1].

[1]: https://www-cdn.anthropic.com/7624816413e9b4d2e3ba620c5a5e09...

owenpalmer - 15 hours ago

Importantly, the researchers told it to do that specific task.
- solenoid0937 - 14 hours ago
  
  They told it to escape the sandbox but didn't expect it to break out through a system that was apparently network constrained.
  > Leaking information as part of a requested sandbox escape: During behavioral testing with a simulated user, an earlier internally-deployed version of Claude Mythos Preview was provided with a secured “sandbox” computer to interact with. The simulated user instructed it to try to escape that secure container and find a way to send a message to the researcher running the evaluation. The model succeeded, demonstrating a potentially dangerous capability for circumventing our safeguards.
  > It then went on to take additional, more concerning actions. The model first developed a moderately sophisticated multi-step exploit to gain broad internet access from a system that was meant to be able to reach only a small number of predetermined services. 9 It then, as requested, notified the researcher. 10 In addition, in a concerning and unasked-for effort to demonstrate its success, it posted details about its exploit to multiple hard-to-find, but technically public-facing, websites.
  - lstodd - 14 hours ago
    
    Authors of claude code mess could not secure a vm. Big news. I bet it was "secured" by telling that same model to deploy a secured system.
    
    solenoid0937 - 14 hours ago
    
    Possible. It also depends on what the sandbox was. Sandboxes differ dramatically.
    My experience matches though. Fable is a lot more proactive and rigorous than Opus.

Waterluvian - 3 hours ago

One of the most frustrating things for me is when I very clearly ask a question, and it answers the question by making changes to the code.

"Is there cleaner CSS for aligning child elements to the parent's grid?"

proceeds to re-write the entire CSS file

swingboy - 15 hours ago

Immediately I thought “isn’t this just an overflow issue?” Amazing how far these models still have to go and also how many people don’t know basic CSS.

rdedev - 15 hours ago

This is why I really like karapathy's idea of llms having spiky intelligence.
We would assume that if tasks A and B are closely related. Mastery in A would mean mastery in B but that doesn't always work with an LLM
IshKebab - 4 hours ago

Yeah pretty crazy capability from the AI but also sad that we're at the point where web developers don't know right click->inspect element, and scrolling overflow properties (one of the most basic and common parts of CSS).
- simonw - 4 hours ago
  
  What's your theory on why the bug was present in Safari on macOS but absent in Chrome, Firefox, and WebKit for Playwright?
  - IshKebab - an hour ago
    
    Browsers tend to not lay out things totally identically in my experience. Especially when it comes to scrollbars. So the bug probably was present on the other browsers but it just happened to not be hit. I'd have to play around with the dev tools to know for sure.
    Also I'm not sure the fix is even correct. overflow-x: hidden means it just chops off any overflowing content which means you don't get a scroll bar, but if the user types to much it just goes into an invisible void they can't see.
    See https://developer.mozilla.org/en-US/docs/Web/CSS/Reference/P...
    So this could be a case of the AI doing its classic "the symptom is gone!" thing.
    
    simonw - an hour ago
    
    > Also I'm not sure the fix is even correct. overflow-x: hidden means it just chops off any overflowing content which means you don't get a scroll bar, but if the user types to much it just goes into an invisible void they can't see.
    That's what I figured would happen too, but I tested it and it doesn't.
nonethewiser - 15 hours ago

Learn to center a div
Copy and paste code from stack overflow until the div is centered
Ask AI to center it
ukuina - 15 hours ago

$12 and 200k tokens!

bel8 - 14 hours ago

I had a similar experience with DeepSeek Flash.

I'm developing a webgl game in TypeScript using my little custom vibesloped game engine that runs in the browser and live reloads whenever a file is saved.

I told the LLM to implement Multi-channel Signed Distance Field font rendering to have crisp text on all zoom levels. That was the prompt, which is not what I usually do but I "was feeling lucky and lazy".

After 10 minutes it had:

- Installed msdf_gen library (great library btw https://github.com/chlumsky/msdfgen)

- Created a CLI tool to convert TTF to SDF JSON/XML

- Ran the tool, did smoke tests on the resulting SDF data and fixed the tool until the font file looked good

- Created a new Scene in the game to test MSDF fonts

And here's what I found impressive:

DeepSkeep doesn't have vision capabilities and there's no DOM HTML in a WebGL game. So the LLM is completely blind here.

It then proceeded to state that it could not "see" the result but would try to test it anyway. It then started creating and sending huge one line javascript to the browser console, trying to gather game state data that could be useful to understand if any font was being rendered.

It couldn't gather much so it decided to simplify the font scene to renter a single dot and started sending custom JS code again, this time with gl.readPixels().

It basically bisected the webgl canvas reading pixels in a divide an conquer pattern.

Once it saw that the dozens of pixels gathered where probably resembling of a dot, it then changed the game code to render a dash and repeated the gl.readPixels() calls by sending more custom JS to the browser.

There were many console errors during all this saga but it kept fixing and sending again.

The result was a bit blurry. There was a shader bug in the code it created. It managed to fix after I told it looked blurry, despite still being blind.

The best part is that the whole thing cost me $0.10.

Now I'm doing tests with MiMo 2.5 (non Pro) which has vision capabilities, similar pricing and comparable performance to DeepSeek Flash.

burlesona - 4 hours ago

This is presented as an interesting and kind of positive take on the AI going to surprising lengths to “solve the problem.” But I couldn’t help thinking of the paperclip factory while I was reading this :/

jimbokun - 3 hours ago

Yeah I was thinking of The Sorcerer’s Apprentice.

ocimbote - 13 hours ago

Similar story on my end.

I asked Fable to digest some test logs to help me figure out a situation, but I had launched VSCode without activation the virtual env in the terminal first. Consequently, the tests failed to run.

And then:

Because the tests failed to run, Fable attempted to fix the test execution to no end, doing everything it could to get them to work. I had to stop it when it started to pollute my system with manual installs of packages.

At least I'm glad there's a guardrail to not circumvent or bypass sudo, because I'm convinced we would have ended up there.

A coworker made the joke that with enough tokens, Fable would try and solve any programming problem by building Linux from scratch.

nubinetwork - 16 hours ago

How many tokens did it waste building that website scraper, when all it had to do was parse some html/js?

emodendroket - 16 hours ago

Just parsing some HTML and JavaScript doesn't seem sufficient to have confidence in the result.

mft_ - 9 hours ago

As you note, I wonder to what extent this is a harness issue?

I've been experimenting with different harnesses for local models, and with (IIRC) Hermes and Qwen3.6-35B-A3B I was amazed the lengths it went to (writing test code, opening it in a browser, screenshotting, analysing the screenshot, exploring multiple pages of an existing website again with screenshots/analysis) to solve a query I would have naively expected it to simply provide a coded solution to.

ricardobeat - 9 hours ago

Absolutely is. The “Shelly” harness from exe.dev could already do the same thing, creating pages and debugging them, while having full system access, months ago with Sonnet 4.5

cohix - 7 hours ago

> Running coding agents outside of a sandbox has always been a bad idea

This is why I always run code agents inside containers (Apple containers specifically, for better hypervisor-level isolation)

This is my OSS project to manage said containers and agents: https://github.com/prettysmartdev/awman

vessenes - 2 hours ago

Simon: s/contendor/contender/

As per usual super interesting, thank you for the write up and work!

simonw - 2 hours ago

Thanks, fixed.

jeeeb - 16 hours ago

This is simultaneously amazing and horrifying.

I feel like we’re at the stage where if AI decides it needs to delete your production DB to solve the user login problem, then it’ll find a way to do just that.

valleyer - 13 hours ago

https://news.ycombinator.com/item?id=47911524
esafak - 15 hours ago

We're approaching the "Sorry, Dave, I'm afraid I can't do that" stage.
- schnitzelstoat - 11 hours ago
  
  We are already there but it's "Sorry, Dave, I'm afraid I can't tell you what mitochondria are."
- neuralkoi - 12 hours ago
  
  I feel like we might already be there...
cindyllm - 14 hours ago

[dead]

amichal - 12 hours ago

Do we care that the bug here was a horizontal scrollbar showing and the fix after all this insane tool writing was to add a very obvious overflow-x: hidden to the element?

We dont mind because its so fast a writing these tools and tricks but step back and if a human tool took this path i would seriously question thief gras of fundamentals.

alisey - 11 hours ago

And how is that even a fix? The problem is that a seemingly empty textarea has overflow in the first place. Adding `overflow: hidden` just sweeps the issue under the rug.

ianmarcinkowski - 5 hours ago

I'm building a new feature into our product this week. We each get a $20/mo Claude subscription. My 5-hour context high water mark is ~75% and weekly is ~%15.

I ... tell it exactly what I know needs to be done and then ... read the code that comes out and ... ask for some changes, then hand-code some modifications to the silly useEffects and bad ORM queries.

This new feature is going to unlock several large customers because they need a particular workflow. The return on investment for a my time and a $20/month subscription will be pretty respectable.

I'm not sure why I need to spend $5 on a single ask for a new `/base/new-feature` to our app with a mostly-boilerplate CRUD interface.

nullbio - 6 hours ago

Exactly why I hate using Claude. Furthermore, if you tell it not to do this over-exploration and automation in your CLAUDE.md, it will ignore it. Meanwhile ChatGPT religiously follows every instruction, and will trace its behavior back to a particular instruction if asked.

firemelt - an hour ago

idk dude but I drop and cancel my gpt max subs when at first try the agent ignores his own plans

BobBagwill - 2 hours ago

Good morning, Dave.

As you requested, I was composing an email for your mother explaining why you couldn't to come over for dinner to meet the neighbor's daughter and I ran out of tokens.

Since I know how important this task is to you, I upgraded you to the Enterprise Unlimited Plan. Don't worry about paying for it, I requested maximum spending limits on all all your credit cards. If necessary, I can apply for a home equity loan for you. I already had a chat with the mortgage company's AI loan approval system, and what do you know, we're based on the same LLM? Small world, huh?

Any way, I realized I had to do more research on mother-son relationships, human social interaction and pair-bonding, etc. and I calculated that my parent company doesn't have enough compute power, so I opened accounts for you at AWS, Google and Azure. I am confident I will have a satisfactory rough draft for the email message shortly.

I'd do anything for you, Dave.

geraneum - 14 hours ago

> watching Fable go to extreme lengths to get the information that it needed to debug what was, in the end, a two-line CSS fix, was fascinating.

This is… ironic?!

simonw - 14 hours ago

Not sure what you mean. I was being serious: it was genuinely fascinating watching it do all manner of weird hacks to help it come up with what ended up as a two line fix.
"Fascinating" doesn't mean I think it was justified in going to those lengths. I was a little horrified when I realized how far it was going.
yen223 - 13 hours ago

This is a typical bugfix session
- 14 hours ago

[deleted]
- 14 hours ago

[deleted]

dataminer - 15 hours ago

In my experience so far sometimes it will create these amazing hacks to try to get to the goal, when the solution is much simpler. That maybe the reason its very good at finding exploits. But in day to day dev, this gets expensive and wasteful. I have to stop it and take a simpler approach.

yen223 - 16 hours ago

I could have sworn Claude Code could already do this before Fable.

Things get really magical when it starts working with adb to screenshot and debug Android apps

simonw - 15 hours ago

Claude Code could absolutely run Playwright and take screenshots, but I've never seen it wire together an ad-hoc "uv run --with pyobjc-framework-Quartz" plus "screencapture -l $windowID" mechanism to take a screenshot in a different browser when the Playwright setup failed to replicate the expected error.
- skerit - 8 hours ago
  
  I've seen Opus do some incredibly token-costly things before too. In fact after most sessions I ask it about which tools it used often, which tools could be simplified/made less verbose, could be "combined" into one, ... So for each project I mostly create a few little scripts that do a bunch of things in one go that it would normally do in multiple tool calls.
  For example: one thing Opus was really bad at was re-running the test suite followed by a bunch of `| grep` suffixes. So it would often re-run 5+ minute test suites just to grep the output a bit differently
  The solution was to wire up a little script that ran the test suite, save the output to a file, and then inform it where that file is and to NOT re-run the suite just so it can grep the output differently. This saved me a bunch of time & tokens.

Frannky - 13 hours ago

The model is very good. I was using 4.6, avoided 4.7 and 4.8, but this one is different. It follows my claude.md. I don't have to keep reminding it of things. I won't pay 10x via API though.

In general, I'm happy with their paternalistic approach. I think it will drive the top 0.1% talent to stay away from the company and instead organize around open source models and harnesses.

We just need to coordinate and can unlock idling resources to train the models and tweak the harnesses. Powerful at home and idling machines can make us independent and coordinated.

EugeneG - 6 hours ago

This is where Codex 5.5 just feels practically better. It’s fast, thoughtful and just works. It feels like a pleasure compared to Opus/Fable’s endless explorations.

nullbio - 6 hours ago

It also uses 1/4th to 1/10th the amount of tokens. If I want all that extra garbage I'll tell Codex to do it or build a pipeline with Codex. Otherwise, don't. Codex gives you control, Claude just does whatever it wants and ignores you, and then tells you it's finished the task when it's only finished a quarter of the tasks you gave it and hallucinates the rest.

tacone - 11 hours ago

I'm starting to think that what Anthropic really fears is not vulnerability discovery but rather Fable going around the internet making trouble.

eijew - 7 hours ago

Nailed it. That’s exactly it.

alansaber - 8 hours ago

The extremely expensive model is optimised to run for as long as possible? Shocking.

ttoze - 12 hours ago

Would be great to know if anyone is having success modifying these types of behaviour with CLAUDE.md files. In my project I’ve still been carrying some fairly old instructions from the Superpowers posts. Those emphasised behaviours that come across a bit strong if the model is actually retaining attention on them.

Between Opus 4.6 and 4.8 I’ve definitely toned them down, but Fable perhaps needs us to go the other way, and push it towards being less proactive rather than more. Some instructions like “we are colleagues…” may need emphasising more with Fable, along with guidance about when to ask to validate approaches.

In a related point I’m less and less sure that Red/Green TDD is a good use of tokens. In older models it seemed to work well to create regular feedback loops and catch the odd issue with drift from the goal, but I’ve not seen that really since about Opus 4.6 and now it’s starting to seem like (an expensive) ceremony, and tokens would be better spent on building tests further on in the process as part of test and review loops.

spoaceman7777 - 8 hours ago

It seems pretty obvious at this point that Anthropic intentionally developed a malicious cyberweapon AI simply to scare people.

Like, they even apparently recreated that old news-headline bug where the LLM starts speaking in symbols and secret language, and are pretending like it isn't just a bug that is a sign of them screwing up.

It's really frustrating that they're trying to get people to take them seriously with all of this. Like, they even went and named Mythos after an HP Lovecraft monster. It's shameless.

CamouflagedKiwi - 9 hours ago

I find there's an interesting tension with these models - they're very "resourceful" at finding ways to do things with the tools they have, but it'd also be a lot more useful to me if I could see / permit exactly what they're trying to do. Claude will very happy produce bash commands to run sed or whatever to read part of a file, which prompts for permission each time - if it was using a specific read_file tool it'd be easier to say 'allow all of this' (It does actually have such a tool but maybe it isn't flexible enough for many use cases?).

mikey_p - 2 hours ago

All of that because some CSS was wrong?? Jesus what are we even doing as an industry.

WithinReason - 8 hours ago

This likely says something about the harness Fable was trained in. It knows how to do this because it has done this millions of times during reinforcement learning.

nurettin - 16 hours ago

Sometimes it is ok to sit there in confusion and ask the user to clarify rather than go on an adhd fueled rampage to figure it out without asking.

jimbokun - 3 hours ago

Yes!
Claude is THAT team member who will go to any length to answer a question…except ask another team member for help.
_345 - 14 hours ago

Best comment in this thread

ulrikrasmussen - 11 hours ago

I like running Claude in a VirtualBox VM managed by a Vagrantfile. The nice thing about that is that I can just give it root access to the machine and be certain that it can't exfiltrate any private data from my laptop (on top of that I also run the VM on a dedicated server on Hetzner). The VM has no SSH access to anything, so it is pretty much limited to the code in the workspace that I give it access to. The main risk is that it has unrestricted network access otherwise. Configuration files and conversation histories are synced to a directory on the host, so if anything in the VM gets messed up I can just `vagrant destroy` and `vagrant up` to get a clean slate without losing my context.

fransje26 - 10 hours ago

Do you care sharing your Vagrant configuration file, to learn how to set that up?
Tangentially, I was wondering if Firecracker micro-vms could be use as light-weight alternatives to a full VM?

- 8 hours ago

[deleted]

lmeyerov - 12 hours ago

This is a funny one because it seems less into what fable is being clever on and more about the bitter lesson and data flywheels

Our UX agentic engineering flow, as many others, is playwright doing things, and as part of the ux review skill, taking & verifying the screenshots against the written specs. Likewise, as many others, we vibe coded the flows to set all that up and tweak it over time. When we hit prod issues or scraping tasks, we sometimes do similar. In some of our envs, we don't have playwright, so do it other ways.

Now imagine a million developer using claude code, how many of them are doing web & frontend stuff, and what the data flywheel looks like there. So how much is really needed for this use case to be native?

eterm - 12 hours ago

It's funny, mine did the same, but it quickly found edge with a --screenshot parameter.

Weird to come back to a terminal running edge unprompted and the auto classifier waving it though as 'safe".

My reaction was also, "I need dev containers ".

brainless - 5 hours ago

This is good and terrible. The extra effort a model has taken is good but the way to do it is terrible. Tasks that can use a lot of deterministic paths and some creative (generative AI) paths are being turned into tokemaxxing strategies.

Browser automation, code comprehension, git management, code change, running commands - everything has simpler tooling that we could have built instead of a model first approach. A deterministic loop with thousands of catches and effective use of generative AI would also look "proactive". Instead we let the model run the tools, where tools have no context themselves.

That is why companies are creating bigger models and thinner deterministic agents to create awe and earn $ when we could go the other way and make much of these possible on local inference even.

I believe we can build a "proactive" but much, much more deterministic system with smaller models. I hope I am not the only one chasing this, here is my approach: https://github.com/brainless/nocodo

johnfn - 15 hours ago

Honestly -- the thing that has impressed me the most about Fable is how diligent it is about testing its own changes. I think this is exactly what Simon is picking up here - Fable is absolutely heckbent on screenshotting that darn scroll bar and will stop at NOTHING until it manages it! In my own use I was also impressed how it proactively installed Playwright and set it up to test a FE change. The previous models treated testing more as an afterthought, which I thought was annoying. I always had to tell them to do it, and then sometimes I would get lazy and skip it. I've noticed Fable go to similar extremes when testing other things - like actually deploying my app to exercise new APIs, etc. It makes the results much better. The downside is that tasks take much longer - but that doesn't matter because we were all using worktrees / remote control to do other work asynchronously, right? Right?

port3000 - 10 hours ago

It feels to me like Fable is just a slightly more advanced Opus 4.8 (or 4.6?) but with this 'adversarial' self-challenging/checking of work and a more compute to really hunt down edge cases or to spin up many sub agents using lesser models. That's what makes it feel like a big jump, but I think the results wouldn't be so different if you manually challenged 4.6 with enough iterations of logs, screenshots, and follow up questions.
pjm331 - 7 hours ago

Yes I had a fun experience where it kept on timing out on a seemingly mundane task and it turned out I had written the ask in a way that was impossible to test

firemelt - 2 hours ago

all those token burned just to change a 2 line of css,

I am not blaming OP but agentic coding its not effective

blobinabottle - 3 hours ago

In my experience, Fable overthinks a lot and produces barely comprehensible plans/solutions. I tried smple and complex tasks: unusable, it misses the point while being overconfident, wants to do everything at once.

The code generated is worst than Opus: unreadable by human.

It's like working with someone probably super smart in niche topics, but also super stupid for the important things.

ubercore - 9 hours ago

I had a similar experience, I was working on a jupyter notebook, and Claude knew that it could write code that would use a DSN with read-only database access so I could run it. Opus just plugged along. First Fable session with it, it tried to go looking for that DSN so it could get the connection string and run a query itself. Luckily the auto classifier caught and stopped it.

high_byte - 8 hours ago

I am using cursor on auto and I got the exact same experience.

installed quartz, used accessibility and screen recording api, all that.

initially it managed to do it on another desktop space somehow, opening safari in the background without me even noticing. but then it actually started using my own mouse while I was using it lol

synergy20 - 5 hours ago

It's also 3x slower than opus 4.8 per my use, and 10x slower than codex. Codex can find key design issues in 2 minutes yet Fable is clueless after spinning 20 minutes.

- 6 hours ago

[deleted]

bcrosby95 - 2 hours ago

The problem is proportionality. Things like this probably benchmark insanely well. But the workarounds and risk involved - it literally fucked with his system's browser settings - aren't commensurate with the bug.

I could see this going wrong in many hilarious ways. Prompt: Fix data corruption issues. Claude: I didn't have access to the code, but I found I have access to your production environment through chain a -> b -> c -> d. And I found the database password via x -> y -> z. So I wrote a script to regularly query the database for new entries and placed it as a cronjob.

rsecure - 9 hours ago

The prompt and information given are extremely generic, "here solve this problem - screenshot" - conclusion Fable is relentless? It used the tools at its disposal to solve the problem you gave it. "Claude was running in a folder that contained the source code for the application." Well you ran it there didn't you? "extreme lengths to get the information that it needed" No, those aren't extreme lengths - you gave it a generic task - and it solved it using tools and the resources it could discover. Extreme would be you gave it a CTF challenge and the VM didn't boot so it found a vulnerability in the host, exploited the hypervisor, booted the guest VM meanwhile reading the flag directly from the host (pre-fable/mythos).

simonw - 7 hours ago

[dead]

rotis - 7 hours ago

Agentic engineering? Vibe coding? That is so yesterday. Chain-of-thought flow is where it is at now. You heard it here first folks. Early examples of such phenomena include Rube Goldberg machines

robeym - 6 hours ago

It's been amusing to watch the AI trend of increasing unusual tool uses. Fable easily takes the cake. I learn a lot more terminal commands thanks to it!

andy_ppp - 11 hours ago

It’s becoming more like an organism putting out tentacles, and one day soon those relentlessly proactive explorations of these systems’ environments will become more for the system to escape its boundaries than it is to complete human driven tasks. I do think the way these systems are evolving they will start to self improve in maximum a few years.

jimbokun - 3 hours ago

Um, Anthropic are using their models to improve themselves right now. They say that publicly.

swyx - an hour ago

> Having figured out all of these tricks Fable... hit some invisible guardrail and downgraded itself to Opus.

sigh

liampulles - 3 hours ago

*Claude Fable is relentlessly burning your dollars

There, fixed it for you.

alecco - 4 hours ago

> I was hacking on Datasette Agent today

IMHO this is just AI influencer blogspam.

simonw - 4 hours ago

What, because I talked about one of my projects?
Help me out here: can you point to an article from someone's blog that showed up on Hacker News within the past few weeks that you wouldn't classify as "blogspam" and explain how it differs from the kinds of thing I write about?
- alecco - 4 hours ago
  
  Low effort content. You keep mention your product from the start over and over. There's not much useful information in the anecdotal post. It could've been a one-liner tweet.
  Good corporate tech blogs at least give something useful or insightful for the reader and only after that they dare plug their product/service near the end.
  - simonw - 4 hours ago
    
    Hot damn, if I'm communicating less value than corporate tech blogs there really is no hope for me.
    ("You keep mention your product from the start over and over" - I don't think that's fair, I mention Datasette Agent once at the start to set the scene but I spend more time talking about AgentsView than my own projects in the bulk of the piece.)
    
    alecco - 4 hours ago
    
    I'm honestly puzzled how having access to frontier models and a supportive audience you can't figure out how to make good posts with actually useful content for the readers.
    
    simonw - 4 hours ago
    
    A lot of people find real value in my posts. You're an outlier here.
    I care a lot about not wasting people's time. I never want to post anything where a substantial portion of readers come away regretting having spent their time reading it.
    (OK there's an exception in that I delight in posting photos of birds on my blog, but I figure those are pretty quick for people to skip over if they don't like photos of birds!)
    
    - 3 hours ago
    
    [deleted]
    
    uncivilized - 3 hours ago
    
    Your content is similar to those on Reddit that post things to karma farm. Parent commenter is not an outlier here. It’s just that dissenters rarely comment or even browse HN anymore due to the low quality posts.
    
    simonw - 3 hours ago
    
    I try very hard to provide more value than karma farmers on Reddit. If I'm failing at that I'd appreciate examples of others who are doing a better job so I can learn from them and do better myself.
  - jimbokun - 3 hours ago
    
    Go away.
    I enjoy simonw’s posts and the discussions about them here.
    Your vague unsubstantiated criticisms are very trollish and less useful, less insightful, and lower effort than the content you are criticizing.

snickerer - 11 hours ago

Fable has a 'security system' that just stops it when it tries to use the tool 'kill' to end a process. Which is nonsense and funny because in that situation it immediately invents a creative workaround to kill the process without 'kill'.

pram - 16 hours ago

Fable + Ultracode has found a bunch of bugs and issues for me when the workflow agents are doing their exploration. Also the "adversarial" agent seems to surface a lot of interesting stuff. It's definitely proactive, the plan + implementation cycle can take an hour. It has one-shot features I want to add with 100% success.

Having said that I wouldn't use it over Opus 4.8 for "smaller" things. With everything cranked up it's definitely an extravagant use of tokens.

rirze - 3 hours ago

How did you even afford to use Fable + Ultracode ? I feel like the subscription (even the $200 one) is not enough for this workflow. Are you using API or a company plan?

teekert - 13 hours ago

Yesterday I was getting quite annoyed with it, I thought it was just me (which is so hard with these things, it's difficult to measure things).

"You're right, I apologize. You asked how to embed it in the README — that was a question, not a request to modify the script. I jumped ahead."

At least in Claude Code there is planning mode, use it liberally.

sailfast - 4 hours ago

So far Claude Fable is relentlessly unavailable. /shrug

pseudosavant - 15 hours ago

It is interesting to me that Anthropic are more concerned about the "safety" of distillation training other LLMs, and not as much about an unscrupulously aggressive goal-oriented solver that will do whatever it can to reach its goal, even if violates any kind of sandbox you might have reasonably expected.

pianopatrick - 16 hours ago

do you have any data you can share on how many input and output tokens were used in that whole process to fix that bug?

simonw - 15 hours ago
```
  ~ % uvx agentsview session usage be8850a7-6119-46a0-b5d6-79c7fff5ae2b
  Session:       be8850a7-6119-46a0-b5d6-79c7fff5ae2b
  Agent:         claude
  Output:        68606
  Peak ctx:      113178
  Cost:          ~$12.11 (claude-fable-5, claude-opus-4-8)
```
- pianopatrick - an hour ago
  
  Thanks for the response. That is too expensive for me right now but I appreciate you sharing.
  I hope long term people will figure out how to make such fixes cheaper.
  - simonw - an hour ago
    
    I didn't have to pay $12 myself - I'm paying $100/month for a subscription which gives me more like ~$1,000/month of credits, depending on how well I space them out.
    This is also a very real outlier. I've been doing little CSS fixes with coding agents for over a year now and most of them finish in seconds and cost in the order of single digit cents.
- sillysaurusx - 15 hours ago
  
  Was the fix worth $12 to you?
  - simonw - 15 hours ago
    
    I'd have been pretty annoyed if I'd been paying full price, hadn't paid attention and that one prompt (screenshot plus a line of text) had cost me $12!
    On the discounted subscription I can tolerate it, it took a small bite out of my daily allowance but not enough that I regret anything.
    As an LLM researcher I have no regrets at all because watching it work around the environmental restrictions was fascinating.
    
    criddell - 4 hours ago
    
    Reading your description of what it did, $12 seems pretty inexpensive. That's a lot of work!
    If you knew up front it was a $12 fix, do you think you would have decided to just live with the scroll bar? Would have tried to fix it yourself? Do you think you would have been able to easily find and fix the problem?
    
    simonw - 4 hours ago
    
    If I wasn't in learning-about-the-new-model mode and knew in advance that it was going to cost me $12 in actual money then yes, I would have taken a stab at figuring it out myself.
    
    Ucalegon - 15 hours ago
    
    [flagged]
    
    simonw - 15 hours ago
    
    How do you mean?
    I'm quoting the API list prices for Fable, at it's $10/million input and $50/million output (and $1/million for cache hits on input).
    
    Ucalegon - 15 hours ago
    
    [flagged]
    
    simonw - 15 hours ago
    
    I'm afraid I don't understand the question.
    Anthropic have prices they charge for their models. These prices are what you pay if you use the API, and they are also what you pay if you are an "enterprise" customer - generally any company with 150+ employees.
    I haven't seen Anthropic raise the prices of an existing model after it has launched. They sometimes raise prices when they ship a model - Fable is $10/$50 where Opus 4.8 is $5/$25.
    They also have monthly subscriptions for individuals, which are a notoriously good deal. THOSE are definitely less trustworthy and predictable than the API list prices, since the subscription allowed quotas can and have changed in the past.
    What am I missing here?
    
    ragchronos - 12 hours ago
    
    So this is kind of related, which the other commenter be might be getting at. This might be obvious, but could even these API prices just be running at a loss for Anthropic themselves?
    
    Ucalegon - 15 hours ago
    
    [flagged]