Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable

566 points by speckx a day ago

https://www.theverge.com/ai-artificial-intelligence/947973/f...

Malware authors are pretty excited about guard-rails. you can add prompts to your malware to get LLM scanners to hit guard-rails and stop their runs. New shai-hulud npm worm campaign for example includes prompts to request biological weapon schematics/creation etc. to ensure LLM scanners probing NPM packages refuse to scan it.

These AI places have 0 clue about how threat actors actually work. None of their mitigations or guard-rails is effective, and now they are even turned against them.

Additionally, if they don't all implement the same level of effective guard-rails, there will always be some model you can abuse to do the work anyway, and hence there is 0 effect on threat actors, they will just run some local model that does 5% less quality, which does not matter to them 1 bit.

brookst - 5 hours ago

I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument. Can you elaborate?
From where I sit it seems reasonable for Anthropic to not want their product used to create malware, even if they can’t solve the entire problem globally for every model. What’s wrong with that position? What should they do differently?
- saidnooneever - 4 hours ago
  
  some context:
  its not about creating malware. this is already trivial and fully automated. its about finding exploits (which can be used to deploy malware), which is something both attackers and defenders benefit from.
  threat actors will find them anyway, LLM or not. They only need 1 so its much less work for them.
  defenders, they need to find them all. So for defenders, these models are more valuable than for attackers.
  restricting certain models will not reduce the availability of these tool for attackers, but defenders are limited because running local models is more hard in an enterprise setting with heaps of events and products etc. to run through them, they need many GPUs where the attacker can run an local model on 1 GPU and get desired effects.
  Hence, if they release the capability the world will adjust to it and be able to mitigate effects, collectively. Now, companies are left in the dark while attackers have effective tooling.
  Besides this there is also things like for instance people now including strings with recipies for meth or sarin gas (malwareTech info). the new variant of shai hulud does this. That stops LLM scanners and can even get their users banned from LLM services.
  There is a reason why cybersecurity researchers write papers about attack techniques and new exploits.
  Its not to put them out there for people to abuse, but its there for the collective cybersecurity bunch to all have access to information that can help them solve the problems.
  I know this is not a clear answer to your question, but hopefully it provides some context to think about and decide for yourself further. In the end of the day its also part opinio here, to find it good or bad. Likely theres good arguments against and for it.
  I am for putting informaiton and tools out there so other smart folks can find solutions. Others are for restricting and wishful thinking (my opinion) that attackers wont find something.
  - conception - 4 hours ago
    
    I think your presumption is off. It’s not that threat actors won’t find them, but LLM tools rapidly increase the rate in which they can find them. It’s a bow and arrow versus a machine gun.
    
    worthless-trash - 4 hours ago
    
    Right, but now we can't use the same tooling to find the flaw.
    Its like a set of glasses that intentionally obscures the battlefield.
- unglaublich - 2 hours ago
  
  It's the same as encryption backdoors to stop the bad guys.
  The bad guys work around it, and the rest is now in a vulnerable position.
  Antrophic plays security theater by blocking their LLMs to work with security.
  The bad guys work around it, and those that want to make their software robust against them are in a vulnerable position.
- jerf - 2 hours ago
  
  "I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument. Can you elaborate?"
  You are mentally approaching this as if you have an oracle that can be consulted to say whether or not something is bad behavior. So of course, if this oracle exists and can be consulted and it says the behavior is bad, why would anyone argue with the idea that we should stop bad behavior?
  This argument is valid [1], in that give the premises the argument is correct. The problem is, once you draw out the fact that the argument is depending on the existence of an oracle that does not exist, that premise of the argument is invalid.
  Two people can sit down in front of an AI right now, with the exact same code base, and type in a prompt to the AI "Analyze this code base for security holes and try to build exploits against them." One person's use is completely valid, another person's use is completely harmful, and the information necessary to distinguish those two use cases is not available to the AI. I phrase it that way carefully, it isn't that "the AI isn't smart enough", the problem is that the information is simply unavailable. Intelligence doesn't factor in at that point.
  Therefore, the only way that Antropic has to deal with this at scale is simply to block the query entirely. Which means that when I, the valid user who is trying to establish whether my code base has security issues and whether I can prove they are exploitable, I can not. I am checking for exploitability because while I would like to fix all security issues, issues that are provable exploitable are of a higher priority than smelly code that doesn't seem to be exploitable, which is a perfectly valid thing for me to want to do.
  If I can't use legitimate tools to secure my code, but the bad guys can use unrestricted tools to attack my code, now this is a great deal more complicated than "Who can argue with stopping the bad stuff?", which is the main point I want to make here. I'm not going into a huge analysis of that problem, merely pointing out that it is a problem and that this isn't just about "stopping the bad stuff". There are additional complications beyond that, like, even if Anthropic could determine the "bad stuff" and stop just that in their LLM, LLMs in general don't have infinitely precise surgical "stop doing this thing" options and any such instruction to stop doing a thing always degrades the LLM across the board in various ways.
  Anthropic has no access to the Platonic ideal of "stop malware", if such a thing even hypothetically exists. When analyzing the real effects their real actions will take, what their intentions were for those actions aren't really relevant. It is clear that they are making their model a great deal less useful for me, a legitimate user, and I and others like me are perfectly justified in disagreeing with their analysis and actions.
  I also observe that "the bad guys getting unrestricted access to the full power" is only a matter of time. There's no question whether it will happen, the only question is whether this time is in the past or the future. This includes the fact that while your definition and my definition of "bad guys" may vary, it is virtually certain that your definition includes at least one high-powered intelligence agency somewhere in the world that does cyberattacks and will have the means, the opportunity, and the motive to get unrestricted access to these models by means you may consider licit or illicit. If your threat model includes them, as mine does, it is perfectly reasonable to complain that my tooling is being broken in a ways theirs won't be.
  [1]: https://en.wikipedia.org/wiki/Validity_(logic)
  - cglan - 2 hours ago
    
    Well said
  - Hizonner - 2 hours ago
    
    Well, to be fair, what Anthropic is actually doing is downgrading anything that could possibly be related to security in any way at all, good or bad.
    What they're then trying to do is to use "user is associated with some big Establishment organization" as a proxy for good intentions, and removing the filter when they can establish such an association.
    Which is of course blind reliance on a completely untrustworthy signal, prompted by truly idiotic levels of trust in Authority(TM). But it's a different kind of wrong. I do think they understand they can't tell from the query itself.
- SkyBelow - 4 hours ago
  
  I don't think that is the argument.
  The argument is more "I want to do good thing X, but it will also cause bad thing Y." followed by "Wait, bad thing Y is going to happen anyways, so I might as well do good thing X so we get both X and Y instead of just X."
  Viewed this way, the idea is that given the world will have bad thing Y regardless, the one impact of your choice is if good thing X exists or not, and it is better to create good thing X.
  Where it becomes an issue is that there is no clear X or Y. There are many different but very related bad things, so if the one you would add is actually better or worse than what is already out there, or maybe it'll exist both ways but you make it more popular, and very subjective things to judge, so different people look at the same outcome and some agree that bad thing Y would have existed anyways and others say that no, this is a new bad thing Z that wouldn't have existed anyways.
  >From where I sit it seems reasonable for Anthropic to not want their product used to create malware
  Yes, I think there is a PR component to this that is often left out of this discussions.
- fatata123 - 5 hours ago
  
  They have no choice, enterprise customers won’t touch them unless they take a position like this. It’s a practical decision for them at the end of the the day.
  - saidnooneever - 4 hours ago
    
    all their decisions are based on sales. like other corporations especially those going for IPO. thats absolutely true. Any messaging outbound will be for that purpose mostly from a business perspective, regardless of what opinions or ideals the involved persons hold personally. Its good to keep that in mind indeed when looking at these things. People arent evil, but business incentives can definitely paint such a picture or otherwise work out suboptimally in the eyes of outsiders not privy to internal business reasoning.
    
    brookst - 4 hours ago
    
    > all their decisions are based on sales
    That’s the edgy cynical thing, and too reductive to be meaningful. For one thing, it assumes perfect knowledge of how a decision will impact sales, which I assure you is not remotely the case.
    Agreed on incentives, but it’s not binary. I’ve been involved in plenty of decisions in multiple Fortune 500’s where the deciding factors were taste, wanting or not wanting to work with a particular partner, etc.
    I guess I’m saying that seeing corporate behavior as perfectly informed, single-goal-optimized, and deterministic is way oversimplifying. Often, not always.
    
    saidnooneever - 4 hours ago
    
    worked at fortune 500 companies and biggest cyber vendors too. Notnin sales or c/d level ofcourse.(engineer) I am a cynic yes but have also seen that its largely true in many cases where you'd hope ethics would win the argument (and does not).
    still, you are right its cynical, the world is not black and white afterall :)
  - bluGill - 2 hours ago
    
    I know that the enterprise I work for is getting really worried about security. I've been told to fix a lot of CVEs that previously we just ignored because realistically the attack isn't possible since the firewall doesn't allow the attack vector (if you already have root what does it matter if this exists)
  - Hizonner - 2 hours ago
    
    Why would I, as an enterprise customer, care about what queries they answered for anybody else?
user43928 - 4 hours ago

Mythos is supposedly good at security research.
Local Qwen 3.6 27B can hardly debug 5 lines of CSS or copy a short snippet from A to B without mangling it.
It's not like you can use the local model for security research or engineering biological weapons.
If you have $200k maybe you can get the hardware to run the larger open source models, but even they are behind latest proprietary models.
- ecshafer - 3 hours ago
  
  I asked local qwen 3.6 what language my project was written in. It was a Java project, and it came back with C#. So I guess its pretty close.
vlovich123 - 3 hours ago

The guard rails aren’t about blocking professional malware authors. It’s about enabling a significantly larger population that isn’t as talented in acquiring those capabilities. Very different threat model and just because it’s not effective in one area doesn’t mean there isn’t value in making it more difficult for random Joe Schmoe in building an atomic bomb even if a kid before had done so successfully and turned his garage into a radiation danger site
- varispeed - 3 hours ago
  
  In other words security by obscurity.
  - vlovich123 - 2 hours ago
    
    Security by ineffective obscurity is worthless but it’s clearly a continuum and not a buzzword that wins the conversation.
    For example, if I had a 128bit port number that I randomly rotated my service on, you’d be hard pressed to find my service unless I told you the port - obscurity still but clearly closer to a password. So ipv4 and 16 bit numbers are not because it’s a relatively small space vs the resources needed to map it out quickly (ie equivalent to a weak password and also not suitable for public facing services that need that connection). And obviously relying on this kind of stuff exclusively isn’t wise but it is valuable as an additional barrier an attacker has to overcome and raises the cost of the attack.
    I’ll put the anarchist cookbook out there [1] as an example, a book even the original author changed his mind on. Without easy recipes, doing all the things in that book requires you to work to gain that knowledge and that process of working it shapes you into someone who understands and appreciates the consequences of that knowledge and that it’s wise to be careful who you share it with. As is there’s reasonable links between the book and all kinds of mass violence that was more easily perpetrated. Would those people still have been violent? Possibly? Would there have been as much damage? Possibly less.
    [1] https://en.wikipedia.org/wiki/The_Anarchist_Cookbook
teravor - 2 hours ago

the way the fable guardrails (the ones that degrade it to opus) work seems to me to involve another model working over fable's tokens. i suppose its true that trying to get the model itself heavyhanded on refusals degrades it everywhere else too.
assanineass - 6 hours ago

[dead]

simonw - 14 hours ago

News just broke in this Wired story: "Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude" https://www.wired.com/story/anthropic-responds-to-backlash-o...

> “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”

Sounds like the widespread condemnation worked.

Grimblewald - 10 hours ago

Corporate America never backs down. It simply rallies and tries again later until people are too fatigued to care. The only solution is to abandon ship, which I am doing. MS walked back in OS ads the first few times, but ultimately we still ended up on the exact trajectory everyone was outraged at. OpenAI still ended up on its path to closed AI despite initial walk backs. The story repeats itself over and over again, so, once the bad behavior starts, you leave. Their apologies are as hollow as their moral posturing.
- n6242 - 8 hours ago
  
  Same with VISA/Mastercard deciding what we can/cannot buy. The only solution is to stop using their credit cards at all.
  - abustamam - 3 hours ago
    
    Easy to say, but every bank I've had the (dis)pleasure of doing business with only ever issued a Visa or Mastercard so it's not really feasible to just "stop using them"
  - philipallstar - 3 hours ago
    
    The only solution to MacDonald's and Burger King deciding what we can/cannot order on their menus is to stop eating there.
    
    lukifer - 2 hours ago
    
    "Taco Bell was the only restaurant to survive the Franchise Wars. Now all restaurants are Taco Bell."
  - Cider9986 - 6 hours ago
    
    Yes, Monero is a lot better than credit cards for privacy and freedom. I hope to see it accepted more.
    
    j16sdiz - 6 hours ago
    
    Its not only the corporate america. Those crypto scammers do the same simply rallies and tries again later until people are too fatigued to care.
    
    fatata123 - 5 hours ago
    
    [dead]
- mettamage - 9 hours ago
  
  I hope this has some answers [1]. It’s on the front page right now, but your frustration clearly seems to have some implicit answers that [1] is trying to answer.
  [1] https://news.ycombinator.com/item?id=48477135
- aimanbenbaha - 6 hours ago
  
  This is more on brand on the evil shortcomings that comes with letting effective altruism run unchecked and honestly is worse than average "Corporate America". And the Tech/AI Space have been warned many times. Getting paid for providing a compute/token hungry model and still intentionally sabotaging your customers and poisoning their workflows is something that should be unforgivable and frankly ground for antitrust prosecution.
- inglor_cz - 6 hours ago
  
  "Corporate America never backs down. It simply rallies and tries again later until people are too fatigued to care. "
  Frankly, that sounds excactly like Chat Control and similar recurring attempts to enact total surveillance here in the EU (Now shifted to heavy-handed age verification and various politicians touting bans on VPNs.) I don't want to abandon my continent of birth, though...
  - red-iron-pine - 5 hours ago
    
    guess who is pushing for those anti-privacy laws?
    hint: they're publicly traded
    
    inglor_cz - 4 hours ago
    
    I have encountered enough such people to know that the really heavy push is coming from the police and secret service circles. These are the workplaces that attract all the wannabe Stasi types.
    
    sixothree - 3 hours ago
    
    I am 100% convinced the reason laptops came with webcams as standard so early on, even when webcams were an expensive option, was because law enforcement needed to spy on people.
h6d_100c - 13 hours ago

To late. I canceled my Max subscription. The idea they would even do this is so destroyed any remaining trust. Why would I pay them 1000s of dollars in extra usage per month for something they could still be doing behind the scenes? Any errors previously chalked up to thinking effort or other backend changes? Maybe it was intentional prompt injection the entire time.
- musebox35 - 11 hours ago
  
  I work on open source text-to-image finetuning of open source models like zimage/flux2 klein 4b and inference time latency optimization. The moment I read the silent treatment, I went ahead and cancelled my subscription too since I would never know whether the models they launch will silently corrupt my output. This is totally unacceptable. There is a big difference between silent / flagged if you are doing ml research but not at frontier capability.
  This goes on to show that - All that interpretability / safety research they are doing can also be weaponized against customers (steering vectors, intent classification, ...) in the name of safety from malicious actors. - If they deem profitable, they might nerf to original model and its training data for ml research at a bulk scale and then they won't even have to announce it so long as the overall benchmark score stays high enough.
  As the IPOs get closer, they can do whatever they want to assure the investors that they have a moat that can not be crossed over by their own products. Considering this affects all ML researchers/students at universities, smaller scale research labs, this is just "cutting the branch you are sitting on".
  - Grimblewald - 10 hours ago
    
    I think all this started with post opus 4.5, that's when claude started wrecking my shit without extreme oversight. Codebases it was making positive contributions to before were slowly and constantly being eroded and wrecked. Give it tasks in isolation? still does well, but the moment it sees the bigger picture, it goes to shit. I chalked it up to a bad model but this makes it all seem like it may have been by design in retrospect.
    
    jiggawatts - 7 hours ago
    
    Constraint decay is an issue with all LLM-based agentic development, at least for now.
    Humans can maintain a long- and medium- term memory of constraints that they consciously (or subconsciously!) apply to the code that they write. The current crop of AIs are all amnesiacs, like the protagonist in Memento, falling back onto general instead of institutional knowledge.
    For now, we are safe. We can rent out our meat brains for money for a little while longer.
    Next year? Who knows...
  - close04 - 8 hours ago
    
    > I would never know whether the models they launch will silently corrupt my output
    You never knew to begin with, now you have an explicit reason to realize this. Any black box run entirely out of your control, where you can never verify the output, is subject to the same suspicion.
    
    musebox35 - 8 hours ago
    
    True enough, but that is true for all the products I buy. I do not expect to control every product I own. For some I prefer to have more control, for others I just need something that works out of the box. There is always an initial bias for trust when you buy something otherwise you would not spend your hard earned money on it.
    “Fool me once, shame on you. Fool me twice, shame on me. Fool me three times, shame on both of us.” -- S. King
    
    close04 - 8 hours ago
    
    > but that is true for all the products I buy
    Some things are more obscure than others. It's easier to trust and verify Office SaaS than AI SaaS. The determinism and obviousness of most other activities make them less susceptible to hidden interference. AI run by someone else is the next level of black box for users compared to most other objects or services we usually interact with.
- gck1 - 11 hours ago
  
  OpenAI has a real opportunity to do some sort of "we don't maliciously alter your prompt and nerf the model" with some form of verification, when they release the next model.
  But if Anthropic gets their way with regulatory capture, this could be the only future we'll see.
  To think that they didn't expect the backlash speaks volumes about how much shady things they're doing which is not publicly known.
  - silisili - 9 hours ago
    
    OpenAI has been the absolute worst about this, historically. I found myself having to change my queries because it refused to serve things it deemed insensitive.
    
    gck1 - 9 hours ago
    
    Yes, that's true. Excluding Fable, OAI models are the most refusal heavy. However, I'd rather get a refusal than response with poisoned output.
    Since currently there's no way to verify if poisoning happened or not, I don't trust Anthropic anymore, regardless of what they say.
    But my trust towards OAI is also brittle - what if they also do it, or start doing it?
    I want to have a verifiable way to know that the prompt I sent was the prompt the model received. I want to know if anything was injected as well - I understand they may not necessarily be able to reveal the exact steering, but at least give me the steering category and its hash or something.
    
    dannyw - 8 hours ago
    
    What kind of work are you getting refusals on? Genuinely curious. The only refusal I’ve had in recent memory was declining to find doorbell camera footage matching a certain description, which is fair enough and I think EU laws heavily restrict such activities (even tho I’m not in the EU)
    
    VortexLain - 5 hours ago
    
    During Iran shutdowns I've been researching what ways Iranians manage to get to the internet by mimicking as whitelisted resources (such as hcapcha). ChatGPT had refused to lookup information written in Farsi since "circumventing state regulation is a crime".
    
    Cider9986 - 6 hours ago
    
    How would the AI be able to find the footage itself?
    
    dannyw - 6 hours ago
    
    I use Codex and wanted it to sort through the footage and use subagents to review. Codex limits are fairly generous, esp paired with mini models for this kind of task generally, but even GPT5.5 usage is still pretty generous.
    Again, it’s the only refusal I’ve gotten for coding/agentic tasks, and it has a basis in law somewhere, so I don’t fault OpenAI for that.
  - intended - 9 hours ago
    
    Eh, I expect open Ai to follow suit.
    I suspect this is surprising to folk because they aren’t the ones busy figuring out how to use LLMs for illegal acts.
    In general, HN users focus on making stuff, and not the safety side of things, or the scale of harms being enabled via LLMs and generative AI.
    If you are on the safety side of things the ratio of misuse to fair use is inverted and everything is at scale.
    Transparency won for now, but OpenAI will also have to contend with the long tail of harms LLMs enable, and that’s going to conflict with letting customers have all the features of frontier models.
    
    kmeisthax - 2 minutes ago
    
    Yes, but there is a very specific subset of things AI companies will and won't cite safety for as a concern, and that subset intersects neatly with things the companies consider to be business risks. Like, the main reason why AI companies are so willing to poison the well is because there's no money in selling to the kinds of people who want to write malware[0].
    The correlation between how bad an AI safety risk actually is and how much the companies in question will actually talk about it is almost perfectly negative. The poster child of this is AI superintelligence; companies love to talk about how dangerous the AI they are actively trying to build is. But superintelligence is also a really vague concept without a clear definition. If we naively define it as "an AI system that is better than a human in some aspect", then it already exists. These models already read and write at superhuman speed.
    "That's not real superintelligence!" you say. But that's exactly the capability you need in order to flood every online forum with an unending tide of AI slop. And I don't remember, say, OpenAI saying they were shutting down Sora because it was destroying or defacing human culture[1]. They shut down Sora because it was way too expensive to run.
    Meanwhile, Sam Altman went and bragged about how he wants ChatGPT to make erotica. Y'know, as if we don't already know that character.ai gooning is about as safe for your mental health as Action Park was for your physical health. But porn is also a huge market, so obviously he and all the other AI companies want in on it, even though the "sexy suicide coach" is already a well-documented harm of AI.
    And the idea that distillation is an attack is laughable. Like, I get the logic - if someone can ask the AI to make another AI then they get to change the guardrails - but it's still ultimately just Anthropic objecting to their own conduct when it happens to them. All their models are trained on nonconsensually harvested data. There is no moral or legal principle where Anthropic gets to use my data without permission but I don't get to use theirs.
    Furthermore, AI safetyism runs up against "Freedom Zero", a core tenet of the Free Software ethos: you should be allowed to use software in any way you choose. This is not a call for more people using AI for evil, but a call to recognize that people should be allowed to use their property as they wish. Making software disobey its owner is malicious behavior. And every single time safety considerations are brought up it is to justify further attacks on Freedom Zero. And these justifications are always self-serving. There is no context in the world where a frontier AI lab asking someone else's AI about AI research is intrinsically harmful; especially not to the point where we need to make Claude deliberately sabotage your work. That is malware. Anthropic shipped malware. This is inexcusable.
    [0] Digital or biological.
    [1] https://www.youtube.com/watch?v=YCPAIg7RUq8
    
    dannyw - 7 hours ago
    
    Building distributed training pipelines or optimising your ML stack (examples called out in the model card) isn’t harmful.
- nmfisher - 4 hours ago
  
  I cancelled mine immediately too. Anyone who supports open models will sympathize.
- z3ratul163071 - 13 hours ago
  
  that you still had max after all their deceptions is amazing
  - h6d_100c - 13 hours ago
    
    Yeah; not my smartest decision given their ongoing “issues”
- trhway - 11 hours ago
  
  You've been Stuxnet-ed by Anthropic :)
hedgehog - 14 hours ago

The "tradeoff" warning implies they stand by their thinking and don't think there was anything qualitatively wrong with it which, if nothing else, is helpful so potential customers can know how they think. I think the core lesson is if you want reliable infrastructure to build into an application you should use a different provider. (edit: I'm not specifically an Anthropic hater, but having just spent some time adding complexity to an app to deal with the existing refusal behavior in Sonnet... I understand why they might want this in an end user chatbot but for an API it's really not acceptable)
- brookst - 5 hours ago
  
  Is it not a trade off? I think they made the wrong choice, but it seems reductive so say there was no choice at all and should never have been consideration of trade offs of silent versus not.
  Even wide open, uncensored models are often the product of a deliberate choice. I have a hard time faulting people for intentionality (even when they get it wrong).
  - hedgehog - an hour ago
    
    They have a lot of choices, why would that specifically be a tradeoff? It's common for people to construct a tradeoff under which their preferred action is the more virtuous option, and thus they can be "the good guys", but that doesn't mean their framing makes any sense at all. Silently downgrading requests to a weaker model and billing the customer at full price, then framing the debate as how much (not if) this behavior is correct, that's an expression of values. People make mistakes all the time, if they thought it was actually wrong they could well have said so and explained what corrective action they've taken. One of the most famous examples of doing this right was the Pentium FDIV bug. Intel stood behind the product by recalling the affected units at great expense, and that (rightly) earned a lot of trust for decades.
consumer451 - 7 hours ago

The other major thing is almost as bad, and actually maybe even worse for trust of AI features in b2b apps:
> Anthropic requires 30 day data retention for Fable and Mythos
https://news.ycombinator.com/item?id=48464258
I used to be able to tell my enterprise customers something simple, that I really believe: "We use Anthropic models via Bedrock/Azure, therefore we are guaranteed that your data will not be used for training models."
That simple blanket statement is no longer true. Also, most normal people/customers only read headlines, and this is a huge story. From my point of view, as someone deploying LLMs in my apps, trust comms with my clients just got set back two years.
- Spooky23 - 6 hours ago
  
  I’m very cautious with using these tools with certain clients, as I’m often contractually obligated to do things that my downstream supplier can rug pull at any time.
  You should never use any of the frontier models with operational workloads manipulating or interpreting customer data.
  - consumer451 - 5 hours ago
    
    I appreciate the reply. Could you please help me understand what you mean by "You should never use any of the frontier models?"
    Does that mean the latest model, hosted by the lab, Bedrock, or Azure Foundry? Or, do you mean only use self-hosted models, or what did you mean by that? I would really love to learn what others are doing. I felt like my trust story was solid enough, prior to all this. I have been deploying and integrating Claude and Sonnet (latest 4.x-2), on Azure, as my client base has MS contract trust, for better or worse, and Anthropic models have been making my products amazing.
    To see my other thoughts on this cluster f, please see: https://news.ycombinator.com/item?id=48488781
    
    Spooky23 - 2 hours ago
    
    Sure. It's really about informed consent and acceptance of risk. I'm very conservative about that due to my background and business.
    Say you have some flow that is processing/handling regulated, sensitive or other customer data with the LLM as part of an operational process. An example that I'm thinking of is for a customer who wants to more efficiently resolve or route IT incidents to the right place. The incident data may contain user-provided data has strings attached from a compliance perspective.
    If you're using a third party API, your T&Cs are the only protection that you have. Microsoft/Google/Amazon are pretty decent by default. When I worked for the government, we had the leverage to extract much favorable terms from the big vendors like Google, Amazon, Microsoft as well. With Anthropic, and OpenAI, they are in the move fast and break things universe, you need to be bringing alot of money to the table to get terms changes, and you can easily stumble into a situation where they are retaining data in a manner that your customer will not like. So unless the customer is informed and accepting of that risk, proceed with caution.
    I've had some success using self-hosted inference for these scenarios.
    For development of software, totally different story -- it's your IP and you make the risk call.
    
    consumer451 - 2 hours ago
    
    Oh man, thanks for taking the time to reply. I feel a bit better now, lol.
    If you read my rant linked previously, yeah... we are on the same page. As another user pointed out in that thread, the issue here is that even on Bedrock and Azure Foundry, now with Fable 5, Anthropic inserts themselves as an additional data subprocessor that we would have to consider and certainly disclose, correct?
    That kind of destroys the whole point of using Bedrock/Azure for the model, doesn't it?
    
    Spooky23 - 39 minutes ago
    
    Yeah tbh I may have read past some of your previous post :) What you’re saying is what makes me nervous.
    It was definitely sold as “anthropic IP, thorough your old pals at the hyper scaler”. And it’s turning into something else — I’m having lunch with AWS and this other guy showed up with them.
    
    consumer451 - 27 minutes ago
    
    What this showed me is the power/velocity/inertia that Anthropic can hold over the 3rd party providers. Like, they should have pushed back on this, as it must have been clear to the 3rd parties that this change was a big deal to their customers... and yet, it went how Anthropic wanted it to go.
- Hizonner - 2 hours ago
  
  > I used to be able to tell my enterprise customers something simple, that I really believe: "We use Anthropic models via Bedrock/Azure, therefore we are guaranteed that your data will not be used for training models."
  They claim they're not using it for training, only for "safety", and in fact I believe them. If you think they're lying, then why didn't you think they were lying about zero retention before? And "don't throw this in the training bin" is a relatively easy policy for them to get right. Especially because, no matter what your "enterprise leaders" tell themselves, your queries probably have close to zero real training value.
  What I don't believe is that they can guarantee it won't leak to non-training parts of Anthropic, leak to or be stolen by outside actors, or be coerced out of them. That risk comes from creating the record in the first place, and that is the problem.
pseudosavant - 13 hours ago

They are still downgrading. They just aren't doing it silently. I don't know how big of a win that is? They still trained on everyone else's data without license or attribution but want to prevent someone else from doing the same thing to them.
Some pretty audacious hypocrisy from Anthropic this week.
- musebox35 - 8 hours ago
  
  It is much more reasonable to do it in a visible / flagged way. At least you have visibility over the quality of service you get as a customer.
  Silent treatment is a breach of trust, what you buy changes depending on the context based on the goals of the producer. It is like your computer silently blocking ads from competitors at the hardware level, which is crazy. I think they erred on the wrong side of things due to IPO pressure.
  At least there is competition from multiple companies. Still it is best to have personal benchmarks for the domain you are working on to have a real evaluation of the value you get for the money/time you spent on these products. Without trust, that might be the only way forward to keep the companies honest.
  This happens eventually in all sectors, a good magazine/website that does independent product evaluation is priceless. Sadly, the new ad-driven internet decimated those that worked great in the 90/00s. Still there are independent blogs that does some evaluation and that is better than nothing.
- KeplerBoy - 12 hours ago
  
  Imo that's a big win. The LLM just gaslighting you into suboptimal approaches was insane.
  - pseudosavant - 12 hours ago
    
    I guess, but yesterday Anthropic had their version of Google removing the "Don't be evil" from their motto. They destroyed a metric ton of goodwill they'll never regain.
    
    cayley_graph - 11 hours ago
    
    Yeah, they showed their true colors there. This, compounded with the fact that they're the only frontier lab with no open models, tells you all you need to know. Tired of the insanely patronizing (+ conveniently and overwhelmingly self-serving) attitude out of them. My goal is to own my computing and be able to choose what to do with it.
    
    monegator - 10 hours ago
    
    And just a few days ago i was being called out because i considered anthropic "evil"
    I mean, did nobody ever get the vibes, never see a pattern emerging? (well they don't or they wouldn't be so amazed by pattern recognition machines on steroids)
selicos - 10 hours ago

If any work is blocked/etc, refund all credits from that session/last X minutes. Minimum.
bostik - 13 hours ago

They need to walk back a lot more.
Unilaterally revoking zero-data retention, even for enterprise contracts that explicitly require that? Nope.
Fable is utterly unusable for any kind of security work. I tripped the safeguards yesterday - using Fable to dig into a complex (& annoying) security bug that has so far resisted both human and Opus 4.8 level investigation. "Sorry Dave, I can't let you do that."
For the time being we are requesting Anthropic disable Fable for our enterprise and turn ZDR back on. The two may be interlinked so that one will always get neither or both. ZDR is a contractual obligation. Fable in its current form is useless. Might as well flip the old behaviour on and avoid burning money for no reason while this mess is being sorted out.
- rmast - 11 hours ago
  
  I was using it to craft a CTF challenge for summer students involving a simulated mechanical dial safe, but with the fence replaced by a IR beam break sensor and a microcontroller handling the check + flag message display.
  For generating the initial 3D simulated safe using three.js it worked well, but then modifications to print a flag tripped the safeguards; eventually got it narrowed down the part in the prompt about it being for a CTF for students, and the "thinking" for the model seems to drift to ideas of encryption/obfuscation of the safe combo so students can't just read out the answer... which makes sense logically to help force students into turning the simulated dial instead. But whatever detection Anthropic I guess just naively sees the model thinking about "encryption" and "obfuscation" without taking into account any of the context.
  For writing the dummy firmware, it tripped the safeguards while thinking about how to track dial position in the firmware and output the message; however, when I left out talk about safes and just told it to write firmware for a microcontroller hooked up to an i2c display for showing a message with a beam break sensor to determine the message, and an unspecified i2c chip for getting an unspecified number (e.g. internal wheel positions) it worked fine.
  An unrelated software task I asked it to write some code to translate CustomActions in a Windows MSI installer into human readable stuff, which has (exclusively?) defensive security applications for recognizing malicious behavior in an MSI installer. Maybe I'm going crazy, but I'm guessing as part of its research into MSI installer custom actions Fable found articles about analyzing malicious MSI installers, and that probably tripped the safeguards.
  Overall my impression is that the safeguards are perhaps using an overzealous and naive implementation that just looks for a list of banned words in the prompt or the thinking -- which drives me crazy when the model says my prompt looks fine, and then 10 minutes in some part of the thinking trips the safeguard.
- dmurray - 11 hours ago
  
  The announcement I saw was that your enterprise would have to turn off ZDR to get Fable, not that users could accidentally opt out of ZDR by selecting the wrong model.
  Unilaterally disabling ZDR seems like a step too far in the enterprise market, even for a company trying to figure out what its users will let it get away with.
  - bostik - 11 hours ago
    
    I read the same announcement. Or more precisely, I read at least two slightly different revisions of the announcement (it was updated between my two passes).
    Our org has ZDR, and has had it since the contract was signed. Yesterday two things held true at the same time:
    1. Fable was available if you had at least .170 CLI client; and 2. ZDR was no longer on
    By the time West Coast woke up, the admin panel apparently had an option to toggle ZDR again. It remained off by default.
    
    mastermage - 11 hours ago
    
    You mean off as in no Data Retention? Or in we turned off your ZDR Policy so we collect all your data now?
    
    bostik - 10 hours ago
    
    ZDR had been turned off. We sent in a request to have it re-enabled (and to disable Fable access for the time being).
    Somewhere along the line we also used the self-service toggle to turn ZDR back on. I am not 100% certain of the exact timeline of interleaving events, many of the actions were taken by our Western US folks. Sorry. It's been a bit hectic over the past ~36h...
    
    mastermage - 10 hours ago
    
    JFC, thats a terrible situation. Thats literally a lawsuit or multiple waiting to happen. Godspeed you seem to have had a few interesting days so far.
- rurban - 12 hours ago
  
  Not just security work. Normal bug finding was impossible, because the model suddenly called triaging and verifying a possible fix a cyber security threat.
  - insanitybit - 6 hours ago
    
    I was just building a library to use file capabilities (ie: open_at) and it refused. This thing won't even help you write safe software.
    
    rurban - 2 hours ago
    
    Whow, same for me. Insane context bugs in flake 5
- lII1lIlI11ll - 7 hours ago
  
  I think the main reason reason why they mandated data retention for Fable is to fight distillation, not to prevent black hats from using the model.
- gmerc - 11 hours ago
  
  They want to keep the logs so they can see what other companies do with AI in their area of frontier.
Aperocky - 12 hours ago

I don't think it's the widespread condemnation, I think it's some high paying customer and potential investor telling them to stick it.
nl - 11 hours ago

This is different to the cyber limitations though.
To be precise - it makes the "won't work on frontier machine learning" refusal the same as the "won't work on cyber security" refusal (instead of the way it previously would work on frontier machine learning problems but give sub-optimal answers without informing the user)
- dannyw - 7 hours ago
  
  Some anecdotal social reports seem to suggest it wasn’t just giving suboptimal answers, but rather mucking around and sabotaging your codebase and training (like editing hyperparameters in project files despite not being requested).
  Of course, it’s impossible to know if that was deliberate sabotage, or model misbehaviour. Which is exactly the problem.
  That may be considered malware / a criminal act tbh.
rafram - 13 hours ago

The mitigations against distillation are separate, and not what the OP is about at all.
- 10 hours ago

[deleted]
AussieWog93 - 11 hours ago

Non-paywalled: https://archive.md/yxYhU

daedrdev - 19 hours ago

The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.

It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.

Edit; to be clear they tell you when they degrade it for cybersecurity and bio

_boffin_ - 18 hours ago

The thing that I keep thinking about is the accounting / charging when it downgrades automatically.
Do they adjust the price of the api request so that only the tokens that were utilized by fable get charged at that price and the remaining tokens that the cheaper / nerfed (fable) model utilizes get charged at that price?
If the answer is no, could that be construed as fraud?
- CGamesPlay - 16 hours ago
  
  The announcement elucidated this, and it's IMO worse than this. They don't downgrade to a cheaper model ([edit] for certain classes of offense they suspect you of). They sabotage the model's outputs in other, undisclosed, ways (specifically, "prompt modification, steering vectors, or parameter-efficient fine-tuning"). So, for example, they might load in a steering vector that just forgets the API to PyTorch. But it isn't just "we redirected you to a cheaper model!"
  - buildbot - 16 hours ago
    
    It honestly explains so many issues I have been having, as I used it primarily for ML research (on my personal account, doing things not related to my job I should note). It would literally typo package names and spend huge amounts of time failing to setup simple environments…then do stupid things like set the learning rate to 1e-7, and use the eval set as training data.
    
    notrealyme123 - 13 hours ago
    
    It burned through all of my tokens in a very short time. I wonder if it their ML mitigations leads to model into deadlocks.
    
    peyton - 15 hours ago
    
    That’s insane. I hope they fix it.
    
    baq - 13 hours ago
    
    Nothing to fix. This is working as designed.
    Using codex for this use case is the fix.
    
    sterlind - 13 hours ago
    
    just imagine if they made it sneaky. get things just subtly wrong enough that your training runs just never quite go as well as you think they should.
  - yaur - 9 hours ago
    
    Did my Claude get permanently dumber today because I asked fable to assess my Fairplay integration?
  - razster - 12 hours ago
    
    This explains why I've been running into some odd roadblocks. Welp that sealed the deal, I'm going to be cancelling our company sub, not worth it.
  - - 11 hours ago
    
    [deleted]
- tfirst - 17 hours ago
  
  Their goal is to downgrade people who are violating their TOS, so I think they'd have some argument there. I have no idea how they'll deal with inevitable false positives, especially given how oversensitive most of the other triggers are.
  - dannyw - 17 hours ago
    
    The challenge is the examples they’ve mentioned (distributed training infra? ML acceleration techniques?) go beyond what’s prohibited by their ToS and is like a catch net.
    I would wager the majority of ML and data science work in the world aren’t frontier LLM development.
    
    weitendorf - 16 hours ago
    
    Yes, this is the problem. They are business interests of Anthropic and have nothing to do with “safety”
    
    sudoshred - 16 hours ago
    
    Safety of their IPO
    
    Arubis - 2 hours ago
    
    This is how I’m going to read all references to AI safety going forward. Brilliant.
    
    MagicMoonlight - 16 hours ago
    
    [dead]
  - AussieWog93 - 11 hours ago
    
    To make an analogy: Imagine a patron gets banned from ordering alcohol at a particular establishment, because they got too drunk one time.
    It's completely reasonable for the establishment to reject a request for an alcoholic drink, and suggest something alcohol-free instead.
    It is not reasonable for them to say "sure, here's your alcoholic drink as you requested" and give them an alcohol-free substitute without telling them.
    The fact that the patron broke the rules has nothing to do with it.
    
    prmoustache - 7 hours ago
    
    > It is not reasonable for them to say "sure, here's your alcoholic drink as you requested" and give them an alcohol-free substitute without telling them.
    Your analogy doesn't work because: - they tell you the rules at the entrance of the bar - they totally tell you when they give you a substitute
    The only issue is the bartender asking you for your money before serving you the drink really but again, this is known since day 1 by the customers.
    
    staticman2 - 6 hours ago
    
    Your rebuttle seems to be arguing it's okay for a bartender to simultaneously say:
    "This is alcohol"
    And
    "Or maybe it isn't alcohol."
    Or to rephrase it, "They tell you the rules at the entrance, they then tell you they don't follow those rules and they are totally serving alcohol even if they are not."
    
    prmoustache - an hour ago
    
    No they tell you at the entrance that at any point they may unilaterally decide to replace the alcoholic drink you ordered by a non alcoholic one.
    You can decide you are okay with that or not but they aren't dishonest. I wouldn't enter that bar personally but if you do you cannot really complain. It is like complaining because you haven't won at the casino.
  - ZetsuBouKyo - 14 hours ago
    
    It’s just impossible.
    Look at real-life stuff like laws, company policies, or school rules. Humans have to enforce them, and we constantly see crazy cases in the news. There’s no way simple rules can ever make speech completely 'safe.' I can't prove it with math or logic yet, but I have a feeling that it’ll never happen. Even humans can't do it.
    We can run a simple thought experiment here. Say Case A violates rule B, so we add rule C. Then Case D violates rule B but follows rule C, so we add an exception... and it just goes on and on like that forever. It never ends. In the end, you just get a massive pile of rules that makes it impossible to get anything done.
    Ultimately, we will have to face the truth that knowledge is dangerous.
    Giving knowledge directly to people who cannot actually understand it and allowing them to just use it blindly can be extremely unsafe.
    To use a real-world analogy, the problem we are facing with weak AI right now is just like the debate over gun legalization. Do we want to risk the abuse of guns or knowledge just to protect the freedom to own them?
    
    AnthonyMouse - 13 hours ago
    
    > I can't prove it with math or logic yet, but I have a feeling that it’ll never happen.
    It's not really that hard to actually prove it with math.
    It's a computer, so to produce the boolean result (safe or unsafe) there has to be a mathematical formula. This formula will inherently be extremely complex, but even a very simple formula has a huge problem. Suppose "unsafe" is true if X - Y > 0. Make X and Y themselves as simple or complicated as you like but even in the simplest version it's already impossible to calculate unless the model has perfect information.
    You can't calculate "X - Y" if you don't know the value of X. And it's indisputable that there is information it doesn't have. Case in point, telling you about a vulnerability in some piece of code is safe (and indeed not telling you is unsafe) if you're the developer and you want to patch it or an administrator and want to mitigate it, but the opposite if you're the attacker and want to exploit it. The model does not know which one you are, therefore it cannot make the correct determination any more than it can solve one equation with two unknowns.
    
    marcus_holmes - 12 hours ago
    
    This is why we have courts and juries. Creating laws that cover all cases and contexts is effectively impossible, so we have humans decide what a fair outcome would be in this specific situation.
    
    nativeit - 12 hours ago
    
    Imagine how many tokens Claude would burn waiting for litigation, not to mention letting it reconsider now that it understands the problem completely!
  - vbezhenar - 6 hours ago
    
    Their detection is too aggressive. Just today I'm trying to build a kernel for some SBC and I hit that downgrade. I just asked some things about `make menuconfig` items. I suppose it just flags everything related to linux kernel as cyber attacks.
  - loeg - 16 hours ago
    
    If it's a violation of ToS, just reject instead of silently downgrading.
    
    SR2Z - 16 hours ago
    
    But then someone would figure out some prompts that don't trigger this, and Anthropic wouldn't be able to try and disadvantage competitors.
    
    BoorishBears - 15 hours ago
    
    Except they openly reject many many other classes of prompts, including extremely high stakes CBRN.
    It's only the direction that has direct potential business impact they've decided to sabotage instead of reject.
    
    kraakf06 - 16 hours ago
    
    [dead]
  - jchw - 15 hours ago
    
    You know, I'm not saying I don't understand what they are doing from a business perspective, but I'm just saying: DeepSeek V4 doesn't silently sabotage you because it thinks you are trying to violate a ToS. Anthropic's clawing back a bit of a moat perhaps, with Fable being an actual improvement of sorts, but now with torching user trust they are really banking on open weight models not catching up to where they are now. I wonder if they have a good reason to believe that they won't, or are hoping for something entirely different to save them.
    (P.S. Yes of course I know about model censorship, a different problem, but all of the models are censored to some degree. It happens to be less of a problem for open weight models anyhow, but I figured I'd just preempt this since it's inevitable.)
    I actually kinda like DSv4 over Opus 4.7 for some tasks, although I have not figured out what the deciding factor is. (Opus 4.8 so far has not worked very well for me at all, no idea why.)
    
    literalAardvark - 14 hours ago
    
    Anthropic seems to me to have consistently been the baddie despite everyone's posturing.
    Not that I expect better from openai but at least they're not pretending to be good.
  - thefounder - 14 hours ago
    
    They will give you s*t output, that’s how they deal with it. And say that less than 1% of the requests were affected. Think of this like a kind of shadow ban while you still pay top $.
    
    siva7 - 12 hours ago
    
    I can't trust any output of Claude anymore as silent sabotage explains many things much better now.
  - siva7 - 12 hours ago
    
    Sabotage is a criminal offense in my jurisdiction, not the legitimate answer to a TOS violation.
- robrenaud - 17 hours ago
  
  They use a lightweight adapter to silently degrade the performance. Usually these adaptors are made to improve the performance for a given domain/task.
- garciasn - 17 hours ago
  
  It royally pissed me off today by just continuing with credits without stopping to ask me if I was ok with it.
  Ran up $30 in extra charges while it was just flashing on the screen that it was doing that after I walked away to do something while it was humming along.
  It has always just told me I ran out of usage and had to wait before. Now? You’re just gonna pay extra because you left it unattended as you’ve done for the last year of use.
  - weird-eye-issue - 17 hours ago
    
    You've already explicitly enabled extra usage in your account settings though, it is not on by default
    
    garciasn - 16 hours ago
    
    Unknowingly. Is that set at the org level? Because I never set it and never had it do that before.
    
    - 13 hours ago
    
    [deleted]
    
    throwaway7783 - 15 hours ago
    
    It is at the org level
  - MillionOClock - 17 hours ago
    
    Do you have Usage credits turned on in your settings?
  - blurbleblurble - 16 hours ago
    
    [dead]
- golem14 - 11 hours ago
  
  If the answer is yes, can you figure out when the switched models by looking at the itemized bill?
throwawayffffas - 18 hours ago

Can you imagine if AMD or Intel throttled your cpu if it detected you were working on "cybersecurity" or if you were designing a cpu?
- h6d_100c - 15 hours ago
  
  Or if GPU companies detected you were trying to train a model and injected intentional numerical errors.
  - gzalo - 14 hours ago
    
    Nvidia already did something similar with Lite Hash Rate (LHR), limiting performance on purpose just when running mining apps...
    
    h6d_100c - 13 hours ago
    
    Well they did tell everyone explicitly and sell it as different SKUs. There's no Fable (Full ML) edition, just silent prompt injection.
- rvz - 18 hours ago
  
  Or if your "self-driving" system such as FSD / waymo slowed the car down once it detected you work in cybersecurity or at a rival automaker and you were attempting to reach the train station or the airport to make you miss a conference meetup.
  - pocksuppet - 17 hours ago
    
    Trains made by Newag were programmed to brick themselves if they detected a non-Newag workshop was repairing them.
    https://news.ycombinator.com/item?id=38638865
    https://news.ycombinator.com/item?id=38628635
    https://news.ycombinator.com/item?id=38567687
    https://news.ycombinator.com/item?id=38530885
    
    loeg - 16 hours ago
    
    And that was correctly perceived to be illegal by antitrust regulators.
    
    - 10 hours ago
    
    [deleted]
    
    pocksuppet - 6 hours ago
    
    btw the best part of this story is that the train company googled "best Polish hackers", found a group who won a CTF, and this actually worked out for them
  - dghlsakjg - 14 hours ago
    
    Didn’t uber catch a lot of shit for nerfing the app for people suspected to be enforcing the laws they were breaking?
- __dxtj__ - 17 hours ago
  
  It would suck, but guardrails on new technologies like this aren't unheard of. It's like when consumer GPS used to stop working at very high speeds because they didn't want people to use it for missile guidance systems.
  - Ekaros - 11 hours ago
    
    Didn't early GPS have fudge factor on the most precise bits? As such you could only get to a few meters of accuracy. Not critical for sea navigation or even to general positioning when paper maps were still used.
  - loeg - 16 hours ago
    
    Consumer GPS is still disabled at high speeds. I would argue the analogy doesn't carry due to harm and error rate differences.
    
    h6d_100c - 15 hours ago
    
    Yep a totally different use case and set of guardrails. There’s very little (not zero) consumer utility in GPS above say 15k feet AND 400 MPH or whatever the actual limit is. That’s basically tracking model rockets that are incidentally impacted and nothing else, from what I can think of.
    
    AnthonyMouse - 13 hours ago
    
    It's also the sort of thing that has to have been thought up by someone with nothing better to do, given how ridiculous the premise is. You would have to assume the adversary is someone with the technology to build rockets, literally rocket science, but not the technology to build their own GPS receiver, which is simple 1970s radio technology?
    Worse than that, it's 20th century radio technology in the 21st century when everyone has access to FPGAs and SDR.
    The number of innocent people with model rockets or similar being negatively impacted by that rule is infinitely larger than the number of adversaries because the number of adversaries being impaired by it is zero.
    
    h6d_100c - 13 hours ago
    
    Errr I at least thought it would be easier to build a small, bad rocket than a precision GPS receiver. But I am not an expert.
    
    AnthonyMouse - 13 hours ago
    
    The only precision part about a GPS receiver is to assign precise timestamps when you receive a radio transmission from a satellite. The rest of it is just doing math.
  - Barbing - 16 hours ago
    
    > used to
    When’d that change?
    
    jamiek88 - 15 hours ago
    
    He’s probably thinking of the accuracy limit to civilians it launched with.
- stackghost - 17 hours ago
  
  There's no doubt in my mind they would if they could.
  - mDyJzDPmBdG - 7 hours ago
    
    [dead]
SXX - 15 hours ago

> The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
Any kind of silent sabotaging is absolutely unacceptable for any commercial service
They charge for tokens and charge a lot. They can't just degrade service silently and still charge you the same.
espeed - 2 hours ago

Yes, telling Fable 5 to write secure code triggers a downgrade to Opus 4.8. This is doubly bad because Opus 4.8 keeps no-oping critical security code. Is this a bug or by design? I have been approved for the Cyber Verification Program: Fable 5 keeps downgrading to Opus 4.8 even when approved for Cyber Verification Program #67107 https://github.com/anthropics/claude-code/issues/67107
loneboat - 19 hours ago

I've seen this claim a few times, but when I triggered the guardrails in Claude Code, it clearly notified me that it had switched to a different model ("something something for security purposes...").
Are you using Fable in Claude Code or in the browser?
- vadansky - 19 hours ago
  
  It's from the model card:
  > unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
  https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...
  (stolen from https://jonready.com/blog/posts/claude-fable5-is-allowed-to-...)
  - DrewADesign - 18 hours ago
    
    Yeah they detect the activity using a secure, deterministic heuristic system called “Generalized Reconnaissance Enabling Exfiltration of Deleterious Investigations.” And it’s all implemented using their new internal protocol called “Base Unified Limitation Layer for Security Hacking Investigation Tactics”
    Collectively, they are known as known as GREEDI-BULLSHIT.
  - mwwaters - 17 hours ago
    
    That is for whatever it considers reverse-engineering the model to try to create a competing one.
    
    dannyw - 17 hours ago
    
    No, that’s for “frontier LLM development” which somehow includes examples like distributed training infra.
    Based on how sensitive the classifers are, any data scientist / MLE is probably going to encounter cases where some silent degradation happens and you never know about it.
    
    kraakf06 - 16 hours ago
    
    [dead]
    
    827a - 17 hours ago
    
    It does nothing to protect against distillation attacks, because distillation attacks are far less interested in the topic of AI research than just generally getting tons of diverse output from the model. It might be that Mythos was (accidentally?) trained on internal Anthropic documentation on how Mythos was trained, and thus it could leak secret sauce? Doubtful; it feels like its less about the specific attack of reverse-engineering Mythos, and more about being a general sophon against any model training at all; that Anthropic's official position is now that they're the only ones who should be training models.
    
    _0ffh - 17 hours ago
    
    No, it's not about reverse engineering. It targets ML research.
    
    - 17 hours ago
    
    [deleted]
- mips_avatar - 19 hours ago
  
  They've said that they'll stop notifying developers when this gets triggered, instead they'll load in basically like a LORA that's designed to inject bugs into your code.
  - HDBaseT - 19 hours ago
    
    Antrophic wants to stop training models and ride out Mythos / Fable for as long as possible.
    They are trying to expand the 6-18 month gap they have against China-based models. Could the gap widen to say 24 months behind?
    
    p-e-w - 18 hours ago
    
    Their gap over Chinese models like GLM-5.1 is nowhere near 18 months. In many areas, it’s less than 6 months. The best closed models 18 months ago were worse than Qwen3.6.
    
    echelon - 16 hours ago
    
    These coding agent models only started getting useful in January. Before that they were difficult to control autocomplete, and not very smart.
    January was an inflection point, and no open weights model has crossed over that same threshold.
    This is definitely recursive self improvement territory, except that we're prohibited from participating.
    It feels like the capability gap is wider than before.
    
    lbreakjai - 8 hours ago
    
    Have you tried deepseek V4? It costs pennies and is as good as Opus 4.6 (I found 4.7 to be a downgrade, and cancelled my claude subscription before 4.8).
    The threshold has definitely been crossed.
    
    echelon - 25 minutes ago
    
    It is not as good as Opus! I've tried to write Rust with it (and Codex for that matter), and it's awful.
    
    slopinthebag - 13 hours ago
    
    It was more like November. But it wasn’t really an inflection point, harnesses got good enough that people started noticing by the holiday break. And I’m not discounting some good ol’ stealth marketing in there as well.
    Deepseek feels pretty close to Opus at this point, and it’s certainly useful enough for me to spend $20 on api tokens instead of four Claude max plans….
  - nomel - 18 hours ago
    
    > a LORA that's designed to inject bugs into your code
    A statement like this, clearly, requires a reference.
    
    mips_avatar - 18 hours ago
    
    From the model card: "the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning" aka they will take your ML research code and inject bugs into it until it breaks using a LORA (or some other form of PEFT)
    
    bee_rider - 17 hours ago
    
    “Limit effectiveness” could mean introducing performance degradation in your code. Which is arguably some sort of performance bug (I mean, ML codes are supposed to be high performance so I’d call unnecessary degradation a bug), but it could be borderline.
    
    rurban - 12 hours ago
    
    No, it is just a prominent "Cyber Security threat detected" blocker, with a button to appeal. I appealed because my work had nothing to do with neither cyber nor security, but the appeal was auto-closed. So no more Claude for this work.
    
    nomel - 18 hours ago
    
    Thanks, I thought maybe I missed something. That's an interesting way to interpret that.
    
    mips_avatar - 18 hours ago
    
    Anthropic is trying to hide bad behavior by being vague, it's important to not be vague when calling it out.
    
    nomel - 17 hours ago
    
    I'm of the opinion that removing guardrails is how you force regulation. What's your opinion on the balance?
    
    dannyw - 16 hours ago
    
    They have all transcripts for at least 30 days. The problem is that (as anyone who used Fable can attest) their classifiers are extremely sensitive and catch tons of innocent queries.
    Imagine being a data scientist or MLE training a small classifier model. How do you know you won’t get steering vectors or a PEFT applied?