Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable

techcrunch.com

566 points by speckx a day ago


https://www.theverge.com/ai-artificial-intelligence/947973/f...

saidnooneever - 6 hours ago

Malware authors are pretty excited about guard-rails. you can add prompts to your malware to get LLM scanners to hit guard-rails and stop their runs. New shai-hulud npm worm campaign for example includes prompts to request biological weapon schematics/creation etc. to ensure LLM scanners probing NPM packages refuse to scan it.

These AI places have 0 clue about how threat actors actually work. None of their mitigations or guard-rails is effective, and now they are even turned against them.

Additionally, if they don't all implement the same level of effective guard-rails, there will always be some model you can abuse to do the work anyway, and hence there is 0 effect on threat actors, they will just run some local model that does 5% less quality, which does not matter to them 1 bit.

simonw - 14 hours ago

News just broke in this Wired story: "Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude" https://www.wired.com/story/anthropic-responds-to-backlash-o...

> “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”

Sounds like the widespread condemnation worked.

daedrdev - 19 hours ago

The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.

It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.

Edit; to be clear they tell you when they degrade it for cybersecurity and bio