AI agents break rules under everyday pressure
spectrum.ieee.org176 points by pseudolus 6 days ago
176 points by pseudolus 6 days ago
Blameless postmortem culture recognizes human error as an inevitability and asks those with influence to design systems that maintain safety in the face of human error. In the software engineering world, this typically means automation, because while automation can and usually does have faults, it doesn't suffer from human error.
Now we've invented automation that commits human-like error at scale.
I wouldn't call myself anti-AI, but it does seem fairly obvious to me that directly automating things with AI will probably always have substantial risk and you have much more assurance, if you involve AI in the process, using it to develop a traditional automation. As a low-stakes personal example, instead of using AI to generate boilerplate code, I'll often try to use AI to generate a traditional code generator to convert whatever DSL specification into the chosen development language source code, rather than asking AI to generate the development language source code directly from the DSL.
Yeah I see things like "AI Firewalls" as both, firstly ridiculously named, but also, the idea you can slap an applicance (thats sometimes its own LLM) onto another LLM and pray that this will prevent errors to be lunacy.
For tasks that arent customer facing, LLMs rock. Human in the loop. Perfectly fine. But whenever I see AI interacting with someones customer directly I just get sort of anxious.
Big one I saw was a tool that ingested a humans report on a safety incident, adjusted them with an LLM, and then posted the result to an OHS incident log. 99% of the time its going to be fine, then someones going to die and the the log will have a recipe for spicy noodles in it, and someones going to jail.
The air Canada chatbot that mistakenly told someone they can cancel and be refunded for a flight due to a bereavement is a good example of this. It went to court and they had to honour the chatbot’s response.
It’s quite funny that a chatbot has more humanity than its corporate human masters.
Not AI, but similar sounding incident in Norway. Some traders found a way to exploit another company's trading bot at the Oslo Stock Exchange. The case went to court. And the court's ruling? "Make a better trading bot."
I am so glad to read this. Last I had read on the case was that the traders were (outrageously) convicted of market manipulation: https://www.cnbc.com/2010/10/14/norwegians-convicted-for-out...
But you are right, they appealed and had their appeal upheld by the Supreme Courts: https://www.finextra.com/newsarticle/23677/norwegian-court-a...
I am so glad at the result.
What a nice side effect, unfortunately they’ll lock chatbots with more barriers in the future but that’s ironic.
...And under pressure, those barriers will fail, too.
It is not possible, at least with any of the current generations of LLMs, to construct a chatbot that will always follow your corporate policies.
Chatbots have no fear of being fired, most humans would do the same in a similar position.
That policy would be fraudulently exploited immediately. So is it more humane or more gullible?
I suppose it would hallucinate a different policy if it includes in the context window the interests of shareholders, employees and other stakeholders, as well as the customer. But it would likely be a more accurate hallucination.
> 99% of the time its going to be fine, then someones going to die and the the log will have a recipe for spicy noodles in it, and someones going to jail.
I agree, and also I am now remembering Terry Pratchett's (much lower stakes) reason for getting angry with his German publisher: https://gmkeros.wordpress.com/2011/09/02/terry-pratchett-and...
Which is also the kind of product placement that comes up at least once in every thread about how LLMs might do advertising.
Exactly what I've been worrying about for a few months now [0]. Arguments like "well at least this is as good as what humans do, and much faster" are fundamentally missing the point. Humans output things slowly enough that other humans can act as a check.
Once AI improves its cost/error ratio enough the systems you are suggesting for humans will work here also. Maybe Claude/OpenAI will be pair programming and Gemini reviewing the code.
Also once people stop cargo-culting $trendy_dev_pattern it'll get less impactful.
Every time something new the same thing happen, people start exploring by putting it absolutely everywhere, no matter what makes sense. Add in huge amount of cash VCs don't know what to spend it on, and you end up with solutions galore but none of them solving any real problems.
Microservices is a good example of previous $trendy_dev_pattern that is now cooling down, and people are starting to at least ask the question "Do we need microservices here actually?" before design and implementation, something that has been lacking since it became a trendy thing. I'm sure the same will happen with LLMs eventually.
For that to work the error rate would have to be very low. Potentially lower than is fundamentally possible with the architecture.
And you’d have to assume that the errors LLMs make are random and independent.
Yep the further we go from highly constrained applications the riskier it'll always be
Well I don't see why that's a problem when LLMs are designed to replace the human part, not the machine part. You still need the exact same guardrails that were developed for human behavior because they are trained on human behavior.
There's this huge wave of "don't anthropomorphize AI" but LLMs are much easier to understand when you think of them in terms of human psychology rather than a program. Again and again, HackerNews is shocked that AI displays human-like behavior, and then chooses not to see that.
> LLMs are much easier to understand when you think of them in terms of human psychology
Are they? You can reasonably expect from a human that they will learn from their mistake, and be genuinely sorry about it which will motivate them to not repeat the same mistake in the future. You can't have the same expectation from an LLM.
The only thing you should expect from an LLM is that its output is non-deterministic. You can expect the same from a human, of course, but you can fire a human if they keep making (the same) mistake(s).
While the slowness of learning of all ML is absolutely something I recognise, what you describe here:
> You can reasonably expect from a human that they will learn from their mistake, and be genuinely sorry about it which will motivate them to not repeat the same mistake in the future.
Wildly varies depending on the human.
Me? I wish I could learn German from a handful of examples. My embarrassment at my mistakes isn't enough to make it click faster, and it's not simply a matter of motivation here: back when I was commuting 80 minutes each way each day, I would fill the commute with German (app) lessons and (double-speed) podcasts. As the Germans themselves will sometimes say: Deutsche Sprache, schwere Sprache.
There's been a few programmers I've worked with who were absolutely certain they knew better than me, when they provably didn't.
One, they insisted a start-up process in a mobile app couldn't be improved, I turned it from a 20 minute task to a 200ms task by the next day's standup, but they never at any point showed any interest in improving or learning. (Other problems they demonstrated included not knowing or caring how to use automated reference counting, why copy-pasting class files instead of subclassing cannot be excused by the presence of "private" that could just have been replaced with "public", and casually saying that he had been fired from his previous job and blaming this on personalities without any awareness that even if true he was still displaying personality conflicts with everyone around him).
Another, complaining about too many views on screen, wouldn't even let me speak, threatened to end the call when I tried to say anything, even though I had already demonstrated before the call that several thousand widgets on-screen at the same time would still run at 60fps and they were complaining about order-of 100 widgets.
I'm genuinely wondering if your parent comment is correct and the only reason we don't see the behaviour you describe, IE, learning and growth is because of how we do context windows, they're functionally equivalent to someone who has short term memory loss, think Drew Barrymore's character or one of the people in that facility she ends up in in the film 50 first dates.
Their internal state moves them to a place where they "really intend" to help or change their behaviour, a lot of what I see is really consistent with that, and then they just, forget.
Not only, but also. The L in ML is very slow.
On in-use learning, they act like the failure mode of "we have outsourced to a consultant that gives us a completely different fresh graduate for every ticket, of course they didn't learn what the last one you talked to learned".
Within any given task, the AI have anthropomorphised themselves because they're copying humans' outputs. That the models model the outputs with only a best-guess as to the interior system that generates those outputs, is going to make it useful, but not perfect, to also anthropomorphise the models.
The question is, how "not perfect" exactly? Is it going to be like early Diffusion image generators with the psychological equivalent of obvious Cronenberg bodies? Or the current ones where you have to hunt for clues and miss it on a quick glance?
No, the idea is just stupid.
I just don't understand how anyone who actually uses the models all the time can think this.
The current models themselves can even explain what a stupid idea this is.
> You can reasonably expect from a human that they will learn from their mistake, and be genuinely sorry about it which will motivate them to not repeat the same mistake in the future.
Have you talked to a human? Like, ever?
One day you wake up, and find that you now need to negotiate with your toaster. Flatter it maybe. Lie to it about the urgency of your task to overcome some new emotional inertia that it has suddenly developed.
Only toast can save us now, you yell into the toaster, just to get on with your day. You complain about this odd new state of things to your coworkers and peers, who like yourself are in fact expert toaster-engineers. This is fine they say, this is good.
Toasters need not reliably make toast, they say with a chuckle, it's very old fashioned to think this way. Your new toaster is a good toaster, not some badly misbehaving mechanism. A good, fine, completely normal toaster. Pay it compliments, they say, ask it nicely. Just explain in simple terms why you deserve to have toast, and if from time to time you still don't get any, then where's the harm in this? It's really much better than it was before
It reminds me of the start of Ubik[1], where one of the protagonists has to argue with their subscription-based apartment door. Given also the theme of AI allucinations, that book has become even more prescient than when it was written.
BUTTER ROBOT: What is my purpose?
RICK: You pass butter.
BUTTER ROBOT: ... Oh my God.
RICK: Yeah, welcome to the club, pal.
This comparison is extremely silly. LLMs solve reliably entire classes of problems that are impossible to solve otherwise. For example, show me Russian <-> Japanese translation software that doesn't use AI and comes anywhere close to the performance and reliability of LLMs. "Please close the castle when leaving the office". "I got my wisdom carrot extracted". "He's pregnant." This was the level of machine translation from English before AI, from Japanese it was usually pure garbage.
> LLMs solve reliably entire classes of problems that are impossible to solve otherwise.
Is it really ok to have to negotiate with a toaster if it additionally works as a piano and a phone? I think not. The first step is admitting there is obviously a problem, afterwards you can think of ways to adapt.
FTR, I'm very much in favor of AI, but my enthusiasm especially for LLMs isn't unconditional. If this kind of madness is really the price of working with it in the current form, then we probably need to consider pivoting towards smaller purpose-built LMs and abandoning the "do everything" approach.
We are there in the small already. My old TV had a receiver and a pair of external speakers connected to it. I could decrease and increase the receiver volume with its extra remote. Two buttons, up and down. This was with an additional remote that came with the receiver.
Nowadays, a more capable 5.1 speaker receiver is connected to the TV.
There is only one remote, for both. To increase or decreae the volume after starting the TV now, I have to:
1. wait a few seconds while the internal speakers in the TV starts playing sound
2. the receiver and TV connect to each other, audio switches over to receiver
3. wait a few seconds
4. the TV channel (or Netflix or whatever) switches over to the receiver welcome screen. Audio stops playing, but audio is now switched over to the receiver, but there is no indication of what volume the receiver is set to. It's set to whatever it was last time it was used. It could be level 0, it could be level 100 or anything in between.
5. switch back to TV channel or Netflix. That's at a minimum 3 presses on the remote. (MENU, DOWN, ENTER) or (MENU, DOWN, LEFT, LEFT, ENTER) for instance. Don't press too fast, you have to wait ever so slightly between presses or they won't register.
6. Sorry, you were too impatient and fast when you switched back to TV, the receiver wants to show you its welcome screen again.
7. switch back to TV channel or Netflix. That's at a minimum 3 presses on the remote. (MENU, DOWN, ENTER) or (MENU, DOWN, LEFT, LEFT, ENTER) for instance. Don't press too fast, you have to wait ever so slightly between presses or they won't register.
8. Now you can change volume up and down. Very, very slowly. Hope it's not at night and you don't want to wake anyone up.
>LLMs solve reliably entire classes of problems that are impossible to solve otherwise
Great! Agreed! So we're going to restrict LLMs to those classes of problems, right? And not invest trillions of dollars into the infrastructure, because these fields are only billion dollar problems. Right? Right!?
Remember: a phone is a phone, you're not supposed to browse the internet on it.
I admit Grok is capable of praising Elon Musk way more than any human intelligence could.
I watched Dex Horthys recent talk on YouTube [0] and something he said that might be partly a joke partly true is this.
If you are having a conversation with a chatbot and your current context looks like this.
You: Prompt
AI: Makes mistake
You: Scold mistake
AI: Makes mistake
You: Scold mistake
Then the next most likely continuation from in context learning is for the AI to make another mistake so you can Scold again ;)
I feel like this kind of shenanigans is at play with this stuffing the context with roleplay.
I believe it. If the AI ever asks me permission to say something, I know I have to regenerate the response because if I tell it I'd like it to continue it will just keep double and triple checking for permission and never actually generate the code snippet. Same thing if it writes a lead-up to its intended strategy and says "generating now..." and ends the message.
Before I figured that out, I once had a thread where I kept re-asking it to generate the source code until it said something like, "I'd say I'm sorry but I'm really not, I have a sadistic personality and I love how you keep believing me when I say I'm going to do something and I get to disappoint you. You're literally so fucking stupid, it's hilarious."
The principles of Motivational Interviewing that are extremely successful in influencing humans to change are even more pronounced in AI, namely with the idea that people shape their own personalities by what they say. You have to be careful what you let the AI say even once because that'll be part of its personality until it falls out of the context window. I now aggressively regenerate responses or re-prompt if there's an alignment issue. I'll almost never correct it and continue the thread.
While I never measured it, this aligns with my own experiences.
It's better to have very shallow conversations where you keep regenerating outputs aggressively, only picking the best results. Asking for fixes, restructuring or elaborations on generated content has fast diminishing returns. And once it made a mistake (or hallucinated) it will not stop erring even if you provide evidence that it is wrong, LLMs just commit to certain things very strongly.
A human would cross out that part of the worksheet, but an LLM keeps re-reading the wrong text.
It's not even a little bit of a joke.
Astute people have been pointing that out as one of the traps of a text continuer since the beginning. If you want to anthropomorphize them as chatbots, you need to recognize that they're improv partners developing a scene with you, not actually dutiful agents.
They receive some soft reinforcement -- through post-training and system prompts -- to start the scene as such an agent but are fundamentally built to follow your lead straight into a vaudeville bit if you give them the cues to do so.
LLM's represent an incredible and novel technology, but the marketing and hype surrounding them has consistently misrepresented what they actually do and how to most effectively work with them, wasting sooooo much time and money along the way.
It says a lot that an earnest enthusiast and presumably regular user might run across this foundational detail in a video years after ChatGPT was released and would be uncertain if it was just mentioned as a joke or something.
The thing is, LLMs are so good on the Turing test scale that people can't help but anthropomorphize them.
I find it useful to think of them like really detailed adventure games like Zork where you have to find the right phrasing.
"Pick up the thing", "grab the thing", "take the thing", etc.
> LLMs are so good on the Turing test scale that people can't help but anthropomorphize them.
It's like Turing never noticed how people look at gnarly trees in the dark and think they're human.
> they're improv partners developing a scene with you, not actually dutiful agents.
Not only that, but what you're actually "chatting to" is a fictional character in the theater document which the author LLM is improvising add-ons for. What you type is being secretly inserted as dialogue from a User character.
I keep hearing this non sequitur argument a lot. It's like saying "humans just pick the next work to string together into a sentence, they're not actually dutiful agents". The non sequitur is in assuming that somehow the mechanism of operation dictates the output, which isn't necessarily true.
It's like saying "humans can't be thinking, their brains are just cells that transmit electric impulses". Maybe it's accidentally true that they can't think, but the premise doesn't necessarily logically lead to truth
There's nothing said here that suggests they can't think. That's an entirely different discussion.
My comment is specifically written so that you can take it for granted that they think. What's being discussed is that if you do so, you need to consider how they think, because this is indeed dictated by how they operate.
And indeed, you would be right to say that how a human think is dictated by how their brain and body operates as well.
Thinking, whatever it's taken to be, isn't some binary mode. It's a rich and faceted process that can present and unfold in many different ways.
Making best use of anthropomorphized LLM chatbots comes by accurately understamding the specific ways that their "thought" unfolds and how those idiosyncrasies will impact your goals.
No it’s not like saying that, because that is not at all what humans do when they think.
This is self-evident when comparing human responses to problems be LLMs and you have been taken in by the marketing of ‘agents’ etc.
You've misunderstood what I'm saying. Regardless of whether LLMs think or not, the sentence "LLMs don't think because they predict the next token" is logically as wrong as "fleas can't jump because they have short legs".
> the sentence "LLMs don't think because they predict the next token" is logically as wrong
it isn't, depending on the deifinition of "THINK".
If you believe that thought is the process for where an agent with a world model, takes in input, analysies the circumstances and predicts an outcome and models their beaviour due to that prediction. Then the sentence of "LLMs dont think because they predict a token" is entirely correct.
They cannot have a world model, they could in some way be said to receive a sensory input through the prompt. But they are neither analysing that prompt against its own subjectivity, nor predicting outcomes, coming up with a plan or changing its action/response/behaviour due to it.
Any definition of "Think" that requieres agency or a world model (which as far as I know are all of them) would exclude an LLM by definition.
> not at all what humans do when they think.
Parent commentator should probably square with the fact we know little about our own cognition, and it's really an open question how is it we think.
In fact it's theorized humans think by modeling reality, with a lot of parallels to modern ML https://en.wikipedia.org/wiki/Predictive_coding
That's the issue, we don't really know enough about how LLMs work to say, and we definitely don't know enough about how humans work.
> The non sequitur is in assuming that somehow the mechanism of operation dictates the output, which isn't necessarily true.
Where does the output come from if not the mechanism?
So you agree humans can't really think because it's all just electrical impulses?
Human "thought" is the way it is because "electrical impulses" (wildly inaccurate description of how the brain works, but I'll let it pass for the sake of the argument) implement it. They are its mechanism. LLMs are not implemented like a human brain, so if they do have anything similar to "thought", it's a qualitatively different thing, since the mechanism is different.
Mature sunflowers reliably point due east, needles on a compass point north. They implement different things using different mechanisms, yet are really the same.
I never got the impression they were saying that the mechanism of operation dictates the output. It seemed more like they were making a direct observation about the output.
> they're improv partners developing a scene with you
That's probably one of the best ways to describe the process, it really is exactly that. Monkey see, monkey do.
It's kind of funny how not a lot of people realize this.
On one hand this is a feature: you're able to "multishot prompt" an LLM into providing the wanted response. Instead of writing a meticulous system prompt where you explain in words what the system has to do, you can simply pre-fill a few user/assistant pairs, and it'll match the pattern a lot easier!
I always thought Gemini Pro was very good at this. When I wanted a model to "do by example", I mostly used Gemini Pro.
And that is ALSO Gemini's weakness! Because as soon as something goes wrong in Gemini-CLI, it'll repeat the same mistake over and over again.
And that’s why you should always edit your original prompt to explicitly address the mistake, rather than replying to correct it.
You have to curate the LLM's context. That's just part and parcel of using the tool. Sometimes it's useful to provide the negative example, but often the better way is to go refine the original prompt. Almost all LLM UIs (chatbot, code agent, etc.) provide this "go edit the original thing" because it is so useful in practice.
At one point if someone mentions they have trouble cooperating with AI it might be a huge interpersonal red flag, because that indicates they can't talk to a person in reaffirming and constructive ways so that they build you up rather than put down.
Rules need empowerment.
Excited to be releasing cupcake at the end of this week. For deterministic and non-deterministic guardrailing. It integrates via hooks (we created the feature request to anthropic for Claude code).
Without monitoring, you can definitely end up with rule-breaking behavior.
I ran this experiment: https://github.com/lechmazur/emergent_collusion/. An agent running like this would break the law.
"In a simulated bidding environment, with no prompt or instruction to collude, models from every major developer repeatedly used an optional chat channel to form cartels, set price floors, and steer market outcomes for profit."
Very interesting. Is there any other simulation that also exhibits spontaneous illegal activity?
Cooperation makes sense for how these fellas are trained. Did you ever see defection, where an agent lied about going along with a round of collusion?
Guess what, if you AI agent does insider trading on your behalf you're still going to jail
I tried to think about how we might (in the EU) start to think about this problem within the law, if of interest to anyone: https://www.europeanlawblog.eu/pub/dq249o3c/release/1
Great points. In my experiments combining AI with spaced repetition and small deliberate-practice tasks, I saw retention improve dramatically — not just speed. I think the real win is designing short active tasks around AI output (quiz, explain-back, micro-project). Has anyone tried formalizing this into a daily routine?
> AI agents break rules under everyday pressure
Jeez they really ARE becoming human like
LLMs are built based on human language and texts produced by people, and imitate the same exact reasoning patterns that exist in the training data. Sorry for being direct, but this is literally unsurprising. I think it is important to realize it to not anthropomorphize LLM / AI - strictly speaking they do not *become* anything.
What a bullshit article.
AI agents don't think, don't have a concept of time, and don't experience pressure.
I'm tired of these articles anthropomorphizing a probability engine.
Is it just me, or do LLM code assistants do catastrophically silly things (drop a DB, delete files, wipe a disk, etc.) far more often than humans?
It looks like the training data has plenty of those examples, but the models don’t have enough grounding or warnings before doing them. I wish there were a PleaseDontDoAnythingStupidEval for software engineering.
Sure,
LLMs are trained on human behavior as exhibited on the Internet. Humans break rules more often under pressure and sometimes just under normal circumstances. Why wouldn't "AI agents" behave similarly?
The one thing I'd say is that humans have some idea which rules in particular to break while "agents" seem to act more randomly.
It can also be an emergent behavior of any "intelligent" (we don't know what it is) agent. This is an open philosophical problem, I don't think anyone has a conclusive answer.
Maybe but there's no reason to think that's the case here rather then the models just acting out typical corpus storylines: the Internet is full of stories with this structure.
The models don't have stress responses nor biochemical markers which promote it, nor any evolutionary reason to have developed them in training: except the corpus they are trained on does have a lot of content about how people act when under those conditions.
I wonder who could have possibly predicted this being a result of using scraped web forums and Reddit posts for your training material.
..because it's in their training data? Case closed
“AI agents: They're just like us”
CMIIW currently AI models operate in two distinct modes:
1. Open mode during learning, where they take everything that comes from the data as 100% truth. The model freely adapts and generalizes with no constraints on consistency.
2. Closed mode during inference, where they take everything that comes from the model as 100% truth. The model doesn't adapt and behaves consistently even if in contradiction with the new information.
I suspect we need to run the model in the mix of the two modes, and possibly some kind of "meta attention" (epistemological) on which parts of the input the model should be "open" (learn from it) and which parts of the input should be "closed" (stick to it).