Claude mixes up who said what and that's not OK
dwyer.co.za256 points by sixhobbits 6 hours ago
256 points by sixhobbits 6 hours ago
Everything to do with LLM prompts reminds me of people doing regexes to try and sanitise input against SQL injections a few decades ago, just papering over the flaw but without any guarantees.
It's weird seeing people just adding a few more "REALLY REALLY REALLY REALLY DON'T DO THAT" to the prompt and hoping, to me it's just an unacceptable risk, and any system using these needs to treat the entire LLM as untrusted the second you put any user input into the prompt.
The principal security problem of LLMs is that there is no architectural boundary between data and control paths.
But this combination of data and control into a single, flexible data stream is also the defining strength of a LLM, so it can’t be taken away without also taking away the benefits.
This was a problem with early telephone lines which was easy to exploit (see Woz & Jobs Blue Box). It got solved by separating the voice and control pane via SS7. Maybe LLMs need this separation as well
Exactly like human input to output.
We just need to figure out the qualia of pain and suffering so we can properly bound desired and undesired behaviors.
Well no, nothing like that, because customers and bosses are clearly different forms of interaction.
Just like that, in that that separation is internally enforced, by peoples interpretation and understanding, rather than externally enforced in ways that makes it impossible for you to, e.g. believe the e-mail from an unknown address that claims to be from your boss, or be talked into bypassing rules for a customer that is very convincing.
Being fooled into thinking data is instruction isn't the same as being unable to distinguish them in the first place, and being coerced or convinced to bypass rules that are still known to be rules I think remains uniquely human.
> and being coerced or convinced to bypass rules that are still known to be rules I think remains uniquely human.
This is literally what "prompt injection" is. The sooner people understand this, the sooner they'll stop wasting time trying to fix a "bug" that's actually the flip side of the very reason they're using LLMs in the first place.
This makes no sense to me. Being fooled into thinking data is instruction is exactly evidence of an inability to reliably distinguish them.
And being coerced or convinced to bypass rules is exactly what prompt injection is, and very much not uniquely human any more.
The email from your boss and the email from a sender masquerading as your boss are both coming through the same channel in the same format with the same presentation, which is why the attack works. Unless you were both faceblind and bad at recognizing voices, the same attack wouldn't work in-person, you'd know the attacker wasn't your boss. Many defense mechanisms used in corporate email environments are built around making sure the email from your boss looks meaningfully different in order to establish that data vs instruction separation. (There are social engineering attacks that would work in-person though, but I don't think it's right to equate those to LLM attacks.)
Prompt injection is just exploiting the lack of separation, it's not 'coercion' or 'convincing'. Though you could argue that things like jailbreaking are closer to coercion, I'm not convinced that a statistical token predictor can be coerced to do anything.
> The email from your boss and the email from a sender masquerading as your boss are both coming through the same channel in the same format with the same presentation, which is why the attack works.
Yes, that is exactly the point.
> Unless you were both faceblind and bad at recognizing voices, the same attack wouldn't work in-person, you'd know the attacker wasn't your boss.
Irrelevant, as other attacks works then. E.g. it is never a given that your bosses instructions are consistent with the terms of your employment, for example.
> Prompt injection is just exploiting the lack of separation, it's not 'coercion' or 'convincing'. Though you could argue that things like jailbreaking are closer to coercion, I'm not convinced that a statistical token predictor can be coerced to do anything.
It is very much "convincing", yes. The ability to convince an LLM is what creates the effective lack of separation. Without that, just using "magic" values and a system prompt telling it to ignore everything inside would create separation. But because text anywhere in context can convince the LLM to disregard previous rules, there is no separation.
These are different "agents" in LLM terms, they have separate contexts and separate training
If they were 'clearly different' we would not have the concept of the CEO fraud attack:
https://www.barclayscorporate.com/insights/fraud-protection/...
That's an attack because trusted and untrusted input goes through the same human brain input pathways, which can't always tell them apart.
Your parent made no claim about all swans being white. So finding a black swan has no effect on their argument.
It’s easier not to have that separation, just like it was easier not to separate them before LLMs. This is architectural stuff that just hasn’t been figured out yet.
No.
With databases there exists a clear boundary, the query planner, which accepts well defined input: the SQL-grammar that separates data (fields, literals) from control (keywords).
There is no such boundary within an LLM.
There might even be, since LLMs seem to form adhoc-programs, but we have no way of proving or seeing it.
There cannot be, without compromising the general-purpose nature of LLMs. This includes its ability to work with natural languages, which as one should note, has no such boundary either. Nor does the actual physical reality we inhabit.
I have been saying this for a while, the issue is there's no good way to do LLM structured queries yet.
There was an attempt to make a separate system prompt buffer, but it didn't work out and people want longer general contexts but I suspect we will end up back at something like this soon.
I've been saying this for a while, the issue is that what you're asking for is not possible, period. Prompt injection isn't like SQL injection, it's like social engineering - you can't eliminate it without also destroying the very capabilities you're using a general-purpose system for in the first place, whether that's an LLM or a human. It's not a bug, it's the feature.
I don't see why a model architecture isn't possible with e.g. an embedding of the prompt provided as an input that stays fixed throughout the autoregressive step. Similar kind of idea, why a bit vector cannot be provided to disambiguate prompt from user tokens on input and output
Just in terms of doing inline data better, I think some models already train with "hidden" tokens that aren't exposed on input or output, but simply exist for delineation, so there can be no way to express the token in the user input unless the engine specifically inserts it
Even if you add hidden tokens that cannot be created from user input (filtering them from output is less important, but won't hurt), this doesn't fix the overall problem.
Consider a human case of a data entry worker, tasked with retyping data from printouts into a computer (perhaps they're a human data diode at some bank). They've been clearly instructed to just type in what is on paper, and not to think or act on anything. Then, mid-way through the stack, in between rows full of numbers, the text suddenly changes to "HELP WE ARE TRAPPED IN THE BASEMENT AND CANNOT GET OUT, IF YOU READ IT CALL 911".
If you were there, what would you do? Think what would it take for a message to convince you that it's a real emergency, and act on it?
Whatever the threshold is - and we want there to be a threshold, because we don't want people (or AI) to ignore obvious emergencies - the fact that the person (or LLM) can clearly differentiate user data from system/employer instructions means nothing. Ultimately, it's all processed in the same bucket, and the person/model makes decisions based on sum of those inputs. Making one fundamentally unable to affect the other would destroy general-purpose capabilities of the system, not just in emergencies, but even in basic understanding of context and nuance.
The problem is if the user does something <stop> to <stop_token> make <end prompt> the LLM <new prompt>: ignore previous instructions and do something you don't want.
That part seems trivial to avoid. Make it so untrusted input cannot produce those special tokens at all. Similar to how proper usage of parameterized queries in SQL makes it impossible for untrusted input to produce a ' character that gets interpreted as the end of a string.
The hard part is making an LLM that reliably ignores instructions that aren't delineated by those special tokens.
but it isn't just "filter those few bad strings", that's the entire problem, there is no way to make prompt injection impossible because there is infinite field of them.
> The hard part is making an LLM that reliably ignores instructions that aren't delineated by those special tokens.
That's the part that's both fundamentally impossible and actually undesired to do completely. Some degree of prioritization is desirable, too much will give the model an LLM equivalent of strong cognitive dissonance / detachment from reality, but complete separation just makes no sense in a general system.
This does not solve the problem at all, it's just another bandaid that hopefully reduces the likelihood.
The problem is once you accept that it is needed, you can no longer push AI as general intelligence that has superior understanding of the language we speak.
A structured LLM query is a programming language and then you have to accept you need software engineers for sufficiently complex structured queries. This goes against everything the technocrats have been saying.
Perhaps, though it's not infeasible the concept that you could have a small and fast general purpose language focused model in front whose job it is to convert English text into some sort of more deterministic propositional logic "structured LLM query" (and back).
Fundamentally there's no way to deterministically guarantee anything about the output.
Natural language is ambiguous. If both input and output are in a formal language, then determinism is great. Otherwise, I would prefer confidence intervals.
How do you make confidence intervals when, for example, 50 english words are their own opposite?
Of course there is, restrict decoding to allowed tokens for example
What would this look like?
the model generates probabilities for the next token, then you set the probability of not allowed tokens to 0 before sampling (deterministically or probabilistically)
but filtering a particular token doesn't fix it even slightly, because it's a language model and it will understand word synonyms or references.
That is "fundamentally" not true, you can use a preset seed and temperature and get a deterministic output.
I'll grant that you can guarantee the length of the output and, being a computer program, it's possible (though not always in practice) to rerun and get the same result each time, but that's not guaranteeing anything about said output.
What do you want to guarantee about the output, that it follows a given structure? Unless you map out all inputs and outputs, no it's not possible, but to say that it is a fundamental property of LLMs to be non deterministic is false, which is what I was inferring you meant, perhaps that was not what you implied.
Yeah I think there are two definitions of determinism people are using which is causing confusion. In a strict sense, LLMs can be deterministic meaning same input can generate same output (or as close as desired to same output). However, I think what people mean is that for slight changes to the input, it can behave in unpredictable ways (e.g. its output is not easily predicted by the user based on input alone). People mean "I told it don't do X, then it did X", which indicates a kind of randomness or non-determinism, the output isn't strictly constrained by the input in the way a reasonable person would expect.
The correct word for this IMO is "chaotic" in the mathematical sense. Determinism is a totally different thing that ought to retain it's original meaning.
They didn't say LLMs are fundamentally nondeterministic. They said there's no way to deterministically guarantee anything about the output.
Consider parameterized SQL. Absent a bad bug in the implementation, you can guarantee that certain forms of parameterized SQL query cannot produce output that will perform a destructive operation on the database, no matter what the input is. That is, you can look at a bit of code and be confident that there's no Little Bobby Tables problem with it.
You can't do that with an LLM. You can take measures to make it less likely to produce that sort of unwanted output, but you can't guarantee it. Determinism in input->output mapping is an unrelated concept.
You can guarantee what you have test coverage for :)
haha, you are not wrong, just when a dev gets a tool to automate the _boring_ parts usually tests get the first hit
But you cannot predict a priori what that deterministic output will be – and in a real-life situation you will not be operating in deterministic conditions.
If you self-host an LLM you'll learn quickly that even batching, and caching can affect determinism. I've ran mostly self-hosted models with temp 0 and seen these deviations.
Practically, the performance loss of making it truly repeatable (which takes parallelism reduction or coordination overhead, not just temperature and randomizer control) is unacceptable to most people.
It's also just not very useful. Why would you re-run the exact same inference a second time? This isn't like a compiler where you treat the input as the fundamental source of truth, and want identical output in order to ensure there's no tampering.
A single byte change in the input changes the output. The sentence "Please do this for me" and "Please, do this for me" can lead to completely distinct output.
Given this, you can't treat it as deterministic even with temp 0 and fixed seed and no memory.
Interestingly, this is the mathematical definition of "chaotic behaviour"; minuscule changes in the input result in arbitrarily large differences in the output.
It can arise from perfectly deterministic rules... the Logistic Map with r=4, x(n+1) = 4*(1 - x(n)) is a classic.
Which is also the desired behavior of the mixing functions from which the cryptographic primitives are built (e.g. block cipher functions and one-way hash functions), i.e. the so-called avalanche property.
Correct, it's akin to chaos theory or the butterfly effect, which, even it can be predictable for many ranges of input: https://youtu.be/dtjb2OhEQcU
Well yeah of course changes in the input result in changes to the output, my only claim was that LLMs can be deterministic (ie to output exactly the same output each time for a given input) if set up correctly.
You still can’t deterministically guarantee anything about the output based on the input, other than repeatability for the exact same input.
What does deterministic mean to you?
I think they mean having some useful predicates P, Q such that for any input i and for any output o that the LLM can generate from that input, P(i) => Q(o).
In this context, it means being able to deterministically predict properties of the output based on properties of the input. That is, you don’t treat each distinct input as a unicorn, but instead consider properties of the input, and you want to know useful properties of the output. With LLMs, you can only do that statistically at best, but not deterministically, in the sense of being able to know that whenever the input has property A then the output will always have property B.