What can LLMs never do?
strangeloopcanon.com460 points by henrik_w 15 days ago
460 points by henrik_w 15 days ago
Fantastic essay. Highly recommended!
I agree with all key points:
* There are problems that are easy for human beings but hard for current LLMs (and maybe impossible for them; no one knows). Examples include playing Wordle and predicting cellular automata (including Turing-complete ones like Rule 110). We don't fully understand why current LLMs are bad at these tasks.
* Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the "reasoning steps" and handing them to the LLM, instead of the LLM figuring them out by itself. We have "reasoning machines" that are intelligent but seem to be hitting fundamental limits we don't understand.
* It's unclear if better prompting and bigger models using existing attention mechanisms can achieve AGI. As a model of computation, attention is very rigid, whereas human brains are always undergoing synaptic plasticity. There may be a more flexible architecture capable of AGI, but we don't know it yet.
* For now, using current AI models requires carefully constructing long prompts with right and wrong answers for computational problems, priming the model to reply appropriately, and applying lots of external guardrails (e.g., LLMs acting as agents that review and vote on the answers of other LLMs).
* Attention seems to suffer from "goal drift," making reliability hard without all that external scaffolding.
Go read the whole thing.
> There are problems that are easy for human beings but hard for current LLMs (and maybe impossible for them; no one knows). Examples include playing Wordle and predicting cellular automata (including Turing-complete ones like Rule 110). We don’t fully understand why current LLMs are bad at these tasks.
I thought we did know for things like playing Wordle, that its because they deal with words as sequence of tokens that correspond to whole words not sequences of letters, so a game that involves dealing with sequences of letters constrained to those that are valid words doesn’t match the way they process information?
> Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the “reasoning steps” and handing them to the LLM, instead of the LLM figuring them out by itself. We have “reasoning machines” that are intelligent but seem to be hitting fundamental limits we don’t understand.
But providing examples with different, contextually-appropriate sets of reasoning steps results can enable the model to choose its own, more-or-less appropriate, set of reasoning steps for particular questions not matching the examples.
> It’s unclear if better prompting and bigger models using existing attention mechanisms can achieve AGI.
Since there is no objective definition of AGI or test for it, there’s no basis for any meaningful speculation on what can or cannot achieve it; discussions about it are quasi-religious, not scientific.
Arriving at a generally accepted scientific definition of AGI might be difficult, but a more achievable goal might be to arrive at a scientific way to determine something is not AGI. And while I'm not an expert in the field, I would certainly think a strong contender for relevant criteria would be an inability to process information in a way other than the one a system was explicitly programmed to, even if the new way of processing information was very related to the pre-existing method. Most humans playing Wordle for the first time probably weren't used to thinking about words that way either, but they were able to adapt because they actually understand how letters and words work.
I'm sure one could train an LLM to be awesome at Wordle, but from an AGI perspective the fact that you'd have to do so proves it's not a path to AGI. The Wordle dominating LLM would presumably be perplexed by the next clever word game until trained on thinking about information that way, while a human doesn't need to absorb billions of examples to figure it out.
I was originally pretty bullish on LLMs, but now I'm equally convinced that while they probably have some interesting applications, they're a dead-end from a legitimate AGI perspective.
An LLM doesn't even see individual letters at all, because they get encoded into tokens before they are passed as input to the model. It doesn't make much sense to require reasoning with things that aren't even in the input as a requisite for intelligence.
That would be like an alien race that could see in an extra dimension, or see the non-visible light spectrum, presenting us with problems that we cannot even see and saying that we don't have AGI when we fail to solve them.
And yet ChatGPT 3.5 can tell me the nth letter of an arbitrary word…
I have just tried and it indeed does get it right quite often, but if the word is rare (or made up) and the position is not one of the first, it often fails. And GPT-4 too.
I suppose if it can sort of do it is because of indirect deductions from training data.
I.e. maybe things like "the third letter of the word dog is d", or "the word d is composed of the letters d, o, g" are in the training data; and from there it can answer questions not only about "dog", but probably about words that have "dog" as their first subtoken.
Actually it's quite impressive that it can sort of do it taking into account that, as I mention, characters are just outright not in the input. It's ironic that people often use these things as an example of how "dumb" the system is when it's actually amazing that it can sometimes work around that limitation.
...because it knows that the next token in the sequence "the 5th letter in the word _illusion_ is" happens to be "s". Not because it decomposed the word into letters.
It seems unlikely that such sequences exist for the majority of words. And I asked in English about Portuguese words.
And yet GPT4 still can't reliably tell me if a word contains any given letter.
"they're a dead-end from a legitimate AGI perspective"
Or another piece of the puzzle to achieve it. It might not be one true path, but a clever combination of existing working pieces where (different) LLMs are one or some of those pieces.
I believe there is also not only one way of thinking in the human brain, but my thought processes happen on different levels and maybe based on different mechanism. But as far as I know, we lack details.
What about an LLM that can't play wordle itself without being trained on it, but can write and use a wordle solver upon seeing the wordle rules?
I think "can recognize what tools are needed to solve a problem, build those tools, and use those tools" would count as a "path to AGI".
LLMs can’t reason but neither can the part of your brain that automatically completes the phrase “the sky is…”
"Since there is no objective definition of AGI or test for it, there’s no basis for any meaningful speculation on what can or cannot achieve it; discussions about it are quasi-religious, not scientific."
This is such a weird thing to say. Essentially _all_ scientific ideas are, at least to begin with, poorly defined. In fact, I'd argue that almost all scientific ideas remain poorly defined with the possible exception of _some_ of the basic concepts in physics. Scientific progress cannot be and is not predicated upon perfect definitions. For some reason when the topic of consciousness or AGI comes up around here, everyone commits a sort of "all or nothing" logical fallacy: absence of perfect knowledge is cast as total ignorance.
Yes. That absence of perfect definition was part of why Turing came with his famous test so long ago. His original paper is a great read!
Sam Harris argues similarly in The Moral Landscape. There's this conception objective morality cannot exist outside of religion, because as soon as you're trying to prove one, philosophers rush with pedantic criticism that would render any domain of science invalid.
I kinda get where Sam Harris is coming from, but its kind of silly to call what he is talking about morality. As far as I can tell, Harris is just a moral skeptic who believes something like "we should get a bunch of people together to decide kind of what we want in the world and then rationally pursue those ends." But that is very different from morality as it was traditionally understood (eg, facts about behaviors which are objective in their assignment of good and bad).
I think one should feel comfortable arguing that AGI must be stateful and experience continuous time at least. Such that a plain old LLM is definitively not ever going to be AGI; but an LLM called in a do while true for loop might.
I don't understand why you believe it must experience continuous time. If you had a system which clearly could reason, which could learn new tasks on its own, which didn't hallucinate any more than humans do, but it was only active for the period required for it to complete an assigned task, and was completely dormant otherwise, why would that dormant period disqualify it as AGI? I agree that such a system should probably not be considered conscious, but I think it's an open question whether or not consciousness is required for intelligence.
Active for a period is still continuous during that period.
As opposed to “active when called”. A function, being called repeatedly over a length of time is reasonably “continuous” imo
I don't see what the difference between "continuous during that period" and "active when called" is. When an AI runs inference, that calculation takes time. It is active during the entire interval during which it is responding to the prompt. It is then inactive until the next prompt. I don't see why a system can't be considered intelligent merely because its activity is intermittent.
The calculation takes time but the inference is from a single snapshot so it is effectively a single transaction of input to output. An intelligent entity is not a transactional machine. It has to a working system.
That system might be as simple as calling the transactional machine ever few seconds. That might pass the threshold. But then your AGI is the broader setup, not just the LLM.
But the transactional machine is certainly not an intelligent entity. Much like a brain in a jar or a cryostasis’d human.
Suppose we could perfectly simulate a human mind in a way that everyone finds compelling. We would still not call that simulated human mind an intelligent entity unless it was “active”.
I think its note worthy that humans actually fail this test... We have to go dormant for 8 hours every day.
Yes, but our brain is still working and processing information at those times as well, isn't it? Even if not in the same way as it does when we're conscious.
What about general anesthesia? I had a major operation during which most of my brain was definitely offline for at least 8 hours.
Anesthesia shouldn't take your brain offline. It just makes you unconscious, paralyzes you, and gives you amnesia. Your brain is still active under general anesthesia. What you were thinking or feeling for those 8 hours was just forgotten.