Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

159 points by Jimmc414 a year ago

I get the feeling this works due to the following:

1. An input is processed with answer generated token-by-token.

2. The model can output based on probability the answer or a filler token. Low probability answers are ignored in favour of higher probability filler tokens (I don't have a better token to answer with than .....)

3. At a certain point, an alignment is made with what was learnt previously triggering a higher probability of outputting a better token.

This intuition being that I've noticed models respond differently based on where in context information appears: Can't speak for different embedding methods however as I'm sure this changes my thoughts on above.

If instead chain of thought prompting is used, the tokens further generated may interfere with the output probability.

So further to this, I'm thinking filler tokens allow for a purer ability for a model to surface the best answer it has been trained on without introducing more noise. Or we can use methods that resample multiple times to find the highest outputs.

These LLMs are practically search engines in disguise.

PoignardAzur - a year ago

I'm not sure you understood what the paper was saying.
The LLM in the paper isn't being trained to output filler tokens until it finds an answer, it's trained to provide a better answer when it's given filler tokens. The only tokens the paper's LLM will predict are "true" and "false", the filler tokens are input-only.
And the paper doesn't find that filler tokens are "purer" than chain-of-thought: it describes them as less effective than CoT, though still a perf boost over getting the raw answer on certain types of tasks.
- razodactyl - a year ago
  
  I think you may need to read my comment again. It's even mentioned in the summary:
  > In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.
  The paper discusses an unexplained benefit of additional computation regardless of which token is selected be it symbols or Lorem Ipsum.
  I didn't mention anything about training. I'm speaking based on how the Transformer architecture itself is designed.
  The "unauditable" computation is simply a result of how machine learning models work. The extra computation made available is mentioned in my explanation.
  Keen to hear your thoughts on it though.
- a year ago

[deleted]

diziet - a year ago

This is a surprising result to me, given that (in my mind) the method simply does a few more forward passes, without encoding or transferring meaningful state between each pass.

sdenton4 - a year ago

You get embeddings at every activation layer of the network, at every token. That's extra state accessible to the network when running in recurrent 'generate the next token' mode.
- ehsanu1 - a year ago
  
  How much extra state and computation is it per token exactly? Can we account for the improvement in just those terms?
  - sdenton4 - a year ago
    
    That's the point of this paper: investigating whether 'chain of thought' promoting kinda-works because it actually induces reasoning, or whether it's just that more verbose answers give the model more tokens to work with, and this more state in which to hide interesting computations. This work introduced a way to give the model more tokens - and thus compute and state - to work with, independent of the prompt, which makes it easier to separate the computational impacts of verbosity from prompting.
  - PoignardAzur - a year ago
    
    Basically in a chain of N tokens, the state of the token at layer L reflects L * N / 2 intermediary states worth of info. (In practice a lot less, since attention is a somewhat noisy channel between tokens.)
ehsanu1 - a year ago

I've only read the abstract, but also find this strange. I wonder if this is just tapping into the computational chains that are already available when tokens are further away, due to the positional encodings being trained that way. If so, that makes the reasoning/modeling powers of LLMs even more impressive and inscrutable.
mmoskal - a year ago

Every token generates a KV cache entry based on this token and all previous KV cache entries. This happens in every layer. KV cache is 100k-1MB per token so quite a bit.
Edit: also you can forward generate 5 or 10 dots in batch without much overhead compared to a single dot since the main cost is pulling KV cache from VRAM so you have free tensor units.
- diziet - a year ago
  
  I understand what you are trying to say, but correct me if I am wrong: Each forward pass is stateless. The KV cache you describe is a clever method to make compute scale closer to linear complexity for attention. I am trying to build some conceptual understanding of how performance on 3SUM problem improves on this fine-tuned example but not on the fine-tuned no shot.
  - edmara - a year ago
    
    In the case of 3sum, because the LLM has been fine tuned to use 'blank' token key values as a register to represent the sums of specific integer triplets.
- - a year ago
  
  [deleted]
HarHarVeryFunny - a year ago

I think the general idea is that since partial embeddings can be copied laterally (between token positions) from one layer of a transformer to the next, then additional work done at filler positions can also be copied to following real token positions. There's obviously a limit to how useful this can be since these are added parallel token steps rather than sequential transformer layer ones, and results from different experiments seem to be mixed.
Still, I don't see how this really works .. more compute / embedding transformations are being potentially applied to the prediction, but in what circumstances are these filler positions being used in a useful way? The filler token embeddings themselves presumably aren't matching attention keys, but positional encodings for adjacent tokens will be similar, which is maybe what triggers lateral copying into (and perhaps out of) filler positions?
imtringued - a year ago

Given an algorithm that takes 100 iterations, ask the computer to perform it in ten iterations. It gives you a nonsense answer. Tell it to do it in 100 steps and it might just be able to do it. What this tells us is that context size appears to be a limiting factor as models get bigger and bigger.
dist-epoch - a year ago

You can transfer some state just through dots. The dot count could mean "the first n ideas do not work, analyze the n+1 one, if that's bad, emit another dot"
- wongarsu - a year ago
  
  And this works even if we assume the dots don't actually transfer information but just slightly mutate the state. First it tries a random approach, if that approach doesn't lead to a powerful result it emits a dot to try another random approach in the next round, until you get a sufficiently good path forwards.
  Essentially a brute-force search. Which is a bit wasteful, but better than just blindly taking the first idea
  - im3w1l - a year ago
    
    Kind of like trying nonces until you find one that gives you a hash with lots of leading zeros? Dotchain?
    
    kevindamm - a year ago
    
    Chain of Dot.
- diziet - a year ago
  
  I believe the model is trained to always output the same number of dots. Additionally, the way LLMs generate output tokens are not really in in line with transferring state this way.
pyinstallwoes - a year ago

Can't anything be compressed into one word by comparison?
- ivankolev - a year ago
  
  Godel and Heisenberg say no, in the most generalized case. Our universe is not deterministic
  - ants_everywhere - a year ago
    
    I'm not sure what Godel is doing here but quantum mechanics is consistent with the universe being deterministic
    https://en.wikipedia.org/wiki/Superdeterminism
  - pyinstallwoes - a year ago
    
    Explain how words do that?

rgbrgb - a year ago

i found a nice thread-level walkthrough of this paper by the first coauthor here: https://twitter.com/jacob_pfau/status/1783951795238441449

segmondy - a year ago

What does a prompt for this look like?

nestorD - a year ago

My intuitive understanding of transformers is that each token (input+output) gives the model some "thinking space" that can be used to store reasoning information (independently of the token): thus, a transformer should be more clever when completing a 1000 tokens sequence than a 2 tokens sequence. This seems in line with the paper's finding.

avereveard - a year ago

Can't they just not drop the decoder part of the transformer architecture if they need some additional processing on the prompt?

Vetch - a year ago

This paper reads to me as being about fundamental limitations of Transformers and backdoor risk.

The paper starts off by reviewing work which uses an encompassing theoretical model of transformers to prove they're limited to only expressing computations in TC^0 (roughly, upperbounded by set of parallelizable problems that can be solved by relatively shallow circuits).

There's also a reference to a paper which finds that (wrt input problem size), a polynomial number of intermediate scratchpad decoding steps allow transformers to recognize the class of polynomial-time solvable problems, linear steps is context-sensitive languages.

This paper now ask about filler tokens, do they help? The answer is negative except for a very clever exception they work out: problems with demonstrations that can be decomposed to be solvable in parallel. This identifies a practical limitation (transformer next token prediction is not expressive enough to capture all of TC^0) at the same as it identifies a theoretical capability. From the paper:

> Taken together these findings suggest that although current LLMs are unlikely to benefit from filler tokens, this is not an in-principle limitation of current architectures.

If I've understood, this means for learning to use fillers to benefit from CoT data, demonstrations must be structured such that they can be computed in parallel and not as a more natural sequential, instance-adaptive process.

> in order to use filler tokens on natural language data, LLMs would need to discover parallelizable algorithmic solutions given access only to CoT demonstrations lacking parallel structure. By training on instance-adaptive chains of thought, we can study whether models can learn to use filler tokens having seen only more naturalistic chain-of-thought data >... > We find that models trained on instance-adaptive CoT data fail to use filler tokens. On filler token sequences, the resulting models remain at, or below, no-intermediate-token, baseline performance, Figure 6. This indicates that there is no transfer from serial, instance-adaptive demonstrations to filler tokens for the 3SUM problem.

It also appears that the parallelizable problem must have a certain amount of structural complexity before a gap appears versus no filler modes (unless using an impractical amount of filler tokens):

> we expect integer addition tasks will not offer suitably rich structures for taking advantage of filler tokens when using large models—natural-language tasks may offer alternatives

Empirically, other papers have shown that LLM performance on complex tasks deteriorates significantly with input length and distractor text. Anyone who has naively attempted to combine RAG with large contexts might also have first hand experience with this.

The reason I consider this to be primarily a backdoor risk is that the kind of data and learning required seems highly unlikely to occur naturally but someone could create documents to introduce triggerable obfuscated computations. While not an issue today, future LLM training might need to filter for data with meaningful parts separated by meaningless patterns of repeated characters.

This paper follows a recent trend of marketing excellent theoretical work as LLMs being capable of secretly plotting behind your back, when the realistic implication is backdoor risk.

An article currently on the first page is relevant:

https://www.strangeloopcanon.com/p/what-can-llms-never-do

casebash - a year ago

"This paper follows a recent trend of marketing excellent theoretical work as LLMs being capable of secretly plotting behind your back, when the realistic implication is backdoor risk".
Many top computer scientists consider loss of control risks to be a possibility that we need to take seriously.
So the question then becomes, is there a way to apply science to gain greater clarity on the possibility of these claims? And this is very tricky, since we're trying to evaluate claims not about models that currently exist, but about future models.
And I guess what people have realised recently is that, even if we can't directly run an experiment to determine the validity of the core claim of concern, we can run experiments on auxiliary claims in order to better inform discussions. For example, the best way to show that a future model could have a capability is to demonstrate that a current model possesses that capability.
I'm guessing you'd like to see more scientific evidence before you want to take possibilities like deceptive alignment seriously. I think that's reasonable. However, work like this is how we gather that evidence.
Obviously, each individual result doesn't provide much evidence on its own, but the accumulation of results has helped to provide more strategic clarity over time.
nopromisessir - a year ago

I would agree that choice of language 'hidden reasoning' is a poor one.
This paper demonstrates a novel training approach which could yield narrow capability growth on a certain class of tasks.
The narrow test tube environment in which we see better performance hints at the unknown which when better understood could promise further yields down the road.
To my mind, the idea that filler tokens might promote immergent capability leading to broader task complexity capability is more promising than the backdoor risk you lay out. The possible scale in each direction just doesn't seem comparable to me(assuming each scenario plays out in a meaningful way).
Re the article...
A single fundamental breakthrough could make his entire article obsolete in a single month. We've found a lot of limits to LLMs sure... This is always how it goes over the history of AI right? The pace of fundamental breakthroughs seems of more relevant conversation with respect to the prospects for AGI as framed by his article.
- Vetch - a year ago
  
  The paper also proves that this capability, one unlikely to occur naturally, does not help for tasks where one must create sequentially dependent chains of reasoning, a limiting constraint. At least not without overturning what we believe about TCS.
  > A single fundamental breakthrough
  Then we'd no longer be talking about transformers. That something unpredicted could happen is trivially true.
  > immergent capability
  It's specifically trained in, requires heavy supervision and is hard to learn. It's surprising that Transformers can achieve this at all but it's not emergent.
  - nopromisessir - a year ago
    
    Look...
    You are taking literally 2-4 token phrases from my comment and attacking them without context. I'll spend time on the latter quote. You quote 'emergent capability'.
    A) appreciate you correcting my spelling
    B) 'The narrow test tube environment in which we see better performance hints at the unknown which when better understood could promise further yields down the road.
    To my mind, the idea that filler tokens might promote immergent capability leading to broader task complexity'
    C) Now that we have actual context... I'll leave the rest to the thoughtful reader. I said the following key words: 'hints', 'could', 'might'
    D) Who asserted this behavior was emergent?
    Recommend slowing down next time. You might get a more clear picture before you attack a straw man. Expect no further exchange. Best of luck.