Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

arxiv.org

159 points by Jimmc414 15 days ago


razodactyl - 14 days ago

I get the feeling this works due to the following:

1. An input is processed with answer generated token-by-token.

2. The model can output based on probability the answer or a filler token. Low probability answers are ignored in favour of higher probability filler tokens (I don't have a better token to answer with than .....)

3. At a certain point, an alignment is made with what was learnt previously triggering a higher probability of outputting a better token.

This intuition being that I've noticed models respond differently based on where in context information appears: Can't speak for different embedding methods however as I'm sure this changes my thoughts on above.

If instead chain of thought prompting is used, the tokens further generated may interfere with the output probability.

So further to this, I'm thinking filler tokens allow for a purer ability for a model to surface the best answer it has been trained on without introducing more noise. Or we can use methods that resample multiple times to find the highest outputs.

These LLMs are practically search engines in disguise.

diziet - 14 days ago

This is a surprising result to me, given that (in my mind) the method simply does a few more forward passes, without encoding or transferring meaningful state between each pass.

rgbrgb - 14 days ago

i found a nice thread-level walkthrough of this paper by the first coauthor here: https://twitter.com/jacob_pfau/status/1783951795238441449

segmondy - 14 days ago

What does a prompt for this look like?

nestorD - 14 days ago

My intuitive understanding of transformers is that each token (input+output) gives the model some "thinking space" that can be used to store reasoning information (independently of the token): thus, a transformer should be more clever when completing a 1000 tokens sequence than a 2 tokens sequence. This seems in line with the paper's finding.

avereveard - 14 days ago

Can't they just not drop the decoder part of the transformer architecture if they need some additional processing on the prompt?

Vetch - 14 days ago

This paper reads to me as being about fundamental limitations of Transformers and backdoor risk.

The paper starts off by reviewing work which uses an encompassing theoretical model of transformers to prove they're limited to only expressing computations in TC^0 (roughly, upperbounded by set of parallelizable problems that can be solved by relatively shallow circuits).

There's also a reference to a paper which finds that (wrt input problem size), a polynomial number of intermediate scratchpad decoding steps allow transformers to recognize the class of polynomial-time solvable problems, linear steps is context-sensitive languages.

This paper now ask about filler tokens, do they help? The answer is negative except for a very clever exception they work out: problems with demonstrations that can be decomposed to be solvable in parallel. This identifies a practical limitation (transformer next token prediction is not expressive enough to capture all of TC^0) at the same as it identifies a theoretical capability. From the paper:

> Taken together these findings suggest that although current LLMs are unlikely to benefit from filler tokens, this is not an in-principle limitation of current architectures.

If I've understood, this means for learning to use fillers to benefit from CoT data, demonstrations must be structured such that they can be computed in parallel and not as a more natural sequential, instance-adaptive process.

> in order to use filler tokens on natural language data, LLMs would need to discover parallelizable algorithmic solutions given access only to CoT demonstrations lacking parallel structure. By training on instance-adaptive chains of thought, we can study whether models can learn to use filler tokens having seen only more naturalistic chain-of-thought data >... > We find that models trained on instance-adaptive CoT data fail to use filler tokens. On filler token sequences, the resulting models remain at, or below, no-intermediate-token, baseline performance, Figure 6. This indicates that there is no transfer from serial, instance-adaptive demonstrations to filler tokens for the 3SUM problem.

It also appears that the parallelizable problem must have a certain amount of structural complexity before a gap appears versus no filler modes (unless using an impractical amount of filler tokens):

> we expect integer addition tasks will not offer suitably rich structures for taking advantage of filler tokens when using large models—natural-language tasks may offer alternatives

Empirically, other papers have shown that LLM performance on complex tasks deteriorates significantly with input length and distractor text. Anyone who has naively attempted to combine RAG with large contexts might also have first hand experience with this.

The reason I consider this to be primarily a backdoor risk is that the kind of data and learning required seems highly unlikely to occur naturally but someone could create documents to introduce triggerable obfuscated computations. While not an issue today, future LLM training might need to filter for data with meaningful parts separated by meaningless patterns of repeated characters.

This paper follows a recent trend of marketing excellent theoretical work as LLMs being capable of secretly plotting behind your back, when the realistic implication is backdoor risk.

An article currently on the first page is relevant:

https://www.strangeloopcanon.com/p/what-can-llms-never-do