Attention Wasn't All We Needed

stephendiehl.com

95 points by mooreds 7 hours ago


yorwba - 5 hours ago

The explanation for Multi-head Latent Attention https://www.stephendiehl.com/posts/post_transformers/#multi-... does not match the definition in the DeepSeek-V2 paper https://arxiv.org/pdf/2405.04434#subsection.2.1

MLA as developed by DeepSeek is a technique to reduce the memory footprint of the KV cache by storing only two vectors of size latent_dim and rope_dim per token and layer, instead of 2 * num_heads vectors of size head_dim. (DeepSeek-V3 has num_heads = 128 and head_dim = 128 vs latent_dim = 512 and rope_dim = 64, so a significant reduction https://arxiv.org/pdf/2412.19437#subsection.4.2 )

What this article describes instead is some kind of two-step attention scheme I haven't seen before and that I think wouldn't work with causal masking (despite mask appearing in the example code) because either you allow an earlier token to attend to a latent that attended to a later token (creating backwards information flow) or the latents can only attend to a limited prefix of the sequence, after which they're frozen and useless. I wonder whether the author dreamed it up himself or whether someone else is actually using this somewhere.

johnsmith1840 - 2 hours ago

One interesting thought process i've had around these topics is how it's not just attention but all DL methods suffer similar problems.

I truly believe the last step to AGI is solving continual learning. Efficient will always inch up but the "jump" is honestly not in sight.

Maybe attention + (unknown thing) really is all we need.

The thought is interesting because if you extrapolate that all DL models suffer the same class of problems (CL) the solution is implying two possibilities.

1. In the future, AGI level models will be entire new categories sharing little to nothing with methods like attention. (Every part is different like the article suggests)

2. Or (maybe more likely) we will simply build on what we have. If that's true then next generation models in agi like realm will be the same models we have now with one unifying change to all of them.

I previously made a unique transformer model whose every single neuron acted like a decision gate. Every neuron would choose a "computation nueron" before going on. Back prop was modified so that only computation neurons contributed to back prop of the next layer.

It had some interesting properties, the largest being that every token loop through the model was essentially seeing a completely different model. I was/am under the belief that scaling dimensionality == solving CL.

I bring it up because technically this architecture was identical to the transformer. I could drop my special neuron into literally any DL model out there and train.

I believe this kind of advancement is what will be the next generations models. Not a change of the transformer or attention but to the fundamental building blocks of all DL models.

It honestly does feel like attention gets us part of thr AGI equation well enough. It seems to have solved or will soon solve most short term hard problems. Again this is why CL is key, it's the timr comonent no AI method across the board has ever solved.

kouteiheika - 6 hours ago

> Let's look at some of the most important ones that have been developed over the years and try to implement the basic ideas as succinctly as possible.

One big architectural tweak that comes to mind and isn't in the article is QK norm: https://arxiv.org/pdf/2010.04245

> Cosine Schedule

A lot (most?) of new training runs actually don't use cosine schedule anymore; instead they keep the learning rate constant and only decay it at the very end, which gives equivalent or better results. See:

https://arxiv.org/pdf/2405.18392 https://arxiv.org/pdf/2404.06395

> There is a highly optimized implementation of AdamW in PyTorch.

A fun tidbit - it's actually not highly optimized from my experience. Imagine my surprise when I reimplemented it in Triton (because I needed to tweak a few things) and I got better performance than the built-in PyTorch implementation.

empiko - 5 hours ago

Nice writeup, but regarding title -- I find it fascinating how powerful attention really is. There were some tweaks developedz sure, but if I open Llama 4 code on HugginFace, it is more or less the same code that I've seen there 5 years ago. Despite all the AI hype, we are still just exploiting tech developed in 2015-2020. And despite NeurIPS brandishing 25k papers this year, the innovation rate in deep learning seems to stagnate

flebron - 6 hours ago

This is an excellent summary of these techniques :) I like that every single one comes with an example implementation, with shape comments on the tensors. Thanks Stephen!

andrewmcwatters - 7 hours ago

I know this probably seems like such a small detail to a lot of people, but I really love that the author adds comments.

I can't stand reading PyTorch or other neural network code and asking myself, "What architecture am I looking at here?" or "What the hell are these operations for?"

It's always like an mash up of reading some published paper code with deep effort behind it along with all the worst programming practices of complete unreadability.

jdeaton - 5 hours ago

First four things on the list are attention