A History of Large Language Models

296 points by alexmolas 7 days ago

This is quite a good overview, and parts reflect well how things played out in language model research. It's certainly true that language models and deep learning were not considered particularly promising in NLP, which frustrated me greatly at the time since I knew otherwise!

However the article misses the first two LLMs entirely.

Radford cited CoVE, ELMo, and ULMFiT as the inspirations for GPT. ULMFiT (my paper with Sebastian Ruder) was the only one which actually fine-tuned the full language model for downstream tasks. https://thundergolfer.com/blog/the-first-llm

ULMFiT also pioneered the 3-stage approach of fine-tuning the language model using a causal LM objective and then fine-tuning that with a classification objective, which much later was used in GPT 3.5 instruct, and today is used pretty much everywhere.

The other major oversight in the article is that Dai and Le (2015) is missing -- that pre-dated even ULMFiT in fine-tuning a language model for downstream tasks, but they missed the key insight that a general purpose pretrained model using a large corpus was the critical first step.

It's also missing a key piece of the puzzle regarding attention and transformers: the memory networks paper recently had its 10th birthday and there's a nice writeup of its history here: https://x.com/tesatory/status/1911150652556026328?s=46

It came out about the same time as the Neural Turing Machines paper (https://arxiv.org/abs/1410.5401), covering similar territory -- both pioneered the idea of combining attention and memory in ways later incorporated into transformers.

Al-Khwarizmi - 4 days ago

A great writeup, just let me make two nitpicks (not to diminish the awesome effort of the author, but just in case they wish to take suggestions).

1. I think the paper underemphasizes the relevance of BERT. While from today's LLM-centric perspective it may seem minor because it's in a different branch of the tech tree, it smashed multiple benchmarks at the time and made previous approaches to many NLP analysis tasks immediately obsolete. While I don't much like citation counts as a metric, a testament of its impact is that it has more than 145K citations - in the same order of magnitude as the Transformers paper (197K) and many more than GPT-1 (16K). GPT-1 would ultimately be a landmark paper due to what came afterwards, but at the time it wasn't that useful due to being more oriented to generation (but not that good at it) and, IIRC, not really publicly available (it was technically open source but not posted at a repository or with a framework that allowed you to actually run it). It's also worth remarking that for many NLP tasks that are not generative (things like NER, parsing, sentence/document classification, etc.) often the best alternative is still a BERT-like model even in 2025.

2. The writing kind of implies that modern LLMs were something that was consciously sought after ("the transformer architecture was not enough. Researchers also needed advancements in how these models were trained in order to make the commodity LLMs most people interact with today"). The truth is that no one in the field expected modern LLMs. The story was more like the OpenAI researchers noticing that GPT-2 was good at generating random text that looked fluent, and thought "if we make it bigger it will do that even better". But it turned out that not only it generated better random text, but it started being able to actually state real facts (in spite of the occasional hallucinations), answer questions, translate, be creative, etc. All those emergent abilities that are the basis of "commodity LLMs most people interact with today" were a totally unexpected development. In fact, it is still poorly understood why they work.

jph00 - 4 days ago

(2) is not quite right. I created ULMFiT specifically because I thought a language model pretrained on a large general corpus then fine-tuned was the right way to go for creating generally capable NLP models. It wasn't an accident.
The fact that, sometime later, GPT-2 could do zero-shot generation was indeed something a lot of folks got excited about, but that was actually not the correct path. The 3-step ULMFiT approach (causal LM training on general corpus then specialised corpus, then classification task fine tuning) was what ChatGPT 3.5 Instruct used, which formed the basis of the first ChatGPT product.
So although it took quite a while to take off, the idea of the LLM was quite intentional and has largely developed as I planned (even although at the time almost no-one else felt the same way; luckily Alec Radford did, however! He told me in 2018 that reading the ULMFiT paper was a big "omg" moment for him and he set to work on GPT right away.)
PS: On (1), if I may take a moment to highlight my team's recent work, we updated BERT last year to create ModernBERT, which showed that yes, this approach still has legs. Our models have had >1.5m downloads and there's >2k fine-tunes and variants of it now on Huggingface: https://huggingface.co/models?search=modernbert
- Al-Khwarizmi - 4 days ago
  
  Point taken (both from you and the sibling comment mentioning Phil Blunsom), I should know better than carelessly dropping such broad generalizations as "no one in the field expected..." :)
  Still, I think only a tiny minority of the field expected it, and I think it was also clear from the messaging at the time that the OpenAI researchers who saw how GPT-3 (pre-instruct) started solving arbitrary tasks and displaying emergent abilities were surprised by that. Maybe they did have an ultimate goal in mind of creating a general-purpose system via next word prediction, but I don't think they expected it so soon and just by scaling GPT-2.
- HarHarVeryFunny - 4 days ago
  
  When you say "classification task fine tuning", are you referring to RLHF?
  RLHF seems to have been the critical piece that "aligned" the otherwise rather wild output of a purely "causally" (next-token prediction) trained LLM with what a human expects in terms of conversational turn taking (e.g. Q & A) and instruction following, as well as more general preferences/expectations.
- alansaber - 3 days ago
  
  You mention that encoder only approaches like bmodernBERT still have legs, would you mind sharing some applications aside from some niche NER? Genuinely curious
williamtrask - 4 days ago

Nit: regarding (2), Phil Blunsom did (same Blunsom from the article, and who was leading language modeling at DeepMind for about 7-8 years). He would often opine at Oxford (where he taught) that solving next word prediction is a viable meta path to AGI. Almost nobody agreed at the time. He also called out early that scaling and better data were the key, and they did end up being, although Google wasn’t as “risk on” as OpenAI on gathering the data for GPT-1/2. Had they been history could easily have been different. People forget the position OAI was in at the time. Elon/funding had left, key talent had left. Risk appetite was high for that kind of thing… and it paid off.

empiko - 4 days ago

What a great write-up, kudos to the author! I’ve been in the field since 2014, so this really feels like reliving my career. I think one paradigm shift that isn’t fully represented in the article is what we now call “genAI.” Sure, we had all kinds of language models (BERTs, word embeddings, etc.), but in the end, most people used them to build customized classifiers or regression models. Nobody was thinking about “solving” tasks by asking oracle-like models questions in natural language. That was considered completely impossible with our technology even in 2018/19. Some people studied language models, but that definitely wasn’t their primary use case; they were mainly used to support tasks like speech-to-text, grammar correction, or similar applications.

With GPT-3 and later ChatGPT, there was a very fundamental shift in how people think about approaching NLP problems. Many of the techniques and methods became outdated and you could suddenly do things that were not feasible before.

yobbo - 4 days ago

> Nobody was thinking about “solving” tasks by asking oracle-like models
I remember this being talked about maybe even earlier than 2018/2019, but the scale of models then was still off by at least one order of magnitude before it had a chance of working. It was the ridiculous scale of GPT that allowed the insight that scaling would make it useful.
(Tangentially related; I remember a research project/system from maybe 2010 or earlier that could respond to natural language queries. One of the demos was to ask for distance between cities. It was based on some sort of language parsing and knowledge graph/database, not deep-learning. Would be interesting to read about this again, if anyone remembers.)
mike_hearn - 4 days ago

Are you sure? I wrote an essay at the end of 2016 about the state of AI research and at the time researchers were demolishing benchmarks like FAIR's bAbI which involved generating answers to questions. I wrote back then about story comprehension and programming robots by giving them stories (we'd now call these prompts).
https://blog.plan99.net/the-science-of-westworld-ec624585e47
bAbI paper: https://arxiv.org/abs/1502.05698
Abstract: One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human.
So at least FAIR was thinking about making AI that you could ask questions of in natural language. Then they went and beat their own benchmark with the Memory Networks paper:
https://arxiv.org/pdf/1410.3916
Fred went to the kitchen. Fred picked up the milk. Fred travelled to the office.
Where is the milk ? A: office
Where does milk come from ? A: milk come from cow
What is a cow a type of ? A: cow be female of cattle
Where are cattle found ? A: cattle farm become widespread in brazil
What does milk taste like ? A: milk taste like milk
What does milk go well with ? A: milk go with coffee
Where was Fred before the office ? A: kitchen
That was published in 2015. So we can see quite early ChatGPT like capabilities, even though they're quite primitive still.

jszymborski - 3 days ago

I enjoyed this. With the hindsight of today's LMs, people might get a kick our of reading Claude Shannon's "Prediction and Entropy of Printed English" which was published as early as 1950 [0], and later expanded on by Cover and King in 1978 [1].

They are fun reads and people interested in LMs like myself probably won't be able to stop thinking about how they can see the echos of this work in Bengio et al.'s 2003 paper.

[0] Shannon CE. Prediction and Entropy of Printed English. In: Claude E Shannon: Collected Papers [Internet]. IEEE; 1993 [cited 2025 Sep 15]. p. 194–208. Available from: https://ieeexplore.ieee.org/document/5312178

[1] Cover T, King R. A convergent gambling estimate of the entropy of English. IEEE Trans Inform Theory. 1978 Jul;24(4):413–21.

chermi - 12 hours ago

Very nice.

When I first started using LLMs, I thought this sort of history retracing would be something you could use LLMs for. They were good at language, and research papers are language + math + graphs. At the time they didn't really understand math and they weren't multimodal yet, but still I decided to try a very basic version by feeding it some papers I knew very well in my area of expertise and try to construct the genealogy of the main idea by tracing references.

What I found at the time was garbage, but I attribute that mostly to me not being very rigorous. It suggested papers that came years after the actual catalysts that were basically regurgitations of existing results. Not even syntheses, just garbage papers that will never be cited by anyone but the authors themselves.

What I concluded was that it didn't work because LLMs don't understand ideas so they can't really trace them. They were basically doing dot products to find papers that matched the wording best in the current literature, which will of course have yield both a recency bias, as the subfields converge on common phrasings. I think there's also an "unoriginality" bias in the sense that the true catalyst/origin of an idea will likely not have the most refined and "survivable" way of describing the new idea. New ideas are new, and upon digestion by the community will probably come out looking a little different. That is to say, raw text matching isn't the best approach to tracing ideas.

I'm absolutely certain someone could and has done a much better job than my amateur exploration and I'd love to know more. As far as I know methods based solely on the analysis graphs of citations could probably beat what I tried.

Warning: ahead are less-than-half-baked ideas.

But now I'm wondering if you could extend the idea of "addition in language space" as LLMs encode (king-man+woman=queen or whatever that example is) to addition in the space of ideas/concepts as expressed in scientific research articles. It seems most doable in math, where stuff is encapsulated in theorems and mathematicians are otherwise precise about the pieces needed to construct a result. Maybe this already exists with automatic theorem provers I know exist but don't understand. Like what is the missing piece between "two intersecting lines form a plane" and "n-d space is spanned by n independent vectors in n-d space"? What's the "delta" that gets you from 2d to n-d basis? I can't even come up with a clean example of what I'm trying to convey...

What I'm trying to say is, wouldn't it be cool if we could 1) take a paper P published in 2025 2) consider all papers/talks/proceedings/blog post published before it 3) come up with the set of papers that require the smallest "delta" in idea space to reach P. That is, new idea(s) =novel part of P = delta -(contributions of ideas represented by the rest of the papers in the set). Suppose further you have some clustering to clean stuff up so you have just one paper per contributing idea(s), P_x representing idea x (or maybe a set).

Then you could do stuff like remove(1) all of the papers that are similar to the P_x representing the single "idea" x that contributed the most to the sum current_paper_idea(s) = delta +(contributions x_i from prexisting) from the corpus. With that idea x no longer in existence, how hard is it to get to the new idea - how much bigger is delta? And perhaps more interesting, is there a new novel route to the new idea? This presupposes the ability of the system to figure out the missing piece(s), but my optimistic take is that it's much easier to get to a result when you know the result. Of course, the larger the delta, the harder it is to construct a new path. If culling an idea leads the the inability to construct a new path, it was probably quite important. I think this would be valuable for trying to trace the most likely path to a paper -- emphasis most likely with the enormous assumption that " shortest path" = most likely; we'll never really know where someone got an idea. But also valuable in uncovering different trajectories/routes from some set of ideas to another via the proposed deletion pertubations. Maybe it unveils a better pedagogical approach, an otherwise unknown connection between subfields, or at the very least is instructive in the the same way that knowing how to solve a problem multiple ways is instructive.

That's all very, very vague and hand-wavy, but I'm guessing there's some ideas in epistemology, knowledge graphs and other stuff that I don't know that could bring it a little closer to sensical.

Thank you for sitting through my brain dump, feel free to shit on it.

(1) This whole half-baked idea needs a lot of work. Especially obvious is that to be sure of cleansing the idea space of everything coming from those papers would probably require complete restraining? Also, this whole thing also presupposes ideas are traceable to publications, which is unlikely for many reasons.

brcmthrowaway - 4 days ago

Dumb question, what is the difference between embedding and bag of words?

Al-Khwarizmi - 4 days ago

With bag of words, the representation of a word is a vector whose dimension is the dictionary size, all components are zeros except for the component corresponding to that word, which is one.
This is not good to train neural networks (because they like to be fed dense, continuous data, not sparse and discrete) and it treats each word as an atomic entity without dealing with relationships between them (you don't have a way to know that the wprds "plane" and "airplane" are more related than "plane" and "dog").
With word embeddings, you get a space of continuous vectors with a predefined (lower) number of dimensions. This is more useful to serve as input or training data to neural networks, and it is a representation of the meaning space ("plane" and "airplane" will have very similar vectors, while the one for "dog" will be different) which opens up a lot of possibilities to make models and systems more robust.
- HarHarVeryFunny - 4 days ago
  
  Also important to note that in a Transformer-based LLM, embeddings are more than just a way of representing the input words. Embeddings are what pass through the transformer, layer by layer, and get transformed by it.
  The size of the embedding space (number of vector dimensions) is therefore larger than needed to just represent word meanings - it needs to be large enough to also be able to represent the information added by these layer-wise transformations.
  The way I think of these transformations, but happy to be corrected, is more a matter of adding information rather than modifying what is already there, so conceptually the embeddings will start as word embeddings, then maybe get augmented with part-of-speech information, then additional syntactic/parsing information, and semantic information, as the embedding gets incrementally enriched as it is "transformed" by successive layers.
  - empiko - 4 days ago
    
    > The way I think of these transformations, but happy to be corrected, is more a matter of adding information rather than modifying
    This is very much the case considering the residual connections within the model. The final representation can be expressed as a sum of representations from N layers, where the N-th representation is a function of N-1-th.
- - 4 days ago
  
  [deleted]

- 3 days ago

[deleted]

WolfOliver - 4 days ago

with what tool was this article written?

altilunium - 4 days ago

https://gregorygundersen.com/blog/2020/06/21/blog-theme/

sreekanth850 - 4 days ago

I was wondering on what basis @Sama keeps saying they are near AGI, when in reality LLMs just calculate sequences and probabilities. I really doubt this bubble is going to burst soon.

jjtheblunt - 4 days ago

I'm unaware of any proof (in the mathematician sense, for example) that _we_ aren't just kickass machines calculating sequences at varying probabilities, though.
perhaps that is how the argument persists?
- lispybanana - 2 days ago
  
  Humans do this but this is not all they do. How do we explain humans who invent new concepts, new words, new numerical systems, new financial structures, new legal theories. These are not exactly predictions (since they don't exist in a training set) but they may be composed from such sets.
  - in-silico - a day ago
    
    > How do we explain humans who invent new concepts
    Simple: they are hallucinations that turn out to be correct or useful.
    Ask ChatGPT to create a million new concepts that weren't in its training data and some of them are bound to be similarly correct or useful. The only difference is that humans have hands and eyes to test their new ideas.
- sreekanth850 - 3 days ago
  
  Efficiency matters. We do it with a fraction of the processing power.
  - jjtheblunt - 3 days ago
    
    true in the caloric/watts sense, but we might well have way higher computational power architecturally?
    
    sreekanth850 - 2 days ago
    
    100%, we do a lot more in real life. There are many circumstances, where you work without prior training. This is how new inventions happening everytime.