A History of Large Language Models

gregorygundersen.com

296 points by alexmolas 7 days ago


jph00 - 4 days ago

This is quite a good overview, and parts reflect well how things played out in language model research. It's certainly true that language models and deep learning were not considered particularly promising in NLP, which frustrated me greatly at the time since I knew otherwise!

However the article misses the first two LLMs entirely.

Radford cited CoVE, ELMo, and ULMFiT as the inspirations for GPT. ULMFiT (my paper with Sebastian Ruder) was the only one which actually fine-tuned the full language model for downstream tasks. https://thundergolfer.com/blog/the-first-llm

ULMFiT also pioneered the 3-stage approach of fine-tuning the language model using a causal LM objective and then fine-tuning that with a classification objective, which much later was used in GPT 3.5 instruct, and today is used pretty much everywhere.

The other major oversight in the article is that Dai and Le (2015) is missing -- that pre-dated even ULMFiT in fine-tuning a language model for downstream tasks, but they missed the key insight that a general purpose pretrained model using a large corpus was the critical first step.

It's also missing a key piece of the puzzle regarding attention and transformers: the memory networks paper recently had its 10th birthday and there's a nice writeup of its history here: https://x.com/tesatory/status/1911150652556026328?s=46

It came out about the same time as the Neural Turing Machines paper (https://arxiv.org/abs/1410.5401), covering similar territory -- both pioneered the idea of combining attention and memory in ways later incorporated into transformers.

Al-Khwarizmi - 4 days ago

A great writeup, just let me make two nitpicks (not to diminish the awesome effort of the author, but just in case they wish to take suggestions).

1. I think the paper underemphasizes the relevance of BERT. While from today's LLM-centric perspective it may seem minor because it's in a different branch of the tech tree, it smashed multiple benchmarks at the time and made previous approaches to many NLP analysis tasks immediately obsolete. While I don't much like citation counts as a metric, a testament of its impact is that it has more than 145K citations - in the same order of magnitude as the Transformers paper (197K) and many more than GPT-1 (16K). GPT-1 would ultimately be a landmark paper due to what came afterwards, but at the time it wasn't that useful due to being more oriented to generation (but not that good at it) and, IIRC, not really publicly available (it was technically open source but not posted at a repository or with a framework that allowed you to actually run it). It's also worth remarking that for many NLP tasks that are not generative (things like NER, parsing, sentence/document classification, etc.) often the best alternative is still a BERT-like model even in 2025.

2. The writing kind of implies that modern LLMs were something that was consciously sought after ("the transformer architecture was not enough. Researchers also needed advancements in how these models were trained in order to make the commodity LLMs most people interact with today"). The truth is that no one in the field expected modern LLMs. The story was more like the OpenAI researchers noticing that GPT-2 was good at generating random text that looked fluent, and thought "if we make it bigger it will do that even better". But it turned out that not only it generated better random text, but it started being able to actually state real facts (in spite of the occasional hallucinations), answer questions, translate, be creative, etc. All those emergent abilities that are the basis of "commodity LLMs most people interact with today" were a totally unexpected development. In fact, it is still poorly understood why they work.

empiko - 4 days ago

What a great write-up, kudos to the author! I’ve been in the field since 2014, so this really feels like reliving my career. I think one paradigm shift that isn’t fully represented in the article is what we now call “genAI.” Sure, we had all kinds of language models (BERTs, word embeddings, etc.), but in the end, most people used them to build customized classifiers or regression models. Nobody was thinking about “solving” tasks by asking oracle-like models questions in natural language. That was considered completely impossible with our technology even in 2018/19. Some people studied language models, but that definitely wasn’t their primary use case; they were mainly used to support tasks like speech-to-text, grammar correction, or similar applications.

With GPT-3 and later ChatGPT, there was a very fundamental shift in how people think about approaching NLP problems. Many of the techniques and methods became outdated and you could suddenly do things that were not feasible before.

jszymborski - 3 days ago

I enjoyed this. With the hindsight of today's LMs, people might get a kick our of reading Claude Shannon's "Prediction and Entropy of Printed English" which was published as early as 1950 [0], and later expanded on by Cover and King in 1978 [1].

They are fun reads and people interested in LMs like myself probably won't be able to stop thinking about how they can see the echos of this work in Bengio et al.'s 2003 paper.

[0] Shannon CE. Prediction and Entropy of Printed English. In: Claude E Shannon: Collected Papers [Internet]. IEEE; 1993 [cited 2025 Sep 15]. p. 194–208. Available from: https://ieeexplore.ieee.org/document/5312178

[1] Cover T, King R. A convergent gambling estimate of the entropy of English. IEEE Trans Inform Theory. 1978 Jul;24(4):413–21.

chermi - 12 hours ago

Very nice.

When I first started using LLMs, I thought this sort of history retracing would be something you could use LLMs for. They were good at language, and research papers are language + math + graphs. At the time they didn't really understand math and they weren't multimodal yet, but still I decided to try a very basic version by feeding it some papers I knew very well in my area of expertise and try to construct the genealogy of the main idea by tracing references.

What I found at the time was garbage, but I attribute that mostly to me not being very rigorous. It suggested papers that came years after the actual catalysts that were basically regurgitations of existing results. Not even syntheses, just garbage papers that will never be cited by anyone but the authors themselves.

What I concluded was that it didn't work because LLMs don't understand ideas so they can't really trace them. They were basically doing dot products to find papers that matched the wording best in the current literature, which will of course have yield both a recency bias, as the subfields converge on common phrasings. I think there's also an "unoriginality" bias in the sense that the true catalyst/origin of an idea will likely not have the most refined and "survivable" way of describing the new idea. New ideas are new, and upon digestion by the community will probably come out looking a little different. That is to say, raw text matching isn't the best approach to tracing ideas.

I'm absolutely certain someone could and has done a much better job than my amateur exploration and I'd love to know more. As far as I know methods based solely on the analysis graphs of citations could probably beat what I tried.

Warning: ahead are less-than-half-baked ideas.

But now I'm wondering if you could extend the idea of "addition in language space" as LLMs encode (king-man+woman=queen or whatever that example is) to addition in the space of ideas/concepts as expressed in scientific research articles. It seems most doable in math, where stuff is encapsulated in theorems and mathematicians are otherwise precise about the pieces needed to construct a result. Maybe this already exists with automatic theorem provers I know exist but don't understand. Like what is the missing piece between "two intersecting lines form a plane" and "n-d space is spanned by n independent vectors in n-d space"? What's the "delta" that gets you from 2d to n-d basis? I can't even come up with a clean example of what I'm trying to convey...

What I'm trying to say is, wouldn't it be cool if we could 1) take a paper P published in 2025 2) consider all papers/talks/proceedings/blog post published before it 3) come up with the set of papers that require the smallest "delta" in idea space to reach P. That is, new idea(s) =novel part of P = delta -(contributions of ideas represented by the rest of the papers in the set). Suppose further you have some clustering to clean stuff up so you have just one paper per contributing idea(s), P_x representing idea x (or maybe a set).

Then you could do stuff like remove(1) all of the papers that are similar to the P_x representing the single "idea" x that contributed the most to the sum current_paper_idea(s) = delta +(contributions x_i from prexisting) from the corpus. With that idea x no longer in existence, how hard is it to get to the new idea - how much bigger is delta? And perhaps more interesting, is there a new novel route to the new idea? This presupposes the ability of the system to figure out the missing piece(s), but my optimistic take is that it's much easier to get to a result when you know the result. Of course, the larger the delta, the harder it is to construct a new path. If culling an idea leads the the inability to construct a new path, it was probably quite important. I think this would be valuable for trying to trace the most likely path to a paper -- emphasis most likely with the enormous assumption that " shortest path" = most likely; we'll never really know where someone got an idea. But also valuable in uncovering different trajectories/routes from some set of ideas to another via the proposed deletion pertubations. Maybe it unveils a better pedagogical approach, an otherwise unknown connection between subfields, or at the very least is instructive in the the same way that knowing how to solve a problem multiple ways is instructive.

That's all very, very vague and hand-wavy, but I'm guessing there's some ideas in epistemology, knowledge graphs and other stuff that I don't know that could bring it a little closer to sensical.

Thank you for sitting through my brain dump, feel free to shit on it.

(1) This whole half-baked idea needs a lot of work. Especially obvious is that to be sure of cleansing the idea space of everything coming from those papers would probably require complete restraining? Also, this whole thing also presupposes ideas are traceable to publications, which is unlikely for many reasons.

brcmthrowaway - 4 days ago

Dumb question, what is the difference between embedding and bag of words?

- 3 days ago
[deleted]
WolfOliver - 4 days ago

with what tool was this article written?

sreekanth850 - 4 days ago

I was wondering on what basis @Sama keeps saying they are near AGI, when in reality LLMs just calculate sequences and probabilities. I really doubt this bubble is going to burst soon.