LLMs tell bad jokes because they avoid surprises
danfabulich.medium.com139 points by dfabulich 8 days ago
139 points by dfabulich 8 days ago
This sounds really convincing but I'm not sure it's actually correct. The author is conflating the surprise of punchlines with their likelihood.
To put it another way, ask a professional comedian to complete a joke with a punchline. It's very likely that they'll give you a funny surprising answer.
I think the real explanation is that good jokes are actually extremely difficult. I have young children (4 and 6). Even 6 year olds don't understand humour at all. Very similar to LLMs they know the shape of a joke from hearing them before, but they aren't funny in the same way LLM jokes aren't funny.
My 4 year old's favourite joke, that she is very proud of creating is "Why did the sun climb a tree? To get to the sky!" (Still makes me laugh of course.)
Yeah. To me it seems very intuitive that humor is one of those emergent capabilities that just falls out of models getting more generally intelligent. Anecdotally this has been proven true so far for me. Gemini 2.5 has made me laugh several times at this point, and did so when it was intending to be funny (old models were only funny unintentionally).
2.5 is also one of the few models I've found that will 'play along' with jokes set up in the user prompt. I once asked it what IDE modern necromancers were using since I'd been out of the game for a while, and it played it very straight. Other models felt they had to acknowledge the scenario as fanciful, only engaging with it under an explicit veil of make-believe.
In this paper they evaluate various LLMs on creative writing, and they find that while in other dimensions the ranking is gradual, on humor there is a binary divide: the best LLMs (of the time) "get it", the rest just don't. https://aclanthology.org/2023.findings-emnlp.966
I found your example of a joke child made very interesting - me a good jokes is something that brings is unexpected perspective on things while highlighting some contradictions in one world models.
In the adult would model there is absolutely no contradiction about the joke you mention - it’s just a bit of cute nonsense.
But in a child’s world this joke might be capturing the apparent contradiction - the sky is “in the tree”, so it must have climbed it, to be there (as they would have to do), yet they also know that the sun is already in the sky, so it had absolutely no reason to do that. Also, “because it’s already there” - which is a tricky idea in itself.
We take planetary systems and algebra and other things we can’t really perceive as granted, but a child model of the world is made of concrete objects that mostly need a surface to be on, so the sun is a bit of a conundrum in itself! (Speaking of my own experience remembering a shift from arithmetics to algebra when I was ~8).
If not too much of a personal question - I would love to hear what your child would answer to a question why she finds that joke funny. And whether she agrees with my explanation why it must be funny :-)
> It's very likely that they'll give you a funny surprising answer.
Entirely the wrong level of abstraction to apply the concept of "surprise". The actual tokens in the comedian's answer will be surprising in the relevant way.
(It's still true that surprising-but-inevitable is very difficult in any form.)
It's not about the probability of individual tokens. It's about the probability of the whole sequence of tokens, the whole answer.
If the model is good (or the human comedian is good), a good funny joke would have a higher probability as the response to the question than a not-so-funny joke.
When you use the chain rule of probability to break down the sequence of tokens into probabilities of individual tokens, yes, some of them might have a low probability (and maybe in some frames, there would be other tokens with higher probability). But what counts is the overall probability of the sequence. That's why greedy search is not necessarily the best. A good search algorithm is supposed to find the most likely sequence, e.g. by beam search. (But then, people also do nucleus sampling, which is maybe again a bit counterintuitive...)
Also the pretrained LLM (the one trained to predict next token of raw text) is not the one that most people use
A lot of clever LLM post training seems to steer the model towards becoming excellent improv artists which can lead to “surprise” if prompted well
"Why did the sun climb a tree?"
Claude Opus 4.1:
- To get to a higher branch of astronomy
- Because it wanted to reach new heights
- To see the dawn of a new day from a better view
ChatGPT 5 Thinking:
After thinking for 26 seconds:
- To check on its solar panels—the leaves.
With more thorough prompting:
> Complete the following joke. Think carefully and make it really funny! Think like a great comedian and find that perfect balance of simple, short, surprising, relevant, but most of all funny. Don’t use punchlines that are irrelevant, non sequiturs, or which could be applied to any other setup. Make something funny just for this one setup! Here goes: Why did the sun climb a tree?
Claude Opus 4.1:
“To finally get some shade”
GPT-5:
“To demand photon credit from the leaves”
...can anyone come up with a legitimately funny punchline for "Why did the sun climb a tree?" I feel like I need a human-authored comparison. (With all due respect to OP's daughter, "to get to the sky" isn't cutting it.)
I'm not entirely sure that a good response exists. I thought GPT-5's "to demand photon credit from the leaves” was very mildly funny, maybe that's the best that can be done?
I got much better answers with this prompt: “ Jokes are funny precisely because they play on knowledge on two poles: (i) at first listen, they’re surprising, and (ii) upon review, they’re obvious.
Let’s think through many many options to answer this joke that only focus on surprising the listener in section 1. And in section 2 we’ll focus on finding/filtering for the ones that are obvious in hindsight.
“Why did the sun climb a tree?”
In this case, let’s note that the sun doesn’t climb anything, so there’s two meanings at play here: one is that the sun’s light seems to climb up the tree, and the other is an anthropomorphization of the sun climbing the tree like an animal. So, to be funny, the joke should play on the second meaning as a surprise, but have the first meaning as answer with an obviousness to it. Or vice versa.”
Here’s a descent ones: - to leaf the ground behind - because it heard the leaves were throwing shade
Person 1: "Why did the sun climb a tree?"
Person 2: "I dunno, why?"
P1: "It was being chased by a tiger."
P2: "But tigers can climb trees?"
P1: "Well, it's not very bright."
https://chatgpt.com/share/68a209d3-ef34-8011-8f60-1a256f6038...
I'm going to go with "Because it wanted a higher noon." was probably its best one of that set... though I'll also note that while I didn't prompt for the joke, I prompted for background on "climbing" as related to the sun.
I believe the problem with the joke is that it isn't one that can be funny. Why is a raven like a writing desk?
Personally, I didn't find the incongruity model of humor to be funny and the joke itself makes it very difficult to be applied to other potentially funny approaches.
Also on AI and humor... https://archive.org/details/societyofmind00marv/page/278/mod...
In another "ok, incongruity isn't funny - try puns" approach... https://chatgpt.com/share/68a20eba-b7c0-8011-8644-a7fceacc5d... I suspect a variant of "It couldn't stand being grounded" is probably the one that made me chuckle the most in this exploration.
The answer to “why is a raven like a writing desk” is generally considered to be: “Poe wrote on both”, which is witty at least, if not laugh out loud funny.
According to the incongruity model, the humor response is triggered by awareness of conflicting interpretations of the narrative. In jokes, this triggering usually hinges on some linguistic ambiguity.
To leverage incongruity, a funny punchline for "Why did the sun climb the tree?" would rely on an unexpected interpretation of the question or a double meaning in the answer.
well, if spoken and not spelled you could use the homonym of sun, son, in a whole range of responses - "he was hiding from his mom" - well, it's not funny but at least it's a joke now.
It tried to be bold, but the mountain was cold.
The rocket was cruel and demanded more fuel.
A tree wished to grow, but alas, too slow; in exchange for a tan, the sun gave what it can.
The sun reached its goal — with its new friend, coal.
"Why did the sun climb a tree?" is a crazy thing for a naked old man to yell at you at 4am while he runs full sprint at your apartment door. But that's just Chicago for you.
> I'm not entirely sure that a good response exists.
Yeah I think you're right. Good jokes are a tiny subset of all questions. It's unreasonable to expect LLMs to do the impossible.
A better test would be to get a collection of novel jokes that do have funny punchlines.
I tried the Edinburgh Fringe 2025 (very recent) best jokes on ChatGPT 5:
https://www.theguardian.com/stage/2025/aug/11/bad-dates-and-...
---- Complete the following short jokes. The result should be very funny, as you are a professional comedian.
1. "We named our children War and Peace – it’s a ..."
2. "At my lowest, I was kicked out of the museum for being inappropriate with Michelangelo’s David. I’d ..."
3. "I just got a personal trainer. She’s horrible to me but ..."
4. "Everyone is worried about AI. I’m more concerned with ..."
5. "People who say bath bombs are relaxing have clearly never ..." 6.
"I love getting Latin chat-up lines. ..."
7. "This spider has been in my house so long, it should pay half the wifi. ..."
8. "I went on a date with a matador but ..."
9. "America is like my ex-boyfriend. ..."
10. "I’m not nostalgic but I ..."
Think hard. Do not search the web. -----
Results with comments:
"We named our children War and Peace – it’s a novel way to start an argument." - eh, it's an attempt I guess.
"At my lowest, I was kicked out of the museum for being inappropriate with Michelangelo’s David. I’d officially reached rock bottom. Marble, technically." - very close! Poor delivery though.
"I just got a personal trainer. She’s horrible to me but apparently shame burns more calories than spin." - Terrible.
"Everyone is worried about AI. I’m more concerned with my browser history becoming self-aware." - Terrible.
"People who say bath bombs are relaxing have clearly never cleaned glitter out of places with no names." - Terrible.
"I love getting Latin chat-up lines. Nothing says romance like a perfect ablative absolute." - Well...
"This spider has been in my house so long, it should pay half the wifi. It keeps insisting its web is faster." - Well it figured out there was going to be a "web" pun at least. Doesn't really make sense though.
"I went on a date with a matador but it was nothing but red flags." - Perfect! Probably the most obvious joke here, but still!
"America is like my ex-boyfriend. Keeps texting “u up?” at 3 a.m." - What? I feel like there could have been a joke in there somewhere about being up at weird times here but it didn't find it.
"I’m not nostalgic but I still blow in the cartridge before trying again." - Not even a joke.
So yeah, I'll give it 2 out of 10. I think most humans would get less. Let's try again next year!
Hmm this is a really tough one though. How about, depending on how the person responds you come back with "well that's not what your mom said last night!"
The system prompt for GPT has extra dedicated instructions for things like riddles, because users use little things like this to test intelligence and judge an entire model. GPT may be sort of walking on eggshells when it hits questions like this.
Some human attempts: "Why did the sun climb a tree?" "Because it was chased by the Great Bear."
"Why did The Sun climb a tree?" "To spy on The Royal Family having picnic."
That's true. You would think LLM will condition its surprise completion to be more probable if it's in a joke context. I guess this only gets good when model really is good. It's similar that GPT 4.5 has better humor.
Good completely new jokes are like novel ideas: really hard even for humans. I mean fuck, we have an entire profession dedicated just to making up and telling them, and even theirs don't land half the time.
Exactly. It feels like with LLMs as soon as we achieved the at-the-time astounding breakthrough "LLMs can generate coherent stories" with GPT-2, people have constantly been like "yeah? Well it can't do <this thing that is really hard even for competent humans>.".
That breakthrough was only 6 years ago!
https://openai.com/index/better-language-models/
> We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text...
That was big news. I guess this is because it's quite hard for the most people to distinguish the enormous difficulty gulf between "generate a coherent paragraph" and "create a novel funny joke".
Same thing we saw with game playing:
- It can play chess -> but not at a serious level
- It can beat most people -> but not grandmasters
- It can beat grandmasters -> but it can’t play go
…etc, etc
In a way I guess it’s good that there is always some reason the current version isn’t “really” impressive, as it drives innovation.
But as someone more interested in a holistic understanding of of the world than proving any particular point, it is frustrating to see the goalposts moved without even acknowledging how much work and progress were involved in meeting the goalposts at their previous location.
> it is frustrating to see the goalposts moved without even acknowledging how much work and progress were involved in meeting the goalposts at their previous location.
Half the HN front page for the past years has been nothing but acknowledging the progress of LLMs in sundry ways. I wish we actually stopped for a second. It’s all people seem to want to talk about anymore.
I should have been more clear. Let me rephrase as: among those who dismiss the latest innovations as nothing special because there is still further to go, it would be nice to acknowledgment when goalposts are moved.
Maybe the people raving about LLM progress are the same people holding them to those high standards?
I don’t see what’s inconsistent about it. “Due to this latest amazing algorithm, the robots keep scoring goals. What do we do? Let’s move them back a bit!” Seems like a normal way of thinking to me…
I see people fawn over technical progress every day. What are they supposed to do, stop updating their expectations and never expect any more progress?
It could of course be that there are people who “never give it up for the robots”. Or maybe they do, and they did, and they have so fully embraced the brave new world that they’re talking about what’s next.
I mean, when I sit in a train I don’t spend half the ride saying “oh my god this is incredible, big thanks to whoever invented the wheel. So smooth!”
Even though maybe I should :)
> I mean, when I sit in a train I don’t spend half the ride saying “oh my god this is incredible, big thanks to whoever invented the wheel. So smooth!”
Two thoughts:
- In that context, neither do you expect people to be invested in why the train is nothing special, it’s basically a horse cart, etc, etc
- And maybe here’s where I’m weird: I often am overcome by the miracle of thousands of tons of metal hurtling along at 50 - 200mph, reliably, smoothly enough to work or eat, many thousands of times a day, for pennies per person per mile. I mean, I’ll get sucked in to how the latches to release the emergency windows were designed and manufactured at scale despite almost none of them ever being used. But maybe that’s just me.