How outdated information hides in LLM token generation probabilities

blog.anj.ai

109 points by anjsimmo 3 days ago


0xKelsey - 3 days ago

> The scenario that I’m worried about, and that is playing out right now, is that they get good enough that we (or our leaders) become overconfident in their abilities and start integrating them into applications that they just aren’t ready for without a proper understanding of their limitations.

Very true.

ascorbic - 15 hours ago

The o1 example is interesting. In the CoT summary it acknowledges that the most recent official information is 1611m, but it then chooses to say 1622 because it's more commonly cited. It's like it over-thinks itself into the wrong answer.

antirez - 5 hours ago

Prompting Claude to show the ambiguity:

Tell me the height of Mountain Bartle Frere. Please don't output any long text, also don't output a single height if you saw multiple heights around. Give me a list of potential heights cited around.

LLM:

Mount Bartle Frere in Queensland, Australia has commonly cited heights of:

1,622 meters (5,322 feet)

1,611 meters (5,285 feet)

Since this is quite specific geographic information that may appear in only a few sources, I should note that I may hallucinate details - you should verify these numbers.

Terr_ - 13 hours ago

> Welcome to the era of generative AI, where a mountain can have multiple heights, but also only one height, and the balance of my bank account gets to determine which one that is. All invisible to the end user and then rationalised away as a coincidence.

I've always found the idea of untraceable, unfixable, unpredictable bugs in software... Offensive. Dirty. Unprofessional.

So the last couple years have been been disconcerting, as a non-trivial portion of people who I thought felt similarly started to overlook it in LLMs, while also integrating those LLMs into flows where the bad-output can't even be detected.

Workaccount2 - 7 hours ago

How would we rule out that the model didn't notice that the difference was small, and then simply put less weight into determining which is true.

I get the authors point, but I would have liked to see and example with a more egregious error.

firtoz - 12 hours ago

Looking towards the future, we will need to move away from "tokens are characters to print", We're kind of starting to consider this with "tool calls" but I believe an architectural shift will become necessary.

We do have some kind of understanding of what kind of concept we want to emit next, e.g.

```

[The height:property name] of [Mount Bartle Frere:proper noun, describing an object to get a property out of], [in metres:attributes], is [?: retrieve value | (Mount Bartle Frere).("height", "metres")].

```

scarface_74 - 8 hours ago

This problem is simple to solve for most real world use cases. Don’t trust any facts from an LLM and use your own trusted source of information and RAG where it will give you citations

https://chatgpt.com/share/6783df4c-904c-8010-a4b5-7301faea3b...

https://chatgpt.com/share/6783e0b8-ce78-8010-9177-d95eb77eac...

I use NotebookLM for most of my real world work these days with my project documentation.

Our company standard is GSuite and NotebookLM is specifically allowed.

throwawaymaths - 6 hours ago

kind of crazy that models moving forward don't just strip all multi-numeral tokens. Would be great for llm providers, too, since their tokens consumed will go up.

stereobit - 11 hours ago

Is there no concept like page rank that biases certain inputs to have higher impact while training based on recency and popularity?

tucnak - 12 hours ago

I don't get it why people demo COT reasoning with o1 when there's models like Gemini 2.0 Thinking that would usually solve the same tasks and would happily produce the full output.