Should LLMs just treat text content as an image?

seangoedecke.com

63 points by ingve 6 days ago


rebeccaskinner - 3 minutes ago

Although thus isn’t directly related to the idea in the article, I’m reminded that one of the most effective hacks I’ve found for working with ChatGPT has been to attach screen shots of files rather than the files themselves. I’ve noticed the model will almost always pay attention to an image and pull relevant data out of it, but it requires a lot of detailed prompting to get it to reliably pay attention to text and pdf attachments instead of just hallucinating their contents.

wongarsu - 3 hours ago

Look long enough at literature on any machine learning task, and someone invariably gets the idea to turn the data into an image and do machine learning on that. Sometimes it works out (turning binaries into images and doing malware detection with a CNN surprisingly works), usually it doesn't. Just like in this example the images usually end up as a kludge to fix some deficiency in the prevalent input encoding.

I can certainly believe that images bring certain advantages over text for LLMs: the image representation does contain useful information that we as humans use (like better information hierarchies encoded in text size, boldness, color, saturation and position, not just n levels of markdown headings), letter shapes are already optimized for this kind of encoding, and continuous tokens seem to bring some advantages over discrete ones. But none of these advantages need the roundtrip via images, they merely point to how crude the state of the art of text tokenization is

bonsai_spool - 3 hours ago

This doesn’t cite the very significant example of DeepVariant (and as of 10/16/25 DeepSomatic) which convert genomic data to images in order to find DNA mutations. This has been done since the late 2010s

https://google.github.io/deepvariant/posts/2020-02-20-lookin...

LysPJ - 3 hours ago

Andrej Karpathy made an interesting comment on the same paper: https://x.com/karpathy/status/1980397031542989305

sojuz151 - 37 minutes ago

This means that current tokenisers are bad, and something better is needed if text rendering + image input is a better tokeniser.

mohsen1 - 3 hours ago

My understanding is that text tokens are too rigid. The way we read is not to process each character (tokens for LLMs) precociously but to see a word or sometimes a collection of familiar words and make sense of writing. That concept that we understand from written text is really what we read and not letter or words exactly. This is why we can easily read written text with typos. They are just similar enough. By letting LLMs not to be too hung up on exact tokens and "skim" through text we can make them more efficient just like how humans efficiently read.

mannykannot - an hour ago

Language was spoken long before it was written (or so it seems.) This article almost suggests that sound might be a superior input medium over either digital text or images.

skywhopper - 41 minutes ago

There’s some poor logic in this writeup. Yes, images can contain more information than words, but the extra information an image of a word conveys is usually not relevant to the intent of the communication, at least not for the purposes assumed in this writeup. Ie, pre-converting the text you would have typed into ChatGPT and uploading that as an image instead will not better convey the meaning and intent behind your words.

If it gives better results (something that there’s no evidence presented of), that’d be interesting, but it wouldn’t be because of the larger data size of the uploaded image vs the text.

aitchnyu - 2 hours ago

The amount of video/imagery to make a million tokens vs the amount of text to do the same is a surprisingly low ratio. Did they have the same intuition?

onesandofgrain - 4 hours ago

A picture is worth a thousand words

leemcalilly - 2 hours ago

and reading (aka “ocr”) is the fastest way for the brain to process language.

ToJans - 3 hours ago

A series of tokens is one-dimensional (a sequence). An image is 2-dimensional. What about 3D/4D/... representation (until we end up with an LLM-dimensional solution ofc).

themoxon - 3 hours ago

There's a new paper from ICCV which basically tries to render every modality as images: https://openaccess.thecvf.com/content/ICCV2025/papers/Hudson...

vindex10 - 3 hours ago

reminds me of the difference between fasttext and word2vec.

fasttext can learn words it haven't seen before by combining words from ngrams, word2vec can learn better meaning of the whole words, but then missing out on the "unknown words".

image tokens are "text2vec" here, while text tokens are a proxy towards building a text embedding of even unseen before texts.

ghoul2 - 3 hours ago

But does this not miss the "context" that the embeddings of the text tokens carry? An LLM embedding of a text token has a compressed version of the entire preceding set of tokens that came before it in the context. While the image embeddings are just representations of pixel values.

Sort of at the level of word2vec, where the representation of "flies" in "fruit flies like a banana" vs "time flies like an arrow" would be the same.

DonHopkins - an hour ago

Wouldn't it be ironic if Comic Sans turned out to be the most efficient font for LLM OCR understanding.

qiine - 2 hours ago

or maybe 3d objects, since that's closer to what real life is and how the brain shaped itself around?

Havoc - 3 hours ago

Seems wildly counterintuitive to me frankly.

Even if true though not sure what we’d do with it. The bulk of knowledge available on the internet is text. Aside from maybe YouTube so I guess it could work for world model type things? Understanding physical interactions of objects etc

pcwelder - 3 hours ago

I ϲаn guаrаntее thаt thе ОСR ϲаn't rеаd thіs sеntеnсе ϲоrrесtlу.

metalliqaz - 2 hours ago

Future headline: "The unreasonable effectiveness of text encoding"

nacozarina - 3 hours ago

the enshittifiers simply haven't yet weighted image processing fees with potential token charges; once they have, your cost advantage goes bye bye