Don't bother parsing: Just use images for RAG

morphik.ai

316 points by Adityav369 a day ago


serjester - a day ago

There's multiple fundamental problems people need to be aware of.

- LLM's are typically pre-trained on 4k text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images.

- Pdf's at 1536 × 2048 use 3 to 5X more tokens than the raw text (ie higher inference costs and slower responses). Going lower results in blurry images.

- Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.

Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.

An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort.

pilooch - a day ago

Some colleagues and myself did implemented exactly this six months ago for a French gov agency.

It's open source and available here: https://github.com/jolibrain/colette

It's not our primary business so it's just lying there and we don't advertise much, but it works, somehow and with some tweaks to get it really efficient.

The true genius though is that the whole thing can be made fully differentiable, unlocking the ability to finetune the viz rag on targeted datasets.

The layout model can also be customized for fine grained document understanding.

themanmaran - a day ago

Hey we've done a lot of research on this side [1] (OCR vs direct image + general LLM benchmarking).

The biggest problem with direct image extraction is multipage documents. We found that single page extraction (OCR=>LLM vs Image=LLM) slightly favored the direct image extraction. But anything beyond 5 images had a sharp fall off in accuracy compared to OCR first.

Which makes sense, long context recall over text is already a hard problem, but that's what LLMs are optimized for. Long context recall over images is still pretty bad.

[1] https://getomni.ai/blog/ocr-benchmark

freezed8 - 4 hours ago

This blog post makes some good points about using vision models for retrieval, but I do want to call out a few problems: 1. The blog conflates indexing/retrieval with document parsing. Document parsing itself is the task of converting a document into a structured text representation, whether it's markdown/JSON (or in the case of extraction, an output that conforms to a schema). It has many uses, one of which is RAG, but many of which are not necessarily RAG related.

ColPali is great for retrieval, but you can't use ColPali (at least natively) for pure document parsing tasks. There's a lot of separate benchmarks for just evaluating doc parsing while the author mostly talks about visual retrieval benchmarks.

2. This whole idea of "You can DIY document parsing by screenshotting a page" is not new at all, lots of people have been talking about it! It's certainly fine as a baseline and does work better than standard OCR in many cases.

a. But from our experience there's still a long-tail of accuracy issues. b. It's missing metadata like confidence scores/bounding boxes etc. out of the box c. Honestly this is underrated, but creating a good screenshotting pipeline itself is non-trivial.

3. In general for retrieval, it's helpful to have both text and image representations. Image tokens are obviously much more powerful. Text tokens are way cheaper to store and let you do things like retrieval entire documents (instead of just chunks) and input that into the LLM.

(disclaimer: I am ceo of llamaindex, and we have worked on both document parsing and retrieval with LlamaCloud, but I hope my point stands in a general sense)

thor-rodrigues - a day ago

I spent a good amount of time last year working on a system to analyse patent documents.

Patents are difficult as they can include anything from abstract diagrams, chemical formulas, to mathematical equations, so it tends to be really tricky to prepare the data in a way that later can be used by an LLM.

The simplest approach I found was to “take a picture” of each page of the document, and ask for an LLM to generate a JSON explaining the content (plus some other metadata such as page number, number of visual elements, and so on)

If any complicated image is present, simply ask for the model to describe it. Once that is done, you have a JSON file that can be embedded into your vector store of choice.

I can’t say about the price-to-performance ration, but this approach seems to easier and more efficient than what is the author is proposing.

ashishb - a day ago

I speak from experience that this is a bad idea.

There are cases where documents contains text with letters that look the same in many font. For example, 0 and O looks identical in many fonts. So if you have a doc/xls/PDF/html then you lose information by converting it into an image.

For cases like serial numbers, not even humans can distinguish 0 vs O (or l vs I) by looking at them.

hdjrudni - 20 hours ago

I was trying to copy a schedule into Gemini to ask it some questions about it. I struggled with copying and pasting it for several minutes, just wouldn't come out right even though it was already in HTML. Finally gave up, screenshotted it, and then put black boxes over the parts I wanted Gemini to ignore (irrelevant info) and pasted that image in. It worked very well.

emanuer - a day ago

Could someone please help me understand how a multi-modal RAG does not already solve this issue?[1]

What am I missing?

Flash 2.5, Sonnet 3.7, etc. always provided me with very satisfactory image analysis. And, I might be making this up, but to me it feels like some models provide better responses when I give them the text as an image, instead of feeding "just" the text.

[1] https://www.youtube.com/watch?v=p7yRLIj9IyQ

budududuroiu - 15 hours ago

I get that ColPali is straightforward and powerful, but document processing still has many advantages:

- lexical retrieval (based on BM25, TFIDF) which is better at capturing specific terms - full-text search

tobyhinloopen - a day ago

This is something I've done as well - I wanted to scan all invoices that came into my mail so I just exported ALL ATTACHMENTS from my mailbox and used a script to upload them one by one, forcing a tool call to extract "is invoice: yes / no" and a bunch of invoice line, company name, date, invoice number, etc fields.

It had a surprisingly high hit rate. It took over 3 hours of LLM calls but who cares - It was completely hands-off. I then compared the invoices to my bank statements (aka I asked an LLM to do it) and it just missed a few invoices that weren't included as attachments (like those "click to download" mails). It did a pretty poor job matching invoices to bank statements (like "oh this invoice is a few dollars off but i'm sure its this statement") so I'm afraid I still need an accountant for a while.

"What did it cost"? I don't know. I used a cheap-ish model, Claude 3.7 I think.

jamesblonde - a day ago

"The results transformed our system, and our query latency went from 3-4s to 30ms."

Ignorging the trade-offs introduced, the MUVERA paper presented a drop of 90% in latency with evidence in the form of a research paper. Yet, you are reporting "99%" drops in latency. Big claims require big evidence.

urbandw311er - a day ago

Something just feels a bit off about this piece. It seems to labour the point about how “beautiful” or “perfect” their solution is a few times too many, to the point where it starts to feel more like marketing than any sort of useful technical observation.

tom_m - 20 hours ago

Is the text flattened? You don't need to run PDFs through OCR if not. The text can be extracted. Even with JavaScript in the web browser. You only need OCR for hand written text or flatted text. Google's document parse can help as well. You could also run significantly cheaper tools on the PDF first. Just sending everything to the LLM is more costly. What about massive PDFs? They won't fit in the context window sometimes or will cost a lot.

LLMs are great, but use the right tool for the job.

meander_water - 21 hours ago

> You might still need to convert a document to text or a structured format, that’s essential for syncing information into structured databases or data lakes. In those cases, OCR works (with its quirks), but in my experience passing the original document to an LLM is better

Has anyone done any work to evaluate how good LLM parsing is compared to traditional OCR? I've only got anecdotal evidence saying LLMs are better. However whenever I've tested it out there were always an unacceptable level of hallucinations.

bravesoul2 - 21 hours ago

Looks like they cracked it? But I found both OCR and reading the whole page (Open AI various models) has been unusable for scanning a magazine say. And getting which heading is for wheat text.

ianbicking - a day ago

Using modern tools I would naturally be inclined to:

1. Have the LLM see the image and produce an text version using a kind of semantic markup (even hallucinated markup)

2. Use that text for most of the RAG

3. If the focus (of analysis or conversation) converges one image, include that image in the context in addition to the text

If I use a simple prompt with GPT 4o on the Palantir slide from the article I get this: https://gist.github.com/ianb/7a380a66c033c638c2cd1163ea7b2e9... – seems pretty good!

imperfect_light - 19 hours ago

The emphasis on PDFs for RAG seems like something out of the 1990s. Are there any good frameworks for using RAG if your company doesn't go around creating documents left and right?

After all, the documents/emails/presentations will cover the most common use cases. But we have databases that have all the questions the RAG might be asked, far more answers than that which live in documents.

anshumankmr - 16 hours ago

Problem is transcription errors will mess things up for sure. With the text, you just do not have to worry about transcription errors. Sure, its a bit tricky handling tables and chunking is a problem as well, but unless my document is more images than text, I would prefer handling it the "old-fashioned" way.

jasonthorsness - a day ago

It makes sense that a lossy transformation (OCR which removes structure) would be worse than perceptually lossless (because even if the PDF file has additional information, you only see the rendered visual). But it's cool and a little surprising that the multi-modal models are getting this good at interpreting images!

CaptainFever - 16 hours ago

Interesting article, but this is also an ad for a SaaS.

etk934 - 21 hours ago

Can you report the relative storage requirements for multivector COLPALI vs multivector COPALI with binary vectors vs MUVERA vs a single vector per page? Can your system scale to millions of vectors?

K0balt - 19 hours ago

Can multimodal llms read the pdf file format to extract text components as well as graphical ones? Because that would seem to me to be the best way to go.

coyotespike - 21 hours ago

Wow, this is tempting me to use Morphik to add memory to in terminal AI agents for personal use even. Looks powerful and easy.

constantinum - 7 hours ago

LLMs are not yet there for complex and diverse document parsing use cases, especially at an enterprise scale (processing millions of pages).

Some of the reasons are:

Complex layouts, nested tables, tables spanning multiple pages, checkboxes, radio-buttons, off-oriented scans, controlling LLM costs, checking hallucinations, Human-in-the-loop integration, and privacy.

More on the issues: https://unstract.com/blog/why-llms-struggle-with-unstructure...

abc03 - a day ago

Related question: what is today‘s best solution for invoices?

commanderkeen08 - 21 hours ago

> The ColPali model doesn't just "look" at documents. It understands them in a fundamentally different way than traditional approaches.

I’m so sick of this.

ekianjo - 17 hours ago

I did a bit of work in that space. Its not that simple. models that work with images are not perfect either and often have problem finding the right information. So you trade parsing issues with much more difficult to debug corner cases. At the end of the day, whatever works better should be assessed by your test/validation set.