Llama-Scan: Convert PDFs to Text W Local LLMs

github.com

220 points by nawazgafar 4 days ago


visarga - 3 days ago

The crucial information is missing - accuracy comparison with other OCR providers. From my experience LLM based OCR might misread the layout and hallucinate values, it is very subtle but sometimes critically wrong. Classical OCR has more precision but doesn't get the layout at all. Combining both has other issues, no approach is 100% reliable.

HocusLocus - 4 days ago

By 1990 Omnipage 3 and its successors were 'good enough' and with their compact dictionaries and letter form recognition were miracles of their time at ~300MB installed.

In 2025 LLMs can 'fake it' using Trilobites of memory and Petaflops. It's funny actually, like a supercomputer being emulated in real time on a really fast Jacquard loom. By 2027 even simple hand held calculator addition will be billed in kilowatt-hours.

ggnore7452 - 4 days ago

I’ve done a similar PDF → Markdown workflow.

For each page:

- Extract text as usual.

- Capture the whole page as an image (~200 DPI).

- Optionally extract images/graphs within the page and include them in the same LLM call.

- Optionally add a bit of context from neighboring pages.

Then wrap everything with a clear prompt (structured output + how you want graphs handled), and you’re set.

At this point, models like GPT-5-nano/mini or Gemini 2.5 Flash are cheap and strong enough to make this practical.

Yeah, it’s a bit like using a rocket launcher on a mosquito, but this is actually very easy to implement and quite flexible and powerfuL. works across almost any format, Markdown is both AI and human friendly, and surprisingly maintainable.

fcoury - 4 days ago

I really wanted this to be good. Unfortunately it converted a page that contained a table that is usually very hard for converters to properly convert and I got a full page with "! Picture 1:" and nothing else. On top of that, it hung at page 17 of a 25 page document and never resumed.

evolve2k - 4 days ago

“Turn images and diagrams into detailed text descriptions.”

I’d just prefer that any images and diagrams are copied over, and rendered into a popular format like markdown.

thorum - 4 days ago

I’ve been trying to convert a dense 60 page paper document to Markdown today from photos taken on my iPhone. I know this is probably not the best way to do it but it’s still been surprising to find that even the latest cloud models are struggling to process many of the pages. Lots of hallucination and “I can’t see the text” (when the photo is perfectly clear). Lots of retrying different models, switching between LLMs and old fashioned OCR, reading and correcting mistakes myself. It’s still faster than doing the whole transcription manually but I thought the tech was further along.

Areibman - 4 days ago

Similar project used to organize PDFs with Ollama https://github.com/iyaja/llama-fs

david_draco - 4 days ago

Looking at the code, this converts PDF pages to images, then transcribes each image. I might have expected a pdftotext post-processor. The complexity of PDF I guess ...

ahmedhawas123 - 4 days ago

This may be a bit of an irrelevant and at best imaginative rant, but there is no shortage of solutions that are mediocre or near perfect for specific use cases out there to parse PDFs. This is a great addition to that.

That said, over the last two years I've come across many use cases to parse PDFs and each has its own requirements (e.g., figuring out titles, removing page numbers, extracting specific sections, etc). And each require a different approach.

My point is, this is awesome, but I wonder if there needs to be a broader push / initiative to stop leveraging PDFs so much when things like HTML, XML, JSON and a million other formats exist. It's a hard undertaking I know, no doubt, but it's not unheard of to drop technologies (e.g., fax) for a better technology.

roscas - 4 days ago

Almost perfect, the PDF I tested it missed only a few symbols.

But that is something I will use for sure. Thank you.

cast42 - 3 days ago

It would be nice to see how it performs on this benchmark: https://github.com/opendatalab/OmniDocBench

deepsquirrelnet - 4 days ago

Give the nanonets-ocr-s model a try. It’s a fine tune of Qwen 2.5 vl which I’ve had good success with for markdown and latex with image captioning. It uses a simple tagging scheme for page numbers, captions and tables.

pyuser583 - 3 days ago

It seems we've entered the "AI is local" phase.

philips - 3 days ago

It would be nice to provide a way to edit the prompt. I have a use case where I need to extract tabular handwritten data from PDFs scanned with a phone and I don't want it to extract the printed instructions on the form, etc.

I have a very similar Go script that does this. My prompt: Create a CSV of the handwritten text in the table. Include the package number on each line. Only output a CSV.

- 4 days ago
[deleted]
jdthedisciple - 3 days ago

If its not nearly as accurate as state of the art OCR (such as GPT's), then I'm not sure it being offline is worth the tradeoff to me personally.

I'm personally on the watchout for the absolute best possible multilingual OCR performance, local or not, cost it what it may (almost).

fforflo - 3 days ago

If you're interested in this sort of thing with an SQL flavor, you may find the pgpdf PostgreSQL extension useful https://github.com/Florents-Tselai/pgpdf .

It's basically an SQL wrapper around poppler.

firesteelrain - 4 days ago

Ironically, Ollama likely is using Tesseract under the hood. Python library ocrmypdf uses Tesseract too. https://github.com/ocrmypdf/OCRmyPDF

treetalker - 4 days ago

I presume this doesn't handle handwriting.

Does anyone have a suggestion for locally converting PDFs of handwriting into text, say on a recent Mac? Use case would be converting handwritten journals and daily note-taking.

constantinum - 4 days ago

Other tools worthy of mention that help with OCR'ing PDF/Scans to markdown/layout-preserved text:

LLMWhisperer(from Unstract), Docling(IBM), Marker(Surya OCR), Nougat(Facebook Research), Llamaparse.

abnry - 4 days ago

I would really like a tool to reliably get the title of PDF. It is not as easy as it seems. If the PDF exists online (say a paper or course notes) a bonus would be to find that or related metadata.

no_creativity_ - 4 days ago

Which llama model would have the best results for transcribing an image, I wonder. Say, for a screen grab of a newspaper page.

wittjeff - 4 days ago

Please add a license file. Thanks!

leodip - 4 days ago

Nice! I wonder what is the hardware required to run qwen2.5vl locally. A 6gb 2cpu VPS can do?

jonwinstanley - 3 days ago

What else can be hooked up to Ollama? Can Cursor use it yet?

cronoz30 - 4 days ago

Does this work with images embedded in the PDF and rasterized images?

ekianjo - 4 days ago

careful if you plan on using this. it leverages pymupdf which is AGPL.

AmazingTurtle - 3 days ago

Yet another Prompt Wrapper

TRANSCRIPTION_PROMPT = """Task: Transcribe the page from the provided book image.

- Reproduce the text exactly as it appears, without adding or omitting anything. - Use Markdown syntax to preserve the original formatting (e.g., headings, bold, italics, lists). - Do not include triple backticks (```) or any other code block markers in your response, unless the page contains code. - Do not include any headers or footers (for example, page numbers). - If the page contains an image, or a diagram, describe it in detail. Enclose the description in an <image> tag. For example:

<image> This is an image of a cat. </image>

"""

KnuthIsGod - 4 days ago

Sub-2010 level OCR using LLM.

It is hype-compatible so it is good.

It is AI so it is good.

It is blockchain so it is good.

It is cloud so it is good.

It is virtual so it is good.

It is UML so it is good.

It is RPN so it is good.

It is a steam engine so it is good.

Yawn...