How we made our OCR code more accurate

pieces.app

57 points by thunderbong 3 days ago


bluelightning2k - 2 days ago

I can't say I've ever wanted to transcribe code from an image. That seems super niche.

Perhaps the specific idea is to harvest coding textbooks as training data for LLMs?

camtarn - 2 days ago

Neat article, but I feel like I have no idea why they're doing this! Is transcribing code from images really such a big use case?

bobosha - 2 days ago

has anyone tried feeding the admittedly noisy OCR-ed text -at a document level - to an LLM for making sense? Presumably some of the less capable ones should be quite affordable and accurate at scale as well.

- 2 days ago
[deleted]
lesuorac - 2 days ago

OCR is the biggest XY problem.

Stop accepting PDFs and force things to use APIs ...

- 2 days ago
[deleted]
MoonGhost - a day ago

Even small upscale model trained on texts should do better than big generic.

abc-1 - 2 days ago

Anything that mentions tesseract is about 10 years out of date at this point.

sushid - 2 days ago

Making OCR more accurate for regular text (e.g. data extraction from documents) would be useful; not sure how useful code transcription is

gitroom - 2 days ago

[dead]

vaxman - 2 days ago

Tesseract OCR was created by digital (DEC) in 19_8_5 (yes, 40 not four YEARs ago). Now go back and read the article and ROFL with me.