Llama-Scan: Convert PDFs to Text W Local LLMs

71 nawazgafar 45 8/17/2025, 9:40:47 PM github.com ↗

Comments (45)

Areibman · 9m ago

Similar project used to organize PDFs with Ollama https://github.com/iyaja/llama-fs

ggnore7452 · 42m ago

I’ve done a similar PDF → Markdown workflow.

For each page:

- Extract text as usual.

- Capture the whole page as an image (~200 DPI).

- Optionally extract images/graphs within the page and include them in the same LLM call.

- Optionally add a bit of context from neighboring pages.

Then wrap everything with a clear prompt (structured output + how you want graphs handled), and you’re set.

At this point, models like GPT-5-nano/mini or Gemini 2.5 Flash are cheap and strong enough to make this practical.

Yeah, it’s a bit like using a rocket launcher on a mosquito, but this is actually very easy to implement and quite flexible and powerfuL. works across almost any format, Markdown is both AI and human friendly, and surprisingly maintainable.

GaggiX · 32m ago

>are cheap and strong enough to make this practical.

It all depends on the scale you need them, with the API it's easy to generate millions of tokens without thinking.

deepsquirrelnet · 20m ago

Give the nanonets-ocr-s model a try. It’s a fine tune of Qwen 2.5 vl which I’ve had good success with for markdown and latex with image captioning. It uses a simple tagging scheme for page numbers, captions and tables.

captainregex · 7m ago

I desperately wanted Qwen vl to work but it just unleashes rambling hallucinations off basic screencaps. going to try nanonet!

HocusLocus · 3h ago

By 1990 Omnipage 3 and its successors were 'good enough' and with their compact dictionaries and letter form recognition were miracles of their time at ~300MB installed.

In 2025 LLMs can 'fake it' using Trilobites of memory and Petaflops. It's funny actually, like a supercomputer being emulated in real time on a really fast Jacquard loom. By 2027 even simple hand held calculator addition will be billed in kilowatt-hours.

privatelypublic · 1h ago

If you think 1990's ocr- even 2000's OCR is remotely as good as modern OCR... I`v3 g0ta bnedge to sell.

No comments yet

Y_Y · 1h ago

https://en.wikipedia.org/wiki/Trilobite

Trilobites? Those were truly primitve computers.

evolve2k · 1h ago

“Turn images and diagrams into detailed text descriptions.”

I’d just prefer that any images and diagrams are copied over, and rendered into a popular format like markdown.

fcoury · 2h ago

I really wanted this to be good. Unfortunately it converted a page that contained a table that is usually very hard for converters to properly convert and I got a full page with "! Picture 1:" and nothing else. On top of that, it hung at page 17 of a 25 page document and never resumed.

nawazgafar · 2h ago

Author here, that sucks. I'd love to recreate this locally. Would you be willing to share the PDF?

KnuthIsGod · 40m ago

Sub-2010 level OCR using LLM.

It is hype-compatible so it is good.

It is AI so it is good.

It is blockchain so it is good.

It is cloud so it is good.

It is virtual so it is good.

It is UML so it is good.

It is RPN so it is good.

It is a steam engine so it is good.

Yawn...

GaggiX · 25m ago

>Sub-2010 level OCR

It's not.

david_draco · 4h ago

Looking at the code, this converts PDF pages to images, then transcribes each image. I might have expected a pdftotext post-processor. The complexity of PDF I guess ...

firesteelrain · 4h ago

There is a very popular Python module called ocrmypdf. I used it to help my HOA and OCR’ing of old PDFs.

https://github.com/ocrmypdf/OCRmyPDF

No LLMs required.

enjaydee · 3h ago

Saw this tweet the other day that helped me understand just how crazy PDF parsing can be

https://threadreaderapp.com/thread/1955355127818358929.html

constantinum · 16m ago

There are a few other reasons why PDF parsing is Hell! > https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

moritonal · 3h ago

I imagine part of the issue is how many PDFs are just a series of images anyway.

westurner · 3h ago

Shell: GNU parallel, pdftotext

Python: PyPdf2, PdfMiner.six, Grobid, PyMuPdf; pytesseract (C++)

paperetl is built on grobid: https://github.com/neuml/paperetl

annotateai: https://github.com/neuml/annotateai :

> annotateai automatically annotates papers using Large Language Models (LLMs). While LLMs can summarize papers, search papers and build generative text about papers, this project focuses on providing human readers with context as they read.

pdf.js-hypothes.is: https://github.com/hypothesis/pdf.js-hypothes.is:

> This is a copy of Mozilla's PDF.js viewer with Hypothesis annotation tools added

Hypothesis is built on the W3C Web Annotations spec.

dokieli implements W3C Web Annotations and many other Linked Data Specs: https://github.com/dokieli/dokieli :

> Implements versioning and has the notion of immutable resources.

> Embedding data blocks, e.g., Turtle, N-Triples, JSON-LD, TriG (Nanopublications).

A dokieli document interface to LLMs would be basically the anti-PDF.

Rust crates: rayon handles parallel processing, pdf-rs, tesseract (C++)

pdf-rs examples/src/bin/extract_page.rs: https://github.com/pdf-rs/pdf/blob/master/examples/src/bin/e...

thorum · 2h ago

I’ve been trying to convert a dense 60 page paper document to Markdown today from photos taken on my iPhone. I know this is probably not the best way to do it but it’s still been surprising to find that even the latest cloud models are struggling to process many of the pages. Lots of hallucination and “I can’t see the text” (when the photo is perfectly clear). Lots of retrying different models, switching between LLMs and old fashioned OCR, reading and correcting mistakes myself. It’s still faster than doing the whole transcription manually but I thought the tech was further along.

bugglebeetle · 2h ago

Try this:

https://github.com/rednote-hilab/dots.ocr

mdaniel · 2h ago

The code is MIT, and the weights are labeled MIT although the actual license file in the weights repo seems to be mostly Apache 2 https://huggingface.co/rednote-hilab/dots.ocr/blob/main/NOTI...

Seems to weigh about 6GB which feels reasonable to manage locally

firesteelrain · 3h ago

Ironically, Ollama likely is using Tesseract under the hood. Python library ocrmypdf uses Tesseract too. https://github.com/ocrmypdf/OCRmyPDF

rafram · 2h ago

> Ironically, Ollama likely is using Tesseract under the hood.

No, it isn’t.

roscas · 4h ago

Almost perfect, the PDF I tested it missed only a few symbols.

But that is something I will use for sure. Thank you.

nawazgafar · 3h ago

Glad to hear it! What types of symbols did it miss?

constantinum · 17m ago

Other tools worthy of mention that help with OCR'ing PDF/Scans to markdown/layout-preserved text:

LLMWhisperer(from Unstract), Docling(IBM), Marker(Surya OCR), Nougat(Facebook Research), Llamaparse.

ahmedhawas123 · 2h ago

This may be a bit of an irrelevant and at best imaginative rant, but there is no shortage of solutions that are mediocre or near perfect for specific use cases out there to parse PDFs. This is a great addition to that.

That said, over the last two years I've come across many use cases to parse PDFs and each has its own requirements (e.g., figuring out titles, removing page numbers, extracting specific sections, etc). And each require a different approach.

My point is, this is awesome, but I wonder if there needs to be a broader push / initiative to stop leveraging PDFs so much when things like HTML, XML, JSON and a million other formats exist. It's a hard undertaking I know, no doubt, but it's not unheard of to drop technologies (e.g., fax) for a better technology.

bm-rf · 1h ago

For the purposes of an llm "reading" a pdf, it just renders it as an image. The file format does not matter. Let's say you have documents that already exist, a robust ocr solution that can handle tables and diagrams could be very valuable.

mdaniel · 2h ago

That ship has sailed, and I'd guess the majority of the folks in these threads are in the same boat I am: one does not get to choose what files your customers send you, you have to meet them where they are

treetalker · 3h ago

I presume this doesn't handle handwriting.

Does anyone have a suggestion for locally converting PDFs of handwriting into text, say on a recent Mac? Use case would be converting handwritten journals and daily note-taking.

nawazgafar · 3h ago

Author here, I tested it with this PDF of a handwritten doc [1], and it converted both pages accurately.

1. https://github.com/pnshiralkar/text-to-handwriting/blob/mast...

treetalker · 2h ago

Amazing, can't wait to try it!

FYI, your GitHub link tells me it's unable to render because the pdf is invalid.

simonw · 3h ago

This one should handle handwriting - it's using Qwen 2.5 VL which is a vision LLM that is very good at handwritten text.

password4321 · 3h ago

I don't know re: handwriting so only barely relevant but here is a new contender for a CLI "OCR Tool using Apple's Vision Framework API": https://github.com/riddleling/macocr which I found while searching for this recent discussion:

My iPhone 8 Refuses to Die: Now It's a Solar-Powered Vision OCR Server

https://news.ycombinator.com/item?id=44310944

phren0logy · 1h ago

If you use Docling, you can set your OCR engine to OCRMac then set it to use LiveText. It’s a good arrangement. You can send these as command-line arguments, but I generally configure it from the Python API.

ntnsndr · 3h ago

+1. I have tried a bunch of local models (albeit the smaller end, b/c hardware limits), and I can't get handwriting recognition yet. But online Gemini and Claude do great. Hoping the local models catch up soon, as this is a wonderful LLM use case.

UPDATE: I just tried this with the default model on handwriting, and IT WORKED. Took about 5-10 minutes on my laptop, but it worked. I am so thrilled not to have to send my personal jottings into the cloud!

abnry · 3h ago

I would really like a tool to reliably get the title of PDF. It is not as easy as it seems. If the PDF exists online (say a paper or course notes) a bonus would be to find that or related metadata.

s0rce · 3h ago

Zotero does an ok job at this for papers.

leodip · 3h ago

Nice! I wonder what is the hardware required to run qwen2.5vl locally. A 6gb 2cpu VPS can do?

mdaniel · 2h ago

It does not appear that qwen2.5vl is one thing, so it would depend a great deal on the size you wish to use

Also, watch out, it seems the weights do not carry a libre license https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main...

wittjeff · 3h ago

Please add a license file. Thanks!

nawazgafar · 3h ago

Will do!

no_creativity_ · 3h ago

Which llama model would have the best results for transcribing an image, I wonder. Say, for a screen grab of a newspaper page.

cronoz30 · 2h ago

Does this work with images embedded in the PDF and rasterized images?

Clojure Async Flow Guide (clojure.github.io)

Claudia – Desktop companion for Claude code (claudiacode.com)

The Enterprise Experience (churchofturing.github.io)

AI vs. Professional Authors Results (mark---lawrence.blogspot.com)

NUMA Is the New Network: Reshaping Per-Socket Microservice Placement (codemia.io)

Modifying other people's software (natkr.com)

Show HN: Doxx – Terminal .docx viewer inspired by Glow (github.com)

Llama-Scan: Convert PDFs to Text W Local LLMs (github.com)

Show HN: OverType – A Markdown WYSIWYG editor that's just a textarea

Derivatives, Gradients, Jacobians and Hessians (blog.demofox.org)

ArchiveTeam has finished archiving all goo.gl short links (tracker.archiveteam.org)

Show HN: NextDNS Adds "Bypass Age Verification"

Node.js is able to execute TypeScript files without additional configuration (nodejs.org)

I Prefer RST to Markdown (2024) (buttondown.com)

BBC Micro, ancestor to ARM (retrogamecoders.com)

IQ tests results for AI (trackingai.org)

MS-DOS development resources (github.com)

Here be dragons: Preventing static damage, latchup, and metastability in the 386 (righto.com)

Mangle – a language for deductive database programming (github.com)

The Microscopic Forces That Break Hearts (thewaitlist.substack.com)

Why Nim? (undefined.pyfy.ch)

A Visual Exploration of Gaussian Processes (2019) (distill.pub)

Carved stone mask – Pre-pottery, Neolithic B period (imj.org.il)

Show HN: Fallinorg - Offline Mac app that organizes files by meaning (fallinorg.com)

LL3M: Large Language 3D Modelers (threedle.github.io)

HN Search isn't ingesting new data since Friday (github.com)

Faster Index I/O with NVMe SSDs (marginalia.nu)

Primitive Streaming Gods (tedium.co)

Teaching GPT-5 to Use a Computer (prava.co)

Electricity prices are climbing more than twice as fast as inflation (npr.org)

The decline of high-tech manufacturing in the United States (blog.waldrn.com)

Who does your assistant serve? (xeiaso.net)

SuperSight: A graphical enhancement mod for Brøderbund's "Stunts" (marnetto.net)

LLMs tell bad jokes because they avoid surprises (danfabulich.medium.com)

Dispelling misconceptions about RLHF (aerial-toothpaste-34a.notion.site)

Review of Anti-Aging Drugs (scienceblog.com)

Lego Transformers Soundwave (lego.com)

The Photographic Periodic Table of the Elements (2017) (periodictable.com)

PG Auto Upgrade – Docker (and K8s) container to auto upgrade your database (github.com)

Cache of WW2 bombs found under a playground in Northumberland (bbc.co.uk)

VictoriaLogs Practical Ingestion Guide for Message, Time and Streams (victoriametrics.com)

Men paying thousands to get their legs broken – and lengthened (theguardian.com)

When did AI take over Hacker News? (zachperk.com)

Does OLAP Need an ORM (clickhouse.com)

State of the U.S. Semiconductor Industry 2025 [pdf] (semiconductors.org)

Sunny days are warm: why LinkedIn rewards mediocrity (elliotcsmith.com)

AI doesn't lighten the burden of mastery (playtechnique.io)

8M patients: the only neurosurgeon in Sierra Leone (npr.org)

One GPU translates into three to five of the fastest Ethernet switch ports (theregister.com)

Car-carrier truck loaded with Teslas causes hazmat fire on 5 Freeway in Sylmar (abc7.com)

Llama-Scan: Convert PDFs to Text W Local LLMs

Comments (45)