MIPS – The hyperactive history and legacy of the pioneering RISC architecture (thechipletter.substack.com)

- LLM's are typically pre-trained on 4k text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images.

- Pdf's at 1536 × 2048 use 3 to 5X more tokens than the raw text (ie higher inference costs and slower responses). Going lower results in blurry images.

- Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.

Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.

An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort.

pilooch · 50m ago

True but modern models such as gemma3 pan& scan and other tricks such as training from multiple resolutions do alleviate these issues.

An interesting property of the gemma3 family is that increasing the input image siwmze actually does not increase processing memory requirements, because a second stage encoder actually compresses it into fixed size tokens. Very neat in practice.

ArnavAgrawal03 · 56m ago

You can add OCR with Gemini, and presumably that would lead to better results than the OCR model we compared against. However, it's important to note that then you're guaranteeing that the entire corpus of documents you're processing will go through a large VLM. That can be prohibitively expensive and slow.

Definitely trade-offs to be made here, we found this to be the most effective in most cases.

pilooch · 3h ago

Some colleagues and myself did implemented exactly this six months ago for a French gov agency.

It's open source and available here: https://github.com/jolibrain/colette

It's not our primary business so it's just lying there and we don't advertise much, but it works, somehow and with some tweaks to get it really efficient.

The true genius though is that the whole thing can be made fully differentiable, unlocking the ability to finetune the viz rag on targeted datasets.

The layout model can also be customized for fine grained document understanding.

ted_dunning · 2h ago

You don't have a license in your repository top-level. That means that nobody who takes licensing at all seriously can use your stuff, even just for reference.

deadbabe · 2m ago

Standard practice now is to just have an LLM read the whole repo and write a new original version in a different language. It’s code laundering.

pilooch · 1h ago

Good catch, will add it tomorrow. License is Apache2.

wryun · 1h ago

They do have: https://github.com/jolibrain/colette/blob/main/pyproject.tom...

I agree it's better to have the full licence at top level, but is there a legal reason why this would be inadequate?

JSR_FDED · 2h ago

Great, thanks for sharing your code. Could you please add a license so I and others can understand if we're able to use it?

Adityav369 · 3h ago

Yeah the fine tuning is definitely the best part.

Often, the blocker becomes high quality eval sets (which I guess always is the blocker).

bravesoul2 · 5m ago

Looks like they cracked it? But I found both OCR and reading the whole page (Open AI various models) has been unusable for scanning a magazine say. And getting which heading is for wheat text.

themanmaran · 3h ago

Hey we've done a lot of research on this side [1] (OCR vs direct image + general LLM benchmarking).

The biggest problem with direct image extraction is multipage documents. We found that single page extraction (OCR=>LLM vs Image=LLM) slightly favored the direct image extraction. But anything beyond 5 images had a sharp fall off in accuracy compared to OCR first.

Which makes sense, long context recall over text is already a hard problem, but that's what LLMs are optimized for. Long context recall over images is still pretty bad.

[1] https://getomni.ai/blog/ocr-benchmark

ArnavAgrawal03 · 3h ago

That's an interesting point. We've found that for most use cases, over 5 pages of context is overkill. Having a small LLM conversion layer on top of images also ends up working pretty well (i.e. instead of direct OCR, passing batches of 5 images - if you really need that many - to smaller vision models and having them extract the most important points from the document).

We're currently researching surgery on the cache or attention maps for LLMs to have larger batches of images work better. Seems like Sliding window or Infinite Retrieval might be promising directions to go into.

Also - and this is speculation - I think that the jump in multimodal capabilities that we're seeing from models is only going to increase, meaning long-context for images is probably not going to be a huge blocker as models improve.

themanmaran · 2h ago

This just depends a lot on how well you can parse down the context prior to passing to an LLM.

Ex: Reading contracts or legal documents. Usually a 50 page document that you can't very effectively cherry pick from. Since different clauses or sections will be referenced multiple times across the full document.

In these scenarios, it's almost always better to pass the full document into the LLM rather than running RAG. And if you're passing the full document it's better as text rather than images.

thor-rodrigues · 1h ago

I spent a good amount of time last year working on a system to analyse patent documents.

Patents are difficult as they can include anything from abstract diagrams, chemical formulas, to mathematical equations, so it tends to be really tricky to prepare the data in a way that later can be used by an LLM.

The simplest approach I found was to “take a picture” of each page of the document, and ask for an LLM to generate a JSON explaining the content (plus some other metadata such as page number, number of visual elements, and so on)

If any complicated image is present, simply ask for the model to describe it. Once that is done, you have a JSON file that can be embedded into your vector store of choice.

I can’t say about the price-to-performance ration, but this approach seems to easier and more efficient than what is the author is proposing.

Adityav369 · 54m ago

You can ask the model to describe the image, but that is inherently lossy. What if it is a chart and the model gets most x, y pairs, but the user asks about a missing "x" or "y" value. Presenting the image at inference is effective since you're guaranteeing that the LLM is able to answer exactly the user's question. The only blocker here becomes how good retrieval is, and that's a smaller problem to solve. This approach allows us to only solve for passing in relevant context, the rest is taken care of by the LLM, otherwise the problem space expands to correct OCR, parsing, and getting all possible descriptions to images from the model.

monkeyelite · 1h ago

This is a great example of how to use LLMs thanks.

But it also illustrates to me that the opportunities with LLMs right now are primarily about reclassifying or reprocessing existing sources of value like patent documents. In the 90-00s many successful SW businesses were building databases to replace traditional filing.

Creating fundamentally new collections of value which require upfront investment seems to still be challenging for our economy.

cheschire · 1h ago

how often has the model hallucinated the image though?

ashishb · 2h ago

I speak from experience that this is a bad idea.

There are cases where documents contains text with letters that look the same in many font. For example, 0 and O looks identical in many fonts. So if you have a doc/xls/PDF/html then you lose information by converting it into an image.

For cases like serial numbers, not even humans can distinguish 0 vs O (or l vs I) by looking at them.

zffr · 1h ago

PDFs don’t always contain actual text. Sometimes they just contain instructions to draw the letters.

For that reason, IMO rendering a PDF page as an image is a very reasonable way to extract information out of it.

For the other formats you mentioned, I agree that it is probably better to parse the document instead.

ArnavAgrawal03 · 51m ago

Completely agree with this. This is what we've observed in production too. Embedding images makes the RAG a lot more robust to the "inner workings" of a document.

ArnavAgrawal03 · 44m ago

For HTML, in a lot of cases, using the tags to chunk things better works. However, I've found that when I'm trying to design a page, showing models the actual image of the page leads to way better debugging than just sending the code back.

1 vs I or 0 vs O are valid issues, but in practice - and there's probably selection bias here - we've seen documents with a ton of diagrams and charts (that are much simpler to deal with as images).

weego · 1h ago

This is within the context of using it as an alternative to OCR, which would suffer the same issues, with more duct tape and string infrastructure and cost.

llm_nerd · 5m ago

Strangely the linked marketing text repeatedly comments regarding OCR errors (I counted at least 4 separate instances), which is extremely weird because such a visual RAG suffers precisely the same problem. It is such a weird thing to repeatedly harp on.

If the OCR has a problem understanding varying fonts and text, there is zero reason using embeddings instead is immune to this.

ashishb · 1h ago

You can win any race if you can cherry-pick your competitors.

emanuer · 3h ago

Could someone please help me understand how a multi-modal RAG does not already solve this issue?[1]

What am I missing?

Flash 2.5, Sonnet 3.7, etc. always provided me with very satisfactory image analysis. And, I might be making this up, but to me it feels like some models provide better responses when I give them the text as an image, instead of feeding "just" the text.

[1] https://www.youtube.com/watch?v=p7yRLIj9IyQ

ArnavAgrawal03 · 3h ago

Multimodal RAG is exactly what we argue for. In their original state, though, multivectors (that form the basis for multi-modal RAG) are very unwieldy - computing the similarity scores is very expensive and so scaling them up in this state is hard.

You need to apply things like quantization, single-vector conversions (using fixed dimensional encodings), and better indexing to ensure that multimodal RAG works at scale.

That is exactly what we're doing at Morphik :)

urbandw311er · 2h ago

Something just feels a bit off about this piece. It seems to labour the point about how “beautiful” or “perfect” their solution is a few times too many, to the point where it starts to feel more like marketing than any sort of useful technical observation.

bravesoul2 · 2m ago

It is marketing of course. Regardless of what it says it's a company blog. That sets constraints on the sort of stuff they say vs. a regular blog. Not picking on this company as it is the same for all such blogs.

programjames · 2h ago

I disagree. It feels like something you would say when you finally come across the "obviously right" solution, that's easier to implement and simpler to describe. As Kolmogorov said, the simplest solution is exponentially more correct than the others.

ianbicking · 2h ago

Using modern tools I would naturally be inclined to:

1. Have the LLM see the image and produce an text version using a kind of semantic markup (even hallucinated markup)

2. Use that text for most of the RAG

3. If the focus (of analysis or conversation) converges one image, include that image in the context in addition to the text

If I use a simple prompt with GPT 4o on the Palantir slide from the article I get this: https://gist.github.com/ianb/7a380a66c033c638c2cd1163ea7b2e9... – seems pretty good!

tobyhinloopen · 3h ago

This is something I've done as well - I wanted to scan all invoices that came into my mail so I just exported ALL ATTACHMENTS from my mailbox and used a script to upload them one by one, forcing a tool call to extract "is invoice: yes / no" and a bunch of invoice line, company name, date, invoice number, etc fields.

It had a surprisingly high hit rate. It took over 3 hours of LLM calls but who cares - It was completely hands-off. I then compared the invoices to my bank statements (aka I asked an LLM to do it) and it just missed a few invoices that weren't included as attachments (like those "click to download" mails). It did a pretty poor job matching invoices to bank statements (like "oh this invoice is a few dollars off but i'm sure its this statement") so I'm afraid I still need an accountant for a while.

"What did it cost"? I don't know. I used a cheap-ish model, Claude 3.7 I think.

taberiand · 1h ago

In your use case, for that simple data matching that it errors on I think it would be better to have the LLM write the code that can be used to process the input files (the raw text that it produced from images and the bank statements), rather than have the LLM try to match up the data in the files itself.

jamesblonde · 2h ago

"The results transformed our system, and our query latency went from 3-4s to 30ms."

Ignorging the trade-offs introduced, the MUVERA paper presented a drop of 90% in latency with evidence in the form of a research paper. Yet, you are reporting "99%" drops in latency. Big claims require big evidence.

jasonthorsness · 3h ago

It makes sense that a lossy transformation (OCR which removes structure) would be worse than perceptually lossless (because even if the PDF file has additional information, you only see the rendered visual). But it's cool and a little surprising that the multi-modal models are getting this good at interpreting images!

abc03 · 3h ago

Related question: what is today‘s best solution for invoices?

ArnavAgrawal03 · 3h ago

This would depend on the exact use case. Feeding in the invoice directly to the model is - in my opinion - the best way to approach this. If you need to search over them, then directly embedding them as images is definitely a strong approach. Here's something we wrote explaining the process: https://www.morphik.ai/docs/concepts/colpali

Global hack on Microsoft Sharepoint hits U.S., state agencies, researchers say (washingtonpost.com)

What went wrong inside recalled Anker PowerCore 10000 power banks? (lumafield.com)

AccountingBench: Evaluating LLMs on real long-horizon business tasks (accounting.penrose.com)

Don't bother parsing: Just use images for RAG (morphik.ai)

TrackWeight: Turn your MacBook's trackpad into a digital weighing scale (github.com)

Scarcity, Inventory, and Inequity: A Deep Dive into Airline Fare Buckets (blog.getjetback.com)

Yoni Appelbaum on the real villians behind our housing and mobility problems (riskgaming.com)

New records on Wendelstein 7-X (iter.org)

Erlang 28 on GRiSP Nano using only 16 MB (grisp.org)

Spice Data (YC S19) Is Hiring (ycombinator.com)

Game Genie Retrospective: The Best NES Accessory Ever Was Unlicensed (tedium.co)

Show HN: Lotas – Cursor for RStudio (lotas.ai)

Jqfmt like gofmt, but for jq (github.com)

The Fundamentals of Asyncio (github.com)

Occasionally USPS sends me pictures of other people's mail (the418.substack.com)

Sutton SignWriting is a writing system for sign languages (en.m.wikipedia.org)

UK backing down on Apple encryption backdoor after pressure from US (arstechnica.com)

Gemini with Deep Think achieves gold-medal standard at the IMO (deepmind.google)

MIPS – The hyperactive history and legacy of the pioneering RISC architecture (thechipletter.substack.com)

Scholars solved a 130-year literary mystery and it hinged on one word (sciencedaily.com)

SecretSpec: Declarative Secrets Management (devenv.sh)

Modern Debian-based Window Maker distribution (wmlive.sourceforge.net)

Amazon and the “Profitless Business Model” Fallacy (eugenewei.com)

We made Postgres writes faster, but it broke replication (paradedb.com)

Make Map Icons with Orthographic Projections (esri.com)

Hiding messages in a deck playing cards (asherfalcon.com)

Nine households control 15% of wealth in Silicon Valley as inequality widens (theguardian.com)

Show HN: Pogocache – Fast caching software (github.com)

The Krull dimension of the semiring of natural numbers is equal to 2 (freedommathdance.blogspot.com)

Show HN: MCP Jetpack – The easiest way to get started with MCP in Cursor (mcpjetpack.com)

Memory Efficiency in iOS: Reducing footprint and beyond (antongubarenko.substack.com)

Ask HN: Why is Gmail so incompetent at basic search?

Show HN: Intercepting proxy for semantic search over visited pages (github.com)

The daily life of a medieval king (medievalists.net)

12ft.io Taken Down (newsmediaalliance.org)

Germany's Fairytale Castles Added to UNESCO's World Heritage List (smithsonianmag.com)

Comparison of MGR, SunView, OpenWindows and X11R6 (2022) (oldvcr.blogspot.com)

Shale Drillers Turn on Each Other as Toxic Water Leaks Hit Biggest US Oil Field (bloomberg.com)

Reverse Engineering the Mysterious Up-Data Link Test Set from Apollo (righto.com)

1990 Networking: LAN Manager 2.0 (os2museum.com)

Show HN: Built an email marketing platform after paying $230/month (fertit.com)

Quadratic forms beyond arithmetic (ams.org)

Man wearing metallic necklace dies after being sucked into MRI machine (bbc.com)

ESP32-Faikin: ESP32 based module to control Daikin aircon units (github.com)

Writing your Clojure tests in EDN files (biffweb.com)

Houdini of FL: autistic savant in prison for taking tools he inherited (en.wikipedia.org)

India: Income Tax Bill allows officials to forcibly access social media, email (thehindu.com)

AI Coding Tools Underperform in Field Study with Experienced Developers (infoq.com)

ChatGPT 'router' that automatically selects the right model for job imminent (venturebeat.com)

Human programmer beats OpenAI's custom AI in 10-hour marathon (tomshardware.com)

Don't bother parsing: Just use images for RAG

Comments (37)