FastVLM: Efficient Vision Encoding for Vision Language Models

Comments (3)

meatmanek · 7h ago

I guess this is the paper / announcement about https://github.com/apple/ml-fastvlm, which was previously discussed in https://news.ycombinator.com/item?id=44661527

yorwba · 4h ago

I think you meant to link to https://news.ycombinator.com/item?id=43968897

godelski · 27m ago

Personally, I didn't find too much value in this paper. I think it is good as a product demonstration, I just don't know if it added a ton of value into the research space (but maybe it did because people have been making the same mistake for awhile?).

I actually think the linked page makes it very easy to understand my main critique. The main problem here is that downscaling is a destructive process. It destroys information. Zoom in on that sign, can you read it?[0] No! But can you in the high res?[1] Of course!

We can of course train the model on those signs alone and then get it to recognize what the sign should say, the same way you might do this (not by reading words, but by reading the symbol), but we may run into problems when downsampling images, especially with subtle biases that those algorithms can create, which even includes tiling[3].

If the main thesis is "training on larger resolution results in better performance on high resolution images" then this seems to be a conclusion we already knew from a pure mathematical understanding of entropy, and is something many researchers have been discussing for decades.

There are a lot of evaluations here but it is not explicitly clear to me that the architecture is playing the main role. There is very little in the ablation study and a larger focus on dataset coverage. Table 1 is difficult interpret. While I commend the fine tuning of ViT it would not distinguish the entropy problem as (IIRC) VIT was pretrained on 224x224 resolution images and then fine-tuned to a higher resolution. More fine tuning isn't going to make that problem go away. Table 2 helps us understand pooling but does more in terms of dataset coverage than the coverage of solution space.

I don't think it is bad in the way of "this is not a useful thing that was built" but "the way this is communicated makes it difficult for me as a researcher to interpret the reason for these results." In a way, my criticism here is much more general than just this paper. I am frustrated with the recent trends in AI research that there is more focus being put into coverage of datasets over interpretation. Interpretation such as more in depth ablations (e.g. holding variables constant, changing specific parameters for a test[4]). There isn't infinite compute, so I'm not expecting the world. But in the trade-off between dataset coverage and more thorough ablations, I'd significantly prefer the latter. It is entirely possible that the architectural changes here are critical to the model's ability to properly encode the information. There are hints at it in the paper but it is difficult to distinguish form training procedures and simply the entropy. There's many moving parts and the information provided is not enough to distinguish (or distinguish to an acceptable threshold). I don't entirely blame researchers for making their choice in trade-offs, we can't encourage more in depth ablations until reviewers stop using "what about x dataset" as a excuse[5]. This paradigm of dataset coverage really feels like a lot of wasted compute. And honestly, I suspect we'd make far more improvements were we to change paradigms, as well as many of those improvements would come from much smaller labs without these large compute resources.

[0] Small Res: http://0x0.st/8nU3.png

[1] High Res: https://0x0.st/8nUE.png

[2] https://www.cs.cmu.edu/~clean-fid/

[3] https://arxiv.org/abs/2104.05704

[4] It would be nice to change one parameter at a time but sometimes things are coupled.

[5] "I'm curious about performance on x dataset because x dataset has y quality that I think is important" is a perfectly fine critique. But I rarely see that type of criticism in reviews. They include the demand but not the motivation for the demand. Just leads to noisy reviewing as an author can't infer if reviewer is asking because they're lazy or because they think lack of inclusion undermines the author's claims.

I drank every cocktail (aaronson.org)

CARA – High precision robot dog using rope (aaedmusa.com)

Vintage Macintosh Programming Book Library (2017) (vintageapple.org)

The Promised LAN (tpl.house)

Parsing Protobuf like never before (mcyoung.xyz)

Neil Armstrong's customs form for moon rocks (2016) (magazine.uc.edu)

AI overviews cause massive drop in search clicks (arstechnica.com)

US AI Action Plan (ai.gov)

Major rule about cooking meat turns out to be wrong (seriouseats.com)

Building better AI tools (hazelweakly.me)

Lumo: Privacy-first AI assistant (proton.me)

I made Tinder but it's only pictures of my wife and I can only swipe right (trytender.app)

The Big OOPs: Anatomy of a Thirty-Five Year Mistake (computerenhance.com)

Show HN: TheProtector – Linux Bash script for the paranoid admin on a budget (github.com)

What to expect from Debian/Trixie (michael-prokop.at)

Quantum Won't Replace Your Computer (medium.com)

Jitsi privacy flaw enables one-click stealth audio and video capture (zimzi.substack.com)

Seven Sisters Eclipse Will Temporarily Block Stars from View (discovermagazine.com)

Checklists are hard, but still a good thing (utcc.utoronto.ca)

I'm Unsatisfied with Easing Functions (davepagurek.com)

Optery (YC W22) Is Hiring in Engineering, Legal, Sales, Marketing (U.S., Latam) (optery.com)

Cops say criminals use a Google Pixel with GrapheneOS – I say that's freedom (androidauthority.com)

A diverse cast of rocky worlds around a small star revealed by astronomers (nouvelles.umontreal.ca)

Tram Trains (worksinprogress.news)

FastVLM: Efficient Vision Encoding for Vision Language Models (machinelearning.apple.com)

Vector Tiles are deployed on OpenStreetMap.org (blog.openstreetmap.org)

Interactive Programming in C (2014) (nullprogram.com)

Why Elixir? Common misconceptions (matthewsinclair.com)

How YouTube won the battle for TV viewers (wsj.com)

Kimi-K2 Tech Report [pdf] (github.com)

SIMD Perlin Noise: Beating the Compiler with SSE (2014) (scallywag.software)

You can now disable all AI features in Zed (zed.dev)

Show HN: The missing link of a bookstore's tech stack (bookhead.net)

Manticore Search: Fast, efficient, drop-in replacement for Elasticsearch (github.com)

How to increase your surface area for luck (usefulfictions.substack.com)

Reverse engineering GitHub Actions cache to make it fast (blacksmith.sh)

Robot scans rare library books at 2.5k pages per hour (popsci.com)

Geocities Backgrounds (pixelmoondust.neocities.org)

The Surprising gRPC Client Bottleneck in Low-Latency Networks (blog.ydb.tech)

When Is WebAssembly Going to Get DOM Support? (queue.acm.org)

Using Radicle CI (radicle.xyz)

AccuWeather to discontinue free access to Core Weather API (developer.accuweather.com)

SQL Injection as a Feature (idiallo.com)

Cerebras launches Qwen3-235B, achieving 1.5k tokens per second (cerebras.ai)

Show HN: NativeSwap – Low cost cross-chain swaps without wrappers or bridges (nativeswap.io)

Reversing a Fingerprint Reader Protocol (2021) (blog.th0m.as)

Checking Out CPython 3.14's remote debugging protocol (rtpg.co)

AI coding agents are removing programming language barriers (railsatscale.com)

The First Photograph Ever Taken (1826) (openculture.com)

Using uninitialized memory for fun and profit (2008) (research.swtch.com)

FastVLM: Efficient Vision Encoding for Vision Language Models

Comments (3)