Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

Comments (3)

0xjunhao · 1h ago

Hi, I'm the author of this post. Writing it was a great learning experience. I gained a lot of insight into vLLM. If you have any feedback or questions, feel free to drop a comment below!

criemen · 1h ago

Thanks for writing the article!

I didn't quite get

Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.

I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?

criemen · 1h ago

And on the topic of prefill: Do you know what the role of GPUs is vs. in inference?

Vending-Bench: Testing long-term coherence in agents (andonlabs.com)

A Vibe Coded Zookeeper Browser That Doesn't Suck (zk.ankitsultana.com)

Torvalds Drops Bcachefs Support After Clash (news.itsfoss.com)

Why a Simple Button Press Can Crash Your FPGA System (and How to Fix It) (siliscale.substack.com)

Experimental X11 Compatibility Layer (github.com)

OpenAI Partnership Puts Conversational AI in Mattel Toys (pymnts.com)

Accuracy of Apple Watch calorie counts (empirical.health)

Solving `UK Passport Application` with Haskell (jameshaydon.github.io)

A reverse-delta backup strategy – obvious idea or bad idea?

How to Train Your GPT Wrapper (blog.sshh.io)

There's not a shred of evidence on the internet that this band has ever existed (musicradar.com)

App51 vs. Bolt, Replit, Rork and A0 (app51.ai)

Supreme Court Greenlights Online Digital ID Checks (reclaimthenet.org)

Sysadmin.ca – Free tools and policies for system administrators (sysadmin.ca)

Crewless ship is defending Denmark's and NATO's waters (euronews.com)

How to Surf the Web in 2025, and Why You Should (raptitude.com)

Automatic build number incrementing in Xcode (blog.gingerbeardman.com)

Taiwan Looks to New Sea-Drone Tech to Repel China (wsj.com)

Archive Postgres Partitions to Iceberg (crunchydata.com)

What went wrong with our happiness (medium.com)

In the Age of AI, Is Code Literacy Your Superpower? (pmbanugo.me)

The Death of the Middle-Class Musician (thewalrus.ca)

Banausos (en.wikipedia.org)

Swiss cocaine so cheap and widely used they're considering legalising it (telegraph.co.uk)

Show HN: Build Discord bots, earn prizes (18 and under) (converge.hackclub.com)

White-Label AI Platform for SMB (parallellabs.app)

Exploring Trichromacy through Maxwell's Color Experiment (2023) (maxwell.kohterai.com)

AI Knows Us Too Well (nautil.us)

Group of investors represented by YouTuber Perifractic buys Commodore (amiga-news.de)

Ask HN: How do you decide what to ship each week as a solo founder?

Removing race as a risk factor for cardiovascular disease (peterattiamd.com)

So Much Better for Beginners Than Tmux (howtogeek.com)

KDE nears full Wayland session restore in Plasma 6.5 (neowin.net)

Coronary atherosclerosis is a silent killer, but we have tools to stop it (peterattiamd.com)

Meteorologists are losing a vital tool for forecasting hurricanes (abcnews.go.com)

The Orakle Manifesto: Why Your AI Apps (Should) Belong to You (medium.com)

Graduating soon and want to join a startup

Asteroid 2024 YR4 could shower Earth with meteors if it hits the moon in 2032 (livescience.com)

Schizophrenia Is the Price We Pay for Minds Poised Near the Edge of a Cliff (psychiatrymargins.com)

OpenAI Agents SDK (TS): voice and multi-agent framework, MIT-licensed (github.com)

World’s First autonomous delivery of a car [video] (youtube.com)

Invisible Asymptotes (eugenewei.com)

China 'planned car collision' during Taiwan vice-president's visit to Prague (theguardian.com)

Ultra: Universal Grammar as a Universal Parser (frontiersin.org)

PaperCheck (papercheck.ai)

Show HN: An on-chain memory layer for X (neverdeletingthis.app)

A Medical-History Museum Contends with Its Collection of Human Remains (newyorker.com)

Show HN: Webshell – A Simple Terminal on Browser written in go (github.com)

Why Gleam Wins over F# (peakd.com)

UBS reports a data leak after a cyber attack on a provider (cnn.com)

Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

Comments (3)