Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

61 samaysharma 3 6/28/2025, 6:42:05 PM ubicloud.com ↗

Comments (3)

0xjunhao · 1h ago
Hi, I'm the author of this post. Writing it was a great learning experience. I gained a lot of insight into vLLM. If you have any feedback or questions, feel free to drop a comment below!
criemen · 1h ago
Thanks for writing the article!

I didn't quite get

Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.

I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?

criemen · 1h ago
And on the topic of prefill: Do you know what the role of GPUs is vs. in inference?