Hi, I'm the author of this post. Writing it was a great learning experience. I gained a lot of insight into vLLM. If you have any feedback or questions, feel free to drop a comment below!
criemen · 1h ago
Thanks for writing the article!
I didn't quite get
Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.
I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?
criemen · 1h ago
And on the topic of prefill: Do you know what the role of GPUs is vs. in inference?
I didn't quite get
Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.
I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?