Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

113 rsehrlich 9 6/5/2025, 9:27:07 PM scalingintelligence.stanford.edu ↗

Comments (9)

refibrillator · 51m ago
The code has few comments but gotta love when you can tell someone was having fun!

https://github.com/ScalingIntelligence/tokasaurus/blob/65efb...

I’m honestly impressed that a pure python implementation can beat out vLLM and SGLang. Granted they lean on FlashInfer, and of course torch.compile has gotten incredibly powerful in the last few years. Though dynamic shapes have still been a huge thorn in my side, I’ll need to look closer at how they pulled it off…

symbolicAGI · 4h ago
Given chat and API needs for low-latency, llama.cpp is probably still the best choice for self hosted models with or without GPU support. And Ollama is the leader for wrapping llama.cpp.

Because Tokasaurus was mentioned as better than Ollama for conducting darwinian godel machine operations (self-improvement), I looked for the linked repo on GitHub and it was 404. So glad it is back https://github.com/ScalingIntelligence/tokasaurus.

nabakin · 4h ago
> On throughput-focused benchmarks, Tokasaurus can outperform vLLM and SGLang by up to 3x+.

Looks like they don't compare to TensorRT-LLM throughput numbers which, last I checked, are SOTA in open source.

andersa · 2h ago
TensorRT-LLM being open source is a lie, all the important kernels are loaded from cubins.
radq · 1h ago
Cool project! The codebase is simple and well documented, a good starting point for anyone interested in how to implement a high-performance inference engine. The prefix sharing is very relevant for anyone running batch inference to generate RL rollouts.
AStonesThrow · 10m ago
[delayed]
behnamoh · 5h ago
While Tokasaurus’s Async-TP shows impressive throughput gains, it seems over-engineered for common use cases. The CPU overhead from async tensor parallelism only pays off at 6k+ token batches, and you need NVLink-connected GPUs to see real benefits. Most prod deployments don’t need this complexity — you’re better off with simpler approaches unless you’re specifically optimizing for massive batch throughput. The adaptive manager skipping “optional” tasks under load also feels concerning from a reliability perspective.
bjt12345 · 4h ago
Buy surely next years production deployments will be very different to right now, with different use cases...etc
jdiff · 3h ago
Sure. Things change over time. Is there a reason to believe they'd be different in such a way that this would be more useful than in today's landscape? I haven't seen such a forecast myself.
YetAnotherNick · 4h ago
Depends on what production means for you. This is useful for batch production jobs.

Also, this seems very useful for generating synthetic data or labelling a bunch of data. 6k batch size is small for data labelling.