Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken
156 matthewolfe 44 6/30/2025, 12:33:58 PM github.com ↗
TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.
I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.
Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.
1. Make it work. 2. Make it fast. 3. Make it pretty.
Transformers & LLMs have been developed to a point where they work quite well. I feel as though we're at a stage where most substantial progress is being made on the performance side.
My mentor used say it is the difference between a screw and glue.
You can glue some things together and prove that it works, but eventually you learn that anytime you had to break something to fix it, you should've used a screw.
It is trade off in coupling - the glue binds tightly over the entire surface but a screw concentrates the loads, so needs maintenance to stay tight.
You only really know which is "right" it if you test it to destruction.
All of that advice is probably sounding date now, even in material science the glue might be winning (see the Tesla bumper or Lotus Elise bonding videos - every screw is extra grams).
Iteration speed trumps all in research, most of what Python does is launch GPU operations, if you're having slowdowns from Pythonland then you're doing something terribly wrong.
Python is an excellent (and yes, fast!) language for orchestrating and calling ML stuff. If C++ code is needed, call it as a module.
ML researchers aren’t using python because they are dumb. They use it because what takes 8 lines in Java can be done with 2 or 3 (including import json) in python for example.
If you think Python is a bad language for AI integrations, try writing one in a compiled language.
ScyllaDB comes to mind
The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable?
I've reached out to the guy who maintains Tiktoken to talk about this.
[1] https://crates.io/crates/bpe
For this project, I think there is value in building some of these model-specific quirks into the library. Could see some minor performance gains and generally make it easier to integrate with. It's probably not too much work to keep up with newer models. Tokenizers change much less frequently.
[0] https://github.com/meta-llama/llama-models/blob/01dc8ce46fec...
[1] https://github.com/mistralai/mistral-common/tree/main/src/mi...
Out of the large proprietary western AI labs (OpenAI, Anthropic, Google), only Anthropic with Claude 3 and newer lack local tokenizers.
[1] https://github.com/google/sentencepiece
[2] https://github.com/googleapis/python-aiplatform/blob/main/ve...
[3] https://storage.googleapis.com/deepmind-media/gemma/gemma-2-...: "We inherit from the large Gemini vocabulary (256k entries)."
[4] https://storage.googleapis.com/deepmind-media/gemma/Gemma3Re...: "We use the same tokenizer as Gemini 2.0."
https://github.com/google/sentencepiece
Or maybe even your speedups from "b" in the pure js implementation
Does that mean there could be cases with less quality in terms of tokenization?
The Tiktoken implementation takes a collection of all special tokens upon initialization and compiles them into a regex by joining them with `|` [0]. Then the actual encoding process checks for matches on this expression.
Models like Llama 4 define a list of 1,135 special tokens. Notably, 1,115 of those are "reserved" special tokens! So this yields a huge regexp of special tokens that shouldn't be considered at all.
TokenDagger does not do this. Instead, simple string matching is used. This works because we don't need to consider the entire special vocabulary every time. The caller of `encode` must explicitly define which special tokens should be considered [1]. So it's faster to check against the much smaller list we _know_ is being used.
[0] https://github.com/openai/tiktoken/blob/main/src/lib.rs#L476
[1] https://github.com/openai/tiktoken/blob/main/tiktoken/core.p...
During this process I also asked ChatGPT a lot of questions.
I'm definitely open to suggestions about "how to learn" with all the new tools we have. I felt this has not been straightforward to figure out.
[0] https://modal.com/gpu-glossary
[1] https://www.youtube.com/watch?v=7xTGNNLPyMI
[2] https://www.youtube.com/watch?v=wjZofJX0v4M
[3] https://siboehm.com/articles/22/CUDA-MMM
No comments yet