Many ransomware strains will abort if they detect a Russian keyboard installed (2021) (krebsonsecurity.com)

TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

Comments (44)

npalli · 3h ago

Kudos, I think (in the short term at least) there is a large amount of perf. optimization to be found by coding parts of the whole AI/ML infrastructure in C++ like this one, not as a rewrite (god no!) but drop in and fix key bottlenecks. Anytime I see someone (seems Chinese engineers are good at this) put something out in C++, good chance some solid engineering tradeoffs have been made and dramatic improvement will be seen.

matthewolfe · 1h ago

Agreed. A former mentor of mine told me a nice way of viewing software development:

1. Make it work. 2. Make it fast. 3. Make it pretty.

Transformers & LLMs have been developed to a point where they work quite well. I feel as though we're at a stage where most substantial progress is being made on the performance side.

diggan · 1h ago

Heh, seems people I've been learning from been biased away from beauty, as I know that as "Make It Work, Make It Right, Make It Fast".

kevindamm · 29m ago

I've usually heard/said it as

  1. Make it
  2. Make it work
  3. Make it work better

(different circumstances have different nuances about what "better" means, it isn't always performance optimization; some do substitute "faster" for "better" here, but I think it loses generality then).

abybaddi009 · 1h ago

What's the difference between make it work and make it right? Aren't they the same thing?

robertfw · 56m ago

Making it work can be a hacky, tech debt laden implementation. Making it right involves refactoring/rewriting with an eye towards maintainability, testability, etc etc

gopalv · 41m ago

> make it work and make it right?

My mentor used say it is the difference between a screw and glue.

You can glue some things together and prove that it works, but eventually you learn that anytime you had to break something to fix it, you should've used a screw.

It is trade off in coupling - the glue binds tightly over the entire surface but a screw concentrates the loads, so needs maintenance to stay tight.

You only really know which is "right" it if you test it to destruction.

All of that advice is probably sounding date now, even in material science the glue might be winning (see the Tesla bumper or Lotus Elise bonding videos - every screw is extra grams).

stavros · 1h ago

Yeah, if it's not right, it doesn't work.

darknoon · 55m ago

In ML, often it does work to a degree even if it's not 100% correct. So getting it working at all is all about hacking b/c most ideas are bad and don't work. Then you'll find wins by incrementally correcting issues with the math / data / floating point precision / etc.

DSingularity · 43m ago

Not true. Things can work with hacks. Your standards might consider it unacceptable to have hacks. So you can have a “make it right” stage.

ipsum2 · 34m ago

Sort of. The key bottlenecks are not in tokenization, but running the actual CUDA kernels. Python actually has very little overhead. (See VLLM, which is primarily in Python). So when people (like deepseek) 'rewrite in C++', they're usually just rewriting CUDA kernels to be more efficient.

saretup · 1h ago

And while we’re at it, let’s move away from Python altogether. In the long run it doesn’t make sense just because it’s the language ML engineers are familiar with.

tbalsam · 1h ago

No! This is not good.

Iteration speed trumps all in research, most of what Python does is launch GPU operations, if you're having slowdowns from Pythonland then you're doing something terribly wrong.

Python is an excellent (and yes, fast!) language for orchestrating and calling ML stuff. If C++ code is needed, call it as a module.

janalsncm · 9m ago

Most of that is already happening under the hood. A lot of performance-sensitive code is already written in C or cython. For example numpy, scikit learn, pandas. Lots of torch code is either C or CUDA.

ML researchers aren’t using python because they are dumb. They use it because what takes 8 lines in Java can be done with 2 or 3 (including import json) in python for example.

bigyabai · 16m ago

It makes plenty of sense. Python handles strings well, has a great package ecosystem, and is easy to write/learn for non-programmers. It can be easily embedded into a notebook (which is huge for academics) and is technically a "write once run anywhere" platform in theory. It's great.

If you think Python is a bad language for AI integrations, try writing one in a compiled language.

pama · 2h ago

Cool. Would it be possible to eliminate that little vocab format conversion requirement for the vocab I see in the test against tiktoken? It would be nice to have a fully compatible drop in replacement without having to think about details. It also would be nice to have examples that work the other way around: initialize tiktoken as you normally would, including any specialized extension of standard tokenizers, and then use that initialized tokenizer to initialize a new tokendagger and test identity of results.

matthewolfe · 24m ago

Ah good catch. Updating this right now.

chrismustcode · 3h ago

There’s something beautiful about creating a drop in replacement for something that improves performance substantially.

ScyllaDB comes to mind

matthewolfe · 3h ago

Agreed. I figured nobody would use it otherwise.

parhamn · 3h ago

Put it in there readme & description. It's a big selling point.

matthewolfe · 3h ago

Thanks, I clarified it.

pvg · 3h ago

To be fair, many people have token stabbing needs.

kevmo314 · 1h ago

Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie

The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable?

matthewolfe · 1h ago

Cool!

I've reached out to the guy who maintains Tiktoken to talk about this.

p0 · 2h ago

How does this compare to the BPE crate [1]? Its main selling point is support for incrementally re-tokenising text, but it's also faster than tiktoken.

[1] https://crates.io/crates/bpe

matthewolfe · 1h ago

I'm working on incremental re-tokenizing next. Then I'll run some benchmarks against this crate too.

frabcus · 2h ago

Is there any way we can get local tokenizers for other LLMs? e.g. Gemini only offer a remote API for their tokenizer. Is it proprietary? Could we infer the token mapping somehow efficiently by making lots of calls?

matthewolfe · 1h ago

A lot of model-specific tokenizers have reference implementations ([0], [1]). Underlying them is a core algorithm like SentencePiece or Byte-pair encoding (BPE). Tiktoken and TokenDagger are BPE implementations. The wrapping "tokenizer" mostly deals with the quirks of the vocabulary and handling special tokens.

For this project, I think there is value in building some of these model-specific quirks into the library. Could see some minor performance gains and generally make it easier to integrate with. It's probably not too much work to keep up with newer models. Tokenizers change much less frequently.

[0] https://github.com/meta-llama/llama-models/blob/01dc8ce46fec...

[1] https://github.com/mistralai/mistral-common/tree/main/src/mi...

Deathmax · 1h ago

Gemini uses SentencePiece [1], and the proprietary Gemini models share the same tokenizer vocabulary as Gemma [2, 3, 4].

Out of the large proprietary western AI labs (OpenAI, Anthropic, Google), only Anthropic with Claude 3 and newer lack local tokenizers.

[1] https://github.com/google/sentencepiece

[2] https://github.com/googleapis/python-aiplatform/blob/main/ve...

[3] https://storage.googleapis.com/deepmind-media/gemma/gemma-2-...: "We inherit from the large Gemini vocabulary (256k entries)."

[4] https://storage.googleapis.com/deepmind-media/gemma/Gemma3Re...: "We use the same tokenizer as Gemini 2.0."

weberer · 2h ago

I thought Gemini used SentencePiece

https://github.com/google/sentencepiece

matrix2596 · 24m ago

is is possible for your tokenizer to give different tokenization ever then openai tokenizer? i am asking because there are multiple ways to tokenize the same string?? sry if i am mistaken

matthewolfe · 23m ago

Should be the same. Both use Byte-Pair Encoding (BPE) as underlying algo.

fkyoureadthedoc · 2h ago

Would be cool to see WASM bindings for this here https://github.com/dqbd/tiktoken

Or maybe even your speedups from "b" in the pure js implementation

pamelafox · 2h ago

Just curious whether it's possible to push any of your performance improvements to tiktoken itself?

matthewolfe · 2h ago

I probably will. Was hesitant initially, because adding PCRE2 as a dependency might cause issues to existing projects. I believe this was discussed briefly in a closed PR with other performance improvements.

b0a04gl · 2h ago

if dagger builds a byte level DFA for special tokens and resolves overlaps via longest match, how does it handle inputs with partial matches at chunk boundaries, say a stream ends mid token like <|endo , does it buffer forward or require lookahead

konsalexee · 3h ago

> simplifying the algorithm to forego regex matching special tokens at all

Does that mean there could be cases with less quality in terms of tokenization?

matthewolfe · 3h ago

The output should be identical, assuming no bugs.

The Tiktoken implementation takes a collection of all special tokens upon initialization and compiles them into a regex by joining them with `|` [0]. Then the actual encoding process checks for matches on this expression.

Models like Llama 4 define a list of 1,135 special tokens. Notably, 1,115 of those are "reserved" special tokens! So this yields a huge regexp of special tokens that shouldn't be considered at all.

TokenDagger does not do this. Instead, simple string matching is used. This works because we don't need to consider the entire special vocabulary every time. The caller of `encode` must explicitly define which special tokens should be considered [1]. So it's faster to check against the much smaller list we _know_ is being used.

[0] https://github.com/openai/tiktoken/blob/main/src/lib.rs#L476

[1] https://github.com/openai/tiktoken/blob/main/tiktoken/core.p...

polynomial · 1h ago

Just to note that Tiktoken is still the tokenizer behind the GPT-4x series, it just uses a different token model. (Post only says GPT-3, implying they were using something else for subsequent iterations.)

EGreg · 2h ago

What about pairing this with BigBird and Mamba?

manishsharan · 2h ago

Is there a tokenizer someone can recommend for code ? I have tried CodeBert but maybe I am using it wrong as my results with it were pretty bad.

silentsea90 · 1h ago

"I’m teaching myself LLM internals by re-implementing the stack from first principles." - curious what resources you're using? Any books or courses, or just building it straight up? Great work!

matthewolfe · 40m ago

Modal's GPU glossary is a good overview about how GPUs work [0]. Karpathy's LLM overview is a good high level overview on LLMs [1]. 3b1b's video (and subsequent videos) on transformers was excellent at helping me understand the math at a high level [2]. This matrix multiplication optimization worklog helped me understand writing better CUDA (not for beginner intro though) [3].

During this process I also asked ChatGPT a lot of questions.

I'm definitely open to suggestions about "how to learn" with all the new tools we have. I felt this has not been straightforward to figure out.

[0] https://modal.com/gpu-glossary

[1] https://www.youtube.com/watch?v=7xTGNNLPyMI

[2] https://www.youtube.com/watch?v=wjZofJX0v4M

[3] https://siboehm.com/articles/22/CUDA-MMM

janwilmake · 2h ago

You know what's also faster to roughly get the amount of tokens? string.length/5

No comments yet

The Academic Pipeline Stall: Why Industry Must Stand for Academia – ACM Sigops (sigops.org)

Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken (github.com)

Donkey Kong Country 2 and Open Bus (jsgroth.dev)

The provenance memory model for C (gustedt.wordpress.com)

Show HN: New Ensō – first public beta (untested.sonnet.io)

Gridfinity: The modular, open-source grid storage system (gridfinity.xyz)

The Plot of the Phantom, a text adventure that took 40 years to finish (scottandrew.com)

Scribble-based forecasting and AI 2027 (dynomight.net)

Printegrated Circuits: Merging 3D Printing and Electronics (spectrum.ieee.org)

Show HN: Open-Source International Space Station Tracker ESP32/Arduino for $20 (github.com)

Reverse Engineering Vercel's BotID (nullpt.rs)

Auth for B2B SaaS: it's not like auth for consumer software (tesseral.com)

New proof dramatically compresses space needed for computation (scientificamerican.com)

Ask HN: What Are You Working On? (June 2025)

I made my VM think it has a CPU fan (wbenny.github.io)

Ubuntu: Introducing Debcrafters (discourse.ubuntu.com)

Cross-Compiling Common Lisp for Windows (fosskers.ca)

Data Centers, Temperature, and Power (backblaze.com)

A glob of 99M-year-old amber trapped a zombie fungus erupting from a fly (cnn.com)

Shadow of a Doubt (harpers.org)

Revisiting Knuth's “Premature Optimization” Paper (probablydance.com)

Show HN: I built a daily sunlight tracker (lumehealth.io)

Cell Towers Can Double as Cheap Radar Systems for Ports and Harbors (2014) (spectrum.ieee.org)

Ultrasound toothbrush promises painless checks for hidden gum problems (phys.org)

The role of the University is to resist AI (danmcquillan.org)

Jane Austen's Boldest Novel Is Also Her Least Understood (nytimes.com)

Anticheat Update Tracking (not-matthias.github.io)

Event – Fast, In-Process Event Dispatcher (github.com)

Want to meet people, try charging them for it? (notes.eatonphil.com)

NativeJIT: A C++ expression –> x64 JIT (2018) (github.com)

Does Form Shape Function? (quantamagazine.org)

The Medley Interlisp Project: Reviving a Historical Software System [pdf] (interlisp.org)

Many ransomware strains will abort if they detect a Russian keyboard installed (2021) (krebsonsecurity.com)

The $25k car is going extinct? (media.hubspot.com)

Building untrusted container images safely at scale (depot.dev)

How urea forms spontaneously (ethz.ch)

The Book of Shaders (2015) (thebookofshaders.com)

Nearly 20% of cancer drugs defective in four African nations (dw.com)

LetsEncrypt – Expiration Notification Service Has Ended (letsencrypt.org)

Finding a former Australian prime minister’s passport number on Instagram (2020) (mango.pdf.zone)

Modelling API rate limits as diophantine inequalities (vivekn.dev)

Use keyword-only arguments in Python dataclasses (chipx86.blog)

Show HN: Octelium – FOSS Alternative to Teleport, Cloudflare, Tailscale, Ngrok (github.com)

We accidentally solved robotics by watching 1M hours of YouTube (ksagar.bearblog.dev)

Arm muscles into server market – but can't wrestle control from x86 just yet (theregister.com)

Touching the back wall of the Apple store (blog.lauramichet.com)

Tumblr's move to WordPress and Fediverse integration is 'on hold' (theverge.com)

Reverse Engineering the Microchip CLB (mcp-clb.markomo.me)

Sequence and first differences together list all positive numbers exactly once (oeis.org)

Bloom Filters by Example (llimllib.github.io)

Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken

Comments (44)