Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon

159 dipampaul17 15 5/16/2025, 8:04:58 PM github.com ↗
I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality.

I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising:

- K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality

This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues)

Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon.

GitHub: https://github.com/dipampaul17/KVSplit

Comments (15)

matheist · 2h ago
Looks interesting! Is there any intuition for why this should be the case? Did you discover it via that intuition, or just random experimentation?

A note, your install script appears to still have a placeholder at the "apply patch" step. A suggestion, might be more user-friendly to fork llama.cpp and then include that as a git submodule rather than make it a "git clone and apply patch" step.

A further note, everyone and their dog has a different local python set-up, might be nice to let people separate the llama.cpp stuff from the python stuff rather than bake in a dependence on homebrew python.

therealsmith · 5m ago
Am I missing something? As far as I can see this patch does nothing except add new options that replicate the functionality of the existing --cache-type-k and --cache-type-v options.

Using `--flash-attn --cache-type-k q8_0 --cache-type-v q8_0` is a very well known optimization to save VRAM.

And it's also very well known that the keys are more sensitive to quantization than values. E.g. https://arxiv.org/abs/2502.15075

behnamoh · 2h ago
Is this patch possible to do on MLX? I'm getting better speeds on MLX. That, combined with your approach, would finally let Mac users have long conversations at usable speeds.
3abiton · 18m ago
This is a brilliant idea, and initiative. Does this also apply to GPUs? And I assume should be compatible with other quantization techniques, albeit they probably require their own patches?
entrepy123 · 2h ago
Are these significantly faster/better on 64GB or 128GB Apple silicon (over 36GB or 48GB)?

I've been reading that large contexts and large models are just painfully slow, even on the fastest and largest Apple silicon that money can buy.

So I wonder if this helps make more use of greater memory, or if really smallish models are still where it's at for Apple silicon, practically speaking.

badmonster · 2h ago
I'm curious: is it possible to apply differentiated KV quantization (like K8V4) to models after they're already converted to .gguf format, or does this require rebuilding the model with special support? If it's compatible with any .gguf file, are there any limitations on model types (e.g. Mistral, Phi-3, etc.) or tokenizer configs?
dipampaul17 · 2h ago
Yes, that's one of the key benefits - KVSplit works with any existing .gguf model without requiring reconstruction or special conversion. The quantization happens at runtime on the KV cache, not during model loading or conversion.

This works because the KV cache is created during inference as tokens are processed, completely separate from the model weights themselves. The --kvq-key and --kvq-val flags simply tell llama.cpp how to store these intermediate tensors in memory.

I've tested it successfully with:

- Llama-3 models - Mistral models - Phi-2/Phi-3 - TinyLlama - Qwen variants

The only limitation is that it requires llama.cpp's Metal backend, and you need to disable Flash Attention with -fa 0 since the current FA implementation in llama.cpp bypasses the custom KV cache format. The technique itself should work with any transformer architecture that uses a standard attention mechanism.

ondra · 2h ago
Is this any different from using --cache-type-k and --cache-type-v?
azinman2 · 34m ago
That’s what I want to know!
nico · 2h ago
Great work. This seems very interesting, but I need something slightly more high level to relate to it

Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window? Or a 128k model (like gemma3) with a 256k+ context window?

What’s the ideal use case for local models?

Thank you

dipampaul17 · 2h ago
With the K8V4 configuration providing 59% memory savings, you can effectively run contexts 2.4× longer on the same hardware. A model with a 2048 token context can now handle about 5000 tokens, while an 8K context model can reach approximately 19.5K tokens.

In practical terms, this means processing entire books at once on a MacBook, analyzing large codebases without splitting files, or maintaining comprehensive conversation history in chat applications.

The memory savings scale linearly with context length - the longer your context window, the more absolute memory you save. On my M4 MacBook with 8K context, I reduced KV cache from 176MB to 72MB. At 128K context, that same percentage saving would free up gigabytes.

This optimization is most valuable when you're context-window limited rather than model-parameter limited. If you're hitting OOM errors due to long inputs rather than large model weights, KVSplit directly addresses your bottleneck.

kmacdough · 2h ago
> Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window

It reduces the memory footprint of a particular model. You can do what you like with that. Extending the context window post-training isn't trivial, so unless you know what you're doing, you'd be better off finding a model trained on a larger context window.

Many uses for local models like working offline or privacy/security. Most folks, though, are using it to experiment with tweaking models.

nico · 2h ago
Will that make the model run/feel faster?

I can run models with 30-40b parameters on my computer, but they feel a lot slower than the 1-7b ones

So would this make the 30-40b parameter modes run faster? Or at least “feel” faster?

smcleod · 1h ago
+0.86% perplexity it's quite a bit at such a small context size though isn't it? How is it at more reasonable context sizes like 64-128k?
nomel · 33m ago
> This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.

The point seems to be that this reduces memory footprint. This makes it possible to run longer context, for the same limited memory, if you couldn't before. Or, you can use that free memory to do something else, like an IDE.