Show HN: Whisper at 1.58 bits with custom kernels for edge inference

6 coolhanhim 2 7/28/2025, 1:18:19 PM medium.com ↗

Comments (2)

coolhanhim · 6h ago
We quantized OpenAI’s Whisper model to 1.58 bits using Quantization-Aware Training (QAT) to run speech recognition on resource-constrained embedded CPUs. Post-Training Quantization(PTQ) was unsuccessful under 4 bits, so we conducted QAT with a replicated dataset. To make inference feasible, we also implemented custom low-bit kernels optimized for edge deployment. This post walks through the technical challenges and how we addressed them for extreme quantization in real-world use.
conjecTech · 4h ago
Very nice work. Training these from scratch is a big undertaking.

- Did you train the encoder & decoder together or separately? It would be nice to have the encoder representation be compatible with the existing whisper implementation since it would mean you could swap your implementation into models where its used as a component, like in the recent Voxtral model. I'd imagine it also might make training a bit faster as well.

- Did you consider training the turbo model as well?