The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.
keyle · 2h ago
I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.
Aside: Are there any models for understanding voice to text, fully offline, without training?
I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"
Teever · 5m ago
Any idea what factors play into latency in TTS models?
Does anybody find it funny that sci-fi movies have to heavily distort "robot voices" to make them sound "convincingly robotic"? A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations. I don't expect a smart toaster to talk like a BBC host; it'd be enough is the speech if easy to recognize.
The voice sounds great! I find it quite aesthetically pleasing, but it's far from genderless.
degamad · 56m ago
Interesting concept, but why is that site filled with Top X blogspam?
Retr0id · 2h ago
I tried to replicate their demo text but it doesn't sound as good for some reason.
If anyone else wants to try:
> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.
> Error generating speech: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running Expand node. Name:'/bert/Expand' Status Message: invalid expand shape
Not open source. "You will need internet connectivity to validate your AccessKey with Picovoice license servers ... If you wish to increase your limits, you can purchase a subscription plan." https://github.com/Picovoice/orca#accesskey
satvikpendem · 2h ago
Does an apk for Android exist for replacing its speech to text engine? I tried sherpa-onnx but it was too slow for real time usage it seemed, and especially so for audiobooks when sped up.
I can't test this out right now, is this just a demo or is it actually an apk for replacing the engine? Because those are two different things, the latter can be used any time you want to read something aloud on the page for example. This is the sherpa-onnx one I'm talking about.
Wow, amazing and good work, I hope to see more amazing models running on CPUs!
sandreas · 1h ago
Cool.
While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.
With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.
For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.
I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.
wkat4242 · 2h ago
Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.
For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.
kamranjon · 2h ago
The best open one I've found so far is Dia - https://github.com/nari-labs/dia - it has some limitations, but i think it's really impressive and I can run it on my laptop.
> Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.
This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.
Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.
Since then, the trend has been to scale up. We need more models to scale down.
In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.
Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).
Thanks! Yeah. It definitely isn’t the absolute best in quality but it trounces the default TTS options on macOS (as third party developers are locked out of the Siri voices). And for less than the size of many modern web pages…
pkaye · 3h ago
Where does the training data come for the models? Is there an openly available dataset the people use?
This is the model and Github page, this blog post looks very much AI generated.
mayli · 3h ago
Is this english only?
a2128 · 2h ago
If you're looking for other languages, Piper has been around in this scene for much longer and they have open-source training code and a lot of models (they're ~60MB instead of 25MB but whatever...) https://huggingface.co/rhasspy/piper-voices/tree/main
Aside: Are there any models for understanding voice to text, fully offline, without training?
I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"
It sounds ok, but impressive for the size.
If anyone else wants to try:
> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.
https://clowerweb.github.io/node_modules/onnxruntime-web/dis...
(seems reverted now)
Doesn't seem to work with thai.
https://k2-fsa.github.io/sherpa/onnx/tts/apk-engine.html
While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.
With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.
For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.
1: https://github.com/fishaudio/fish-speech
2: https://github.com/SWivid/F5-TTS
3: https://github.com/woheller69/ttsengine
For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.
This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.
Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.
Since then, the trend has been to scale up. We need more models to scale down.
In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.
Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).
https://codepen.io/logicalmadboy/pen/RwpqMRV
Foundational tools like this open up the possiblity of one-time payment or even free tools.
It would be great if the training data were released too!
https://github.com/ggml-org/whisper.cpp
That being said, the ‘classical’ (pre-AI) speech synthesisers are much smaller than kitten, so you’re not wrong per se, just for the wrong reason.
https://project64.c64.org/Software/SAM10.TXT
Obviously it's not fair to compare these with ML models.
Running `man say` reveals that "this tool uses the Speech Synthesis manager", so I'm guessing the Apple Intelligence stuff is kicking in.
[0] https://old.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...
https://github.com/KittenML/KittenTTS
This is the model and Github page, this blog post looks very much AI generated.
But yeah, if it's like any of the others we'll likely see a different "model" per language down the line based on the same techniques