Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

MutedEstate45 · 1h ago

The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.

keyle · 2h ago

I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.

Aside: Are there any models for understanding voice to text, fully offline, without training?

I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"

Teever · 5m ago

Any idea what factors play into latency in TTS models?

blopker · 3h ago

Web version: https://clowerweb.github.io/kitten-tts-web-demo/

It sounds ok, but impressive for the size.

nine_k · 3h ago

Does anybody find it funny that sci-fi movies have to heavily distort "robot voices" to make them sound "convincingly robotic"? A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations. I don't expect a smart toaster to talk like a BBC host; it'd be enough is the speech if easy to recognize.

roywiggins · 3h ago

This one is at least an interesting idea: https://genderlessvoice.com/

cosmojg · 1h ago

The voice sounds great! I find it quite aesthetically pleasing, but it's far from genderless.

degamad · 56m ago

Interesting concept, but why is that site filled with Top X blogspam?

Retr0id · 2h ago

I tried to replicate their demo text but it doesn't sound as good for some reason.

If anyone else wants to try:

> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.

cortesoft · 31m ago

Is the demo using the not smallest model?

quantummagic · 2h ago

Doesn't work here. Backend module returns 404 :

https://clowerweb.github.io/node_modules/onnxruntime-web/dis...

Retr0id · 2h ago

Looks like this commit 15 minutes ago broke it https://github.com/clowerweb/kitten-tts-web-demo/commit/6b5c...

(seems reverted now)

itake · 1h ago

> Error generating speech: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running Expand node. Name:'/bert/Expand' Status Message: invalid expand shape

Doesn't seem to work with thai.

jainilprajapati · 58m ago

You can also try on https://clowerweb.github.io/node_modules/onnxruntime-web/dis...

kenarsa · 2h ago

Try https://github.com/Picovoice/orca It's about 7MB all included

gary_0 · 2h ago

Not open source. "You will need internet connectivity to validate your AccessKey with Picovoice license servers ... If you wish to increase your limits, you can purchase a subscription plan." https://github.com/Picovoice/orca#accesskey

satvikpendem · 2h ago

Does an apk for Android exist for replacing its speech to text engine? I tried sherpa-onnx but it was too slow for real time usage it seemed, and especially so for audiobooks when sped up.

kenarsa · 2h ago

https://github.com/Picovoice/orca/tree/main/demo%2Fandroid

satvikpendem · 1h ago

I can't test this out right now, is this just a demo or is it actually an apk for replacing the engine? Because those are two different things, the latter can be used any time you want to read something aloud on the page for example. This is the sherpa-onnx one I'm talking about.

https://k2-fsa.github.io/sherpa/onnx/tts/apk-engine.html

toisanji · 3h ago

Wow, amazing and good work, I hope to see more amazing models running on CPUs!

sandreas · 1h ago

Cool.

While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.

With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.

For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.

1: https://github.com/fishaudio/fish-speech

2: https://github.com/SWivid/F5-TTS

3: https://github.com/woheller69/ttsengine

nine_k · 3h ago

I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.

wkat4242 · 2h ago

Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.

For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.

kamranjon · 2h ago

The best open one I've found so far is Dia - https://github.com/nari-labs/dia - it has some limitations, but i think it's really impressive and I can run it on my laptop.

kenarsa · 2h ago

Try https://github.com/Picovoice/orca

guskel · 1h ago

Chatterbox is also worth a try.

jainilprajapati · 1h ago

You should give try to https://pinokio.co/

echelon · 2h ago

> Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.

This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.

Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.

Since then, the trend has been to scale up. We need more models to scale down.

In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.

Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).

wewewedxfgdf · 53m ago

Chrome does TTS too.

https://codepen.io/logicalmadboy/pen/RwpqMRV

mlboss · 3h ago

Reddit post with generated audio sample: https://www.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...

vahid4m · 1h ago

amazing! can't wait to integrate it into https://desktop.with.audio I'm already using KokorosTTS without a GPU. It works fairly well on Apple Silicon.

Foundational tools like this open up the possiblity of one-time payment or even free tools.

maxloh · 1h ago

Hi. Will the training and fine-tuning code also be released?

It would be great if the training data were released too!

RobKohr · 2h ago

What's a good one in reverse; speech to text?

jasonjmcghee · 2h ago

Whisper and the many variants. Here's a good implementation.

https://github.com/ggml-org/whisper.cpp

maxloh · 1h ago

Will the training and fine-tuning code also be released? Better with training data too!

wewewedxfgdf · 3h ago

say is only 193K on MacOS

  ls -lah /usr/bin/say
  -rwxr-xr-x  1 root  wheel   193K 15 Nov  2024 /usr/bin/say

Usage:

  M1-Mac-mini ~ % say "hello world this is the kitten TTS model speaking"

dented42 · 2h ago

That’s not a far comparison. Say just calls the speech synthesis APIs that have been around since at least Mac OS 8.

That being said, the ‘classical’ (pre-AI) speech synthesisers are much smaller than kitten, so you’re not wrong per se, just for the wrong reason.

deathanatos · 44m ago

The linked repository at the top-level here has several gigabytes of dependencies, too.

selcuka · 2h ago

SAM on Commodore 64 was only 6K:

https://project64.c64.org/Software/SAM10.TXT

Obviously it's not fair to compare these with ML models.

tonypapousek · 1h ago

Tried that on 26 beta, and the default voice sounds a lot smoother than it used it.

Running `man say` reveals that "this tool uses the Speech Synthesis manager", so I'm guessing the Apple Intelligence stuff is kicking in.

wnoise · 2h ago

And what dynamic libraries s it linked to? And what other data are they pulling in?

satvikpendem · 2h ago

`say` sounds terrible compared to modern neural network based text to speech engines.

wewewedxfgdf · 1h ago

Sounds about the same as Kitten TTS.

satvikpendem · 1h ago

To me it sounds worse, especially on the construction of certain more complex sentences or words.

onair4you · 3h ago

Okay, lots of details information and example code, great. But skimming through I didn’t see any audio samples to judge the quality?

TheAceOfHearts · 3h ago

They posted a demo on reddit[0]. It sounds amazing given the tiny size.

[0] https://old.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...

onair4you · 1h ago

Thanks! Yeah. It definitely isn’t the absolute best in quality but it trounces the default TTS options on macOS (as third party developers are locked out of the Siri voices). And for less than the size of many modern web pages…

pkaye · 3h ago

Where does the training data come for the models? Is there an openly available dataset the people use?

andai · 2h ago

Can you run it in reverse for speech recognition?

glietu · 1h ago

Kudos guys!

jainilprajapati · 1h ago

♥

GaggiX · 3h ago

https://huggingface.co/KittenML/kitten-tts-nano-0.1

https://github.com/KittenML/KittenTTS

This is the model and Github page, this blog post looks very much AI generated.

mayli · 3h ago

Is this english only?

a2128 · 2h ago

If you're looking for other languages, Piper has been around in this scene for much longer and they have open-source training code and a lot of models (they're ~60MB instead of 25MB but whatever...) https://huggingface.co/rhasspy/piper-voices/tree/main

kenarsa · 2h ago

Or use https://github.com/Picovoice/orca which is about 7MB and supports 8 languages

pezgrande · 57m ago

you need api key and internet access to run locally? lol. Classic .NET project.

evgpbfhnr · 2h ago

I tried on some Japanese for the kicks of it, it reads... "Chinese letter chinese letter japanese letter chinese letter..." :D

But yeah, if it's like any of the others we'll likely see a different "model" per language down the line based on the same techniques

g7r · 3h ago

Yes. The FAQ says that multilingual capabilities are in the works.

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

Comments (58)