Chatterbox TTS

627 pinter69 182 6/11/2025, 8:23:52 PM github.com ↗

Comments (182)

Mizza · 1d ago
Demos here: https://resemble-ai.github.io/chatterbox_demopage/ (not mine)

This is a good release if they're not too cherry picked!

I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.

ianbicking · 1d ago
FWIW in my recent experience I've found LLMs are very good at reading through the transcription errors

(I've yet to experiment with giving the LLM alternate transcriptions or confidence levels, but I bet they could make good use of that too)

vunderba · 1d ago
Pairing speech recognition with a LLM acting as a post-processor is a pretty good approach.

I put together a script a while back which converts any passed audio file (wav, mp3, etc.), normalizes the audio, passes it to ggerganov whisper for transcription, and then forwards to an LLM to clean the text. I've used it with a pretty high rate of success on some of my very old and poorly recorded voice dictation recordings from over a decade ago.

Public gist in case anyone finds it useful:

https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...

sovok · 1d ago
An LLM step also works pretty well for diarization. You get a transcript with speaker-segmentation (with whisper and pyannote for example), SPEAKER_01 says at some point „Hi I’m Bob. And here’s Alice“, SPEAKER_02 says „Hi Bob“ and now the LLM can infer that SPEAKER_01 = Bob and SPEAKER_02 = Alice.
soulofmischief · 1d ago
Yep, my agent i built years ago worked very well with this approach, using a whisper-pyannote combo. The fun part is knowning when to end transcription in noisy environments like a coffee shop.
Tokumei-no-hito · 1d ago
thanks for sharing. are some local models better than others? can small models work well or do you want 8B+?
vunderba · 1d ago
So in my experience smaller models tend to produce worse results BUT I actually got really good transcription cleanup with CoT (Chain of Thought models) like Qwen even quantized down to 8b.
dragonwriter · 23h ago
I think the 8B+ question was about parameter count (8 billion+ parameters), not quantization level (8 bits per weight).
vunderba · 21h ago
Yeah I should have been more specific - Qwen 8b at a 5_K_M quant worked very well.
mikepurvis · 1d ago
I was going to say, ideally you’d be able to funnel alternates to the LLM, because it would be vastly better equipped to judge what is a reasonable next word than a purely phonetic model.
ianbicking · 1d ago
If you just give the transcript, and tell the LLM it is a voice transcript with possible errors, then it actually does a great job in most cases. I mostly have problems with mistranscriptions saying something entirely plausible but not at all what I said. Because the STT engine is trying to make a semantically valid transcription it often produces grammatically correct, semantically plausible, and incorrect transcriptions. These really foil the LLM.

Even if you can just mark the text as suspicious I think in an interactive application this would give the LLM enough information to confirm what you were saying when a really critical piece of text is low confidence. The LLM doesn't just know what are the most plausible words and phrases for the user to say, but the LLM can also evaluate if the overall gist is high or low confidence, and if the resulting action is high or low risk.

miki123211 · 1d ago
This is actually something people used to do.

old ASR systems (even models like Wav2vec) were usually combined with a language model. It wasn't a large language model, those didn't exist at the time, it was usually something based on n-grams.

throwawaymaths · 1d ago
do you know if any current locally hostable public transcribers are good at diarization? for some tasks having even crude diarization would improve QOL by a huge factor. i was looking at a whisper diarization python package for a bit but it was a bitch to deploy.
philipkiely · 1d ago
throwawaymaths · 1d ago
yeah as i said, i couldn't figure out how to deploy whisper-diarization.
genewitch · 17h ago
so you need python - a full install, and git. Doesn't matter OS. python venv (virtual environment) ensures that this folder, once it works, is locked to all the versions inside it, including the python version. this works for any software that uses pip to set up, or any python stuff in general.

  git clone <whisper-diarization.git URL>
  cd whisper-diarization
  python -m venv .
  cd scripts
  # and then depending on your OS it's activate.sh, activate.ps1, activate.bat, etc. so on linux [0] 
your prompt should change to say

(whisper-diarization) <your OS prompt>$

now you can type

  cd ..
  pip install -c constraints.txt -r requirements.txt
  python ./diarize.py --no-stem --suppress_numerals --whisper-model large-v3-turbo --device cuda -a <FILE>
next time you want to use it, you can just do like

  cd ~/whisper-diarization
  scripts/activate.sh (or whatever) [0]
  python ./diarize.py [...]

[0] To activate a Python virtual environment created with venv, use the command

  source venv/bin/activate 
on Linux or macOS, or

  venv\Scripts\activate 
on Windows. This will change your terminal prompt to indicate that the virtual environment is active.

(the [0] note was 'AI generated' by DDG, but whatever, linux puts it in ./bin/activate and windows puts it in ./Scripts/activate.ps1 (ideally))

iainmerrick · 1d ago
Deepgram does it.
throwawaymaths · 1d ago
sorry i meant locally hostable public. ill edit parent.
pinter69 · 1d ago
Right you are. I've used speechmatics, they do a decent jon with transcription
theyinwhy · 1d ago
1 error every 78 characters?
pinter69 · 1d ago
The way to measure transcription accuracy is word error and not character error. I have not really checked or trusted) speechmatics' accuracy benchmarks But, from my experience and personal impression - it looks good, haven't done a quantitative benchmark
theyinwhy · 1d ago
Thanks for your constructive reply on my bad joke. I was referring to your original comment where you had a typo. I just couldn't resist, sorry.

No comments yet

causal · 1d ago
Play with the Huggingface demo and I'm guessing this page is a little cherry-picked? In particular I am not getting that kind of emotion in my responses.
backnotprop · 1d ago
It is hard to get consistent emotion with this. There are some parameters, and you can go a bit crazy, but it gets weird…
lvl155 · 1d ago
Can’t you get around that by synthetic data?
echelon · 1d ago
I absolutely ADORE that this has swearing directly in the demo. And from Pulp Fiction, too!

> Any of you fucking pricks move and I'll execute every motherfucking last one of you.

I'm so tired of the boring old "miss daisy" demos.

People in the indie TTS community often use the Navy Seals copypasta [1, 2]. It's refreshing to see Resemble using swear words themselves.

They know how this will be used.

[1] https://en.wikipedia.org/wiki/Copypasta

[2] https://knowyourmeme.com/memes/navy-seal-copypasta

bschwindHN · 1d ago
Heh, I always type out the first sentence or two of the Navy Seal copypasta when trying out keyboards.
xnx · 1d ago
echelon · 1d ago
Sadly they don't publish any training or fine tuning code, so this isn't "open" in the way that Flux or Stable Diffusion are "open".

If you want better "open" models, these all sound better for zero shot:

Zeroshot TTS: MaskGCT, MegaTTS3

Zeroshot VC: Seed-VC, MegaTTS3

Granted, only Seed-VC has training/fine tuning code, but all of these models sound better than Chatterbox. So if you're going to deal with something you can't fine tune and you need a better zero shot fit to your voice, use one of these models instead. (Especially ByteDance's MegaTTS3. ByteDance research runs circles around most TTS research teams except for ElevenLabs. They've got way more money and PhD researchers than the smaller labs, plus a copious amount of training data.)

xnx · 1d ago
Great tip. I hadn't heard of MegaTTS3.

No comments yet

cpill · 17h ago
But whats the inference speed like on these? Can you use them in a realtime interaction with an agent?
Quarrel · 1d ago
Fun to play with.

It makes my Australian accent sound very English though, in a posh RP way.

Very natural sounding, but not at all recreating my accent.

Still, amazingly clear and perfect for most TTS uses where you aren't actually impersonating anyone.

skatanski · 1d ago
How does it work from the privacy standpoint? Can they use recorded samples for training?
travisvn · 1d ago
Chatterbox is fantastic.

I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-api/

Best voice cloning option available locally by far, in my experience.

mistersquid · 19h ago
> Chatterbox is fantastic.

> I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-ap

Gave your wrapper a try and, wow, I'm blown away by both Chatterbox TTS and your API wrapper.

Excuse the rudimentary level of what follows.

Was looking for a quick and dirty CLI incantation to specify a local text file instead of the inline `input` object, but couldn't figure it.

Pointers much appreciated.

travisvn · 18h ago
This API wrapper was initially made to support a particular use case where someone's running, say, Open WebUI or AnythingLLM or some other local LLM frontend.

A lot of these frontends have an option for using OpenAI's TTS API, and some of them allow you to specify the URL for that endpoint, allowing for "drop-in replacements" like this project.

So the speech generation endpoint in the API is designed to fill that niche. However, its usage is pretty basic and there are curl statements in the README for testing your setup.

Anyway, to get to your actual question, let me see if I can whip something up. I'll edit this comment with the command if I can swing it.

In the meantime, can I assume your local text files are actual `.txt` files?

mistersquid · 18h ago
This is way more of a response than I could have even hoped for. Thank you so much.

To answer your question, yes, my local text files are .txt files.

travisvn · 18h ago
Ok, here's a command that works.

I'm new to actually commenting on HN as opposed to just lurking, so I hope this formatting works..

  cat your_file.txt | python3 -c 'import sys, json; print(json.dumps({"input": sys.stdin.read()}))' | curl -X POST http://localhost:5123/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d @- \
    --output speech.wav

Just replace the `your_file.txt` with.. well, you get it.

This'll hopefully handle any potential issues you'd have with quotes or other symbols breaking the JSON input.

Let me know how it goes!

Oh and you might want to change `python3` to `python` depending on your setup.

mistersquid · 18h ago
> Just replace the `your_file.txt` with.. well, you get it.

> This'll hopefully handle any potential issues you'd have with quotes or other symbols breaking the JSON input.

> Let me know how it goes!

Wow. I'm humbled and grateful.

I'll update once I'm done with work and back in front of my hone nachine.

travisvn · 11h ago
Hey — just pushed a big update that adds an (opt-in) frontend to test the API

For now, there's just a textarea for input (so you'll have to copy the `.txt` contents) — but it's a lot easier than trying to finagle into a `curl` request

Let me know if you have any issues!

mistersquid · 9h ago
(Didn't carefully read your reply. What follows are the results of cat-ing a text file in the CLI. Will give the new textbox a whirl in the morning PDT. A truly heartfelt thanks for helping me work with Chatterbox TTS!)

Absolutely blown away.

I fed it the first page of Gibson's "Neuromancer" and your incantation worked like a charm. Thanks for the shell script pipe mojo.

Some other details:

  - 3:01 (3 mins, 1 sec) of generated .wav took 4:28 to process
  - running on M4 Max with 128GB RAM
  - Chatterbox TTS inserted a few strange artifacts which sounded like air venting, machine whirring, and vehicles passing. Very odd and, oddly, apropos for cyberpunk.
  - Chatterbox TTS managed to enunciate the dialog _as_ dialog, even going so far as to mimick an Australian accent where the speaker was identified as such. (This might be the effect of wishful listening.)
I am astounded.
venusenvy47 · 1d ago
Would this be usable on a PC without a GPU?
travisvn · 19h ago
It can definitely run on CPU — but I'm not sure if it can run on a machine without a GPU entirely.

To be honest, it uses a decently large amount of resources. If you had a GPU, you could expect about 4-5 gb memory usage. And given the optimizations for tensors on GPUs, I'm not sure how well things would work "CPU only".

If you try it, let me know. There are some "CPU" Docker builds in the repo you could look at for guidance.

If you want free TTS without using local resources, you could try edge-tts https://github.com/travisvn/openai-edge-tts

teraflop · 1d ago
> Every audio file generated by Chatterbox includes Resemble AI's Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.

Am I misunderstanding, or can you trivially disable the watermark by simply commenting out the call to the apply_watermark function in tts.py? https://github.com/resemble-ai/chatterbox/blob/master/src/ch...

I thought the point of this sort of watermark was that it was embedded somehow in the model weights, so that it couldn't easily be separated out. If you're going to release an open-source model that adds a watermark as a separate post-processing step, then why bother with the watermark at all?

jchw · 1d ago
Possibly a sort of CYA gesture, kinda like how original Stable Diffusion had a content filter IIRC. Could also just be to prevent people from accidentally getting peanut butter in the toothpaste WRT training data, too.
throw101010 · 1d ago
Stable Diffusion or rather Automatic1111 which was initially the UI of choice for SD models had a joke/fake "watermark" setting too which was deliberately doing nothing besides poking fun at people who were thinking that open source projects would really waste time on developing something that could easily be stripped/reverted by the virtue of being open source anyways.
vunderba · 1d ago
Yeah, there's even a flag to turn it off in the parser `--no-watermark`. I assumed they added it for downstream users pulling it in as a "feature" for their larger product.
echelon · 1d ago
1. Any non-OpenAI, non-Google, non-ElevenLabs player is going to have to aggressively open source or they'll become 100% irrelevant. The TTS market leaders are obvious and deeply entrenched, and Resemble, Play(HT), et al. have to aggressively cater to developers by offering up their weights [1].

2. This is CYA for that. Without watermarking, there will be cries from the media about abuse (from anti-AI outfits like 404Media [2] especially).

[1] This is the right way to do it. Offer source code and weights, offer their own API/fine tuning so developers don't have to deal with the hassle. That's how they win back some market share.

[2] https://www.404media.co/wikipedia-pauses-ai-generated-summar...

echelon · 1d ago
Nevermind, this is just ~3/10 open, or not really open at all [1]:

https://github.com/resemble-ai/chatterbox/issues/45#issuecom...

> For now, that means we’re not releasing the training code, and fine-tuning will be something we support through our paid API (https://app.resemble.ai). This helps us pay the bills and keep pushing out models that (hopefully) benefit everyone.

Big bummer here, Resemble. This is not at all open.

For everyone stumbling upon this, there are better "open weights" models than Resemble's Chatterbox TTS:

Zeroshot TTS: MaskGCT, MegaTTS3

Zeroshot VC: Seed-VC, MegaTTS3

These are really good robust models that score higher in openness.

Unfortunately only Seed-VC is fully open. But all of the above still beat Resemble's Chatterbox in zero shot MOS (we tested a lot), especially the mega-OP Chinese models.

(ByteDance slaps with all things AI. Their new secretive video model is better than Veo 3, if you haven't already seen it [2]!)

You can totally ignore this model masquerading as "open". Resemble isn't really being generous at all here, and this is some cheap wool over the eyes trickery. They know they retain all of the cards here, and really - if you're just going to use an API, why not just use ElevenLabs?

Shame on y'all, Resemble. This isn't "open" AI.

The Chinese are going to wipe the floor with TTS. ByteDance released their model in a more open manner than yours, and it sounds way better and generalizes to voices with higher speaker similarity.

Playing with open source is a path forward, but it has to be in good faith. Please do better.

[1] "10/10" open includes: 1. model code, 2. training code, 3. fine tuning code, 4. inference code, 5. raw training data, 6. processed training data, 7. weights, 8. license to outputs, 9. research paper, 10. patents. For something to be a good model, it should have 7/10 or above.

[2] https://artificialanalysis.ai/text-to-video/arena?tab=leader...

fastball · 1d ago
The weights are indeed open (both accessible and licensing-wise): you don't need to put that in square quotes. Training code is not. You can fine-tune the weights yourself with your own training code. Saying that isn't open is like saying ffmpeg isn't open because it doesn't do everything I need it to do and I have to wrap it with own code to achieve my goals.
dragonwriter · 20h ago
It really weird to say ByteDance’s release is “more open” when the WaveVAE encoder isn't released at all, only the decoder, so new voices require submitting your sample to a public GDrive folder and getting extracted latents back through another public GDrive folder.
pmarreck · 17h ago
FYI, the term is scare quotes (because they imply suspicion), not square quotes
echelon · 1d ago
Machine learning assets are not binary "open" or "closed". There is a continuum of openness.

To make a really poor analogy, this repo is like a version of Linux that you can't cross-compile or port.

To make another really poor (but fitting) analogy, this is like an "open core" SaaS platform that you know you'll never be able to run the features that matter on your own.

This repo scores really low on the "openness" continuum. In this case, you're very limited in what you can do with Chatterbox TTS. You certainly can't improve it or fit it to your data.

> You can fine-tune the weights yourself with your own training code.

This will never be built by anyone, and they know that. If it could be, they'd provide it themselves.

If you're considering Chatterbox TTS, just use MegaTTS3 [1] instead. It's better by all accounts.

[1] https://github.com/bytedance/MegaTTS3

dragonwriter · 5h ago
> This will never be built by anyone, and they know that. If it could be, they'd provide it themselves.

Community fine-tuning code has been developed in the past for open-weights models without public first-party training code.

fastball · 1d ago
Why can't you improve it or fit it to your data?

This can be cross-compiled/ported in the Linux analogy. The Linux analogy would be more like: a kernel dev wrote code for some part of the Linux kernel using JetBrains' CLion. He used features of CLion that made this process much easer than if he had written the code using `nano`. By your logic, the resulting kernel code is not "open" because the tooling used to create it is not open. This is, of course, nonsense.

I agree that the project as a whole is less open than it could be, but the weights are indeed as open as they can be, no scare quotes required.

echelon · 1d ago
I really don't think your analogy fits the absurdity of lacking the tooling. It's more like you have to decompile an N64 cartridge ROM and don't have the tools. But I don't want to play that game.

I'll up the ante. I'll bet you money that nobody forks this and adds fine tuning for at least a year.

eginhard · 1d ago
fastball · 1d ago
You're supposed to wait to post this until I agree to the bet ;)
echelon · 23h ago
I'm totally humbled by this.

I haven't seen this level of involvement for a lot of the models I'm using, including several text to speech models.

The rapidity of this is also quite shocking. I don't think Resemble anticipated this either, given their wording on the aforementioned ticket.

There's probably a lot more work to do to ensure this works, adjusting learning rates, batching, etc., but it's all clearly being put into place and given attention. Even if this model has some finicky fine tuning behaviors, with this kind of willpower it'll be quickly overcome.

I suppose I owe you, haha.

tedip · 1d ago
Cant make everyone happy :)
echelon · 1d ago
This space is getting pretty crowded.

If you're going to drop weights on unsuspecting developers (who might not be familiar with TTS) and make them think that they'll fit their use case, that's a bit of a bait-and-switch.

Chatterbox TTS is only available over API for fine tunes. That's an incredibly saturated market, and there are better quality and cheaper models for this.

Chatterbox TTS is equivalent to already-released semi-open weights from ByteDance and other labs, and those models already sound and perform better.

It'd be truly exciting if Chatterbox fine tunes could be done as open weights, similar to how Flux operates. Black Forest Labs has an entire open weights ecosystem built around them. While they do withhold their pro / highest quality variants, they always release open weights with training code for each commercial release. That's a much better model for courting open source developers.

Another company doing "open weights" right is Lightricks with LTX-1. They have a commercial studio, but they release all of their weights and tuning code in the open.

I don't see how this is a carrot for open source. It's an ad for the hosted API.

gcr · 1d ago
not a single top-tier lab has a "10/10 open" model for any model type for any learning application since ResNet, it's not fair to shit on them solely for this
unstablediffusi · 1d ago
>Without watermarking, there will be cries from the media about abuse (from anti-AI outfits like 404Media [2] especially).

it is highly amusing that they still believe they can put that genie back in the bottle with their usual crybully bullshit.

nine_k · 1d ago
Some measures like that still sort of work. Try loading a scanned picture of a dollar bill into Photoshop. Try printing it on a color printer. Try printing anything on a coor printer without the yellow tracking pixels.

A lock needs not be infinitely strong to be useful, it just needs to take more resources to crack it than the locked thing is worth.

No comments yet

ineedasername · 1d ago
The emotional exaggeration is interesting, though I don't think I've come across anything quite so versatile and easy to "sculpt" as Elevenlabs and it's ability to generate a voice on the basis of a description of how you want the voice to sound. SparkTTS allows some additional parameters, and it's project on GitHub has placeholders in its code that indicate the model might be refined for more fine grained emotional control. As it is, I've had some success with it and other models by trying to influence prosody and tonality with some heavy handed queues in the text, which can then be used with VC to get closer to desired results, but it's a much more cumbersome process than Eleven.
pryelluw · 1d ago
Silly question, what’s the lowest spec hardware this will run ?
thorum · 1d ago
This GitHub issue says 6-7 GB VRAM: https://github.com/resemble-ai/chatterbox/issues/44

But if the model is any good someone will probably find a way to optimize it to run on even less.

Edit: Got it running on an old Nvidia 2060, I'm seeing ~5 GB VRAM peak.

magicalhippo · 1d ago
Looking at the issues page, it seems it's not well optimized[1] currently.

So out of the box it seems quite beefy consumer hardware will be needed for it to perform reasonably. However it seems like there's significant potential for improvements, though I'm no expert.

[1]: https://github.com/resemble-ai/chatterbox/issues/127

01HNNWZ0MV43FF · 1d ago
I was going to report how it runs on an old CPU but after fussing with it for about 30 minutes, I can't even get it to run.

Listing the issues in case it helps anyone:

- It doesn't work with Python 3.13, luckily `uv` makes it easy to build a venv with 3.12

- It said numpy 1.26.4 doesn't exist. It definitely does, but `uv pip` was searching for it on the pytorch repo. I passed an `--index-strategy` flag so it would check other repos. This could just be a bug in uv, but when I see "numpy 1.26.4 doesn't exist" and numpy is currently on 2.x, my brain starts to cramp up.

- The `pip install chatterbox-tts` version has a bug in CPU-only mode, so I cloned the Git repo

- The version at the tip of main requires `protobuf-compiler` installed on Debian

- I got a weird CMake error that I can't decipher. I think maybe it's complaining that the Python dev headers are not installed. Why would they be, I'm trying to do inference, not compile Python...

I know anger isn't productive but this is my experience almost any time I'm running Somebody Else's Python Project. Hit an issue, back up, hit another issue, back up, after an hour it still doesn't run.

thorum · 1d ago
We’ll know AGI has arrived when it can figure out Python dependency conflicts
kevin_thibedeau · 17h ago
It'll just throw up its virtual hands and switch to something better after transpiling all the Python code in a fit.
blharr · 1d ago
Maybe this wasn't here when you looked at it, but maybe try Python 3.11?

> We developed and tested Chatterbox on Python 3.11 on Debain 11 OS; the versions of the dependencies are pinned in pyproject.toml to ensure consistency.

keyle · 1d ago
It's not a silly question, it's the best question!

If something can be run for free but it's cheaper to rent, it voids the DIY aspect of it.

bityard · 1d ago
Not a silly question, I came here to ask too. Curious to know whether I need a GPU costing 4 digits or if it will run on my 12-year-old thinkpad shitbox. Or something in between.
nmstoker · 1d ago
I've found it excellent with really common accents but with other accents (that are pretty common too) it can easily get stuck picking a different accent. For instance several Scottish recordings ended up Australian, likewise a fairly mild Yorkshire accent
a_wild_dandan · 1d ago
I think this says more about Scottish than the model.
Quarrel · 1d ago
> For instance several Scottish recordings ended up Australian

Funnily enough, it made my Australian accent sound very English RP. I was suddenly very posh.

ltrg · 1d ago
I'm English (RP) and it gave me a Yorkshire accent and Scottish accent in turn.
m3sta · 1d ago
Like a professional actor!
audiala · 1d ago
What is the current state of the art for open source multilingual TTS? I have found Kokoro to be great as English as well, but am still searching for a good solution for French, Japanese, German...
barrell · 21h ago
I’ve also been looking for this. OpenVoice2 supports a few languages (5 IIRC), but I haven’t seen anything usable yet
abraxas · 1d ago
Are these things good enough to narrate a book convincingly or does the voice lose coherence after a few paragraphs being spoken?
vunderba · 1d ago
Most of these TTS systems tend to fall apart the longer the text - it's a good idea to just wrap any longform text into separate paragraph segmented batches and then stitch them back together again at the end.

I've also found that if your one-shot sample wave isn't really clean that sometimes Chatterbox produces random unholy whooshing sounds at the end of the generated audio which is an added bonus if you're recording Dante's Inferno.

elektor · 1d ago
Yes, I've generated an audiobook of a epub using this tool and the result was passable: https://github.com/santinic/audiblez
venusenvy47 · 1d ago
Regarding your example "On a Google Colab's T4 GPU via Cuda, it takes about 5 minutes to convert "Animal's Farm"", do you know the approximate cost to perform this? I've only used Colab at the free level, so I have no concept of the costs for GPU time.
raincole · 1d ago
Once it's good enough Audible will be flooded with AI-narrated books so we'll know soon. (The only question is whether Amazon would disclose it, ofc)
landl0rd · 1d ago
Flip side is a solution where I can have a book without an audiobook auto-generated (or use an existing ebook rather than paying audible $30 for their version) and it's "good enough" is a legit improvement. AI generated isn't as good but it's better than nothing. Also, being able to interrupt and ask for more detail/context would be pretty nice. Like I'm reading some Pynchon and I have to stop sometimes and look up the name of a reference to some product nobody knows now, stuff like that.
skygazer · 1d ago
If you're willing to forgo the interactive LLM bit, kokoro-tts (just a script using Kokoro-ONNX) takes epubs and outputs a series of wavs or mp3s that need to be stitched together into chapters or audiobook m4a with some ffmpeg fu. I've listened to several generated audiobooks, and found them pretty good. Some nice generic narration-like prosody. It uses espeak-ng to generate phonemes and passes those to the model to render voice, so it generally pronounces things quite well. It comes with a handful of nice voices and several can be blended, but no easy voice cloning, like chatterbox, that I'm aware of.

https://github.com/nazdridoy/kokoro-tts/blob/main/kokoro-tts

vahid4m · 1d ago
I've used this repo and its great. It was one many things that inspired me in building a similar tool. I built https://desktop.with.audio

It was important to me that it be 100% private and local and wanted it to be a one time payment solution. Because it locally process your data it can be a one time payment text to speech app.

If you are interested in creating audiobooks from epubs check this demo: https://www.youtube.com/watch?v=pOHzo6Oq0lQ If you are interested in listening while reading with text highlighting check these demos: - https://www.youtube.com/watch?v=8yJ-lsbzAuw - https://www.youtube.com/watch?v=y8wi4d8xmnw

jedbrooke · 1d ago
audiblez[1] does exactly that and handles the ffmpeg fu part for you, and will output a m4b file which audio book players will support.

1. https://github.com/santinic/audiblez

ajolly · 21h ago
I've been using epub2tts / epub2tts-edge and its been working well for me. Converts into m4b
russellbeattie · 1d ago
Audible has already flooded their store with generated audio books. Go to the "Plus Catalog" and it's filled with them. The quality at the moment is complete trash, but I can't imagine it won't get better quickly.

The whole audiobook business will eventually disappear - probably within the decade. There will only be ebooks and on-device AI assistants will read it to you on demand.

I imagine it'll go like this: First pre-generated audiobooks as audio files. Next, online service to generate audio on demand with hyper customizable voices which can be downloaded. Next, a new ebook format which embeds instructions for narration and pronunciation to be read on-device. Finally, AI that's good enough to read it like a storyteller instantly without hints.

satvikpendem · 1d ago
> There will only be ebooks and on-device AI assistants will read it to you on demand.

Honestly I read (or rather, listen to) a lot of books already by getting the epubs onto my phone then using a very basic TTS to read it out. Yes, they're definitely not as lifelike as even the most common AI TTS systems but they're good enough to listen to at high speed. Moon+ Reader is pretty good for Android, not sure about iOS.

BoorishBears · 1d ago
fatesblind · 1d ago
its watermarked
mianos · 1d ago
It's open source. It's not in the model. The watermark function is added to show you how to use it. You can just remove it.

``` watermarked_wav = self.watermarker.apply_watermarl(... ```

pinter69 · 1d ago
I consult a company in the space (not resemble) and I can definitely say it can narrate a book
wsintra2022 · 1d ago
A year ago for fun I gave a friend a Carl Rogers therapy audiobook, for fun I made an Attenbrough esque reading and it was pretty good over a year ago so should be better now.
philipkiely · 1d ago
Example implementation with sample inference code + voice cloning example:

https://github.com/basetenlabs/truss-examples/tree/main/chat...

Still working on streaming

tevon · 1d ago
I just tested it out locally, really excellent quality, the server was easy to set up and well documented.

I'd love to get to real-time generation if that's in the pipeline? Would like to use it along with Home Assistant.

DHolzer · 5h ago
I love chatterbox, it's my favourite. While the generation speed is quick, i wonder what performance optimization i could try on my 3090 to improve throughput. It's not quite enough for realtime.
iambateman · 1d ago
Just a regular reminder to tell your friends and family to be extra skeptical about phone conversations.

It’s becoming much more likely that the friend who desperately needs a gift card to Walmart isn’t the friend at all. :(

probably_wrong · 1d ago
My family members speak Spanish with an Argentinean accent. From what I've seen in the space it looks like I'm safe.
jeroenhd · 1d ago
Public research and well-intentioned AI companies is all focusing on (white) American English, but that doesn't mean the technology isn't being refined elsewhere. The scamming industry is massive and already goes to depths like slavery to get the job done.

I wouldn't assume you're safe just because the tech in your phone can't speak your language.

KaiserPro · 23h ago
In the UK I have been getting AI-fancyTTS calls quite often. I even got one today.

interupting them with "can you make me a poem about x" works reliably. However the latency is a dead give away.

chii · 1d ago
the easiest way to defeat phone fraud is to ahead of time decide on a verbal password between family (and close friends, if they're close enough that you'd lend them money).

In a real scenario, they'd know the verbal password and you can authenticate them. Drum it into them that this password will prevent other people from impersonating you in this brave new world of ai voices and even video.

jimjimwii · 1d ago
That is more or less what i did with my parents, but this approach is still susceptible to active mitm attacks.

2 factor authentication through a secure app or a trusted family member is probably also needed though i haven't tackled this part with them yet.

chii · 1d ago
> 2 factor authentication through a secure app

the problem is that the sort of emergency scenario in which family member would need the help is not often done or possible via a secured app. It's often just a telephone, with a number that you cannot recognize - imagine getting that phone call from a police station in the middle of nowhere when arrested, then you dont have access to any of your personal belongings as they're confiscated. The phone is a landline from the police station!

Therefore, a verbal password is needed, as this scenario is exactly how a scammer would present as the emergency that they need help (usually, wire some dollars to this account to bail out).

IshKebab · 22h ago
"Oh sorry son did we have a password? I totally forgot."

This is a HN fantasy solution.

Ylpertnodi · 21h ago
Works for me and the family. No code-word, no transfer of funds.
IshKebab · 18h ago
Have your parents been targeted by convincing fraudsters? It doesn't work for you; you hope it will work.
chii · 11h ago
It is better than not having it as a backup method. It might even trip the scammer up as they may not expect it.

Not to mention that it is your responsibility as the technically minded to hammer it into your family members.

mattigames · 1d ago
My bet is that the government at some point will have to put some pressure on Walmart and others to stop selling those gift cards completely, doing impersonations is getting too easy and too cheap for there not to be a flood of those scam calls in the near future.
stevage · 1d ago
Interesting demo. A few observations, having uploaded a snippet of my own voice, and testing with some of my own text:

- the output had some of the qualities of my voice, but wasn't super similar. (Then again, the fact it could even do this from such a tiny snippet was impressive)

- increasing "CFG/pace" (whatever CFG is) even a little bit often just breaks down into total gibberish

- it was very inconsistent whether it would come out with a kind of British accent or an American one. (My accent is Australian...)

- the emotional exaggeration was interesting, but it seemed to vary a lot exactly what kind of emotion would come out

b0a04gl · 19h ago
> the emotion intensity control is killer. actual param you can tune per line. > and the perth watermarking baked into every output, that’s the part most people are sleeping on. survives mp3, editing, even resampling. no plugin, no postprocess. > also noticed the chatterboxtoolkitui floating in the org, with audiobook mode and batch voice conversion already wired in.

is it a banger??? yes ig so, a full setup ready for indies shipping voicefirst products right now.

lukeinator42 · 14h ago
Does anyone know of an open-source TTS like this that can also encode speech to do voice conversion alongside TTS? i.e. a model that would take speech as input and convert it to one of the pretrained TTS voices.
yavorgiv · 2h ago
ojw0816 · 8h ago
Looks good! What is the difference between the open-source version and the priced version?
palmfacehn · 1d ago
Has anyone developed a way to annotate the input to provide emotional context?

In the past I've used different samples from the same speaker for this.

dragonwriter · 20h ago
There are models that are trained for some kind of (in or out of band) emotiona (or style more general) prompting, but Chatterbox isn’t one of them, so beyond building some kind of system that took in input, processed it into chunks of text to speak and the settings Chatterbox does support (mostly pace and exaggeration) for each chunk, there’s probably no real way to do that with Chatterbox.
monksy · 9h ago
How would I install this alongside librechat or ollama using docker?
pzo · 1d ago
It's only for English sadly
darccio · 1d ago
Are there any good options for non-English languages?
jeroenhd · 1d ago
It's not on the same level in terms of emotion, but I believe the research https://github.com/CorentinJ/Real-Time-Voice-Cloning was based on is mostly oriented around Chinese first (and then English). It seems to work well enough if you and the voice you're cloning speak the same language though I haven't tested it much.
bachittle · 19h ago
I always have issues with TTS models that do not allow you to send large chunks of text. Seems this one does not resolve this either. Always has a limit of like 2-3 sentences.
travisvn · 19h ago
That's just for their demo.

If you want to run it without size limits, here's an open-source API wrapper that fixes some of the main headaches with the main repo https://github.com/travisvn/chatterbox-tts-api/

racecar789 · 1d ago
I’d sign up for a service that calls a pharmacy on my behalf to refill prescriptions. In certain situations, pharmacies will not list prescriptions on their websites, even though they have the prescriptions on file, which forces the customer to call by phone — a frustrating process.

I do feel bad for pharmacists, their job is challenging in so many ways.

jeroenhd · 1d ago
Didn't Google already demo that with Google Duplex? It's not available here so I can't test it, but I think that's exactly the kind of thing duplex was designed to do.

Although, from a risk avoidance point of view, I'd understand if Google wanted to stay as far away from having AI deal with medication as possible. Who knows what it'll do when it starts concocting new information while ordering medicine.

init0 · 22h ago
MrThoughtful · 1d ago
How do you set the voice?

On the Huggingface demo, there seems to be no option for it.

It has a female voice. Any way to set it to a male voice?

ipsum2 · 1d ago
It's voice cloning. Maybe not available in the demo, but you just provide a different input.
j2kun · 1d ago
They should put the meaning of "TTS" in the readme somewhere, probably near the top. Or their website.
byteknight · 1d ago
TTS is a very common initialism for Text-to-Speech going back to at least the 90s.
stevage · 1d ago
Yeah, it's a very common initialism for people who work in the space, and have some context.
j2kun · 1d ago
So? Acronym soup is bad communication.
aquariusDue · 1d ago
I miss glossaries.
dylan604 · 1d ago
Good writing rules can still be used even for repo READMEs where the first time an acronym is used it is spelled out to show what the acronym means. Too many assumptions being made that everyone is going to know it. Sometimes the author can be too inside baseball and assumes anyone reading their README will already know about the subject. Not all devs are literature majors and probably just never think about these things
rapfaria · 1d ago
An AI-powered browser extension that shows on hover the most likely acronym meaning, based on context you say?
aquariusDue · 1d ago
I've used this one for a hot minute a few weeks ago: https://lumetrium.com/definer/

It also can be configured to use Ollama or an API key from other providers (OpenRouter included) and from what I gather the default prompt can be changed too.

Sadly it's closed source.

sdenton4 · 1d ago
Table Top Simulator.

It's obviously an AI for playing wargames without having to bother painting all the miniatures, or finding someone with the same weird interest in Balkan engagements during the Napoleonic era.

ipsum2 · 1d ago
The voice cloning is okay, not as good as Eleven Labs. There's a Rick (from Rick and Morty) voice example, and the generated audio sounds muffled and low quality. I appreciate that its open source though.
kiririn7 · 1d ago
definitely worse than the new elevenlabs model(v3). that model is really good
plangary123 · 1d ago
I disagree
andy_xor_andrew · 1d ago
in my experience, TTS has been a "pick two" situation:

- fast / cheap to run

- can clone voices

- sounds super realistic

from what I can tell, Chatterbox is the first that apparently lets you pick 3! (have not tried it myself yet, this is just what I can deduce)

CGamesPlay · 1d ago
Can you share one that is fast/cheap to run and sounds super realistic? I'm very interested in finding a good TTS and not really concerned about cloning any particular voice (but would like a "distinctive" voice that isn't just a preset one).
pzo · 1d ago
It's also about if you want multi lung support and if wanna run on edge devices. Chatterbox only support English.
andymcsherry · 1d ago
ipsum2 · 1d ago
You failed to mention that this is an ad for the company you work at. Also, the links don't even work without signing up for some shitty service.
andymcsherry · 23h ago
Hey ipsum, sorry I could have mentioned that. We spend a ton of effort on open source and sharing our ML knowledge with the community. If you don't want to use our platform, the entire source code and a tutorial is there to run it on your own.
3ds · 1d ago
There are only english voices, even in the paid version. Using them in other languages results in an accent.
ash1224 · 5h ago
wow! 200mms very good!
Shopper0552 · 1d ago
Anyone know a good free open source speech to text? Looking for something for my laptop which is running Fedora KDE plasma.
pzo · 1d ago
Whisper large v3 turbo if need support of many languages and want fast enough for deployment even on smartphones (WhisperKit). Can also try lite whisper on HF if need even smaller weights and slightly faster speed.
hoherd · 1d ago
Whisper has been great for me. I have a single-file uv powered python script that creates SRT files or timestamped text files from media stored on the filesystem. https://github.com/danielhoherd/pub-bin/blob/main/whisper-tr...
santiagobasulto · 1d ago
Whisper?
az226 · 1d ago
How does one train a TTS model with an LLM backbone? Practically, how does this work?
cyanf · 1d ago
you use a neural audio codec to encode audio into codebooks

then you could treat the codebook entries as tokens and treat audio generation as a next token prediction task

you then take the codebook entries generated and run it through the codec’s decoder and yield audio

it works surprisingly well

speech text models (tts model with an llm as backbone) is the current meta

benob · 1d ago
Watermarking is easily disabled in the code. I a wondering when they will release model weights with embedded watermarking.
decide1000 · 1d ago
How does it perform on multi-lingual tasks?
yjftsjthsd-h · 1d ago
The readme says it only supports English
SV_BubbleTime · 15h ago
Fun stuff... I don't know how or why, but connecting bluetooth while on this site, made all of the audio clips play at once (Firefox, Linux). Not the best listening experience.
causality0 · 1d ago
Anyone know how this compares to Kokoro? I've found Kokoro very useful for generating audiobook but it almost always pronounces words with paired vowels incorrectly. Daisy becomes die-zee, leave becomes lay-ve, etc.
nmstoker · 1d ago
If you're running Kokoro yourself then it might be worth checking your phonemizer / espeak-ng installs in case they are messing up the phonemes for those words (which are then passed on as inputs to Kokoro itself)
BigBananaGuy · 1d ago
Chatterbox sounds much more natural. The zero shot voice cloning and exaggeration feature is sick!
pradeepodela · 1d ago
What is the latency?
tuananh · 1d ago
for this, what does it take to support another language?
internet_points · 1d ago
> Supported Lanugage

> Currenlty only English.

meh

_andrei_ · 1d ago
very cherry picked
andrewstuart · 1d ago
There’s been surprisingly little advancement in TTS after a rapid leap forward three years ago or so.

There’s eleven labs which is quite good but not incredible and very expensive.

Everything else ……. all the big AI companies …. have TTS systems that are kinda meh.

Everything else in AI has advanced in leaps and bounds, TTS remains deep in the uncanny valley.

hsavit1 · 1d ago
another TTS that is only supporting English. This really irritates me
nmstoker · 1d ago
Maybe that irritation could be channelled to contributing into one that supports not only English? Even small steps like tweaking docs, adding missing/extra examples, fielding a few issues in GH (most are usually simple misunderstandings where a quick pointer can easily help a beginner)
jeroenhd · 1d ago
For what it's worth, there are also a whole bunch of models that speak Chinese.

So far the US and China are spearheading AI research, so it makes sense that models optimize for languages spoken there. Spanish is an interesting omission on the US part, but that's probably because most AI researchers in the US speak English even if their native tongue is Spanish.

gardnr · 1d ago
tomhow · 1d ago
Thanks for posting this but it's conventional to only post links to past submissions if they had significant discussion, which none of these did.
pinter69 · 1d ago
I did a quick google search before positing and only found a reference in a comment. But, I searched for the link to the GitHub.
andyferris · 1d ago
It took me ages to understand what TTS means!
andyferris · 1d ago
In the spirit of being more constructive...

https://github.com/resemble-ai/chatterbox/pull/156

SV_BubbleTime · 17h ago
I don't like how for text to image/video it's T2V I2V, and reference video to video is V2V... Then when we get to text 2 it T all of a sudden.
dragonwriter · 4h ago
TTS has been around as an initialism long before the current AI wave, the x2y pattern is newer. (You do see it around TTS, even though TTS itself hasn't become T2S; e.g., TTS toolchains often include a g2p—grapheme-to-phoneme—component.)