As an aside, while this tool can be used to create an audiobook from a book you have in text format, for your private consumption, having an author employ something like this to create files for distribution is extremely risky, even if they acknowledge its use and intend those files to only be available on their website.
Indie authors struggle a lot to promote their works, and the new normal is that potential readers, the polite ones[^1], use the slightest hint of AI usage to discard their title and move on...as they are entitled to, since there are so many books.
I in particular have started to hire voice actors that have good acting skills and good diction but for whom English is their second language, or it's their first language but they speak something else at home; sometimes I even ask them to go a notch up with their accents. It helps with the non-AI recognition, and it also increases the appeal of the book for people who would like to try out something new. Once, I did an audition for a project and was pleasantly surprised with how much life people from around the Mediterranean basin were able to inject into their renderings, compared with people from Britain and North America.
[^1] Impolite readers set the town on fire, and then go about and spread that fire to neighboring towns, for good measure.
baxtr · 1h ago
I am big time user of Amazon’s WhisperSync feature. With that feature I can simultaneously read the book and listen to it.
This is especially helpful when you’re on the go but still want to have a visual now and then or highlight text for later.
The problem is that many books don’t offer that feature. There is a built-in read function now in the kindle app, but it’s crap.
So, if you ask me, I’d prefer a good human-written book with an additional AI voice on top to enable that feature for me.
amaccuish · 1h ago
Amazing, but I'm personally waiting for the one that generates a well formated ePub from a PDF.
anotherpaul · 3h ago
Does it turn it into spoken word or an audiobook?
Because good audiobooks often have voice actors that read the characters with different emphasis and dialects.
I imagine tools like chatgpt could do this for a few sentences but what about an 8-20 hour audiobook?
I think there are still basic hurdles to take before we can go epub to audiobook in a quality that can compete with current state of the art.
Or am I missing something?
jamilton · 2h ago
Elevenlabs has a feature for a "full cast"-type generation, where different characters will get different voices. It's certainly not automatically sensitive to dialect though.
It's probably possible with current systems to do though. I believe there are TTS systems that can use context/prompting to change emphasis and other speech qualities, though I'm not sure how reliably.
vorgol · 45m ago
Have you heard results from it? How does it know for example, when there is a romantic scene in the book, which voice to read out as?
It's definitely an exited voice, but is it read out as in a battle or as in a romantic scene?
pyman · 2h ago
Is it open source?
No comments yet
BenGosub · 1h ago
There are a few character voices that also can be mixed using the mixer, achieving different nuances. You can then write your own code to use different voices for different characters.
poulpy123 · 10m ago
perfect, I was looking for something like that ! is it gui only, or is there an api available ? I would like to be able to share a link or a text from my phone and get back the audio
floppyd · 2h ago
I tried Kokoro for voicing blog posts and articles and wasn't impressed to be honest. Right now Gemini 2.5 Flash TTS is a much more capable system with generous free limits (about 10 minutes per generation and about 90 minutes per day). Voices are not very consistent between generations, but for shorter pieces it's not a big deal (but will obviously be for books)
ekianjo · 2h ago
Kokoro is fine for TTS, but it lacks emotion. But for a model of this size, that is kind of given.
8s2ngy · 3h ago
I've been using Kokoro TTS with the CLI app, audiblez, mentioned in the "Similar Projects" section of the README. The model is fast and delivers impressive quality for its small size. Some issues I have faced, however, are:
a) It doesn't distinguish periods at the end of sentences from the dots in abbreviations such as "Mr." or "Mrs." The result is an awkward pause between "Mr." and the name.
b) It doesn't handle ellipses well.
c) Words are pronounced the same way regardless of context.
rkagerer · 3h ago
The Mr. / Mrs. thing feels like it would be a pretty easy fix, at least to eliminate a lot of the more common cases.
hombre_fatal · 31m ago
^ A thought that everyone has had at one point when processing human text before learning the hard way (like end of sentence detection). :P
The difference is that even weak LLMs are good at magically doing this, so I wonder what the problem is for the TTS mentioned above.
leobg · 7m ago
Kokoro is small and fast because all the text -> phoneme conversion is done by “dumb code” and only the phoneme -> sound part is done using a neural net.
TOGoS · 3h ago
The demo video doesn't seem to have any audio in it! At least none that either ffmpeg or whatever Firefox uses can recognize.
noisem4ker · 1h ago
It's probably due to the unusual sound format, 24kHz mono PCM, and the fact that it was somehow forced into a WebM container, which only supports Vorbis and Opus officially.
It looks like the author created it using the "higher quality" ffmpeg command line, except for the "webm" final extension, producing the opposite of what's described as "an MP4 file that's compatible with more devices".
Yeah, I've run a local Kokoro instance, and it doesn't work with Firefox. This uses Kokoro under the hood.
noisem4ker · 1h ago
The demo clip is static and has the Kokoro output encoded as the audio track. It's not Kokoro running and generating it in your browser in real time.
jamilton · 2h ago
Same here, but it worked when I opened it in Chrome. What a weird error - you would think that playing an embedded mp4 with audio wouldn't differ from browser to browser.
mnmalst · 1h ago
I was surprised by this as well at first but thinking about it, it would make sense when they use an audio codec which is not supported on the target system. In that case the video can still play but the audio can't. I wasn't aware tho that audio can be disabled separately.
Daunk · 3h ago
Same on my end, no audio in the video.
huseyinkeles · 3h ago
I can hear it on safari
leke · 1h ago
How big is this app?
nikolayasdf123 · 3h ago
can I choose any voice? would love to read software engineering books in voice of Morgan Freeman, or maybe even better, Scarlett Johansson
pyman · 2h ago
The voice of Mickey Mouse would be nice.
hulitu · 3h ago
Why not Stephen Hawking ?
hajimuz · 2h ago
Yeah, could be a buff like 500% brain supercharge.
scotty79 · 2h ago
I think the quality of the voice is super important for audiobooks and I think we are just closing in on the required quality with TTS.
I played a bit with Eleven labs voices and while they aren't bad when I tried make them read fragment of a text that I wrote, it sounded chaotic, boring, quite terrible, for anything longer than a sentence or two. But when I tried their v3 voices which they are currently in the process of rolling out, the same text sounded consistent, emotional, engaging, simply amazing. I think we are just crossing vocal uncanny valley.
porker · 1h ago
Strong agree that voice quality (and voice acting) is important. I listen to a lot of fiction audiobooks, and will listen to the end of a middling book with a good narrator, but if the narration is flat or out of keeping with the characters I'll stop after a chapter or two.
As an aside, while this tool can be used to create an audiobook from a book you have in text format, for your private consumption, having an author employ something like this to create files for distribution is extremely risky, even if they acknowledge its use and intend those files to only be available on their website.
Indie authors struggle a lot to promote their works, and the new normal is that potential readers, the polite ones[^1], use the slightest hint of AI usage to discard their title and move on...as they are entitled to, since there are so many books.
I in particular have started to hire voice actors that have good acting skills and good diction but for whom English is their second language, or it's their first language but they speak something else at home; sometimes I even ask them to go a notch up with their accents. It helps with the non-AI recognition, and it also increases the appeal of the book for people who would like to try out something new. Once, I did an audition for a project and was pleasantly surprised with how much life people from around the Mediterranean basin were able to inject into their renderings, compared with people from Britain and North America.
[^1] Impolite readers set the town on fire, and then go about and spread that fire to neighboring towns, for good measure.
This is especially helpful when you’re on the go but still want to have a visual now and then or highlight text for later.
The problem is that many books don’t offer that feature. There is a built-in read function now in the kindle app, but it’s crap.
So, if you ask me, I’d prefer a good human-written book with an additional AI voice on top to enable that feature for me.
I think there are still basic hurdles to take before we can go epub to audiobook in a quality that can compete with current state of the art.
Or am I missing something?
It's probably possible with current systems to do though. I believe there are TTS systems that can use context/prompting to change emphasis and other speech qualities, though I'm not sure how reliably.
It's definitely an exited voice, but is it read out as in a battle or as in a romantic scene?
No comments yet
The difference is that even weak LLMs are good at magically doing this, so I wonder what the problem is for the TTS mentioned above.
It looks like the author created it using the "higher quality" ffmpeg command line, except for the "webm" final extension, producing the opposite of what's described as "an MP4 file that's compatible with more devices".
https://github.com/denizsafak/abogen/tree/main/demo#for-high...
I played a bit with Eleven labs voices and while they aren't bad when I tried make them read fragment of a text that I wrote, it sounded chaotic, boring, quite terrible, for anything longer than a sentence or two. But when I tried their v3 voices which they are currently in the process of rolling out, the same text sounded consistent, emotional, engaging, simply amazing. I think we are just crossing vocal uncanny valley.