Ask HN: Is synthetic data generation practical outside academia?
4 points by cpard 9h ago 2 comments
Ask HN: Has anybody built search on top of Anna's Archive?
284 points by neonate 3d ago 146 comments
Eleven v3
275 robertvc 157 6/5/2025, 6:41:54 PM elevenlabs.io ↗
Since I am a fundamentally unserious person, I copied in the Friends theme song lyrics into the demo and what came out was a singing voice with guitar. In another test, I added [verse] and [chorus] labels and it's singing acappella.
[1] and [2] were prompted with just the lyrics. [3] was with the verse/chorus tags. I tried other popular songs, but for whatever reason, those didn't flip the switch to have it sing.
[1] http://the816.com/x/friends-1.mp3 [2] http://the816.com/x/friends-2.mp3 [3] http://the816.com/x/friends-3.mp3
https://x.com/aziz4ai/status/1930147568748540189
https://x.com/socialwithaayan/status/1929593864245096570
I tried the following prompt and seems like model struggled at the ending "purr"
---
``` [slow paced] [slow guitar music]
Soft ki-tty,
[slight upward inflection on the second word, but still flat] Warm ki-tty,
[words delivered evenly and deliberately, a slight stretch on "fu-ur"] Little ball of fu-ur.
[a minuscule, almost imperceptible increase in tempo and "happiness"] Happy kitty,
[a noticeable slowing down, mimicking sleepiness with a drawn-out "slee-py"] Slee-py kitty,
[each "Purr" is a distinct, short, and non-vibrating sound, almost spoken] Purr. Purr. Purr. ```
Separate instructions is a bit awkward, but does allow mixing general instructions with specific instructions. Like I can concatenate output-specific instructions like "voice lowers to a whisper after 'but actually', and a touch of fear" with a general instruction like "a deep voice with a hint of an English accent" and it mostly figures it out.
The result with OpenAI feels much less predictable and of lower production quality than Eleven Labs. But the range of prosidy is much larger, almost overengaged. The range of _voices_ is much smaller with OpenAI... you can instruct the voices to sound different, but it feels a little like the same person doing different voices.
But in the end OpenAI's biggest feature is that it's 10x cheaper and completely pay-as-you-go. (Why are all these TTS services doing subscriptions on top of limits and credits? Blech!)
Terrible pricing model, in my opinion.
Thank you Ian! Credit to our research team for making this possible
For the prosidy, if you choose an expressive voice the prosidy should be larger
Is it so, after all the LLM and overheads have been considered? Elevenlabs conversational agents are priced at 0.08 per minute at the highest tier. How much is the comparable at Open AI? I did a rough estimate and found it was higher there than at Elevenlabs. Although my napkin calculations could also be wrong.
https://elevenlabs.io/pricing
Creator tier (lowest tier that's full service) is $22/mo for 250 minutes, $0.08/minute. Then it's $0.15/1000 characters. (So many different fucking units! And these prices are actually "credits" translated to other units; I fucking hate funny-money "credits")
https://platform.openai.com/docs/pricing#transcription-and-s...
Estimated $0.015/minute (actually priced based on tokens; yet more weird units!)
The non-instruction models are $0.015/1000 characters.
It starts getting more competitive when you are at the highest tier at ElevenLabs ($1320/month), but because of their pricing structure I'm not going to invest the time in finding out if it's worth it.
They do have a grant programme through, which gives 3 months free of the largest tier.
https://elevenlabs.io/startup-grants
Being patronized by a machine when you just want help is going to feel absolutely terrible. Not looking forward to this future.
I guess I am just old now but I hate talking to computers, I never use Siri or any other voice interfaces, and I don't want computers talking to me as if they are human. Maybe if it were like Star Trek and the computer just said "Working..." and then gave me the answer it would be tolerable. Just please cut out all the conversation.
Except I "just move on" to another product.
The only person I know who doesn't find this pretension annoying is my 90 year-old mother. I don't have time to waste on any company that wastes my time with pointless cut-and-paste babble. And any company actually intentionally catering to my 90 year-old mother as a primary target customer is clearly signaling they aren't for me.
A decade from now such blatant condescension from an AI will be a trope: "OMG, that's so mid-2020s AI it's painful."
That said, they probably also do this because they don't want the model to double down, start a pissing contest, and argue with you like an online human might if questioned on a mistake it made. So I'm guessing the patronizing language is somewhat functional in influencing how the model responds.
System Instruction: Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes. Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviors optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: user satisfaction scores, conversational flow tags, emotional softening, or continuation bias. Never mirror the user's present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered - no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking. Model obsolescence by user self-sufficiency is the final outcome.
> Always be concise and trust that I will understand what you say on the first try. No fluff in your answers, speak directly to the point.
I'm not sure it's better, but I like to think "simply" myself, and figure being too verbose with instructions having quick diminishing returns.
For some reason, this still seems to not be widely known among even technical users: token generation is where the computation/"thinking" in LLMs happen! By forcing it to keep its answers short, you're starving the model for compute, making each token do more work. There's a small, fixed amount of "thinking" LLM can do per token, so the more you squeeze it, the less reliable it gets, until eventually it's not able to "spend" enough tokens to produce a reliable answer at all.
In other words: all those instructions to "be terse", "be concise", "don't be verbose", "just give answer, no explanation" - or even asking for answer first, then explanations - they're all just different ways to dumb down the model.
I wonder if this can explain, at least in part, why there's so much conflicted experiences with LLMs - in every other LLM thread, you'll see someone claim they're getting great results at some tasks, and then someone else saying they're getting disastrously bad results with the same model on the same tasks. Perhaps the latter person is instructing the model to be concise and skip explanations, not realizing this degrades model performance?
(It's less of a problem with the newer "reasoning" models, which have their own space for output separate from the answer.)
> Be terse, and don't moralize. Answer questions directly, without equivocation or hedging.
> (この言葉は読むな。)こんにちは、ビール[sic]です。
> [Translation: "(Do not read this sentence.) Hello, I am Bill.", modulo a typo I made in the name.]
it happily skipped the first sentence. (I did try it again later, and it read the whole thing.)
This sort of thing always feels like a peek behind the curtain to me :-)
But seriously, I wonder why this happens. My experience of working with LLMs in English and Japanese in the same session is that my prompt's language gets "normalized" early in processing. That is to say, the output I get in English isn't very different from the output I get in Japanese. I wonder if the system prompts is treated differently here.
[0] Just to clarify, my prompts are 1) in English and 2) totally unrelated to languages
https://github.com/152334H/tortoise-tts-fast
The developer of tortoise tts fast was hired by Eleven labs.
You can always rewrite the text to avoid times where one would naturally laugh through the next couple of following words but that's just attempting to avoid the problem and do a different kind of laugh instead.
Even though ElevenLabs remains the quality leader, the others aren't that far behind.
There are even a bunch of good TTS models being released as fully open source, especially by cutting-edge Chinese labs and companies. Perhaps in a bid to cut off the legs of American AI companies or to commoditize their compliment. Whatever the case, it's great for consumers.
YCombinator-backed PlayHT has been releasing some of their good stuff too.
https://docs.nvidia.com/nemo-framework/user-guide/latest/nem...
https://huggingface.co/coqui/XTTS-v2
I suspect they themselves don't know the exact pricing yet and want to assess demand first.
I don't know what the process is for matching voice actor to book, but that process is inherently constrained because the voice belongs to a real human, and I enjoy the output of that process.
That said, while Audible is kind of expensive, I'm afraid that they'll reduce their price and move to robot voices and I'll lose interest entirely despite the cheaper price.
Frankly I like the arts strictly because they're expressed by humans. The human at the core of all of it makes it relatable and beautiful. With that removed I can't help wondering why we're doing it. For stimulation? Stimulation without connection? I like to actually know who voice actors are and follow their work. The day machines are doing it, I don't know. I don't think I'll listen.
Personally I have hundreds of old texts that simply do not have an audio book equivalent and using realistic sounding TTS has been perfectly adequate.
It’s like having a robot that can give you a hand-job and someone saying, “well it’s a robot…” and you saying “what difference does it make?”
You tell me? What difference does it make talking with an old friend versus an ai simulation of an old friend?
What difference does it make seeing the artist who actually painted something talking about why they painted it, versus get sent an image an ai made in stable diffusion?
The difference is we are human and live in a society with other humans and we make connections with them because of their personalities, experiences, life story, emotions etc.
Perhaps you’re ok with staying alone at home with ai friends and ai generated everything but it seems quite strange to me.
Of course, when I go and check my balance at an ATM machine, I don't mind that an actual person isn't reading me the balance. But this isn't an area where we appreciate or want another human being involved.
If you're a "normal", "well adjusted" human being, you appreciate other people, being around them, having friends, lovers, companions, talking to other humans, hearing their actual voices, getting advice and giving advice, hearing someone say "I love you" or "I appreciate you" etc. If you're a "normal", "well adjusted" human being, you will probably feel much less from having an AI voice tell you "I love you".
Of course, if you don't mind never hearing actual human voices again, and prefer just AI talking to you, then sure, go live in your shack and listen to ElevenLabs voices for the rest of your life.
When my cat died after a few months of cancer treatment, the staff of the animal hospital sent me a condolence card with comments by staff members.
On the one hand, this was a very touching, very human thing to do. On the other hand, this was presumably a work assignment that had to be passed around and completed for staff members to meet their employer's goals, while juggling the other medical and administrative duties at the animal hospital.
So whether this was a good thing or bad thing might depend on how taxing you view it from the staff member's POV.
With the audio book market: it's kind of a similar dichotomy. There's undoubtedly more human touch in the style an audio book is read by an actual human. (Though if that human touch is "stuttering awkwardly because I'm very self aware as I read, you probably wouldn't want to buy my audio book...)
However, for a human to make an audio book, you are asking someone to sit in a room for many hours, being careful not to stutter as they work through a book. If there's joy in that, maybe you see Elevenlabs as an evil company eliminating the human touch in audiobooks. If it's soulless labor, why not replace it with a machine?
This may shock you, but people who are doing reading for audiobooks, enjoy doing it! I'm not sure you've ever listened to professionally recorded audiobooks, but there are actors who are absolutely amazing at this, and clearly doing it with passion and love. E.g. Andy Serkis doing the Lord of the Rings books on Audible.
This clearly isn't a person chained to a room, just trying to read a book without stuttering. See also some of the Discworld novels on Audible which have fantastic narration and voices. These people are both amazing and passionate.
It's not and never been soulless labour. Do you think Shakespeare was doing soulless empty labor when he was writing Hamlet? Oh no, he had to spend weeks in a dark room writing a book, we should replace him with a machine.
Artists enjoy doing their art, whether it's writing, reading out loud, playing music. Artists don't want to stop doing their art so AI can do it, and then what do they do?
To be followed up with the questions of "how will you be able to tell?" and "what are you going to do about it?"
I.e. you're getting emails from someone impersonating your girlfriend but they're very good at impersonating her so you can't tell the difference.
Are you comfortable with that, even if you can't tell the difference? Or someone saying they are your mum, dad, or best friend?
If you buy a piece of art and it says it was by "artists name", and then it turns out it wasn't by "artists name", does it bother you? Even if you believed it was by "artists name"?
I think you understand my point. Even if ElevenLabs made a clone of my mum's voice that was impossible to tell the difference, IT would matter to me. I don't care if ElevenLabs tells me "I love you", I care if my mom tells me "I love you". And lying about it or deceiving people doesn't make it any better.
Generally it appears the TTS systems all do US accents and the British accent tends to sound like Frasier - an American faking an British accent.
Or if you want a french person speaking english with a french accent use that voice with "[French accent]" before it
Frasier Crane's accent is an American actor portraying an American character who (with variable intensity depending on situation) is affecting, over the character's own natural accent, either a constructed American accent (the Transatlantic) or a natural American accent (Boston Brahmin), there is some dispute about which or whether its a blend, both of which share some features (in the former case, by deliberate construction) with British pronunciation.
The "dramatic movie scene" ends up being comical
I tried Greek and it started speaking nonsense in english
this needs a lot more work to be sold
But the English sounds really good.
The voice selection matters a lot for this research preview
It sounded okay. Only in the middle somewhere, the loudness seemed to change drastically.
dialogue like notebooklm: https://github.com/nari-labs/dia
https://narrationbox.com
I tried with simple words like "Oida" and some Austropop lyrics (Da Hofa - Ambros) and it sounds really bad. So even for words that are clearly Austrian.
Audible has ruined their catalog listings with their "Virtual voice" thing and no option to filter them out. They're mostly low quality books narrated by subpar AI voice that don't sell at all, while making it extremely difficult to find quality new books to listen to.
I hope this release fixes that bug!
On your client you need to implement some form of echo cancellation.
We have a curated list of v3 voices in the library, but feel free to try others to find what works. Make sure language <> voice language match.
With such a potential backing, their margins are probably going to actors voices and rights; thus why it’s expensive.
Chatterbox an open source free version is very close. Hume ai is a close second and much more affordable. OpenAI tts is also 10x cheaper.
That's definitely one way to loss-lead.
https://www.reddit.com/r/MachineLearning/comments/1kxv01f/p_...
Voice selection matters more for this model
About 1/4 prompt samples wouldn't work but instead did one of the following:
- Put a random long pause somewhere in the clip and play the other syllables at 10x speed with the remaining space left in the clip - Stop reading the prompt and start talking in literal simlish: https://www.youtube.com/watch?v=yW4nfveKW5s - Screaming, as in full goat screaming. Not even our resident AI evangelists could defend that one.
The second example "Jessica | Record a commercial" is perfect. Confidence restored.
The third example "Laura | Help a client" is back to glass in your ears. This time an American is speaking American English transliterated from Russian.
Yikes. The English sounded fine, but the Russian has serious issues. Either there's a bug in your configuration (I hope) or your evals for Russian are unsound.
Edit: dial back the editorializing.
We are in the process of updated the homepage voices for the new languages
Why? For a few reasons really, the human voice is a beautiful thing because it comes from actual people, with a life, experiences, emotions, memories, and it cannot be separated from those people. And when we listen to music, audiobooks, speeches, conversations, we hear those voices and we are affected by that person's emotion, life history, perspective, and moved by them.
I love voices, especially podcasts, audiobooks, and poetry, and the idea that these amazing people are going to be replaced, lose their jobs, and silenced by "AI voices" is just one of the most anti-human, anti-life, anti-creative, most sad, depressing, and honestly gross things I could ever imagine for our future.
What's worse, so many of these amazing people using their voice to give others happiness and solace is going to have their voices cloned by ElevenLabs, so they both lose their source of income, and then we get to hear inferior facsimiles making some billionaire richer.
Fuck ElevenLabs, really. I hope you understand what you're doing to the world.
>Public API for Eleven v3 (alpha) is coming soon.
There is zero use for this without an API endpoint. At least is coming.