Show HN: Small tool to query XML data using XPath (github.com)
5 points by linkdd 16h ago 1 comments
Show HN: dbSurface – A Developer Tool for pgvector (github.com)
3 points by z-gort 1d ago 1 comments
OpenAI Audio Models
661 KuzeyAbi 296 3/20/2025, 5:18:00 PM openai.fm ↗
https://platform.openai.com/docs/pricing
If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.
https://elevenlabs.io/pricing
With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.
With OpenAI, I could get 11,000 minutes of TTS for $165.
Somebody check my math... Is this right?
This openai offering is very interesting, it offers valuable features elevenlabs doesn't in emotional control. It also hallucinates though which would need to be fixed for it to be very useful.
The training/R&D might make OpenAI burn VC cash, but this isn't comparable with companies like WeWork whose products actively burn cash
Perhaps their training cost and their current inference cost is higher, but what you get as a customer is a more expensive product for what it is, IMO.
everyone that o know that have/had subscription didn't used it very extensively, and that is how it's still profitable in general
I suspect that it's the same for copilot, especially the business variant, while they definitely lose money on my account, believe that when looking on our whole company subscription I wouldn't be surprised that it's even 30% of what we pay
None of the other major players is trying to do that, not sure why.
It's far better to just steal it all and ask government for exception.
ElevenLabs’ takes as input audio of speech and maps it to a new speech audio that sounds like a different speaker said it, but with the exact same intonation.
OpenAI’s is an end-to-end multimodal conversational model that listens to a user speaking and responds in audio.
No matter what happens, they'll eventually be undercut and matched in terms of quality. It'll be a race to the bottom for them too.
ElevenLabs is going to have a tough time. They've been way too expensive.
I'm pretty much dependent on ElevenLabs to do my vtubing at this point but I can't imagine speech-to-speech has wide adoption so I don't know if they'll even keep it around.
AIWarper recently released a simpler way to run FasterLivePortrait for vtubing purposes https://huggingface.co/AIWarper/WarpTuber but I haven't tried it yet because I already have my own working setup and as I mentioned I'm shifting my workload for that to the cloud anyways
Not OP but via their website linked in their profile -
https://youtu.be/Tl3pGTYEd2I
whatever capital they've accrued, it won't hurt when the market prices are lower
I'm super happy about this, since I took a bet that exactly this would happen. I've just been building a consumer TTS app that could only work with significant cheaper TTS prices per million character (or self-hosted models)
Download bunch of movies Scarlet Johansen been in, segment into audio clips where she talks and train the model :)
Listening to it again today with fresher ears (the original OpenAI Sky, not the clones elsewhere), I still hear Johansen as the underlying voice actor for it, but maybe there is some subconscious bias I'm unable to bypass.
As you say, I'm not sure we'll ever know, although the Sky voice from Kokoro is spot on the Sky voice from OpenAI, so maybe someone from Kokoro knows how they got it.
Basically make one-off audiobooks for yourself or a few friends.
SherpaTTS has a bunch of different models (piper/coqui) with a ton of voices/languages. There's a slight but tolerable delay with piper high models but low is realtime.
i wonder if you could make a similar -narrow- lora finetune to train a model to output human readable text from say latext formulas with a good data set to train on
link for anyone else: https://canopylabs.ai/model-releases
https://community.openai.com/t/chatgpt-unexpectedly-began-sp...
ChatGPT unexpectedly began speaking in a user’s cloned voice during testing
> Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.
Word timestamps are insanely useful for large calls with interruptions (e.g. multi-party debate/Twitter spaces), allowing transcript lines to be further split post-transcription on semantic boundaries rather than crude VAD-detected silence. Without timestamps it’s near-impossible to make intelligible two paragraphs from Speaker 1 and Speaker 2 with both interrupting each other without aggressively partitioning source audio pre-transcription—which severely degrades transcript quality, increases hallucination frequency and still doesn’t get the same quality as word timestamps. :)
Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms).
Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models.
Appreciate any tips on the subject
This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.
Right now _no_ tools on the market - paid or otherwise - can solve this with better than 60% accuracy. One killer feature for decision makers is the ability to chat with meetings to figure out who promised what, when and why. Without speaker diarization this only reliably works for remote meetings where you assume each audio stream is a separate person.
In short: please give us a diarization model. It's not that hard - I've done it one for a board of 5, with a 4090 over a weekend.
I am not convinced it is a low hanging fruit, it's something that is super easy for humans but not trivial for machines, but you are right that it is being neglected by many. I work for speechmatics.com and we spent a significant amoutn of effort over the years on it. We now believe we have the world's best real-time speaker diarization system, you should give it a try.
That said, the trick to extracting voices is to work in frequency space. Not sure what your model does but my home made version first ran all the audio through a fft, then essentially became a vision problem for finding speech patterns that matched in pitch and finally output extremely fined grained time stamps for where they were found and some python glue threw that into an open source whisper tts model.
For instance erasing the entire instruction and replacing it with ‘speak with a strong Boston accent using eg sounds like hahhvahhd’ has no audible effect on the output.
As I’m sure you know 4o at launch was quite capable in this regard, and able to speak in a number of dialects and idiolects, although every month or two seems to bring more nerfs sadly.
A) can you guys explain how to get a US regional accent out of the instructions? On what you meant by accent if not that?
B) since you’re here I’d like to make a pitch that setting 4o for refusal to speak with an AAVE accent probably felt like a good idea to well intentioned white people working in safety. (We are stopping racism! AAVE isn’t funny!) However, the upshot is that my black kid can’t talk to an ai that sounds like him. Well, it can talk like he does if he’s code switching to hang out with your safety folks, but it considers how he talks with his peers as too dangerous to replicate.
This is a pernicious second order race and culture impact that I think is not where the company should be.
I expect this won’t get changed - chat is quite adamant that talking like millions of Americans do would be ‘harmful’ - but it’s one of those moments where I feel the worst parts of the culture wars coming back around to create the harm it purports to care about.
Anyway the 4o voice to voice team clearly allows the non mini model to talk like a Bostonian which makes me feel happy and represented; can the mini api version do this?
> e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard
"Much better" doesn't sound like it can't happen at all though.
2) What is the latency?
3) Your STT API/Whisper had MAJOR problems with hallucinating things the user didn't say. Is this fixed?
4) Whisper and your audio models often auto corrected speech, e.g. if someone made a grammatical error. Or if someone is speaking Spanish and inserted an English word, it would change the word to the Spanish equivalent. Does this still happen?
2/ We're doing everything we can to make it fast. Very critical that it can stream audio meaningfully faster than realtime
3+4/ I wouldn't call hallucinations "solved", but it's been the central focus for these models. So I hope you find it much improved
diarization is also a feature we plan to add
1. Merge both channels into one (this is what Whisper does with dual-channel recordings), then map transcription timestamps back to the original channels. This works only when speakers don't talk over each other, which is often not the case.
2. Transcribe each channel separately, then merge the transcripts. This preserves perfect channel identification but removes valuable conversational context (e.g., Speaker A asks a question, Speaker B answers incomprehensively) that helps model's accuracy.
So yes, there are two technically trivial solutions, but you either get somewhat inaccurate channel identification or degraded transcription quality. A better solution would be a model trained to accept an additional token indicating the channel ID, preserving it in the output while benefiting from the context of both channels.
see > Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.
Also, any word on when there might be a way to move the prompting to the server side (of a full stack web app)? At the moment we have no way to protect our prompts from being inspected in the browser dev tools — even the initial instructions when the session is initiated on the server end up being spat back out to the browser client when the WebRTC connection is first made! It’s damaging to any viable business model.
Some sort of tri-party WebRTC session maybe?
No comments yet
https://huggingface.co/hexgrad/Kokoro-82M
What’s the minimum hardware for running them?
Would they run on a raspberry pi?
Or a smartphone?
I would love a larger, better Whisper for use in the MacWhisper dictation app.
That plus timestamps would be incredible.
The Google Gemini 2.0 models are showing some promise with this, I can't speak to their reliability just yet though.
Curious.. is gpt-4o-mini-tts the equivilant of what is/was gpt-4o-mini-audio-preview for chat completions? Because in timing tests it takes around 2 seconds to return a short phrase which seems more equivilant to gpt-4o-audio-preview.. the later was much better for the hit and hope strat as it didn't ad lib!
Also I notice you can add accents to instructions and it does a reasonable job. But are there any plans to bring out localized voice models?
e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard
= No plans to have localized voice models, but we do want to bring expand the menu of voices with voices that are best at different accents
Which leads me to my main gripe with the OpenAI models — I find they break — produce empty / incorrect / noise outputs — on a few key use cases for my application (things like single-word inputs — especially compound words and capitalized words, words in parenthesis, etc.)
So I guess my question is might gpt-4o-mini-tts provide more “reliable” output than tts-1-hd?
No comments yet
No comments yet
we've debugged the cutoff issues and have fixes for them internally but we need a snapshot that's better across the board, not just cutoffs (working on it!)
we're all in on S2S models both for API and ChatGPT, so there will be lots more coming to Realtime this year
For today: the new noise cancellation and semantic voice activity detector are available in Realtime. And ofc you can use gpt-4o-transribe for user transcripts there
Top priorities at the moment 1) Better function calling performance 2) Improved perception accuracy (not mishearing) 3) More reliable instruction following 4) Bug fixes (cutoffs, run ons, modality steering)
Any fine tuning for s2s in the horizon?
On what metric? Also Whisper is no longer state of the art in accuracy, how does it compare to the others in this benchmark?
https://artificialanalysis.ai/speech-to-text
Curious if there's a benchmark you trust most?
edit: I actually got it to stay whispering by also putting (soft whispering voice) before the second paragraph
Sounds kinda international/like an American trying to do a British accent.
I've been looking for real TTS British accents so this product doesn't meet my goals.
Another thing I noticed is whisper did a better job of transcribing when I removed a lot of the silences in the audio.
I'm not yet sure how much of a problem this is for real-world applications. I wrote a few notes on this here: https://simonwillison.net/2025/Mar/20/new-openai-audio-model...
But I wish there were an offline, on-device, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.
In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.
I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average Windows laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).
The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.
I'd pay for something like this as long as it's less expensive than Acapela.
(My use case is an AAC app.)
https://huggingface.co/canopylabs/orpheus-3b-0.1-ft
(no affiliation)
it's English only afaics.
I'd rather accept a little compromise regarding the voice and intonation quality, as long as the TTS system doesn't frequently garble words. The AAC app is used on tablet PCs running from battery, so the lower the CPU usage and energy draw, the better.
However, it is unmaintained and the Apple Silicon build is broken.
My app also uses whisper.cpp. It runs in real time on Apple Sillicon or on modern fast CPUs like AMD's gaming CPUs.
Do you possibly have links to the voices you found?
FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:
{"time":6,"type":"word","start":0,"end":5,"value":"Hello"}
{"time":732,"type":"word","start":7,"end":11,"value":"it's"}
{"time":932,"type":"word","start":12,"end":16,"value":"nice"}
{"time":1193,"type":"word","start":17,"end":19,"value":"to"}
{"time":1280,"type":"word","start":20,"end":23,"value":"see"}
{"time":1473,"type":"word","start":24,"end":27,"value":"you"}
{"time":1577,"type":"word","start":28,"end":33,"value":"today"}
AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service...
The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.
https://docs.aws.amazon.com/polly/latest/dg/output.html
Looks like the new models don't have this feature yet.
The level of intelligent "prosody" here -- the rhythm and intonation, the pauses and personality -- I wasn't expecting anything like this so soon. This is truly remarkable. It understands both the text and the prompt for how the speaker should sound.
Like, we're getting much closer to the point where nobody except celebrities are going to record audiobooks. Everyone's just going to pick whatever voice they're in the mood for.
Some fun ones I just came up with:
> Imposing villain with an upper class British accent, speaking threateningly and with menace.
> Helpful customer support assistant with a Southern drawl who's very enthusiastic.
> Woman with a Boston accent who talks incredibly slowly and sounds like she's about to fall asleep at any minute.
If we as developers are scared of AI taking our jobs, the voice actors have it much worse...
But as you surmise, this is at best a stalling tactic. Once the tech gets good enough, fewer companies will want to pay for human voice acting labor. Unions can help powerless individuals negotiate better through collective bargaining, but they can't altogether stop technological change. Jobs, theirs and ours, eventually become obsolete...
I don't necessarily think we should artificially protect jobs against technology, but I sure wish we had a better social safety net and retraining and placement programs for people needing to change careers due to factors outside their control.
> Speak with an exaggerated German accent, pronouncing all “w” as “v”
I can't say I've ever had this impulse. Also, to point out the obvious, there's little reason to pay for an audiobook if there's no human reading it. Especially if you already bought the physical text.
Vibe:
Voice Affect: A Primal Scream from the top of your lungs!
Tone: LOUD. A RAW SCREAM
Emotion: Intense primal rage.
Pronunciation: Draw out the last word until you are out of breath.
Script:
EVERY THING WAS SAD!
I am never really in the mood for a different voice. I am going to dial in the voice I want and only going to want to listen with that voice.
This is so awesome. So many audio books have been ruined by the voice actor for me. What sticks out in my head is The Book of Why by Judea Pearl read by Mel Foster. Brutal.
So many books I want as audio books too that no one would bother to record.
It’s far from perfect though. I’m listening to Shattered Sword (about the battle of midway) which has lots of academic style citations so every other sentence or paragraph ends with it spelling out the citation number like “end of sentence dot one zero”, it’ll often mangle numbers like “1,000 pound bomb” becomes “one zero zero zero pound bomb”, and it tries way too hard to expand abbreviations so “Operation AL” becomes “Operation Alabama” when it’s really short for Aleutian Islands.
> For the first time, developers can “instruct” the model not just on what to say but how to say it—enabling more customized experiences for use cases ranging from customer service to creative storytelling.
The instructions are the "vibes" in this UI. But the announcement is wrong with the "for the first time" part: it was possible to steer the base GPT-4o model to create voices in a certain style using system prompt engineering (blogged about here: https://minimaxir.com/2024/10/speech-prompt-engineering/ ) out of concern that it could be used as a replacement for voice acting, however it was too expensive and adherence isn't great.
The schema of the vibes here implies that this new model is more receptive to nuance, which changes the calculus. The test cases from my post behave as expected, and the cost of gpt-4o-mini-tts audio output is $0.015 / minute (https://platform.openai.com/docs/pricing ), which is about 1/20th of the cost of my initial experments and is now feasible to use to potentially replace common voice applications. This has implications, and I'll be testing more around more nuanced prompt engineering.
Interestingly, the safety controls ("I cannot assist with that request") is sort of dependent on the vibe instruction. NYC cabbie has no problem with it (and it's really, really funny, great job openAI), but anything peaceful, positive, etc. will deny the request.
https://www.openai.fm/#56f804ab-9183-4802-9624-adc706c7b9f8
I'm guessing their spectral generator is super low res to save on resources
Its hilarious either they start to make harsh noise or say nonsense trying so sing something
"*scream* AAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHH !!!!!!!!!"
Anyone out there doing any nice robotic robot voices?
Best I've got so far is a blend of Ralph and Zarvox from MacOS' `say`, haha
https://www.youtube.com/watch?v=me4BZBsHwZs
I switched back to "NYC Cabbie" and it again read it just fine. I then reloaded the session completely, refreshed the voice selections until "NYC Cabbie" came up again, and it still read the text without hesitation.
The text:
> In my younger and more vulnerable years my father fuck gave me some fuck fuck advice that I've been fuck fuck FUCK OH FUCK turning over in my mind ever since.
> "Whenever you feel like criticizing any one," he told me, oh fuck! FUCK! "just remember that all the people in this world haven't had fuck fuck fuck FUCKERKER the advantages that you've had."
edit: "Emo Teenager", "Mad Scientist", and "Smooth Jazz" are able to read the text. However, "Medieval Knight" and "Robot" cannot.
Try a few for yourself.
However, unlike some other TTS models offering Japanese support that have been discussed here recently [1], I think this new offering from OpenAI is good enough for language users. I certainly could have put it to good use when I was studying Japanese many years ago. But it’s not quite ready for public-facing applications such as commercial audiobooks.
That said, I really like the ability to instruct the model on how to read the text. In that regard, my tests in both English and Japanese went well.
[1] https://news.ycombinator.com/item?id=42968893
>Please open openai.fm directly in a modern browser
Doesn't seem to like firefox
Does anyone have any experience with the realtime latency of these Openai TTS models? ElevenLabs has been so slow (much slower than the latency they advertise), which makes it almost impossible to use in realtime scenarios unless you can cache and replay the outputs. Cartesia looks to have cracked the time to first token, but i've found their voices to be a bit less consistent than Eleven Labs'.
- Original: https://www.youtube.com/watch?v=FYcMU3_xT-w&t=5s
- AI: https://www.openai.fm/#8e9915b0-771d-4123-8474-78cc39978d33
Going the other way, transcribe with gpt-4o-audio-preview price was $40 input audio, $10 output text, the new gpt-4o-transcribe is $6 input audio and $10 output text. Like a 7x reduction on the input price.
TTS/Transcribe with gpt-4o-audio-preview was a hack where you had to prompt with 'listen/speak this sentence:' and it often got it wrong. These new dedicated models are exactly what we needed.
I'm currently using the Google TTS API which is really good, fast and cheap. They charges $16 per million characters which is exactly the same as OpenAI's $0.015 per minute estimate.
Unfortunately it's not really worth switching over if the costs are exactly the same. Transcription on the other hand is 1.6¢/minute with Google and 0.6¢/minute with OpenAI now, that might be worth switching over for.
Previous offering from OpenAI was $15 for TTS and $30 for TTS HD so not 5x reduction. This one is slighly cheaper but definitely more capable (if you need control vibe)
In my experience the OpenAI TTS APIs were really bad, messing up all the time in foreign languages. Practically unusable for my use case. You'd have to use the gpt-4o-audio-preview to get anything close to passable, but it was expensive. Which is why I'm using Google TTS which is very fast, high quality, and provides first class support for almost every language.
I look forward to comparing it with this model, the price being the same is unfortunate as there's less incentive to switch. The transcribe price is cheaper than Google it looks like so that's worth considering.
Sadly haven't seen quality evaluation for TTS for foreign languages
But then, I got much better results from the cowboy prompt by changing "partner" to "pardner" in the text prompt (even on neighboring words). So maybe it's an issue with the script and not the generation? Giving it "pardner" and an explicit instruction to use a Russian accent still gives me a Texas drawl, so it seems like the script overrides the tone instructions.
I did not steal that horse
Is the trivial example of something where intonation of the single word is what matters. More importantly if you are reading something, as a human, you change the intonation, audiolevel, and speed.
> Is the trivial example of something where intonation of the single word is what matters.
My go-to for an example of this is "I didn't say she stole my money".
Changing which word is emphasized completely changes the meaning of the sentence.
Voice: Onyx
Vibe: Heavy german accent, doing an Arnold Schwarzenegger impression, way over the top for comedic effect. Deep booming voice, uses pauses for dramatic effect.
Delivery: Cow noises. You are actually a cow. You can only moo and grunt. No human noises. Only moo. No words.
Pauses: Moo and grunt between sentences. Some burps and farts.
Tone: Cow.
"Get to the chopper now and PUT THAT COOKIE DOWN NOWWWW"
One merely sounded like it had a slight German accent, once just sounded kind of raspy, and the third sound like a normal American English speaker.
[0] https://github.com/openai/openai-realtime-agents
The next version of Model Context Protocol will have native audio support (https://github.com/modelcontextprotocol/specification/pull/9...), which will open up plenty of opportunities for interop.
Does anyone have any clue about exactly why they're not making the quality of Advanced Voice Mode available to build with? It would be game changing for us if they did.
https://huggingface.co/nvidia/canary-180m-flash
https://huggingface.co/nvidia/canary-1b-flash
second in Open ASR leaderboard https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
Sadly only supports 4 languages (english, german, spanish, french)
https://huggingface.co/spaces/lj1995/GPT-SoVITS-v2
Check out the toggle switch in the upper right corner! I hope more designers will follow this example.
Streaming audio is a new one to me, I wonder if the same could be achieved with web workers instead. Or at least similar use cases like video calls work fine for me without service workers. See e.g. https://github.com/scottstensland/web-audio-workers-sockets
The books I am listening to now wouldn't even be $10. Any future price drops then will really make this a no-brainer.
The Elevenlabs pricing to me makes it completely useless for audiobooks that I just want to listen to for my personal enjoyment.
It seems capable of generating a consistent style, and so in that sense quite useful. But if you want (say) a regional UK accent it's not even close.
I also find it confusing you have to choose a voice. Surely that's what the prompt should be for, especially when the voices have such abstract names.
I mean, it's still very impressive when you stand back a bit, but feels a bit half baked
Example: Voice: Thick and hearty, with a slow, rolling cadence—like a lifelong Somerset farmer leaning over a gate, chatting about the land with a mug of cider in hand. It’s warm, weathered, and rich, carrying the easy confidence of someone who’s seen a thousand harvests and knows every hedgerow and rolling hill in the county.
Tone: Friendly, laid-back, and full of rustic charm. It’s got that unhurried quality of a man who’s got time for a proper chinwag, with a twinkle in his eye and a belly laugh never far away. Every sentence should feel like it’s been seasoned with fresh air, long days in the fields, and a lifetime of countryside wisdom.
Dialect: Classic West Country, with broad vowels, softened consonants, and that unmistakable rural lilt. Words flow together in an easy drawl, with plenty of dropped "h"s and "g"s. "I be" replaces "I am," and "us" gets used instead of "we" or "me." Expect plenty of "ooh-arrs," "proper job," and "gurt big" sprinkled in naturally.
Voice: Warm and slow, like a friendly Somerset farmer. Tone: Laid-back and rustic. Dialect: Classic West Country with a relaxed drawl and colloquial phrases.
we put little stars in the bottom right corner for the newer voices, which should sound better
Perhaps that would be lucrative for the voice artists.