MIT Study Finds AI Use Reprograms the Brain, Leading to Cognitive Decline (publichealthpolicyjournal.com)

I read the comments praising these voices as very life like, and went to the page primed to hear very convincing voices. That is not at all what I heard though.

The voices are decent, but the intonation is off on almost every phrase, and there is a very clear robotic-sounding modulation. It's generally very impressive compared to many text-to-speech solutions from a few years ago, but for today, I find it very uninspiring. The AI generated voice you hear all over YouTube shorts is at least as good as most of the samples on this page.

The only part that seemed impressive to me was the English + (Mandarin?) Chinese sample, that one seemed to switch very seamlessly between the two. But this may well be simply because (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that, and (2) the different character systems make it extremely clear that the model needs to switch between different languages. Peut-être que cela n'aurait pas été si simple if it had been switching between two languages using the same writing system - I'm particularly curious how it would have read "simple" in the phrase above (I think it should be read with the French pronunication, for example).

And, of course, the singing part is painfully bad, I am very curious why they even included it.

IshKebab · 1h ago

I agree. For some reason the female voices are waaay more convincing than the male ones too, which sound barely better than speech synthesis from a decade ago.

mclau157 · 26m ago

ElevenLabs has a much more convincing voice model

rcarmo · 1h ago

One of the things this model is actually quite good at is voice cloning. Drop a recorded sample of your voice into the voices folder, and it just works.

echelon · 41m ago

This is close to SOTA emotional performance, at least the female voices.

I trust the human scores in the paper. At least my ear aligns with that figure.

With stuff like this coming out in the open, I wonder if ElevenLabs will maintain its huge ARR lead in the field. I really don't see how they can continue to maintain a lead when their offering is getting trounced by open models.

MengerSponge · 1h ago

> (1) I'm not familiar with any Chinese language, so I couldn't really judge the pronunciation of that

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

giancarlostoro · 2h ago

I really hope someone within Microsoft is naming their open source coding agent Microsoft VibeCode. Let this be a thing. Its either that or "Lo" then you can have Lo work with Phi, so you can Vibe code with Lo Phi.

https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl...

simiones · 1h ago

Knowing the history of Microsoft marketing, it will either be called something like "Microsoft Copilot Code Generator for VSCode" or something like "Zunega"...

giancarlostoro · 1h ago

Well don't forget "Microsoft SQL" ;) They'll name something as though they invented it and now have the worse possible way to google it.

kelvinjps10 · 1h ago

For me it doesn't sounds like they invented it but that it's Microsoft version of SQL idk but I hate Microsoft version of anything

loloquwowndueo · 1h ago

“Microsoft Word” haha reminds me of the old joke : “Microsoft Works” is an oxymoron.

giancarlostoro · 1h ago

Oh my goodness, I forgot about "Microsoft Works" you just shot me back in time to the 2000s

esafak · 1h ago

You misquoted Microsoft "Works"

polytely · 1h ago

GitHub Dotnet Copilot Code Generator for VSC (new)

datadrivenangel · 1h ago

(preview)

airstrike · 1h ago

Now I need a new project just so I can call it Zunega... lmao

TheAceOfHearts · 43m ago

Unfortunately it's not usable if you're GPU-poor. Couldn't figure out how to run this with an old 1080. I tried VibeVoice-1.5B on my old CPU with torch.float32 and it took 832 seconds to generate a 66 second audio clip. Switching from torch.bfloat16 also introduced some weird sound artifacts in the audio output. If you're GPU-poor the best TTS model I've tried so far is Kokoro.

Someone else mentioned in this thread that you cannot add annotations to the text to control the output. I think for these models to really level up there will have to be an intermediate step that takes your regular text as input and it generates an annotated output, which can be passed to the TTS model. That would give users way more control over the final output, since they would be able to inspect and tweak any details instead of expecting the model to get everything correctly in a single pass.

tempodox · 18m ago

This is ludicrous. macOS has had text-to-speech for ages with acceptable quality, and they never needed energy- and compute-expensive models for it. And it reacts instantly, not after ridiculous delays. I cannot believe this hype about “AI”, it’s just too absurd.

strangescript · 1h ago

The male voices seem much worse than the female voices, borderline robotic. Every sample of their website starts with a female voice. They clearly are aware of the issue.

jsomedon · 1h ago

I felt the same, male voice feels kinda artificial.

aargh_aargh · 1h ago

Is there a current, updated list (ideally, a ranking) of the best open weights TTS models?

I'm actually more interested in STT (ASR) but the choices there are rather limited.

xnx · 1h ago

Click leaderboard in the hamburger menu: https://huggingface.co/spaces/TTS-AGI/TTS-Arena-V2

prophesi · 10m ago

Is there a way to filter out hosted models? The top three winners currently are all proprietary as far as I can tell.

edit: Ah, there's a lock icon next to the name of each proprietary model.

malnourish · 2h ago

This is clearly high quality but there's something about the voices, the male voices in particular, which immediately register as computer generated. My audio vocabulary is not rich enough to articulate what it is.

heeton · 2h ago

I'm no audio engineer either, but those computer voice sound "saw-tooth"y to me.

From what I understand, it's more basic models/techniques that are undersampling, so there is a series of audio pulses which give it that buzzy quality. Better models are produced smoother output.

https://www.perfectcircuit.com/signal/difference-between-wav...

codebastard · 2h ago

I would describe it as blockly, as if we visualise the sound wave it seems to be without peaks and cut upwards and downwards producing a metallic boxy echo.

jofzar · 1h ago

Yeah it sounds super low bitrate to me, reminds me of someone on Bluetooth microphone

lvncelot · 2h ago

After hearing them myself, I think I know what you mean. The voices get a bit warbly and sound at times like they are very mp3-compressed.

ehutch79 · 27m ago

The examples are kind of off-putting. We're definitely in uncanny valley territory here.

faxmeyourcode · 1h ago

I tried the colab notebook that they link to and couldn't replicate the quality for whatever reason. I just swapped out the text and let it run on the introduction paragraph of Metamorphosis by Franz Kafka and it seemingly could not handle the intricacies.

ml_basics · 37m ago

what's the relationship between this work and the recently announced voice models from Microsoft AI? https://microsoft.ai/news/two-new-in-house-models/

rafaelmn · 2h ago

The Spontaneous Emotion dailog sounds like a team member venting through LLMs.

They could have skipped the singing part, it would be better if the model did not try to do that :)

kridsdale1 · 28m ago

It did get me to look up the song [1] again though, which is a great stimulator of emotion. The robot singing has a long way to go.

1. https://music.youtube.com/watch?v=xl8thVrlvjI&si=dU6aIJIPWSs...

eibrahim · 2h ago

Hahahah. Thats what I thought too

regularfry · 2h ago

Ok, this is nit-picking, but it's very obvious that the sample voices these were trained with were captured in different audio environments. There's noticeable reverb on the male voice that's not there on the other.

So that's a useful next step: for multi-voice TTS models, make them sound like they're in the same room.

glenstein · 2h ago

Very good and I could see how I might believe they are real people if I let my guard down. The male voice sounded a little sedated though and there was a smoothness to it that could be samey over long stretches.

Still not at the astonishing level of Google Notebook text to speech which has been out for a while now. I still can't believe how good that one is.

swiftcoder · 47m ago

Ah, yes, the Furious 7 soundtrack. Definitely something everyone recalls

closewith · 41m ago

The most popular song of the year from one of the most popular movie franchises that had been in the global news due to the death of its star. Probably the most memorable song from a soundtrack of the century so far.

wewewedxfgdf · 2h ago

I'm really hoping one day there will be TTS does that does really nice British accents - I've surveyed them all deeply, none do.

Most that claim to do a British accent end up sounding like Kelsey Grammer - sort of an American accent pretending to be British.

specproc · 2h ago

I'd like one that really nails Brummie.

baal80spam · 2h ago

Wow. I admit that I am not a native speaker, but this looks (or rather, sounds) VERY impressive and I could mistake it for hearing two people talking.

x187463 · 2h ago

The giveaway is they will never talk over each other. Only one speaker at a time, consistently.

kridsdale1 · 27m ago

And longer pause between turns than humans would do.

tracker1 · 1h ago

Fair enough... though it would be possible to generate that and edit to overlay the speech, introducing stuttering/pauses at the beginning and end of statements then edit the output to overlay the steps.

Would probably want to do similar to balance crossfade anyway... having each speaker's input offset from center instead of straight mono.

kaptainscarlet · 1h ago

Also the lack of stutter and perfect flow of speech are a dead giveaway

tracker1 · 2h ago

Yeah, a lot of the TTS has gotten really impressive in general. Definitely a clear leap from the TTS stuff I worked with for training simulations a bit over a decade ago. Aside: Installing a sound card (unused) on a windows server just to be able to generate TTS was interesting. It was required by the platform, even if it wasn't used for it.

I generally don't like a lot of the AI generated slop that's starting to pop up on YouTube these days... I do enjoy some of the reddit story channels, but have completely stopped with it all now. With the AI stuff, it really becomes apparent with dates/ages and when numbers are spoken. Dates/ages/timelines are just off as far as story generation, and really should be human tweaked. As to the voice gen, saying a year or measurement is just not how English speakers (US or otherwise) speak.

ementally · 55m ago

they vibecoded their demo website? the text is invisible on Firefox.

throwaw12 · 2h ago

Will there be a support for SSML to have more control of conversation?

egorfine · 2h ago

[deleted - I'm an idiot]

x187463 · 2h ago

Whisper is speech-to-text. VibeVoice is text-to-speech.

mpeg · 2h ago

There is a text-to-speech version of whisper, but IMHO the quality is much worse than the demos of this model.

x187463 · 2h ago

Are you referring to this?

https://github.com/WhisperSpeech/WhisperSpeech

Or is there some OpenAI official Whisper TTS?

mpeg · 2h ago

Yep, nothing official that I know, but that one is fairly popular so maybe they were referring to it (although AFAIK it's not frontier?)

egorfine · 2h ago

I stand corrected

anarticle · 38m ago

The first example sounds like a cry for help.

Some of them have tone wobbles which iirc was more common in early TTS models. Looks like the huge context window is really helping out here.

Havoc · 2h ago

MIT license - very nice!

ComputerGuru · 6m ago

The application of known FOSS licenses to what is effectively a binary-only release is misleading and borderline meaningless.

em-bee · 51m ago

what does that mean in this context? it seems to depend on an LLM. so can i run this completely offline? if i have to sign up and pay for an LLM to make it work, then it's not really more useful than any other non-free system

baxuz · 1h ago

Looking forward to the day when tts and speech recognition will work on Croatian, or other less prevalent languages.

It seems that it's only variants of English, Spanish and Chinese which are somewhat working.

lukax · 17m ago

Have you tried Soniox for speech recognition? It supports Croatian. Or are you just looking for self-hosted open-source models? Soniox is very cheap ($0.1/h for async, $0.12/h for real-time) and you get $200 free credits on signup.

https://soniox.com/

Disclaimer: I used to work for Soniox

amelius · 1h ago

I tried some TTS models a while ago, but I noticed that none of them allowed to put markup statements in the text. For example, it would be nice to do something like:

     Hey look! [enthusiastic] Should we tell the others? Maybe not ... [giggles]

etc.

In fact, I think this kind of thing is absolutely necessary if you want to use this to replace a voice actor.

data-ottawa · 35m ago

Eleven labs has some models with support for that.

https://elevenlabs.io/blog/v3-audiotags

viggity · 1h ago

I feel like this is a step in the right direction, but a lot of emotive text-to-speech models are only changing the duration and loudness of each word, the timing/pauses are better too.

I would love to have a model that can make sense of things like stressing particular syllables or phonemes to make a point.

Voyager is an interactive video generation model with realtime 3D reconstruction (github.com)

Microsoft VibeVoice: A Frontier Open-Source Text-to-Speech Model (microsoft.github.io)

MIT Study Finds AI Use Reprograms the Brain, Leading to Cognitive Decline (publichealthpolicyjournal.com)

The 16-year odyssey it took to emulate the Pioneer LaserActive (readonlymemo.com)

John Coltrane's Tone Circle (roelsworld.eu)

Building the most accurate DIY CNC lathe in the world [video] (youtube.com)

How to Give a Good Talk (blog.sigplan.org)

Energy Dashboard (UK) (energydashboard.co.uk)

Magic Lantern Is Back (magiclantern.fm)

Kernel-hack-drill and exploiting CVE-2024-50264 in the Linux kernel (a13xp0p0v.github.io)

Dynamo AI (YC W22) Is Hiring for AI Product Managers (ycombinator.com)

With AI Boom, Dell's Datacenter Biz Is Finally Bigger Than Its PC Biz (nextplatform.com)

Lit: a library for building fast, lightweight web components (lit.dev)

Today, I learned that eels are fish (eocampaign1.com)

Sharing a mutable reference between Rust and Python (blog.lilyf.org)

Finding thousands of exposed Ollama instances using Shodan (blogs.cisco.com)

%CPU utilization is a lie (brendanlong.com)

Inside the World of "The Great British Bake Off" (newyorker.com)

This blog is running on a recycled Google Pixel 5 (2024) (blog.ctms.me)

Amazonq.nvim: Official AWS AI Assistant Plugin for Neovim (github.com)

A staff engineer's journey with Claude Code (sanity.io)

Comic Sans typeball designed to work with the IBM Selectric typewriters (printables.com)

TPDE-LLVM: Faster LLVM -O0 Back-End (discourse.llvm.org)

AI is going great for the blind (2023) (robertkingett.com)

Apple's Assault on Standards (infrequently.org)

Sharing Is Scaring: Linking Cloud File-Sharing to Programming Language Semantics (cs.brown.edu)

We already live in social credit, we just don't call it that (thenexus.media)

The staff ate it later (en.wikipedia.org)

Google can keep its Chrome browser but will be barred from exclusive contracts (cnbc.com)

Making a Linux home server sleep on idle and wake on demand (2023) (dgross.ca)

Lisp interpreter with GC in <750 lines of Odin (and <500 lines of C) (github.com)

The Little Book of Linear Algebra (github.com)

Computing simplified coverage polygons (volkerkrause.eu)

Physically based rendering from first principles (imadr.me)

<template>: The Content Template element (developer.mozilla.org)

Introduction to Ada: a project-based exploration with rosettas (blog.adacore.com)

The Middle Earth (historytoday.com)

Static sites enable a good time travel experience (hamatti.org)

Take something you don’t like and try to like it (dynomight.net)

Chicago has the most lead pipes in the nation (grist.org)

Acorn and the future of (AI?) theorem proving (lmao.bearblog.dev)

'World Models,' an old idea in AI, mount a comeback (quantamagazine.org)

Launch HN: Datafruit (YC S25) – AI for DevOps

Untangling the myths and mysteries of Dvorak and QWERTY (2023) (aresluna.org)

You're Not Interviewing for the Job. You're Auditioning for the Job Title (idiallo.com)

Indices, not Pointers (joegm.github.io)

Triangle Grids (2022) (kvachev.com)

Zig Software Foundation 2025 Financial Report and Fundraiser (ziglang.org)

Toronto’s network of pedestrian tunnels (worksinprogress.news)

Vijaye Raji to become CTO of Applications with acquisition of Statsig (openai.com)

Microsoft VibeVoice: A Frontier Open-Source Text-to-Speech Model

Comments (63)