Like others recently, I've been extremely impressed by LLM's ability to play GeoGuessr, or, more generally, to geo-locate random snapshots that you give them, with what seem (to me) to be almost no context clues. (I gave ChatGPT loads of holiday snapshots, screenshotted to remove metadata, and it did amazingly.)
I assume that, with enough training, we could get similarly accurate guesses of a person's linguistic history from their voice data.
Obviously it would be extremely tricky for lots of people. For instance, many people think I sound English or Irish. I grew up in France to American parents who both went to Oxford and spent 15 years in England. I wouldn't be surprised, though, if a well-trained model could do much better on my accent than "you sound kinda Irish."
asveikau · 1h ago
Victor's problem isn't really the vowels or pacing. The final consonants are soft or not really audible. I am not hearing the /ŋ/ of "long" as the most marked example. It sounds closer to "law". In his "improved" recording he hasn't fixed this.
I sometimes see content on social media encouraging people to sound more native or improve their accent. But IMO it's perfectly ok to have an accent, as long as the speech meets some baseline of intelligibility. (So Victor needs to work on "long" but not "days".) I've even come across people who are trying to mimick a native accent but lose intelligibility, where they'd sound better with their foreign accent. (An example I've seen is a native Spanish speaker trying to imitate the American accent's intervocalic T and D, and I don't understand them. A Spanish /t/ or /d/ would be different from most English language accents, but be way more understandable.)
anadalakra · 58m ago
"If Victor wanted to move beyond this point, the sound-by-sound phonetic analysis available in the BoldVoice app would allow him to understand the patterns in pronunciation and stress that contribute to Eliza’s accent and teach him how to apply them in his own speech."
Indeed Victor would likely receive a personalized lesson and practice on the NG sound on the app.
pjc50 · 1h ago
What the vector-space data gets right, and what the human commentary tends not to, is the idea that accents are a complex statistical distribution. You should be careful about the concept of a "default" or "neutral" accent. Telecommunications has spent the 20th century flattening accents together, as has accent discrimination. There's always the tendency for people to say "my accent is the neutral standard against which all others should be measured".
ilyausorov · 39m ago
For sure, and I don't think we ever use the term default or neutral. The "the American English accent of our expert accent coach Eliza" is just that -- it's one accent.
As a learning platform that provides instruction to our users, we do need to set some kind of direction in our pedagogy, but we 100% recognize that there isn't just 1 American English accent, and there's lots of variance.
lurk2 · 1h ago
> There's always the tendency for people to say "my accent is the neutral standard against which all others should be measured".
You can measure this by mutual intelligibility with other accent groupings.
georgewsinger · 2h ago
This is so cool. Real-time accent feedback is something language learners have never had throughout all of human history, until now.
Along similar lines, it would be useful to map a speaker's vowels in vowel-space (and likewise for consonants?) to compare native to non-native speakers.
I can't wait until something like this is available for Japanese.
ilyausorov · 37m ago
That's a fascinating idea! Definitely something to try out for our team. We actively and continuously do all sorts of experiments with our machine learning models to be able to extract the most useful insights. We will definitely share if we find something useful here.
pjc50 · 1h ago
> something language learners have never had throughout all of human history
.. unless they had access to a native speaker and/or vocal coach? While an automated Henry Higgins is nifty, it's not something humans haven't been able to do themselves.
Unearned5161 · 19m ago
I'm always very entertained when I'm talking with someone and pick up on some very slight deviation from the "norm" in their accent. I think it shows two things: that its near impossible to totally wipe that fingerprint of a past tongue, and that our ears are incredibly adept pieces of tooling
ccppurcell · 1h ago
Oh pssh. There's no such thing as accent strength. There's only accent distance. Accent strength is just an artefact of distance from the accent of a socially dominant group.
dmurray · 37m ago
The article defines accent strength in precisely this way, as the difference "relative to native speakers of English".
That group has a vast range of accents, but it's believable that that range occupies an identifiable part of the multi-dimensional accent space, and has very little overlap with, for example, beginner ESL students from China.
Even between native speakers, I bet you could come up with some measure of centrality and measure accent strength as a distance from that. And if language families exist upon a continuum - there must be some point on that continuum where you are no longer speaking English, but say Scots or Friesian or Nigerian Creole instead. Accents close to those points are objectively stronger.
But there is a lot of freedom in how you measure centrality - if you weight by number of speakers, you might expect to get some mid-American or mid-Atlantic accent, but wind up with the dialect of semi-literate Hyderabad call centre workers.
joshuaissac · 29m ago
> relative to native speakers of English
> Even between native speakers, I bet you could come up with some measure of centrality and measure accent strength as a distance from that
Is that what BoldVoice is actually doing? At least from the article is saying, it is measuring the strength of the user's American English accent (maybe GenAm?), and there does is no discussion of any selectable choice for which native accent to target.
dmurray · 16m ago
> Is that what BoldVoice is actually doing?
No, I don't think it is doing that, I'm just taking issue with cccpurcell, who seems to believe that any definition of accent strength is chauvinistic.
ilyausorov · 31m ago
Indeed, although the inference output of the model is based on the ratings input that we trained it on. And that rating input was done by American English native speakers, so this iteration of the model is centered towards those accents more than e.g. UK or Australian or other accents of English from outside the US.
ilyausorov · 38m ago
Sure, that's fair. We apply labels that have a connotation of strength based on the distance, but the underlying calculation is indeed based on distance.
semiquaver · 38m ago
What a silly nitpick. You’re just using different words to say the same thing.
adhsu01 · 1h ago
Super cool work, congrats BoldVoice team! I've always thought that one of the non-obvious applications of voice cloning/matching is the ability to show a language learner what they would sound like with a more native accent.
ilyausorov · 39m ago
This and more exciting features are coming to the BoldVoice app soon!
oscar120 · 1h ago
this^
fxtentacle · 1h ago
What a great AI use-case! At first, I felt excited ...
But then I read their privacy policy. They want permission to save all of my audio interactions for all eternity. It's so sad that I will never try out their (admittedly super cool) AI tech.
anadalakra · 1h ago
You can reach out and request your data to be deleted at any time.
fxtentacle · 1h ago
"if you wish to opt out of future collection of voice samples, you may do so by disabling voice-related features in the BoldVoice app. Please note that this may limit the functionality of certain services."
Yeah, I can opt out. By not using any voice-related feature in their voice training app.
anadalakra · 1h ago
If you're still actively using the app, the voice will be retained and processed so that you can receive instant feedback, and also so that you receive additional personalized practice items and video lessons based on your speech needs. If you don't want the samples saved "in perpetuity", you can request them to be deleted once you decide that you're done with the application. Hope this helps!
childintime · 29m ago
I didn't find international english, would have been interesting.
Also, the USA writing convention falls short, like "who put the dot inside the string."
crazy. Rationals "put the dot after the string". No spelling corrector should change that.
vessenes · 1h ago
This is super cool.
A suggestion and some surprise: I’m surprised by your assertion that there’s no clustering. I see the representation shows no clustering, and believe you that there is therefore no broad high-dimensional clustering. I also agree that the demo where Victor’s voice moves closer to Eliza’s sounds more native.
But, how can it be that you can show directionality toward “native” without clustering? I would read this as a problem with my embedding, not a feature. Perhaps there are some smaller-dimensional sub-axes that do encode what sort of accent someone has?
Suggestion for the BoldVoice team: if you’d like to go viral, I suggest you dig into American idiolects — two that are hard not to talk about / opine on / retweet are AAVE and Gay male speech (not sure if there’s a more formal name for this, it’s what Wikipedia uses).
I’m in a mixed race family, and we spent a lot of time playing with ChatGPT’s AAVE abilities which have, I think sadly, been completely nerfed over the releases. Chat seems to have no sense of shame when it says speaking like one of my kids is harmful; I imagine the well intentioned OpenAI folks were sort of thinking the opposite when they cut it out. It seems to have a list of “okay” and “bad” idiolects baked in - for instance, it will give you a thick Irish accent, a Boston accent, a NY/Bronx accent, but no Asian/SE Asian accents.
I like the idea of an idiolect-manager, something that could help me move my speech more or less toward a given idiolect. Similarly England is a rich minefield of idiolects, from scouse to highly posh.
I’m guessing you guys are aimed at the call center market based on your demo, but there could be a lot more applications! Voice coaches in Hollywood (the good ones) charge hundreds of dollar per hour, so there’s a valuable if small market out there for much of this. Thanks for the demo and write up. Very cool.
BalinKing · 1h ago
(Minor nitpick, but I think "dialect" is a more appropriate word than "idiolect" here—at least according to Wikipedia, "idiolect" refers to a single person's way of speaking, whereas AAVE et al. are shared and are therefore considered dialects.)
vessenes · 57m ago
OK, good read for me here. Based on your feedback and some research, I think I should have use ‘sociolect’ for both in that I was less complaining about ChatGPT’s unwillingness to use, say, finna, in a sentence, and more complaining about the vocalized accents. Anyway good catch, thanks!
retrac · 37m ago
Sociolect is the right term for a dialect used by a particular social group. A related idea is "register" when multiple related and mutually understandable standards exist, and are used in different contexts.
pjc50 · 1h ago
> It seems to have a list of “okay” and “bad” idiolects baked in
We're back to "AI safety actually means brand safety": inept pushback against being made into an automated racism factory with their name on it.
vessenes · 1h ago
100%
treetalker · 2h ago
This is cool and one of the applications of LLMs that I'm actually looking forward to: accent training when acquiring a new language, particularly hearing what you would sound like without an accent!
That said, I found the recording of Victor's speech after practicing with the recording of his own unaccented voice to be far less intelligible than his original recording.
Looking forward to seeing the developments in this particular application.
ilyausorov · 45m ago
Fair point! When Victor tried to speed up to speak as fast as Coach Eliza, while it sounded somewhat less accented, a few parts of the phrase did get less intelligible. 10 minutes of practice is only a start after all.
Interesting to note that we're also developing a separate measure of intelligibility that will give a separate sense of how intelligible versus accented something is.
sonny3690 · 29m ago
This is some insanely cool work. It's going to help so many people.
wbroo · 1h ago
Very interestng! Have you tested for other factors like speaking speed, emotional tone, or microphone quality to see what else is (or isn’t) influencing model perception?
ilyausorov · 43m ago
For sure we did! The training data we used for this was purposely highly varied to account for these various factors so they don't cause too much bias in the model. But there's also an error rate regardless of how good you make it. We keep improving!
Goofy_Coyote · 56m ago
Glad to see BoldVoice here.
I’ve been using it for a few months, and I can confirm it’s working.
ilyausorov · 42m ago
Happy to see a happy BoldVoice user. Please don't hesitate to reach out to our team with feedback or thoughts on how we can continue to improve your learning journey. Helping you succeed is our #1 priority!
joshjhargreaves · 1h ago
Damn, this is really cool.
oscar120 · 1h ago
thanks!
mckirk · 2h ago
Is it just me, or did the sound files get hugged-to-death?
I assume that, with enough training, we could get similarly accurate guesses of a person's linguistic history from their voice data.
Obviously it would be extremely tricky for lots of people. For instance, many people think I sound English or Irish. I grew up in France to American parents who both went to Oxford and spent 15 years in England. I wouldn't be surprised, though, if a well-trained model could do much better on my accent than "you sound kinda Irish."
I sometimes see content on social media encouraging people to sound more native or improve their accent. But IMO it's perfectly ok to have an accent, as long as the speech meets some baseline of intelligibility. (So Victor needs to work on "long" but not "days".) I've even come across people who are trying to mimick a native accent but lose intelligibility, where they'd sound better with their foreign accent. (An example I've seen is a native Spanish speaker trying to imitate the American accent's intervocalic T and D, and I don't understand them. A Spanish /t/ or /d/ would be different from most English language accents, but be way more understandable.)
Indeed Victor would likely receive a personalized lesson and practice on the NG sound on the app.
As a learning platform that provides instruction to our users, we do need to set some kind of direction in our pedagogy, but we 100% recognize that there isn't just 1 American English accent, and there's lots of variance.
You can measure this by mutual intelligibility with other accent groupings.
Along similar lines, it would be useful to map a speaker's vowels in vowel-space (and likewise for consonants?) to compare native to non-native speakers.
I can't wait until something like this is available for Japanese.
.. unless they had access to a native speaker and/or vocal coach? While an automated Henry Higgins is nifty, it's not something humans haven't been able to do themselves.
That group has a vast range of accents, but it's believable that that range occupies an identifiable part of the multi-dimensional accent space, and has very little overlap with, for example, beginner ESL students from China.
Even between native speakers, I bet you could come up with some measure of centrality and measure accent strength as a distance from that. And if language families exist upon a continuum - there must be some point on that continuum where you are no longer speaking English, but say Scots or Friesian or Nigerian Creole instead. Accents close to those points are objectively stronger.
But there is a lot of freedom in how you measure centrality - if you weight by number of speakers, you might expect to get some mid-American or mid-Atlantic accent, but wind up with the dialect of semi-literate Hyderabad call centre workers.
> Even between native speakers, I bet you could come up with some measure of centrality and measure accent strength as a distance from that
Is that what BoldVoice is actually doing? At least from the article is saying, it is measuring the strength of the user's American English accent (maybe GenAm?), and there does is no discussion of any selectable choice for which native accent to target.
No, I don't think it is doing that, I'm just taking issue with cccpurcell, who seems to believe that any definition of accent strength is chauvinistic.
But then I read their privacy policy. They want permission to save all of my audio interactions for all eternity. It's so sad that I will never try out their (admittedly super cool) AI tech.
Yeah, I can opt out. By not using any voice-related feature in their voice training app.
Also, the USA writing convention falls short, like "who put the dot inside the string."
crazy. Rationals "put the dot after the string". No spelling corrector should change that.
A suggestion and some surprise: I’m surprised by your assertion that there’s no clustering. I see the representation shows no clustering, and believe you that there is therefore no broad high-dimensional clustering. I also agree that the demo where Victor’s voice moves closer to Eliza’s sounds more native.
But, how can it be that you can show directionality toward “native” without clustering? I would read this as a problem with my embedding, not a feature. Perhaps there are some smaller-dimensional sub-axes that do encode what sort of accent someone has?
Suggestion for the BoldVoice team: if you’d like to go viral, I suggest you dig into American idiolects — two that are hard not to talk about / opine on / retweet are AAVE and Gay male speech (not sure if there’s a more formal name for this, it’s what Wikipedia uses).
I’m in a mixed race family, and we spent a lot of time playing with ChatGPT’s AAVE abilities which have, I think sadly, been completely nerfed over the releases. Chat seems to have no sense of shame when it says speaking like one of my kids is harmful; I imagine the well intentioned OpenAI folks were sort of thinking the opposite when they cut it out. It seems to have a list of “okay” and “bad” idiolects baked in - for instance, it will give you a thick Irish accent, a Boston accent, a NY/Bronx accent, but no Asian/SE Asian accents.
I like the idea of an idiolect-manager, something that could help me move my speech more or less toward a given idiolect. Similarly England is a rich minefield of idiolects, from scouse to highly posh.
I’m guessing you guys are aimed at the call center market based on your demo, but there could be a lot more applications! Voice coaches in Hollywood (the good ones) charge hundreds of dollar per hour, so there’s a valuable if small market out there for much of this. Thanks for the demo and write up. Very cool.
We're back to "AI safety actually means brand safety": inept pushback against being made into an automated racism factory with their name on it.
That said, I found the recording of Victor's speech after practicing with the recording of his own unaccented voice to be far less intelligible than his original recording.
Looking forward to seeing the developments in this particular application.
Interesting to note that we're also developing a separate measure of intelligibility that will give a separate sense of how intelligible versus accented something is.
I’ve been using it for a few months, and I can confirm it’s working.