Alterego: Thought to Text

101 oldfuture 67 9/8/2025, 9:17:50 PM alterego.io ↗

Comments (67)

stevage · 6h ago
The great thing about a product like this is that it's so easy to fake it in video.

I don't really buy that typing speed is a bottleneck for most people. We can't actually think all that fast. And I suspect AI is doing a lot of filling in the gaps here.

It might have some niche use cases, like being able to use your phone while cycling.

Bjartr · 6h ago
Personal anecdote: I do find typing to be a bottleneck in situations where typing speed is valuable (so notes in meetings, not when coding).

I can break 100wpm, especially if I accept typos. It's still much, much slower to type than I can think.

stevage · 4h ago
My experience with taking notes in meetings is definitely that the brain is the bottleneck, not the fingers. There are times where I literally just type what the person is saying, dictation style (ie, recording a client's exact words, often helpful for reference, even later in the meeting). I can usually keep up. But if I'm trying to formulate original thoughts, or synthesise what I've heard, or find a way to summarise what they have been saying - that's where I fall behind, even though the total number of words I need to write is actually much smaller.

So this definitely wouldn't help me here. Realistically though, there ought to be better solutions like something that just listens to the meeting and automatically takes notes.

robofanatic · 5h ago
> notes in meetings

That’s already solved by AI, if you let AI listen to your meetings.

Feathercrown · 4h ago
I haven't found that to be very accurate. I suspect the internal idiosyncrasies of a company are an issue, as the AI doesn't have the necessary context.
aeroaero · 4h ago
Seems like it would be much easier to solve that problem than it would be to cross the brain barrier and start interfacing with our thoughts, no? Just provide some context on the company etc
j45 · 5h ago
Speech to text can be 130-200 wpm.

Also, keybr.com helps speed up typing if you were thinking about it.

dllthomas · 4h ago
Typing speed is very much a bottleneck when I'm washing dishes, at least.
w00ds · 5h ago
It's possible the demo is faked, and I'm skeptical. But I also don't think the speed is really the point of a device like this. Getting out a device, pressing keys or tapping on it, and putting it away again, those attentional costs of using some device... I know something like basic notetaking would feel really different to me if I was able to just do the thing in the demo at high accuracy instead. That's a big if, though - the accuracy would have to be high for it to really be useful, and the video is probably best-case clips.
com2kid · 3h ago
Pulling out my phone, unlocking it, opening my notes app, creating a new note, that is a bottleneck.

Puling out my phone, unlocking it, remembering what the hotkey is today for starting google/gemini, is a bottle neck. Damned if I can remember what random gesture lets me ask Gemini to take a note today (presumably gemini has notes support now, IIRC the original release didn't).

Finding where Google stashes todo items at, also a bottle neck. Of course that entails me getting my phone out and navigating to whatever notes app (for awhile todos/notes were inside a separate Google search app!) they are shoved into.

My Palm Pilot from 2000 had more usability than a modern smartphone.

This device can solve all of those issues.

soulofmischief · 4h ago
> We currently have a working prototype that, after training with user-specific example data, demonstrates over 90% accuracy on an application-specific vocabulary. The system is currently user-dependent and requires individual training. We are currently on working on iterations that would not require any personalization.

https://www.media.mit.edu/projects/alterego/frequently-asked...

andymatuschak · 2h ago
That text was written about the Media Lab-era prototype in 2019: https://web.archive.org/web/20190102110930/https://www.media...

I wonder how far they've gotten past it.

vunderba · 2h ago
From the article:

> Alterego only responds to intentional, silent speech.

What exactly do they mean by this? Some kind of equivalent to subvocalization [1]?

[1] https://en.wikipedia.org/wiki/Subvocalization

ipsum2 · 7m ago
Yes. The paper the company is based on uses EMG (muscle movements) to convert into text.
hyperadvanced · 2h ago
Oh god we’re about to have the “I don’t have an inner monologue” debate again, aren’t we?
com2kid · 3h ago
I am surprised no one here has noted that a device like this almost completely negates the need for literacy. That is huge. Right now people still need to interact with written words, both typing and reading. Realistically a quiet vocal based input device like this could have a UX built around it that does not require users to be literate at all.
jussaying2 · 2h ago
Not to mention the support it brings for people with disabilities! (speech, hands/fingers)
synapsomorphy · 6h ago
The accuracy is going to be the real make or break for this. In a paper from 2018 they reported 92% word accuracy [1]. That's a lifetime ago for ML but they were also using five facial electrodes where now it looks confined to around the ears. If the accuracy was great today they would report it. In actual use I can see even 99% being pretty annoying and 95% being almost unusable (for people who can speak normally).

[1] https://www.media.mit.edu/publications/alterego-IUI/

ivape · 1h ago
Why do you say that? I often vocalize near giberrish and the LLM fixes it for me and mostly gets what I meant.
boznz · 1h ago
Spent all last year writing a techno-thriller about mind-reading, I'm sure this is about as factual, and, of course nothing nefarious could possibly happen if this ever became real.
deekshith13 · 11m ago
You probably thought about some nefarious stuff that could happen. Mind to share some interesting ones?
gcanyon · 4h ago
For those thinking about speed: an average human talks anywhere from 120-240 words per minute. An average human who touch types is probably 1/3 to 1/2 as fast as that, while an average human on a phone probably types 1/5 as fast as that.

But for me speed isn't even the issue. I can dictate to Siri at near-regular-speech speeds -- and then spend another 200% of the time that took to fix what it got wrong. I have reasonable diction and enunciation, and speech to text is just that bad while walking down the street. If this is as accurate as they're showing, it would be worth it just for the accuracy.

keleftheriou · 3h ago
I agree, but I think LLM-based voice input is a lot better. I’m using OpenAI’s realtime API for my Apple Watch app, and it does wonders, even editing can be as simple as “add a heart emoji at the end”, and it just works.

https://x.com/keleftheriou/status/1963399069646426341

socalgal2 · 5h ago
I just imagine this going really wrong. My chain of thought would be something like: "Let's see, I need to rotate this image so I need to loop over rows then columns, .. gawd fuck this code base is shit designed, there are no units on these fields, this could be so much cleaner, ... for each row ... I wonder what's for lunch today? I hope it's good ... for each column ... Dang that response on HN really pissed me off, I'd better go check it ... read pixel from source ... tonight I meeting up with a friend, I'd better remember to confirm, ... write pixel to dest ...."
Briannaj · 4h ago
This is literally only as fast as text to speech. the only difference is that you don't have to speak aloud. Which is cool. But for using a computer its still annoying and worse than a mouse because with a mouse you can click or drag and place in a second, in this format you have to think "move the box from point A to point B (with coordinates or a description) etc etc".

I think its cool, I've been brainstorming how a good MCI would work for a while and didn't think of this. I think its a great novel approach that will probably be expanded on soon.

com2kid · 4h ago
> But for using a computer its still annoying and worse than a mouse because with a mouse you can click or drag and place in a second, in this format you have to think "move the box from point A to point B (with coordinates or a description) etc etc".

You wouldn't use a regular WIMP[1] paradigm with this, that completely defeats the advantages you have. You don't need to have a giant window full of icons and other clickable/tappable UI elements, that becomes pointless now.

[1]https://en.wikipedia.org/wiki/WIMP_(computing)

Briannaj · 4h ago
what it could be really cool for is stuff like "open my house door", "Turn off the lights", "text so and so", "Start my car" Stuff we want to do without pulling out our phone that doesn't require a lot of detailed instruction.
stevage · 4h ago
I must be such a rarity around here, but if I could improve a hundred things about my life, none of those would make the list. Well, possibly the third one - more convenient ways to text people.

I guess I also kind of enjoy the physical sensations of putting a key in a lock, opening the door etc. Definitely don't want a digital-only existence.

oldfuture · 7h ago
pedalpete · 6h ago
I'd love to get a better understanding of the technology this is built with (without sitting through an exceedingly long video).

I suspect it's EMG though muscles in the ear and jaw bone, but that seems too rudimentary.

The TED talk describes a system which includes sensors on the chin across the jaw bone, but the demo obviously has removed that sensor.

jackthetab · 6h ago
Thirteen minutes is an "exceedingly long video"?! Man, I thought I was jaded complaining about 20 minute videos! :-)

I want to know is what are the connected to? A laptop? A AS400? An old Cray they have lying around? I'd think doing the demo while walking would have been de riguer.

Anyway, tres cool!

esafak · 4h ago
These guys were not born when Crays roamed the earth.
ilaksh · 6h ago
Maybe they have combined an LLM or something with the speech detection convolution layers or whatever they were doing. Like with JSON schemas constraining the set of available tokens that are used for structured outputs. Except the set of tokens comes from the top 3-5 words that their first analysis/network decided are the most likely. So with that smarter system they can get by with fewer electrodes in a smaller area at the base of the skull where cranial nerves for the face and tongue emerge from the brainstem.
fxwin · 6h ago
i think this is what you're looking for: https://www.media.mit.edu/projects/alterego/publications/
blixt · 6h ago
I found it interesting that in the segment where two people were communicating "telepathically", they seem to be producing text, which is then put through text-to-speech (using what appeared to be a voice trained on their own -- nice touch).

I have to wonder, if they have enough signal to produce what essentially looks like speech-to-text (without the speech), wouldn't it be possible to use the exact same signal to directly produce the synthesized speech? It could also lower latency further by not needing extra surrounding context for the text to be pronounced correctly.

com2kid · 3h ago
> they seem to be producing text, which is then put through text-to-speech (using what appeared to be a voice trained on their own -- nice touch).

This is an LLM model thing. Plenty of open source (or at least MIT licensed) LLMs and TTS models exist that translate and can be zero shot trained on a user's speech. Direct audio to audio models tend to be less researched and less advanced than the corresponding (but higher latency) audio to text to audio pipelines.

That said you can get audio->text->audio down to 400ms or so latency if you are really damn good at it.

stevage · 4h ago
Interesting, I remember reading a sci-fi book a long time ago with almost exactly this same method, which they called "sub-vocalisation".

(I think it was https://en.wikipedia.org/wiki/Oath_of_Fealty_%28novel%29 but can't find enough details to confirm.)

goopypoop · 1h ago
Speaker For The Dead - Orson Scott Card
akdor1154 · 5h ago
From memory, i think other recent research is along this approach, but not yet good enough. Cant remember where I read this but was likely HN. I think the posted paper got 95% accuracy when picking from a known set of target sentences/words, but far less (60%?) when used for freeform input.

I'm sure that's not the last word though!

andsoitis · 4h ago
They don’t have something that anyone can try out and it also seems no public demonstrations of early prototypes.

Seems like vaporware.

Dilettante_ · 6h ago
I want to believe so bad that I can finally get rid of my keyboard.
deckar01 · 4h ago
You can tell it’s fake, because it’s hard wired and super low profile, yet isn’t covered in LEDs.
kittikitti · 27m ago
So if I unintentionally think of a thought crime, is it still illegal? I wonder how governments will use this to "nudge" their citizens. I guess if you have nothing to hide, then there are no issues at all.
baroninthetrees · 4h ago
As someone with ADD and a lot of crosstalk in my "inner voice", I can't imagine this could make any sense of what I intending, let alone one thing. Definitely a lot of use cases if it isn't vaporware.
laurieg · 4h ago
I'm a huge smart speaker user. I have one in every room. But as soon as guests come over I stop using them. I would never use Siri etc in public.

Going from voice input to silent voice input is a huge step forward for UX.

keleftheriou · 3h ago
I get the sentiment, but can you elaborate on why that is the case for you?
bromanko · 2h ago
I think the killer app is doing video calls in coffee shops without disturbing my neighbors.
vunderba · 2h ago
I can 100% guarantee that I lack the Mentat level of discipline necessary to prevent my true inner thoughts from leaking out in the middle of a conference. I'd also be the first to inadvertently summon the giant squid from Sphere.

No comments yet

paulbjensen · 3h ago
I wonder if they've considered testing it with people who have locked-in syndrome or Motor-Neurone disease. It could be an amazing tool for them.
whymauri · 2h ago
The Harvard BIONICS lab is working on neuroprostheses for different forms of paralysis, like intestinal paralysis. They're great.
dwa3592 · 2h ago
Y’all are missing a few key points.

- There is a ML model which was trained on 31 hours of silently spoken text. That’s the training data. You still need to know the red fruit in front of you is called apple bc that’s what the model is trained on. So you must be literate to get this working.

- The accuracy in the paper is on a very small text type, numerals. As much as I could understand, they asked users to do mathematical operations and they checked the accuracy on that. Someone with a deeper understanding please correct me.

- Most of the video demo(honestly) is meh, once you have the text input for a LLM, you are limited to what the LLM can do. The real deal is the ml model that translates the neuromuscular signals to actual words. Those signals must be super noisy. So training a model with only 31 hours of data is a bit surprising and impressive. But the model would probably require calibration for each user’s silent voice, like say this sentence silently , “a quick brown fox jumped over the rope”. I think this will be cool.

- I really hope this tech works. I really really hope they don’t sell to big tech jerks like Meta. I really really really hope this tech removes screens from our lives(or at least a step in the right direction).

crooked-v · 27m ago
> So you must be literate to get this working.

Literacy is about written text, not spoken words. I think you've confused it with fluency.

dwa3592 · 2h ago
I should have started with- “Congratulations. Very cool tech if works”.
Theodores · 6h ago
The presentation of this product reminds me of peak crypto when a 'white paper' and a two-page website was all anyone needed to get bamboozled into handing their money over.
g42gregory · 4h ago
Are these are friends of Elizabeth Holmes? :-)
lordofgibbons · 6h ago
Is the video hosted anywhere else? It seems to be the only source of info, but it's being played at like 0.5X speed.
oldfuture · 6h ago
bigolnik · 5h ago
So this a bone conducting microphone? That operates at the speed of speech? While you sit around awkwardly, hoping no one talks to you? This isn't thought. This is you saying to yourself quite clearly what you would like it to hear.
reassess_blind · 5h ago
It doesn't look like he's speaking.
p1dda · 1h ago
Brilliant idea to capture the neuromuscular signals and translate it to text!
desireco42 · 5h ago
What I picked up from this vision of the future... we will have mind reading devices to capture out thoughts, but we will still be on a train and commuting to work... dang...

So they came up with this groundbreaking idea but couldn't come up with better use case then typing on a train.

Look, I can't but not appreciate that at least they are doing something interesting as opposed to vibe one shot fork of vs code things that we see.

zknowledge · 6h ago
either this is the world's biggest grift OR the 2nd greatest product of the 21st century... so far.
lukebechtel · 6h ago
been waiting for something like this. Looking forward to adoption!
tibbon · 5h ago
So... Sub-Etha?
desireco42 · 4h ago
Reminds of the song Sound of Silence...
deadbabe · 3h ago
Great. I can imagine fucked up charlatans putting stuff like this on brain dead patients and convincing their family the person is still able to communicate with the help of AI.
goopypoop · 1h ago
"spirit box" don't need Al