Voxtral – Frontier open source speech understanding models

66 meetpateltech 18 7/15/2025, 2:47:02 PM mistral.ai ↗

Comments (18)

kamranjon · 1h ago
Im pretty excited to play around with this. I’ve worked with whisper quite a bit, it’s awesome to have another model in the same class and from Mistral, who tend to be very open. I’m sure unsloth is already working on some GGUF quants - will probably spin it up tomorrow and try it on some audio.
homarp · 9h ago
Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

ipsum2 · 5h ago
24B is crazy expensive for speech transcription. Conspicuously no comparison with Parakeet, a 600M param model thats currently dominating leaderboards (but only for English)
sheerun · 2h ago
In demo they mention polish prononcuation is pretty bad, spoken as if second language of english-native speaker. I wonder if it's the same for other languages. On the other hand whispering-english is hillariously good, especially different emotions.
GaggiX · 10h ago
There is also a Voxtral Small 24B small model available to be downloaded: https://huggingface.co/mistralai/Voxtral-Small-24B-2507
lostmsu · 9h ago
Does it support realtime transcription? What is the ~latency?
danelski · 13h ago
They claim to undercut competitors of similar quality by half for both models, yet they released both as Apache 2.0 instead of following smaller - open, larger - closed strategy used for their last releases. What's different here?
halJordan · 5h ago
They didn't release voxtral large so your question doesn't really make sense
wmf · 6h ago
They're working on a bunch of features so maybe those will be closed. I guess they're feeling generous on the base model.
Havoc · 6h ago
Probably not looking to directly compete in transcription space
homarp · 11h ago
homarp · 11h ago
Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.

Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

lostmsu · 9h ago
My Whisper v3 Large Turbo is $0.001/min, so their price comparison is not exactly perfect.
ImageXav · 9h ago
How did you achieve that? I was looking into it and $0.006/min is quoted everywhere.
lostmsu · 8h ago
Harvesting idle compute. https://borgcloud.org/speech-to-text
4b11b4 · 40m ago
This is your service?
BetterWhisper · 7h ago
Do you support speaker recognition?
lostmsu · 5h ago
No. I found models doing that unreliable when there are many speakers.