Ask HN: What API or software are people using for transcription?
55 indigodaddy 63 6/9/2025, 4:10:31 PM
What API or software are people using for transcription? If remote API, would like it to be fast, cheap, and support summarization. Groq actually looks good as it apparently supports remote URLs for the audio file, which I actually would prefer. If local, would need to work on a base M4 mini. Looking at llamafile/whisperfile, as I'd want to be able to either cli batch it or use it as a local API/server.
I am also eyeing whisperX[2], because I want to play some more with speaker diarization.
Your use-case seems to be batch transcription, so I'd suggest you go ahead and just use whisperfile, it should work well on an M4 mini, and it also has an HTTP API if you just start it without arguments.
If you want more interactivity, I have been using Vibe[3] as an open-source replacement of SuperWhisper[4], but VoiceInk from a sibling comment seems better.
Aside: It seems that so many of the mentioned projects use whisper at the core, that it would be interesting to explicitly mark the projects that don't use whisper, so we can have a real fundamental comparison.
[1] https://huggingface.co/Mozilla/whisperfile
[2] https://github.com/m-bain/whisperX
[3] https://github.com/thewh1teagle/vibe/
[4] https://superwhisper.com/
There are two ways to parse your first sentence. Are you saying that you used whisperX and it doesn't do well with diarization? Because I am curious of alternative ways of doing that.
I've run it on my ThinkPad P14s Gen 4, which doesn't have much of a GPU (Radeon 780M). It processes approximately in realtime.
https://pccnect.fit.vutbr.cz/gradio-demo/
It runs Whisper (or the newer Whisper Turbo) really well, and you can both drop MP3/MP4/etc files into it or paste in URLs to a YouTube video/podcast URL to kick off a transcription. It exports to text or VTT subtitles or a bunch of other formats. I use it several times a week.
Our stack is a hybrid approach, as we've found no single service is best for everything:
For Real-Time & High-Volume Transcription: We use Google's Speech-to-Text API (via Vertex AI). We found its accuracy with various accents and noisy environments to be top-tier, which is critical for turning real-world meeting audio into usable data. It's not the cheapest, but the reliability is worth it for our core product.
For Bundled Transcription + Summarization (What you're asking about): You're right to look at models that can handle this in one go. We use Gemini for this exact purpose. The ability to send an audio file and get back a structured summary, not just a raw transcript, is a huge workflow accelerator. We point our audio data to the model and ask it to not only transcribe but also extract action items, which is a core feature of our platform.
For Local/CLI Batching (As you mentioned): We've had great success with OpenAI's Whisper running in a containerized environment for internal, less time-sensitive tasks. Using a fine-tuned version of Whisper on a dedicated machine for batch processing can be very cost-effective. Your M4 mini should be able to handle the base models of Whisper quite well, especially using a tool like whisper.cpp.
My takeaway: For a production API, a robust cloud service like Google's or a powerful model like Gemini is more reliable. For local or batch processing, a self-hosted Whisper setup is a fantastic and powerful option. Good luck!
If you want to run it locally, I'd still go with whisper, then I'd look at something like whisper.cpp https://github.com/ggml-org/whisper.cpp. Runs quite well.
Mind you, this is from a few months back! Not sure if this is still the best approach ¯\_(ツ)_/¯
- Locally running, wrapper around whisper.cpp
- I've done a lot of work on noise profiling, stitching the segments. So when you are speaking for anything >2-3mins, its actually faster than cloud transcriptions. (Accuracy is a few WER off since they are quantized models).
- You can try without paying or putting in CC. After that ~19$ one time. No need to sign up or login.
- BYOK to use your groq, gemini free daily credits to rewrite. Support for thinking models too. can also plug into any locally running LLM.
- Works on my 1st gen M1 without a sweat.
https://arxiv.org/abs/2402.08021
Out of the box it can transcribe with Whisper or Faster-Whisper, but it can also align audio with an existing human-written transcript, providing time information without losing accuracy. This last feature was something I really needed, and my attempt at building it myself ended up much worse, so I'm glad I found this
I self-host it using Modal.com, as do some other commenters
It runs a small local model and has optional Power Modes that pass the transcript to a remote or local LLM for further enhancements, based on your currently opened apps or websites. Also the app is open-source, but with a one-time license purchase option (instabuy for me, of course).
These worked really great for our usecase where we needed to transcribe from 6 different languages in production
I needed to do an inventory of stuff in our house over the weekend, and I used Wispr Flow on iOS to take a very very long and rambly note in their app. Then the transcription text appeared on their Mac app, ready to be pasted into ChatGPT for parsing.
Wispr Flow handles languages switches quite well in my experience using it in both English and Japanese.
whisper based options:
whisper.cpp works well on Mac and runs locally, though fine-tuning vocabulary is desired
noScribe is a good FOSS option
MacWhisper is a highly effective macOS desktop app that processes local files and remote URLs, exporting to various formats
whisperfile is suitable for batch transcription and can run on an M4 mini with an HTTP API
carelesswhisper.app is a local, whisper.cpp wrapper that offers fast transcription for longer audio, noise profiling, and one-time payment, working well on M1
Faster-Whisper-XXL standalone on Windows offers fantastic accuracy due to vocal extraction preprocessing
VoiceInk is an open-source solution that runs a small local model with optional remote/local LLM enhancements
Vibe is an open-source alternative to SuperWhisper
Cloud options:
OpenAI speech-to-text and AssemblyAI are considered high quality, with AssemblyAI being cheap and having robust SDK support
TurboScribe offers a generous free tier for web-based transcription
Hosting OpenAI Whisper large v3 on Modal.com provides a fast, cheap solution with no rate limits
AssemblyAI's Universal ASR has impressive WER, future textual prompting, and PII redaction capabilities
borgcloud.org offers competitive pricing and fast real-time transcription
GCP's Chirp & chirp2 are used for large-scale meeting minute transcription
Microsoft Word 365 (online) Transcribe is a surprisingly effective out-of-the-box solution for English, offering labeled speakers and timestamps
Uploading audio to YouTube can be a 'cheap hack' for transcription
Full summary here: https://extraakt.com/extraakts/transcription-tools-and-workf...
-Huge size (400MB), it can be split but then I want a single text file with correct timestamps
- There are 3 speakers and one is speaking far from the microphone and with low voice. Whisper sometimes ignores this speaker.
- The last and more difficult is that there are 2 languages being used at the same time. The same speaker might use Dutch or English and even mix both in a sentence.
Is there a way to deal with all that?
(Wispr Flow is the best for general TTS on desktop as well.)
1. record the audio on your phone audio recorder
2. send the mp3 to yourself in Slack
3. a few minutes later the transcription will appear on Slack
I then feed that to an LLM for summary and actions. Quality has been great for this workflow, all in English.
So far I've seen DiCoW-v2 work pretty well, it's a diarization finetuned Whisper [0], also paid options like Speechmatics work well and are fairly cheap.
[0] https://pccnect.fit.vutbr.cz/gradio-demo/
I know, I know - it sounds super cheese, but after trying several different LLMs and workflows, it just worked out of the box and gave me what I needed: labeled speakers, timestamps, and a nice way to review/jump from the generated text to the audio. Didn't work so well for mixed languages, but for English at least it was comparable or better to the other solutions I tried.
I just saw Apple's new live transcribe. I wonder how that works, in a legal sense for two party states.
You can still be sued for doing something that’s completely legal, and because of the costs associated, you are punished by process rather than law.
I would really like to get transcription of my meetings without having the legal implications or notification requirements of sound recording.
http://transcribetranslate.app/
https://www.assemblyai.com/