Show HN: Ask-human-mcp – zero-config human-in-loop hatch to stop hallucinations (masonyarbrough.com)

What API or software are people using for transcription? If remote API, would like it to be fast, cheap, and support summarization. Groq actually looks good as it apparently supports remote URLs for the audio file, which I actually would prefer. If local, would need to work on a base M4 mini. Looking at llamafile/whisperfile, as I'd want to be able to either cli batch it or use it as a local API/server.

Comments (63)

TachyonicBytes · 2d ago

I use whisperfile[1] directly. The whisper-large-v3 model seems good with non-English transcription, which is my main use-case.

I am also eyeing whisperX[2], because I want to play some more with speaker diarization.

Your use-case seems to be batch transcription, so I'd suggest you go ahead and just use whisperfile, it should work well on an M4 mini, and it also has an HTTP API if you just start it without arguments.

If you want more interactivity, I have been using Vibe[3] as an open-source replacement of SuperWhisper[4], but VoiceInk from a sibling comment seems better.

Aside: It seems that so many of the mentioned projects use whisper at the core, that it would be interesting to explicitly mark the projects that don't use whisper, so we can have a real fundamental comparison.

[1] https://huggingface.co/Mozilla/whisperfile

[2] https://github.com/m-bain/whisperX

[3] https://github.com/thewh1teagle/vibe/

[4] https://superwhisper.com/

levocardia · 2d ago

I have used whisperX with success in a variety of languages, but not with diarization. If the goal is to use the transcript for something else, you can often feed the transcript into a text LLM and say "this is an audio transcript and might have some mistakes, please correct them." I played around with transcribing in original language vs. having whisper translate it, and it seems to work better transcribing in the original language, then feeding into an LLM and having that model do the translation. At least for french, spanish, italian, and norwegian. I imagine a text-based LLM could also clean up any diarization weirdness.

TachyonicBytes · 2d ago

Yes, this is exactly where I am going. The LLM also has an advantage, because you can give it the context of the audio (e.g. "this is an audio transcript from a radio show about etc. etc."). I can foresee this working for a future whisper-like model as well.

There are two ways to parse your first sentence. Are you saying that you used whisperX and it doesn't do well with diarization? Because I am curious of alternative ways of doing that.

anonymousiam · 2d ago

Whisper is amazing. It's better than any other speech recognition system I've seen, and it can be run locally.

I've run it on my ThinkPad P14s Gen 4, which doesn't have much of a GPU (Radeon 780M). It processes approximately in realtime.

satvikpendem · 2d ago

DiCoW-v2 seems to work better than whisperX for diarization, by the way.

https://pccnect.fit.vutbr.cz/gradio-demo/

TachyonicBytes · 2d ago

It seems that both use / leverage pyannote. I wonder if the whisperX pipeline can be combined with DiCoW-v2.

simonw · 2d ago

I really like the MacWhisper macOS desktop app - https://goodsnooze.gumroad.com/l/macwhisper

It runs Whisper (or the newer Whisper Turbo) really well, and you can both drop MP3/MP4/etc files into it or paste in URLs to a YouTube video/podcast URL to kick off a transcription. It exports to text or VTT subtitles or a bunch of other formats. I use it several times a week.

droopyEyelids · 2d ago

I was surprised to find my old PC with a GTX1080 could transcribe/diarize about 10x faster than my m1 Mac. If anyone reading this is looking to transcribe 100s of hours of audio, do the extra work to get it set up on desktop with a dedicated graphics card.

solardev · 1d ago

Was the M1 using its GPU cores? It should have more power than a 1080.

indigodaddy · 1d ago

Does macwhisper utilize metal GPUs by default or is it something you have to tick?

solardev · 1d ago

I'm not sure, sorry, been a while since I used it.

indigodaddy · 1d ago

Oh wow if I can batch process a list of remote audio URLs with macwhisper, then this would be perfect! Will check it out!

solardev · 1d ago

Another vote for this... super easy to use.

PaulShin · 1d ago

That's a great question, as the transcription stack landscape is evolving quickly. As the founder of an AI collaboration platform (Markhub), we've tested several options for our own internal "AI teammate," MAKi.

Our stack is a hybrid approach, as we've found no single service is best for everything:

For Real-Time & High-Volume Transcription: We use Google's Speech-to-Text API (via Vertex AI). We found its accuracy with various accents and noisy environments to be top-tier, which is critical for turning real-world meeting audio into usable data. It's not the cheapest, but the reliability is worth it for our core product.

For Bundled Transcription + Summarization (What you're asking about): You're right to look at models that can handle this in one go. We use Gemini for this exact purpose. The ability to send an audio file and get back a structured summary, not just a raw transcript, is a huge workflow accelerator. We point our audio data to the model and ask it to not only transcribe but also extract action items, which is a core feature of our platform.

For Local/CLI Batching (As you mentioned): We've had great success with OpenAI's Whisper running in a containerized environment for internal, less time-sensitive tasks. Using a fine-tuned version of Whisper on a dedicated machine for batch processing can be very cost-effective. Your M4 mini should be able to handle the base models of Whisper quite well, especially using a tool like whisper.cpp.

My takeaway: For a production API, a robust cloud service like Google's or a powerful model like Gemini is more reliable. For local or batch processing, a self-hosted Whisper setup is a fantastic and powerful option. Good luck!

codeptualize · 2d ago

Whisper large v3 from openai, but we host it ourselves on Modal.com. It's easy, fast, no rate limits, and cheap as well.

If you want to run it locally, I'd still go with whisper, then I'd look at something like whisper.cpp https://github.com/ggml-org/whisper.cpp. Runs quite well.

pramodbiligiri · 2d ago

I second this (whisper.cpp). I've had a good experience running whisper.cpp locally. I wrote a Python wrapper for invoking its whisper-cli: https://github.com/pramodbiligiri/annotate-subs/blob/main/ge... (that repo's readme might have more details).

Mind you, this is from a few months back! Not sure if this is still the best approach ¯\_(ツ)_/¯

Tsarp · 2d ago

I'd love for you to try https://carelesswhisper.app

- Locally running, wrapper around whisper.cpp

- I've done a lot of work on noise profiling, stitching the segments. So when you are speaking for anything >2-3mins, its actually faster than cloud transcriptions. (Accuracy is a few WER off since they are quantized models).

- You can try without paying or putting in CC. After that ~19$ one time. No need to sign up or login.

- BYOK to use your groq, gemini free daily credits to rewrite. Support for thinking models too. can also plug into any locally running LLM.

- Works on my 1st gen M1 without a sweat.

onemoresoop · 2d ago

How much do you pay on average for an hour of transcription?

Tsarp · 1d ago

Runs locally on device. So no server costs.

meepmorp · 2d ago

simultaneously related and off topic:

https://arxiv.org/abs/2402.08021

Tsarp · 1d ago

huh! nice!

illright · 2d ago

A very worthwhile mention is also Stable-TS: https://github.com/jianfch/stable-ts

Out of the box it can transcribe with Whisper or Faster-Whisper, but it can also align audio with an existing human-written transcript, providing time information without losing accuracy. This last feature was something I really needed, and my attempt at building it myself ended up much worse, so I'm glad I found this

I self-host it using Modal.com, as do some other commenters

fchilmi · 7h ago

how much do you spend for modal.com?

ivm · 2d ago

Just configured VoiceInk yesterday and it's been flawless for all the languages I speak: https://tryvoiceink.com

It runs a small local model and has optional Power Modes that pass the transcript to a remote or local LLM for further enhancements, based on your currently opened apps or websites. Also the app is open-source, but with a one-time license purchase option (instabuy for me, of course).

swyx · 2d ago

i use https://voicebraindump.com/ which seems to do similar (but i happen to know the dev which is nice for support haha)

celurian92 · 13h ago

I have used Azure AI services whisper model and for some other users Google Gemini 2.5

These worked really great for our usecase where we needed to transcribe from 6 different languages in production

ashryan · 2d ago

Thumbs up for Wispr Flow. Their iOS app was just released last week, and is an interesting addition to the product.

I needed to do an inventory of stuff in our house over the weekend, and I used Wispr Flow on iOS to take a very very long and rambly note in their app. Then the transcription text appeared on their Mac app, ready to be pasted into ChatGPT for parsing.

Wispr Flow handles languages switches quite well in my experience using it in both English and Japanese.

guybedo · 1d ago

here's a few options cited in this thread:

whisper based options:

whisper.cpp works well on Mac and runs locally, though fine-tuning vocabulary is desired

noScribe is a good FOSS option

MacWhisper is a highly effective macOS desktop app that processes local files and remote URLs, exporting to various formats

whisperfile is suitable for batch transcription and can run on an M4 mini with an HTTP API

carelesswhisper.app is a local, whisper.cpp wrapper that offers fast transcription for longer audio, noise profiling, and one-time payment, working well on M1

Faster-Whisper-XXL standalone on Windows offers fantastic accuracy due to vocal extraction preprocessing

VoiceInk is an open-source solution that runs a small local model with optional remote/local LLM enhancements

Vibe is an open-source alternative to SuperWhisper

Cloud options:

OpenAI speech-to-text and AssemblyAI are considered high quality, with AssemblyAI being cheap and having robust SDK support

TurboScribe offers a generous free tier for web-based transcription

Hosting OpenAI Whisper large v3 on Modal.com provides a fast, cheap solution with no rate limits

AssemblyAI's Universal ASR has impressive WER, future textual prompting, and PII redaction capabilities

borgcloud.org offers competitive pricing and fast real-time transcription

GCP's Chirp & chirp2 are used for large-scale meeting minute transcription

Microsoft Word 365 (online) Transcribe is a surprisingly effective out-of-the-box solution for English, offering labeled speakers and timestamps

Uploading audio to YouTube can be a 'cheap hack' for transcription

Full summary here: https://extraakt.com/extraakts/transcription-tools-and-workf...

jmward01 · 2d ago

A combination of engines generally gets the best WER with additional cost. hosted whisper + gemini 2.5 flash lite with custom deconfliction based on what each one does best is a reasonable path. Gemini does general conversation and silence better than whisper v3 large but whisper v3 large does better specialty vocab. Of course both after and before the merge, common transcription errors are fixed with a dictionary based lookup (that preserves punctuation, etc). This combo stays multi-lingual and is pretty cheap but is complex. There are better single source transcription vendors out there but they generally fail to either provide multi-lingual, or to provide timing info or are ridiculously expensive, or or or... I think the next gen of multi-modal models will make this all moot as they will likely crush transcription. Gemini shows that direction right now. OpenAI does a bad job of it but is in the game. Anthropic is surprisingly not really engaged in this yet (but they did just announce real time audio so they gotta be thinking about it).

user568439 · 1d ago

I just happen to have a 3 hours recording that needs transcription and I didn't manage with Whisper. It has 3 special characteristics:

-Huge size (400MB), it can be split but then I want a single text file with correct timestamps

- There are 3 speakers and one is speaking far from the microphone and with low voice. Whisper sometimes ignores this speaker.

- The last and more difficult is that there are 2 languages being used at the same time. The same speaker might use Dutch or English and even mix both in a sentence.

Is there a way to deal with all that?

lostmsu · 1d ago

Whisper 3 Large should be able to handle multiple languages in the same audio. Have you used that?

tcdent · 2d ago

OpenAI speech-to-text. Every time I try to use an open source model I am left unimpressed. If I'm going through the hassle of creating a system it might as well work correctly. At this point the closed source world is still miles ahead of the open world; hopefully that changes in the next year or two.

(Wispr Flow is the best for general TTS on desktop as well.)

sexyman48 · 2d ago

I also used to wish open source would catch up with closed source. Then I realized kids and vacations cost money.

devoutsalsa · 2d ago

For a web UI, I've used TurboScribe and liked it: http://turboscribe.ai/. Their free tier allows 3 transcriptions per day for audio/video files up to 30 minutes in length. That was nice as many competing services limit their free tier to 10 minutes.

doebi · 2d ago

https://github.com/kaixxx/noScribe

swyx · 2d ago

this is really good FOSS. thanks for sharing. pyannote seems a pain to figure out

lostmsu · 2d ago

If you want to support a startup that does not get the bill footed by VC, we made https://borgcloud.org/speech-to-text and the price is competitive to Groq. We have average latency under 1s and do 5x-15x realtime.

mazzystar · 1d ago

I've been using WhisperNotes (https://whispernotes.app) for iOS/macOS and really appreciate the offline approach - no recurring API costs and everything stays local. The transcription quality is solid, though you're right that it doesn't handle summarization yet. Still worth checking out if privacy and costs matter to you.

io84 · 2d ago

Cheap hack I use for transcribing in-person customer sessions:

1. record the audio on your phone audio recorder

2. send the mp3 to yourself in Slack

3. a few minutes later the transcription will appear on Slack

I then feed that to an LLM for summary and actions. Quality has been great for this workflow, all in English.

satvikpendem · 2d ago

What are people using for realtime transcription and diarization specifically? I'm thinking something like Zoom's transcript feature but Zoom itself has the advantage of knowing exactly who is speaking at what time so they don't need to diarize from raw speech at all.

So far I've seen DiCoW-v2 work pretty well, it's a diarization finetuned Whisper [0], also paid options like Speechmatics work well and are fairly cheap.

[0] https://pccnect.fit.vutbr.cz/gradio-demo/

xnx · 2d ago

Faster-Whisper-XXL standalone executable on Windows: https://github.com/Purfview/whisper-standalone-win

dinfinity · 1d ago

Same. The built in vocal extraction preprocessing step is fantastic for accuracy.

dweekly · 2d ago

AssemblyAI's Universal ASR has got really impressive WER compared to Google Cloud Voice or Nova. I'm looking forward to their rolling out Slam-1's textual prompting, which will let you use plain English to describe the domain/context of the recording. Very reasonably priced. I've used it on about 10,000 speech hours for a service I built. As a bonus, it also can produce PII redacted audio files and transcripts for a very reasonable price. That used to be hugely tedious and expensive work.

thadt · 2d ago

Microsoft Word 365 (online) -> Transcribe.

I know, I know - it sounds super cheese, but after trying several different LLMs and workflows, it just worked out of the box and gave me what I needed: labeled speakers, timestamps, and a nice way to review/jump from the generated text to the audio. Didn't work so well for mixed languages, but for English at least it was comparable or better to the other solutions I tried.

meepmorp · 2d ago

Is there anyone using anything that's NOT based on whisper? Maybe some hopefully not-too-ugly kaldi recipe?

_boffin_ · 2d ago

Question: In California, it's a two party state for recording, including (i can be wrong) transcribing. How are you handling this? are you handling this or are you just capturing without consent?

I just saw Apple's new live transcribe. I wonder how that works, in a legal sense for two party states.

janalsncm · 2d ago

This isn’t legal advice but practical: try to avoid doing things that will piss other people off, especially illegal things.

You can still be sued for doing something that’s completely legal, and because of the costs associated, you are punished by process rather than law.

Andugal · 2d ago

Bonus question: what API or software are people using for diarization? (On top of transcription)

daft_pink · 2d ago

Is there a way to get transcription without engaging in sound recording, ie on device?

I would really like to get transcription of my meetings without having the legal implications or notification requirements of sound recording.

mbanerjeepalmer · 2d ago

This wrapper for whisper.cpp https://github.com/akash-joshi/better-whisper

kaiwenwang · 2d ago

Something that I built myself using MacOS's speech recognizer:

http://transcribetranslate.app/

educationcto · 2d ago

AssemblyAI is quite good, pretty cheap and easy with robust SDK support.

https://www.assemblyai.com/

shafkathullah · 2d ago

https://replicate.com/collections/speech-to-text

randomgu · 2d ago

https://github.com/ggml-org/whisper.cpp

tonymet · 2d ago

Chirp & chirp2 on GCP cloud speech-to-text-v2 api . I transcribed about 1000+ hours of meeting minutes from public records

airza · 2d ago

I use whisper.cpp and it works pretty okay on a mac. Runs locally. Wish i could finetune its vocab.

lyime · 2d ago

I would love to automate our Youtube channel videos and transcribe with speaker identification

wenbin · 2d ago

https://Transcript.New

oulipo · 2d ago

I'm using VoiceInk which is great and open-source (free if you build it yourself!)

TachyonicBytes · 2d ago

It seems to be pre-built on github, in releases.

oulipo · 1d ago

Yes, but I guess it includes the "payment check" in the pre-built releases? Not sure but perhaps

TachyonicBytes · 1d ago

It has the payment button at least

elif · 2d ago

Honestly if it were me and I wanted "fast and cheap" I would just upload all content to YouTube and let them pay for and upkeep the AI transcription models.

Show HN: Spark, An advanced 3D Gaussian Splatting renderer for Three.js (sparkjs.dev)

Show HN: RomM – An open-source, self-hosted ROM manager and player (github.com)

Show HN: Ikuyo a Travel Planning Web Application (ikuyo.kenrick95.org)

Show HN: S3mini – Tiny and fast S3-compatible client, no-deps, edge-ready (github.com)

Show HN: DIY virtual HDMI monitor using "AR" glasses (github.com)

Show HN: I made a 3D printed VTOL drone (tsungxu.com)

Show HN: Chili3d – A open-source, browser-based 3D CAD application

Show HN: A “Course” as an MCP Server (mastra.ai)

Show HN: High End Color Quantizer (github.com)

Show HN: UserWatch – AI product analyst. Instant dashbords, AB tests, AI replays (userwatch.xyz)

Show HN: I built a tool to use my homelab apps remotely without a full VPN (github.com)

Show HN: MidWord – A Word-Guessing Game (midword.com)

Show HN: Most users won't report bugs unless you make it stupidly easy

Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass (github.com)

Show HN: CongressMCP – interact with congress.gov data in natural language (github.com)

Show HN: Rapidez – Headless Magento with Laravel and InstantSearch (rapidez.io)

Show HN: Munal OS: a graphical experimental OS with WASM sandboxing (github.com)

Show HN: Operations manager agent for remote team work (get.traction.team)

Show HN: StopX – AI-powered content blocker with 99.7% (stopx.today)

Show HN: Jomon – a network forensics and passive sniffer tool (github.com)

Show HN: An open-source rhythm dungeon crawler in 16 x 9 pixels (github.com)

Show HN: Update to my meta glasses API "Hey Meta send a message to ChatGPT" (github.com)

Show HN: Glowstick – type level tensor shapes in stable rust (github.com)

Show HN: Open-source Go Challenges – Interactive practice for interviews (github.com)

Show HN: Somo – a human friendly alternative to netstat (github.com)

Show HN: Air Lab – A portable and open air quality measuring device (networkedartifacts.com)

Show HN: Let’s Bend – Open-Source Harmonica Bending Trainer (letsbend.de)

Show HN: I'm building an app to replace Overleaf and Notion

Show HN: I am making an app to rival "Everything" (drimiteros.github.io)

Show HN: A MCP server and client implementing the latest spec (github.com)

Show HN: I built a loadout building and sharing tool for Helldivers 2 (helldivehelper.net)

Show HN: Building Electret Mic Preamp Open sourced (hackaday.io)

Show HN: I made CSS-only glitch effect (muffinman.io)

Show HN: Ask-human-mcp – zero-config human-in-loop hatch to stop hallucinations (masonyarbrough.com)

Show HN: GPT image editing, but for 3D models (adamcad.com)

Show HN: Typeconomy, a clicker game where you type (github.com)

Show HN: Viberunner – build personal desktop apps in seconds (viberunner.me)

Show HN: macOS app PhotoSort could help reduce your monthly iCloud bill

Show HN: AI game animation sprite generator (godmodeai.cloud)

Show HN: iOS Screen Time from a REST API (thescreentimenetwork.com)

Show HN: Lambduck, a Functional Programming Brainfuck (imjakingit.github.io)

Show HN: Claude Composer (github.com)

Show HN: Zymo.tv (github.com)

Show HN: Container Use for Agents (github.com)

Show HN: Interactive Enigma Machine Simulator (enigmasimulator.com)

Show HN: Pyleak – Detect asyncio issues causing AI agent latency (github.com)

Show HN: ClickStack – Open-source Datadog alternative by ClickHouse and HyperDX (github.com)

Show HN: InstaAmp – Supercharge Instagram Web (chromewebstore.google.com)

Show HN: I wrote a Java decompiler in pure C language (github.com)

Show HN: AI-Powered Music Creation Starts Here – Vibe Musicing (vibemusicing.com)

Ask HN: What API or software are people using for transcription?

Comments (63)