GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 (magazine.sebastianraschka.com)

Whisper is genuinely amazing - with the right nudging. It's the one AI thing that has genuinely turned my life upside-down in an unambiguously good way.

People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.

HOWTO:

Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.

EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.

EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):

    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

    uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.

https://www.nikse.dk/subtitleedit

https://www.nikse.dk/donate

https://github.com/SubtitleEdit/subtitleedit/releases

tossit444 · 1h ago

Aegisub is still actively developed (forked), and imo, both software can't really be compared to one another. They can complement each other, but SE is much better for actual transcription. Aegisub still does the heavy lifting for typesetting and the like.

pawelduda · 23m ago

Can you give an example why it made your life that much better?

pmarreck · 32s ago

Now if it only did separate speaker identification (diarization)

londons_explore · 3h ago

Does this have the ability to edit historic words as more info becomes available?

Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".

Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".

Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.

yvdriess · 1h ago

A good opportunity to point people to the paper with my favorite title of all time:

"How to wreck a nice beach you sing calm incense"

https://dl.acm.org/doi/10.1145/1040830.1040898

abound · 1h ago

For folks like me puzzling over what the correct transcription of the title should be, I think it's "How to recognize speech using common sense"

strken · 8m ago

Thank you! "Calm incense" makes very little sense when said in an accent where calm isn't pronounced like com.

efilife · 5m ago

Thanks. Now I know that I'm not that stupid and this actually makes no sense

fiatjaf · 33m ago

Thank you very much!

fmx · 1h ago

The paper: https://sci-hub.st/https://dl.acm.org/doi/10.1145/1040830.10...

(Agree that the title is awesome, by the way!)

Fluorescence · 1h ago

It makes me curious about how human subtitlers or even scriptwriters choose to transcribe intentionally ambiguous speech, puns and narratively important mishearings. It's like you need to subtitle what is heard not what is said.

Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?

It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!

ph4evers · 3h ago

Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit.

jeroenhd · 3h ago

The ffmpeg code seems to default to three second chunks (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):

    queue
    
         The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"

londons_explore · 2h ago

so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!

I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.

The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.

miki123211 · 2h ago

The right way to do this would be to use longer, overlapping chunks.

E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording).

This would increase processing requirements significantly, though. You could probably get around some of that with clever use of caching, but I don't think any (open) implementation actually does that.

superluserdo · 1h ago

I basically implemented exactly this on top of whisper since I couldn't find any implementation that allowed for live transcription.

https://tomwh.uk/git/whisper-chunk.git/

I need to get around to cleaning it up but you can essentially alter the number of simultaneous overlapping whisper processes, the chunk length, and the chunk overlap fraction. I found that the `tiny.en` model is good enough with multiple simultaneous listeners to be able to have highly accurate live English transcription with 2-3s latency on a mid-range modern consumer CPU.

llarsson · 2h ago

Attention is all you need, as the transformative paper (pun definitely intended) put it.

Unfortunately, you're only getting attention in 3 second chunks.

no_wizard · 1h ago

That’s because at the end of the day this technology doesn’t “think”. It simply holds context until the next thing without regard for the previous information

anonymousiam · 1h ago

Whisper is excellent, but not perfect.

I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."

JohnKemeny · 1h ago

Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly.

0points · 2h ago

So, yes, and also no.

lgessler · 2h ago

I recommend having a look at 16.3 onward here if you're curious about this: https://web.stanford.edu/~jurafsky/slp3/16.pdf

I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".

shaunpud · 2h ago

I Scream in the Sun https://carmageddon.fandom.com/wiki/I_Scream_in_the_Sun

didacusc · 1h ago

what would it make of this? https://www.youtube.com/watch?v=zyvZUxnIC3k

DiogenesKynikos · 2h ago

This is what your brain does when it processes language.

I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.

mockingloris · 2h ago

A slight segue to this; I was made aware of the phenomena that - The language in which you think in, sets the constraints to which you level of expanse the brain can think and parse information in.

I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.

│

└── Dey well; Be well

Lio · 2h ago

Once local transcription is in more places hopefully we can persuade content creator not to burn bouncing sub-titles into their videos.

I've seen professionally produced recordings on dry and technical subjects with good sound quality where they've decided to use distracting sub-titles with no way to disable them.

It seems so unnecessary if you're not making novelty videos about cats.

Also local transcription allows for automatic translation and again overlaying subtitles on top of an existing burnt in set is a really poor reading experience.

ambicapter · 1h ago

They do that because it increases “engagement”, not because they care about the user’s experience with the subtitles.

whywhywhywhy · 1h ago

Algorithm boosts it that’s why they do it. Even if every device had real time 100% accurate subtitling built in they’d still do it if they video performs better with it.

HPsquared · 2h ago

The other problem with burned-in subtitles is you can't change the language.

LorenDB · 1h ago

The other other problem with burned-in subtitles is that they normally have horrible formatting. Who wants to try to read single words that only flash on-screen while they are being spoken?

rkomorn · 2h ago

True, but (as someone who not infrequently has to rewind content on just about all streaming apps because it decided one particular subtitle only needed to be display for less than 200ms this time around) sometimes burned-in seems like a good idea.

I don't understand why the problem seems so pervasive (I've seen it on Netflix, Viki, and Apple TV, at least) and so transient.

preisschild · 2h ago

They could also just upload those transcriptions as normal closed-captioning srt subtitles...

jimkleiber · 1h ago

not all social media will show subtitles/captions tho, which is the challenge. YouTube Shorts, TikTok videos, IG reels, FB reels, Whatsapp statuses, and more. I think some allow cc but some don't, and if someone reshares to another platform, it may not be there, so some of us burn them in begrudgingly :-)

dzhiurgis · 1h ago

It's just so annyoing how someone like Netflix offers like 3-4 languages for most of its content when you can basically get it for free via browser extensions (if you watch on browser).

Must be union thing.

dewey · 52m ago

That Netflix who would need to pay more to license more subtitles can't compete with pirated or unlicensed auto-generated subtitles shouldn't really be a surprise.

It's also annoying that you have to pay for Netflix when you can get the same movies for free with less restrictions on a pirate site.

JohnKemeny · 1h ago

Related, a blog article by the author of the patch:

Run Whisper audio transcriptions with one FFmpeg command

https://medium.com/@vpalmisano/run-whisper-audio-transcripti...

Posted here, with 0 comments: https://news.ycombinator.com/item?id=44869254

voxadam · 3h ago

Am I correct in understanding that Whisper is a speech recognition AI model originally created by OpenAI?

https://en.wikipedia.org/wiki/Whisper_(speech_recognition_sy...

Maxious · 3h ago

yep, there's a c++ implementation to run it https://github.com/ggml-org/whisper.cpp

oezi · 3h ago

Isn't WhisperX the canonical choice for running Whisper?

0points · 2h ago

While whisper and whisperx is python implementations, the whisper.cpp wins the benchmarks.

sampullman · 2h ago

Maybe for running locally? whisper.cpp is nice because you can embed it pretty easily in apps for various targets like iOS, OSX, Android, wasm, etc.

johnisgood · 3h ago

Yes.

From the documentation:

> It runs automatic speech recognition using the OpenAI's Whisper model.

voxadam · 3h ago

Thanks, I was being tripped up by DDOS protection on code.ffmpeg.org for a minute and couldn't read the patch. The combo of Firefox and the fact that Quantum/Lumen/CenturyLink seems to get off by rotating my dynamic IP for no reason occasionally triggers various DDOS protections schemes.

acidburnNSA · 3h ago

Yes, according to the comments in the patch, you are correct.

cess11 · 3h ago

Kind of, it's a family of audio transcription models.

https://huggingface.co/search/full-text?q=whisper

AlienRobot · 3h ago

I think so, if I remember correctly PotPlayer also supports it for automatic subtitling.

kwar13 · 3h ago

yes.

realxrobau · 9m ago

Annoyingly, something is broken with their anti not stuff, as it keeps refusing to let me see the page.

correa_brian · 40s ago

hell yeah

instagraham · 3h ago

Does this mean that any software which uses ffmpeg can now add a transcription option? Audacity, Chrome, OBS etc

ks2048 · 2h ago

If they want to support it out-of-the box, they'll still have to embed a model file (roughly 500 MB - 3GB, varying size and quality)

MaxikCZ · 53m ago

I tried to use whisper to generate non-english subs from english audio, but wasnt able to figure out. I know it can do english subs from non-english audio, and that earlier (less precise) versions could do any language audio -> any language subs, but latest whisper only to english subs.

Anyone found a way?

abdusco · 43m ago

I solved it by generating English subtitles, then passing those to an LLM in chunks that are ~20 entries in size. Include preceding and following subtitles as context for better translation. Make sure to replace the timestamps with simple integer ids, because LLMs like to mangle those, no matter how hard you prompt.

I could share a python script that is working pretty reliably for me.

donatj · 2h ago

I know nothing about Whisper, is this usable for automated translation?

I own a couple very old and as far as I'm aware never translated Japanese movies. I don't speak Japanese but I'd love to watch them.

A couple years ago I had been negotiating with a guy on Fiver to translate them. At his usual rate-per-minute of footage it would have cost thousands of dollars but I'd negotiated him down to a couple hundred before he presumably got sick of me and ghosted me.

ethan_smith · 1h ago

Whisper can indeed transcribe Japanese and translate it to English, though quality varies by dialect and audio clarity. You'll need the "large-v3" model for best results, and you can use ffmpeg's new integration with a command like `ffmpeg -i movie.mp4 -af whisper=model=large-v3:task=translate output.srt`.

waltbosz · 23m ago

I wonder how the results of an AI Japanese-audio-to-English-subtitles would compare to a fansub-ed anime. I'm guessing it would be a more literal translation vs. contextual or cultural.

I found an interesting article about trollsubs, which I guess are fansubs made with a contemptuous flare. https://neemblog.home.blog/2020/08/19/the-lost-art-of-fan-ma...

Tangent: I'm one of those people who watch movies with closed captions. Anime is difficult because the subtitle track is often the original Japanese-to-English subtitles and not closed captions, so the text does not match the English audio.

chazeon · 13m ago

I do japanese transcription + gemini translations. It’s worse than fansub, but its much much better than nothing. First thing that could struggle is actually the vad, then is special names and places, prompting can help but not always. Finally it’s uniformity (or style). I still feel that I can’t control the punctuation well.

prmoustache · 2h ago

My personnal experience trying to transcribe (not translate) was a complete failure. The thing would invent stuff. It would also be completely lost when more than one language is used.

It also doesn't understand contexts so does a lot of errors you see in automatic translations from videos in youtube for example.

okdood64 · 30m ago

It's curious how YouTube's is so bad still given the current state of the art; but it has got a lot better in the last 6 months.

trenchpilgrim · 2h ago

Whisper has quite bad issues with hallucination. It will inject sentences that were never said in the audio.

It's decent for classification but poor at transcription.

_def · 2h ago

May I ask which movies? I'm just curious

poglet · 2h ago

Yep, whisper can do that. You can also try whisperx (https://github.com/m-bain/whisperX) for a possibly better experience with aligning of subtitles to spoken words.

zoobab · 2h ago

Not sure it will be packaged in Debian, with an external binary model god knows how it was produced...

majewsky · 2h ago

It looks like the model file needs to be supplied at invocation time, so the binary blob would not be required for packaging.

zoobab · 35m ago

so 'apt install ffmpeg' won't be enough to have the feature?

SahAssar · 1m ago

You'd have the feature, but you also need to supply the model. The feature seems to just be that ffmpeg has the ability to run the model, it does not include the model.

webinar · 1h ago

I've been using FFmpeg and Whisper to record and transcribe live police scanner audio for my city, and update it in real-time to a live website. It works great, with the expected transcription errors and hallucinations.

waltbosz · 19m ago

I wanted to do this for my local county council meetings. I think in this context speaker recognition would be important.

Xunjin · 1h ago

Is this website open? Would love to see your work :P

webinar · 1h ago

somerville.votolab.com

mkayokay · 1m ago

Looks like this is a nice case were the LLM thinks that silence is "thanks for watching" which was discussed on here a few days ago.

bondarchuk · 3h ago

Can whisper do multilingual yet? Last time I tried it on some mixed dutch/english text it would spit out english translations for some of the dutch text. Strange bug/feature since from all appearances it had understood the dutch text perfectly fine.

clarionbell · 2h ago

I think the Dutch/English is probably the worst combination for this. Languages are rather close.

bondarchuk · 2h ago

I don't understand how this would happen, though. It's not like it will mishear a dutch sentence as if it's english; it will correctly pick up the dutch sentence, but (since the language is auto-detected as english at the start of the segment), seemingly auto-translate that (correct and correctly heard) dutch text to english. All we need is a way to get the dutch text that's surely somewhere in there, before the translation happens.

Unless it was trained end-to-end on dutch-subtitled english text?? Which might make the translation a somewhat inextricable part of the model..? Does anyone know?

numpad0 · 2h ago

Isn't that a bit much for ASR models? Humans can't handle simultaneous multilingual dictation task either, I have to stop and reinitialize ears before switching languages between English and my primary one.

cenamus · 18m ago

Isn't that exactly what intepreters do?

bondarchuk · 1h ago

Seems like it already has the capability somewhere in the model though - see my reply to clarionbell.

kwar13 · 2h ago

Best for English, but I've found it pretty decent for Spanish.

MaKey · 47m ago

It's even better for some languages other than English (e. g. Spanish), see: https://github.com/openai/whisper?tab=readme-ov-file#availab...

jeroenhd · 3h ago

I found that it works quite well for Dutch+English as long as you use one of the larger models. But that may just be luck, I imagine mixing Italian and Swedish will have very different results.

ph4evers · 3h ago

Whisper-v3 works well for multi-lingual. I tried it with Dutch, German and English

guilamu · 3h ago

Whisper has been multilingual for 5 years at least.

bondarchuk · 2h ago

I know it is ostensibly multilingual, it's less than a year since I tried, but it does this thing where it then translates everything (or only some things) into a single language regardless with no way to turn it off.

guilamu · 11m ago

Sorry, I've been using it for French audio files since 5 years and never had this issues.

porridgeraisin · 58m ago

I had a small bash pipeline for doing this until now.

  ffmpeg -f pulse -i "$(pactl get-default-source)" -t 5 -f wav -ar 16000 -ac 1 -c:a pcm_s16le - \
  | ./main - \
  | head -2 \
  | tail -1 \
  | cut -d] -f2 \
  | awk '{$1=$1};1'

The reading from mic part (-f pulse, pactl...) is linux-specific rest of it should be cross platform. The `main` executable is the whisper.cpp executable (see whisper.cpp github readme, it's just the output of `make base.en` from that).

Edit: -t 5 controls recording duration.

Oh and add 2>/dev/null to silence the debug output. I copied this from a pipe that further sends it into an LLM that then looks at the meaning and turns it into a variety of structured data (reminders, todo items, etc) which I then....

re · 3h ago

I've been playing with whisper to try to do local transcription of long videos, but one issue I've found is that long (>15 seconds) spans without any speech tend to send it into a hallucination loops that it often can't recover from. I wonder if, with direct integration into ffmpeg, they will be able to configure it in a way that can improve that situation.

franga2000 · 3h ago

Whisper is supposed to be used with voice activity detection and all production implementations that I've seen do that. The raw model is known to make up nonsense for silence because, as I understand it, it was never trained not to do that, assuming everyone will use VAD

No comments yet

42lux · 3h ago

You usually delete silence before using something like whisper.

re · 3h ago

I've heard that, but that doesn't sound like a useful approach for videos where (1) non-speech segments can have plenty of other sound (music, noise) and (2) you want timestamps to match up with the original video, like for subtitles. But maybe there are known mitigations for both of those issues that I'm not aware of. And if they do exist maybe they can be included in the ffmpeg whisper integration.

miki123211 · 2h ago

By "delete", people mostly mean "detect", so that you can avoid processing such segments through Whisper. There's no reason to actually cut the silence out from the original audio file.

No comments yet

hnlmorg · 2h ago

This is designed for real time use too. And in such cases, you couldn’t delete the silence before use.

42lux · 2h ago

The ffmpeg implementation might be the example was not.

lawik · 3h ago

I wonder if they'll be satisfied there or add a chunk of others now that they've started. Parakeet is supposed to be good?

Should they add Voice Activity Detection? Are these separate filters or just making the whisper filter more fancy?

shrx · 3h ago

Voice Activity Detection support is already included.

mockingloris · 2h ago

How could one in theory, use this to train on a new language? Say for a hubby project; I have recordings of some old folks stories in my local dialect.

│

└── Dey well; Be well

zzsshh · 3h ago

Does this finally enable dynamically generating subtitles for movies with AI?

jeroenhd · 3h ago

Docs say:

    If set, the transcription output will be sent to the specified file or URL
    (use one of the FFmpeg AVIO protocols); otherwise, the output will be logged as info messages.
    The output will also be set in the "lavfi.whisper.text" frame metadata.
    If the destination is a file and it already exists, it will be overwritten.

    @item format
    The destination format string; it could be "text" (only the transcribed text will be sent to the destination), "srt" (subtitle format) or "json".
    Default value: @code{"text"}

I don't know if this can embed the subtitles, but it does support generating accompanying srt files.

Of course, you could already do that by just manually calling whisper on files, but now you don't need to export parts or transformed media files to feed into whisper.

regularfry · 3h ago

If you have enough processing power. Without a GPU it's going to lag.

KeplerBoy · 3h ago

Whisper is pretty fast.

diggan · 3h ago

Finally? I think VLC demo'd this a while ago at some conference where they had a table, if I remember correctly.

SSLy · 3h ago

VLC and ffmpeg are unrelated projects

demurgos · 22m ago

I'm not very familiar with them, but I always assumed that there is a lot of overlap between the maintainers of both projects.

martzoukos · 2h ago

I guess that there is no streaming option for sending generated tokens to, say, an LLM service to process the text in real-time.

nomad_horse · 2h ago

Whisper has the encoder-decoder architecture, so it's hard to run streaming efficiently, though whisper-streaming is a thing.

https://kyutai.org/next/stt is natively streaming STT.

thedangler · 1h ago

Does this whisper also do text-to-speech?

boutell · 3h ago

Shut off the broken bot filter so we can read it please

majewsky · 2h ago

From experience, these bot filters are usually installed because the site would be down entirely without rejecting AI scrapers, so the argument to shut it off to improve usability is rather silly.

QuantumNomad_ · 3h ago

Archived snapshots of the linked page:

https://web.archive.org/web/20250813104007/https://code.ffmp...

https://archive.is/dmj17

You can read it on one of these without having to pass that specific bot check

diggan · 3h ago

Took my iPhone 12 Mini a whole of 0.1 seconds to pass it. What hardware/OS are you using?

politelemon · 2h ago

Took me zero seconds to be blocked with invalid response

miloignis · 2h ago

It also instantly blocks me on GrapheneOS, both Firefox and Vanadium. Very odd, as I've never had an issue with Anubis before.

shaky-carrousel · 1h ago

GrapheneOS here, with Vanadium in incognito, it doesn't block me, both in wifi and in mobile. Maybe it was a temporary hiccup.

londons_explore · 3h ago

Took about 30 secs for me (5 yr old intel cpu). Looked like there was a progress bar, but it didn't progress. Maybe the difficulty varies depending on IP address?

jeroenhd · 3h ago

Anubis has config for that: https://anubis.techaro.lol/docs/admin/policies#request-weigh...

It's up to the site admin to configure it that way, but it's possible some IP ranges/user agents are more often used by bots and therefore have an increased weight.

For old browsers there's also an option to use meta refresh instead of JS (https://anubis.techaro.lol/docs/admin/configuration/challeng...) but that's quite a recent addition and not enabled by default.

diggan · 2h ago

> Maybe the difficulty varies depending on IP address?

I'm currently roaming in Finland with a Spanish SIM so would have expected the opposite in that case.

ta1243 · 1h ago

my i5-6200U with firefox/linux is about 10 years old. I used a variety of add blocking and fingerprint blocking techniques. Cloudflare often complains and blocks me.

This page loaded pretty much instantly (certainly in the time it took to switch to the background tab I loaded in). But then ffmpeg is written by old school engineers with old school ways of working. Their social media accounts are a hilarity of trolling worthy of slashdot in its peak.

johnisgood · 3h ago

Took me 8 seconds on my shitty desktop.

blahyawnblah · 1h ago

The stock chrome browser Google news uses

jeroenhd · 3h ago

Check out commit 13ce36fef98a3f4e6d8360c24d6b8434cbb8869b from https://git.ffmpeg.org/ffmpeg.git if your web browser doesn't support Javascript. The linked page is just a git viewer for that specific commit.

yorwba · 3h ago

Or read the documentation for the new whisper filter: https://ffmpeg.org/ffmpeg-filters.html#whisper-1

jeroenhd · 3h ago

That also works, I assumed the ffmpeg website would also be behind Anubis if the git server is, but it doesn't actually seem to be.

majewsky · 2h ago

Anubis is not all that useful for static websites since serving them does not generate high load (unlike when a bot traverses a Git server UI).

yewenjie · 3h ago

I have recently found that parakeet from NVIDIA is way faster and pretty much as correct as Whisper, but it only works with English.

dncornholio · 1h ago

I was expecting a lot more comments on if this is a necessary feature or if this even belongs in a library like ffmpeg. I think this is bloat, especially when the feature doesn't work flawless, whisper is very limited.

MrGilbert · 1h ago

The only item that was discussed was that the subtitle workflow does not seem to be that good, afaict:

https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20022#issuecomme...

kwar13 · 3h ago

Fantastic! I am working on a speech-to-text GNOME extension that would immensely benefit from this.

https://github.com/kavehtehrani/gnome-speech2text

ggap · 3h ago

Very interesting to see this!

GPT-5 (openai.com)

Fight Chat Control (fightchatcontrol.eu)

GitHub is no longer independent at Microsoft after CEO resignation (theverge.com)

I tried every todo app and ended up with a .txt file (al3rez.com)

Claude Sonnet 4 now supports 1M tokens of context (anthropic.com)

Ultrathin business card runs a fluid simulation (github.com)

I want everything local – Building my offline AI workspace (instavm.io)

Wikipedia loses challenge against Online Safety Act (bbc.com)

Emailing a one-time code is worse than passwords (blog.danielh.cc)

Debian 13 “Trixie” (debian.org)

Vibechart (vibechart.net)

Claude Code is all you need (dwyer.co.za)

Show HN: The current sky at your approximate location, as a CSS gradient (sky.dlazaro.ca)

How I code with AI on a budget/free (wuu73.org)

Try and (ygdp.yale.edu)

GPT-5: Key characteristics, pricing and system card (simonwillison.net)

Wikimedia Foundation Challenges UK Online Safety Act Regulations (wikimediafoundation.org)

OpenFreeMap survived 100k requests per second (blog.hyperknot.com)

Jim Lovell, Apollo 13 commander, has died (nasa.gov)

Show HN: Building a web search engine from scratch with 3B neural embeddings (blog.wilsonl.in)

Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

Historical Tech Tree (historicaltechtree.com)

Cursed Knowledge (immich.app)

The Chrome VRP Panel has decided to award $250k for this report (issues.chromium.org)

Meta Leaks Part 1: Israel and Meta (archive.org)

GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 (magazine.sebastianraschka.com)

Flipper Zero dark web firmware bypasses rolling code security (rtl-sdr.com)

Monero appears to be in the midst of a successful 51% attack (twitter.com)

Getting good results from Claude Code (dzombak.com)

Why are there so many rationalist cults? (asteriskmag.com)

The Framework Desktop is a beast (world.hey.com)

GPT-5 for Developers (openai.com)

StarDict sends X11 clipboard to remote servers (lwn.net)

Linear sent me down a local-first rabbit hole (bytemash.net)

Show HN: Engineering.fyi – Search across tech engineering blogs in one place (engineering.fyi)

OpenSSH Post-Quantum Cryptography (openssh.com)

Search all text in New York City (alltext.nyc)

Trump Orders National Guard to Washington and Takeover of Capital’s Police (nytimes.com)

Vanishing from Hyundai’s data network (techno-fandom.org)

My Lethal Trifecta talk at the Bay Area AI Security Meetup (simonwillison.net)

The surprise deprecation of GPT-4o for ChatGPT consumers (simonwillison.net)

Windows XP Professional (win32.run)

MCP overlooks hard-won lessons from distributed systems (julsimon.medium.com)

Tor: How a military project became a lifeline for privacy (thereader.mitpress.mit.edu)

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [pdf] (arxiv.org)

OpenAI's new open-source model is basically Phi-5 (seangoedecke.com)

Exit Tax: Leave Germany before your business gets big (eidel.io)

Building Bluesky comments for my blog (natalie.sh)

Cursor CLI (cursor.com)

Project Hyperion: Interstellar ship design competition (projecthyperion.org)

FFmpeg 8.0 adds Whisper support

Comments (129)