OpenAI Charges by the Minute, So Make the Minutes Shorter

235 georgemandis 68 6/25/2025, 1:17:25 PM george.mand.is ↗

Comments (68)

w-m · 2h ago

With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

In the idea of making more of an OpenAI minute, don't send it any silence.

E.g.

    ffmpeg -i video-audio.m4a \
      -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                         stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                         apad=pad_dur=0.02" \
      -c:a aac -b:a 128k output_minpause.m4a -y

will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.

behnamoh · 1h ago

> His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).

btown · 53m ago

Even a last-decade transcription model could be used to detect a rough number of syllables per unit time, and the accuracy of that model could be used to guide speed-up and dead-time detection before sending to a more expensive model. As with all things, it's a question of whether the cost savings justify the engineering work.

varispeed · 58m ago

It's a shame platforms don't generally support speeds greater than 2x. One of my "superpowers" or a curse is that I cannot stand normal speaking pace. When I watch lectures, I always go for maximum speed and that still is too slow for me. I wish platforms have included 4x but done properly (with minimal artefacts).

narratives1 · 7m ago

I use a Chrome extension that lets you take any video player (including embedded) to 10x speed. Turn most things to 3-4x. It works on ads too

mrmuagi · 53m ago

All audiobooks are like this for me. I tried it for lectures but if I'm taking handwritten notes, I can't keep up my writing.

I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?

colechristensen · 10m ago

No but a little. I struggle with people who repeat every point of what they're saying to you several times or when you say "you told me exactly this the last time we spoke" they cannot be stopped from retelling the whole thing verbatim. Usually in those situations though there's some potential cognitive issues so you can only be understanding.

dpcx · 46m ago

https://github.com/codebicycle/videospeed has been a wonderful addition for me.

lofaszvanitt · 54m ago

Robot in a human body identified :D.

echelon · 1h ago

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file.

Stupid heuristic: take a segment of video, transcribe text, count number of words per utterance duration. If you need speaker diarization, handle speaker utterance durations independently. You can further slice, such as syllable count, etc.

nand4011 · 1h ago

https://www.science.org/doi/10.1126/sciadv.aaw2594

Apparently human language conveys information at around 39 bits/s. You could use a similar technique as that paper to determine the information rate of a speaker and then correct it to 39 bits/s by changing the speed of the video.

georgemandis · 2h ago

Oooh fun! I had a feeling there was more ffmpeg wizardry I could be leaning into here. I'll have to try this later—thanks for the idea!

w-m · 2h ago

In the meantime I realized that the apad part is nonsensical - it pads the end of the stream, not at each silence-removed cut. I wanted to get angry at o3 for proposing this, but then I had a look at the silenceremove= documentation myself: https://ffmpeg.org/ffmpeg-filters.html#silenceremove

Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!

I now think this might be a good solution:

    ffmpeg -i video-audio.m4a \
           -af "silenceremove=start_periods=1:stop_periods=-1:stop_duration=0.15:stop_threshold=-40dB:detection=rms" \
           -c:a aac -b:a 128k output.m4a -y

snickerdoodle12 · 57m ago

I love ffmpeg but the documentation is often close to incomprehensible.

pragmatic · 1h ago

No not really? The talk where he babbles about OSes and everyone is somehow impressed?

heeton · 2h ago

A point on skimming vs taking the time to read something properly.

I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.

Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.

This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.

Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.

Slower is usually better for thinking.

pluc · 2h ago

Seriously this is bonkers to me. I, like many hackers, hated school because they just threw one-size-fits-all knowledge at you and here we are, paying for the privilege to have that in every facet of our lives.

Reading is a pleasure. Watching a lecture or a talk and feeling the pieces fall into place is great. Having your brain work out the meaning of things is surely something that defines us as a species. We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.

hooverd · 2h ago

If you're not listening to summaries of different audiobooks at 2x speed in each ear you're not contentmaxing.

isaacremuant · 1h ago

> We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.

Your doomerism and superiority doesn't follow from your initial "I like many hackers don't like one size fits all".

This is literally offering you MANY sizes and you have the freedom to choose. Somehow you're pretending pushed down uniformity.

Consume it however you want and come up with actual criticisms next time?

georgemandis · 2h ago

For what it's worth, I completely agree with you, for all the reasons you're saying. With talks in particular I think it's seldom about the raw content and ideas presented and more about the ancillary ideas they provoke and inspire, like you're describing.

There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.

In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!

++ to "Slower is usually better for thinking"

conradev · 1h ago

Was it the speed or the additional information vended by the audio and video? If someone is a compelling speaker, the same message will be way more effective in an audiovisual format. The audio has emphasis on certain parts of the content, for example, which is missing from the transcript or summary entirely. Video has gestural and facial cues, also often utilized to make a point.

mutagen · 1h ago

Not to discount slower speeds for thinking but I wonder if there is also value in dipping into a talk or a subject and then revisiting (re-watching) with the time to ponder on the thoughts a little more deeply.

tass · 1h ago

This is similar to strategies in “how to read a book” (Adler).

By understanding the outline and themes of a book (or lecture, I suppose), it makes it easier to piece together thoughts as you delve deeper into the full content.

bongodongobob · 32m ago

You'd love where I work. Everything is needlessly long bloviating power point meetings that could easily be ingested in a 5 minute email.

georgemandis · 4h ago

I was trying to summarize a 40-minute talk with OpenAI’s transcription API, but it was too long. So I sped it up with ffmpeg to fit within the 25-minute cap. It worked quite well (Up to 3x speeds) and was cheaper and faster, so I wrote about it.

Felt like a fun trick worth sharing. There’s a full script and cost breakdown.

bravesoul2 · 2h ago

You could have kept quiet and started a cheaper than openai transcription business :)

behnamoh · 1h ago

Sure, but now the world is a better place because he shared something useful!

hn8726 · 22m ago

Or openai will do it themselves for transcription tasks

4b11b4 · 1h ago

Pre-processing of the audio still a valid biz, multiple types of pre-processing might be valid

timerol · 2h ago

> Is It Accurate?

> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.

This is a great bit of work, and the author accurately summarizes my discomfort

alok-g · 5m ago

>> by jumping straight to the point ...

Love this! I wish more authors follow this approach. So many articles keep going all over the place before 'the point' appears.

If trying, perhaps some 50% of the authors may realize that they don't _have_ a point.

simonw · 2h ago

There was a similar trick which worked with Gemini versions prior to Gemini 2.0: they charged a flat rate of 258 tokens for an image, and it turns out you could fit more than 258 tokens of text in an image of text and use that for a discount!

dataviz1000 · 45m ago

I built a Chrome extension with one feature that transcribes audio to text in the browser using huggingface/transformers.js running the OpenAI Whisper model with WebGPU. It works perfect! Here is a list of examples of all the things you can do in the browser with webgpu for free. [0]

The last thing in the world I want to do is listen or watch presidential social media posts, but, on the hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.

[0] https://github.com/huggingface/transformers.js/tree/main/exa...

[1] https://github.com/adam-s/doomberg-terminal

kgc · 4m ago

Impressive

rob · 1h ago

For anybody trying to do this in bulk, instead of using OpenAI's whisper via their API, you can also use Groq [0] which is much cheaper:

[0] https://groq.com/pricing/

Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.

We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.

georgemandis · 1h ago

Interesting! At $0.02 to $0.04 an hour I don't suspect you've been hunting for optimizations, but I wonder if this "speed up the audio" trick would save you even more.

> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube

Doesn't YouTube do this for you automatically these days within a day or so?

jerjerjer · 1m ago

> I wonder if this "speed up the audio" trick would save you even more.

At this point you'll need to at least check how much running ffmpeg costs. Probably less than $0.01 per hour of audio (approximate savings) but still.

rob · 1h ago

> Doesn't YouTube do this for you automatically these days within a day or so?

Oh yeah, we do a check first and use youtube-transcript-api if there's an automatic one available:

https://github.com/jdepoix/youtube-transcript-api

The tool usually detects them within like ~5 mins of being uploaded though, so usually none are available yet. Then it'll send the summaries to our internal Slack channel for our editors, in case there's anything interesting to 'follow up on' from the meeting.

Probably would be a good idea to add a delay to it and wait for the automatic ones though :)

anshumankmr · 5m ago

Someone should try transcribing Eminem's Rap god with this trick.

xg15 · 14m ago

That's really cool! Also, isn't this effectively the same as supplying audio with a sampling rate of 8kHz instead of the 16kHz that the model is supposed to work with?

Tepix · 1h ago

Why would you give up your privacy by sending what interests you to OpenAI when whisper doesn't need that much computer in the first place?

With faster-whisper (int8, batch=8) you can transcripe 13 minutes of audio in 51 seconds on CPU.

anigbrowl · 32m ago

I came here to ask the same question. This is a well-solved problem, red queen racing it seems utterly pointless, a symptom of reflexive adversarialism.

55555 · 1h ago

This seems like a good place for me to complain about the fact that the automatically generated subtitle files Youtube creates are horribly malformed. Every sentence is repeated twice. In many subtitle files, the subtitle timestamp ranges overlap one another while also repeating every sentence twice in two different ranges. It's absolutely bizarre and has been like this for years or possibly forever. Here's an example - I apologize that it's not in English. I don't know if this issue affects English. https://pastebin.com/raw/LTBps80F

brendanfinan · 3h ago

would this also work for my video consisting of 10,000 PDFs?

https://news.ycombinator.com/item?id=44125598

jasonjmcghee · 2h ago

I can't tell if this is a meme or not.

And if someone had this idea and pitched it to Claude (the model this project was vibe coded with) it would be like "what a great idea!"

pimlottc · 1h ago

Appreciated the concise summary + code snippet upfront, followed by more detail and background for those interested. More articles should be written this way!

karpathy · 32m ago

Omg long post. TLDR from an LLM for anyone interested

Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.

;)

bravesoul2 · 26m ago

This is the sort of content I want to see in Tweets and LinkedIn posts.

I have been thinking for a while how do you make good use of the short space in those places.

LLM did well here.

georgemandis · 26m ago

Hahaha. Okay, okay... I will watch it now ;)

(Thanks for your good sense of humor)

karpathy · 11m ago

I like that your post deliberately gets to the point first and then (optionally) expands later, I think it's a good and generally underutilized format. I often advise people to structure their emails in the same way, e.g. first just cutting to the chase with the specific ask, then giving more context optionally below.

It's not my intention to bloat information or delivery but I also don't super know how to follow this format especially in this kind of talk. Because it's not so much about relaying specific information (like your final script here), but more as a collection of prompts back to the audience as things to think about.

My companion tweet to this video on X had a brief TLDR/Summary included where I tried, but I didn't super think it was very reflective of the talk, it was more about topics covered.

Anyway, I am overall a big fan of doing more compute at the "creation time" to compress other people's time during "consumption time" and I think it's the respectful and kind thing to do.

KTibow · 2h ago

This is really interesting, although the cheapest route is still to use an alternative audio-compatible LLM (Gemini 2.0 Flash Lite, Phi 4 Multimodal) or an alternative host for Whisper (Deepinfra, Fal).

babuloseo · 2h ago

I use the youtube trick, will share it here, but upload to youtube and use their built in transcription service to translate to text for you, and than use gemini pro 2.5 to rebuild the transcript.

ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \ -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt yuv420p \ -shortest \ file_you_upload_to_youtube_for_free_transcripts.mp4

This works VERY well for my needs.

fallinditch · 2h ago

When extracting transcripts from YouTube videos, can anyone give advice on the best (cost effective, quick, accurate) way to do this?

I'm confused because I read in various places that the YouTube API doesn't provide access to transcripts ... so how do all these YouTube transcript extractor services do it?

I want to build my own YouTube summarizer app. Any advice and info on this topic greatly appreciated!

rob · 1h ago

There's a tool that uses YouTube's unofficial APIs to get them if they're available:

https://github.com/jdepoix/youtube-transcript-api

For our internal tool that transcribes local city council meetings on YouTube (often 1-3 hours long), we found that these automatic ones were never available though.

(Our tool usually 'processes' the videos within ~5-30 mins of being uploaded, so that's also why none are probably available 'officially' yet.)

So we use yt-dlp to download the highest quality audio and then process them with whisper via Groq, which is way cheaper (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's API.) Sometimes groq errors out so there's built-in support for Replicate and Deepgram as well.

We run yt-dlp on our remote Linode server and I have a Python script I created that will automatically login to YouTube with a "clean" account and extract the proper cookies.txt file, and we also generate a 'po token' using another tool:

https://github.com/iv-org/youtube-trusted-session-generator

Both cookies.txt and the "po token" get passed to yt-dlp when running on the Linode server and I haven't had to re-generate anything in over a month. Runs smoothly every day.

(Note that I don't use cookies/po_token when running locally at home, it usually works fine there.)

fallinditch · 31m ago

Very useful, thanks. So does this mean that every month or so you have to create a new 'clean' YouTube account and use that to create new po_token/cookies?

It's frustrating to have to jump through all these hoops just to extract transcripts when the YouTube Data API already gives reasonable limits to free API calls ... would be nice if they allowed transcripts too.

Do you think the various YouTube transcript extractor services all follow a similar method as yours?

vjerancrnjak · 2h ago

If YouTube placed autogenerated captions you can download them free of charge with yt-dlp.

tmaly · 1h ago

The whisper model weights are free. You could save even more by just using them locally.

amelius · 53m ago

Solution: charge by number of characters generated.

b0a04gl · 2h ago

it's still decoding every frame and matching phonemes either way, but speeding it up reduces how many seconds they bill you for. so you may hack their billing logic more than the model itself.

also means the longer you talk, the more you pay even if the actual info density is the same. so if your voice has longer pauses or you speak slow, you maybe subsidizing inefficiency.

makes me think maybe the next big compression is in delivery cadence. just auto-optimize voice tone and pacing before sending it to LLM. feed it synthetic fast speech with no emotion, just high density words. you lose human warmth but gain 40% cost savings

topaz0 · 2h ago

I have a way that is (all but) free -- just watch the video if you care about it, or decide not to if you don't, and move on with your life.

jasonjmcghee · 2h ago

Heads up, the token cost breakdown tables look white on white to me. I'm in dark mode on iOS using Brave.

georgemandis · 2h ago

Should be fixed now. Thank you!

mcc1ane · 2h ago

Longer*

stogot · 1h ago

Love this idea but the accuracy section is lacking. Couldnt you do a simple diff of the outputs and see how many differences there are? .5% or 5%?

georgemandis · 1h ago

Yeah, I'd like to do a more formal analysis of the outputs if I can carve out the time.

I don't think a simple diff is the way to go, at least for what I'm interested in. What I care about more is the overall accuracy of the summary—not the word-for-word transcription.

The test I want to setup is using LLMs to evaluate the summarized output and see if the primary themes/topics persist. That's more interesting and useful to me for this exercise.

ada1981 · 3h ago

We discovered this last month.

There is also prob a way to send a smaller sampler of audio at diff speeds and compare them to get a speed optimization with no quality loss unique for each clip.

moralestapia · 2h ago

>We discovered this last month.

Nice. Any blog post, twitter comment or anything pointing to that?

babuloseo · 1h ago

source?

Lower risk of dementia with AS01-adjuvanted vaccination (nature.com)

You've vibe coded an app. Now what? (stackoverflow.blog)

Habits and Tools of Effective Remote Teams (jacobelder.com)

Fast and reliable end-to-end testing for modern web apps – Playwright (playwright.dev)

The AI Agent schism: deterministic vs. non deterministic (writing.kunle.app)

Claude Artifacts (claude.ai)

Warp 2.0: The Agentic Development Environment [video] (youtube.com)

Do AI Code Review Tools Work, or Just Pretend? (redmonk.com)

Show HN: Controlling a 3D globe with computer vision and hand gestures (twitter.com)

A Brief, Incomplete, and Mostly Wrong History of Electric Vehicles (climatedrift.substack.com)

BNFGen: A random text generator based on context-free grammars (baturin.org)

Show HN: Postly – A lightweight social blogging platform with real-time UI (postlyapp.com)

Ask HN: Anyone interested in taking over my indie app?

Strangers 2 in-universe tourist info website (visitvenusoregon.com)

Could Open Table Formats End the Reign of Snowflake and Databricks? (prequel.co)

Reddit in talks to embrace Sam Altman's iris-scanning Orb to verify users (semafor.com)

Scrut, a Markdown-based CLI integration testing framework (github.com)

Agentic Coding Tool by Sourcegraph (ampcode.com)

An All-Around Better Horse (patrickhebron.com)

Secret FSB documents confirm Iran and Russia are in a spy war against each other (theins.press)

Texas a&M Research Pioneers Autonomous Construction Using Synthetic Lichens (stories.tamu.edu)

USC's new AI implant promises drug-free relief for chronic pain (viterbischool.usc.edu)

A pill replace exercise? Swigging this molecule gives mice benefits of workout (nature.com)

Red Star OS (North Korean OS) (en.wikipedia.org)

Show HN: Grep App MCP (github.com)

Asynchronous Functional Programming – Handling HTTP (rubico.land)

Cloudflare Sandboxes (twitter.com)

LM Studio is now an MCP Host (lmstudio.ai)

Getty drops key copyright claims against Stability AI, but UK lawsuit continues (techcrunch.com)

Sam Altman takes his 'io' trademark battle public (theverge.com)

iPhone 17 Pro: A closer look at the new 'camera bar' design (9to5mac.com)

DeepSpeech Is Discontinued (github.com)

Will matcha's popularity be its downfall? (thehustle.co)

Build and Host AI-Powered Apps with Claude – No Deployment Needed (anthropic.com)

America's Prison Population Is in Serious Decline (theatlantic.com)

India forcibly sterilised 8M men: One village remembers, 50 years later (aljazeera.com)

The World is a Fractal – a mental model for navigating the depth of knowledge (blog.satpugnet.com)

I Stopped Worrying About Costs and Learned to Love Kubernetes (medium.com)

Cloudflare/actors library – SDK for Durable Objects in beta (developers.cloudflare.com)

Microsoft opens a free tier for Windows 10 extended updates (theregister.com)

Phantoscopes, Radiovision, and the Dawn of TV (daily.jstor.org)

Inworld TTS: 20x cheaper, state-of-the-art, text-to-speech (inworld.ai)

Claude Code Is Magic (rkayg.com)

What Problems to Solve – By Richard Feynman (genius.cat-v.org)

Public Bet (el-yawd.github.io)

We cataloged 200 ways you're wasting money in the cloud (hub.pointfive.co)

Why Do So Many Parents Think Kids Need Their Own Bedroom? (theatlantic.com)

Show HN: Mind blow by NotebookLM generating podcast on LLM Sparsity (open.spotify.com)

Karpathy: "context engineering" over "prompt engineering" (twitter.com)

At Amazon's Biggest Data Center, Everything Is Supersized for A.I (nytimes.com)

OpenAI Charges by the Minute, So Make the Minutes Shorter

Comments (68)