https://george.mand.is/2025/06/openai-charges-by-the-minute-so-make-the-minutes-shorter/

 * George Mandis
 * About
 * Words (written)
 * Words (spoken)
 * Projects
 * Hire
 * Get in Touch
 * /dev/random

OpenAI Charges by the Minute, So Make the Minutes Shorter

June 24th, 2025 * ~2,000 words * 9 minute read

Want to make OpenAI transcriptions faster and cheaper? Just speed up
your audio.

I mean that very literally. Run your audio through ffmpeg at 2x or 3x
before transcribing it. You'll spend fewer tokens and less time
waiting with almost no drop in transcription quality.

That's it!

Here's a script combining of all my favorite little toys and tricks
to get the job. You'll need yt-dlp, ffmpeg and llm installed.

# Extract the audio from the video
yt-dlp -f 'bestaudio[ext=m4a]' --extract-audio --audio-format m4a -o 'video-audio.m4a' "https://www.youtube.com/watch?v=LCEmiRjPEtQ" -k;

# Create a low-bitrate MP3 version at 3x speed
ffmpeg -i "video-audio.m4a" -filter:a "atempo=3.0" -ac 1 -b:a 64k video-audio-3x.mp3;

# Send it along to OpenAI for a transcription
curl --request POST \
 --url https://api.openai.com/v1/audio/transcriptions \
 --header "Authorization: Bearer $OPENAI_API_KEY" \
 --header 'Content-Type: multipart/form-data' \
 --form [email protected] \
 --form model=gpt-4o-transcribe > video-transcript.txt;

# Get a nice little summary

cat video-transcript.txt | llm --system "Summarize the main points of this talk."

I just saved you time by jumping straight to the point, but read-on
if you want more of a story about how I accidentally discovered this
while trying to summarize a 40-minute talk from Andrej Karpathy.

Also read-on if you're wondering why I didn't just use the built-in
auto-transcription that YouTube provides, though the short answer
there is easy: I'm sort of a doofus and thought--incorrectly--it wasn't
available. So I did things the hard way.

I Just Wanted the TL;DW(atch)

A former colleague of mine sent me this talk from Andrej Karpathy
about how AI is changing software. I wasn't familiar with Andrej, but
saw he'd worked at Tesla. That coupled with the talk being part of a
Y Combinator series and 40 minutes made me think "Ugh. Do I... really
want to watch this? Another 'AI is changing everything' talk from the
usual suspects, to the usual crowds?"

If ever there were a use-case for dumping something into an LLM to
get the gist of it and walk away, this felt like it. I respected the
person who sent it to me though and wanted to do the noble thing: use
AI to summarize the thing for me, blindly trust it and engage with
the person pretending I had watched it.

My first instinct was to pipe the transcript into an LLM and get the
gist of it. This script is the one I would previously reach for to
pull the auto-generated transcripts from YouTube:

yt-dlp --all-subs --skip-download \
 --sub-format ttml/vtt/best \
 [url]

For some reason though, no subtitles were downloaded. I kept running
into an error!

Later, after some head-scratching and rereading the documentation, I
realized my version (2025.04.03) was outdated.

Long story short: Updating to the latest version (2025.06.09) fixed
it, but for some reason I did not try this before going down a
totally different rabbit hole. I guess I got this little write-up and
exploration out of it though.

If you care more about summarizing transcripts and less about the
vagaries of audio-transcriptions and tokens, this is the correct
answer and your off-ramp.

My Transcription Workflow

I already had an old, home-brewed script that would extract the audio
from any video URL, pipe it through whisper locally and dump the
transcription in a text file.

That worked, but I was on dwindling battery power in a coffee shop.
Not ideal for longer, local inference, mighty as my M3 MacBook Air
still feels to me. I figured I would try offloading it to OpenAI's
API instead. Surely that would be faster?

Testing OpenAI's Transcription Tools

Okay, using the whisper-1 model it's still pretty slow, but it gets
the job done. Had I opted for the model I knew and moved on, the
story might end here.

However, out of curiosity, I went straight for the newer
gpt-4o-transcribe model first. It's built to handle multimodal inputs
and promises faster responses.

I quickly hit another roadblock: there's a 25-minute audio limit and
my audio was nearly 40 minutes long.

Let's Try Something Obvious

At first I thought about trimming the audio to fit somehow, but there
wasn't an obvious 14 minutes to cut. Trimming the beginning and end
would give me a minute or so at most.

An interesting, weird idea I thought about for a second but never
tried was cutting a chunk or two out of the middle. Maybe I would
somehow still have enough info for a relevant summary?

Then it crossed my mind--what if I just sped up the audio before
sending it over? People listen to podcasts at accelerated 1-2x speeds
all the time.

So I wrote a quick script:

ffmpeg -i video-audio.m4a -filter:a "atempo=2.0" -ac 1 -b:a 64k video-audio-2x.mp3

Ta-da! Now I had something closer to a 20 minute file to send to
OpenAI.

I uploaded it and... it worked like a charm! Behold the summary
bestowed upon me that gave me enough confidence to reply to my
colleague as though I had watched it.

But there was something... interesting here. Did I just stumble
across a sort of obvious, straightforward hack? Is everyone in the
audio-transcription business already doing this and am I just
haphazardly bumbling into their secrets?

I had to dig deeper.

Why This Works: Our Brains Forgive, and So Does AI

There's an interesting parallel here in my mind with optimizing
images. Traditionally you have lossy and lossless file formats. A
lossy file-format kind of gives away the game in its description--the
further you crunch and compact the bytes the more fidelity you're
going to lose. It works because the human brain just isn't likely to
pick-up on the artifacts and imperfection

But even with a "lossless" file format there are tricks you can lean
into that rely on the limits of human perception. One of the primary
ways you can do that with a PNG or GIF is reducing the number of
unique colors in the palette. You'd be surprised by how often a
palette of 64 colors or fewer might actually be enough and perceived
as significantly more.

There's also a parallel in my head between this and the brain's
ability to still comprehend text with spelling mistakes, dropped
words and other errors, i.e. transposed letter effects. Our brains
have a knack for filling in the gaps, and when you go looking through
the world with magnifying glass you'll start to notice lots of them.

Speeding up the audio starts to drop the more subtle sounds and
occasionally shorter words from the audio, but it doesn't seem to
hurt my ability to comprehend what I'm hearing--even if I do have to
focus. These audio transcription models seem to be pretty good at
this as well.

Wait--how far can I push this? Does It Actually Save Money?

Turns out yes. OpenAI charges for transcription based on audio
tokens, which scale with the duration of the input. Faster audio =
fewer seconds = fewer tokens.

Here are some rounded numbers based on the 40-minute audio file
breaking down the audio input and text output token costs:

 Speed       Duration     Audio Input    Input Token   Output Token
            (seconds)        Tokens         Cost           Cost
1x         2,372          NA (too long)  NA            NA
(original)
2x         1,186          11,856         $0.07         $0.02
3x         791            7,904          $0.04         $0.02

That's a solid 33% price reduction on input tokens at 3x! However the
bulk of your costs for these transcription models are still going to
be the output tokens. Those are priced at $10 per 1M tokens whereas
audio input tokens are priced at $6 per 1M token as of the time of
this writing.

Also interesting to note--my output tokens for the 2x and 3x versions
were exactly the same: 2,048. This kind of makes sense, I think? To
the extent the output tokens are a reflection of that model's ability
to understand and summarize the input, my takeaway is a "summarized"
(i.e. reduced-token) version of the same audio yields the same amount
of comprehensibility.

This is also probably a reflection of the 4,096 token ceiling on
transcriptions generally when using the gpt-4o-transcription model. I
suspect half the context window is reserved for the output tokens and
this is basically reflecting our request using it up in its entirety.
I suspect we might get diminishing results with longer
transcriptions.

But back to money.

So the back-of-the-envelope calculator for a single transcription
looks something like this:

6 * (audio_input_tokens / 1_000_000) + 10 * (text_output_tokens / 1_000_000);

That does not quite seem to jibe with the estimated cost of $0.006
per minute stated on the pricing page, at least for the 2x speed.
That version (19-20 minutes) seemed to cost about $0.09 whereas the
3x version (13 minutes) cost about $0.07 (pretty accurate actually),
if I'm adding up the tokens correctly.

# Pricing for 2x speed
6 * (11_856 / 1_000_000) + 10 * (2_048 / 1_000_000) = 0.09

# Pricing for 3x speed
6 * (7_904 / 1_000_000) + 10 * (2_048 / 1_000_000) = 0.07

It would seem that estimate isn't just based on the length of the
audio but also some assumptions around how many tokens per minute are
going to be generated from a normal speaking cadence.

That's... kind of fascinating! I wonder how John Moschitta's feels
about this.

Comparing these costs to whisper-1 is easy because the pricing table
more confidently advertises the cost--not "estimated" cost--as a flat
$0.006 per minute. I'm assuming that's minute of audio processed, not
minute of inference.

The gpt-4o-transcription model actually compares pretty favorably.

Speed   Duration   Cost
1x    2372         $0.24
2x    1186 seconds $0.12
3x    791 seconds  $0.08

Does This Save Money?

In short, yes! It's not particularly rigorous, but it seems like we
reduced the cost of transcribing our 40-minute audio file by 23% from
$0.09 to $0.07 simply by speeding up the audio.

If we could compare to a 1x version of the audio file trimmed to the
25-minute limit, I bet we could paint an even more impressive picture
of cost reduction. We kind of can with the whisper-1 chart. You could
make the case this technique reduced costs by 67%!

Is It Accurate?

I don't know--I didn't watch it, lol. That was the whole point. And if
that answer makes you uncomfortable, buckle-up for this future we're
hurtling toward. Boy, howdy.

More helpfully, I didn't compare word-for-word, but spot checks on
the 2x and 3x versions looked solid. 4x speed was too fast--the
transcription started getting hilariously weird. So, 2x and 3x seem
to be the sweet spot between efficiency and fidelity, though it will
obviously depend on how fast the people are speaking in the first
place.

Why Not 4x?

When I pushed it to 4x the results became comically unusable.

Output of a 4x transcription mostly repeating "And how do we talk
about that?" over and over again

That sure didn't stop my call to summarize from trying though.

Hey, not the worst talk I've been to!

In Summary

Always, in short, to save time and money, consider doubling or
tripling the speed of the audio you want to transcribe. The trade-off
is, as always, fidelity, but it's not an insignificant savings.

Simple, fast, and surprisingly effective.

TL;DR

 * OpenAI charges for transcriptions based on audio duration
   (whisper-1) or tokens (gpt-4o-transcribe).
 * You can speed up audio with ffmpeg before uploading to save time
   and money.
 * This reduces audio tokens (or duration), lowering your bill.
 * 2x or 3x speed works well.
 * 4x speed? Probably too much--but fun to try.

If you find problems with my math, have questions, found a more
rigorous study qualitatively comparing different output speeds please
get in touch! Or if you thought this was so cool you want to hire me
for something fun...

--

Published on Tuesday, June 24th 2025. Read this post in Markdown or
plain-text.

If you enjoyed this consider signing-up for my newsletter or hiring
me

OpenAI Charges by the Minute, So Make the Minutes Shorter

June 24th 2025

I discovered a fun and strangely obvious trick for summarizing videos
faster and reducing costs: just speed them up. Cheaper, faster OpenAI
transcriptions with a little ffmpeg trick.

Read article -

What It Takes to Be a Good Engineering Manager

June 19th 2025

Reflecting on 4.5 years of engineering leadership and what makes a
good manager

Read article -

Exploring phonics with OpenAI

June 12th 2025

An informal exploration into phoneme generation and TTS quirks using
OpenAI tools.

Read article -

Future-proofing my blog for an AI audience

June 11th 2025

Adding markdown and plain-text versions of my content for the LLM
future gods to gobble-up

Read article -

Upcoming Talks & Workshops

LeadDev New York 2025
October 15-16, 2025
New York, NY

View a more complete list -

George Mandis has played the part of director of engineering,
manager, tech lead, full-stack engineer, long-time independent
developer, consultant, avid traveler and probably a handful of other
things in his life. Currently he lives in Brooklyn, New York.

Looking for an engineering leader who'll pull the right levers? Let's
talk.

Unsolicited endorsements:

   "OMG long post." -- Andrej Karpathy

(c) 2009-2025 George Mandis

RSS * JSON * Newsletter * Notes * GitHub * LinkedIn * Medium *
Support

If the sun is out, you should visit my [?] SolarPi

*** This site is a registered periodical with the Library of Congress *
ISSN #2640-0901

Some very random posts might contain affiliate links. If you use them
to buy stuff I might get money.

* Changes last made on Thu, 26 Jun 2025 22:32:21 GMT