[HN Gopher] OpenAI charges by the minute, so speed up your audio

	[HN Gopher] OpenAI charges by the minute, so speed up your audio
	___________________________________________________________________

	OpenAI charges by the minute, so speed up your audio

	Author : georgemandis
	Score : 691 points
	Date : 2025-06-25 13:17 UTC (1 days ago)

	web link (george.mand.is)
	w3m dump (george.mand.is)

	\| georgemandis wrote:
	\| I was trying to summarize a 40-minute talk with OpenAI's
	\| transcription API, but it was too long. So I sped it up with
	\| ffmpeg to fit within the 25-minute cap. It worked quite well (Up
	\| to 3x speeds) and was cheaper and faster, so I wrote about it.
	\|
	\| Felt like a fun trick worth sharing. There's a full script and
	\| cost breakdown.

	\| bravesoul2 wrote:
	\| You could have kept quiet and started a cheaper than openai
	\| transcription business :)

	\| behnamoh wrote:
	\| Sure, but now the world is a better place because he shared
	\| something useful!

	\| 4b11b4 wrote:
	\| Pre-processing of the audio still a valid biz, multiple types
	\| of pre-processing might be valid

	\| hn8726 wrote:
	\| Or openai will do it themselves for transcription tasks

	\| ilyakaminsky wrote:
	\| I've already done that [1]. A fraction of the price, 24-hour
	\| limit per file, and speedup tricks like the OP's are welcome.
	\| :)
	\|
	\| [1] https://speechischeap.com

	\| bravesoul2 wrote:
	\| Nice. Don't expect you to spill the beans but is it doing
	\| OK (some customers?)
	\|
	\| Just wondering if I cam build a retirement out of APIs :)

	\| ilyakaminsky wrote:
	\| It's sustainable, but not enough to retire on at this
	\| point.
	\|
	\| > Just wondering if I cam build a retirement out of APIs
	\| :)
	\|
	\| I think it's possible, but you need to find a way to add
	\| value beyond the commodity itself (e.g., audio
	\| classification and speaker diarization in my case).

	\| ada1981 wrote:
	\| We discovered this last month.
	\|
	\| There is also prob a way to send a smaller sampler of audio at
	\| diff speeds and compare them to get a speed optimization with no
	\| quality loss unique for each clip.

	\| moralestapia wrote:
	\| >We discovered this last month.
	\|
	\| Nice. Any blog post, twitter comment or anything pointing to
	\| that?

	\| babuloseo wrote:
	\| source?

	\| brendanfinan wrote:
	\| would this also work for my video consisting of 10,000 PDFs?
	\|
	\| https://news.ycombinator.com/item?id=44125598

	\| jasonjmcghee wrote:
	\| I can't tell if this is a meme or not.
	\|
	\| And if someone had this idea and pitched it to Claude (the
	\| model this project was vibe coded with) it would be like "what
	\| a great idea!"

	\| raincole wrote:
	\| Geez, that repo[0] has 8k stars on Github?
	\|
	\| Are people just staring it for meme value or something? Is this
	\| a scam?
	\|
	\| [0]: https://github.com/Olow304/memvid

	\| mcc1ane wrote:
	\| Longer*

	\| canyp wrote:
	\| Came here just for this.

	\| simonw wrote:
	\| There was a similar trick which worked with Gemini versions prior
	\| to Gemini 2.0: they charged a flat rate of 258 tokens for an
	\| image, and it turns out you could fit more than 258 tokens of
	\| text in an image of text and use that for a discount!

	\| Graziano_M wrote:
	\| Well a picture is worth a thousand tokens.

	\| heeton wrote:
	\| A point on skimming vs taking the time to read something
	\| properly.
	\|
	\| I read a transcript + summary of that exact talk. I thought it
	\| was fine, but uninteresting, I moved on.
	\|
	\| Later I saw it had been put on youtube and I was on the train, so
	\| I watched the whole thing at normal speed. I had a huge number of
	\| different ideas, thoughts and decisions, sparked by watching the
	\| whole thing.
	\|
	\| This happens to me in other areas too. Watching a conference talk
	\| in person is far more useful to me than watching it online with
	\| other distractions. Watching it online is more useful again than
	\| reading a summary.
	\|
	\| Going for a walk to think about something deeply beats a 10
	\| minute session to "solve" the problem and forget it.
	\|
	\| Slower is usually better for thinking.

	\| pluc wrote:
	\| Seriously this is bonkers to me. I, like many hackers, hated
	\| school because they just threw one-size-fits-all knowledge at
	\| you and here we are, paying for the privilege to have that in
	\| every facet of our lives.
	\|
	\| Reading is a pleasure. Watching a lecture or a talk and feeling
	\| the pieces fall into place is great. Having your brain work out
	\| the meaning of things is surely something that defines us as a
	\| species. We're willingly heading for such stupidity, I don't
	\| get it. I don't get how we can all be so blind at what this is
	\| going to create.

	\| hooverd wrote:
	\| If you're not listening to summaries of different audiobooks
	\| at 2x speed in each ear you're not contentmaxing.

	\| lovestory wrote:
	\| Or just use notebookLM to convert your books into an hour
	\| long podcasts /s

	\| 0cf8612b2e1e wrote:
	\| I am genuinely curious how well this would go. There are
	\| so many books I "should" read, but will never get around
	\| to doing it. A one hour podcast would be more engaging
	\| than reading a Wikipedia summary.
	\|
	\| On the gripping hand, there are probably already
	\| excellent 10/30/60 minute book summaries on YouTube or
	\| wherever which are not going to hallucinate plot points.

	\| LanceH wrote:
	\| Read the title and go.

	\| isaacremuant wrote:
	\| > We're willingly heading for such stupidity, I don't get it.
	\| I don't get how we can all be so blind at what this is going
	\| to create.
	\|
	\| Your doomerism and superiority doesn't follow from your
	\| initial "I like many hackers don't like one size fits all".
	\|
	\| This is literally offering you MANY sizes and you have the
	\| freedom to choose. Somehow you're pretending pushed down
	\| uniformity.
	\|
	\| Consume it however you want and come up with actual
	\| criticisms next time?

	\| colechristensen wrote:
	\| University didn't agree with me mostly because I can't pay
	\| attention to the average lecturer. Getting bored in between
	\| words or while waiting for them to write means I absorbed
	\| very little and had to teach myself nearly everything.
	\|
	\| Audiobooks before speed tools were the worst (are they
	\| _trying_ to speak extra slow?) But when I can speed things up
	\| comprehension is just fine.

	\| parpfish wrote:
	\| The worst part about talks/lectures is that once you lose
	\| the thread, the rest is meaningless. If my mind wanders a
	\| bit 5 minutes in to an hour long talk, the rest of that
	\| hour is a lost cause

	\| bisby wrote:
	\| > I, like many hackers, hated school because they just threw
	\| one-size-fits-all knowledge at you
	\|
	\| "This specific knowledge format doesnt work for me, so I'm
	\| asking OpenAI to convert this knowledge into a format that is
	\| easier for me to digest" is exactly what this is about.
	\|
	\| I'm not quite sure what you're upset about? Unless you're
	\| referring to "one size fits all knowledge" as simplified
	\| topics, so you can tackle things at a surface level? I love
	\| having surface level knowledge about a LOT of things. I
	\| certainly don't have time to have go deep on every topic out
	\| there. But if this is a topic I find I am interested in, the
	\| full talk is still available.
	\|
	\| Breadth and depth are both important, and well summarized
	\| talks are important for breadth, but not helpful at all for
	\| depth, and that's ok.

	\| zahlman wrote:
	\| > I, like many hackers, hated school because they just threw
	\| one-size-fits-all knowledge at you and here we are, paying
	\| for the privilege to have that in every facet of our lives.
	\|
	\| But now we get to browse the knowledge rather than having it
	\| thrown at us. That's more important than the quality or
	\| formatting of the content.

	\| itake wrote:
	\| > I don't get how we can all be so blind at what this is
	\| going to create.
	\|
	\| There is too much information. people are trying to optimize
	\| breadth over depth, but obviously there are costs to this.

	\| georgemandis wrote:
	\| For what it's worth, I completely agree with you, for all the
	\| reasons you're saying. With talks in particular I think it's
	\| seldom about the raw content and ideas presented and more about
	\| the ancillary ideas they provoke and inspire, like you're
	\| describing.
	\|
	\| There is just _so_ much content out there. And context is
	\| everything. If the person sharing it had led with some specific
	\| ideas or thoughts I might have taken the time to watch and
	\| looked for those ideas. But in the context it was received--a
	\| quick link with no additional context--I really just wanted the
	\| "gist" to know what I was even potentially responding to.
	\|
	\| In this case, for me, it was worth it. I can go back and decide
	\| if I want to watch it. Your comment has intrigued me so I very
	\| well might!
	\|
	\| ++ to "Slower is usually better for thinking"

	\| mutagen wrote:
	\| Not to discount slower speeds for thinking but I wonder if
	\| there is also value in dipping into a talk or a subject and
	\| then revisiting (re-watching) with the time to ponder on the
	\| thoughts a little more deeply.

	\| tass wrote:
	\| This is similar to strategies in "how to read a book"
	\| (Adler).
	\|
	\| By understanding the outline and themes of a book (or
	\| lecture, I suppose), it makes it easier to piece together
	\| thoughts as you delve deeper into the full content.

	\| conradev wrote:
	\| Was it the speed or the additional information vended by the
	\| audio and video? If someone is a compelling speaker, the same
	\| message will be way more effective in an audiovisual format.
	\| The audio has emphasis on certain parts of the content, for
	\| example, which is missing from the transcript or summary
	\| entirely. Video has gestural and facial cues, also often
	\| utilized to make a point.

	\| bongodongobob wrote:
	\| You'd love where I work. Everything is needlessly long
	\| bloviating power point meetings that could easily be ingested
	\| in a 5 minute email.

	\| itsoktocry wrote:
	\| > _Slower is usually better for thinking._
	\|
	\| Yeah, I see people talking about listening to podcasts or
	\| audiobooks on 2x or 3x.
	\|
	\| Sometimes I set mine to 0.8x. I find you get time to absorb and
	\| think. Am I an outlier?

	\| LanceH wrote:
	\| Depends on what you're listening to. If it's a recap of
	\| something and you're just looking for the answer to "what
	\| happened?", that can be fine for 2x. If you're getting into
	\| the "why?" maybe slower is better. Or if there are a lot of
	\| players involved.
	\|
	\| I'm trying to imagine listening to War and Peace faster. On
	\| the one hand, there are a lot of threads and people to keep
	\| track of (I had a notepad of who is who). On the other hand,
	\| having the stories compressed in time might help remember
	\| what was going on with a character when finally returning to
	\| them.
	\|
	\| Listening to something like Dune quickly, someone might come
	\| out only thinking of the main political thrusts, and the
	\| action, without building that same world in their mind they
	\| would if read slower.

	\| b0a04gl wrote:
	\| it's still decoding every frame and matching phonemes either way,
	\| but speeding it up reduces how many seconds they bill you for. so
	\| you may hack their billing logic more than the model itself.
	\|
	\| also means the longer you talk, the more you pay even if the
	\| actual info density is the same. so if your voice has longer
	\| pauses or you speak slow, you maybe subsidizing inefficiency.
	\|
	\| makes me think maybe the next big compression is in delivery
	\| cadence. just auto-optimize voice tone and pacing before sending
	\| it to LLM. feed it synthetic fast speech with no emotion, just
	\| high density words. you lose human warmth but gain 40% cost
	\| savings

	\| timerol wrote:
	\| > Is It Accurate?
	\|
	\| > I don't know--I didn't watch it, lol. That was the whole point.
	\| And if that answer makes you uncomfortable, buckle-up for this
	\| future we're hurtling toward. Boy, howdy.
	\|
	\| This is a great bit of work, and the author accurately summarizes
	\| my discomfort

	\| BHSPitMonkey wrote:
	\| As if human-generated transcriptions of audio ever came with
	\| guarantees of accuracy?
	\|
	\| This kind of transformation has always come with flaws, and I
	\| think that will continue to be expected implicitly. Far more
	\| worrying is the public's trust in _interpretations_ and claims
	\| of _fact_ produced by gen AI services, or at least the popular
	\| idea that "AI" is more trustworthy/unbiased than humans,
	\| journalists, experts, etc.

	\| angst wrote:
	\| at least human-generated transcriptions have entities that we
	\| can hold responsible for...

	\| _kb wrote:
	\| That still holds true for gen-AI. Organisations that
	\| provide transcription services can't offload responsibility
	\| to a language model any more than they can to steno
	\| keyboard manufacturers.
	\|
	\| If you are the one feeding content to a model then you are
	\| that responsible entity.

	\| raincole wrote:
	\| A lot of people read newspaper.
	\|
	\| Newspaper is essentially just an inaccurate summary of what
	\| really happened. So I don't find this realization that
	\| uncomfortable.

	\| dmix wrote:
	\| That's why I find the idea of training breaking news on
	\| Reddit or Twitter funny, wild exaggerations and targeted spin
	\| is the sort of stuff that does best on those sites and
	\| generates the most comments, 50% of the output would be lies.

	\| jasonjmcghee wrote:
	\| Heads up, the token cost breakdown tables look white on white to
	\| me. I'm in dark mode on iOS using Brave.

	\| georgemandis wrote:
	\| Should be fixed now. Thank you!

	\| w-m wrote:
	\| With transcribing a talk by Andrej, you already picked the most
	\| challenging case possible, speed-wise. His natural talking speed
	\| is already >=1.5x that of a normal human. One of the people you
	\| absolutely have to set your YouTube speed back down to 1x when
	\| listening to follow what's going on.
	\|
	\| In the idea of making more of an OpenAI minute, don't send it any
	\| silence.
	\|
	\| E.g. ffmpeg -i video-audio.m4a \ -af
	\| "silenceremove=start_periods=1:start_duration=0:start_threshold=-
	\| 50dB:\
	\| stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
	\| apad=pad_dur=0.02" \ -c:a aac -b:a 128k
	\| output_minpause.m4a -y
	\|
	\| will cut the talk down from 39m31s to 31m34s, by replacing any
	\| silence (with a -50dB threshold) longer than 20ms by a 20ms
	\| pause. And to keep with the spirit of your post, I measured only
	\| that the input file got shorter, I didn't look at all at the
	\| quality of the transcription by feeding it the shorter version.

	\| georgemandis wrote:
	\| Oooh fun! I had a feeling there was more ffmpeg wizardry I
	\| could be leaning into here. I'll have to try this later--thanks
	\| for the idea!

	\| w-m wrote:
	\| In the meantime I realized that the apad part is nonsensical
	\| - it pads the end of the stream, not at each silence-removed
	\| cut. I wanted to get angry at o3 for proposing this, but then
	\| I had a look at the silenceremove= documentation myself:
	\| https://ffmpeg.org/ffmpeg-filters.html#silenceremove
	\|
	\| Good god. You couldn't make that any more convoluted and
	\| hard-to-grasp if you wanted to. You gotta love ffmpeg!
	\|
	\| I now _think_ this might be a good solution:
	\| ffmpeg -i video-audio.m4a \ -af "silenceremove
	\| =start_periods=1:stop_periods=-1:stop_duration=0.15:stop_thre
	\| shold=-40dB:detection=rms" \ -c:a aac -b:a
	\| 128k output.m4a -y

	\| snickerdoodle12 wrote:
	\| I love ffmpeg but the documentation is often close to
	\| incomprehensible.

	\| squigz wrote:
	\| Out of curiosity, how might you improve those docs? They
	\| seem fairly reasonable to me

	\| w-m wrote:
	\| The documentation reads like it was written by a
	\| programmer who documented the different parameters to
	\| their implementation of a specific algorithm. Now when
	\| you as the user come along and want to use silenceremove,
	\| you'll have to carefully read through this, and build
	\| your own mental model of that algorithm, and then you'll
	\| be able to set these parameters accordingly. That takes a
	\| lot of time and energy, in this case multiple read-
	\| throughs and I'd say > 5 minutes.
	\|
	\| Good documentation should do this work for you. It should
	\| explain somewhat atomic concepts to you, that you can
	\| immediately adapt, and compose. Where it already works is
	\| for the "detection" and "window" parameters, which are
	\| straightforward. But the actions of trimming in the
	\| start/middle/end, and how to configure how long the
	\| silence lasts before trimming, whether to ignore short
	\| bursts of noise, whether to skip every nth silence
	\| period, these are all ideas and concepts that get mushed
	\| together in 10 parameters which are called start/stop-
	\| duration/threshold/silence/mode/periods.
	\|
	\| If you want to apply this filter, it takes a long time to
	\| build mental models for these 10 parameters. You do have
	\| some example calls, which is great, but which doesn't
	\| help if you need to adjust any of these - then you
	\| probably need to understand them all.
	\|
	\| Some stuff I stumbled over when reading it:
	\|
	\| "To remove silence from the middle of a file, specify a
	\| stop_periods that is negative. This value is then treated
	\| as a positive value [...]" - what? Why is this parameter
	\| so heavily overloaded?
	\|
	\| "start_duration: Specify the amount of time that non-
	\| silence must be detected before it stops trimming audio"
	\| - parameter is named start_something, but it's about
	\| stopping? Why?
	\|
	\| "start_periods: [...] Normally, [...] start_periods will
	\| be 1 [...]. Default value is 0."
	\|
	\| "start_mode: Specify mode of detection of silence end at
	\| start": start_mode end at start?
	\|
	\| It's very clunky. Every parameter has multiple modes of
	\| operation. Why is it start and stop for beginning and
	\| end, and why is "do stuff in the middle" part of the end?
	\| Why is there no global mode?
	\|
	\| You could nitpick this stuff to death. In the end, naming
	\| things is famously one of the two hard problems in
	\| computer science (the others being cache invalidation and
	\| off-by-one errors). And writing good documentation is
	\| also very, very hard work. Just exposing the internals of
	\| the algorithm is often not great UX, because then every
	\| user has to learn how the thing works internally before
	\| they can start using it (hey, looking at you, git).
	\|
	\| So while it's easy to point out where these docs fail, it
	\| would be a lot of work to rewrite this documentation from
	\| the top down, explaining the concepts first. Or even
	\| rewriting the interface to make this more approachable,
	\| and the parameters less overloaded. But since it's hard
	\| work, and not sexy to programmers, it won't get done, and
	\| many people will come after, having to spend time on
	\| reading and re-reading this current mess.

	\| phito wrote:
	\| > naming things is famously one of the two hard problems
	\| in computer science
	\|
	\| Isn't ffmpeg made by a French person? As a francophone
	\| myself, I can tell you one of the biggest weakness of
	\| francophone programmers is naming things, even worse when
	\| it's in English. Maybe it's what's at play here.

	\| ada1981 wrote:
	\| Curious if this is helpful.
	\|
	\| https://claude.ai/public/artifacts/96ea8227-48c3-484d-b30
	\| b-6...
	\|
	\| I had Claude rewrite the documentation for silenceremove
	\| based on your feedback.

	\| zahlman wrote:
	\| > "start_mode: Specify mode of detection of silence end
	\| at start": start_mode end at start?
	\|
	\| In "start_mode", "start" means "initial", and "mode"
	\| means "method". But specifically, it's a method of
	\| figuring out where the silence ends.
	\|
	\| > In the end, naming things is famously one of the two
	\| hard problems in computer science
	\|
	\| It's also one of the hard problems in English.

	\| dylan604 wrote:
	\| if you did it in 2 passes, you could find the cut points
	\| using silence detect, use a bunch of -ss/-t/-i based on
	\| those segments, apad each segment with a -filter_complex
	\| chain the ends in concating. it would be a wonderfully
	\| gnarly command for very little benefit. but it could be
	\| done

	\| pragmatic wrote:
	\| No not really? The talk where he babbles about OSes and
	\| everyone is somehow impressed?

	\| behnamoh wrote:
	\| > His natural talking speed is already >=1.5x that of a normal
	\| human. One of the people you absolutely have to set your
	\| YouTube speed back down to 1x when listening to follow what's
	\| going on.
	\|
	\| I wonder if there's a way to automatically detect how "fast" a
	\| person talks in an audio file. I know it's subjective and
	\| different people talk at different paces in an audio, but it'd
	\| be cool to kinda know when OP's trick fails (they mention x4
	\| ruined the output; maybe for karpathy that would happen at x2).

	\| echelon wrote:
	\| > I wonder if there's a way to automatically detect how
	\| "fast" a person talks in an audio file.
	\|
	\| Stupid heuristic: take a segment of video, transcribe text,
	\| count number of words per utterance duration. If you need
	\| speaker diarization, handle speaker utterance durations
	\| independently. You can further slice, such as syllable count,
	\| etc.

	\| nand4011 wrote:
	\| https://www.science.org/doi/10.1126/sciadv.aaw2594
	\|
	\| Apparently human language conveys information at around 39
	\| bits/s. You could use a similar technique as that paper to
	\| determine the information rate of a speaker and then
	\| correct it to 39 bits/s by changing the speed of the video.

	\| varispeed wrote:
	\| It's a shame platforms don't generally support speeds greater
	\| than 2x. One of my "superpowers" or a curse is that I cannot
	\| stand normal speaking pace. When I watch lectures, I always
	\| go for maximum speed and that still is too slow for me. I
	\| wish platforms have included 4x but done properly (with
	\| minimal artefacts).

	\| lofaszvanitt wrote:
	\| Robot in a human body identified :D.

	\| mrmuagi wrote:
	\| All audiobooks are like this for me. I tried it for
	\| lectures but if I'm taking handwritten notes, I can't keep
	\| up my writing.
	\|
	\| I wonder if there is negative side effects of this though,
	\| do you notice when interacting with people who speak slower
	\| require a greater deal of patience?

	\| colechristensen wrote:
	\| No but a little. I struggle with people who repeat every
	\| point of what they're saying to you several times or when
	\| you say "you told me exactly this the last time we spoke"
	\| they cannot be stopped from retelling the whole thing
	\| verbatim. Usually in those situations though there's some
	\| potential cognitive issues so you can only be
	\| understanding.

	\| hamburglar wrote:
	\| I once attended a live talk by Leslie Lamport and as he
	\| talked, I had the overwhelming feeling that something was
	\| wrong, and was thinking "did he have a stroke or
	\| something?" but then I realized I had just always watched
	\| his lectures online and had become accustomed to
	\| listening to him at 2x.

	\| userbinator wrote:
	\| _I wonder if there is negative side effects of this
	\| though, do you notice when interacting with people who
	\| speak slower require a greater deal of patience?_
	\|
	\| You are basically training your brain to work faster, and
	\| I suspect that causes some changes in the structure of
	\| your memory; if someone speaks too slowly, I'll be more
	\| likely to forget what they said earlier, compared to if
	\| they quickly gave me the entire sentence.

	\| dpcx wrote:
	\| https://github.com/codebicycle/videospeed has been a
	\| wonderful addition for me.

	\| narratives1 wrote:
	\| I use a Chrome extension that lets you take any video
	\| player (including embedded) to 10x speed. Turn most things
	\| to 3-4x. It works on ads too

	\| munch117 wrote:
	\| I use a bookmarklet:
	\|
	\| javascript:void%20function(){document.querySelector(%22vi
	\| deo,audio%22).playbackRate=parseFloat(prompt(%22Set%20the
	\| %20playback rate%22))}();

	\| cookingrobot wrote:
	\| There are fonts designed to be legibly at really small
	\| size. I wonder if there are voices that are especially
	\| understandable at extreme speeds.
	\|
	\| Could use an "auctioneer" voice to playback text at 10x
	\| speed.

	\| bbatha wrote:
	\| I'm also a fast listener. I find audio quality is the
	\| main differentiator in my ability to listen quickly or
	\| not. A podcast recorded at high quality I can listen to
	\| at 3-4x (with silence trimmed) comfortably, the second
	\| someone calls in from their phone I'm getting every 4th
	\| word and often need to go down to 2x or less. Mumbly
	\| accents are also a driver of quality but not as much,
	\| then again I rarely have trouble understanding difficult
	\| accents IRL and almost never use subtitles on TV
	\| shows/youtube to better understand the speaker. Your
	\| mileage may vary.
	\|
	\| I understand 4-6x speakers fairly well but don't enjoy
	\| listening at that pace. If I lose focus for a couple of
	\| seconds I effectively miss a paragraph of context and my
	\| brain can't fill in the missing details.

	\| seabass wrote:
	\| I made a super simplistic chrome extension for this.
	\| Doesn't work on all websites, but YouTube and most online
	\| video courses are covered.
	\|
	\| https://github.com/sebastiansandqvist/video-speed-extension

	\| JadeNB wrote:
	\| Can't you use VLC to watch almost anything streamable, and
	\| then play at your desired speed?

	\| ars wrote:
	\| I use this extension: https://mybrowseraddon.com/video-
	\| speed-control.html

	\| eitally wrote:
	\| Recently, YT started supporting 4x playback for Premium
	\| subscribers, but only in the mobile app, not on the web.

	\| btown wrote:
	\| Even a last-decade transcription model could be used to
	\| detect a rough number of syllables per unit time, and the
	\| accuracy of that model could be used to guide speed-up and
	\| dead-time detection before sending to a more expensive model.
	\| As with all things, it's a question of whether the cost
	\| savings justify the engineering work.

	\| janalsncm wrote:
	\| > I wonder if there's a way to automatically detect how
	\| "fast" a person talks in an audio file
	\|
	\| Transcribe it locally using whisper and output tokens/sec?

	\| maxall4 wrote:
	\| Just count syllables per second by doing an FFT plus some
	\| basic analysis.

	\| tucnak wrote:
	\| > FFT plus some basic analysis
	\|
	\| Yeah, totally easier than `len(transcribe(a))/len(a)`

	\| mrstone wrote:
	\| > I wonder if there's a way to automatically detect how
	\| "fast" a person talks in an audio file.
	\|
	\| Hilbert transform and FFT to get phoneme rate would work.

	\| WalterSear wrote:
	\| Better: just make everyone in the video speak at my
	\| comfortable speed.

	\| dTal wrote:
	\| Compress it using a VBR speech codec and measure the
	\| compression ratio?

	\| brunoborges wrote:
	\| The interesting thing here is that OpenAI likely has a layer
	\| that trims down videos exactly how you suggest, so they can
	\| still charge by the full length while costing less for them to
	\| actually process the content.

	\| cbsmith wrote:
	\| That's an amusing perspective. I really struggle with watching
	\| any video at double speed, but I've never had trouble listening
	\| to any of his talks at 1x. To me, he seems to speak at a
	\| perfectly reasonable pace.

	\| swyx wrote:
	\| > I didn't look at all at the quality of the transcription by
	\| feeding it the shorter version.
	\|
	\| guys how hard is it to toss both versions into like diffchecker
	\| or something haha youre just comparing text

	\| TimorousBestie wrote:
	\| Why use diffchecker when there's a perfectly good LLM you
	\| could ask right there? lol

	\| serf wrote:
	\| because a lot of LLMs will just eat tokens to call a
	\| diffchecker.
	\|
	\| really it becomes a question of whether or not the friction
	\| of invoking the command or the cost of tokens is greater.
	\|
	\| as I get older and more rsi'd the tokens seem cheaper.

	\| trashchomper wrote:
	\| Assuming sarcasm but if not, because deterministic vs.
	\| nondeterministic output?

	\| TimorousBestie wrote:
	\| Not sarcasm, just a little joke. I thought the emote at
	\| the end would prevent it from being taken seriously. . .

	\| Der_Einzige wrote:
	\| Make it semi deterministic with structured/constrained
	\| generation!

	\| QuantumGood wrote:
	\| I wish there was a 2.25x YouTube option for "normal" humans. I
	\| already use every shortcut, and listen at 2x 90% of the time.
	\| But Andrej I can't take faster than 1.25x

	\| zamadatix wrote:
	\| YouTube ran an experiment with up to 4x playback on mobile
	\| (???) but it went away in February. I get a lot of the
	\| experiments they do being experiments but why just allowing
	\| the slider to go farther is such a back and forth hoopla is
	\| beyond me. It's one of the oft touted features of 3rd party
	\| apps and extensions with nearly 0 UI impact to those who
	\| don't want to use it (just don't slide the slider past 2x if
	\| you don't want past 2x).
	\|
	\| https://www.theverge.com/news/603581/youtube-premium-
	\| experim...

	\| K2L8M11N2 wrote:
	\| As a premium subscriber I currently have 4x available on
	\| Android and they recently (in the last month) added it to
	\| web too

	\| zelphirkalt wrote:
	\| Probably, because they are "A/B testing" things, that do
	\| not really show much effect or depend on more
	\| circumstances, than they care to eliminate and then
	\| overinterpret the results. Like almost all corporate A/B
	\| testing.

	\| ars wrote:
	\| Install this: https://mybrowseraddon.com/video-speed-
	\| control.html
	\|
	\| I listen to a lot of videos on 3 or even 4x.

	\| david_allison wrote:
	\| I have up to 4x (in steps of 0.05) with YouTube Premium on
	\| Android

	\| zahlman wrote:
	\| Meanwhile, I've found that just reading the transcript is
	\| often good enough.

	\| nickjj wrote:
	\| Andrej's talk seemed normal to listen at 2x but I've also
	\| listened to everything at 2x for a long time.
	\|
	\| Unfortunately a byproduct of listening to everything at 2x is
	\| I've had a number of folks say they have to watch my videos at
	\| 0.75x but even when I play back my own videos it feels
	\| painfully slow unless it's 2x.
	\|
	\| For reference I've always found John Carmack's pacing perfect /
	\| natural and watchable at 2x too.
	\|
	\| A recent video of mine is https://www.youtube.com/watch?v=pL-
	\| qft1ykek. It was posted on HN by someone else the other day so
	\| I'm not trying to do any self promotion here, it's just an
	\| example of a recent video I put up and am generally curious if
	\| anyone finds that too fast or it's normal. It's a regular
	\| unscripted video where I have a rough idea of what I want to
	\| cover and then turn on the mic, start recording and let it pan
	\| out organically. If I had to guess I'd say the last ~250-300
	\| videos were recorded this way.

	\| noahjk wrote:
	\| To me you talk at what I would consider "1.2x" of podcast
	\| speed (which to me is a decent average measure of spoken word
	\| speed - I usually do 1.5x on all podcasts). You're definitely
	\| still in the normal distribution for tech YouTubers, in my
	\| experience - in fact it feels like a lot of tech YouTube
	\| talks like they've had a bit too much adderall, but you don't
	\| come off that way. Naturally people may choose to slow down
	\| tutorials, because the person giving the tutorial can never
	\| truly understand what someone learning would or wouldn't
	\| understand. So overall I think your speed is totally fine!
	\| Also, very timely video, I was interested in the exact topic,
	\| so I'm happy I found this.

	\| eru wrote:
	\| > "[I]n fact it feels like a lot of tech YouTube talks like
	\| they've had a bit too much adderall, [...]"
	\|
	\| Funnily enough, if you actually have ADHD, then stimulants
	\| like adderall or even nicotine, will calm you down.
	\|
	\| > Naturally people may choose to slow down tutorials, [...]
	\|
	\| For me it also depends on what mood I'm in and whether I'm
	\| doing anything else at the same time. If I'm fully
	\| concentrating on a video, 2x is often fine. If I'm doing
	\| some physical task at the same time, I need it slower than
	\| that.
	\|
	\| If I'm doing a mental task at the same, I can forget about
	\| getting anything out of the video. At least, if the mental
	\| task involves any words. So eg I could probably still
	\| follow along a technical discussion at roughly 1x speed
	\| while playing Tetris, but not while coding.

	\| Tyr42 wrote:
	\| Driving is a hard 1.0 for me. But otherwise 2.0 is good.

	\| SavioMak wrote:
	\| Yeah, you sound around 1.25-1.5x than the average videos I
	\| watch

	\| viraptor wrote:
	\| > Andrej's talk seemed normal to listen at 2x but I've also
	\| listened to everything at 2x for a long time.
	\|
	\| We get used to higher speeds when we consume a lot of content
	\| that way. Have you heard the systems used by experienced
	\| blind people? I cannot even understand the words in them, but
	\| months of training would probably fix that.

	\| userbinator wrote:
	\| You can achieve a similar, less permanent effect by closing
	\| your eyes; I often do it when I'm on a call and the person
	\| on the other end is extremely difficult to understand.

	\| userbinator wrote:
	\| _but even when I play back my own videos it feels painfully
	\| slow unless it 's 2x._
	\|
	\| Watching your video at 1x still feels too slow, and it's just
	\| right for me at 2x speed (that's approximately how fast I
	\| normally talk if others don't tell me to slow down), although
	\| my usual YouTube watching speed is closer to 2.5-3x. That is
	\| to say, you're still faster than a lot of others.
	\|
	\| I think it just takes practice --- I started at around 1.25x
	\| for videos, and slowly moved up from there. As you have
	\| noticed, once you've consumed enough sped-up content, your
	\| own speaking speed will also naturally increase.

	\| fuzztester wrote:
	\| James Goodnight of SAS Institute:
	\|
	\| https://en.m.wikipedia.org/wiki/James_Goodnight
	\|
	\| I have watched one or two videos of his, and he spoke slowly,
	\| compared to the average person. I liked that. It sounded
	\| good.

	\| makeitdouble wrote:
	\| Your video sounded a tad fast at 2x and pretty fine at 1.5.
	\|
	\| Now I think speed adjustment come less from the natural
	\| speaking pace of the person than the subject matter.
	\|
	\| I'm thinking of a channel like Accented Cinema
	\| (https://youtu.be/hfruMPONaYg), with a slowish talking pace,
	\| but as there's all the visual part going on at all times, it
	\| actually doesn't feel slow to my ear.
	\|
	\| I felt the same for videos explaining concept I have no
	\| familiarity with, so I see as how fast the brain can process
	\| the info, less than the talking speed per se.

	\| retsibsi wrote:
	\| Your speaking speed is noticeably faster than usual, but I
	\| think it's good for this kind of video. When the content is
	\| really dense and every word is chosen for maximum information
	\| value, a slower speed would be good, but for relatively
	\| natural speech with a normal amount of redundancy I think
	\| it's fine to go at this speed.

	\| quietbritishjim wrote:
	\| Your actual speed of talking sounds a little faster than
	\| average but not notably so.
	\|
	\| But it _feels_ (very subjectively) faster to me than usual
	\| because you don 't really seem to take any pauses. It's like
	\| the whole video is a single run-on sentence that I keep
	\| buffering, but I never get a chance to process it and flush
	\| the buffer.

	\| fortran77 wrote:
	\| I always listen to YouTube and podcasts at 1.5. And when I
	\| meet a YouTuber/podcaster IRL, I'm always annoyed at how slow
	\| they speak.

	\| Der_Einzige wrote:
	\| This btw is also why spreading (speed reading) happens in
	\| American competitive debate. This gets ridiculed online but
	\| it's exactly why it happens.
	\|
	\| https://en.wikipedia.org/wiki/Spreading_(debate)

	\| hooverd wrote:
	\| They should put an upper WPM on competitive debate, like F1
	\| does with certain car parts.

	\| jwrallie wrote:
	\| From my own experience with whisper.cpp, normalizing the audio
	\| and removing silence not only shortens the process time
	\| significantly, but also increases a lot the quality of the
	\| transcription, as silence can mean hallucinations. You can do
	\| that graphically with Audacity too, if you do not want to deal
	\| with the command line. You also do not need any special
	\| hardware to run whisper.cpp, with the small model literally any
	\| computer should be able to do it if you can wait a bit (less
	\| than the audio length).
	\|
	\| One half interesting / half depressing observation I made is
	\| that at my workplace any meeting recording I tried to
	\| transcribe in this way had its length reduced to almost 2/3
	\| when cutting off the silence. Makes you think about the
	\| efficiency (or lack of it) of holding long(ish) meetings.

	\| d1sxeyes wrote:
	\| 1/3 of the meeting is silence? That's a good thing. It's
	\| allowing people time to think over what they're hearing,
	\| there are pauses to allow people to contribute or
	\| participate. What do you think a better percentage of silent
	\| time would be?

	\| jwrallie wrote:
	\| Good point, somehow if I think of a 30 minutes meeting, 10
	\| minutes of silence sounds great, but seeing a 1 hour block
	\| disappear from a 3 hour recording makes me want to use that
	\| "free" hour to do something else.
	\|
	\| Well, I don't think silence is not the real problem with a
	\| 3 hour meeting!

	\| literalAardvark wrote:
	\| If people could speak continuously for an entire meeting
	\| then that meeting would be better off as an email.
	\| Meetings are for bouncing half formed ideas around and
	\| coagulating that into something greater.
	\|
	\| There MUST be time to think

	\| sudhirj wrote:
	\| If a human meeting had lot of silence (assuming it's between
	\| words and not before / after), I would consider it a very
	\| efficient meeting where there was just enough information
	\| exchanged with adequate absorption, processing and response
	\| time.

	\| dogprez wrote:
	\| Others pointed out the value of silence, but I just wanted to
	\| say it saddens me when humanity is misclassified as
	\| inefficiency. The other day Sam Altman made a jest about how
	\| much energy is wasted by people saying "thanks" to chatgpt.
	\| The corollary is how much human energy is wasted on humans
	\| saying thanks to each other. When making a judgement about
	\| inefficiency one is making a judgement on what is valuable, a
	\| very biased judgement that isn't necessarily aligned with
	\| what makes us thrive. =) (<-- a wasteful smiley)

	\| kristianbrigman wrote:
	\| I'll remember that you told me thanks. Will chatgpt?
	\| (Honestly curious... it's possible)

	\| Salgat wrote:
	\| I say thanks for my own well-being too.

	\| rz2k wrote:
	\| I get the impression that it sets a tone that encourages
	\| creative, more open ended responses.
	\|
	\| I think this is the reverse of confrontation with the
	\| LLM. Typically if you get a really dumb response, it is
	\| better to hang up the conversation and completely start
	\| over than it is to tell the LLM why it is wrong. Once you
	\| start arguing, they start getting stupider and respond
	\| with even faultier logic as they try to appease you.
	\|
	\| I suppose it makes sense if the training involves
	\| alternate models of discourse resembling two educated
	\| people in a forum with shared intellectual curiosity and
	\| a common goal, or two people having a ridiculous internet
	\| argument.

	\| Philip-J-Fry wrote:
	\| Well, humans saying thanks to eachother isn't wasted
	\| energy. It has a real affect on our relationships.
	\|
	\| People say thank you to AI because they are portrayed as
	\| human-like chat bots, but in reality it has almost no
	\| effect on their effectiveness to respond to our queries.
	\|
	\| Saying thank you to ChatGPT is no less wasteful than saying
	\| thank you to Windows for opening the calculator.
	\|
	\| I don't think anyone is trying to draw any parallels
	\| between that inefficiency and real humans saying thank you?

	\| mulmen wrote:
	\| Humans _are_ inefficient. The mistake is making a moral
	\| judgement about that.

	\| vayup wrote:
	\| Gemini charges by tokens rather than minutes. I used VAD to
	\| trim silence hoping token count will go down. I noticed the
	\| token count wasn't much different (Eg: 30 seconds of background
	\| noise had the same count as 2s of background noise). Either
	\| Gemini API trims silence under the hood, or the nature of
	\| tokenization is dependent on speech content rather than the
	\| length. Not sure which.
	\|
	\| In either case, I bet OpenAI is doing the same optimization
	\| under the hood and keeping the savings for themselves.

	\| CSMastermind wrote:
	\| > to set your YouTube speed back down to 1x
	\|
	\| Is it common for people to watch Youtube sped up?
	\|
	\| I've heard of people doing this for podcasts and audiobooks and
	\| never understood it all that much there. Just feels like
	\| 'skimming' a real book instead of actually reading it.

	\| Feathercrown wrote:
	\| Some people talk slower than your natural listening speed.
	\| It's less like skimming and more like if some books used 36pt
	\| font and you normalized the size back down to a comfortable
	\| information-dense size.

	\| Eezee wrote:
	\| That's completely different. Imagine you are reading a book
	\| and the words only get revealed to you at 1 word a second.
	\| You would get annoyed if your natural reading speed was
	\| higher than that.
	\|
	\| Same with a video. A lot of people speak considerably slower
	\| than you could process the information they are conveying, so
	\| you speed it up. You still get the same content and are not
	\| skipping parts as you would when skimming a book.

	\| keithxm23 wrote:
	\| Often, I'll come across speakers who just speak slowly and
	\| listening at 1.5x or 2x barely feels sped-up.
	\|
	\| Additionally, the brain tends to adjust to a faster talking
	\| speed very quickly. If I'm watching an average-paced person
	\| talk and speed them up by 2x, the first couple minutes of
	\| listening might be difficult and will require more intent-
	\| listening. However, the brain starts processing it as the new
	\| normal and it does not feel sped-up anymore. To the extent
	\| that if I go back to 1x, it feels like the speaker is way too
	\| slow.

	\| 83 wrote:
	\| >>Just feels like 'skimming' a real book instead of actually
	\| reading it.
	\|
	\| That's the goal for me lately. I primarily use Youtube for
	\| technical assistance (where are the screws to adjust this
	\| carburetor?, how do I remove this brake hub?, etc). There
	\| used to be short 1 to 2m videos on this kind of stuff but
	\| nowadays I have to suffer through a 10-15 minute video with
	\| multiple ad breaks.
	\|
	\| So now I always watch youtube at 2x speed while rapidly
	\| jumping the slider forward to find relevant portions.

	\| babuloseo wrote:
	\| I use the youtube trick, will share it here, but upload to
	\| youtube and use their built in transcription service to translate
	\| to text for you, and than use gemini pro 2.5 to rebuild the
	\| transcript.
	\|
	\| ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i
	\| file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \
	\| -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt
	\| yuv420p \ -shortest \
	\| file_you_upload_to_youtube_for_free_transcripts.mp4
	\|
	\| This works VERY well for my needs.

	\| KTibow wrote:
	\| This is really interesting, although the cheapest route is still
	\| to use an alternative audio-compatible LLM (Gemini 2.0 Flash
	\| Lite, Phi 4 Multimodal) or an alternative host for Whisper
	\| (Deepinfra, Fal).

	\| fallinditch wrote:
	\| When extracting transcripts from YouTube videos, can anyone give
	\| advice on the best (cost effective, quick, accurate) way to do
	\| this?
	\|
	\| I'm confused because I read in various places that the YouTube
	\| API doesn't provide access to transcripts ... so how do all these
	\| YouTube transcript extractor services do it?
	\|
	\| I want to build my own YouTube summarizer app. Any advice and
	\| info on this topic greatly appreciated!

	\| vjerancrnjak wrote:
	\| If YouTube placed autogenerated captions you can download them
	\| free of charge with yt-dlp.

	\| rob wrote:
	\| There's a tool that uses YouTube's unofficial APIs to get them
	\| if they're available:
	\|
	\| https://github.com/jdepoix/youtube-transcript-api
	\|
	\| For our internal tool that transcribes local city council
	\| meetings on YouTube (often 1-3 hours long), we found that these
	\| automatic ones were never available though.
	\|
	\| (Our tool usually 'processes' the videos within ~5-30 mins of
	\| being uploaded, so that's also why none are probably available
	\| 'officially' yet.)
	\|
	\| So we use yt-dlp to download the highest quality audio and then
	\| process them with whisper via Groq, which is way cheaper
	\| (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's
	\| API.) Sometimes groq errors out so there's built-in support for
	\| Replicate and Deepgram as well.
	\|
	\| We run yt-dlp on our remote Linode server and I have a Python
	\| script I created that will automatically login to YouTube with
	\| a "clean" account and extract the proper cookies.txt file, and
	\| we also generate a 'po token' using another tool:
	\|
	\| https://github.com/iv-org/youtube-trusted-session-generator
	\|
	\| Both cookies.txt and the "po token" get passed to yt-dlp when
	\| running on the Linode server and I haven't had to re-generate
	\| anything in over a month. Runs smoothly every day.
	\|
	\| (Note that I don't use cookies/po_token when running locally at
	\| home, it usually works fine there.)

	\| fallinditch wrote:
	\| Very useful, thanks. So does this mean that every month or so
	\| you have to create a new 'clean' YouTube account and use that
	\| to create new po_token/cookies?
	\|
	\| It's frustrating to have to jump through all these hoops just
	\| to extract transcripts when the YouTube Data API already
	\| gives reasonable limits to free API calls ... would be nice
	\| if they allowed transcripts too.
	\|
	\| Do you think the various YouTube transcript extractor
	\| services all follow a similar method as yours?

	\| banana_giraffe wrote:
	\| You can use yt-dlp to get the transcripts. For instance, to
	\| grab just the transcript of a video: ./yt-dlp
	\| --skip-download --write-sub --write-auto-sub --sub-lang en
	\| --sub-format json3 <youtube video URL>
	\|
	\| You can also feed the same command a playlist or channel URL
	\| and it'll run through and grab all the transcripts for each
	\| video in the playlist or channel.

	\| fallinditch wrote:
	\| That's cool, thanks for the info. But do you also have to use
	\| a rotating proxy to prevent YouTube from blocking your IP
	\| address?

	\| banana_giraffe wrote:
	\| Last time I ran this at scale was a couple of months ago,
	\| so my information is no doubt out of date, but in my
	\| experience, YouTube seems less concerned about this than
	\| they are when you're grabbing lots of videos.
	\|
	\| But that was a few months ago, so for all I know they've
	\| tightened down more hatches since then.

	\| topaz0 wrote:
	\| I have a way that is (all but) free -- just watch the video if
	\| you care about it, or decide not to if you don't, and move on
	\| with your life.

	\| Tepix wrote:
	\| Why would you give up your privacy by sending what interests you
	\| to OpenAI when whisper doesn't need that much computer in the
	\| first place?
	\|
	\| With faster-whisper (int8, batch=8) you can transcripe 13 minutes
	\| of audio in 51 seconds _on CPU_.

	\| anigbrowl wrote:
	\| I came here to ask the same question. This is a well-solved
	\| problem, red queen racing it seems utterly pointless, a symptom
	\| of reflexive adversarialism.

	\| poly2it wrote:
	\| > symptom of reflexive adversarialism
	\|
	\| Is there a definition for this expression? I don't catch you.
	\|
	\| > ... using corporate technology for the solved problem is a
	\| symptom of self-directed skepticism by the user against the
	\| corporate institutions ...
	\|
	\| Eh?

	\| ProllyInfamous wrote:
	\| I am a blue collar electrician. Not a coder (but definitely
	\| geeky).
	\|
	\| Whisper works quite well on Apple Silicon with simple drag/drop
	\| install (i.e. no terminal commands). Program is free; you can
	\| get an M4 mini for ~$550; don't see how an online platform can
	\| even compete with this, except for one-off customers (i.e. not
	\| great repeat customers).
	\|
	\| We used it to transcribe _ddaayyss_ of audio microcassettes
	\| which my mother had made during her lifetime. Whisper.app even
	\| transcribed a few hours that are difficult to comprehend as a
	\| human listener. It is _VERY_ fast.
	\|
	\| I've used the text to search for timestamps worth listening to,
	\| skipping most dead-space (e.g. she made most while driving, in
	\| a stream of not-always-focused consciousness).

	\| pimlottc wrote:
	\| Appreciated the concise summary + code snippet upfront, followed
	\| by more detail and background for those interested. More articles
	\| should be written this way!

	\| rob wrote:
	\| For anybody trying to do this in bulk, instead of using OpenAI's
	\| whisper via their API, you can also use Groq [0] which is much
	\| cheaper:
	\|
	\| [0] https://groq.com/pricing/
	\|
	\| Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with
	\| whisper-large-v3-turbo. I believe OpenAI comes out to like
	\| ~$0.36/hr.
	\|
	\| We do this internally with our tool that automatically
	\| transcribes local government council meetings right when they get
	\| uploaded to YouTube. It uses Groq by default, but I also added
	\| support for Replicate and Deepgram as backups because sometimes
	\| Groq errors out.

	\| georgemandis wrote:
	\| Interesting! At $0.02 to $0.04 an hour I don't suspect you've
	\| been hunting for optimizations, but I wonder if this "speed up
	\| the audio" trick would save you even more.
	\|
	\| > We do this internally with our tool that automatically
	\| transcribes local government council meetings right when they
	\| get uploaded to YouTube
	\|
	\| Doesn't YouTube do this for you automatically these days within
	\| a day or so?

	\| rob wrote:
	\| > Doesn't YouTube do this for you automatically these days
	\| within a day or so?
	\|
	\| Oh yeah, we do a check first and use youtube-transcript-api
	\| if there's an automatic one available:
	\|
	\| https://github.com/jdepoix/youtube-transcript-api
	\|
	\| The tool usually detects them within like ~5 mins of being
	\| uploaded though, so usually none are available yet. Then
	\| it'll send the summaries to our internal Slack channel for
	\| our editors, in case there's anything interesting to 'follow
	\| up on' from the meeting.
	\|
	\| Probably would be a good idea to add a delay to it and wait
	\| for the automatic ones though :)

	\| jerjerjer wrote:
	\| > I wonder if this "speed up the audio" trick would save you
	\| even more.
	\|
	\| At this point you'll need to at least check how much running
	\| ffmpeg costs. Probably less than $0.01 per hour of audio
	\| (approximate savings) but still.

	\| ks2048 wrote:
	\| > Doesn't YouTube do this for you automatically these days
	\| within a day or so?
	\|
	\| Last time I checked, I think the Google auto-captions were
	\| noticeably worse quality than whisper, but maybe that has
	\| changed.

	\| colechristensen wrote:
	\| If you have a recent macbook you can run the same whisper model
	\| locally for free. People are really sleeping on how cheap the
	\| compute you own hardware for already is.

	\| rob wrote:
	\| I don't. I have a MacBook Pro from 2019 with an Intel chip
	\| and 16 GB of memory. Pretty sure when I tried the large
	\| whisper model it took like 30 minutes to an hour to do
	\| something that took hardly any time via Groq. It's been a
	\| while though so maybe my times are off.

	\| colechristensen wrote:
	\| Ah, no, Apple silicon Mac required with a decent amount of
	\| memory. But this kind of machine has been very common (a
	\| mid to high range recent macbook) at all of my employers
	\| for a long time.

	\| fragmede wrote:
	\| It's been roughly six years since that MacBook was top of
	\| the line, so your times are definitely off.

	\| likium wrote:
	\| What tool do you use?

	\| pzo wrote:
	\| there is also cloudflare workers ai where you can have whisper-
	\| large-v3-turbo for around $0.03 per hour:
	\|
	\| https://developers.cloudflare.com/workers-ai/models/whisper-...

	\| abidlabs wrote:
	\| You could use Hugging Face's Inference API (which supports all
	\| of these API providers) directly making it easier to switch
	\| between them, e.g. look at the panel on the right on:
	\| https://huggingface.co/openai/whisper-large-v3

	\| BrunoJo wrote:
	\| Let me know if you are interested in a more reliable
	\| transcription API. I'm building Lemonfox.ai and we've optimized
	\| our transcription API to be highly available and very fast for
	\| large files. Happy to give you a discount (email: bruno at
	\| lemonfox.ai)

	\| stogot wrote:
	\| Love this idea but the accuracy section is lacking. Couldnt you
	\| do a simple diff of the outputs and see how many differences
	\| there are? .5% or 5%?

	\| georgemandis wrote:
	\| Yeah, I'd like to do a more formal analysis of the outputs if I
	\| can carve out the time.
	\|
	\| I don't think a simple diff is the way to go, at least for what
	\| I'm interested in. What I care about more is the overall
	\| accuracy of the summary--not the word-for-word transcription.
	\|
	\| The test I want to setup is using LLMs to evaluate the
	\| summarized output and see if the primary themes/topics persist.
	\| That's more interesting and useful to me for this exercise.

	\| tmaly wrote:
	\| The whisper model weights are free. You could save even more by
	\| just using them locally.

	\| pzo wrote:
	\| but this is still great trick if you want to reduce latency or
	\| inference speed even with local models e.g. in realtime chatbot

	\| 55555 wrote:
	\| This seems like a good place for me to complain about the fact
	\| that the automatically generated subtitle files Youtube creates
	\| are horribly malformed. Every sentence is repeated twice. In many
	\| subtitle files, the subtitle timestamp ranges overlap one another
	\| while also repeating every sentence twice in two different
	\| ranges. It's absolutely bizarre and has been like this for years
	\| or possibly forever. Here's an example - I apologize that it's
	\| not in English. I don't know if this issue affects English.
	\| https://pastebin.com/raw/LTBps80F

	\| xenator wrote:
	\| Seems like Thai. Thai translation and recognition is like 10
	\| years ago comparing to other languages I'm dealing with in my
	\| everyday life. Good news tho is the same level was for Russian
	\| years ago, and now it is near perfect.

	\| 55555 wrote:
	\| Well the weird thing is honestly their speech to text
	\| recognizes 97% of words correctly. The subtitle content is
	\| pretty perfect. It's just the formatting that's awful.

	\| amelius wrote:
	\| Solution: charge by number of characters generated.

	\| dataviz1000 wrote:
	\| I built a Chrome extension with one feature that transcribes
	\| audio to text in the browser using huggingface/transformers.js
	\| running the OpenAI Whisper model with WebGPU. It works perfect!
	\| Here is a list of examples of all the things you can do in the
	\| browser with webgpu for free. [0]
	\|
	\| The last thing in the world I want to do is listen or watch
	\| presidential social media posts, but, on the other hand,
	\| sometimes enormously stupid things are said which move the SP500
	\| up or down $60 in a session. So this feature queries for new
	\| posts every minute, does ORC image to text and transcribe video
	\| audio to text locally, sends the post with text for analysis, all
	\| in the background inside a Chrome extension before notify me of
	\| anything economically significant.
	\|
	\| [0]
	\| https://github.com/huggingface/transformers.js/tree/main/exa...
	\|
	\| [1] https://github.com/adam-s/doomberg-terminal

	\| kgc wrote:
	\| Impressive

	\| karpathy wrote:
	\| Omg long post. TLDR from an LLM for anyone interested
	\|
	\| Speed your audio up 2-3x with ffmpeg before sending it to
	\| OpenAI's gpt-4o-transcribe: the shorter file uses fewer input-
	\| tokens, cuts costs by roughly a third, and processes faster with
	\| little quality loss (4x is too fast). A sample yt-dlp - ffmpeg -
	\| curl script shows the workflow.
	\|
	\| ;)

	\| bravesoul2 wrote:
	\| This is the sort of content I want to see in Tweets and
	\| LinkedIn posts.
	\|
	\| I have been thinking for a while how do you make good use of
	\| the short space in those places.
	\|
	\| LLM did well here.

	\| georgemandis wrote:
	\| Hahaha. Okay, okay... I will watch it now ;)
	\|
	\| (Thanks for your good sense of humor)

	\| karpathy wrote:
	\| I like that your post deliberately gets to the point first
	\| and then (optionally) expands later, I think it's a good and
	\| generally underutilized format. I often advise people to
	\| structure their emails in the same way, e.g. first just
	\| cutting to the chase with the specific ask, then giving more
	\| context optionally below.
	\|
	\| It's not my intention to bloat information or delivery but I
	\| also don't super know how to follow this format especially in
	\| this kind of talk. Because it's not so much about relaying
	\| specific information (like your final script here), but more
	\| as a collection of prompts back to the audience as things to
	\| think about.
	\|
	\| My companion tweet to this video on X had a brief
	\| TLDR/Summary included where I tried, but I didn't super think
	\| it was very reflective of the talk, it was more about topics
	\| covered.
	\|
	\| Anyway, I am overall a big fan of doing more compute at the
	\| "creation time" to compress other people's time during
	\| "consumption time" and I think it's the respectful and kind
	\| thing to do.

	\| georgemandis wrote:
	\| I watched your talk. There are so many more interesting
	\| ideas in there that resonated with me that the summary
	\| (unsurprisingly) skipped over. I'm glad I watched it!
	\|
	\| LLMs as the operating system, the way you interface with
	\| vibe-coding (smaller chunks) and the idea that maybe we
	\| haven't found the "GUI for AI" yet are all things I've
	\| pondered and discussed with people. You articulated them
	\| well.
	\|
	\| I think some formats, like a talk, don't lend themselves
	\| easily to meaningful summaries. It's about giving the
	\| audience things to think about, to your point. It's the sum
	\| of storytelling that's more than the whole and why we still
	\| do it.
	\|
	\| My post is, at the end of the day, really more about a neat
	\| trick to optimize transcriptions. This particular video
	\| might be a great example of why you may not always want to
	\| do that :)
	\|
	\| Anyway, thanks for the time and thanks for the talk!

	\| mh- wrote:
	\| _> I often advise people to structure their emails [..]_
	\|
	\| I frequently do the same, and eventually someone sent me
	\| this HBR article summarizing the concept nicely as "bottom
	\| line up front". It's a good primer for those interested.
	\|
	\| https://hbr.org/2016/11/how-to-write-email-with-military-
	\| pre...

	\| lordspace wrote:
	\| that's a really good summary :)

	\| xg15 wrote:
	\| That's really cool! Also, isn't this effectively the same as
	\| supplying audio with a sampling rate of 8kHz instead of the 16kHz
	\| that the model is supposed to work with?

	\| anshumankmr wrote:
	\| Someone should try transcribing Eminem's Rap god with this trick.

	\| alok-g wrote:
	\| >> by jumping straight to the point ...
	\|
	\| Love this! I wish more authors follow this approach. So many
	\| articles keep going all over the place before 'the point'
	\| appears.
	\|
	\| If trying, perhaps some 50% of the authors may realize that they
	\| don't _have_ a point.

	\| pknerd wrote:
	\| I guess it'd work even if you make it 2.5 or evebn 3x.

	\| donkey_brains wrote:
	\| Hmm...doesn't this technique effectively make the minute longer,
	\| not shorter? Because you can pack more speech into a minute of
	\| recording? Seems like making a minute shorter would be
	\| counterproductive.

	\| StochasticLi wrote:
	\| No. You're paying for a minute of audio, which will be more
	\| packed with speech, not for how long it's being computed.

	\| impossiblefork wrote:
	\| Make the minutes longer, you mean.

	\| pbbakkum wrote:
	\| This is great, thank you for sharing. I work on these APIs at
	\| OpenAI, it's a surprise to me that it still works reasonably well
	\| at 2/3x speed, but on the other hand for phone channels we get
	\| 8khz audio that is upsampled to 24khz for the model and it still
	\| works well. Note there's probably a measurable decrease in
	\| transcription accuracy that worsens as you deviate from 1x speed.
	\| Also we really need to support bigger/longer file uploads :)

	\| nerder92 wrote:
	\| Quick Feedback: Would it be cool to research this internally
	\| and maybe find a sweet spot in speed multiplier where the loss
	\| is minimal. This pre-processing is quite cheap and could bring
	\| down the API price eventually.

	\| georgemandis wrote:
	\| I kind of want to take a more proper poke at this but focus
	\| more one summarization accuracy over word-for-word accuracy,
	\| though I see the value in both.
	\|
	\| I'm actually curious, if I run transcriptions back-to-back-to-
	\| back on the exact same audio, how much variance should I
	\| expect?
	\|
	\| Maybe I'll try three approaches:
	\|
	\| - A straight diff comparison (I know a lot of people are
	\| calling for this, but I really think this is less useful than
	\| it sounds)
	\|
	\| - A "variance within the modal" test running it multiple times
	\| against the same audio, tracking how much it varies between
	\| runs
	\|
	\| - An LLM analysis assessing if the primary points from a talk
	\| were captured and summarized at 1x, 2x, 3x, 4x runs (I think
	\| this is far more useful and interesting)

	\| celltalk wrote:
	\| With this logic, you should also be able to trim the parts that
	\| doesn't have words. Just add a cut-off for db, and trim the video
	\| before transcription.
	\|
	\| Possibly another 10-20% gain?

	\| isubkhankulov wrote:
	\| Transcripts get much more valuable when one diarizes the audio
	\| beforehand to determine which speaker said what.
	\|
	\| I use this free tool to extract those and dump the transcripts
	\| into a LLM with basic prompts: https://contentflow.megalabs.co

	\| mt_ wrote:
	\| You can just dump the youtube link video in Google AI studio and
	\| ask it to transcribe the video with speaker labels and even ask
	\| it it to add useful visual clues, because the model is multimodal
	\| for video too.

	\| MaxDPS wrote:
	\| Can I ask what you mean by "useful visual clues"?

	\| mt_ wrote:
	\| What is the speaker showcasing in its slides, what is it's
	\| body language and so on.

	\| cprayingmantis wrote:
	\| I noticed something similar with images as inputs to Claude, you
	\| can scale down the images and still get good outputs. There is an
	\| accuracy drop off at a certain point but the token savings are
	\| worth doing a little tuning there.

	\| georgemandis wrote:
	\| Definitely in the same spirit!
	\|
	\| Clearly the next thing we need to test is removing all the
	\| vowels from words, or something like that :)

	\| meerab wrote:
	\| Interesting approach to transcript generation!
	\|
	\| I'm implementing a similar workflow for VideoToBe.com
	\|
	\| My Current Pipeline:
	\|
	\| Media Extraction - yt-dlp for reliable video/audio downloads
	\| Local Transcription - OpenAI Whisper running on my own hardware
	\| (no API costs) Storage & UI - Transcripts stored in S3 with a
	\| custom web interface for viewing
	\|
	\| Y Combinator playlist
	\| https://videotobe.com/play/playlist/ycombinator
	\|
	\| and Andrej's talk is
	\| https://videotobe.com/play/youtube/LCEmiRjPEtQ
	\|
	\| After reading your blog post, I will be testing effect on
	\| speeding audio for locally-hosted Whisper models. Running Whisper
	\| locally eliminates the ongoing cost concerns since my
	\| infrastructure is already a sunk cost. Speeding audio could be an
	\| interesting performance enhancement to explore!

	\| fuzztester wrote:
	\| Stop being slaves of extorters of any kind, and just leave.
	\|
	\| there is tons of this happening everywhere, and we need to fight
	\| this, and boycott it.

	\| pottertheotter wrote:
	\| You can just ask Gemini to summarize it for you. It's free. I do
	\| it all the time with YouTube videos.
	\|
	\| Or you can just copy the transcript that YouTube provides below
	\| the video.

	\| BrunoJo wrote:
	\| If you look for a cheaper transcription API you could als use
	\| https://Lemonfox.ai. We've optimized the API for long audio files
	\| and are much faster and cheaper than OpenAI.

	\| conjecTech wrote:
	\| If you are hosting whisper yourself, you can do something
	\| slightly more elegant, but with the same effect. You can
	\| downsample/pool the context 2:1 (or potentially more) a few
	\| layers into the encoder. That allows you to do the equivalent of
	\| speeding up audio without worry about potential spectral losses.
	\| For whisper large v3, that gets you nearly double throughput in
	\| exchange for a relative ~4% WER increase.

	\| nomercy400 wrote:
	\| Do you have more details or examples on how to downsample the
	\| context in the encoder? I treat the encoder as an opaque block,
	\| so I have no idea where to start.

	\| PeterStuer wrote:
	\| I wonder how much time and battery
	\| transcoding/uploading/downloading over coffeeshop wifi would
	\| realy save vs just running it locally through optimized Whisper.

	\| georgemandis wrote:
	\| I had this same thought and won't pretend my fear was rational,
	\| haha.
	\|
	\| One thing that I thought was fairly clear in my write-up but
	\| feels a little lost in the comments: I didn't just try this
	\| with whisper. I tried it with their newer gpt-4o-transcription
	\| model, which seems considerably faster. There's no way to run
	\| that one locally.

	\| KPennig86852 wrote:
	\| But you know that you can run OpenAI's Whisper audio recognition
	\| model locally for free, right? It has very little GPU
	\| requirements, and the new "turbo" model works quite fast (there
	\| are also several Python libraries which make it significantly
	\| faster still).

	\| dajonker wrote:
	\| Gemini 2.5 pro is, in my usage, quite superior for high quality
	\| transcriptions of phone calls, in Dutch in my case. As long as
	\| you upload the audio to GCS there you can easily process
	\| conversations of over an hour. It correctly identified and
	\| labeled speakers.
	\|
	\| The cheaper 2.5 flash made noticeably more mistakes, for example
	\| it didn't correctly output numbers while the Pro model did.
	\|
	\| As for OpenAI, their gpt-4o-transcribe model did worse than 2.5
	\| flash, completely messing up names of places and/or people. Plus
	\| it doesn't label the conversation in turns, it just outputs a
	\| single continuous piece of text.

	\| yashasolutions wrote:
	\| the question would be how to do that but also still get proper
	\| time code when using whisper to get the subtitles

	\| ryanar wrote:
	\| In my experience, transcription software has no problem with
	\| transcribing sped up audio, or audio that is inaudible to humans
	\| or extremely loud (as long as not clipped), I wonder if LLM
	\| transcription works the same.

	\| mushishi wrote:
	\| Do the APIs support simultaneous voice transcription in a way
	\| that different voices are tagged? (either in text or as metadata)
	\|
	\| If so: could you split the audiofile and process the latter half
	\| by pitch shifting, say an octave, and then merging them together
	\| to get shorter audiofile -- then transcribe and join them back to
	\| a linear form, tagging removed. (You could insert some
	\| prerecorded voice to know at which point the second voice
	\| starts.). If pitch change is not enough, maybe manipulate it
	\| further by formants.

	\| godot wrote:
	\| If you're already doing local ffmpeg stuff (i.e. pretty involved
	\| with code and scripting already) you're only a couple of steps
	\| more away from just downloading the openai-whisper models (or
	\| even the faster-whisper models which runs about two times
	\| faster). Since this looks like personal usage and not building
	\| production quality code, you can use AI (e.g. Cursor) to write a
	\| script to run the whisper model inference in seconds.
	\|
	\| Then there is no cost at all to run any length of audio. (since
	\| cost seems to be the primary factor of this article)
	\|
	\| On my m1 mac laptop it takes me about 30 seconds to run it on a
	\| 3-minute audio file. I'm guessing for a 40 minute talk it takes
	\| about 5-10 minutes to run.

	\| ta8903 wrote:
	\| This "hack" also works in real life, youtubers low to talk slowly
	\| to increase the video runtime so I watch everything other than
	\| songs at 2x speed (and that's only because their player doesn't
	\| let you go faster).

	___________________________________________________________________
	(page generated 2025-06-26 21:02 UTC)