[HN Gopher] OpenAI charges by the minute, so speed up your audio | |
___________________________________________________________________ | |
OpenAI charges by the minute, so speed up your audio | |
Author : georgemandis | |
Score : 691 points | |
Date : 2025-06-25 13:17 UTC (1 days ago) | |
web link (george.mand.is) | |
w3m dump (george.mand.is) | |
| georgemandis wrote: | |
| I was trying to summarize a 40-minute talk with OpenAI's | |
| transcription API, but it was too long. So I sped it up with | |
| ffmpeg to fit within the 25-minute cap. It worked quite well (Up | |
| to 3x speeds) and was cheaper and faster, so I wrote about it. | |
| | |
| Felt like a fun trick worth sharing. There's a full script and | |
| cost breakdown. | |
| bravesoul2 wrote: | |
| You could have kept quiet and started a cheaper than openai | |
| transcription business :) | |
| behnamoh wrote: | |
| Sure, but now the world is a better place because he shared | |
| something useful! | |
| 4b11b4 wrote: | |
| Pre-processing of the audio still a valid biz, multiple types | |
| of pre-processing might be valid | |
| hn8726 wrote: | |
| Or openai will do it themselves for transcription tasks | |
| ilyakaminsky wrote: | |
| I've already done that [1]. A fraction of the price, 24-hour | |
| limit per file, and speedup tricks like the OP's are welcome. | |
| :) | |
| | |
| [1] https://speechischeap.com | |
| bravesoul2 wrote: | |
| Nice. Don't expect you to spill the beans but is it doing | |
| OK (some customers?) | |
| | |
| Just wondering if I cam build a retirement out of APIs :) | |
| ilyakaminsky wrote: | |
| It's sustainable, but not enough to retire on at this | |
| point. | |
| | |
| > Just wondering if I cam build a retirement out of APIs | |
| :) | |
| | |
| I think it's possible, but you need to find a way to add | |
| value beyond the commodity itself (e.g., audio | |
| classification and speaker diarization in my case). | |
| ada1981 wrote: | |
| We discovered this last month. | |
| | |
| There is also prob a way to send a smaller sampler of audio at | |
| diff speeds and compare them to get a speed optimization with no | |
| quality loss unique for each clip. | |
| moralestapia wrote: | |
| >We discovered this last month. | |
| | |
| Nice. Any blog post, twitter comment or anything pointing to | |
| that? | |
| babuloseo wrote: | |
| source? | |
| brendanfinan wrote: | |
| would this also work for my video consisting of 10,000 PDFs? | |
| | |
| https://news.ycombinator.com/item?id=44125598 | |
| jasonjmcghee wrote: | |
| I can't tell if this is a meme or not. | |
| | |
| And if someone had this idea and pitched it to Claude (the | |
| model this project was vibe coded with) it would be like "what | |
| a great idea!" | |
| raincole wrote: | |
| Geez, that repo[0] has 8k stars on Github? | |
| | |
| Are people just staring it for meme value or something? Is this | |
| a scam? | |
| | |
| [0]: https://github.com/Olow304/memvid | |
| mcc1ane wrote: | |
| Longer* | |
| canyp wrote: | |
| Came here just for this. | |
| simonw wrote: | |
| There was a similar trick which worked with Gemini versions prior | |
| to Gemini 2.0: they charged a flat rate of 258 tokens for an | |
| image, and it turns out you could fit more than 258 tokens of | |
| text in an image of text and use that for a discount! | |
| Graziano_M wrote: | |
| Well a picture is worth a thousand tokens. | |
| heeton wrote: | |
| A point on skimming vs taking the time to read something | |
| properly. | |
| | |
| I read a transcript + summary of that exact talk. I thought it | |
| was fine, but uninteresting, I moved on. | |
| | |
| Later I saw it had been put on youtube and I was on the train, so | |
| I watched the whole thing at normal speed. I had a huge number of | |
| different ideas, thoughts and decisions, sparked by watching the | |
| whole thing. | |
| | |
| This happens to me in other areas too. Watching a conference talk | |
| in person is far more useful to me than watching it online with | |
| other distractions. Watching it online is more useful again than | |
| reading a summary. | |
| | |
| Going for a walk to think about something deeply beats a 10 | |
| minute session to "solve" the problem and forget it. | |
| | |
| Slower is usually better for thinking. | |
| pluc wrote: | |
| Seriously this is bonkers to me. I, like many hackers, hated | |
| school because they just threw one-size-fits-all knowledge at | |
| you and here we are, paying for the privilege to have that in | |
| every facet of our lives. | |
| | |
| Reading is a pleasure. Watching a lecture or a talk and feeling | |
| the pieces fall into place is great. Having your brain work out | |
| the meaning of things is surely something that defines us as a | |
| species. We're willingly heading for such stupidity, I don't | |
| get it. I don't get how we can all be so blind at what this is | |
| going to create. | |
| hooverd wrote: | |
| If you're not listening to summaries of different audiobooks | |
| at 2x speed in each ear you're not contentmaxing. | |
| lovestory wrote: | |
| Or just use notebookLM to convert your books into an hour | |
| long podcasts /s | |
| 0cf8612b2e1e wrote: | |
| I am genuinely curious how well this would go. There are | |
| so many books I "should" read, but will never get around | |
| to doing it. A one hour podcast would be more engaging | |
| than reading a Wikipedia summary. | |
| | |
| On the gripping hand, there are probably already | |
| excellent 10/30/60 minute book summaries on YouTube or | |
| wherever which are not going to hallucinate plot points. | |
| LanceH wrote: | |
| Read the title and go. | |
| isaacremuant wrote: | |
| > We're willingly heading for such stupidity, I don't get it. | |
| I don't get how we can all be so blind at what this is going | |
| to create. | |
| | |
| Your doomerism and superiority doesn't follow from your | |
| initial "I like many hackers don't like one size fits all". | |
| | |
| This is literally offering you MANY sizes and you have the | |
| freedom to choose. Somehow you're pretending pushed down | |
| uniformity. | |
| | |
| Consume it however you want and come up with actual | |
| criticisms next time? | |
| colechristensen wrote: | |
| University didn't agree with me mostly because I can't pay | |
| attention to the average lecturer. Getting bored in between | |
| words or while waiting for them to write means I absorbed | |
| very little and had to teach myself nearly everything. | |
| | |
| Audiobooks before speed tools were the worst (are they | |
| _trying_ to speak extra slow?) But when I can speed things up | |
| comprehension is just fine. | |
| parpfish wrote: | |
| The worst part about talks/lectures is that once you lose | |
| the thread, the rest is meaningless. If my mind wanders a | |
| bit 5 minutes in to an hour long talk, the rest of that | |
| hour is a lost cause | |
| bisby wrote: | |
| > I, like many hackers, hated school because they just threw | |
| one-size-fits-all knowledge at you | |
| | |
| "This specific knowledge format doesnt work for me, so I'm | |
| asking OpenAI to convert this knowledge into a format that is | |
| easier for me to digest" is exactly what this is about. | |
| | |
| I'm not quite sure what you're upset about? Unless you're | |
| referring to "one size fits all knowledge" as simplified | |
| topics, so you can tackle things at a surface level? I love | |
| having surface level knowledge about a LOT of things. I | |
| certainly don't have time to have go deep on every topic out | |
| there. But if this is a topic I find I am interested in, the | |
| full talk is still available. | |
| | |
| Breadth and depth are both important, and well summarized | |
| talks are important for breadth, but not helpful at all for | |
| depth, and that's ok. | |
| zahlman wrote: | |
| > I, like many hackers, hated school because they just threw | |
| one-size-fits-all knowledge at you and here we are, paying | |
| for the privilege to have that in every facet of our lives. | |
| | |
| But now we get to browse the knowledge rather than having it | |
| thrown at us. That's more important than the quality or | |
| formatting of the content. | |
| itake wrote: | |
| > I don't get how we can all be so blind at what this is | |
| going to create. | |
| | |
| There is too much information. people are trying to optimize | |
| breadth over depth, but obviously there are costs to this. | |
| georgemandis wrote: | |
| For what it's worth, I completely agree with you, for all the | |
| reasons you're saying. With talks in particular I think it's | |
| seldom about the raw content and ideas presented and more about | |
| the ancillary ideas they provoke and inspire, like you're | |
| describing. | |
| | |
| There is just _so_ much content out there. And context is | |
| everything. If the person sharing it had led with some specific | |
| ideas or thoughts I might have taken the time to watch and | |
| looked for those ideas. But in the context it was received--a | |
| quick link with no additional context--I really just wanted the | |
| "gist" to know what I was even potentially responding to. | |
| | |
| In this case, for me, it was worth it. I can go back and decide | |
| if I want to watch it. Your comment has intrigued me so I very | |
| well might! | |
| | |
| ++ to "Slower is usually better for thinking" | |
| mutagen wrote: | |
| Not to discount slower speeds for thinking but I wonder if | |
| there is also value in dipping into a talk or a subject and | |
| then revisiting (re-watching) with the time to ponder on the | |
| thoughts a little more deeply. | |
| tass wrote: | |
| This is similar to strategies in "how to read a book" | |
| (Adler). | |
| | |
| By understanding the outline and themes of a book (or | |
| lecture, I suppose), it makes it easier to piece together | |
| thoughts as you delve deeper into the full content. | |
| conradev wrote: | |
| Was it the speed or the additional information vended by the | |
| audio and video? If someone is a compelling speaker, the same | |
| message will be way more effective in an audiovisual format. | |
| The audio has emphasis on certain parts of the content, for | |
| example, which is missing from the transcript or summary | |
| entirely. Video has gestural and facial cues, also often | |
| utilized to make a point. | |
| bongodongobob wrote: | |
| You'd love where I work. Everything is needlessly long | |
| bloviating power point meetings that could easily be ingested | |
| in a 5 minute email. | |
| itsoktocry wrote: | |
| > _Slower is usually better for thinking._ | |
| | |
| Yeah, I see people talking about listening to podcasts or | |
| audiobooks on 2x or 3x. | |
| | |
| Sometimes I set mine to 0.8x. I find you get time to absorb and | |
| think. Am I an outlier? | |
| LanceH wrote: | |
| Depends on what you're listening to. If it's a recap of | |
| something and you're just looking for the answer to "what | |
| happened?", that can be fine for 2x. If you're getting into | |
| the "why?" maybe slower is better. Or if there are a lot of | |
| players involved. | |
| | |
| I'm trying to imagine listening to War and Peace faster. On | |
| the one hand, there are a lot of threads and people to keep | |
| track of (I had a notepad of who is who). On the other hand, | |
| having the stories compressed in time might help remember | |
| what was going on with a character when finally returning to | |
| them. | |
| | |
| Listening to something like Dune quickly, someone might come | |
| out only thinking of the main political thrusts, and the | |
| action, without building that same world in their mind they | |
| would if read slower. | |
| b0a04gl wrote: | |
| it's still decoding every frame and matching phonemes either way, | |
| but speeding it up reduces how many seconds they bill you for. so | |
| you may hack their billing logic more than the model itself. | |
| | |
| also means the longer you talk, the more you pay even if the | |
| actual info density is the same. so if your voice has longer | |
| pauses or you speak slow, you maybe subsidizing inefficiency. | |
| | |
| makes me think maybe the next big compression is in delivery | |
| cadence. just auto-optimize voice tone and pacing before sending | |
| it to LLM. feed it synthetic fast speech with no emotion, just | |
| high density words. you lose human warmth but gain 40% cost | |
| savings | |
| timerol wrote: | |
| > Is It Accurate? | |
| | |
| > I don't know--I didn't watch it, lol. That was the whole point. | |
| And if that answer makes you uncomfortable, buckle-up for this | |
| future we're hurtling toward. Boy, howdy. | |
| | |
| This is a great bit of work, and the author accurately summarizes | |
| my discomfort | |
| BHSPitMonkey wrote: | |
| As if human-generated transcriptions of audio ever came with | |
| guarantees of accuracy? | |
| | |
| This kind of transformation has always come with flaws, and I | |
| think that will continue to be expected implicitly. Far more | |
| worrying is the public's trust in _interpretations_ and claims | |
| of _fact_ produced by gen AI services, or at least the popular | |
| idea that "AI" is more trustworthy/unbiased than humans, | |
| journalists, experts, etc. | |
| angst wrote: | |
| at least human-generated transcriptions have entities that we | |
| can hold responsible for... | |
| _kb wrote: | |
| That still holds true for gen-AI. Organisations that | |
| provide transcription services can't offload responsibility | |
| to a language model any more than they can to steno | |
| keyboard manufacturers. | |
| | |
| If you are the one feeding content to a model then you are | |
| that responsible entity. | |
| raincole wrote: | |
| A lot of people read newspaper. | |
| | |
| Newspaper is essentially just an inaccurate summary of what | |
| really happened. So I don't find this realization that | |
| uncomfortable. | |
| dmix wrote: | |
| That's why I find the idea of training breaking news on | |
| Reddit or Twitter funny, wild exaggerations and targeted spin | |
| is the sort of stuff that does best on those sites and | |
| generates the most comments, 50% of the output would be lies. | |
| jasonjmcghee wrote: | |
| Heads up, the token cost breakdown tables look white on white to | |
| me. I'm in dark mode on iOS using Brave. | |
| georgemandis wrote: | |
| Should be fixed now. Thank you! | |
| w-m wrote: | |
| With transcribing a talk by Andrej, you already picked the most | |
| challenging case possible, speed-wise. His natural talking speed | |
| is already >=1.5x that of a normal human. One of the people you | |
| absolutely have to set your YouTube speed back down to 1x when | |
| listening to follow what's going on. | |
| | |
| In the idea of making more of an OpenAI minute, don't send it any | |
| silence. | |
| | |
| E.g. ffmpeg -i video-audio.m4a \ -af | |
| "silenceremove=start_periods=1:start_duration=0:start_threshold=- | |
| 50dB:\ | |
| stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\ | |
| apad=pad_dur=0.02" \ -c:a aac -b:a 128k | |
| output_minpause.m4a -y | |
| | |
| will cut the talk down from 39m31s to 31m34s, by replacing any | |
| silence (with a -50dB threshold) longer than 20ms by a 20ms | |
| pause. And to keep with the spirit of your post, I measured only | |
| that the input file got shorter, I didn't look at all at the | |
| quality of the transcription by feeding it the shorter version. | |
| georgemandis wrote: | |
| Oooh fun! I had a feeling there was more ffmpeg wizardry I | |
| could be leaning into here. I'll have to try this later--thanks | |
| for the idea! | |
| w-m wrote: | |
| In the meantime I realized that the apad part is nonsensical | |
| - it pads the end of the stream, not at each silence-removed | |
| cut. I wanted to get angry at o3 for proposing this, but then | |
| I had a look at the silenceremove= documentation myself: | |
| https://ffmpeg.org/ffmpeg-filters.html#silenceremove | |
| | |
| Good god. You couldn't make that any more convoluted and | |
| hard-to-grasp if you wanted to. You gotta love ffmpeg! | |
| | |
| I now _think_ this might be a good solution: | |
| ffmpeg -i video-audio.m4a \ -af "silenceremove | |
| =start_periods=1:stop_periods=-1:stop_duration=0.15:stop_thre | |
| shold=-40dB:detection=rms" \ -c:a aac -b:a | |
| 128k output.m4a -y | |
| snickerdoodle12 wrote: | |
| I love ffmpeg but the documentation is often close to | |
| incomprehensible. | |
| squigz wrote: | |
| Out of curiosity, how might you improve those docs? They | |
| seem fairly reasonable to me | |
| w-m wrote: | |
| The documentation reads like it was written by a | |
| programmer who documented the different parameters to | |
| their implementation of a specific algorithm. Now when | |
| you as the user come along and want to use silenceremove, | |
| you'll have to carefully read through this, and build | |
| your own mental model of that algorithm, and then you'll | |
| be able to set these parameters accordingly. That takes a | |
| lot of time and energy, in this case multiple read- | |
| throughs and I'd say > 5 minutes. | |
| | |
| Good documentation should do this work for you. It should | |
| explain somewhat atomic concepts to you, that you can | |
| immediately adapt, and compose. Where it already works is | |
| for the "detection" and "window" parameters, which are | |
| straightforward. But the actions of trimming in the | |
| start/middle/end, and how to configure how long the | |
| silence lasts before trimming, whether to ignore short | |
| bursts of noise, whether to skip every nth silence | |
| period, these are all ideas and concepts that get mushed | |
| together in 10 parameters which are called start/stop- | |
| duration/threshold/silence/mode/periods. | |
| | |
| If you want to apply this filter, it takes a long time to | |
| build mental models for these 10 parameters. You do have | |
| some example calls, which is great, but which doesn't | |
| help if you need to adjust any of these - then you | |
| probably need to understand them all. | |
| | |
| Some stuff I stumbled over when reading it: | |
| | |
| "To remove silence from the middle of a file, specify a | |
| stop_periods that is negative. This value is then treated | |
| as a positive value [...]" - what? Why is this parameter | |
| so heavily overloaded? | |
| | |
| "start_duration: Specify the amount of time that non- | |
| silence must be detected before it stops trimming audio" | |
| - parameter is named start_something, but it's about | |
| stopping? Why? | |
| | |
| "start_periods: [...] Normally, [...] start_periods will | |
| be 1 [...]. Default value is 0." | |
| | |
| "start_mode: Specify mode of detection of silence end at | |
| start": start_mode end at start? | |
| | |
| It's very clunky. Every parameter has multiple modes of | |
| operation. Why is it start and stop for beginning and | |
| end, and why is "do stuff in the middle" part of the end? | |
| Why is there no global mode? | |
| | |
| You could nitpick this stuff to death. In the end, naming | |
| things is famously one of the two hard problems in | |
| computer science (the others being cache invalidation and | |
| off-by-one errors). And writing good documentation is | |
| also very, very hard work. Just exposing the internals of | |
| the algorithm is often not great UX, because then every | |
| user has to learn how the thing works internally before | |
| they can start using it (hey, looking at you, git). | |
| | |
| So while it's easy to point out where these docs fail, it | |
| would be a lot of work to rewrite this documentation from | |
| the top down, explaining the concepts first. Or even | |
| rewriting the interface to make this more approachable, | |
| and the parameters less overloaded. But since it's hard | |
| work, and not sexy to programmers, it won't get done, and | |
| many people will come after, having to spend time on | |
| reading and re-reading this current mess. | |
| phito wrote: | |
| > naming things is famously one of the two hard problems | |
| in computer science | |
| | |
| Isn't ffmpeg made by a French person? As a francophone | |
| myself, I can tell you one of the biggest weakness of | |
| francophone programmers is naming things, even worse when | |
| it's in English. Maybe it's what's at play here. | |
| ada1981 wrote: | |
| Curious if this is helpful. | |
| | |
| https://claude.ai/public/artifacts/96ea8227-48c3-484d-b30 | |
| b-6... | |
| | |
| I had Claude rewrite the documentation for silenceremove | |
| based on your feedback. | |
| zahlman wrote: | |
| > "start_mode: Specify mode of detection of silence end | |
| at start": start_mode end at start? | |
| | |
| In "start_mode", "start" means "initial", and "mode" | |
| means "method". But specifically, it's a method of | |
| figuring out where the silence ends. | |
| | |
| > In the end, naming things is famously one of the two | |
| hard problems in computer science | |
| | |
| It's also one of the hard problems in English. | |
| dylan604 wrote: | |
| if you did it in 2 passes, you could find the cut points | |
| using silence detect, use a bunch of -ss/-t/-i based on | |
| those segments, apad each segment with a -filter_complex | |
| chain the ends in concating. it would be a wonderfully | |
| gnarly command for very little benefit. but it could be | |
| done | |
| pragmatic wrote: | |
| No not really? The talk where he babbles about OSes and | |
| everyone is somehow impressed? | |
| behnamoh wrote: | |
| > His natural talking speed is already >=1.5x that of a normal | |
| human. One of the people you absolutely have to set your | |
| YouTube speed back down to 1x when listening to follow what's | |
| going on. | |
| | |
| I wonder if there's a way to automatically detect how "fast" a | |
| person talks in an audio file. I know it's subjective and | |
| different people talk at different paces in an audio, but it'd | |
| be cool to kinda know when OP's trick fails (they mention x4 | |
| ruined the output; maybe for karpathy that would happen at x2). | |
| echelon wrote: | |
| > I wonder if there's a way to automatically detect how | |
| "fast" a person talks in an audio file. | |
| | |
| Stupid heuristic: take a segment of video, transcribe text, | |
| count number of words per utterance duration. If you need | |
| speaker diarization, handle speaker utterance durations | |
| independently. You can further slice, such as syllable count, | |
| etc. | |
| nand4011 wrote: | |
| https://www.science.org/doi/10.1126/sciadv.aaw2594 | |
| | |
| Apparently human language conveys information at around 39 | |
| bits/s. You could use a similar technique as that paper to | |
| determine the information rate of a speaker and then | |
| correct it to 39 bits/s by changing the speed of the video. | |
| varispeed wrote: | |
| It's a shame platforms don't generally support speeds greater | |
| than 2x. One of my "superpowers" or a curse is that I cannot | |
| stand normal speaking pace. When I watch lectures, I always | |
| go for maximum speed and that still is too slow for me. I | |
| wish platforms have included 4x but done properly (with | |
| minimal artefacts). | |
| lofaszvanitt wrote: | |
| Robot in a human body identified :D. | |
| mrmuagi wrote: | |
| All audiobooks are like this for me. I tried it for | |
| lectures but if I'm taking handwritten notes, I can't keep | |
| up my writing. | |
| | |
| I wonder if there is negative side effects of this though, | |
| do you notice when interacting with people who speak slower | |
| require a greater deal of patience? | |
| colechristensen wrote: | |
| No but a little. I struggle with people who repeat every | |
| point of what they're saying to you several times or when | |
| you say "you told me exactly this the last time we spoke" | |
| they cannot be stopped from retelling the whole thing | |
| verbatim. Usually in those situations though there's some | |
| potential cognitive issues so you can only be | |
| understanding. | |
| hamburglar wrote: | |
| I once attended a live talk by Leslie Lamport and as he | |
| talked, I had the overwhelming feeling that something was | |
| wrong, and was thinking "did he have a stroke or | |
| something?" but then I realized I had just always watched | |
| his lectures online and had become accustomed to | |
| listening to him at 2x. | |
| userbinator wrote: | |
| _I wonder if there is negative side effects of this | |
| though, do you notice when interacting with people who | |
| speak slower require a greater deal of patience?_ | |
| | |
| You are basically training your brain to work faster, and | |
| I suspect that causes some changes in the structure of | |
| your memory; if someone speaks too slowly, I'll be more | |
| likely to forget what they said earlier, compared to if | |
| they quickly gave me the entire sentence. | |
| dpcx wrote: | |
| https://github.com/codebicycle/videospeed has been a | |
| wonderful addition for me. | |
| narratives1 wrote: | |
| I use a Chrome extension that lets you take any video | |
| player (including embedded) to 10x speed. Turn most things | |
| to 3-4x. It works on ads too | |
| munch117 wrote: | |
| I use a bookmarklet: | |
| | |
| javascript:void%20function(){document.querySelector(%22vi | |
| deo,audio%22).playbackRate=parseFloat(prompt(%22Set%20the | |
| %20playback rate%22))}(); | |
| cookingrobot wrote: | |
| There are fonts designed to be legibly at really small | |
| size. I wonder if there are voices that are especially | |
| understandable at extreme speeds. | |
| | |
| Could use an "auctioneer" voice to playback text at 10x | |
| speed. | |
| bbatha wrote: | |
| I'm also a fast listener. I find audio quality is the | |
| main differentiator in my ability to listen quickly or | |
| not. A podcast recorded at high quality I can listen to | |
| at 3-4x (with silence trimmed) comfortably, the second | |
| someone calls in from their phone I'm getting every 4th | |
| word and often need to go down to 2x or less. Mumbly | |
| accents are also a driver of quality but not as much, | |
| then again I rarely have trouble understanding difficult | |
| accents IRL and almost never use subtitles on TV | |
| shows/youtube to better understand the speaker. Your | |
| mileage may vary. | |
| | |
| I understand 4-6x speakers fairly well but don't enjoy | |
| listening at that pace. If I lose focus for a couple of | |
| seconds I effectively miss a paragraph of context and my | |
| brain can't fill in the missing details. | |
| seabass wrote: | |
| I made a super simplistic chrome extension for this. | |
| Doesn't work on all websites, but YouTube and most online | |
| video courses are covered. | |
| | |
| https://github.com/sebastiansandqvist/video-speed-extension | |
| JadeNB wrote: | |
| Can't you use VLC to watch almost anything streamable, and | |
| then play at your desired speed? | |
| ars wrote: | |
| I use this extension: https://mybrowseraddon.com/video- | |
| speed-control.html | |
| eitally wrote: | |
| Recently, YT started supporting 4x playback for Premium | |
| subscribers, but only in the mobile app, not on the web. | |
| btown wrote: | |
| Even a last-decade transcription model could be used to | |
| detect a rough number of syllables per unit time, and the | |
| accuracy of that model could be used to guide speed-up and | |
| dead-time detection before sending to a more expensive model. | |
| As with all things, it's a question of whether the cost | |
| savings justify the engineering work. | |
| janalsncm wrote: | |
| > I wonder if there's a way to automatically detect how | |
| "fast" a person talks in an audio file | |
| | |
| Transcribe it locally using whisper and output tokens/sec? | |
| maxall4 wrote: | |
| Just count syllables per second by doing an FFT plus some | |
| basic analysis. | |
| tucnak wrote: | |
| > FFT plus some basic analysis | |
| | |
| Yeah, totally easier than `len(transcribe(a))/len(a)` | |
| mrstone wrote: | |
| > I wonder if there's a way to automatically detect how | |
| "fast" a person talks in an audio file. | |
| | |
| Hilbert transform and FFT to get phoneme rate would work. | |
| WalterSear wrote: | |
| Better: just make everyone in the video speak at my | |
| comfortable speed. | |
| dTal wrote: | |
| Compress it using a VBR speech codec and measure the | |
| compression ratio? | |
| brunoborges wrote: | |
| The interesting thing here is that OpenAI likely has a layer | |
| that trims down videos exactly how you suggest, so they can | |
| still charge by the full length while costing less for them to | |
| actually process the content. | |
| cbsmith wrote: | |
| That's an amusing perspective. I really struggle with watching | |
| any video at double speed, but I've never had trouble listening | |
| to any of his talks at 1x. To me, he seems to speak at a | |
| perfectly reasonable pace. | |
| swyx wrote: | |
| > I didn't look at all at the quality of the transcription by | |
| feeding it the shorter version. | |
| | |
| guys how hard is it to toss both versions into like diffchecker | |
| or something haha youre just comparing text | |
| TimorousBestie wrote: | |
| Why use diffchecker when there's a perfectly good LLM you | |
| could ask right there? lol | |
| serf wrote: | |
| because a lot of LLMs will just eat tokens to call a | |
| diffchecker. | |
| | |
| really it becomes a question of whether or not the friction | |
| of invoking the command or the cost of tokens is greater. | |
| | |
| as I get older and more rsi'd the tokens seem cheaper. | |
| trashchomper wrote: | |
| Assuming sarcasm but if not, because deterministic vs. | |
| nondeterministic output? | |
| TimorousBestie wrote: | |
| Not sarcasm, just a little joke. I thought the emote at | |
| the end would prevent it from being taken seriously. . . | |
| Der_Einzige wrote: | |
| Make it semi deterministic with structured/constrained | |
| generation! | |
| QuantumGood wrote: | |
| I wish there was a 2.25x YouTube option for "normal" humans. I | |
| already use every shortcut, and listen at 2x 90% of the time. | |
| But Andrej I can't take faster than 1.25x | |
| zamadatix wrote: | |
| YouTube ran an experiment with up to 4x playback on mobile | |
| (???) but it went away in February. I get a lot of the | |
| experiments they do being experiments but why just allowing | |
| the slider to go farther is such a back and forth hoopla is | |
| beyond me. It's one of the oft touted features of 3rd party | |
| apps and extensions with nearly 0 UI impact to those who | |
| don't want to use it (just don't slide the slider past 2x if | |
| you don't want past 2x). | |
| | |
| https://www.theverge.com/news/603581/youtube-premium- | |
| experim... | |
| K2L8M11N2 wrote: | |
| As a premium subscriber I currently have 4x available on | |
| Android and they recently (in the last month) added it to | |
| web too | |
| zelphirkalt wrote: | |
| Probably, because they are "A/B testing" things, that do | |
| not really show much effect or depend on more | |
| circumstances, than they care to eliminate and then | |
| overinterpret the results. Like almost all corporate A/B | |
| testing. | |
| ars wrote: | |
| Install this: https://mybrowseraddon.com/video-speed- | |
| control.html | |
| | |
| I listen to a lot of videos on 3 or even 4x. | |
| david_allison wrote: | |
| I have up to 4x (in steps of 0.05) with YouTube Premium on | |
| Android | |
| zahlman wrote: | |
| Meanwhile, I've found that just reading the transcript is | |
| often good enough. | |
| nickjj wrote: | |
| Andrej's talk seemed normal to listen at 2x but I've also | |
| listened to everything at 2x for a long time. | |
| | |
| Unfortunately a byproduct of listening to everything at 2x is | |
| I've had a number of folks say they have to watch my videos at | |
| 0.75x but even when I play back my own videos it feels | |
| painfully slow unless it's 2x. | |
| | |
| For reference I've always found John Carmack's pacing perfect / | |
| natural and watchable at 2x too. | |
| | |
| A recent video of mine is https://www.youtube.com/watch?v=pL- | |
| qft1ykek. It was posted on HN by someone else the other day so | |
| I'm not trying to do any self promotion here, it's just an | |
| example of a recent video I put up and am generally curious if | |
| anyone finds that too fast or it's normal. It's a regular | |
| unscripted video where I have a rough idea of what I want to | |
| cover and then turn on the mic, start recording and let it pan | |
| out organically. If I had to guess I'd say the last ~250-300 | |
| videos were recorded this way. | |
| noahjk wrote: | |
| To me you talk at what I would consider "1.2x" of podcast | |
| speed (which to me is a decent average measure of spoken word | |
| speed - I usually do 1.5x on all podcasts). You're definitely | |
| still in the normal distribution for tech YouTubers, in my | |
| experience - in fact it feels like a lot of tech YouTube | |
| talks like they've had a bit too much adderall, but you don't | |
| come off that way. Naturally people may choose to slow down | |
| tutorials, because the person giving the tutorial can never | |
| truly understand what someone learning would or wouldn't | |
| understand. So overall I think your speed is totally fine! | |
| Also, very timely video, I was interested in the exact topic, | |
| so I'm happy I found this. | |
| eru wrote: | |
| > "[I]n fact it feels like a lot of tech YouTube talks like | |
| they've had a bit too much adderall, [...]" | |
| | |
| Funnily enough, if you actually have ADHD, then stimulants | |
| like adderall or even nicotine, will calm you down. | |
| | |
| > Naturally people may choose to slow down tutorials, [...] | |
| | |
| For me it also depends on what mood I'm in and whether I'm | |
| doing anything else at the same time. If I'm fully | |
| concentrating on a video, 2x is often fine. If I'm doing | |
| some physical task at the same time, I need it slower than | |
| that. | |
| | |
| If I'm doing a mental task at the same, I can forget about | |
| getting anything out of the video. At least, if the mental | |
| task involves any words. So eg I could probably still | |
| follow along a technical discussion at roughly 1x speed | |
| while playing Tetris, but not while coding. | |
| Tyr42 wrote: | |
| Driving is a hard 1.0 for me. But otherwise 2.0 is good. | |
| SavioMak wrote: | |
| Yeah, you sound around 1.25-1.5x than the average videos I | |
| watch | |
| viraptor wrote: | |
| > Andrej's talk seemed normal to listen at 2x but I've also | |
| listened to everything at 2x for a long time. | |
| | |
| We get used to higher speeds when we consume a lot of content | |
| that way. Have you heard the systems used by experienced | |
| blind people? I cannot even understand the words in them, but | |
| months of training would probably fix that. | |
| userbinator wrote: | |
| You can achieve a similar, less permanent effect by closing | |
| your eyes; I often do it when I'm on a call and the person | |
| on the other end is extremely difficult to understand. | |
| userbinator wrote: | |
| _but even when I play back my own videos it feels painfully | |
| slow unless it 's 2x._ | |
| | |
| Watching your video at 1x still feels too slow, and it's just | |
| right for me at 2x speed (that's approximately how fast I | |
| normally talk if others don't tell me to slow down), although | |
| my usual YouTube watching speed is closer to 2.5-3x. That is | |
| to say, you're still faster than a lot of others. | |
| | |
| I think it just takes practice --- I started at around 1.25x | |
| for videos, and slowly moved up from there. As you have | |
| noticed, once you've consumed enough sped-up content, your | |
| own speaking speed will also naturally increase. | |
| fuzztester wrote: | |
| James Goodnight of SAS Institute: | |
| | |
| https://en.m.wikipedia.org/wiki/James_Goodnight | |
| | |
| I have watched one or two videos of his, and he spoke slowly, | |
| compared to the average person. I liked that. It sounded | |
| good. | |
| makeitdouble wrote: | |
| Your video sounded a tad fast at 2x and pretty fine at 1.5. | |
| | |
| Now I think speed adjustment come less from the natural | |
| speaking pace of the person than the subject matter. | |
| | |
| I'm thinking of a channel like Accented Cinema | |
| (https://youtu.be/hfruMPONaYg), with a slowish talking pace, | |
| but as there's all the visual part going on at all times, it | |
| actually doesn't feel slow to my ear. | |
| | |
| I felt the same for videos explaining concept I have no | |
| familiarity with, so I see as how fast the brain can process | |
| the info, less than the talking speed per se. | |
| retsibsi wrote: | |
| Your speaking speed is noticeably faster than usual, but I | |
| think it's good for this kind of video. When the content is | |
| really dense and every word is chosen for maximum information | |
| value, a slower speed would be good, but for relatively | |
| natural speech with a normal amount of redundancy I think | |
| it's fine to go at this speed. | |
| quietbritishjim wrote: | |
| Your actual speed of talking sounds a little faster than | |
| average but not notably so. | |
| | |
| But it _feels_ (very subjectively) faster to me than usual | |
| because you don 't really seem to take any pauses. It's like | |
| the whole video is a single run-on sentence that I keep | |
| buffering, but I never get a chance to process it and flush | |
| the buffer. | |
| fortran77 wrote: | |
| I always listen to YouTube and podcasts at 1.5. And when I | |
| meet a YouTuber/podcaster IRL, I'm always annoyed at how slow | |
| they speak. | |
| Der_Einzige wrote: | |
| This btw is also why spreading (speed reading) happens in | |
| American competitive debate. This gets ridiculed online but | |
| it's exactly why it happens. | |
| | |
| https://en.wikipedia.org/wiki/Spreading_(debate) | |
| hooverd wrote: | |
| They should put an upper WPM on competitive debate, like F1 | |
| does with certain car parts. | |
| jwrallie wrote: | |
| From my own experience with whisper.cpp, normalizing the audio | |
| and removing silence not only shortens the process time | |
| significantly, but also increases a lot the quality of the | |
| transcription, as silence can mean hallucinations. You can do | |
| that graphically with Audacity too, if you do not want to deal | |
| with the command line. You also do not need any special | |
| hardware to run whisper.cpp, with the small model literally any | |
| computer should be able to do it if you can wait a bit (less | |
| than the audio length). | |
| | |
| One half interesting / half depressing observation I made is | |
| that at my workplace any meeting recording I tried to | |
| transcribe in this way had its length reduced to almost 2/3 | |
| when cutting off the silence. Makes you think about the | |
| efficiency (or lack of it) of holding long(ish) meetings. | |
| d1sxeyes wrote: | |
| 1/3 of the meeting is silence? That's a good thing. It's | |
| allowing people time to think over what they're hearing, | |
| there are pauses to allow people to contribute or | |
| participate. What do you think a better percentage of silent | |
| time would be? | |
| jwrallie wrote: | |
| Good point, somehow if I think of a 30 minutes meeting, 10 | |
| minutes of silence sounds great, but seeing a 1 hour block | |
| disappear from a 3 hour recording makes me want to use that | |
| "free" hour to do something else. | |
| | |
| Well, I don't think silence is not the real problem with a | |
| 3 hour meeting! | |
| literalAardvark wrote: | |
| If people could speak continuously for an entire meeting | |
| then that meeting would be better off as an email. | |
| Meetings are for bouncing half formed ideas around and | |
| coagulating that into something greater. | |
| | |
| There MUST be time to think | |
| sudhirj wrote: | |
| If a human meeting had lot of silence (assuming it's between | |
| words and not before / after), I would consider it a very | |
| efficient meeting where there was just enough information | |
| exchanged with adequate absorption, processing and response | |
| time. | |
| dogprez wrote: | |
| Others pointed out the value of silence, but I just wanted to | |
| say it saddens me when humanity is misclassified as | |
| inefficiency. The other day Sam Altman made a jest about how | |
| much energy is wasted by people saying "thanks" to chatgpt. | |
| The corollary is how much human energy is wasted on humans | |
| saying thanks to each other. When making a judgement about | |
| inefficiency one is making a judgement on what is valuable, a | |
| very biased judgement that isn't necessarily aligned with | |
| what makes us thrive. =) (<-- a wasteful smiley) | |
| kristianbrigman wrote: | |
| I'll remember that you told me thanks. Will chatgpt? | |
| (Honestly curious... it's possible) | |
| Salgat wrote: | |
| I say thanks for my own well-being too. | |
| rz2k wrote: | |
| I get the impression that it sets a tone that encourages | |
| creative, more open ended responses. | |
| | |
| I think this is the reverse of confrontation with the | |
| LLM. Typically if you get a really dumb response, it is | |
| better to hang up the conversation and completely start | |
| over than it is to tell the LLM why it is wrong. Once you | |
| start arguing, they start getting stupider and respond | |
| with even faultier logic as they try to appease you. | |
| | |
| I suppose it makes sense if the training involves | |
| alternate models of discourse resembling two educated | |
| people in a forum with shared intellectual curiosity and | |
| a common goal, or two people having a ridiculous internet | |
| argument. | |
| Philip-J-Fry wrote: | |
| Well, humans saying thanks to eachother isn't wasted | |
| energy. It has a real affect on our relationships. | |
| | |
| People say thank you to AI because they are portrayed as | |
| human-like chat bots, but in reality it has almost no | |
| effect on their effectiveness to respond to our queries. | |
| | |
| Saying thank you to ChatGPT is no less wasteful than saying | |
| thank you to Windows for opening the calculator. | |
| | |
| I don't think anyone is trying to draw any parallels | |
| between that inefficiency and real humans saying thank you? | |
| mulmen wrote: | |
| Humans _are_ inefficient. The mistake is making a moral | |
| judgement about that. | |
| vayup wrote: | |
| Gemini charges by tokens rather than minutes. I used VAD to | |
| trim silence hoping token count will go down. I noticed the | |
| token count wasn't much different (Eg: 30 seconds of background | |
| noise had the same count as 2s of background noise). Either | |
| Gemini API trims silence under the hood, or the nature of | |
| tokenization is dependent on speech content rather than the | |
| length. Not sure which. | |
| | |
| In either case, I bet OpenAI is doing the same optimization | |
| under the hood and keeping the savings for themselves. | |
| CSMastermind wrote: | |
| > to set your YouTube speed back down to 1x | |
| | |
| Is it common for people to watch Youtube sped up? | |
| | |
| I've heard of people doing this for podcasts and audiobooks and | |
| never understood it all that much there. Just feels like | |
| 'skimming' a real book instead of actually reading it. | |
| Feathercrown wrote: | |
| Some people talk slower than your natural listening speed. | |
| It's less like skimming and more like if some books used 36pt | |
| font and you normalized the size back down to a comfortable | |
| information-dense size. | |
| Eezee wrote: | |
| That's completely different. Imagine you are reading a book | |
| and the words only get revealed to you at 1 word a second. | |
| You would get annoyed if your natural reading speed was | |
| higher than that. | |
| | |
| Same with a video. A lot of people speak considerably slower | |
| than you could process the information they are conveying, so | |
| you speed it up. You still get the same content and are not | |
| skipping parts as you would when skimming a book. | |
| keithxm23 wrote: | |
| Often, I'll come across speakers who just speak slowly and | |
| listening at 1.5x or 2x barely feels sped-up. | |
| | |
| Additionally, the brain tends to adjust to a faster talking | |
| speed very quickly. If I'm watching an average-paced person | |
| talk and speed them up by 2x, the first couple minutes of | |
| listening might be difficult and will require more intent- | |
| listening. However, the brain starts processing it as the new | |
| normal and it does not feel sped-up anymore. To the extent | |
| that if I go back to 1x, it feels like the speaker is way too | |
| slow. | |
| 83 wrote: | |
| >>Just feels like 'skimming' a real book instead of actually | |
| reading it. | |
| | |
| That's the goal for me lately. I primarily use Youtube for | |
| technical assistance (where are the screws to adjust this | |
| carburetor?, how do I remove this brake hub?, etc). There | |
| used to be short 1 to 2m videos on this kind of stuff but | |
| nowadays I have to suffer through a 10-15 minute video with | |
| multiple ad breaks. | |
| | |
| So now I always watch youtube at 2x speed while rapidly | |
| jumping the slider forward to find relevant portions. | |
| babuloseo wrote: | |
| I use the youtube trick, will share it here, but upload to | |
| youtube and use their built in transcription service to translate | |
| to text for you, and than use gemini pro 2.5 to rebuild the | |
| transcript. | |
| | |
| ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i | |
| file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \ | |
| -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt | |
| yuv420p \ -shortest \ | |
| file_you_upload_to_youtube_for_free_transcripts.mp4 | |
| | |
| This works VERY well for my needs. | |
| KTibow wrote: | |
| This is really interesting, although the cheapest route is still | |
| to use an alternative audio-compatible LLM (Gemini 2.0 Flash | |
| Lite, Phi 4 Multimodal) or an alternative host for Whisper | |
| (Deepinfra, Fal). | |
| fallinditch wrote: | |
| When extracting transcripts from YouTube videos, can anyone give | |
| advice on the best (cost effective, quick, accurate) way to do | |
| this? | |
| | |
| I'm confused because I read in various places that the YouTube | |
| API doesn't provide access to transcripts ... so how do all these | |
| YouTube transcript extractor services do it? | |
| | |
| I want to build my own YouTube summarizer app. Any advice and | |
| info on this topic greatly appreciated! | |
| vjerancrnjak wrote: | |
| If YouTube placed autogenerated captions you can download them | |
| free of charge with yt-dlp. | |
| rob wrote: | |
| There's a tool that uses YouTube's unofficial APIs to get them | |
| if they're available: | |
| | |
| https://github.com/jdepoix/youtube-transcript-api | |
| | |
| For our internal tool that transcribes local city council | |
| meetings on YouTube (often 1-3 hours long), we found that these | |
| automatic ones were never available though. | |
| | |
| (Our tool usually 'processes' the videos within ~5-30 mins of | |
| being uploaded, so that's also why none are probably available | |
| 'officially' yet.) | |
| | |
| So we use yt-dlp to download the highest quality audio and then | |
| process them with whisper via Groq, which is way cheaper | |
| (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's | |
| API.) Sometimes groq errors out so there's built-in support for | |
| Replicate and Deepgram as well. | |
| | |
| We run yt-dlp on our remote Linode server and I have a Python | |
| script I created that will automatically login to YouTube with | |
| a "clean" account and extract the proper cookies.txt file, and | |
| we also generate a 'po token' using another tool: | |
| | |
| https://github.com/iv-org/youtube-trusted-session-generator | |
| | |
| Both cookies.txt and the "po token" get passed to yt-dlp when | |
| running on the Linode server and I haven't had to re-generate | |
| anything in over a month. Runs smoothly every day. | |
| | |
| (Note that I don't use cookies/po_token when running locally at | |
| home, it usually works fine there.) | |
| fallinditch wrote: | |
| Very useful, thanks. So does this mean that every month or so | |
| you have to create a new 'clean' YouTube account and use that | |
| to create new po_token/cookies? | |
| | |
| It's frustrating to have to jump through all these hoops just | |
| to extract transcripts when the YouTube Data API already | |
| gives reasonable limits to free API calls ... would be nice | |
| if they allowed transcripts too. | |
| | |
| Do you think the various YouTube transcript extractor | |
| services all follow a similar method as yours? | |
| banana_giraffe wrote: | |
| You can use yt-dlp to get the transcripts. For instance, to | |
| grab just the transcript of a video: ./yt-dlp | |
| --skip-download --write-sub --write-auto-sub --sub-lang en | |
| --sub-format json3 <youtube video URL> | |
| | |
| You can also feed the same command a playlist or channel URL | |
| and it'll run through and grab all the transcripts for each | |
| video in the playlist or channel. | |
| fallinditch wrote: | |
| That's cool, thanks for the info. But do you also have to use | |
| a rotating proxy to prevent YouTube from blocking your IP | |
| address? | |
| banana_giraffe wrote: | |
| Last time I ran this at scale was a couple of months ago, | |
| so my information is no doubt out of date, but in my | |
| experience, YouTube seems less concerned about this than | |
| they are when you're grabbing lots of videos. | |
| | |
| But that was a few months ago, so for all I know they've | |
| tightened down more hatches since then. | |
| topaz0 wrote: | |
| I have a way that is (all but) free -- just watch the video if | |
| you care about it, or decide not to if you don't, and move on | |
| with your life. | |
| Tepix wrote: | |
| Why would you give up your privacy by sending what interests you | |
| to OpenAI when whisper doesn't need that much computer in the | |
| first place? | |
| | |
| With faster-whisper (int8, batch=8) you can transcripe 13 minutes | |
| of audio in 51 seconds _on CPU_. | |
| anigbrowl wrote: | |
| I came here to ask the same question. This is a well-solved | |
| problem, red queen racing it seems utterly pointless, a symptom | |
| of reflexive adversarialism. | |
| poly2it wrote: | |
| > symptom of reflexive adversarialism | |
| | |
| Is there a definition for this expression? I don't catch you. | |
| | |
| > ... using corporate technology for the solved problem is a | |
| symptom of self-directed skepticism by the user against the | |
| corporate institutions ... | |
| | |
| Eh? | |
| ProllyInfamous wrote: | |
| I am a blue collar electrician. Not a coder (but definitely | |
| geeky). | |
| | |
| Whisper works quite well on Apple Silicon with simple drag/drop | |
| install (i.e. no terminal commands). Program is free; you can | |
| get an M4 mini for ~$550; don't see how an online platform can | |
| even compete with this, except for one-off customers (i.e. not | |
| great repeat customers). | |
| | |
| We used it to transcribe _ddaayyss_ of audio microcassettes | |
| which my mother had made during her lifetime. Whisper.app even | |
| transcribed a few hours that are difficult to comprehend as a | |
| human listener. It is _VERY_ fast. | |
| | |
| I've used the text to search for timestamps worth listening to, | |
| skipping most dead-space (e.g. she made most while driving, in | |
| a stream of not-always-focused consciousness). | |
| pimlottc wrote: | |
| Appreciated the concise summary + code snippet upfront, followed | |
| by more detail and background for those interested. More articles | |
| should be written this way! | |
| rob wrote: | |
| For anybody trying to do this in bulk, instead of using OpenAI's | |
| whisper via their API, you can also use Groq [0] which is much | |
| cheaper: | |
| | |
| [0] https://groq.com/pricing/ | |
| | |
| Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with | |
| whisper-large-v3-turbo. I believe OpenAI comes out to like | |
| ~$0.36/hr. | |
| | |
| We do this internally with our tool that automatically | |
| transcribes local government council meetings right when they get | |
| uploaded to YouTube. It uses Groq by default, but I also added | |
| support for Replicate and Deepgram as backups because sometimes | |
| Groq errors out. | |
| georgemandis wrote: | |
| Interesting! At $0.02 to $0.04 an hour I don't suspect you've | |
| been hunting for optimizations, but I wonder if this "speed up | |
| the audio" trick would save you even more. | |
| | |
| > We do this internally with our tool that automatically | |
| transcribes local government council meetings right when they | |
| get uploaded to YouTube | |
| | |
| Doesn't YouTube do this for you automatically these days within | |
| a day or so? | |
| rob wrote: | |
| > Doesn't YouTube do this for you automatically these days | |
| within a day or so? | |
| | |
| Oh yeah, we do a check first and use youtube-transcript-api | |
| if there's an automatic one available: | |
| | |
| https://github.com/jdepoix/youtube-transcript-api | |
| | |
| The tool usually detects them within like ~5 mins of being | |
| uploaded though, so usually none are available yet. Then | |
| it'll send the summaries to our internal Slack channel for | |
| our editors, in case there's anything interesting to 'follow | |
| up on' from the meeting. | |
| | |
| Probably would be a good idea to add a delay to it and wait | |
| for the automatic ones though :) | |
| jerjerjer wrote: | |
| > I wonder if this "speed up the audio" trick would save you | |
| even more. | |
| | |
| At this point you'll need to at least check how much running | |
| ffmpeg costs. Probably less than $0.01 per hour of audio | |
| (approximate savings) but still. | |
| ks2048 wrote: | |
| > Doesn't YouTube do this for you automatically these days | |
| within a day or so? | |
| | |
| Last time I checked, I think the Google auto-captions were | |
| noticeably worse quality than whisper, but maybe that has | |
| changed. | |
| colechristensen wrote: | |
| If you have a recent macbook you can run the same whisper model | |
| locally for free. People are really sleeping on how cheap the | |
| compute you own hardware for already is. | |
| rob wrote: | |
| I don't. I have a MacBook Pro from 2019 with an Intel chip | |
| and 16 GB of memory. Pretty sure when I tried the large | |
| whisper model it took like 30 minutes to an hour to do | |
| something that took hardly any time via Groq. It's been a | |
| while though so maybe my times are off. | |
| colechristensen wrote: | |
| Ah, no, Apple silicon Mac required with a decent amount of | |
| memory. But this kind of machine has been very common (a | |
| mid to high range recent macbook) at all of my employers | |
| for a long time. | |
| fragmede wrote: | |
| It's been roughly six years since that MacBook was top of | |
| the line, so your times are definitely off. | |
| likium wrote: | |
| What tool do you use? | |
| pzo wrote: | |
| there is also cloudflare workers ai where you can have whisper- | |
| large-v3-turbo for around $0.03 per hour: | |
| | |
| https://developers.cloudflare.com/workers-ai/models/whisper-... | |
| abidlabs wrote: | |
| You could use Hugging Face's Inference API (which supports all | |
| of these API providers) directly making it easier to switch | |
| between them, e.g. look at the panel on the right on: | |
| https://huggingface.co/openai/whisper-large-v3 | |
| BrunoJo wrote: | |
| Let me know if you are interested in a more reliable | |
| transcription API. I'm building Lemonfox.ai and we've optimized | |
| our transcription API to be highly available and very fast for | |
| large files. Happy to give you a discount (email: bruno at | |
| lemonfox.ai) | |
| stogot wrote: | |
| Love this idea but the accuracy section is lacking. Couldnt you | |
| do a simple diff of the outputs and see how many differences | |
| there are? .5% or 5%? | |
| georgemandis wrote: | |
| Yeah, I'd like to do a more formal analysis of the outputs if I | |
| can carve out the time. | |
| | |
| I don't think a simple diff is the way to go, at least for what | |
| I'm interested in. What I care about more is the overall | |
| accuracy of the summary--not the word-for-word transcription. | |
| | |
| The test I want to setup is using LLMs to evaluate the | |
| summarized output and see if the primary themes/topics persist. | |
| That's more interesting and useful to me for this exercise. | |
| tmaly wrote: | |
| The whisper model weights are free. You could save even more by | |
| just using them locally. | |
| pzo wrote: | |
| but this is still great trick if you want to reduce latency or | |
| inference speed even with local models e.g. in realtime chatbot | |
| 55555 wrote: | |
| This seems like a good place for me to complain about the fact | |
| that the automatically generated subtitle files Youtube creates | |
| are horribly malformed. Every sentence is repeated twice. In many | |
| subtitle files, the subtitle timestamp ranges overlap one another | |
| while also repeating every sentence twice in two different | |
| ranges. It's absolutely bizarre and has been like this for years | |
| or possibly forever. Here's an example - I apologize that it's | |
| not in English. I don't know if this issue affects English. | |
| https://pastebin.com/raw/LTBps80F | |
| xenator wrote: | |
| Seems like Thai. Thai translation and recognition is like 10 | |
| years ago comparing to other languages I'm dealing with in my | |
| everyday life. Good news tho is the same level was for Russian | |
| years ago, and now it is near perfect. | |
| 55555 wrote: | |
| Well the weird thing is honestly their speech to text | |
| recognizes 97% of words correctly. The subtitle content is | |
| pretty perfect. It's just the formatting that's awful. | |
| amelius wrote: | |
| Solution: charge by number of characters generated. | |
| dataviz1000 wrote: | |
| I built a Chrome extension with one feature that transcribes | |
| audio to text in the browser using huggingface/transformers.js | |
| running the OpenAI Whisper model with WebGPU. It works perfect! | |
| Here is a list of examples of all the things you can do in the | |
| browser with webgpu for free. [0] | |
| | |
| The last thing in the world I want to do is listen or watch | |
| presidential social media posts, but, on the other hand, | |
| sometimes enormously stupid things are said which move the SP500 | |
| up or down $60 in a session. So this feature queries for new | |
| posts every minute, does ORC image to text and transcribe video | |
| audio to text locally, sends the post with text for analysis, all | |
| in the background inside a Chrome extension before notify me of | |
| anything economically significant. | |
| | |
| [0] | |
| https://github.com/huggingface/transformers.js/tree/main/exa... | |
| | |
| [1] https://github.com/adam-s/doomberg-terminal | |
| kgc wrote: | |
| Impressive | |
| karpathy wrote: | |
| Omg long post. TLDR from an LLM for anyone interested | |
| | |
| Speed your audio up 2-3x with ffmpeg before sending it to | |
| OpenAI's gpt-4o-transcribe: the shorter file uses fewer input- | |
| tokens, cuts costs by roughly a third, and processes faster with | |
| little quality loss (4x is too fast). A sample yt-dlp - ffmpeg - | |
| curl script shows the workflow. | |
| | |
| ;) | |
| bravesoul2 wrote: | |
| This is the sort of content I want to see in Tweets and | |
| LinkedIn posts. | |
| | |
| I have been thinking for a while how do you make good use of | |
| the short space in those places. | |
| | |
| LLM did well here. | |
| georgemandis wrote: | |
| Hahaha. Okay, okay... I will watch it now ;) | |
| | |
| (Thanks for your good sense of humor) | |
| karpathy wrote: | |
| I like that your post deliberately gets to the point first | |
| and then (optionally) expands later, I think it's a good and | |
| generally underutilized format. I often advise people to | |
| structure their emails in the same way, e.g. first just | |
| cutting to the chase with the specific ask, then giving more | |
| context optionally below. | |
| | |
| It's not my intention to bloat information or delivery but I | |
| also don't super know how to follow this format especially in | |
| this kind of talk. Because it's not so much about relaying | |
| specific information (like your final script here), but more | |
| as a collection of prompts back to the audience as things to | |
| think about. | |
| | |
| My companion tweet to this video on X had a brief | |
| TLDR/Summary included where I tried, but I didn't super think | |
| it was very reflective of the talk, it was more about topics | |
| covered. | |
| | |
| Anyway, I am overall a big fan of doing more compute at the | |
| "creation time" to compress other people's time during | |
| "consumption time" and I think it's the respectful and kind | |
| thing to do. | |
| georgemandis wrote: | |
| I watched your talk. There are so many more interesting | |
| ideas in there that resonated with me that the summary | |
| (unsurprisingly) skipped over. I'm glad I watched it! | |
| | |
| LLMs as the operating system, the way you interface with | |
| vibe-coding (smaller chunks) and the idea that maybe we | |
| haven't found the "GUI for AI" yet are all things I've | |
| pondered and discussed with people. You articulated them | |
| well. | |
| | |
| I think some formats, like a talk, don't lend themselves | |
| easily to meaningful summaries. It's about giving the | |
| audience things to think about, to your point. It's the sum | |
| of storytelling that's more than the whole and why we still | |
| do it. | |
| | |
| My post is, at the end of the day, really more about a neat | |
| trick to optimize transcriptions. This particular video | |
| might be a great example of why you may not always want to | |
| do that :) | |
| | |
| Anyway, thanks for the time and thanks for the talk! | |
| mh- wrote: | |
| _> I often advise people to structure their emails [..]_ | |
| | |
| I frequently do the same, and eventually someone sent me | |
| this HBR article summarizing the concept nicely as "bottom | |
| line up front". It's a good primer for those interested. | |
| | |
| https://hbr.org/2016/11/how-to-write-email-with-military- | |
| pre... | |
| lordspace wrote: | |
| that's a really good summary :) | |
| xg15 wrote: | |
| That's really cool! Also, isn't this effectively the same as | |
| supplying audio with a sampling rate of 8kHz instead of the 16kHz | |
| that the model is supposed to work with? | |
| anshumankmr wrote: | |
| Someone should try transcribing Eminem's Rap god with this trick. | |
| alok-g wrote: | |
| >> by jumping straight to the point ... | |
| | |
| Love this! I wish more authors follow this approach. So many | |
| articles keep going all over the place before 'the point' | |
| appears. | |
| | |
| If trying, perhaps some 50% of the authors may realize that they | |
| don't _have_ a point. | |
| pknerd wrote: | |
| I guess it'd work even if you make it 2.5 or evebn 3x. | |
| donkey_brains wrote: | |
| Hmm...doesn't this technique effectively make the minute longer, | |
| not shorter? Because you can pack more speech into a minute of | |
| recording? Seems like making a minute shorter would be | |
| counterproductive. | |
| StochasticLi wrote: | |
| No. You're paying for a minute of audio, which will be more | |
| packed with speech, not for how long it's being computed. | |
| impossiblefork wrote: | |
| Make the minutes longer, you mean. | |
| pbbakkum wrote: | |
| This is great, thank you for sharing. I work on these APIs at | |
| OpenAI, it's a surprise to me that it still works reasonably well | |
| at 2/3x speed, but on the other hand for phone channels we get | |
| 8khz audio that is upsampled to 24khz for the model and it still | |
| works well. Note there's probably a measurable decrease in | |
| transcription accuracy that worsens as you deviate from 1x speed. | |
| Also we really need to support bigger/longer file uploads :) | |
| nerder92 wrote: | |
| Quick Feedback: Would it be cool to research this internally | |
| and maybe find a sweet spot in speed multiplier where the loss | |
| is minimal. This pre-processing is quite cheap and could bring | |
| down the API price eventually. | |
| georgemandis wrote: | |
| I kind of want to take a more proper poke at this but focus | |
| more one summarization accuracy over word-for-word accuracy, | |
| though I see the value in both. | |
| | |
| I'm actually curious, if I run transcriptions back-to-back-to- | |
| back on the exact same audio, how much variance should I | |
| expect? | |
| | |
| Maybe I'll try three approaches: | |
| | |
| - A straight diff comparison (I know a lot of people are | |
| calling for this, but I really think this is less useful than | |
| it sounds) | |
| | |
| - A "variance within the modal" test running it multiple times | |
| against the same audio, tracking how much it varies between | |
| runs | |
| | |
| - An LLM analysis assessing if the primary points from a talk | |
| were captured and summarized at 1x, 2x, 3x, 4x runs (I think | |
| this is far more useful and interesting) | |
| celltalk wrote: | |
| With this logic, you should also be able to trim the parts that | |
| doesn't have words. Just add a cut-off for db, and trim the video | |
| before transcription. | |
| | |
| Possibly another 10-20% gain? | |
| isubkhankulov wrote: | |
| Transcripts get much more valuable when one diarizes the audio | |
| beforehand to determine which speaker said what. | |
| | |
| I use this free tool to extract those and dump the transcripts | |
| into a LLM with basic prompts: https://contentflow.megalabs.co | |
| mt_ wrote: | |
| You can just dump the youtube link video in Google AI studio and | |
| ask it to transcribe the video with speaker labels and even ask | |
| it it to add useful visual clues, because the model is multimodal | |
| for video too. | |
| MaxDPS wrote: | |
| Can I ask what you mean by "useful visual clues"? | |
| mt_ wrote: | |
| What is the speaker showcasing in its slides, what is it's | |
| body language and so on. | |
| cprayingmantis wrote: | |
| I noticed something similar with images as inputs to Claude, you | |
| can scale down the images and still get good outputs. There is an | |
| accuracy drop off at a certain point but the token savings are | |
| worth doing a little tuning there. | |
| georgemandis wrote: | |
| Definitely in the same spirit! | |
| | |
| Clearly the next thing we need to test is removing all the | |
| vowels from words, or something like that :) | |
| meerab wrote: | |
| Interesting approach to transcript generation! | |
| | |
| I'm implementing a similar workflow for VideoToBe.com | |
| | |
| My Current Pipeline: | |
| | |
| Media Extraction - yt-dlp for reliable video/audio downloads | |
| Local Transcription - OpenAI Whisper running on my own hardware | |
| (no API costs) Storage & UI - Transcripts stored in S3 with a | |
| custom web interface for viewing | |
| | |
| Y Combinator playlist | |
| https://videotobe.com/play/playlist/ycombinator | |
| | |
| and Andrej's talk is | |
| https://videotobe.com/play/youtube/LCEmiRjPEtQ | |
| | |
| After reading your blog post, I will be testing effect on | |
| speeding audio for locally-hosted Whisper models. Running Whisper | |
| locally eliminates the ongoing cost concerns since my | |
| infrastructure is already a sunk cost. Speeding audio could be an | |
| interesting performance enhancement to explore! | |
| fuzztester wrote: | |
| Stop being slaves of extorters of any kind, and just leave. | |
| | |
| there is tons of this happening everywhere, and we need to fight | |
| this, and boycott it. | |
| pottertheotter wrote: | |
| You can just ask Gemini to summarize it for you. It's free. I do | |
| it all the time with YouTube videos. | |
| | |
| Or you can just copy the transcript that YouTube provides below | |
| the video. | |
| BrunoJo wrote: | |
| If you look for a cheaper transcription API you could als use | |
| https://Lemonfox.ai. We've optimized the API for long audio files | |
| and are much faster and cheaper than OpenAI. | |
| conjecTech wrote: | |
| If you are hosting whisper yourself, you can do something | |
| slightly more elegant, but with the same effect. You can | |
| downsample/pool the context 2:1 (or potentially more) a few | |
| layers into the encoder. That allows you to do the equivalent of | |
| speeding up audio without worry about potential spectral losses. | |
| For whisper large v3, that gets you nearly double throughput in | |
| exchange for a relative ~4% WER increase. | |
| nomercy400 wrote: | |
| Do you have more details or examples on how to downsample the | |
| context in the encoder? I treat the encoder as an opaque block, | |
| so I have no idea where to start. | |
| PeterStuer wrote: | |
| I wonder how much time and battery | |
| transcoding/uploading/downloading over coffeeshop wifi would | |
| realy save vs just running it locally through optimized Whisper. | |
| georgemandis wrote: | |
| I had this same thought and won't pretend my fear was rational, | |
| haha. | |
| | |
| One thing that I thought was fairly clear in my write-up but | |
| feels a little lost in the comments: I didn't just try this | |
| with whisper. I tried it with their newer gpt-4o-transcription | |
| model, which seems considerably faster. There's no way to run | |
| that one locally. | |
| KPennig86852 wrote: | |
| But you know that you can run OpenAI's Whisper audio recognition | |
| model locally for free, right? It has very little GPU | |
| requirements, and the new "turbo" model works quite fast (there | |
| are also several Python libraries which make it significantly | |
| faster still). | |
| dajonker wrote: | |
| Gemini 2.5 pro is, in my usage, quite superior for high quality | |
| transcriptions of phone calls, in Dutch in my case. As long as | |
| you upload the audio to GCS there you can easily process | |
| conversations of over an hour. It correctly identified and | |
| labeled speakers. | |
| | |
| The cheaper 2.5 flash made noticeably more mistakes, for example | |
| it didn't correctly output numbers while the Pro model did. | |
| | |
| As for OpenAI, their gpt-4o-transcribe model did worse than 2.5 | |
| flash, completely messing up names of places and/or people. Plus | |
| it doesn't label the conversation in turns, it just outputs a | |
| single continuous piece of text. | |
| yashasolutions wrote: | |
| the question would be how to do that but also still get proper | |
| time code when using whisper to get the subtitles | |
| ryanar wrote: | |
| In my experience, transcription software has no problem with | |
| transcribing sped up audio, or audio that is inaudible to humans | |
| or extremely loud (as long as not clipped), I wonder if LLM | |
| transcription works the same. | |
| mushishi wrote: | |
| Do the APIs support simultaneous voice transcription in a way | |
| that different voices are tagged? (either in text or as metadata) | |
| | |
| If so: could you split the audiofile and process the latter half | |
| by pitch shifting, say an octave, and then merging them together | |
| to get shorter audiofile -- then transcribe and join them back to | |
| a linear form, tagging removed. (You could insert some | |
| prerecorded voice to know at which point the second voice | |
| starts.). If pitch change is not enough, maybe manipulate it | |
| further by formants. | |
| godot wrote: | |
| If you're already doing local ffmpeg stuff (i.e. pretty involved | |
| with code and scripting already) you're only a couple of steps | |
| more away from just downloading the openai-whisper models (or | |
| even the faster-whisper models which runs about two times | |
| faster). Since this looks like personal usage and not building | |
| production quality code, you can use AI (e.g. Cursor) to write a | |
| script to run the whisper model inference in seconds. | |
| | |
| Then there is no cost at all to run any length of audio. (since | |
| cost seems to be the primary factor of this article) | |
| | |
| On my m1 mac laptop it takes me about 30 seconds to run it on a | |
| 3-minute audio file. I'm guessing for a 40 minute talk it takes | |
| about 5-10 minutes to run. | |
| ta8903 wrote: | |
| This "hack" also works in real life, youtubers low to talk slowly | |
| to increase the video runtime so I watch everything other than | |
| songs at 2x speed (and that's only because their player doesn't | |
| let you go faster). | |
___________________________________________________________________ | |
(page generated 2025-06-26 21:02 UTC) |