The journey of converting "hardcoded" subtitles

Last year I was looking for something to watch and noticed a French
film on Mubi called In Bed with Victoria (known just as Victoria in
France) which I seem to remember having some middling reviews on
Letterboxd at the time but decided to give it a punt anyway.

After I'd watched and enjoyed the film, I looked to buy a copy of the
film since Mubi used to have a system where films were only available
for thirty days. Looking at Blu-ray.com, I was disappointed to see
that none of the Blu-rays available had English subtitles included.
"No problem," I thought, I can rip the Blu-ray and graft on some
subtitles from somewhere else so I ordered a French copy of the film.
I don't think there were any DVD copies of the film which included
English subtitles but there are lots of websites offering subtitles
in SubRip format (SRT) and I found various versions for In Bed with
Victoria.

I've not had much success with Blu-rays on Linux so all the tools I'm
using are on Windows (specifically Windows 10 for this). MakeMKV is a
great piece of software for ripping discs to a single Matroshka (MKV)
file, selecting which audio, video and subtitle streams you want (I
have the paid version which incurs a one-off purchase but I think you
can still rip Blu-rays even with the free version). You can then put
the resulting file into something like Handbrake to convert it and
create a smaller file. Handbrake doesn't support SRT as a format to
bake in a selectable subtitle stream for the outputted MKV file but I
normally watch films using VLC or Kodi/OSMC which both support them
(and if they have the same filename as the video file except the file
extension then they will be automatically applied).

When I started applying the downloaded subtitles to the Blu-ray rip,
I quickly noticed basic issues with the formatting (eg when two
characters are speaking at the same time) and tried downloading some
different ones when I started to realise there was a more fundamental
problem: the translation was totally different to the one I'd seen on
Mubi. Spending some more time checking these subtitles, it became
clear that the translations weren't just different, they were vastly
inferior. I don't know if it's because there's not a disc of this
film available to buy which includes English subtitles and these were
just the results of an automated translations or whether they were
done by a less-skilled subtitler. One of the reasons I wanted to buy
the film was so I could introduce it to other people I thought I
would enjoy it, I could tell these new subtitles would be seriously
detrimental to my enjoyment of the film (or to anyone else I wanted
to introduce the film to).

At this point, the film was still available to view on Mubi, I
watched it again and skipped to the end credits for details of the
subtitles and found they were credited to TETRAFILM and were written
by Sionann O'Neill. I tried contacting TETRAFILM to ask for a copy of
the subtitles but received no response; I looked for Sionann but
could only find an inactive Twitter account (though I can see she
subsequently has some minor activity on it). Interestingly, I did
come across an article about her from 2011 with a San Francisco news
site called SFGATE promoting the release of Francois Ozon's latest
film at the time, Potiche, which she had done the English subtitles
for. I used to do captioning for live theatre shows so am more
interested in the art of subtitles and captions than most people but
it was an extremely insightful article and it drives home the
significance and importance of her work (including the concept of
translation vs adaption which I never had to deal with):

>  O'Neill says in her work, she has two mottos. The first: Less is
>  more. "English is often more succinct than French," she says, "So
>  I have to find a way to synthesize the French while being true to
>  the character. I'm always trying to streamline it to have fewer
>  words, because it's terrible to have the eyes down at the
>  subtitles all the time. I want people to forget they're reading
>  subtitles."
>
>  Her second motto: Sometimes you have to go further away to get
>  closer. "That's the essence of adaptation," she says, using the
>  French word. "You're adapting it. The literal translation won't
>  cut it. You have to express how that character would put it if
>  they were an American saying it."

Realising that I'd hit a dead end with acquiring some soft subtitles
of the correct translation, I decided to figure out a way to extract
the subtitles from the Mubi stream of the film. I put "hardcoded" in
the title of this article in quotes because in the Mubi stream, the
subtitles are selectable but I ended up having to deal with them as
part of the video stream. Obviously, as a part of a paid-for
subscription service, the video streams on Mubi are using Digital
Rights Management (DRM) to protect them. I'd hoped the subtitles
might be delivered separately to this but inspecting the Network tab
in FireFox while the video played with subtitles showed there was no
way to grab them.

Using a popular piece of streaming tool called Open Broadcasting
Software (OBS), I was able to capture a screen recording of the film
while it played on my computer. I'm not sure how Mubi determines its
output quality but when playing it in a browser on a screen at a
resolution of 1920x1080 I have never found it to be great, especially
not for retaining a copy of the film. The important thing was to
capture the subtitles in the video and luckily the film has an aspect
ratio of 2.40:1 which meant that being played back on 16:9 screen
left the subtitle text entirely on a black background. This allowed
me to use a tool called VideoSubFinder without changing any of the
presets (except for selecting the bottom portion of the video) to
output each frame of subtitle has an image with the timestamp in the
filename.

Once you have these image files, you need to use Optical Character
Recognition (OCR) to convert them into text. Another popular tool
called SubtitleEdit is meant to have the functionality to do this
using various third-party systems such as Tesseract but I found it to
be so slow and unreliable that it was completely unusable. I found a
tool on GitHub by a user called Abu3safeer which is a Python script
that allowed you to use the Google Docs OCR. You need to follow the
Google Developers Python quickstart guide to generate a
credentials.json file, then when you run the Python script you will
be prompted for your Google account credentials before it goes and
OCRs all your files. It ran much more quickly than the SubtitleEdit
process and produced 1329 text files with no errors.

Going back to SubtitleEdit, you can import via a batch of text files
- the ones output by the Python tool retain the timestamps in the
filenames so SubtitleEdit knows where to place them. The idea now is
to export this as an SRT file but unfortunately we still have some
hoops to jump through. Even before we start thoroughly inspecting the
quality of the OCR, it's clear that there's a timing issue. Setting
the first subtitle and offsetting the rest from there, I can see that
by the end of the film the subtitles and video are about 10 seconds
out of sync. I checked the framerate of the OBS recording using VLC
Tools > Codec Information and could see it was 30fps whereae the
Blu-ray rip was 23.976fps. Using a tool called Subtitle framerate
changer I thought I could simply convert my SRT file from 30 to
23.976fps but the resultant file was way out. I started to think
about how the framerate shouldn't matter because the SRT uses
timestamps, not frame information but obviously there was a mismatch.
Even though my OBS recording was set to 30fps, whatever framerate
Mubi had broadcast the film at would affect the length of the film.
Sure enough, I used to the tool to convert the SRT from 24 to
23.976fps and the resulting file matched perfectly.

With the hard bit out of the way, now I had to actually work on the
content of the subtitles. Even though the OCR via Google had been
much more successful than the providers in SubtitleEdit, I could see
there was still a lot of work needing done. I don't know if it was a
problem with how I configured VideoSubFinder or the Python script but
they hadn't handled line breaks at all. Sometimes these were easy to
see where they should be because the last word of the first line and
the first word of the second line would be concatenated but plenty of
them were not so easy to see. Lines which included two characters
speaking at the same time should start with a hyphen and a space,
many of the hyphens had been lost or converted to other characters
such as an em dash or interpunkt. More critically, many of these
instances ommitted the second line altogether and there were other
short, individual lines which had been missed. I'm not sure whether
these were missed by the extraction of the images or the OCR process
and I don't have an easy way to tell now.

After going through all the lines in SubtitleEdit and adding line
breaks where I thought necessary, I lined up my OBS copy of the film
in VLC with the window set to View > Always on top. This means I can
sit it just above the video player in SubtitleEdit, view the
subtitles being played at the same time and make changes to the
subtitles without losing visibility of VLC. I changed the skip time
feature in SubtitleEdit from 0.5 to 3 seconds under the Adjust tabe
in the bottom left which matches the amount of time VLC jumps when
using Shift + Left/Right. Now I played the two videos together
looking for any changes I needed to make, like adding line breaks or
lines that were completely missing. Obviously because of the
different framerates, the videos will go out of sync, but you can
briefly pause the faster one periodically and you might need to stop
them altogether if you are doing larger edits.

Once the editing is finished, you can save your work as an SRT file
and place it in the same folder as your video file. If you want, you
can use a tool like MKVToolNix) to add the subtitles as a selectable
track in your MKV file.