Over the past 2 months, I’ve been working on digitizing our physical media library for easier access. Part of this project was digitizing and transcoding the government-issued complete Seinfeld DVD collection. Which is weird in about 4 different ways.

Basically, it acted as a great lesson in:

The importance of good source data, both in how you structure it and what data you have available to you
When to pull the rip cord on automation (or, more importantly, recognizing the things that you did already automate)
Try not to over-automate and recognize when a task is inefficient, but the process is efficient.

This is all over the place!

To start, the data on the disks for the early seasons is all over the place. So, if a disk is episodes 1-4, they aren’t stored sequentially. This goes against the grain of basically every other disk I’ve transcoded in this process.

What is an episode, anyway?

Seinfeld also throws multiple curveballs in its file structure. Notably, The Stranded is listed as Season 3, Episode 10, but was filmed as part of Season 2’s production schedule. So, it’s actually on the Season 1 & 2 box set.

Pop quiz: If an episode is a 2-parter, how should it be broken up as a file?

There are 2 schools of thought:

It’s a contiguous story, so as a single video
Broken up into distinct episodes, since that’s how it aired and is syndicated

Seinfeld takes both approaches, based on air date. In the same season, one two-parter is two distinct video files, while another is a single video. This is based on whether it aired on the same day, or was broken up over different days. All metadata sources break them up into distinct entries.

This would have thrown a loop for both bulk-renaming and computer vision identification, and I had to make a judgement call and come up with a modified naming schema for these cases. (eg: S4.E10-11).

Make sure you’ve got good external data

My first pass was to try to find a computer vision tool to solve this, because that’s kind of a perfect use case for computer vision. And I found a python script that extracts frames from a video, searches TMDB for episodes with snapshots that match the best, and renames the files.

A great workflow in theory, but I ran into the problem that plagues computer vision: sourcing. While TMDB does have a feature where it has an API for grabbing screenshots for an episode, their dataset for Seinfeld has maybe 2 screenshots per episode. So it’s not a strong enough dataset for automated identification. When I ran a test of an episode that I had correctly ID’d, it whiffed on it.

I’d also found some potential tools to match by subtitles, but was not able to get them to work (I’m pretty sure they’re one-off vibe coded projects that worked for someone in their specific case). Plus, they were inefficient: they wouldn’t look at the actual subtitles in the file, but would transcribe the audio, which would have slowed down the whole process and been much more resource intensive. I already have the source subtitles! They don’t need to like listen and get an imperfect transcript, I have the transcript! It would be a lot more efficient to compare text against text, so this didn’t inspire confidence in the project.

Even though these approaches failed, they did help in the final workflow

How I processed all these videos in a few hours

The final workflow ended up being:

GUI:
- Multiple windows of Windows Explorer:
  - One for the tabs of raw videos from the DVDs
  - One for the renamed, identified videos that will be transcoded
- A browser with TMDB and seinfind.com (which I’d found while looking for subtitle sources) open
- VLC in a small window in the corner

Workflow:
1. Check the physical disk for the corresponding folder (Season 3, Disk 2) to get the episode range
2. Bulk-rename the files, based on their “created at” timestamp
3. Open each video, scrub for frames, scenarios, or identifying phrases
4. Check TMDB/Seinfind with that information, get the match
5. Rename the file, onto the next one

The importance of structuring data

This workflow was possible because rather than having 23 episodes in a season, or 180 episodes to work through, the data was broken into 5-6 episodes using a stable identifier: The disk they were from. As I was ripping each disk, I made sure that the files were put into directories and I made sure that the directories matched the the disc names. For example: “Season 4, Disk 3” and then everything inside that DVD would go in there for later processing.

Then, the problem of “oh, which season, which episode am I looking at?” was much more efficient because I’m only looking at those 4-6 episodes at a time. With the batch renamer, I could take a best-guess pass and sort from there. Essentially, I automated that part of the process with:

For this part of the operation, these are episodes 4 through 10.

After that, manual verification kicked in. When the episodes were out of order, I renamed the correct match with a suffix, then jumped to the incorrect episode and identified it.

Which, is just a manually implemented sorting algorithm. Rather than going through all 23 episodes, which would be really intensive for a slow or “differently efficient” CPU (my brain), I am breaking it up into chunks of six because it’s much easier to work in those chunks. And because those chunks will be sequential, I will get the ordered list of episodes.

Task inefficient, Process efficient

This endeavor a good example of when someone is task inefficient, but the overall process is efficient. If you said:

Me, the human, is going to open each video, scrub through it, identify the episode, and sort 180 episodes of television

That feels like a horribly inefficient and daunting process. And at the task level, its inefficient! However, at the process level, it is more efficient because of the work I did to structure the source data, the lessons & tools I found along the way, leveraging existing automations, and the nature of probability.

Having good source data and preparing it to be as workable and easy as possible despite its imperfections is what unlocked the sorting algorithm.

The next part was knowing my tools well enough to automate the things where it makes sense and then using human discernment to finish the rest of the job. Extracting the DVD data, using a batch renamer, and auto-organizing files into disk-based directories are all things I’ve already done to automate the process. TMDB and Seinfind ended up being crucial resources for pattern matching. And because I’m comfortable with my window manager and keyboard shortcuts, I built out a whole bespoke GUI just for this process, with zero code or extra work!

At the process level this approach ended up being much more efficient, because rather than trying to find all the edge cases and particulars of this unstructured data (a series of flat files) and translate into software or prompts, I wrote the program on the fly in my brain. I identified the patterns, used them to build out a structure, and got into a flow state.

Also, I’m only doing this once. I’m not regularly transcoding & organizing nine 23 episode seasons of a beloved broadcast television show that did not logically organize its data on disk. I mean, if I end up doing The West Wing, Deep Space 9, or some other giant set it might make sense to expand these scripts to make the process faster. But, I spent 3 hours trying to get them up-and-running, and the manual work for Seinfeld took 3-4 total.

Computer vision and transcription have very useful applications, but they require strong datasets. It also requires understanding the data, trying to make the best sense of it, and human discernment.

I feel like in the quest to overautomate everything or write custom software for everything, you end up creating more inefficiencies and wasting more time, where a human can get into a flow state and better adapt to the complexities and problems that certain datasets present.

Thomas Cannon

The Seinfeld Transcoding Problem