Extract YouTube video IDs from the end of filenames

Question

I downloaded some files from YouTube a while back with the YouTube IDs in their names. My problem is that some IDs contain a character that also marks the beginning of the ID itself.

$ ls *.mp4 | head -n 2 1 - How to make an operating system from scratch-rr-9w2gITDM.mp4 2 - How to make an operating system from scratch-WwitBbsUvc8.mp4

I want to return the IDs rr-9w2gITDM and WwitBbsUvc8.

In the following two examples, what I have written work for the second ID, but not the first. They seem to be greedily matching up to the last matching character. Using /1 at the end did not make any difference.

$ ls *.mp4 | sed -e 's/^.*\-//1;s/\..*$//1' 9w2gITDM WwitBbsUvc8

$ ls *.mp4 | sed -e 's/^.*\-\(.*\)\..*$/\1/' 9w2gITDM WwitBbsUvc8

I prefer to solve this using macOS sed, but I'm also open to using gsed (GNU sed) if no other option is possible.

You could greedily match everything except a dash ([^-]*), but that would get you in trouble if the actual name contains a dash. But since Youtube IDs always have 11 characters, maybe match everything up to dash + 11 characters + .mp4? — DonHolgo
– DonHolgo, Commented May 31, 2024 at 9:51
@DonHolgo The YouTube video ID is not guaranteed to be 11 characters long. Related: webapps.stackexchange.com/q/54443 — Kusalananda
– Kusalananda ♦, Commented May 31, 2024 at 18:14
@Kusalananda It's not guaranteed to remain 11 characters long, but for existing downloads (as in the OP's case) I think it's safe to use that length. — DonHolgo
– DonHolgo, Commented May 31, 2024 at 18:39
If we cannot assume the last 11 characters before the the dot preceding the extension will always be the id, or that the same separator wil always separate the [truncated] title from the id, it's inherently impossible to do this reilably. Unless of course by doing something overly involved for the task such as calling Youtube's API and implementing some robust enough logic to check if the possible id actually matches the downloaded video. — kos
– kos, Commented May 31, 2024 at 23:50

larsks · Accepted Answer · 2024-05-31 11:49:44Z

This is perhaps easier with awk, since that makes it easy to simply extract the last n characters of the filename:

$ ls -1 '1 - How to make an operating system from scratch-rr-9w2gITDM.mp4' '2 - How to make an operating system from scratch-WwitBbsUvc8.mp4' $ ls | awk '{id=substr($0, length($0)-14, 11); print id}' rr-9w2gITDM WwitBbsUvc8

You could do something similar with bash parameter expansion:

$ for filename in *.mp4; do id=${filename: -15: -4} echo $id done rr-9w2gITDM WwitBbsUvc8

canupseq · Accepted Answer · 2024-06-01 07:26:43Z

If you always have two hyphens before your intended string, then here is one way of doing that:

$ sed 's/-/ /2' mp4_files | awk '{print $NF}' rr-9w2gITDM.mp4 WwitBbsUvc8.mp4

The sed part simply replaces the second hyphen by a space. Then you extract the last 'word' using awk.

You can replace awk by another sed if you wish:

$ sed 's/-/ /2' mp4_files | sed 's/^.* //' rr-9w2gITDM.mp4 WwitBbsUvc8.mp4

Kusalananda · Accepted Answer · 2024-06-01 08:26:14Z

Assuming that the initial part of the name that is not a YouTube video ID always contains at most one internal dash and one terminating dash, then the part of the name that we must remove to get the YouTube video ID matches the filename globbing pattern *-*-. We can remove this with a standard shell parameter substitution. We may, at the same time, remove the filename suffix .mp4 with the standard basename utility:

$ for name in *-*-*.mp4; do basename -- "${name#*-*-}" .mp4; done rr-9w2gITDM WwitBbsUvc8

Slightly longer, but without using any external utilities:

$ for name in *-*-*.mp4; do name=${name#*-*-}; printf '%s\n' "${name%.mp4}"; done rr-9w2gITDM WwitBbsUvc8

Stack Exchange Network

Extract YouTube video IDs from the end of filenames

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Extract YouTube video IDs from the end of filenames

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions