0

It is known that .srt files are structured in blocks having 3 underlying parts, like this example:

228 00:39:06,680 --> 00:39:13,460 Lorem ipsum dolor sit amet 

Now, let us suppose that in the closed captions there are some excerpts representing the speech of a speaker quoting a literary opus of someone else, like this additional example:

228 00:39:06,680 --> 00:39:13,460 According to Erasmus, book 1, chapter 23... 

Problem: I wish to extract only the text from the .srt by deleting the frame number, the frame duration without erasing, however, the cardinal numbers that appear in the closed captions as quotations through VIM.

Attempts: By using regular expression and the substitute command, I have found a way to "delete" the duration line with :%s/\d\d:\d\d:\d\d,\d\d\d --> \d\d:\d\d:\d\d,\d\d\d/ /g and the numbers with the same idea, except now searching for each cardinal number entry with the option /gc to bypass those amidst the text.

However, I have a considerable amount of such quotations to extract, for which the cardinal number should be maintained. Selecting yes/no for all entries turns into a tedious task.

Since I have a lacking skill in using regex, I presume to say that there is, at least, a less "ugly" manner to perform the strategy aforementioned. Perhaps, a more elegant way to not only delete the unwanted portions, but also to recover a raw text without the frame and duration lines, like:

Lorem ipsum dolor sit met According to Erasmus, book 1, chapter 23... 

Someone knows how to do that?

2 Answers 2

3
  1. Don't replace the content of the line with nothing, actually delete the line. Instead of using :s/PATTERN//g, use :g/PATTERN/d (see :help :g)
  2. Anchor your patterns using ^ and $ to only match lines that consist entirely of the thing you want to remove.

Put together:

:g/^\d\+$/d :g/^\d\d:\d\d:\d\d,\d\d\d --> \d\d:\d\d:\d\d,\d\d\d$/d 

(wow, that's a lot of "d").

This still has the possibility of nuking a "line of dialog" that consists only of digits, but it won't eat numbers that are just in the middle of a line.

To do a better job I would suggest using something a little more fit-for-purpose than Vim — either a programming language, or a subtitle editor :)

Sign up to request clarification or add additional context in comments.

2 Comments

:g/\d --> \d/d will probably be enough for the second step.
@hobbs, thanks for replying! In fact, g: makes a better job and your points are noteworthy. I'll keep them in mind for the next times. Concerning programming and a fit-for-purpose tool, something might be useful indeed, but this was a "hobby" task, I would say. Then, VIM and regexp were the tools that jumped out to my sight to solve this quickly. Anyway, it's done. :) BTW, @romainl, your shorter suggestion also works fine. Thank you, as well!
1

Things get a lot easier (although not necessarily better looking) if you use anchors:

:%s/\v(%^|\n)\zs\d+\n\d{2}(:\d{2}){2},\d{3} --\> \d{2}(:\d{2}){2},\d{3}$\n// 

This considers the sequence numbers and the duration coupled, you don't need to worry about either matching in the middle of the text.

1 Comment

Hi, @SatoKatsura! This definitely "kills two birds with one stone" ;). I appreciate your collaboration. Getting to this surely would cost me too many hours of research and training about regexp. Thank you.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.