4

For example, transfer below

00:00:10.730 this presentation is delivered by the 00:00:13.230 Stanford center for professional 00:00:14.610 development okay so let's get started 00:00:25.500 with today's material so um welcome back 00:00:32.399 to the second lecture what I want to do 

to

00:00:10.730 --> 00:00:13.230 this presentation is delivered by the 00:00:13.230 --> 00:00:14.610 Stanford center for professional 00:00:14.610 --> 00:00:25.500 development okay so let's get started 00:00:25.500 --> 00:00:32.399 with today's material so um welcome back 00:00:32.399 to the second lecture what I want to do 
1
  • Use something like gaupol for munging subtitles? Commented Jun 8, 2017 at 10:58

4 Answers 4

2

For sake of code clarity, we are using GNU sed:

sed -nE ' /^([0-9][0-9]:){2}[0-9]+[.][0-9]+/!{p;d;} h;:a $bb;n;H /^([0-9][0-9]:){2}[0-9]+[.][0-9]+/!ba :b x y/\n_/_\n/ s/^([^_]*)_(.*)_([^_]*)$/\1 ---> \3_\2/ y/\n_/_\n/ p;g;$!s/^/\n/;D ' yourfile 

Results

00:00:10.730 ---> 00:00:13.230 this presentation is delivered by the 00:00:13.230 ---> 00:00:14.610 Stanford center for professional 00:00:14.610 ---> 00:00:25.500 development okay so let's get started 00:00:25.500 ---> 00:00:32.399 with today's material so um welcome back 00:00:32.399 to the second lecture what I want to do 

Explanation

  • We keep range of lines from number to next number.
  • Then at the end of range, the last portion is brought forward and the range printed, also the pattern space is cleared out and the end of range used to fill it and then using this value of pattern space, the control is transferred to the top of sed code for starting the cycle all over again from the current end of range till the next number or till we hit the eof.
1

With single gawk approach for relatively "small" (by size) files:

awk 'BEGIN{ RS=""; FS="[[:space:]]+" } { c++; a[c]["t"]=$1; a[c]["s"]=substr($0,length($1)+2) } END { len=length(a); for(i=1;i<=len;i++) { if((i+1)<=len){ printf("%s --> %s\n%s\n\n",a[i]["t"],a[i+1]["t"],a[i]["s"]) } else { printf("%s\n%s\n",a[i]["t"],a[i]["s"]) } } }' file 

The output:

00:00:10.730 --> 00:00:13.230 this presentation is delivered by the 00:00:13.230 --> 00:00:14.610 Stanford center for professional 00:00:14.610 --> 00:00:25.500 development okay so let's get started 00:00:25.500 --> 00:00:32.399 with today's material so um welcome back 00:00:32.399 to the second lecture what I want to do 
2
  • This loads the entire file in memory. Commented Jun 8, 2017 at 11:58
  • @SatoKatsura, added a note to my answer. Might be used for "small" files Commented Jun 8, 2017 at 12:02
1

With GNU sed and tac:

tac file | \ sed -E '/^[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3}$/ { H; x; s/^\n//; s/\n/ --> /; }' | \ tac 

The same could be written with traditional sed (i.e. without -E), but it would be more verbose.

With GNU awk and tac:

tac file | \ gawk --re-interval ' /^[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3} --> / { old = $1 } /^[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]{3}$/ { if(old != "") $0 = $0 " --> " old; old = $1 } 1' | \ tac 

Please note that the awk version can handle time intervals such as 00:00:14.610 --> 00:00:25.500 in the input file, while the sed version is fooled by them.

Note also that tac can be emulated with sed:

sed -n '1!G; $p; h' 

or like this:

sed '1!G; h; $!d' 

However both forms will load the entire input file in memory, so they aren't very efficient.

Result:

00:00:10.730 --> 00:00:13.230 this presentation is delivered by the 00:00:13.230 --> 00:00:14.610 Stanford center for professional 00:00:14.610 --> 00:00:25.500 development okay so let's get started 00:00:25.500 --> 00:00:32.399 with today's material so um welcome back 00:00:32.399 to the second lecture what I want to do 
0

I see loops or piping to other tools in the given answers, and I don't like that if it is not neccessary. I like one-liners:

sed -E '/^[0-9:.]+$/{x;G;s/(.*)\n(.*)\n(\n)(.*)/\1 --> \4\3\2\3/p;d;};H;$!d;x' 

But let's go step by step:

  • I use ^[0-9:.]+$ as extended regular expression for the time stamp line. This should be sufficient in the real world, but feel free to make it more precise. I use this pattern as an address, so everything inside the {} pair is executed for the timestamp lines only.
  • Obviously we need to keep everything in mind until the next timestamp comes. Keeping in mind means appending to the hold space in sed
  • Thus, each time we meet a timestamp, we assume everything since the last time stamp resides in the hold space. So we append the current timestamp to the Hold space and exchange pattern and hold space. This way the current time stamp is already save in the hold space for the next cycle, while everything we need, is in the pattern space
  • We just need to reorganize it with substitute: s/(.*)\n(.*)\n(\n)(.*)/\1 --> \4\3\2\3/ -- \1 is the starting timestamp, \2 is the text line, \3 is a newline (we need that in the replacement, but POSIX doesn't define \n in the replacement) and \4 is the ending time stamp. Looks more complicated than it is.
  • Adding option p to the substitution and then deleting the pattern space keeps us from unwanted output for the first line when the hold space was still empty.
  • Now all what's left is to append other lines to the Hold space and
  • for the last line exchange buffers again, so lines collected in the hold space will get printed even without closing timestamp

If someone still feels sed is not elegant, I can't help.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.