Extract content between first one match and first different match across one or multiple lines

Question

Let's say I want to get the text between the first match of "start_" and the first of "_end", no matter it it's across same or multiple lines. Don't include the matches.

Example text 1:

This is a start_text with start_and some_end text with_end

out text 1:

text with start_and some

Example text 2:

This is a start_text with start_and some_end text with_end

out text 2:

text with start_and some

I've seen lots of answer but they are line-focused, not file-focused. Any kind of tool or command will do as long as it's console-based.

Stéphane Chazelas · Accepted Answer · 2023-12-18 08:37:06Z

With perl:

$ perl -l -0777ne 'print $1 while /start_(.*?)_end/gs' your-example-2 text with start_and some

perl -n is the sed -n mode where the supplied expression is run for each line of the input like in sed.
-l is for a newline to be automatically appended when printing¹
-<octal-number> sets the record separator to be the byte with the given value instead of newline. 0777 (511) or anything above 0377 (255) is a byte value that cannot exist, so there will be one any only one record: the whole file.
*? like * matches 0 or more of the preceding atom (here . which matches any single character), but while * would match as many as possible, *? matches as few as possible, so that .*? will run until the first occurrence of _end, not the last.
the s flag to the /regexp/ pattern matching operator is needed for . to also match on newline characters, which it doesn't by default.

You should be able to use pcregrep as well, however I find (with Debian's version 8.39 2016-06-14) that it gives:

$ pcregrep -Mo1 '(?s)start_(.*?)_end' your-example-2 text with start_and some and some

Which I can't explain. pcre2grep (version 10.42 2022-12-11) is OK though:

$ pcre2grep -Mo1 '(?s)start_(.*?)_end' your-example-2 text with start_and some

^{¹ Technically, it causes the record separator to be stripped from the input before storing in $_ and the output record separator ($\) to be set as the same as the input record separator ($/) which at that point is still newline, so it's important that that -l come before the -0.... Beware that -l<octal> sets the output record separator to the given byte value, so it's different from -l -<octal>.}

pcregrep version 8.45 2021-06-15 gives the correct output the same as perl. — Smeterlink
– Smeterlink, Commented Dec 18, 2023 at 8:45
@Smeterlink, thanks for checking. I'm not too sure what the exact status of the old PCRE (which Debian calls pcre3) is on Debian, but in any case that old PCRE is discontinued and we should all be switching to PCRE2. — Stéphane Chazelas
– Stéphane Chazelas, Commented Dec 18, 2023 at 8:52
pcre2grep version 10.42 2022-12-11 also confirming that works fine like perl, the latest versions of both have the bugs fixed. — Smeterlink
– Smeterlink, Commented Dec 18, 2023 at 9:06

Stack Exchange Network

Extract content between first one match and first different match across one or multiple lines

1 Answer 1

You must log in to answer this question.

Hot Network Questions