2

Let's say I want to get the text between the first match of "start_" and the first of "_end", no matter it it's across same or multiple lines. Don't include the matches.

Example text 1:

This is a start_text with start_and some_end text with_end 

out text 1:

text with start_and some 

Example text 2:

This is a start_text with start_and some_end text with_end 

out text 2:

text with start_and some 

I've seen lots of answer but they are line-focused, not file-focused. Any kind of tool or command will do as long as it's console-based.

1 Answer 1

3

With perl:

$ perl -l -0777ne 'print $1 while /start_(.*?)_end/gs' your-example-2 text with start_and some 
  • perl -n is the sed -n mode where the supplied expression is run for each line of the input like in sed.
  • -l is for a newline to be automatically appended when printing¹
  • -<octal-number> sets the record separator to be the byte with the given value instead of newline. 0777 (511) or anything above 0377 (255) is a byte value that cannot exist, so there will be one any only one record: the whole file.
  • *? like * matches 0 or more of the preceding atom (here . which matches any single character), but while * would match as many as possible, *? matches as few as possible, so that .*? will run until the first occurrence of _end, not the last.
  • the s flag to the /regexp/ pattern matching operator is needed for . to also match on newline characters, which it doesn't by default.

You should be able to use pcregrep as well, however I find (with Debian's version 8.39 2016-06-14) that it gives:

$ pcregrep -Mo1 '(?s)start_(.*?)_end' your-example-2 text with start_and some and some 

Which I can't explain. pcre2grep (version 10.42 2022-12-11) is OK though:

$ pcre2grep -Mo1 '(?s)start_(.*?)_end' your-example-2 text with start_and some 

¹ Technically, it causes the record separator to be stripped from the input before storing in $_ and the output record separator ($\) to be set as the same as the input record separator ($/) which at that point is still newline, so it's important that that -l come before the -0.... Beware that -l<octal> sets the output record separator to the given byte value, so it's different from -l -<octal>.

4
  • That works perfectly thanks! Commented Dec 18, 2023 at 8:25
  • pcregrep version 8.45 2021-06-15 gives the correct output the same as perl. Commented Dec 18, 2023 at 8:45
  • @Smeterlink, thanks for checking. I'm not too sure what the exact status of the old PCRE (which Debian calls pcre3) is on Debian, but in any case that old PCRE is discontinued and we should all be switching to PCRE2. Commented Dec 18, 2023 at 8:52
  • pcre2grep version 10.42 2022-12-11 also confirming that works fine like perl, the latest versions of both have the bugs fixed. Commented Dec 18, 2023 at 9:06

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.