10

With grep I can filter lines. But if the lines are pretty long it gets messy. How can I only get "some chars around" my search-string?

f.txt

this is a red cat in the room this is a blue house at the street this is a white mouse in the corner this is a blue mouse in the bowl 

What I do

cat f.txt | grep blue 

What I get

this is a blue house at the street this is a blue mouse in the bowl 

But what I want is e.g. only 10 chars after my searchword (whereever it is).

blue house blue mouse 

How can I get that?

1
  • Would returning the matching word and the following 2 or 3 words suffice? Commented Oct 28 at 3:40

5 Answers 5

9

Using GNU grep:

$ grep -oP '\w*blue\w*(\s*\w+)?' input.txt blue house blue mouse 

This shows the complete word containing the match and the following word (if any).

Note that with \w, underscores are considered to be "word" characters...so, for example, "light_blue" would be considered one word, while "light-blue" would not. If you want the regex to treat any non-space character as part of the word, use \S instead of \w. e.g.

$ grep -oP '\S*blue\S*(\s*\S+)?' input.txt blue house blue mouse light_blue mouse light-blue mouse blue-grey mouse 
2
  • 1
    for anything more complicated than this, you should use perl. or maybe awk (but awk would be more work) Commented Oct 8 at 10:41
  • Using gawk: gawk '{ match($0,"blue"); delete a; if(RSTART!=0) { split(substr($0,RSTART+RLENGTH+1),a); print substr($0,RSTART,RLENGTH),a[1]; }}' f.txt Commented Oct 8 at 19:40
7
-E for extended regex

grep -oE 'blue.{0,6}' f.txt

-o with escape characters

grep -o 'blue.\{0,6\}' f.txt

grep & cut

grep -o 'blue.*' f.txt | cut -c1-10

Perl-RegEx

grep -Po 'blue.{0,6}' f.txt

2
  • 3
    Beware the GNU implementation of cut still cuts based on number of characters with -c like with -b so would not be equivalent to the other approaches and could end up cutting in the middle of a character. Commented Oct 8 at 18:40
  • 3
    Beware in on an input like blue12345blue, it would output blue12345b and not show the second blue. Commented Oct 8 at 18:41
2
$ cat file This is the blue tips of blue teeth of a blue mouse in a blue house 
$ grep -Eo 'blue.{0,10}' file blue tips of b blue mouse in blue house 

See how we're missing blue teeth... above as the start of that blue was swallowed by the .{0,10} of the previous search.

Alternatively, you could do:

$ pcre2grep -o1 -o2 '(blue)(?=(.{0,10}))' file blue tips of b blue teeth of blue mouse in blue house 

Or:

$ pcre2grep -o -o1 'blue(?=(.{0,10}))' file blue tips of b blue teeth of blue mouse in blue house 

Which shows them all and repeats the b of the second blue.

pcre2grep is the example command that comes with PCRE2, the library that GNU grep uses for its -P option (if enabled at build time which is not the case by default but is often enabled by distributions as perl regexps have become a de-facto standard these days).

-o is a non-standard extension originally introduced by the GNU implementation of grep. pcre2grep (and pcregrep before that) extended it so it can take an optional number argument to print what's matched by the corresponding capture group instead of by the whole regexp with -o alone (and a --om-separator to put between each capture group if given more than one -o<n>).

The trick here is that we're using the (?=...) look-ahead operator, so what's matched by the pattern inside is not part of the overall match, we're just looking ahead to check whether .{0,10} matches (which it always will as that matches even the empty string), but we're still capturing what that .{0,10} matches (using (...)) so are still able to report it without it consuming any input. So after finding the first blue in blue tips... and outputting that blue and the next 10 characters, pcre2grep resumes looking for more blues just after the first blue, not after blue tips of b.

Or you could use the real thing (the p in pcre2grep or grep -P):

$ perl -C -lne 'print $1.$2 while /(blue)(?=(.{0,10}))/g' file blue tips of b blue teeth of blue mouse in blue house 

Or:

$ perl -C -lne 'print $&.$1 while /blue(?=(.{0,10}))/g' file blue tips of b blue teeth of blue mouse in blue house 

Where $& is for the full match (the equivalent of -o), and $1/$2 for each capture group (equivalent of -o1/-o2).

2

Just use awk instead of grep, e.g. using any awk in any shell on every Unix box:

$ awk 'match($0,/blue/){print substr($0,RSTART,10)}' f.txt blue house blue mouse 

The above assumes you just want to print the 10 chars starting from the first blue on each input line. If instead you wanted to print the 10 chars starting from every blue on each input line then, again using any awk, you could do:

$ echo 'there is a blueblue mouse blue foo in blue house' | awk '{ while ( match($0,"blue") ) { print substr($0,RSTART,10) $0 = substr($0,RSTART+RLENGTH) } }' blueblue m blue mouse blue foo i blue house 

If your requirements are for something other than that then edit your question to include more truly representative sample input/output that covers all of your requirements.

0

Using Raku (formerly known as Perl_6)

Return the first word matching "blue" per line and the 2 words that follow:

~$ raku -ne 'my @a = .words; my $i = @a.grep(/blue/, :k).first; put splice(@a, $i, 3) with $i;' file 

Raku is a Unicode-ready programming language in the Perl-family, thus perfectly situated for thorny text-processing problems. While this question can be solved using Regexes/Substrings*, it might be simpler (and more readable) to grep for the first word matching "blue", then splice-off and return the matching-word along with 2 more trailing words (3 words total), as above:

Sample Input:

this is a red cat in the room this is a blue house at the street this is a white mouse in the corner this is a bluefin tuna in the bowl Here are four: the blue tips of blue teeth of a blue mouse in a blue house 

Sample Output (1):

blue house at bluefin tuna in blue tips of 

Return every word matching "blue" per line, and the following 2 words for each:

~$ raku -ne 'my @a = .words; my @i = @a.grep(/blue/, :k).List; for @i -> $i {my @b = @a; put splice(@b, $i, 3) with $i };' file 

If you want to match every occurrence of "blue" in a line's worth of words (and not just the first occurrence), you can create a List of matches and iterate through them with the code above.

Sample Output (2):

blue house at bluefin tuna in blue tips of blue teeth of blue mouse in blue house 

Finally, it can get messy trying to keep track of the originating line with multiple matches per line. Below is code that numbers the input line:

~$ raku -ne 'BEGIN my $n; ++$n; my @a = .words; my @i = @a.grep(/blue/, :k).List; for @i -> $i {my @b = @a; put $n, ": ", splice(@b, $i, 3) with $i };' file 

Sample Output (3):

2: blue house at 4: bluefin tuna in 5: blue tips of 5: blue teeth of 5: blue mouse in 5: blue house 

*If you prefer up-to-ten characters following the word "blue" you can use this instead: raku -ne '$/.map( "blue" ~ * ).join("\n").put if m:g/ <?after blue> .**0..10 /;'

https://docs.raku.org/routine/splice
https://docs.raku.org
https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.