4

Why doesn't grep -E work as I expect for negative whitespace? i.e. [^\s]+

I wrote a regex to parse my .ssh/config

grep -Ei '^host\s+[^*\s]+\s*$' ~/.ssh/config

# cat ~/.ssh/config Host opengrok-01-Eight Hostname opengrok-01.company.com Host opengrok-02-SIX Hostname opengrok-02.company.com Host opengrok-03-forMe Hostname opengrok-03.company.com Host opengrok-04-ForSam Hostname opengrok-04.company.com Host opengrok-05-Okay Hostname opengrok-05.company.com Host opengrok-05-Okay opengrok-03-forMe IdentityFile /path/to/file Host opengrok-* User root 

What I got was

Host opengrok-01-Eight Host opengrok-03-forMe Host opengrok-05-Okay Host opengrok-05-Okay opengrok-03-forMe 

Where are SIX and Sam!

It took me some time to realise that [^\s*]+ i.e. Match anything that isn't white space or *, 1 or more times was actually match anything that isn't \, s or *, 1 or more times!

The fix is surprisingly easy because that regex works on rex101.com (which uses perl) i.e. switch -E for -P

# grep -Pi '^host\s+[^*\s]+\s*$' ~/.ssh/config Host opengrok-01-Eight Host opengrok-02-SIX Host opengrok-03-forMe Host opengrok-04-ForSam Host opengrok-05-Okay 

What scares me is I have been using grep -E for years in lots of scripts and not spotted that before. Maybe I've just got lucky but more likely my test cases have missed that edge case!

Questions:

  1. Other than changing to use grep -P for all my extended regex how should I be writing my grep -E for this case?
  2. Are there any other nasty gotchas that I have been missing with -E or that will bite me if I use -P?

grep (GNU grep) 3.1 Copyright (C) 2017 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>. 

Running on Windows 10, WSL running Ubuntu 18.04 (bash) ... but I got the same from a proper Linux install

2 Answers 2

11

The complement of \s is \S, not [^\s] which (with the help of -i) excluded 'SIX' and 'Sam' from the result because they contain a literal s.


How to grep -i for lines starting with "host", followed by one or more whitespaces and a sequence of one or more characters until the end of the line, where no literal * or whitespace can exist:

grep -Ei '^host[[:space:]]+[^*[:space:]]+$' file 
Host opengrok-01-Eight Host opengrok-02-SIX Host opengrok-03-forMe Host opengrok-04-ForSam Host opengrok-05-Okay 
1
  • I had tried using [[:space:]] but I didn't understand the syntax when it was already inside [^. I was trying [^[[:space:]]*]+. Seems obvious now I've been shown Commented Oct 29, 2020 at 12:45
6

Interpreting \s as whitespace is an extension of GNU Grep. It is not defined in POSIX. BSD Grep, for example, does not identify \s as whitespace. Perl regexes are also an extension to POSIX, but both BSD and GNU provide it. For a totally portable expression, you should use [[:space:]] instead.

The GNU Grep manual states somewhat loosely that "most meta-characters lose their special meaning inside bracket expressions." You have found that \s is one of them, and it is in fact specified by POSIX (again) that the special characters ., *, [ and \ should lose their special meaning within a bracket expression. But you can still portably use [:space:].

So, answering your two questions,

How should I be writing my grep -E for this case?

grep -Ei '^host[[:space:]]+[^*[:space:]]+[[:space:]]*$' 

Are there any other nasty gotchas that I have been missing with -E or that will bite me if I use -P?

A common mistake is to try the Perl non-greedy .*? with no -P flag.

$ echo 'AB 14 34' | grep -Eo '^.*?4' AB 14 34 $ echo 'AB 14 34' | grep -Po '^.*?4' AB 14 $ echo 'AB 14 34' | grep -o '^.*?4' {nothing} 

The final word: BRE and ERE and PRE are all different. Know your regexes!

5
  • Thanks. Sorry I already gave the accepted to thanasisp as they got in first. Yours is a nice answer but is not quite enough of an improvement for me to switch. Commented Oct 29, 2020 at 12:49
  • @Martin No problem at all, both answers are quite similar and complement each other in some regards. Commented Oct 29, 2020 at 12:50
  • 1
    The main point is that while POSIX leaves the behaviour unspecified if you do grep '\s', it requires grep '[\s]' to match on either \ or s (which IMO is a poor decision as in practice using \ non-doubled inside bracket expressions to match a \ is not portable, so they might as well have left [\s] unspecified as well, allowing implementations to treat it specially, and require applications to do [\\] if they want to match on a \) Commented Oct 29, 2020 at 13:16
  • @StéphaneChazelas Thank you for pointing that, I have added the passage in POSIX that requires [\s] to match backslash or s. Commented Oct 29, 2020 at 13:22
  • 1
    Note that the ast-open implementation of grep supports .*? with -E. Again, POSIX leaves the behaviour for x*? unspecified so implementations are free to interpret it as they want such as implementing that perl extension. Commented Oct 29, 2020 at 13:26

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.