10

I am removing stop words from a text, roughly using this code

I have the following

$ cat file file types extensions $ cat stopwords i file types 

grep -vwFf stopwords file

I am expecting the result: extensions

but I get the ( I think incorrect)

file extensions 

It is as if the word file has been skipped in the stopwords file. Now here's the cool bit: if I modify the stopwords file, by changing the single word/letter i on the first line, to any other ascii letter apart from f, i, l, e, then the same grep command gives me a different and correct result of extensions.

What is going on here and how do I fix it?

I'm using grep (BSD grep) 2.5.1-FreeBSD on a Mac OSX GNU bash, version 4.4.12(1)

4
  • You may want to use the -x switch for line regex instead of -w for word? However I think the -F switch will cancel either of them out, or vice-versa. Commented Oct 15, 2017 at 11:39
  • grep (GNU grep) 3.1 works as you expect. Commented Oct 15, 2017 at 12:31
  • I have replicated this. Another datum: Making the i pattern the second rather than the first pattern in the stopwords file also alters the behaviour. Commented Oct 15, 2017 at 15:06
  • I can't reproduce the behaviour on OpenBSD 6.2 with native grep nor with GNU grep 3.1. Commented Oct 15, 2017 at 15:42

2 Answers 2

13

This was a bug in bsdgrep, relating to a variable that tracks the part of the current line still to scan that is overwritten with successive calls to the regular expression matching engine when multiple patterns are involved.

local fix

You can work around this to an extent by not using the -w option, which relies upon this variable for correct operation and thus is failing, but instead using the regular expression extensions that match the beginning and endings of words, making your stopwords file look like:

\<i\> \<file\> \<types\>

This workaround will also require that you do not use the -F option.

Note that the documented regular expression components [[:<:]] and [[:>:]] that the re_format manual tells you about will not work here. This is because the regular expression library that is compiled into bsdgrep has GNU regular expression compatibility support turned on. This is another bug, which is reportedly fixed.

service fix

This bug was fixed earlier this year. The fix has not yet made it into the STABLE or RELEASE flavours of FreeBSD, but is reportedly in CURRENT.

For getting this into the MacOS version of grep, that is derived from FreeBSD's bsdgrep, please consult Apple. ☺

Further reading

1
  • Nice, and thanks for reporting this upstream. I would find this answer even more fascinating if it quoted the buggy code. Commented Oct 15, 2017 at 17:57
1

This code:

pl " Input data file data1 and stopwords file data2:" head data1 data2 pl " Expected output:" cat $E pl " Results, grep:" # grep -vwFf stopwords file grep -vwFf data2 data1 pl " Results, cgrep:" cgrep -x1 -vFf data2 data1 

produces:

----- Input data file data1 and stopwords file data2: ==> data1 <== file types extensions ==> data2 <== i file types ----- Expected output: extensions ----- Results, grep: file extensions ----- Results, cgrep: extensions 

On a system like:

OS, ker|rel, machine: Apple/BSD, Darwin 16.7.0, x86_64 Distribution : macOS 10.12.6 (16G29), Sierra bash GNU bash 3.2.57 

More details on cgrep, available via brew, and from sourceforge:

cgrep shows context of matching patterns found in files (man) Path : ~/executable/cgrep Version : 8.15 Type : Mach-O64-bitexecutablex86_64 ...) Home : http://sourceforge.net/projects/cgrep/ (doc) 

cheers, drl

2
  • just got myself a new grep. Commented Oct 17, 2017 at 12:26
  • @Tim -- I hope you find cgrep as useful as I have. The speed on the tests I have done put it roughly on a par with GNU grep, and the "context / windowing" features are very useful. It also builds easily on Linux systems ... cheers, drl Commented Oct 18, 2017 at 12:26

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.