How to prevent grep -f from sorting and removing duplicates

Question

I am using the function grep -f inputfile searchfile As a result however I get the data sorted and all duplicates are removed. But this is actually not what I want and I cannot find a way to prevent this. Is there a way I get the exact same order as in the inputfile?

I don't understand why grep is doing this because in my memory this should not be happening. What could I do wrong here?

An example: my inputfile looks like (first few lines shown):

Moraceae Poaceae Fagaceae Rosaceae Betulaceae Salicaceae

My search file looks like (first few lines shown):

Acanthaceae 0 1 0 1 1 0 10 0 10 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 Adoxaceae 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Aizoaceae 0 0 0 0 3 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Amaranthaceae 0 1 0 0 7 0 5 0 4 1 0 0 6 0 0 0 0 0 0 0 1 4 0 0 0

The output looks like (first few lines shown):

Acanthaceae 0 1 0 1 1 0 10 0 10 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 Aizoaceae 0 0 0 0 3 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Amaranthaceae 0 1 0 0 7 0 5 0 4 1 0 0 6 0 0 0 0 0 0 0 1 4 0 0 0 Amaryllidaceae 0 1 0 0 6 1 11 0 6 4 2 0 0 0 0 0 0 0 0 0 1 8 0 0 0 Anacardiaceae 0 1 0 0 0 0 20 0 7 0 0 1 0 0 0 0 0 0 0 0 1 10 0 0 0

The names are printed in red, meaning that they were matching.But I want to retrieve the exact same order and occurence as my inputfile, starting with "Moraceae". Any ideas?

Thanks a lot in advance!

Consider showing an example of this behaviour. Also, you are aware that inputfile should contain the regular expressions that you want to look for in searchfile? — Kusalananda
– Kusalananda ♦, Commented Jan 8, 2021 at 15:05

AdminBee · Accepted Answer · 2021-01-08 17:10:56Z

The behavior you show (although not immediately apparent from your example, since you don't show the full content of your inputfile) is expected. grep searches line-wise, meaning that every line of searchfile is checked against all patterns in inputfile and printed if any of them apply.

What you want to do seems to be a sequential check of searchfile against the patterns in inputfile, which is only possible in a multi-pass or buffering scenario.

You could write it as a bash loop (=multipass approach) or use awk (=buffering approach):

awk 'NR==FNR{pat[++np]=$0;next} {for (i=1;i<=np;i++) {if ($0~pat[i]) buf[i]=buf[i] ? buf[i] ORS $0:$0}} END{for (i=1;i<=np;i++) {if (buf[i]) print buf[i]}}' inputfile searchfile

This will first read all patterns in inputfile into an array pat, keeping the order of inputfile.

When processing of inputfile is finished, it will check every line of searchfile if it fits one of the patterns, and store the line grouped by pattern in a buffer array buf. In the end, the buffer content is printed, resulting in matched lines printed

in the order of their matching pattern according to inputfile
and secondary to that in the order they appeared in searchfile

Note that while using a dedicated tool like awk is usually a more efficient solution than using shell loops, this relies on buffering the relevant text and can become a resource problem if your inputfile and/or the amount of matched lines in searchfile becomes large.

Caveat: In its current form, the program uses full regular expression matching, meaning that the content of inputfile must not contain characters special to regular expressions (or if it does, properly escape them). If this is not what you want, and a literal string comparison should be performed instead, change the condition

if ($0 ~ pat[i])

to either

if ($1 == pat[i])

if the taxonomic category can only amount to single-word names (likely in your example), or

if (index($0,pat[i])==1)

if the names can contain multiple space-separated words.

Thanks so much AdminBee for the clear explanation, really helpfull. I learnt my lesson and need to try avoiding shell loops. — TUnix
– TUnix, Commented Jan 8, 2021 at 17:08

stark · Accepted Answer · 2021-01-08 19:38:06Z

0

To preserve duplicates and print in the same order as inputfile, do this:

while read -r pattern do grep "$pattern" searchfile done < inputfile

edited Jan 8, 2021 at 19:38

answered Jan 8, 2021 at 15:25

stark

1308 bronze badges

Could you add an explanation as to why they should try your code? Consider explaining what issue your code solves and how it solves it. Also note that you probably want IFS= read -r (in case there are backslashes in the regular expressions or flanking whitespace) and to use grep -e "$pattern" (in case a pattern starts with a dash). An issue with your code is what would happen when several patterns matches a single line in searchfile.

Kusalananda
– Kusalananda ♦

2021-01-08 19:03:02 +00:00
Commented Jan 8, 2021 at 19:03
@Kusalananda Good ideas. I don't think IFS matters if not splitting the input line. Also not using grep -e since the original question isn't.

stark
– stark

2021-01-08 19:40:12 +00:00
Commented Jan 8, 2021 at 19:40
IFS matters in the case that flanking whitespaces on lines in inputfile matters. The command in the question does not need -e since it doesn't take regular expressions on the command line. Your code does.

Kusalananda
– Kusalananda ♦

2021-01-08 19:44:37 +00:00
Commented Jan 8, 2021 at 19:44

Add a comment |

Stack Exchange Network

How to prevent grep -f from sorting and removing duplicates

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How to prevent grep -f from sorting and removing duplicates

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions