2

I'm trying to give grep a pattern file (through -f) , but I want to learn which patterns are matching something in the search file

For example, given 1.txt:

a/(.*) b/(.*) b/c/(.*) b/foo/(.*) d/(.*) e/(.*) 

and 2.txt:

a/ a/foo/bar/ b/foo/ d/foo/ 

The patterns from 1.txt that match something in 2.txt are (omitting the (.*) suffix) are as follows:

a/ b/ b/foo/ d/ 

How can I "find the list of patterns that have a match"?

EDIT: I'm only looking for a prefix match but I think the question is interesting enough for general pattern matching.


EDIT: Since a for-loop based solution is given, I should say I'm not looking at calling grep 10000 times. :) The working solution I already have (listed below) is pretty slow:

for line in "${file1_arr[@]}"; do if ! grep -qE "^$v(.*)\$"; then echo "$line" fi done 

Ideally I'm looking for a single grep call or so with less overhead.

8
  • Do all of your patterns consist of fixed strings plus (.*)? (By the way, I don't think your "working solution" will work at all. I'm guessing you tweaked it between when you tried it and when you posted it here.) Commented May 27, 2018 at 6:12
  • Putting (.*)$ at the end of all your patterns can I nly slow matching down. It has no impact on what is matched. Commented May 27, 2018 at 7:12
  • What output do you expect if more than one pattern matches a given line? Commented May 27, 2018 at 7:12
  • Are you sure b/foo/(.*) from 1.txt should match b/foo from 2.txt? Commented May 27, 2018 at 11:29
  • Finally, can you please specify what sorts of patterns you will be using and roughly how many there are. You examples suggest that they are all simple prefixe matches without regex operators; if that's true, there are some very efficient solutions. Commented May 27, 2018 at 17:13

4 Answers 4

2

In awk:

$ awk 'NR==FNR{a[$0]=FNR;next}{for(i in a)if($0 ~ i)print i,$0}' 1.txt 2.txt a/(.*) a/ a/(.*) a/foo/bar b/(.*) b/foo d/(.*) d/foo 

Explained:

$ awk ' # yes NR==FNR { # process first file a[$0]=FNR # hash regex, store record number just in case next # process next record } { # process second file for(i in a) # loop every entry in 1.txt if($0 ~ i) # if regex matches record print i,$0} # print all matching regex and record ' 1.txt 2.txt 

Edit: To output each regex just once (like shown here in the expected output) you could delete the regex from a once it's been used, that way it won't get matched and outputed more than once:

$ awk ' NR==FNR { a[$0]; next } { for(i in a) if($0 ~ i) { print i delete a[i] # deleted regex wont get matched again } }' 1.txt 2.txt vendor/cloud.google.com/go/compute/metadata/(.*)$ vendor/cloud.google.com/go/compute/(.*)$ vendor/cloud.google.com/go/(.*)$ vendor/cloud.google.com/(.*)$ vendor/github.com/Azure/azure-sdk-for-go/arm/dns/(.*)$ vendor/github.com/Azure/azure-sdk-for-go/arm/(.*)$ vendor/github.com/Azure/azure-sdk-for-go/(.*)$ vendor/github.com/Azure/(.*)$ vendor/github.com/(.*)$ 

Also, My test showed about 60 % off (mini laptop, 1:16 to 29 s) the time with this modification for GNU awk (using data you provided in the comments, file1.txt and file2.txt):

$ awk ' BEGIN { FS="." # . splits the url } NR==FNR { a[$1][$0]; next } # we index on the first part of url { for(i in a[$1]) # search space decreased if($0 ~ i) { print i delete a[$1][i] } }' file1.txt file2.txt 

The speedup decreases the search space by using the start of the strings up to the first period as the key for the hash, ie:

FS="." # split at first . ... a[vendor/github][vendor/github.com/Azure/(.*)$] # example of a hash ... for(i in a[$1]) # search space decreased 

Now it does not have to search the whole hash for a matching regex. More feasibe would probably be to use FS="/" ; a[$1 FS $2] but this was just a quick test.

Sign up to request clarification or add additional context in comments.

14 Comments

Excellent answer. IMHO awk based solutions are ideal for heavy duty pattern matching problems.
@rici : please explain, your comment is too short :)
@rann The benchmark could be comparing the time needed using the original solutiion, the excellent awk one and my sed solution. And perhaps add, that awk not only performs well, but will also have easier to read/maintain than code full of backslashes, and can work better with code having special characters (in my solution I really hope that the input files dont have # or \a characters).
If there are really 10000 patterns, as suggested by OP, it's possible that awk is still faster than grep but it's by no means certain, which is why I suggested benchmarking. The ideal would be to check all the patterns in parallel, which would be straight forward if the patterns are limited to (mathematical) regular expressions, or even better, to fixed strings as appears to be the case in the example.
I do have a dir called a_million_files for testing but I don't feel like creating 10000 regex patterns. :D
|
1

The following script:

#!/usr/bin/env bash lines=$(wc -l < 1.txt) for (( i=1; i<=$lines; i++ )); do line=$(sed -n "$i"p 1.txt) line=$(sed "s/\/(.*)$//" <<< "$line") grep -E "$line" 2.txt 1>/dev/null && echo "$line" done 

prints lines in 1.txt that matched in 2.txt:

a b b/foo d 

comments:

# gets a single line from 1.txt line=$(sed -n "$i"p 1.txt) # removes trailing pattern /(.*) from $line variable line=$(sed "s/\/(.*)$//" <<< "$line") # if $line matches in 2.txt, print $line grep -E "$line" 2.txt 1>/dev/null && echo "$line" 

3 Comments

I already have this for loop. The problem is it's too slow, because it's invoking grep >10000 times in my case, which has process overhead. I'm ideally looking for a single grep command.
The solution I already have was: for line in "${values[@]}"; do; if ! grep -qE "^$v(.*)\$"; then echo "$line"; fi; done <-- But this is pretty slow.
You can include -m1 option to speed up grep. But this doesn't solve the main problem, i.e. the multiple calls to grep.
1

I tried the awk and sed based solutions, and I realized I can do this much faster using bash's builtin regexp engine if I read both files in memory.

Here's basically it.

text="$(cat 2.txt)" # read 2.txt while read -r line; do # for each 'line' from 1.txt re=[^\b]*${line} # prepend ^ or \b to the pattern if [[ "$text" =~ $re ]]; then # match the pattern to 2.txt echo "${line}" # if there's a match, print the pattern fi done < <(cat "1.txt") 

Since this doesn't spawn any extra processes and just does it in-memory, I suspect this is quite efficient. My benchmarks with the files I linked under James' answer shows 8-9 seconds for this.

2 Comments

With your question input this works fine. It works different with the regexp f/(.*)g. Your solution can match f/ on one line and g on another. When that is what you want, you have found the only correct solution, otherwise check my updated answer.
This was really helpful, when I had to find out which of ~500000 patterns matched a specific line in a huge file.
0

I don't see a solution with grep, but sed is an alternative to awk. With sed I would like to see patterns like b/foo/.* in 1.txt, but I will show a solution based on the (.*).
The purpose of the first command is constructing sed constructions, that will replace the inputline with the regular expression, when it matches the regular expression. The different output lines must look like

sed -rn 's#b/c/(.*)#b/c/#p' 2.txt 

and this can be done with

# Use subprocess sed 's/\(.*\)\(([.][*])\)/s#\1\2#\1#p/' 1.txt # resulting in sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#\1\2#\1#p/' 1.txt) 2.txt| sort -u 

The solution is a bit difficult to read, caused bij the layout of 1.txt, where I would want lines like b/foo/.*.

The above commands will have 2 bugs:

When the match is on a part of the line, the non-matched part will be shown in the output. This can be fixed with matching the garbage

# Use lines like 's#.*b/foo(.*)#b/foo#p' sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#.*\1\2#\1#p/' 1.txt) 2.txt| sort -u 

The second bug is that strings in 2.txt that have two matches, will be matched only once (the first match will edit the line in the stream).
This can be fixed by adding some unique marker (I will use \a) for the matching lines and repeating the inputlines on the output (with \n&). The output can be viewed by looking for the \a markers.

sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#.*\1\2#\\a\1\\n\&#p/' 1.txt) 2.txt| sed -rn '/\a/ s/.(.*)/\1/p' | sort -u 

EDIT:
The work-around with a marker and restoring the original input is not needed when you follow a different approach.
In sed you can print something to stdout without changing the stream.
One possibility (slow for this situation) is using

sed '/something/ eecho "something" ' 

Another possibility is using the "x" command (that eXchanges the pattern space with the hold buffer). You actuallu want to have a sedscript with commands like

\%a/% {h;s%.*%a/%p;x} \%b/% {h;s%.*%b/%p;x} \%b/c/% {h;s%.*%b/c/%p;x} \%b/foo/% {h;s%.*%b/foo/%p;x} \%d/% {h;s%.*%d/%p;x} \%e/% {h;s%.*%e/%p;x} 

Using above method the sed solution simplifies into

sed -nf <( sed 's#([.][*])##; s#.*#\\%&% {h;s%.*%&%p;x} #' 1.txt ) 2.txt | sort -u 

When the file 1.txt is not changed often, you might want to preprocess that file.

sed 's#([.][*])##; s#.*#\\%&% {h;s%.*%&%p;x} #' 1.txt > /tmp/sed.in sed -nf /tmp/sed.in 2.txt | sort -u 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.