1

I have 3 files: File 1:

1111111 2222222 3333333 4444444 5555555 

File 2:

6666666 7777777 8888888 9999999 

File 3

8888888 7777777 9999999 6666666 4444444 8888888 

I want to search file 3 for lines that contain a string from both file 1 and file 2, so the result of this example would be:

4444444 8888888 

because 444444 is in file 1 and 888888 is file 2.

I currently have a solution, however my files contain 500+ lines and it can take a very long time to run my script:

#!/bin/sh cat file1 | while read line do cat file2 | while read line2 do grep -w -m 1 "$line" file3 | grep -w -m 1 "$line2" >> results done done 

How can i improve this script to run this faster?

2 Answers 2

3

The current process is going to be slow due to the repeated scans of file2 (once for each row in file1) and file3 (once for each row in the cartesian product of file1 and file2). The additional invocation of sub-processes(as a result of the pipes |) is also going to slow things down.

So, to speed this up we want to look at reducing the number of times each file is scanned and limit the number of sub-processes we spawn.

Assumptions:

  • there are only 2 fields (when using white space as delimiter) in each row of file3 (eg, we won't see a row like "field1 has several strings" "and field2 does, too") otherwise we will need to come back revisit the parsing of file3

First our data files (I've added a couple extra lines):

$ cat file1 1111111 2222222 3333333 4444444 5555555 5555555 # duplicate entry $ cat file2 6666666 7777777 8888888 9999999 $ cat file3 8888888 7777777 9999999 6666666 4444444 8888888 8888888 4444444 # switch position of values 8888888XX 4444444XX # larger values; we want to validate that we're matching on exact values and not sub-strings 5555555 7777777 # want to make sure we get a single hit even though 5555555 is duplicated in `file1` 

One solution using awk:

$ awk ' BEGIN { filenum=0 } FNR==1 { filenum++ } filenum==1 { array1[$1]++ ; next } filenum==2 { array2[$1]++ ; next } filenum==3 { if ( array1[$1]+array2[$2] >= 2 || array1[$2]+array2[$1] >= 2) print $0 } ' file1 file2 file3 

Explanation:

  • this single awk script will process our 3 files in the order in which they're listed (on the last line)
  • in order to aply different logic for each file we need to know which file we're processing; we'll use the variable filenum to keep track of which file we're currently processing
  • BEGIN { filenum=0 } - initialize our filenum variable; while the variable should automatically be set to zero the first time it's referenced, it doesn't hurt to be explicit
  • FNR maintains a running count of the records processed for the current file; each time a new file is opened FNR is reset to 1
  • when FNR==1 we know we just started processing a new file, so increment our variable { filenum++ }
  • as we read values from file1 and file2 we're going to use said values as the indexes for the associative arrays array1[] and array2[], respectively
  • filenum==1 { array1[$1]++ ; next } - create entry in our first associative array (array1[]) with the index equal to field1 (from file1); value of the array will be a number > 0 (1 === field exists once in file, 2 == field exists twice in file); next says to skip the rest of processing and go to the next row in the current file
  • filenum==2 { array2[$1]++ ; next } - same as previous command except in this case we're saving fields from file2 in our second associative array (array2[])
  • filenum==3 - optional because if we get this far in this script we have to be on our third file (file3); again, doesn't hurt to be explicit (and makes this easier to read/understand)
  • { if ( ... ) } - test if the fields from file3 exist in both file1 and file2
  • array1[$1]+array2[$2] >= 2 - if (file3) field1 is in file1 and field2 is in file2 then we should find matches in both arrays and the sum of the array element values should be >= 2
  • array1[$2]+array2[$1] >= 2- same as previous command except we're testing for our 2 fields (file3) being in the opposite source files/arrays
  • print $0 - if our test returns true (ie, the current fields from file3 exist in both file1 and file2) then print the current line (to stdout)

Running this awk script against my 3 files generates the following output:

4444444 8888888 # same as the desired output listed in the question 8888888 4444444 # verifies we still match if we swap positions; also verifies # we're matching on actual values and not a sub-string (ie, no # sign of the row `8888888XX 4444444XX`) 5555555 7777777 # only shows up in output once even though 5555555 shows up # twice in `file1` 

At this point we've a) limited ourselves to a single scan of each file and b) eliminated all sub-process calls, so this should run rather quickly.

NOTE: One trade-off of this awk solution is the requirement for memory to store the contents of file1 and file2 in the arrays; which shouldn't be an issue for the relatively small data sets referenced in the question.

Sign up to request clarification or add additional context in comments.

1 Comment

Works perfectly and almost instant, this will knock hours off my daily scripts. thanks
1

You can do it faster if load all data first and than process it

f1=$(cat file1) f2=$(cat file2) IFSOLD=$IFS; IFS=$'\n' f3=( $(cat file3) ) IFS=$IFSOLD for item in "${f3[@]}"; { sub=( $item ) test1=${sub[0]}; test1=${f1//[!$test1]/} test2=${sub[1]}; test2=${f2//[!$test2]/} [[ "$test1 $test2" == "$item" ]] && result+="$item\n" } echo -e "$result" > result 

1 Comment

f1=$(cat file1)IFS=$'\n' read -r -d '' -a f1 <file1 and same for file2 and file3. sub=( $item )IFS=' ' read -r test1 test2 <<<"$item" or set -- $item; test1="$1"; test2="$2" then use $1 and $2 directly. Rather than echo "$item" >> result move the redirection the the end of the loop for item in "${f3[@]}"; {...} > result.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.