1

I have file A with two columns, it looks like this:

7000000185249100 162280 7000000185249048 235500 7000000185249052 755361 7000000185249068 427550 7000000185249070 269102 7000000185249081 291122 

And I have file B with three columns, it looks like this:

7000000185249100 1622651 1623044 7000000185249048 235104 235805 7000000185249146 2500324 2502635 7000000185249100 1218818 1221734 7000000185249468 88587 89699 7000000185249239 299691 300277 7000000185249315 769635 769986 7000000185249374 1548986 1549747 

So what I wanted to do is to print out lines from file A,

  1. if the number in first column in file A matches the number in first column in file B, and
  2. the number in the second column in file A is within the range of the numbers in the second and third columns in file B.

Expected output will be:

7000000185249048 235500 

I tried with the following code, but failed.

awk -F '\t' 'FNR==NR{a[$1,$2,$3]=$0;next}{if(b=a[$1, >=$2 && <= $3]){print b}}' file B file A 
2
  • how is 162280 within range of 1622651 1623044 or 1218818 1221734? Commented Jan 24, 2018 at 3:34
  • Thanks Sundeep for pointing it out, it was a typo... It should be 1622800. Commented Jan 28, 2018 at 3:49

1 Answer 1

0
  1. Use join to find common 1st fields, then use bash to compare values:

    join --nocheck-order -j 1 A B | while read a b c d ; do [[ ( b -le d && b -ge c ) || ( b -le c && b -ge d ) ]] && echo $a $b done 

    The OP spec states "the number in the second column in file A is within the range of the numbers in the second and third columns in file B". This range might not ordered, so the logic between [[ and ]] handles it either way. Example:

    • If A2=3, B2=2, and B3=4, that's matched by ( b -le d && b -ge c ).
    • If A2=3, B2=4, and B3=2, that's matched by ( b -le c && b -ge d ).
  2. Not so good GNU sed code to turn each line of file B into two piped numgrep commands that search file A for ranges, then evaluate the commands. Because the resulting list of commands might have redundant output, pipe that to awk to perform an unsorted uniq:

     sed -n \ 's#\(\w*\)\W*\(\w*\)\W*\(\w*\)#numgrep /\1/ A\|numgrep /\2..\3,\3..\2/#e /./p' B | awk '!a[$0]++' 

Output of either method:

7000000185249048 235500 
6
  • @Sundeep's comment implies the first line output is an error. If so, my code is buggy, and it's bugs coincidentally match the OP's wrong output. Hmm... Commented Jan 24, 2018 at 3:45
  • Fixed code error, problem was that a , in numgrep is a logical OR. The code needs a logical AND, which required a pipe. The revised code can still fail if field #1 of file B is within range of fields #2 & #3, and field #2 of file A is not. Commented Jan 24, 2018 at 4:07
  • Thank you for your kind help @agc! I really appreciated it. Yes, as Sundeep pointed out, I made a mistake in the example files. The expected output will be 7000000185249048 235500 only in this example. Thanks again! Commented Jan 28, 2018 at 3:54
  • @Jing, Glad to hear it worked. Please read: What should I do when someone answers my question? Commented Jan 28, 2018 at 5:17
  • Hi @agc , could you explain a little bit more about what [[ b -le d && b -ge c ]] || [[ b -le c && b -ge d ]] does? I think the first part is to select values in column b that are lower than or equal to values in column d but greater than or equal to values in column c. What about the second part ? And what does || do? Thanks! Commented Jan 28, 2018 at 15:18

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.