Skip to main content
2 of 3
Change list to "actual" list
AdminBee
  • 23.6k
  • 25
  • 55
  • 77

How to compare two column of two file and print not matching pattern with awk

I have a data file A.tsv (field separator = \t) :

id clade mutation 243 40A siti,toto,mumu 254 267 40B lala,sisi,sojo 

and a template file B.tsv (field separator = \t) :

40A toto,xixi,saxa 40B lala,sojo,huhu 40C sasa,sisi,lala 

Based on their common column (clade), I want to compare the mutation of A.tsv from the template B.tsv. I have two questions:

  1. How to indicate in a new column named "missing_mutation" the name of the mutation in B.tsv that aren't present in A.tsv.
  2. How to indicate in a new column named "remaining_mutation" the name of the mutation that are present in A.tsv (and that start with the letter s, case-insensitive) but not present in B.tsv.

C.tsv looks like this:

id clade mutation missing_mutation remaining_mutation 243 40A titi,toto,lala xixi,saxa siti 254 267 40B lala,jiji,jojo huhu sisi 

I know how to compare two files like this:

awk -F"," -vOFS="," ' NR==FNR { a[$2]=$3; next } { print $0,a[$2] } ' B.tsv A.tsv > C.tsv 

but I don't know how to print those who don't match. Do you have an idea?

nstatam
  • 529
  • 4
  • 11