Return to Revisions

2 of 3

Change list to "actual" list

edited Feb 23, 2021 at 8:22

How to compare two column of two file and print not matching pattern with awk

I have a data file A.tsv (field separator = \t) :

id clade mutation 243 40A siti,toto,mumu 254 267 40B lala,sisi,sojo

and a template file B.tsv (field separator = \t) :

40A toto,xixi,saxa 40B lala,sojo,huhu 40C sasa,sisi,lala

Based on their common column (clade), I want to compare the mutation of A.tsv from the template B.tsv. I have two questions:

How to indicate in a new column named "missing_mutation" the name of the mutation in B.tsv that aren't present in A.tsv.
How to indicate in a new column named "remaining_mutation" the name of the mutation that are present in A.tsv (and that start with the letter s, case-insensitive) but not present in B.tsv.

C.tsv looks like this:

id clade mutation missing_mutation remaining_mutation 243 40A titi,toto,lala xixi,saxa siti 254 267 40B lala,jiji,jojo huhu sisi

I know how to compare two files like this:

awk -F"," -vOFS="," ' NR==FNR { a[$2]=$3; next } { print $0,a[$2] } ' B.tsv A.tsv > C.tsv

but I don't know how to print those who don't match. Do you have an idea?

asked Feb 23, 2021 at 7:47