I have a data file A.tsv (field separator = \t) :
id clade mutation 243 40A siti,toto,mumu 254 267 40B lala,sisi,sojo and a template file B.tsv (field separator = \t) :
40A toto,xixi,saxa 40B lala,sojo,huhu 40C sasa,sisi,lala Based on their common column (clade), I want to compare the mutation of A.tsv from the template B.tsv. I have two questions:
- How to indicate in a new column named "missing_mutation" the name of the mutation in
B.tsvthat aren't present inA.tsv. - How to indicate in a new column named "remaining_mutation" the name of the mutation that are present in
A.tsv(and that start with the letters, case-insensitive) but not present inB.tsv.
C.tsv looks like this:
id clade mutation missing_mutation remaining_mutation 243 40A titi,toto,lala xixi,saxa siti 254 267 40B lala,jiji,jojo huhu sisi I know how to compare two files like this:
awk -F"," -vOFS="," ' NR==FNR { a[$2]=$3; next } { print $0,a[$2] } ' B.tsv A.tsv > C.tsv but I don't know how to print those who don't match. Do you have an idea?