In this given input file, there are 4 columns. We have to remove the duplicates but there is a catch. There is a preference order C2>C3>C4. so in the output there is only one row with a, one row with e and 1 and 1 respectively for h and g.
Note that for C1 all a's collapse into one. After that ek, ef and em collapse into one. h and g are separate.
C1 C2 C3 C4 t a b c t a b d t a e t e k t a i t e f t e m t h t g Output: t a b c t e k t h t g I have tried the following commands :
awk '!seen[$2]++' ac.txt My problem: There are a lot of columns between C2 C3 and C4. I tried awk -F$'\t' '{ print $13 " " $18 " " $1 }' originalFile | awk '!seen[$2]++' but these give me only deduplicated rows with these columns. I want the full file(all the columns) deduplicated. Also, there is another constraint : the file size can run into 200 GB. So cutting out the columns doesn't appear to be a good enough approach.
I am using Linux.
awkcommand I deduce tab is your field separator and then you want space to separate fields in the output. But the example input uses spaces. The problem is even if you edited and used tabs, the site engine would misleadingly convert them to multiple spaces, I think. In addition "empty" fields contain spaces, this is (or may be) a problem if space is (or will be) the separator. Please state clearly: (1) what is the field separator in the input? (2) what should be the field separator in the output? (3) are "empty" fields truly empty or do (will) they collide with the separator?