2

In this given input file, there are 4 columns. We have to remove the duplicates but there is a catch. There is a preference order C2>C3>C4. so in the output there is only one row with a, one row with e and 1 and 1 respectively for h and g.

Note that for C1 all a's collapse into one. After that ek, ef and em collapse into one. h and g are separate.

C1 C2 C3 C4 t a b c t a b d t a e t e k t a i t e f t e m t h t g Output: t a b c t e k t h t g 

I have tried the following commands :

awk '!seen[$2]++' ac.txt My problem: There are a lot of columns between C2 C3 and C4. I tried awk -F$'\t' '{ print $13 " " $18 " " $1 }' originalFile | awk '!seen[$2]++' but these give me only deduplicated rows with these columns. I want the full file(all the columns) deduplicated. Also, there is another constraint : the file size can run into 200 GB. So cutting out the columns doesn't appear to be a good enough approach.

I am using Linux.

2
  • Note: by your awk command I deduce tab is your field separator and then you want space to separate fields in the output. But the example input uses spaces. The problem is even if you edited and used tabs, the site engine would misleadingly convert them to multiple spaces, I think. In addition "empty" fields contain spaces, this is (or may be) a problem if space is (or will be) the separator. Please state clearly: (1) what is the field separator in the input? (2) what should be the field separator in the output? (3) are "empty" fields truly empty or do (will) they collide with the separator? Commented May 2, 2019 at 8:10
  • 1
    1. Field separator is tab , i have used spaces for ease 2. filed separator should be tab in output. 3. empty are truly empty. Preceeded by delimiter and succeeded by delimiter. Commented May 2, 2019 at 8:28

2 Answers 2

0

This will treat a "0" column the same as an empty column, but gives the idea more simply:

awk 'A[$c2] + B[$c3] + C[$c4]==0; c2{A[$c2]++; next} c3{B[$c3]++;next} c4 {C[$c4]++} ' c2=2 c3=3 c4=4 input 

(set c2, c3, and c4 to the actual column numbers you care about)

To expand that to your case, you should be able to use:

awk 'A[$c2] + B[$c3] + C[$c4]==0; match($c2,"[^ ]"){A[$c2]++; next} match($c3,"[^ ]"){B[$c3]++;next} match($c4,"[^ ]"){C[$c4]++} ' FS=\\t c2=2 c3=3 c4=4 input 
0

How about this one (save into file and run that)

#!/usr/bin/gawk -f BEGIN { FS="\t" OFS="\t" } FNR==1 { next } ($2 ~ /.+/ && a[$2]++) { next } ($3 ~ /.+/ && a[$3]++) { next } ($4 ~ /.+/ && a[$4]++) { next } { print $0 } 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.