Remove duplicates in a very large file()

Question

In this given input file, there are 4 columns. We have to remove the duplicates but there is a catch. There is a preference order C2>C3>C4. so in the output there is only one row with a, one row with e and 1 and 1 respectively for h and g.

Note that for C1 all a's collapse into one. After that ek, ef and em collapse into one. h and g are separate.

C1 C2 C3 C4 t a b c t a b d t a e t e k t a i t e f t e m t h t g Output: t a b c t e k t h t g

I have tried the following commands :

awk '!seen[$2]++' ac.txt My problem: There are a lot of columns between C2 C3 and C4. I tried awk -F$'\t' '{ print $13 " " $18 " " $1 }' originalFile | awk '!seen[$2]++' but these give me only deduplicated rows with these columns. I want the full file(all the columns) deduplicated. Also, there is another constraint : the file size can run into 200 GB. So cutting out the columns doesn't appear to be a good enough approach.

I am using Linux.

Note: by your awk command I deduce tab is your field separator and then you want space to separate fields in the output. But the example input uses spaces. The problem is even if you edited and used tabs, the site engine would misleadingly convert them to multiple spaces, I think. In addition "empty" fields contain spaces, this is (or may be) a problem if space is (or will be) the separator. Please state clearly: (1) what is the field separator in the input? (2) what should be the field separator in the output? (3) are "empty" fields truly empty or do (will) they collide with the separator? — Kamil Maciorowski
– Kamil Maciorowski, Commented May 2, 2019 at 8:10
1. Field separator is tab , i have used spaces for ease 2. filed separator should be tab in output. 3. empty are truly empty. Preceeded by delimiter and succeeded by delimiter. — Vijay Kumar Attri
– Vijay Kumar Attri, Commented May 2, 2019 at 8:28

William Pursell · Accepted Answer · 2019-05-02 08:49:16Z

This will treat a "0" column the same as an empty column, but gives the idea more simply:

awk 'A[$c2] + B[$c3] + C[$c4]==0; c2{A[$c2]++; next} c3{B[$c3]++;next} c4 {C[$c4]++} ' c2=2 c3=3 c4=4 input

(set c2, c3, and c4 to the actual column numbers you care about)

To expand that to your case, you should be able to use:

awk 'A[$c2] + B[$c3] + C[$c4]==0; match($c2,"[^ ]"){A[$c2]++; next} match($c3,"[^ ]"){B[$c3]++;next} match($c4,"[^ ]"){C[$c4]++} ' FS=\\t c2=2 c3=3 c4=4 input

jf1 · Accepted Answer · 2022-01-31 15:38:19Z

How about this one (save into file and run that)

#!/usr/bin/gawk -f BEGIN { FS="\t" OFS="\t" } FNR==1 { next } ($2 ~ /.+/ && a[$2]++) { next } ($3 ~ /.+/ && a[$3]++) { next } ($4 ~ /.+/ && a[$4]++) { next } { print $0 }

Stack Exchange Network

Remove duplicates in a very large file()

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Remove duplicates in a very large file()

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions