How to remove duplicates from a file?

Question

How to remove duplicates from a large file of large numbers ? This is an interview question about algorithms and data structures rather than sort -u and stuff like that.

I assume there that the file does not fit in memory and the numbers range is large enough so I cannot use in-memory count/bucket sort.

The only option is see is to sort the file (e.g. merge sort) and pass the sorted file again to filter out duplicates.

Does it make sense. Are there other options?

The more you know about the input, the better the position to pick/develop an appropriate algorithm. — greybeard
– greybeard, Commented Mar 7, 2017 at 9:04

Will Ness · Accepted Answer · 2012-07-20 16:16:11Z

You won't even need separate pass over sorted data if you use a duplicates-removing variant of "merge" (a.k.a. "union") in your mergesort. Hash table should be empty-ish to perform well, i.e. be even bigger than the file itself - and we're told that the file itself is big.

Look up multi-way merge (e.g. here) and external sorting.

amit · Accepted Answer · 2012-07-20 14:05:37Z

Yes, the solution makes sense.

An alternative is build a file-system-based hash table, and maintain it as a set. First iterate on all elements and insert them to your set, and later - in a second iteration, print all elements in the set.

It is implementation and data dependent which will perform better, in terms of big-O complexity, the hash offers O(n) time average case and O(n^2) worst case, while the merge sort option offers more stable O(nlogn) solution.

user1277476 · Accepted Answer · 2012-07-20 21:20:57Z

Mergesort or Timsort (which is an improved mergesort) is a good idea. EG: http://stromberg.dnsalias.org/~strombrg/sort-comparison/

You might also be able to get some mileage out of a bloom filter. It's a probabilistic datastructure that has low memory requirements. You can adjust the error probability with bloom filters. EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ You could use one to toss out values that are definitely unique, and then scrutinize the values that are probably not unique more closely via some other method. This would be especially valuable if your input dataset has a lot of duplicates. It doesn't require comparing elements directly, it just hashes the elements using a potentially-large number of hash functions.

You could also use an on-disk BTree or 2-3 Tree or similar. These are often stored on disk, and keep key/value pairs in key order.

Collectives™ on Stack Overflow

How to remove duplicates from a file?

3 Answers 3

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Related