How to remove duplicates from a large file of large numbers ? This is an interview question about algorithms and data structures rather than sort -u and stuff like that.
I assume there that the file does not fit in memory and the numbers range is large enough so I cannot use in-memory count/bucket sort.
The only option is see is to sort the file (e.g. merge sort) and pass the sorted file again to filter out duplicates.
Does it make sense. Are there other options?