Removing duplicates from a huge file

Question

We have a huge chunk of data and we want to perform a few operations on them. Removing duplicates is one of the main operations.

Ex.

a,me,123,2631272164 yrw,wq,1237,123712,126128361 yrw,dsfswq,1323237,12xcvcx3712,1sd26128361

These are three entries in a file and we want to remove duplicates on the basis of 1st column. So, 3rd row should be deleted. Each row may have different number of columns but the column we are interested into, will always be present.

In memory operation doesn't look feasible.

Another option is to store the data in database and removing duplicates from there but it's again not a trivial task. What design should I follow to dump data into database and removing duplicates?

I am assuming that people must have faced such issues and solved it.

How do we usually solve this problem?

PS: Please consider this as a real life problem rather than interview question ;)

If in-memory isn't feasible, it does seem like dumping to a database will be the easiest solution. What turns you off doing things this way? — Ben Neill
– Ben Neill, Commented Apr 28, 2012 at 6:06
Good point! Modified the question! Pointer to design of such method/framework would be helpful! — instanceOfObject
– instanceOfObject, Commented Apr 28, 2012 at 6:10
Also what criteria is there for removal - given the duplicates in your example, do you keep the first only due to the order in the list? What is the ballpark row size of this data? — Ben Neill
– Ben Neill, Commented Apr 28, 2012 at 6:13
That is not an issue for me! I am ok with keeping any of these at this point of time! — instanceOfObject
– instanceOfObject, Commented Apr 28, 2012 at 6:13

JustinDanielson · Accepted Answer · 2012-04-28 06:42:13Z

If the number of keys is also infeasible to load into memory, you'll have to do a Stable(order preserving) External Merge Sort to sort the data and then a linear scan to do duplicate removal. Or you could modify the external merge sort to provide duplicate elimination when merging sorted runs.

I guess since this isn't an interview question or efficiency/elegance seems to not be an issue(?). Write a hack python script that creates 1 table with the first field as the primary key. Parse this file and just insert the records into the database, wrap the insert into a try except statement. Then preform a select * on the table, parse the data and write it back to a file line by line.

Ben Neill · Accepted Answer · 2012-04-28 06:28:47Z

If you go down the database route, you can load the csv into a database and use 'duplicate key update'

using mysql:-

Create a table with rows to match your data (you may be able to get away with just 2 rows - id and data)
dump the data using something along the lines of

LOAD DATA LOCAL infile "rs.txt" REPLACE INTO TABLE data_table FIELDS TERMINATED BY ',';
You should then be able to dump out the data back into csv format without duplicates.

Joachim Isaksson · Accepted Answer · 2012-04-28 06:00:57Z

If the number of unique keys aren't extremely high, you could simply just do this;
(Pseudo code since you're not mentioning language)

Set keySet; while(not end_of_input_file) read line from input file if first column is not in keySet add first column to keySet write line to output file end while

Dan D. · Accepted Answer · 2012-04-28 06:36:01Z

If the input is sorted or can be sorted, then one could do this which only needs to store one value in memory:

r = read_row() if r is None: os.exit() last = r[0] write_row(r) while True: r = read_row() if r is None: os.exit() if r[0] != last: write_row(r) last = r[0]

Otherwise:

What I'd do is keep a set of the first column values that I have already seen and drop the row if it is in that set.

S = set() while True: r = read_row() if r is None: os.exit() if r[0] not in S: write_row(r) S.add(r[0])

This will stream over the input using only memory proportional to the size of the set of values from the first column.

Vatine · Accepted Answer · 2012-04-28 12:19:55Z

If you need to preserve order in your original data, it MAY be sensible to create new data that is a tuple of position and data, then sort on the data you want to de-dup. Once you've sorted by data, de-duplication is (essentially) a linear scan. After that, you can re-create the original order by sorting on the position-part of the tuple, then strip it off.

Say you have the following data: a, c, a, b

With a pos/data tuple, sorted by data, we end up with: 0/a, 2/a, 3/b, 1/c

We can then de-duplicate, trivially being able to choose either the first or last entry to keep (we can also, with a bit more memory consumption, keep another) and get: 0/a, 3/b, 1/c.

We then sort by position and strip that: a, c, b

This would involve three linear scans over the data set and two sorting steps.

Collectives™ on Stack Overflow

Removing duplicates from a huge file

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Related