We have a huge chunk of data and we want to perform a few operations on them. Removing duplicates is one of the main operations.
Ex.
a,me,123,2631272164 yrw,wq,1237,123712,126128361 yrw,dsfswq,1323237,12xcvcx3712,1sd26128361 These are three entries in a file and we want to remove duplicates on the basis of 1st column. So, 3rd row should be deleted. Each row may have different number of columns but the column we are interested into, will always be present.
In memory operation doesn't look feasible.
Another option is to store the data in database and removing duplicates from there but it's again not a trivial task. What design should I follow to dump data into database and removing duplicates?
I am assuming that people must have faced such issues and solved it.
How do we usually solve this problem?
PS: Please consider this as a real life problem rather than interview question ;)