I need to develop, or at least conceptualize a module that does efficient data deduplication. Say we have millions of data records already. Insertion of another 100 mn records, making sure that there are no duplicate records in the resultant dataset, is what the module needs to do, at the top level. Now this may mean comparing on a field(s) that decides whether records are duplicate or not. But this approach, taken serially, is really naive, when we're talking millions of records.
What do you think can a viable approach be? Hashing? using divide and conquer type algorithms to exploit parallelism? I have these in my head, but it really gets giddy at such scale.
Also, please post any pointers to resources on the Web I can use - I could only find debates and vendors saying things about their db's "supreme data deduplication features".
BLOB.