How do I go about data deduplication at scale?

Question

I need to develop, or at least conceptualize a module that does efficient data deduplication. Say we have millions of data records already. Insertion of another 100 mn records, making sure that there are no duplicate records in the resultant dataset, is what the module needs to do, at the top level. Now this may mean comparing on a field(s) that decides whether records are duplicate or not. But this approach, taken serially, is really naive, when we're talking millions of records.

What do you think can a viable approach be? Hashing? using divide and conquer type algorithms to exploit parallelism? I have these in my head, but it really gets giddy at such scale.

Also, please post any pointers to resources on the Web I can use - I could only find debates and vendors saying things about their db's "supreme data deduplication features".

Could you give us some idea what kind of records you're working with? Tables containing, say, a dozen fields would be handled differently than tables containing two or three fields where one is an arbitrarily-large BLOB. — Blrfl
– Blrfl, Commented Sep 12, 2011 at 22:39
Why not use SQL and let the database figure it out for you. For example see:asktom.oracle.com/pls/asktom/… and simple-talk.com/sql/t-sql-programming/… — NoChance
– NoChance, Commented Sep 12, 2011 at 23:12
@Emmad Kareem I am supposed to come up with a db agnostic solution, that applies equally well to NoSQL dbs like Mongo, Couch, etc. — yati sagade
– yati sagade, Commented Sep 13, 2011 at 5:59
@Blrfl There are no BLOBs. The records should be similar to those representing , say, a citizen. — yati sagade
– yati sagade, Commented Sep 13, 2011 at 6:01
I think you're going to find that one solution isn't going to fit every database on the planet. The answers posted so far really aren't much help because the question is still vague: Where is your data stored now? Is the existing data already deduplicated? What are are your integrity requirements? Can there be downtime while you do this? Can you switch to another database, or does it have to remain the same? — Blrfl
– Blrfl, Commented Sep 13, 2011 at 11:11

S.Lott · Accepted Answer · 2011-09-12 21:05:20Z

Hashing?

Essential.

using divide and conquer type algorithms to exploit parallelism?

If necessary.

Consider this.

You have millions of rows already in the DB. Each row must have single surrogate PK field that's (ideally) a number. Each row also has the key fields used for duplicate detection.

Compute a hash of the various comparison key fields. Load that hash and PK into an in-memory hashmap (or treemap). This is 2 million a few megabytes of memory. Hardly anything on a reasonably modern processor. The slowest part is computing the hash of millions of rows.
Each incoming row gets checked against the hashmap. If it's in the map, the comparison keys indicate a duplicate. If it's not in the map, it's okay.

That should cover most bases. The odds of a "false positive" collision is really small. And (for possible duplicates) you simply redo the check to confirm that they keys are indeed a duplicate.

If the comparison key fields are proper keys, they can't be updated. You can denormalize the entire PK-key hash map outside the database and avoid rebuilding it.

If the comparison keys fields are attributes (and can be updated), you have to handle denormalization of the hash values gracefully. A change to a comparison key changes the hash values. Compute it at the time of the update and save a memento of this change so that the comparison hash can be tweaked without being rebuilt from scratch.

"redo the check to confirm that they keys are indeed a duplicate." - can you please elaborate on that? — yati sagade
– yati sagade, Commented Sep 13, 2011 at 6:56
hash(data1)!=hash(data2) implies that data1 != data2, but hash(data1)==hash(data2) merely suggests that data1==data. There's typically a 1-in-4 billion chance that the rows aren't dupliclates. Now, with 100 million new rows compared against millions of existing rows, that's a lot of comparisons. You're quite likely to get a few apparent duplicates with those numbers. — MSalters
– MSalters, Commented Sep 13, 2011 at 9:12

dagnelies · Accepted Answer · 2011-09-12 20:46:40Z

I think hashing is the way to go. Quick lookup & quick insertions. I'd already try it out directly. 100m records don't seem that much to me ...but of course it'll take many hours.

If needed, one way of dividing the workload is to split the data in bundles of hashes. For instance, PC number i reads all records having a hash_value % N == i and then inserts/rejects all the additional records having also hash_value % N == i. Once you are done, merge the N datasets and you have your result.

Edit:

One way to make it slightly more efficient is to use 2 hashes. A quick and easy hash, on a single field for instance, used for a quick filtering hash % N == i. And a normal hash for the actual insertion/lookup in the database itself.

delightful :) Thanks. I think before accepting this I'll wait for a few more approaches :) — yati sagade
– yati sagade, Commented Sep 12, 2011 at 19:56
This is a good idea, the only problem may be the RAM requirement. I also think that it would be very fast. — NoChance
– NoChance, Commented Sep 12, 2011 at 22:09
RAM scales well because you can use more machines. I'd change the approach a litte: Use a single hash for partitiong and fast checks, because that means every machine can cache only the corresponding part of the existing rows. Secondly, if you have N machines, divide the work up in 10*N chunks based on hash_value % (10*N). They're unlikely to be the same size, so every worker machine picks up one chunk when it's done with the last. This means you don't have to wait for the last machine to finish that one huge chunk. — MSalters
– MSalters, Commented Sep 13, 2011 at 9:07

Joey Adams · Accepted Answer · 2011-09-12 20:18:04Z

You may want to look into a rolling hash function. With one, you can compute the hash of a given window of bytes, then "roll" it one byte forward to find the hash of the next window of bytes:

Hello world [ ] [ ] [ ] [ ] [ ] [ ]

If your task is to create an object store that works with binary blobs, you could do something like this:

Maintain a hash table of fixed-size chunks (in this example, 64-byte chunks and 32-bit hashes).
When a new blob is inserted, hash every 64 bytes of it, and store each chunk in the hash table. "Compress" the blob by encoding it as an array of hashes and strings. Bear in mind the possibility of hash collisions.
```
------------------------------ [1fedccba][deadbeef][13579ace] 
```

When a subsequent blob is inserted, roll through it to find chunks we already have:

--------------------------- [a7ed8842] [cb438564] [e5fe0527] [c2ff4713] [1fedccba] *** Chunk matches one we already have. [793bd55d] [45a39f7e] [dace4e10] [ee6fcc7b]

This technique can be used to deduplicate chunks of data even in the presence of insertions and deletions, or variations in alignment due to serialization.

Joey Adams · Accepted Answer · 2011-09-12 19:50:41Z

To deduplicate values that are exactly equal, (rather than almost equal), you can simply use a dictionary keyed by a strong-enough cryptographic hash. Example, using PostgreSQL types:

CREATE TABLE dictionary ( sha256 BYTEA PRIMARY KEY, value BYTEA );

The advantage of doing this is that it makes the lookup operation much faster. Rather than comparing entire strings (which could potentially be very large), you're merely comparing 256-bit (32-byte) values.

This makes the assumption that you will never encounter two different strings with the same SHA-256 hash. Although such pairs of strings do exist in theory due to the pidgenhole principle—a SHA-256 hash is only 256 bits, whereas a byte string can be (and typically will be) much longer—nobody has found such a pair.

If your values aren't byte strings, you will need to find a way to serialize them.

suppose a record represents a citizen.. and, say, the email is supposed to be unique(specifically avoided SSN or something like that)... then, this is a case of exact equality. BUT, I have to come up with a DB agnostic solution, that is the deduplication is to be done in a module, that produces the "clean" set, which is then inserted. — yati sagade
– yati sagade, Commented Sep 12, 2011 at 19:47
No need for a crypto hash. It's intentionally expensive, and CRC32 works equally well when the data isn't designed to break your hash — MSalters
– MSalters, Commented Sep 13, 2011 at 9:08

David Weiser · Accepted Answer · 2011-09-13 15:10:28Z

0

Try using MapReduce.

edit: Here's a wiki entry on map reduce. Here's how I would go about using map reduce on this task:

The Mapping function would map the item to itself (e.g. return a pair ). The Reduce function would, for a pair of duplicate items, delete one of the items.

edited Sep 13, 2011 at 15:10

answered Sep 12, 2011 at 19:18

David Weiser

6154 silver badges9 bronze badges

1

I'm a bit puzzled ...how exactly do you plan to apply MapReduce? What is your mapping step and what is your reduction step? ...I don't see any connection to the current problem.

dagnelies
– dagnelies

2011-09-12 19:48:10 +00:00
Commented Sep 12, 2011 at 19:48
@arnaud: The OP asked "What do you think can a viable approach be? Hashing? using divide and conquer type algorithms to exploit parallelism? I have these in my head, but it really gets giddy at such scale." It seemed to me that the OP was asking for other algorithms, which MapReduce is. The goal of my answer was to point the OP toward an algorithm which helps with deduplication. How the OP implements that is entirely up to them.

David Weiser
– David Weiser

2011-09-12 20:08:49 +00:00
Commented Sep 12, 2011 at 20:08
2

Well, I still don't see how MapReduce applies to the problem. I think you should elaborate a little more, because in this context it sounds more like a buzzword than anything else. What is your mapping step and what is your reduction step?

dagnelies
– dagnelies

2011-09-12 20:13:05 +00:00
Commented Sep 12, 2011 at 20:13
@arnaud: The Mapping function would map the item to itself (e.g. return a pair <item1,item1>). The Reduce function would, for a pair of duplicate items, delete one of the items.

David Weiser
– David Weiser

2011-09-12 20:17:42 +00:00
Commented Sep 12, 2011 at 20:17

Add a comment |

Stack Exchange Network

How do I go about data deduplication at scale?

5 Answers 5

Linked

Hot Network Questions

How do I go about data deduplication at scale?

5 Answers 5

Linked

Related

Hot Network Questions