Timeline for How do I go about data deduplication at scale?

6 events

when toggle format	what		by	license	comment
Sep 13, 2011 at 9:07	comment	added	MSalters		RAM scales well because you can use more machines. I'd change the approach a litte: Use a single hash for partitiong and fast checks, because that means every machine can cache only the corresponding part of the existing rows. Secondly, if you have N machines, divide the work up in 10N chunks based on `hash_value % (10N)`. They're unlikely to be the same size, so every worker machine picks up one chunk when it's done with the last. This means you don't have to wait for the last machine to finish that one huge chunk.
Sep 12, 2011 at 22:09	comment	added	NoChance		This is a good idea, the only problem may be the RAM requirement. I also think that it would be very fast.
Sep 12, 2011 at 20:46	history	edited	dagnelies	CC BY-SA 3.0	edited body
Sep 12, 2011 at 20:10	history	edited	dagnelies	CC BY-SA 3.0	added 248 characters in body
Sep 12, 2011 at 19:56	comment	added	yati sagade		delightful :) Thanks. I think before accepting this I'll wait for a few more approaches :)
Sep 12, 2011 at 19:45	history	answered	dagnelies	CC BY-SA 3.0