Timeline for How do I go about data deduplication at scale?
Current License: CC BY-SA 3.0
6 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Sep 13, 2011 at 9:07 | comment | added | MSalters | RAM scales well because you can use more machines. I'd change the approach a litte: Use a single hash for partitiong and fast checks, because that means every machine can cache only the corresponding part of the existing rows. Secondly, if you have N machines, divide the work up in 10*N chunks based on hash_value % (10*N). They're unlikely to be the same size, so every worker machine picks up one chunk when it's done with the last. This means you don't have to wait for the last machine to finish that one huge chunk. | |
| Sep 12, 2011 at 22:09 | comment | added | NoChance | This is a good idea, the only problem may be the RAM requirement. I also think that it would be very fast. | |
| Sep 12, 2011 at 20:46 | history | edited | dagnelies | CC BY-SA 3.0 | edited body |
| Sep 12, 2011 at 20:10 | history | edited | dagnelies | CC BY-SA 3.0 | added 248 characters in body |
| Sep 12, 2011 at 19:56 | comment | added | yati sagade | delightful :) Thanks. I think before accepting this I'll wait for a few more approaches :) | |
| Sep 12, 2011 at 19:45 | history | answered | dagnelies | CC BY-SA 3.0 |