Timeline for deduplication of lines in a large file
Current License: CC BY-SA 3.0
6 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Dec 23, 2017 at 12:39 | history | edited | Chris Davies | CC BY-SA 3.0 | added 18 characters in body |
| Dec 22, 2017 at 23:17 | comment | added | Chris Davies | @Borna that sounds an interesting question in its own right. When you've asked it I'd appreciate a ping back here with the reference and I'll take a look | |
| Dec 22, 2017 at 20:42 | comment | added | Boy | That is exactly what I was looking for, thank you sir! One question, I was wondering how efficient would it be to create n files in a single directory (under Linux), where each file name is a row from the 'non-unique-lines' file (lets say no illegal chars for the file name), and thus eliminating duplicate rows. | |
| Dec 21, 2017 at 19:35 | comment | added | Chris Davies | @Borna why would you want a hash table when merging multiple pre-sorted files? These external merge-sort algorithms have been around since the days of magnetic tape - at least 50 years ago. | |
| Dec 21, 2017 at 19:15 | comment | added | Boy | How to merge? To be able to merge in reasonable time, we need some lookup logic, e.g. hash table, but then we again face the same problem -> not enough memory to store huge hash table. | |
| Mar 19, 2015 at 13:54 | history | answered | Chris Davies | CC BY-SA 3.0 |