Timeline for Removing duplicates in a large text list
Current License: CC BY-SA 3.0
13 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| S Mar 10, 2017 at 22:49 | history | suggested | MikeD | added text-processing tag | |
| Mar 10, 2017 at 22:22 | comment | added | MikeD | I added an answer (I've tested) below: Using cat to pipe the InputFile to a while loop, read each line in the loop, grep -F (or fgrep) the line against the desired OutputFile. If it's not already in the OutputFile, add it to the OutputFile with echo (see full answer below). | |
| Mar 10, 2017 at 22:17 | review | Suggested edits | |||
| S Mar 10, 2017 at 22:49 | |||||
| Mar 10, 2017 at 22:16 | answer | added | MikeD | timeline score: 0 | |
| Mar 10, 2017 at 14:55 | answer | added | Stéphane Chazelas | timeline score: 2 | |
| Dec 10, 2015 at 14:52 | comment | added | Marco | The following works here (8+GiB of Unicode): for i in {1..30000000}; do echo 'ᚹᛖᛥᚫ\nəsoʊsiˈeıʃn\n⠙⠳⠃⠞\ntest123\n⌷←⍳→⍴∆∇⊃‾⍎⍕⌈\nTest123\nSTARGΛ<030a>TE\ntest\nκόψη\ntest123\nსაერთაშორისო\ntest 123\nКонференцию\nพระปกเกศกองบ<0e39><0e4a>ก<0e39><0e49>ข<0e36><0e49>นใหม<0e48>\nአይታረስ\n'; done | awk '!seen[$0]++' Furthermore, how do you find out it's a lisp source file? I don't believe that awks output is lisp. Maybe some tool's heuristics fail on the content of the resulting file. | |
| Dec 10, 2015 at 12:33 | comment | added | user146854 | I tried that exact command earlier. Like sort -u, it doesn't work properly and creates an "emacs-lisp-source-text'. However, I think I might have found the source of the problem. All the large files I have tried contain "strange" characters (arabic, chinese, hex, ... you name it). Because this only happened with large files, I concluded that the size was likely the reason. Could it be possible, that the "sort" and "awk" command have difficulties with certain kinds of characters? And if so, do you know an alternative that doesn't? | |
| Dec 10, 2015 at 12:20 | comment | added | Marco | And awk '!seen[$0]++' 8GiB_file > output works without problems here. No issues with the file size. The same goes for sort -u -o output 8GiB_file. Works here. | |
| Dec 10, 2015 at 10:54 | comment | added | Marco | You should state in your question what exactly you have tried, which commands you ran and what the console output and return value of the command was. If possible provide an example file which demonstrates the issue. But that might not be feasible if the error only shows after a certain size is reached. And clarify what you mean by “more vulnerable to errors involving large files”. Small vs. large is relative. On an old laptop with 64MiB of memory 100MiB might be large, on a server with 512GiB of memory 100GiB might be small. | |
| Dec 10, 2015 at 10:34 | comment | added | user146854 | I've gone through those threads, but unfortunately the solutions don't work for me (awk seems to be even more vulnerable to errors involving large files). | |
| Dec 10, 2015 at 9:46 | comment | added | Marco | Possible duplicates: How to remove duplicate lines inside a text file? and How to remove duplicate lines in a large multi-GB textfile?. Please check if the answers work for you (especially the awk one). | |
| Dec 10, 2015 at 9:30 | review | First posts | |||
| Dec 10, 2015 at 9:30 | |||||
| Dec 10, 2015 at 9:26 | history | asked | user146854 | CC BY-SA 3.0 |