Timeline for Removing duplicates in a large text list

Current License: CC BY-SA 3.0

13 events

when toggle format	what		by	license	comment
S Mar 10, 2017 at 22:49	history	suggested	MikeD		added text-processing tag
Mar 10, 2017 at 22:22	comment	added	MikeD		I added an answer (I've tested) below: Using `cat` to pipe the InputFile to a `while` loop, `read` each line in the loop, `grep` -F (or fgrep) the line against the desired OutputFile. If it's not already in the OutputFile, add it to the OutputFile with `echo` (see full answer below).
Mar 10, 2017 at 22:17	review	Suggested edits
S Mar 10, 2017 at 22:49
Mar 10, 2017 at 22:16	answer	added	MikeD		timeline score: 0
Mar 10, 2017 at 14:55	answer	added	Stéphane Chazelas		timeline score: 2
Dec 10, 2015 at 14:52	comment	added	Marco		The following works here (8+GiB of Unicode): `for i in {1..30000000}; do echo 'ᚹᛖᛥᚫ\nəsoʊsiˈeıʃn\n⠙⠳⠃⠞\ntest123\n⌷←⍳→⍴∆∇⊃‾⍎⍕⌈\nTest123\nSTARGΛ<030a>TE\ntest\nκόψη\ntest123\nსაერთაშორისო\ntest 123\nКонференцию\nพระปกเกศกองบ<0e39><0e4a>ก<0e39><0e49>ข<0e36><0e49>นใหม<0e48>\nአይታረስ\n'; done \| awk '!seen[$0]++'` Furthermore, how do you find out it's a lisp source file? I don't believe that `awk`s output is lisp. Maybe some tool's heuristics fail on the content of the resulting file.
Dec 10, 2015 at 12:33	comment	added	user146854		I tried that exact command earlier. Like sort -u, it doesn't work properly and creates an "emacs-lisp-source-text'. However, I think I might have found the source of the problem. All the large files I have tried contain "strange" characters (arabic, chinese, hex, ... you name it). Because this only happened with large files, I concluded that the size was likely the reason. Could it be possible, that the "sort" and "awk" command have difficulties with certain kinds of characters? And if so, do you know an alternative that doesn't?
Dec 10, 2015 at 12:20	comment	added	Marco		And `awk '!seen[$0]++' 8GiB_file > output` works without problems here. No issues with the file size. The same goes for `sort -u -o output 8GiB_file`. Works here.
Dec 10, 2015 at 10:54	comment	added	Marco		You should state in your question what exactly you have tried, which commands you ran and what the console output and return value of the command was. If possible provide an example file which demonstrates the issue. But that might not be feasible if the error only shows after a certain size is reached. And clarify what you mean by “more vulnerable to errors involving large files”. Small vs. large is relative. On an old laptop with 64MiB of memory 100MiB might be large, on a server with 512GiB of memory 100GiB might be small.
Dec 10, 2015 at 10:34	comment	added	user146854		I've gone through those threads, but unfortunately the solutions don't work for me (awk seems to be even more vulnerable to errors involving large files).
Dec 10, 2015 at 9:46	comment	added	Marco		Possible duplicates: How to remove duplicate lines inside a text file? and How to remove duplicate lines in a large multi-GB textfile?. Please check if the answers work for you (especially the `awk` one).
Dec 10, 2015 at 9:30	review	First posts
Dec 10, 2015 at 9:30
Dec 10, 2015 at 9:26	history	asked	user146854	CC BY-SA 3.0