Efficiently merge / sort / unique large number of text files

Question

I am trying a naive:

$ cat * | sort -u > /tmp/bla.txt

which fails with:

-bash: /bin/cat: Argument list too long

So in order to avoid a silly solution like (creates an enormous temporary file):

$ find . -type f -exec cat {} >> /tmp/unsorted.txt \; $ cat /tmp/unsorted.txt | sort -u > /tmp/bla.txt

I though I could process files one by one using (this should reduce memory consumption, and be closer to a streaming mechanism):

$ cat proc.sh #!/bin/sh old=/tmp/old.txt tmp=/tmp/tmp.txt cat $old "$1" | sort -u > $tmp mv $tmp $old

Followed then by:

$ touch /tmp/old.txt $ find . -type f -exec /tmp/proc.sh {} \;

Is there a simpler more unix-style replacement for: cat * | sort -u when the number of files reach MAX_ARG ? It feels akward writing a small shell script for such a common task.

is concatenation needed at all? sort does it automatically for multiple file input.. but then sort -u * would fail with Argument list too long as well I suppose — Sundeep
– Sundeep, Commented May 15, 2017 at 9:00

ilkkachu · Accepted Answer · 2017-05-15 09:09:09Z

11

A simple fix, works at least in Bash, since printf is builtin, and the command line argument limits don't apply to it:

printf "%s\0" * | xargs -0 cat | sort -u > /tmp/bla.txt

(echo * | xargs would also work, except for the handling of file names with white space etc.)

answered May 15, 2017 at 9:09

ilkkachu

148k16 gold badges268 silver badges441 bronze badges

This seems like a better answer than the accepted one, since it doesn't require spawning a separate cat process for every file.

LarsH
– LarsH

2017-05-15 18:11:32 +00:00
Commented May 15, 2017 at 18:11
4

@LarsH, find -exec {} + bunches up multiple files per one execution. With find -exec \; it would be one cat per file.

ilkkachu
– ilkkachu

2017-05-15 19:32:02 +00:00
Commented May 15, 2017 at 19:32
Ah, good to know. (Padding)

LarsH
– LarsH

2017-05-16 03:18:12 +00:00
Commented May 16, 2017 at 3:18

Add a comment |

Stéphane Chazelas · Accepted Answer · 2017-05-16 12:16:09Z

With GNU sort, and a shell where printf is built-in (all POSIX-like ones nowadays except some variants of pdksh):

printf '%s\0' * | sort -u --files0-from=- > output

Now, a problem with that is that because the two components of that pipeline are run concurrently and independently, by the time the left one expands the * glob, the right one may have created the output file already which could cause problem (maybe not with -u here) as output would be both an input and output file, so you may want to have the output go to another directory (> ../output for instance), or make sure the glob doesn't match the output file.

Another way to address it in this instance is to write it:

printf '%s\0' * | sort -u --files0-from=- -o output

That way, it's sort opening output for writing and (in my tests), it won't do it before it has received the full list of files (so long after the glob has been expanded). It will also avoid clobbering output if none of the input files are readable.

Another way to write it with zsh or bash

sort -u --files0-from=<(printf '%s\0' *) -o output

That's using process substitution (where <(...) is replaced by a file path that refers to the reading end of the pipe printf is writing to). That feature comes from ksh, but ksh insists in making the expansion of <(...) a separate argument to the command so you can't use it with the --option=<(...) syntax. It would work with this syntax though:

sort -u --files0-from <(printf '%s\0' *) -o output

Note that you'll see a difference from approaches that feed the output of cat on the files in cases where there are files that don't end in a newline character:

$ printf a > a $ printf b > b $ printf '%s\0' a b | sort -u --files0-from=- a b $ printf '%s\0' a b | xargs -r0 cat | sort -u ab

Also note that sort sorts using the collation algorithm in the locale (strcollate()), and sort -u reports one of each set of lines that sort the same by that algorithm, not unique lines at byte level. If you only care about lines being unique at byte level and don't care so much about the order they're sorted on, you may want to fix the locale to C where the sorting is based on byte values (memcmp(); that would probably speed things up significantly):

printf '%s\0' * | LC_ALL=C sort -u --files0-from=- -o output

Feels more natural to write, this also give the opportunity for sort to opimize its memory consumption. I still find printf '%s\0' * a bit complex to type, though. — malat
– malat, Commented May 16, 2017 at 6:24
You could use find . -type f -maxdepth 1 -print0 instead of printf '%s\0' *, but I can't claim it's any easier to type. And the latter is easier to define as an alias, of course! — Toby Speight
– Toby Speight, Commented May 16, 2017 at 7:57
@TobySpeight echo does have a -n, I would have preferred something like printf -0 %s this seems a little less low level than '%s\0' — malat
– malat, Commented May 16, 2017 at 8:10
@Toby, -maxdepth and -print0 are GNU extensions (though widely supported these days). With other finds (though if you have GNU sort, you're likely to have GNU find as well), you can do LC_ALL=C find . ! -name . -prune -type f ! -name '.*' -exec printf '%s\0' {} + (LC_ALL=C to still exclude hidden files that contain invalid characters, even with GNU find), but that's a bit overkill when you generally have printf builtin. — Stéphane Chazelas
– Stéphane Chazelas, Commented May 16, 2017 at 8:22
@malat, you could always define a print0 function as print0() { [ "$#" -eq 0 ] || printf '%s\0' "$@";} and then print0 * | sort... — Stéphane Chazelas
– Stéphane Chazelas, Commented May 16, 2017 at 8:24

Kusalananda · Accepted Answer · 2017-05-15 08:57:31Z

9

find . -maxdepth 1 -type f ! -name ".*" -exec cat {} + | sort -u -o /path/to/sorted.txt

This will concatenate all non-hidden regular files in the current directory and sort their combined contents (while removing duplicated lines) into the file /path/to/sorted.txt.

edited May 15, 2017 at 8:57

answered May 15, 2017 at 8:56

Kusalananda♦

356k42 gold badges737 silver badges1.1k bronze badges

I was trying to use only two files at a time to avoid consuming lots of memory (my number of files is rather large). Do you believe | will properly chain operations to limit memory usage ?

malat
– malat

2017-05-15 09:00:35 +00:00
Commented May 15, 2017 at 9:00
2

@malat sort will do an out-of-core sort if memory requirements require it. The left side of the pipeline will consume very little memory in comparison.

Kusalananda
– Kusalananda ♦

2017-05-15 09:02:33 +00:00
Commented May 15, 2017 at 9:02

Add a comment |

Paul Smith · Accepted Answer · 2017-05-16 09:35:16Z

Efficiency is a relative term so you really have to specify which factor you want to minimize; cpu, memory, disk, time etc. For the sake of argument, I am going to assume that you wanted to minimize memory usage and are willing to spend more cpu cycles to achieve that. Solutions such as that given by Stéphane Chazelas work well

sort -u --files0-from <(printf '%s\0' *) > ../output

but they assume that the individual text files have a high degree of uniqueness to start with. If they don't, ie if after

sort -u < sample.txt > sample.srt

sample.srt is more than 10% smaller then sample.txt then you will save significant memory by removing the duplicates within files before you merge. You will also save even more memory by not chaining the commands which means the results from different processes do not need to be in memory at the same time.

find /somedir -maxdepth 1 type f -exec sort -u -o {} {} \; sort -u --files0-from <(printf '%s\0' *) > ../output

Memory usage is rarely a concern with sort as sort resorts to using temporary files when memory usage goes beyond a threshold (usually relatively small). base64 /dev/urandom | sort -u will fill up your disk but not use a lot of memory. — Stéphane Chazelas
– Stéphane Chazelas, Commented May 16, 2017 at 10:13
Well, at least it's the case of most sort implementations including the original one in Unix v3 in 1972, but apparently not of busybox sort. Presumably because that one is intended to run on small systems that don't have permanent storage. — Stéphane Chazelas
– Stéphane Chazelas, Commented May 16, 2017 at 11:57
Note that yes | sort -u (all duplicated data) doesn't have to use more than a few bytes of memory let alone disk . But with GNU and Solaris sort at least, we see it writing a lot of 2 byte large files in /tmp (y\n for every few megabytes of input) so it will end-up filling up the disk eventually. — Stéphane Chazelas
– Stéphane Chazelas, Commented May 16, 2017 at 12:01

Udi · Accepted Answer · 2020-01-01 17:51:25Z

Like @ilkkachu, but the cat(1) is unnecessary:

printf "%s\0" * | xargs -0 sort -u

Also, If the data is so long maybe you would like to use the the sort(1) option --parallel=N

When N is the number of the CPU's that yours computer has

Stack Exchange Network

Efficiently merge / sort / unique large number of text files

5 Answers 5

You must log in to answer this question.

Hot Network Questions

Efficiently merge / sort / unique large number of text files

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions