Fastest way to concatenate files

Question

I've got 10k+ files totaling over 20GB that I need to concatenate into one file.

Is there a faster way than

cat input_file* >> out

?

The preferred way would be a bash command, Python is acceptable too if not considerably slower.

Updated my answer, find does not sort files the same as a shell glob. — Graeme
– Graeme, Commented Mar 5, 2014 at 13:18
Any and all (sane) solutions will have an equivalent speed here since the time will be 99% system I/O. — goldilocks
– goldilocks, Commented Mar 5, 2014 at 15:04
See also A virtual file containing the concatenation of other files — Stéphane Chazelas
– Stéphane Chazelas, Commented Mar 5, 2014 at 15:09
Considering writing the concatenated file in a different disk than the one(s) you're reading from. — Luis
– Luis, Commented Mar 5, 2014 at 15:29

Chinoman10 · Accepted Answer · 2020-06-09 14:11:24Z

30

Nope, cat is surely the best way to do this. Why use python when there is a program already written in C for this purpose? However, you might want to consider using xargs in case the command line length exceeds ARG_MAX and you need more than one cat. Using GNU tools, this is equivalent to what you already have:

find . -maxdepth 1 -type f -name 'input_file*' -print0 | sort -z | xargs -0 cat -- >>out

edited Jun 9, 2020 at 14:11

Chinoman10

32 bronze badges

answered Mar 5, 2014 at 13:04

Graeme

34.6k9 gold badges90 silver badges110 bronze badges

1

Can you insure in this case that your files will be read in the order ?

Kiwy
– Kiwy

2014-03-05 13:34:06 +00:00
Commented Mar 5, 2014 at 13:34
1

Yes, because the output of find is piped through sort. Without this, the files would be listed in an arbitrary order (defined by the file system, which could be file creation order).

scai
– scai

2014-03-05 13:43:35 +00:00
Commented Mar 5, 2014 at 13:43
@scai I missread sorry, with sort it's pretty obvious

Kiwy
– Kiwy

2014-03-05 13:46:53 +00:00
Commented Mar 5, 2014 at 13:46
1

@Kiwy, the only case I can see is if the locale isn't properly set in the environment, then sort might behave differently from a bash glob. Otherwise I don't see any cases where xargs or cat would not behave as expected.

Graeme
– Graeme

2014-03-05 14:05:36 +00:00
Commented Mar 5, 2014 at 14:05
3

@MarcvanLeeuwen, xargs will call as may cat as is necessary to avoid an E2BIG error of execve(2).

Stéphane Chazelas
– Stéphane Chazelas

2014-03-06 09:41:40 +00:00
Commented Mar 6, 2014 at 9:41

| Show 4 more comments

Stéphane Chazelas · Accepted Answer · 2014-03-05 15:29:41Z

Allocating the space for the output file first may improve the overall speed as the system won't have to update the allocation for every write.

For instance, if on Linux:

size=$({ find . -maxdepth 1 -type f -name 'input_file*' -printf '%s+'; echo 0;} | bc) fallocate -l "$size" out && find . -maxdepth 1 -type f -name 'input_file*' -print0 | sort -z | xargs -r0 cat 1<> out

Another benefit is that if there's not enough free space, the copy will not be attempted.

If on btrfs, you could copy --reflink=always the first file (which implies no data copy and would therefore be almost instantaneous), and append the rest. If there are 10000 files, that probably won't make much difference though unless the first file is very big.

There's an API to generalise that to ref-copy all the files (the BTRFS_IOC_CLONE_RANGE ioctl), but I could not find any utility exposing that API, so you'd have to do it in C (or python or other languages provided they can call arbitrary ioctls).

If the source files are sparse or have large sequences of NUL characters, you could make a sparse output file (saving time and disk space) with (on GNU systems):

find . -maxdepth 1 -type f -name 'input_file*' -print0 | sort -z | xargs -r0 cat | cp --sparse=always /dev/stdin out

@XTian, no, it should be neither > nor >>, but 1<> as I said to write into the file. — Stéphane Chazelas
– Stéphane Chazelas, Commented Mar 5, 2014 at 16:41
@grebneke, <> is the standard Bourne/POSIX read+write redirection operator. See your shell manual or the POSIX spec for details. The default fd is 0 for the <> operator (<> is short for 0<>, like < is short for 0< and > short for 1>), so you need the 1 to explicitly redirect stdout. Here, it's not so much that we need read+write (O_RDWR), but that we don't want O_TRUNC (as in >) which would deallocate what we've just allocated. — Stéphane Chazelas
– Stéphane Chazelas, Commented Mar 5, 2014 at 20:30
@grebneke, "">unix.stackexchange.com/search?q=user%3A22565+%22%3C%3E%22 will give you a few. ksh93 has seek operators BTW, and you can seek forward with dd or via reading. — Stéphane Chazelas
– Stéphane Chazelas, Commented Mar 5, 2014 at 21:15
@StephaneChazelas - thanks a lot, your help and knowledge is deeply appreciated! — grebneke
– grebneke, Commented Mar 5, 2014 at 21:19
I'm not convinced that there will be many cases where fallocate will negate the overhead of the extra find, even though it will be faster the second time round. btrfs certainly opens up some interesting possibilities though. — Graeme
– Graeme, Commented Mar 6, 2014 at 1:28

Stack Exchange Network

Fastest way to concatenate files

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Fastest way to concatenate files

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions