Fastest way to merge millions of small files together

Question

I regularly need to transfer millions of small files (small images, txt, json) with average of 5-50k per file between servers or to aws s3.

Is there a faster to merge them into a single file to optimize transfer speed other than zip/tar -cf?

What you have against tar/zip?

Romeo Ninov
– Romeo Ninov

2022-08-13 16:21:03 +00:00
Commented Aug 13, 2022 at 16:21 — Romeo Ninov
– Romeo Ninov, Commented Aug 13, 2022 at 16:21
What exactly have you tried?

Vlastimil Burián
– Vlastimil Burián

2022-08-14 17:00:59 +00:00
Commented Aug 14, 2022 at 17:00 — Vlastimil Burián
– Vlastimil Burián, Commented Aug 14, 2022 at 17:00

Bob Goddard · Accepted Answer · 2022-08-13 16:22:12Z

13

Something similar to tar cz * | ssh <host> "tar xfc -"? Seriously what is wrong with tar? This command does not create any intermediate files.

answered Aug 13, 2022 at 16:22

Bob Goddard

2,7611 gold badge8 silver badges11 bronze badges

1

ssh to access s3?

Romeo Ninov
– Romeo Ninov

2022-08-14 12:52:21 +00:00
Commented Aug 14, 2022 at 12:52
3

Or better tar --zstd c * | ssh <host> "tar xf -", since zstd mostly obsoletes gzip, being much faster at similar or better compression ratios.

Peter Cordes
– Peter Cordes

2022-08-15 03:46:51 +00:00
Commented Aug 15, 2022 at 3:46
@PeterCordes, how do you think to use ssh to access s3 and exec there tar command (w/o virtual machine)?

Romeo Ninov
– Romeo Ninov

2022-09-05 19:20:14 +00:00
Commented Sep 5, 2022 at 19:20
1

@RomeoNinov: don't ask me, ask the person who wrote this answer. For people who find this answer usable, my suggestion is an incremental improvement on it. For people using S3 like the original asker of the question, apparently this won't work, so they should see your answer. I'd expect zstd if better there, too, if S3 can extract a .tar.zst after you upload it.

Peter Cordes
– Peter Cordes

2022-09-05 19:28:23 +00:00
Commented Sep 5, 2022 at 19:28
1

@RomeoNinov: I haven't used S3 myself, thanks for the reminder that they charge per file as well as total size. And for the fact that millions of files would be prohibitively expensive to keep separately there. Yeah, maybe they are fine just keeping an archive there to grab from other servers, in which case .tar.zst is great.

Peter Cordes
– Peter Cordes

2022-09-05 20:38:55 +00:00
Commented Sep 5, 2022 at 20:38

| Show 4 more comments

Romeo Ninov · Accepted Answer · 2022-08-14 15:19:13Z

9

Developing idea from other answer you can send information via pipe and even do not create locally a tar file for example. Command will be something like:

tar cf - * | aws s3 cp - s3://some-bucket/archive.tar

This command have advantage that will run tar and aws command in parallel. You can even add compression (which will execute again in parallel)

tar cf - * | gzip -c | aws s3 cp - s3://some-bucket/archive.tar.gz

For simplify the operation you can use top level directory for the files and not use *:

tar cf - top_level_directory | aws s3 cp - s3://some-bucket/archive.tar tar cf - top_level_directory | gzip -c | aws s3 cp - s3://some-bucket/archive.tar.gz

Inspired by other answer you can use cpio. Seems to be faster and produce smaller files:

ls |cpio -o |gzip -c | aws s3 cp - s3://some-bucket/archive.cpio.gz

edited Aug 14, 2022 at 15:19

answered Aug 13, 2022 at 16:58

Romeo Ninov

19.5k5 gold badges35 silver badges48 bronze badges

2

I don't think you need -c, tar -cf - * | gzip | ... should work. At least with the tools onmy Linux machine. Also, you don't really need gzip at all, you could do tar -czf - * unless there is some advantage in adding a separate call to gzip.

terdon
– terdon ♦

2022-08-13 18:25:39 +00:00
Commented Aug 13, 2022 at 18:25
3

I don't think that's accurate. I just ran time tar czf file.tgz file and time tar cf - file | gzip -c > file.tgz using a 288M file on an SSD, and the first took 0m20.004s (real) while the second took 0m20.141s. So, at least in this particular test, using the pipe was marginally slower. Granted, the difference is trivial but could become relevant for larger files. Maybe? At the very least, the pipe does not seem to add any advantage to justify the increase in complexity of the command.

terdon
– terdon ♦

2022-08-13 18:41:14 +00:00
Commented Aug 13, 2022 at 18:41
3

Running tar|gzip is the same as tar c, they both run in a piped parallel. To deny that would suggest it is either keeping the data in a buffer - resulting in memory exhaustion, or creating a temporary file, neither of which are true.

Bob Goddard
– Bob Goddard

2022-08-13 18:49:01 +00:00
Commented Aug 13, 2022 at 18:49
2

Heh, sounds like this might be worth a question! I suspect whether one uses an SSD or HDD for this will also be very relevant.

terdon
– terdon ♦

2022-08-13 19:11:43 +00:00
Commented Aug 13, 2022 at 19:11
4

Good grief... run it then suspend it and look for what fd's are setup. They are both piped. Any difference in time is purely arbitrary.

Bob Goddard
– Bob Goddard

2022-08-13 19:18:26 +00:00
Commented Aug 13, 2022 at 19:18

| Show 7 more comments

Austin Hemmelgarn · Accepted Answer · 2022-08-14 12:18:46Z

Yes, there are a whole slew of options.

One is to never create an intermediate file, as suggested in the other answers. This cuts out local IO, though it makes resuming a partial upload impossible.

There are other options though to improve things further:

Use compression on the archive. GZip is the classic, but it’s somewhat slow. LZ4 is pretty widely available these days, wicked fast, and should still get you decent compression ratios with tar and what you’re describing. ZSTD is not as fast as LZ4, but will get you compression ratios similar to GZip in less time. Irrespective of the choice, this will cut down, possibly significantly, on the total data to transfer.
Consider using cpio instead of tar. tar is not exactly a space-efficient archive format. This usually doesn’t matter much, but if you’re dealing with millions of very small files, the overhead is actually pretty significant. cpio still has a nontrivial amount of overhead, but it’s less than tar does, so in theory using cpio here should result in a nontrivial reduction in the amount of data to transfer.
Consider creating multiple archives each containing a subset of the files (for example, maybe have no more than 100k files in each archive), and uploading those in parallel. Assuming the source system has a fast internet connection and relatively fast storage, this will (almost certainly) be faster than uploading one big archive because it can parallelize local IO better (and because AWS can process things in parallel on their end as well). ‘Optimal’ sizing here is generally small enough that you don’t need to use multipart uploads. This also helps with resuming partial uploads, even if you skip creating intermediate files locally, because you only have to upload the archives you haven’t already uploaded.

zstd on lz4 on the tar or cpio stream might remove much of the difference, unless tar's extra metadata has low similarity within each entry or across files. — Peter Cordes
– Peter Cordes, Commented Aug 15, 2022 at 3:44
@PeterCordes True, but at the scale of millions of files, even compressed the overhead is likely to be measurable. — Austin Hemmelgarn
– Austin Hemmelgarn, Commented Aug 15, 2022 at 11:55
I tried archiving 100k empty files with simplistic names: mkdir {0001..1000} ; touch {0001..1000}/{001..100} ; tar cf foo.tar foo ; find foo/ | bsdcpio -o foo.cpio (the libarchive implementation) ; zstd foo.cpio foo.tar. The uncompressed tar archive was much bigger, 50M vs. 8.6M. But the compressed tar archive was half the size, 231K vs. 422K; tar compressed better. cpio -o --binary (aka -7 to use binary instead of ASCII text) doesn't work with that many files, nor does -6. And the other formats available, like --format newc, was larger than the default CPIO format. — Peter Cordes
– Peter Cordes, Commented Aug 16, 2022 at 9:02
Hmm, my test had identical sizes and permissions for all files, and timestamps within a few hundred milliseconds for all files (on tmpfs). So IDK how well it generalizes, but I think if tar is using binary to indicate record length vs. cpio using ASCII, that would compress better in general. The padding of 0 bytes in tar records compresses extremely well. (Especially if it's the same length every time, and probably exaggerating the effect by not having any file contents to make the compression dictionary work harder, not just use very short symbols for the similar metadata stuff. — Peter Cordes
– Peter Cordes, Commented Aug 16, 2022 at 9:07
@PeterCordes As you note yourself, it’s not exactly a fair comparison when all the archived files are empty. With actual file data, the difference is generally pretty small (at most a few percent), but tends to be in favor of cpio over tar. The cpio archives will also generally be processed faster, though the difference is again pretty small (at least, on modern systems, on older systems it could be pretty significant, which is part of why cpio stuck around). — Austin Hemmelgarn
– Austin Hemmelgarn, Commented Aug 16, 2022 at 14:04

Toby Speight · Accepted Answer · 2022-08-14 14:44:32Z

It might be worth it to investigate the rsync command as what you want to do is transfer the files.

--- Added after comments.

This is an XY question: "It takes long time to transfer files so I want to compress them more?". I believe a better question would be: "It takes a long time to transfer the files, how can I make it happen faster?". Rsync is an answer to the second question.

Note that rsync creates and copies the files, it does not create an archive you later need to decompress (which would add to total time as well). As it copies, basically, "changed" or "new" files it can be set to run nearly continuously (say every hour) and will have no problems restarting after network failures. It pays to think a bit about directory structure though.

The exact options to use for rsync in different situations is dependant on the relative speeds of network compared to CPU speed. One important thing to note is that if you select compression it is done simultaneously with transfer. Compared to zipping or similar where you first compress, then transfer, then decompress. This can save a lot of "total time" even when maybe the compression is skipped. (Compression in my experience saves very little time in rsync when sending lots of small files).

Rsync is available on most if not all modern systems, including servers and aws s3 according to my understanding, but I may be wrong.

Please edit your answer and give some details. What rsync options would you recommend? What command would you use? Compression? No compression? Which compression? — terdon
– terdon ♦, Commented Aug 14, 2022 at 12:47
TS asked for transfer of files between servers. I read the question as he wants to make the transfer as fast as possible. How to use rsync is well documented, but the results is dependant on the actual system setup. In brief, the -z option for compression may save time if the link is slow and CPU is fast, otherwise not. — ghellquist
– ghellquist, Commented Aug 14, 2022 at 14:00
rclone would probably work better here as it natively supports AWS (and parallelizes transfers, unlike rsync). — Austin Hemmelgarn
– Austin Hemmelgarn, Commented Aug 14, 2022 at 15:35
@AustinHemmelgarn, AFAIK outbound traffic from S3 if not free of charge. Also they have limit of free list, read, get properties operations. And by OPs word million of files may generate significant charge. — Romeo Ninov
– Romeo Ninov, Commented Aug 14, 2022 at 19:39

HampusLi · Accepted Answer · 2022-08-14 16:26:59Z

Heavily multithreaded use of rclone is what I use. I do a similar thing both between servers and S3.

Ole Tange · Accepted Answer · 2022-08-29 14:08:12Z

What is your storage?

This answer shows that on a single spindle, sorting by inode and running jobs in parallel will get the fastest results:

Speed up copying 1000000 small files

Stack Exchange Network

Fastest way to merge millions of small files together

6 Answers 6

Yes, there are a whole slew of options.

You must log in to answer this question.

Linked

Hot Network Questions

Fastest way to merge millions of small files together

6 Answers 6

Yes, there are a whole slew of options.

You must log in to answer this question.

Linked

Related

Hot Network Questions