I regularly need to transfer millions of small files (small images, txt, json) with average of 5-50k per file between servers or to aws s3.
Is there a faster to merge them into a single file to optimize transfer speed other than zip/tar -cf?
Something similar to tar cz * | ssh <host> "tar xfc -"? Seriously what is wrong with tar? This command does not create any intermediate files.
ssh to access s3? tar --zstd c * | ssh <host> "tar xf -", since zstd mostly obsoletes gzip, being much faster at similar or better compression ratios. ssh to access s3 and exec there tar command (w/o virtual machine)? zstd if better there, too, if S3 can extract a .tar.zst after you upload it. .tar.zst is great. Developing idea from other answer you can send information via pipe and even do not create locally a tar file for example. Command will be something like:
tar cf - * | aws s3 cp - s3://some-bucket/archive.tar This command have advantage that will run tar and aws command in parallel. You can even add compression (which will execute again in parallel)
tar cf - * | gzip -c | aws s3 cp - s3://some-bucket/archive.tar.gz For simplify the operation you can use top level directory for the files and not use *:
tar cf - top_level_directory | aws s3 cp - s3://some-bucket/archive.tar tar cf - top_level_directory | gzip -c | aws s3 cp - s3://some-bucket/archive.tar.gz Inspired by other answer you can use cpio. Seems to be faster and produce smaller files:
ls |cpio -o |gzip -c | aws s3 cp - s3://some-bucket/archive.cpio.gz -c, tar -cf - * | gzip | ... should work. At least with the tools onmy Linux machine. Also, you don't really need gzip at all, you could do tar -czf - * unless there is some advantage in adding a separate call to gzip. time tar czf file.tgz file and time tar cf - file | gzip -c > file.tgz using a 288M file on an SSD, and the first took 0m20.004s (real) while the second took 0m20.141s. So, at least in this particular test, using the pipe was marginally slower. Granted, the difference is trivial but could become relevant for larger files. Maybe? At the very least, the pipe does not seem to add any advantage to justify the increase in complexity of the command. tar|gzip is the same as tar c, they both run in a piped parallel. To deny that would suggest it is either keeping the data in a buffer - resulting in memory exhaustion, or creating a temporary file, neither of which are true. One is to never create an intermediate file, as suggested in the other answers. This cuts out local IO, though it makes resuming a partial upload impossible.
There are other options though to improve things further:
tar and what you’re describing. ZSTD is not as fast as LZ4, but will get you compression ratios similar to GZip in less time. Irrespective of the choice, this will cut down, possibly significantly, on the total data to transfer.cpio instead of tar. tar is not exactly a space-efficient archive format. This usually doesn’t matter much, but if you’re dealing with millions of very small files, the overhead is actually pretty significant. cpio still has a nontrivial amount of overhead, but it’s less than tar does, so in theory using cpio here should result in a nontrivial reduction in the amount of data to transfer.mkdir {0001..1000} ; touch {0001..1000}/{001..100} ; tar cf foo.tar foo ; find foo/ | bsdcpio -o foo.cpio (the libarchive implementation) ; zstd foo.cpio foo.tar. The uncompressed tar archive was much bigger, 50M vs. 8.6M. But the compressed tar archive was half the size, 231K vs. 422K; tar compressed better. cpio -o --binary (aka -7 to use binary instead of ASCII text) doesn't work with that many files, nor does -6. And the other formats available, like --format newc, was larger than the default CPIO format. 0 bytes in tar records compresses extremely well. (Especially if it's the same length every time, and probably exaggerating the effect by not having any file contents to make the compression dictionary work harder, not just use very short symbols for the similar metadata stuff. cpio over tar. The cpio archives will also generally be processed faster, though the difference is again pretty small (at least, on modern systems, on older systems it could be pretty significant, which is part of why cpio stuck around). It might be worth it to investigate the rsync command as what you want to do is transfer the files.
--- Added after comments.
This is an XY question: "It takes long time to transfer files so I want to compress them more?". I believe a better question would be: "It takes a long time to transfer the files, how can I make it happen faster?". Rsync is an answer to the second question.
Note that rsync creates and copies the files, it does not create an archive you later need to decompress (which would add to total time as well). As it copies, basically, "changed" or "new" files it can be set to run nearly continuously (say every hour) and will have no problems restarting after network failures. It pays to think a bit about directory structure though.
The exact options to use for rsync in different situations is dependant on the relative speeds of network compared to CPU speed. One important thing to note is that if you select compression it is done simultaneously with transfer. Compared to zipping or similar where you first compress, then transfer, then decompress. This can save a lot of "total time" even when maybe the compression is skipped. (Compression in my experience saves very little time in rsync when sending lots of small files).
Rsync is available on most if not all modern systems, including servers and aws s3 according to my understanding, but I may be wrong.
s3 aws? Heavily multithreaded use of rclone is what I use. I do a similar thing both between servers and S3.
What is your storage?
This answer shows that on a single spindle, sorting by inode and running jobs in parallel will get the fastest results:
tar/zip?