2

I'm currently looking for a way to improve my system backup performance and realized the following during some tests:

Backing up my Ubuntu system from an SSD to an HDD (both ext4) using TAR without compression is much faster than rsyncing the same content from SSD to HDD.

In detail:

  • TAR takes 1h 15min and produces a 429G large file
  • rsync takes 5h and produces a 406G large folder

Both should copy the same data as they're using ignore files with the same content slightly adjusted to both tools.

I'm not sure why the final TAR is actually larger than the rsynced folder but I don't really care about that ATM.

What I'm really interested in is why TAR is so much faster & if I could somehow improve rsync (or some other file copy tool) to gain a similar performance?

I don't really want to use TAR for my backup strategy as decompressing such large archives or only extracting single files takes "ages" which would be problematic when I really need access to them.

I realize that I could improve rsync performance drastically by always copying to the same target folder & therefore getting incremental copies but that's explicitly not what I'm looking for as I'd like to always have multiple backups of different days.


Update, additional info

Tested alternative "copy via TAR"

I also tried "copying via TAR" (as described here or here) which is a bit slower than rsync, so the bottleneck seems to be the write speed.

Commands used

I've used the following commands which lead to the results above:

  • tar -X "tar-excludes.txt" -cvf "/media/backup/full" "/"
  • rsync -aAXWvh --stats --info=progress2 --exclude-from "rsync-excludes.txt" --log-file="log.txt" "/" "/media/backup/full"

Files

I'm backing up my whole OS (with some exclusions) so the backup includes all kinds of files. Some large files as well as surely many small files.

Device details

The host is an ~ 8 year old Intel NUC D34010WYKH

The source drive is an internal SSD and the destination drive is an external HDD connected via USB 3.0. Both drives are formatted with ext4.

5
  • 3
    One guess: creating one large file is much, much faster than creating thousands/millions of very small files. Commented May 31, 2023 at 6:12
  • @muru Makes sense. This would also explain why "TAR copy" (see update above) is not faster than rsync or plain cp in my case. Commented May 31, 2023 at 7:53
  • @Archemar Could you explain that a bit more detailed? ATM I'm syncing /. Are you suggesting to start a dedicated rsync process for each folder in / that I'm interested in? E.g. /bin, /boot, /etc, /home, ...? Commented May 31, 2023 at 8:34
  • 1
    Note: tar or scp is often better for first copy, but next copies (if you do not have many changes), rsync may perform better. It depends also if you have some support on remote side. Incremental is better, so a common technique is to duplicate your backup on remote site, and then increment backup. (maybe also with hard links, so you safe space). Commented May 31, 2023 at 10:31
  • The extra size (5%) of the tar file is the 512-byte (usually) metadata for every file, kept inline in the archive, plus some block rounding. The rsync folder spreads this metadata into many inodes, which you won't be measuring as part of the folder size. But it has both time and space cost. Commented Jan 2, 2024 at 9:54

3 Answers 3

2

The various cpio and tar file formats are simple sequences of file headers followed by file data. Writing out a new file header is simply appending a record to the output file. Writing out file data is simply appending more records to the output file.

That's the only thing that ever happens: records get appended to the output file. Usually these records are batched into 10KiB or 5KiB (or sometimes even 1MiB) blocks, too.

This is a very efficient thing to do. If the output file is an actual tape device this is just append writes, sequential output, to the current position on the tape. This comes as no surprise. These utilities were designed to archive files to tape, with its good sequential I/O characteristics, and terrible random-access I/O characteristics.

(Adding on compression doesn't change this, either. The compression utilities are designed to use sequential I/O, too.)

Even if this is a file on a disc volume, each appended batch of records is basically three actions: twiddling the disc volume's free space maps to get another block, twiddling the file i-node to include that new block at the end of the file (which filesystems can make cheap by using extents and suitable allocation policies), and writing out the block. With the common filesystem driver optimization of speculatively pre-allocating runs of contiguous data blocks when it detects sequential append write patterns, this can be done very cheaply indeed.

Whereas an rsync-style backup is making an entire tree in the disc volume, which involves creating directory entries, updating B-trees and whatnot, allocating i-nodes, making hard links, and updating any journals, in addition to doing at the individual file level what the cpio/tar archive does: twiddling the disc volume's free space maps, twiddling an i-nodes's block allocations, and writing out file data blocks.

Writing out an archive using only sequential append operations is very efficient for tape, and can be made reasonably efficient for a single archive file stored on a disc volume. Writing out lots of individual files involves inherently doing a lot more.

The obvious things that you are trading for this efficiency are of course ease of making in-line modifications to an archive, good random access reading of the archive, and smart incremental backup facilities.

In the 1980s, Rahul Dhesi created an archive format that was mostly serial, that enabled in-line updates to be made to existing archives using a modicum of random access I/O to go back and overwrite the headers for superseded files. Its tradeoffs were that file headers and data for superseded files had to be purged by rewriting the whole archive, every so often, and of course the greater space requirements for holding old not-yet-purged versions of files.

2
  • Thank you very much for this very detailed answer! I marked this as accepted as it answers my main question. Commented Jun 1, 2023 at 8:24
  • In the case of tape, it is very important that the tape-writing host can keep up with the tape device - a streaming tape device demands data arriving when the fast moving tape band is below the writing head..... So it is a hard realtime system. A system which can't keep up will cause the tape device to stop, overun and have to reverse a little, wait a little for data and then spool up to writing speed. Commented Feb 7 at 15:11
1

TAR takes 1h 15min and produces a 429G large file

rsync takes 5h and produces a 406G large folder

Peering into my crystal ball I can make some deductions from this - there's lots of small files and significant latency between the source and destination devices. It would have been helpful if you'd had a look at these factors and included what you found in your question along with the atual comands you are running to create the backups.

Tar is so much faster because:

  • the data traffic is only going in one direction, it can (potentially) saturate the connection - OTOH rsync needs to retrieve data synchronously from both ends
  • tar writes a single stream so there is no impact from file creation

I realize that I could improve rsync performance drastically by always copying to the same target folder & therefore getting incremental copies but that's explicitly not what I'm looking for as I'd like to always have multiple backups of different days

Assuming that the source and destination are block devices attached to the same host (agan, not specified) maybe you should try overlay filesystems.

1
  • I added some more info to my question in the update section. Please let me know if I can provide anything else. I'm not familiar with overlay FS yet and will consider it once I've concluded some other tests. Commented Jun 1, 2023 at 8:23
0

this is / dir of a centos (irrevelent here)

bin boot dev etc home lib lost+found media mnt opt proc root sbin selinux srv sys tmp usr var 

most likely you don't want to copy /dev, /proc and /sys, depending on your need you don't want /media either.

so instead of using rsync / $DEST (I assumed $DEST is on another host).

you can run

rsync /bin /boot /etc /lib /root /sbin /selinux $DEST & sleep 300 rsync /home $DEST & sleep 300 rsync /opt $DEST & ... wait 

if all your data are in /home, you can go further down

rsync /home/dir1 $DEST & sleep 300 rsync /home/dir2 $DEST & ... 

You will have to adjust $DEST, or use excluding option in rsync

let's pretend you have 1 000 000 files, rsync (one or many) still have to check 1M files stats (in source part) and 1M files stats (on dest part), do some compression etc ...

As mentionned in comment, I had to synchronize, twice a day directories with 100 M files, rsync would last 14h to 16h, with above strategy (and a bit of trial and errors), I manage to go down to 4h to 5h, using 20 rsync (15 of them being short live one)

2
  • Very interesting approach. Most of my files are somewhere within /var/www/nextcloud (mostly data from 2 main users, ~ 340 G in total). I'll have to split/parallelize somewhere inside this folder but already have an idea on how to do this. I'll give this a try and let you know in the comments. Thanks for the details! Commented May 31, 2023 at 9:06
  • Result: Parallelizing rsync brought the overall time from 5 h down to 4 h 30 min which is not enough gain to justify the additional complexity & effort of this method for me. Thanks for the tip nonetheless! Commented Jun 1, 2023 at 13:15

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.