0

I need to speed up the process of syncing log files between servers.

The machine that generates the logs (LOGMACHINE) creates them in a tree like this:

/Files /Files/LOGS1/ /Files/LOGS1/logFiles/ /Files/LOGS1/logFiles/typeLog1A /Files/LOGS1/logFiles/typeLog1B /Files/LOGS1/logFiles/typeLog1C /Files/LOGS1/logFiles/typeLog1C/fileLog1C-20210113-0900.xml.gz /Files/LOGS1/logFiles/typeLog1C/fileLog1C-20210113-0915.xml.gz /Files/LOGS1/logFiles/typeLog1C/fileLog1C-20210113-0930.xml.gz /Files/LOGS1/logFiles/typeLog2A /Files/LOGS1/logFiles/typeLog2A/fileLog2A-20210113-0900.xml.gz /Files/LOGS1/logFiles/typeLog2A/fileLog2A-20210113-0915.xml.gz /Files/LOGS1/logFiles/typeLog2A/fileLog2A-20210113-0930.xml.gz /Files/LOGS2/ /Files/LOGS2/logFiles/ /Files/LOGS2/logFiles/typeLog1A /Files/LOGS2/logFiles/typeLog1B /Files/LOGS2/logFiles/typeLog1C /Files/LOGS2/logFiles/typeLog1C/fileLog1C-20210113-0900.xml.gz /Files/LOGS2/logFiles/typeLog1C/fileLog1C-20210113-0915.xml.gz /Files/LOGS2/logFiles/typeLog1C/fileLog1C-20210113-0930.xml.gz /Files/LOGS2/logFiles/typeLog2A /Files/LOGS2/logFiles/typeLog2A/fileLog2A-20210113-0900.xml.gz /Files/LOGS2/logFiles/typeLog2A/fileLog2A-20210113-0915.xml.gz /Files/LOGS2/logFiles/typeLog2A/fileLog2A-20210113-0930.xml.gz 

There are around 4000 folders of typeLog1* and 9000 of typeLog2*. Each one has a new file each 15 minutes.

I own two servers, SERV1 that syncs the folders of typeLog1* and SERV2 for typeLog2*, both sync them from LOGMACHINE. Each severs sync both LOGS1 and LOGS2 folders.

Right now I'm using rsync that requires 30 mins only to get one of the LOGS folders. That creates a delay of 30 min - 1 hour approx to each file.

I made a solution to run several rsync in parallel. Unfourtnetly I can only have 8 ssh sessions in parallel, that's a limitation from the machine that creates the logs.

Limitations:

  • I have to use ssh
  • I can't install any software in the machine that creates the logs.

Is there any way to speed up the process, using rsync or an alternative?

Update:

Current rsyncs:

  • On SERV1:

    rsync -avz --rsync-path=/usr/local/bin/rsync --ignore-existing --delete --files-from=<(ssh user@logmachine 'cd /home/user/Files/LOGS1/logFiles/; find . -mtime -1 -type f -name "*fileLog1*.xml.gz"') user@logmachine:/home/user/Files/LOGS1/logFiles/ Files/LOGS1/logFiles/ rsync -avz --rsync-path=/usr/local/bin/rsync --ignore-existing --delete --files-from=<(ssh user@logmachine 'cd /home/user/Files/LOGS2/logFiles/; find . -mtime -1 -type f -name "*fileLog1*.xml.gz"') user@logmachine:/home/user/Files/LOGS2/logFiles/ Files/LOGS2/logFiles/ 
  • On SERV2:

    rsync -avz --rsync-path=/usr/local/bin/rsync --ignore-existing --delete --files-from=<(ssh user@logmachine 'cd /home/user/Files/LOGS1/logFiles/; find . -mtime -1 -type f -name "*fileLog2*.xml.gz"') user@logmachine:/home/user/Files/LOGS1/logFiles/ Files/LOGS1/logFiles/ rsync -avz --rsync-path=/usr/local/bin/rsync --ignore-existing --delete --files-from=<(ssh user@logmachine 'cd /home/user/Files/LOGS2/logFiles/; find . -mtime -1 -type f -name "*fileLog2*.xml.gz"') user@logmachine:/home/user/Files/LOGS2/logFiles/ Files/LOGS2/logFiles/ 

The condition of finding a pattern in the file is necessary because there are other files in those folders.

2
  • Can you provide the exact rsync-command? As far as I understand, you are syncing LOGS1 from SERV1 to SERV2 executed on SERV1, and LOGS2 from SERV2 to SERV1 executed on SERV2? Also: I you are using compression, try without if the servers are in the same data center. Commented Jan 13, 2021 at 11:46
  • I'm sorry I mislead you. I updated the post. Actually, I'm syncing typeLog1* files from LOGMACHINE to SERV1 and typeLog2* files from LOGMACHINE to SERV2. The servers are not in the same data center. I tried not using the compression of rsync but it doesn't get any better. Commented Jan 13, 2021 at 12:06

3 Answers 3

0

Compress logs with tar.gz and copy them using rsync and check the time of the copy process.

0

When you run this command you can end up affecting the rsync efficiency

rsync -avz --rsync-path=/usr/local/bin/rsync --ignore-existing --delete --files-from=<(ssh user@logmachine 'cd /home/user/Files/LOGS1/logFiles/; find . -mtime -1 -type f -name "*fileLog1*.xml.gz"') user@logmachine:/home/user/Files/LOGS1/logFiles/ Files/LOGS1/logFiles/ 

I would recommend you experiment with this variation for gz-compressed logs of type 1, modifying it appropriately for the second server that copies logs of type 2,

rsync -av --rsync-path=/usr/local/bin/rsync --delete --prune-empty-dirs --include '*.gz' --include '*/' --exclude '*' user@logmachine:'Files/LOGS1/logFiles/typeLog1*/' Files/LOGS1/logFiles/ 

You may find that when the the latest file is copied, it's only copied partially with the file updated and replaced on the next run. That's because there's no way to identify that a file has been completely written on the logmachine unless it's instantaneously moved into the target directory tree having been written and compressed elsewhere. However, you'll have already been having this problem but worse: because you had --ignore-existing the updated file would never have been copied.

If you don't need the logs on logmachine any more, carefully consider using --remove-source-files

0

Don't forget that "compressing twice" eats your CPU with no gain on transfer speed :

  • rsync can compress data on-the-fly with its -z flag, but it's only interesting when transmitting "compressible" data (all kinds of text, some binary data).
  • Data which is already compressed (gz archives and the like, MP3 files, JPG files, ...) are already compressed / optimized in terms of size and will only heat up your CPU with no benefit on speed.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.