1

I am hoping to achieve some cleanup functionality on about 20TB for my NAS with rsync in linux by excluding whole directories and contents for directories that would contain a ".protect" file

I generate really large caches in subfolders like

cache/simulation_v001/reallybigfiles_*.bgeo

cache/simulation_v002/reallybigfiles_*.bgeo

cache/simulation_v003/reallybigfiles_*.bgeo

and if a file existed like this- cache/simulation_v002/.protect

Then i'd like to build an rsync operation to move all folders to a temp /recycle location excluding cache/simulation_v002/ and all its contents.

I've done something like this before with python, but I'm curious to see if the operation can be simplified with rsync or another method.

3
  • 2
    rsync alone can't do this - but you could use find to construct an exclude file for rsync. e.g. starting with something like find . -name .protect -printf '%h/***\n' Commented Jul 31, 2019 at 3:32
  • this doesn't seem to work. using the find command will generate items in the list like - ./simulation_v002/*** but this will then still end up including files it shouldn't rsync -a -m --remove-source-files --exclude-from='cache/exclude_list.txt' cache/ cache_trash is it possible for find to generate simulation_v002/*** instead? Commented Aug 2, 2019 at 13:31
  • use sed or something to edit find's output before saving to a file. e.g. sed -e 's=^\./=='. don't expect one tool to do everything - it's normal to combine multiple small tools to achieve a desired result, each tool being good at its own job. find to get the list of files, sed to transform it into the required format, rsync to do th copy. Commented Aug 2, 2019 at 14:28

1 Answer 1

1

Thanks to tips from cas I was able to create this workflow to solve the problem with a bash script. Its not ideal because it would be better if it did a move for faster operation (I wish rsync had this ability). The script will search below the current folder for files with find, create an exclusion list, then use rsync from the base volume to move all other folders to a trash folder, retaining the full path underneath so any mistakes can be restored non destructively.

Link to current state if this solution in git dev branch - https://github.com/firehawkvfx/openfirehawk-houdini-tools/blob/dev/scripts/modules/trashcan.sh

#!/bin/bash # trash everything below the current path that does not have a .protect file in # the folder. it should normally only be run from the folder such as # 'job/seq/shot/cache' to trash all data below this path. # see opmenu and firehawk_submit.py for tools to add protect files based on # a top net tree for any given hip file. argument="$1" echo "" ARGS='' if [[ -z $argument ]] ; then echo "DRY RUN. To move files to trash, use argument -m after reviewing the exclude_list.txt and you are sure it lists everything you wish to protect from being moved to the trash." echo "" ARGS1='--remove-source-files' ARGS2='--dry-run' else case $argument in -m|--move) echo "MOVING FILES TO TRASH." echo "" ARGS1='--remove-source-files' ARGS2='' ;; *) raise_error "Unknown argument: ${argument}" return ;; esac fi current_dir=$(pwd) echo "current dir $current_dir" base_dir=$(pwd | cut -d/ -f1-2) echo "base_dir $base_dir" source=$(realpath --relative-to=$base_dir $current_dir)/ echo "source $source" target=trash/ echo "target $target" # ensure trash exists at base dir. mkdir -p $base_dir/$target echo "" echo "Build exclude_list.txt contents with directories containing .protect files" find . -name .protect -print0 | while IFS= read -r -d '' line; do path=$(realpath --relative-to=. "$line") dirname $path done > exclude_list.txt path_to_list=$(realpath --relative-to=. exclude_list.txt) echo $path_to_list >> exclude_list.txt cat exclude_list.txt cd $base_dir # run this command from the drive root, eg /prod. rsync -a $ARGS1 --prune-empty-dirs --inplace --relative --exclude-from="$current_dir/exclude_list.txt" --include='*' --include='*/' $source $target $ARGS2 -v cd $current_dir 
5
  • 1
    +1. you should use double quotes around your variables except where you actually want the shell's white space argument expansion to occur. and, since you're using bash, you should probably use an array for the rsync args rather than $ARGS1, $ARGS2. Commented Aug 7, 2019 at 3:31
  • Thanks cas, can you explain what you mean by 'except where you actually want the shell's white space argument expansion to occur' ? Commented Aug 12, 2019 at 4:23
  • 1
    if you have a variable, say $ARGS, that contains word1 word2 word3 then wrapping it in double-quotes when you use it (e.g. newvar="$VAR") will prevent the shell from expanding it into 3 separate arguments. 99.999999% of the time, this is what you want. If you don't wrap it in double-quotes, then the shell will expand it into three arguments - this is useful in cases where you want to use that variable to hold one or more arguments to another command (e.g. rsync $ARGS ...). And in most of those cases, you're better off using an array (e.g. rsync "${ARGS[@]}" ...) Commented Aug 13, 2019 at 8:43
  • 1
    more on "most of the time, this is what you want" - unquoted variables tend to result in unfortunate (possibly disastrous) consequences if variables contain unexpected spaces or other shell meta-characters. a (contrived) example: say DIR='foo bar', and you have a directory containing sub-directories foo, bar, and foo bar. rm -rf $DIR will delete directories foo and bar. rm -rf "$DIR" will delete directory 'foo bar'. spaces are important. so are other meta-chars like & or ;. proper quoting is extremely important. Commented Aug 13, 2019 at 8:49
  • 1
    There are lots of good questions and answers on this site about quoting variables (search for "quote variable") but this explains it extremely well: unix.stackexchange.com/questions/131766/… Commented Aug 13, 2019 at 8:56

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.