0

I would like to find difference in bytes in files. However, du/diff command with -a list also directories and subdirectories. I want only the files in subdirectories and directories, not these ones.

I know about --exclude option, but i dont know how to manipulate it to do that. thanks.

My os is linux debian.

my command is

dira=/mnt/hdd_a/; dirb=/mnt/hdd_b/; diff -u <(cd $dira && du -ab | sort -k2) <(cd $dirb && du -ab | sort -k2) 

also I cannot fully understand output. Directories have difference of + or - for multiple reasons i suppose eg. attributes. I dont care about that. However, in hundrends of files, diff prints some files without + or -. Why? They may differ in some other attribute except size?

--- /dev/fd/63 2023-08-22 01:38:15.775099368 +0300 +++ /dev/fd/62 2023-08-22 01:38:15.775099368 +0300 @@ -1,6 +1,6 @@ -364123856483 . +364123860579 . 435823780 ./vid_01.mkv -33781164566 ./news_a +33781168662 ./news_b 19110023 ./news_c/covers_09.rar 161634304 ./news_d/video_d7.avi 17080320 ./news_e/video_d17.avi 

As i understood, the "u" options prints only the 3 lines. I want all diff lines and only these. Not identical lines (file sizes).

Using diff --changed-group-format='%<' --unchanged-group-format='' <(cd "$dira" && du -ab | sort -k2) <(cd "$dirb" && du -ab | sort -k2)

prints some files with their sizes without any indication of "+/-". so I cannot know if the diff is from source or from destination files. Note that whole files are missing from destination.

1 Answer 1

3

du/diff command with -a list also directories and subdirectories. I want only the files in subdirectories and directories, not these ones.

I know about --exclude option, but i dont know how to manipulate it to do that. thanks.

If I understand you correctly, you only want to see the sizes of all files in the directory tree, not the total size of the contents of any directories themselves. Unfortunately, the --exclude option of du doesn't appear to support using something like / to indicate directories, e.g. du --exclude='*/' will still output the sizes of directories.

Instead of using any options of du itself to filter out directories, you can use a command like find to get a list of files only (e.g. using its -type f option), and then pass this list to du. The find command outputs each filename on its own line, and we can pipe this list of filnames to du with the aid of xargs. The xargs command expects individual arguments to be delimited by any whitespace character (e.g. space, tab, newline), but in case any filenames contain whitespace, then xargs will not do what we expect, so instead we tell find to delimit the filenames with NULL characters with -print0, and tell xargs to expect such input with -0:

find . -type f -print0 | xargs -0 du -b 

I would like to find difference in bytes in files. [...] diff prints some files without + or -. Why? They may differ in some other attribute except size?

To do this, you need to directly compare the file sizes of the two files whose sizes you wish to compare. The diff command does not do this. Rather, diff is used for comparing the contents of two files, e.g. if file a.txt contains the following...

a b c 

... and file b.txt contains the following...

a b d 

then diff a.txt b.txt outputs this:

3c3 < c --- > d 

This tells you that difference between the two files is this: on line 3 of a.txt, the line c was removed (<) and the line d was added (>).

Using diff with the -u option causes it to format the output in the style of a "unified context" patch file, as is used by the patch command, and similar in style to patch files used by other tools, such as Git. That is, diff -u a.txt b.txt gets you this instead:

--- a.txt 2023-08-22 00:38:07.477617454 +0100 +++ b.txt 2023-08-22 00:38:12.533616240 +0100 @@ -1,3 +1,3 @@ a b -c +d 

This should help you understand why you are seeing + and - in the output of the command you have run. Specifically, cd $dira && du -ab | sort -k2 outputs the sizes of the contents of $dira, sorted by item name, and thus diff -u <(...) <(...) takes two such outputs and shows you the differences between those outputs. Lines preceded by - indicate files that exist in $dira but not in $dirb, and vice-versa for lines preceded by +.


The diff command does not do anything more intelligent, such as directly showing you the difference in file sizes between specific pairs of files in $dira and $dirb. For that, you need to somehow specify which pairs of files you'd like to compare the sizes of.

For example, if you want to compare the sizes of $dira/news_a and $dirb/news_b, then you should do so directly. If you want to only compare the sizes of pairs of files in $dir_a and $dir_b whose names are exactly the same, e.g. $dir_a/news_a and $dir_b/news_a, then this can be done programatically, as in the following Bash script:

#!/bin/bash script_location="$( dirname "$(readlink -f "${BASH_SOURCE:-$0}")" )" dir_a="$1" dir_b="$2" cd "$dir_a" dir_a_filenames="$(find . -type f)" cd "$script_location" cd "$dir_b" dir_b_filenames="$(find . -type f)" # Combine filename lists all_filenames="$( sort -u <(echo "$dir_a_filenames") <(echo "$dir_b_filenames") )" # For each filename in $all_filenames, compare the size of that file in $dir_a with the same file in $dir_b IFS=$'\n' cd "$script_location" for file in $(echo "$all_filenames"); do file_a="$dir_a/$file" file_b="$dir_b/$file" file_a_size="$(if [ -f "$file_a" ]; then stat --format='%s' "$file_a"; else echo 0; fi)" file_b_size="$(if [ -f "$file_b" ]; then stat --format='%s' "$file_b"; else echo 0; fi)" size_diff=$(($file_b_size - $file_a_size)) echo -e "$file\tA size = $file_a_size\tB size = $file_b_size\tSize difference = $size_diff" done 

The $IFS environment variable defines what characters are used as item delimiters in constructs such as for loops. Here, we set it to the newline character, $'\n', for a similar reason as we used NULL delimiters with xargs earlier.

We use stat instead of du to get the file sizes, since it is a bit quicker, and we treat the sizes of non-existent files as being zero for the purposes of reporting their size and calculating the size differences; the command [ -f filename ] is used to check whether the file filename exists.

The Bash syntax $((...)) is used to perform calculations, e.g. $((2+3)) outputs 5; here we are just subtracting one file size from the other.

10
  • There are many valid points in this answer, but the shell code near the end is very poor. You're storing the output of finds in "scalar" (non-array) variables, then you echo them while giving to sort, then you echo the result to build the for loop. These echos are unnecessary, transparent at best, otherwise possibly harmful. Finally you're relying on word splitting from the unquoted $() which is the Bash pitfall number one, just obfuscated with variables and echos. Commented Aug 22, 2023 at 4:19
  • It works, thanks! cd "$script_location" was removed. I created a null file in destination eg. a.txt (deleted any line) but not in source. it reports ./a.txt A size = 0 B size = 0 Size difference = 0. How to report what files missing in one of dirs? Commented Aug 22, 2023 at 7:41
  • after size_diff=$(($file_b_size - $file_a_size)) I added: zerov=0; if [ $size_diff -ne $zerov ] then echo -e "$file, $file_a_size, $size_diff, "not equal"" else echo -e "$file, $file_a_size, $size_diff, "equal"" fi in order to be easy to grep results equal or not equal Commented Aug 22, 2023 at 9:09
  • @Kamil, in debian, i have filenames with spaces, trailing spaces, new line, special characters, chinese characters, underscores etc. The above script works. Maybe, under other circumstances, you must be right. Commented Aug 22, 2023 at 9:19
  • @KamilMaciorowski I know enough Bash to get complex stuff done, and haven't had a need to use arrays in my 13 years of using Linux. Frankly, I also just really don't like the array syntax. Combine that with exposure to (and preference of) Zsh, which does arrays completely differently, and I prefer to just steer clear of arrays in shell contexts entirely. In general I get what you're saying about redundant echoes, but I don't think there are any in the code I wrote above, e.g. if $all_filenames were an array, then you could do for f in $all_filenames; but it isn't. Commented Aug 22, 2023 at 19:16

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.