Skip to main content
Update dtreetrawl usage
Source Link
six-k
  • 121
  • 5
Usage: dtreetrawl [OPTION...] "/trawl/me" [path2,...] Help Options: -h, --help  Show help options Application Options: -t, --terse  Produce a terse output; parsable. -d, --delim=:  Character or string delimiter/separator for terse output(default ':') -l, --max-level=N Do not traverse tree beyond N level(s) --hash  Hash the files to produce checksums(default is MD5). -c, --checksum=md5  Valid hashing algorithms: md5, sha1, sha256, sha512. -s, --hash-symlink  Include symbolic links' referent name while calculating the root checksum -R, --only-root-hash  Output only the root hash. NULLBlank line if --hash is not set -N, --no-name-hash Exclude path name while calculating the root checksum -F, --no-content-hash Do not hash the contents of the file 
Usage: dtreetrawl [OPTION...] "/trawl/me" [path2,...] Help Options: -h, --help Show help options Application Options: -t, --terse Produce a terse output; parsable. -d, --delim=: Character or string delimiter/separator for terse output(default ':') --hash Hash the files to produce checksums(default is MD5). -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512. -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum -R, --only-root-hash Output only the root hash. NULL if --hash is not set 
Usage: dtreetrawl [OPTION...] "/trawl/me" [path2,...] Help Options: -h, --help  Show help options Application Options: -t, --terse  Produce a terse output; parsable. -d, --delim=:  Character or string delimiter/separator for terse output(default ':') -l, --max-level=N Do not traverse tree beyond N level(s) --hash  Hash the files to produce checksums(default is MD5). -c, --checksum=md5  Valid hashing algorithms: md5, sha1, sha256, sha512. -s, --hash-symlink  Include symbolic links' referent name while calculating the root checksum -R, --only-root-hash  Output only the root hash. Blank line if --hash is not set -N, --no-name-hash Exclude path name while calculating the root checksum -F, --no-content-hash Do not hash the contents of the file 
Added the tool's usage and sample output.
Source Link
six-k
  • 121
  • 5

Here's a tool(disclaimer: I'm a contributor to it) Here's a tooldtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.

Usage: dtreetrawl [OPTION...] "/trawl/me" [path2,...] Help Options: -h, --help Show help options Application Options: -t, --terse Produce a terse output; parsable. -d, --delim=: Character or string delimiter/separator for terse output(default ':') --hash Hash the files to produce checksums(default is MD5). -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512. -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum -R, --only-root-hash Output only the root hash. NULL if --hash is not set 

An example human friendly output:

... ... //clipped ... /home/lab/linux-4.14-rc8/CREDITS Base name : CREDITS Level : 1 Type : regular file Referent name : File size : 98443 bytes I-node number : 290850 No. directory entries : 0 Permission (octal) : 0644 Link count : 1 Ownership : UID=0, GID=0 Preferred I/O block size : 4096 bytes Blocks allocated : 200 Last status change : Tue, 21 Nov 17 21:28:18 +0530 Last file access : Thu, 28 Dec 17 00:53:27 +0530 Last file modification : Tue, 21 Nov 17 21:28:18 +0530 Hash : 9f0312d130016d103aa5fc9d16a2437e Stats for /home/lab/linux-4.14-rc8: Elapsed time : 1.305767 s Start time : Sun, 07 Jan 18 03:42:39 +0530 Root hash : 434e93111ad6f9335bb4954bc8f4eca4 Hash type : md5 Depth : 8 Total, size : 66850916 bytes entries : 12484 directories : 763 regular files : 11715 symlinks : 6 block devices : 0 char devices : 0 sockets : 0 FIFOs/pipes : 0 

Here's a tool, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.

Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.

Usage: dtreetrawl [OPTION...] "/trawl/me" [path2,...] Help Options: -h, --help Show help options Application Options: -t, --terse Produce a terse output; parsable. -d, --delim=: Character or string delimiter/separator for terse output(default ':') --hash Hash the files to produce checksums(default is MD5). -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512. -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum -R, --only-root-hash Output only the root hash. NULL if --hash is not set 

An example human friendly output:

... ... //clipped ... /home/lab/linux-4.14-rc8/CREDITS Base name : CREDITS Level : 1 Type : regular file Referent name : File size : 98443 bytes I-node number : 290850 No. directory entries : 0 Permission (octal) : 0644 Link count : 1 Ownership : UID=0, GID=0 Preferred I/O block size : 4096 bytes Blocks allocated : 200 Last status change : Tue, 21 Nov 17 21:28:18 +0530 Last file access : Thu, 28 Dec 17 00:53:27 +0530 Last file modification : Tue, 21 Nov 17 21:28:18 +0530 Hash : 9f0312d130016d103aa5fc9d16a2437e Stats for /home/lab/linux-4.14-rc8: Elapsed time : 1.305767 s Start time : Sun, 07 Jan 18 03:42:39 +0530 Root hash : 434e93111ad6f9335bb4954bc8f4eca4 Hash type : md5 Depth : 8 Total, size : 66850916 bytes entries : 12484 directories : 763 regular files : 11715 symlinks : 6 block devices : 0 char devices : 0 sockets : 0 FIFOs/pipes : 0 
Source Link
six-k
  • 121
  • 5

A robust and clean approach

  • First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
  • Different approaches for different needs/purpose (all of the below or pick what ever applies):
    • Hash only the entry name of all entries in the directory tree
    • Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
    • For a symbolic link, its content is the referent name. Hash it or choose to skip
    • Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
    • If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
    • Handle large files well(again, mind the RAM)
    • Handle very deep directory trees (mind the open file descriptors)
    • Handle non standard file names
    • How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
    • Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.

This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.

Here's a tool, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.