# A robust and clean approach # - First things first, **don't hog the available memory**! Hash a file in chunks rather than feeding the entire file. - Different approaches for different needs/purpose (all of the below or pick what ever applies): - Hash only the entry name of all entries in the directory tree - Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea) - For a symbolic link, its content is the referent name. Hash it or choose to skip - Follow or not to follow(resolved name) the symlink while hashing the contents of the entry - If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files - Handle large files well(again, mind the RAM) - Handle very deep directory trees (mind the open file descriptors) - Handle non standard file names - How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well? - Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases. This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases. Here's a tool(disclaimer: I'm a contributor to it) [dtreetrawl](https://github.com/six-k/dtreetrawl), very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful. > Usage: > dtreetrawl [OPTION...] "/trawl/me" [path2,...] > > Help Options: > -h, --help Show help options > > Application Options: > -t, --terse Produce a terse output; parsable. > -d, --delim=: Character or string delimiter/separator for terse output(default ':') > --hash Hash the files to produce checksums(default is MD5). > -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512. > -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum > -R, --only-root-hash Output only the root hash. NULL if --hash is not set *An example human friendly output:* > ... > ... //clipped > ... > /home/lab/linux-4.14-rc8/CREDITS > Base name : CREDITS > Level : 1 > Type : regular file > Referent name : > File size : 98443 bytes > I-node number : 290850 > No. directory entries : 0 > Permission (octal) : 0644 > Link count : 1 > Ownership : UID=0, GID=0 > Preferred I/O block size : 4096 bytes > Blocks allocated : 200 > Last status change : Tue, 21 Nov 17 21:28:18 +0530 > Last file access : Thu, 28 Dec 17 00:53:27 +0530 > Last file modification : Tue, 21 Nov 17 21:28:18 +0530 > Hash : 9f0312d130016d103aa5fc9d16a2437e > > Stats for /home/lab/linux-4.14-rc8: > Elapsed time : 1.305767 s > Start time : Sun, 07 Jan 18 03:42:39 +0530 > Root hash : 434e93111ad6f9335bb4954bc8f4eca4 > Hash type : md5 > Depth : 8 > Total, > size : 66850916 bytes > entries : 12484 > directories : 763 > regular files : 11715 > symlinks : 6 > block devices : 0 > char devices : 0 > sockets : 0 > FIFOs/pipes : 0