Since I'm not a fan of Perl, here's a bash version:
#!/bin/bash DIR="/path/to/big/files" find $DIR -type f -exec md5sum {} \; | sort > /tmp/sums-sorted.txt OLDSUM="" IFS=$'\n' for i in `cat /tmp/sums-sorted.txt`; do NEWSUM=`echo "$i" | sed 's/ .*//'` NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'` if [ "$OLDSUM" == "$NEWSUM" ]; then echo ln -f "$OLDFILE" "$NEWFILE" else OLDSUM="$NEWSUM" OLDFILE="$NEWFILE" fi done #!/bin/bash DIR="/path/to/big/files" find $DIR -type f -exec md5sum {} \; | sort > /tmp/sums-sorted.txt OLDSUM="" IFS=$'\n' for i in `cat /tmp/sums-sorted.txt`; do NEWSUM=`echo "$i" | sed 's/ .*//'` NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'` if [ "$OLDSUM" == "$NEWSUM" ]; then echo ln -f "$OLDFILE" "$NEWFILE" else OLDSUM="$NEWSUM" OLDFILE="$NEWFILE" fi done This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.
This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.
NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.