4

Is it possible to show the similarity index of two files in a Git repository using git diff? According to the man pages, git diff -p may produce patches with this information in certain cases, but the following command for example does not contain the similarity index information:

git diff -p --no-index a b 

Where a and b are two files known to the repository. Is it possible to let Git calculate and report this similarity index between two existing files in a repository?

1 Answer 1

5

Unfortunately, no—or more precisely, not with any existing front-end command. The only way to get Git to compute a similarity index for two files is to create two tree objects in which it seems possible, to Git, that the file was renamed.

We can, however, do just that. Here's the method:

  1. Create a temporary index file name.
  2. Add the first file to the temporary index and write out a tree, saving its hash ID.
  3. Remove the first file and add the second file; write out a new tree as before.
  4. Diff the two trees with --find-renames=01.

(Using a rename threshold of 00 does not work: this just disables rename-detection.)

I wrapped this up into a script that is here and also appears below. Place the script somewhere in your $PATH (I use $HOME/scripts/ as a directory containing executable scripts that run on any architecture) and you can run git similarity a b.

(This is lightly tested.)


#! /bin/sh # # git-similarity: script to compute similarity of two files . git-sh-setup # for die() etc TAB=$'\t' # should probably use OPTIONS_SPEC, but not yet usage() { echo "usage: git similarity file1 file2" } case $# in 2) ;; *) usage 1>&2; exit 1;; esac test -f "$1" || die "cannot find file $1, or not a regular file" test -f "$2" || die "cannot find file $2, or not a regular file" test "x$1" != "x$2" || die "file names $1 and $2 are identical" TF=$(mktemp) || exit 1 trap "rm -f $TF" 0 1 2 3 15 export GIT_INDEX_FILE=$TF # create a tree holding (just) the argument file maketree() { rm -f $TF git add "$1" || exit 1 git write-tree || exit 1 } # Use git diff-tree here for repeatibility. We expect output of # the form Rnnn$TAB$file1$TAB$file2, but if we get two lines, # with D and A, we'll just print 000 here. print_similarity() { set $(git diff-tree --name-status --find-renames=01 $1 $2) case "$1" in R*) echo "${1#R}";; *) echo "000";; esac } h1=$(maketree "$1") h2=$(maketree "$2") print_similarity $h1 $h2 
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you. I am wondering now though whether it is not simpler to just write a script that calculates the ratio between the number of lines added/removed and the overall/maximum(?) number of lines of the input files (which seems to be roughly what Git uses to calculate a similarity index).
The similarity index isn't line-based though: it uses the internal delta compression code (in a different way than the actual delta compression code) after stripping out any carriage returns. It's kind of an odd calculation.
Ok, I'm convinced :) I'll use your script, thanks!
I had to add -r to the git diff-tree command for git-similarity to work when called from a sub directory in a repo.
Ah, right, I only tested it with simple file names. It would perhaps be better to whack on maketree() to allow files outside the Git repository entirely, but that's more complicated. I'll update my script on github, meanwhile.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.