2

I am currently trying to figure out how git diff -M<limit> works.

What I found out is, that git diff checks how similar two files (say fileA in revision 1, fileC in revision 2) are by calculating a similarity score. If the similarity score is >= limit, fileA has been renamed to fileC which has possibly been modified (if score is < 100%).

Then I asked myself, what if there are more files with the same sha1-hash within the directory? How does git know which one is the renamed (and changed) version?

To find this out, I tried the following:

First, I created two files with 7 lines ("a", "b", "c", "d", "e", "f", "g")

vi fileA vi fileB 

Then I added them to the repository and committed:

git add fileA fileB git commit -m "Added fileA and fileB" [master ffc8964] Added fileA and fileB 2 files changed, 6 insertions(+) create mode 100644 tests/fileA create mode 100644 tests/fileB 

Next, I renamed fileA to fileC using git mv and deleted the first line in fileB and fileC. After that I commited the changes

git mv fileA fileC vi fileB vi fileC git commit -a -m "Renamed and changed files" [master 57ff82a] Renamed and changed filed 2 files changed, 2 deletions(-) rename tests/{fileA => fileC} (85%) 

fileB and fileC now look like this:

b c d e f g 

What I expected now is that the checksums of fileB and fileC are equal:

git hash-object fileB fileC 9fbb6235d2d7eb798268d4537acebea297321241 9fbb6235d2d7eb798268d4537acebea297321241 

Indeed they are :-)

So how should git diff now know what the renamed file is? Since fileC has been changed, a new blob has been generated by commit and the checksum of fileC and fileA are different as well (obviously).

I tried it:

git diff -M80% HEAD master~1 

The output however confused me :-(

diff --git a/tests/fileC b/tests/fileA similarity index 85% rename from tests/fileC rename to tests/fileA index 9fbb623..f9d9a01 100644 --- a/tests/fileC +++ b/tests/fileA @@ -1,3 +1,4 @@ +a b c d diff --git a/tests/fileB b/tests/fileB index 9fbb623..f9d9a01 100644 --- a/tests/fileB +++ b/tests/fileB @@ -1,3 +1,4 @@ +a b c d 

Apparently git diff DID find out that fileA has been renamed to fileC.

But how? Did git save some kind of connection between fileA and fileC?

1 Answer 1

2

No such connection was saved. Renames can be detected when one file is deleted, and another is added. Git saw that fileC was newly added, and went through all deleted files to see if it might have been a rename. Here, the only deleted file was fileA, so it was a rather quick check.

Note: it only has to go through the deleted files, because otherwise you wouldn't have a rename, you'd have a copy. Copies can be detected too, and it works roughly the same way, but they're covered by a separate option (-C).

Sign up to request clarification or add additional context in comments.

5 Comments

This is how I imagine commit to detect renames. But how does git diff know? Does it consult the histories of both commits?
@Paddre A commit doesn't store the changes, it stores the new tree. Commands such as git diff calculate the changes on demand. So git diff has immediate access to the directories and files in both commits, without having to go through the histories.
Thx (makes perfect sense...don't know what was wrong in my mind yestderday night :-P ). So git diff checks whether there is an added file which is sufficiently similar to a deleted file, right? So I tried the following: git mv fileC fileD, cp fileD fileE, delete first lines of fileD and fileE, git add fileE. Now there has been a renaming (that is a deletion and an addition) and an addition and both the renamed and the added file have the same checksum. How does git diff still find out that fileD has been the renamed version of fileC?
@Paddre It doesn't, it guesses. You'll see the exact same result if you do git mv fileC fileE, and then cp fileE fileD: git will still detect a rename from fileC to fileD.
Thanks. That helped me a lot (even though the answer is somewhat "sobering" :-D ). I think this works because it doesn't matter which file has been the renamed one. The base file doesn't exist any more in the newer revision) anyway and so di fileD and fileE regarding the older revision. So it doesn't matter whether fileD or fileE is the renamed version of fileC

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.