Trying to understand `git diff` and `git mv` rename detection mechanism

Question

This is a followup to another question I asked before.

Before being edited, the initially created file something gets renamed to somethingelse which can be observed here:

git mv something somethingelse

The file somethingelse then gets renamed back to something before the second vim edit:

git mv somethingelse something

Basically in the following portion of the code:

# If you add something to the first line, the rename will not be detected by Git # However, if you instead create 2 newlines and fill line 3 with new code, # the rename gets detected for whatever reason printf "\nCOMMAND: vim something\n\n" vim something

If at this point I add abc to the code, we would end up with:

First line of code. abc

Which I think is an addition of 4 bytes on line 1, which in turn will end up in this:

On branch master Changes to be committed: (use "git reset HEAD <file>..." to unstage) new file: something deleted: somethingelse

Then, if we add a newline and type in abc into the third line (which should also be 4 bytes, correct me if wrong):

First line of code. abc

Suddenly, Git will detect the rename (edit inclusive):

On branch master Changes to be committed: (use "git reset HEAD <file>..." to unstage) renamed: somethingelse -> something

One good answer/comment by @torek given here explains this to a certain extent, taking the git diff rename detection treshold of git status into consideration.

Shouldn't Git behave identically since we added 4 bytes in both cases, but in a different manner or does the newline have something to do with this?

Why would modifying a file to add abc end up being reported as a new file and a deleted file with another name? Did you also rename the file? You keep saying and mentioning "the rename" but your question doesn't actually show that you are renaming the file. Please be explicit about what you're doing. — Lasse V. Karlsen
– Lasse V. Karlsen, Commented Sep 16, 2017 at 16:49
The file gets renamed prior to editing, which happens for the first time at line 34 — Christian Heinrichs
– Christian Heinrichs, Commented Sep 16, 2017 at 16:51
But this isn't mentioned in the question. If there is prior information (even if this is in that question you're posting a followup to), you need to post it here. Stack Overflow is not a threaded discussion board and every question is expected to stand on its own feet. — Lasse V. Karlsen
– Lasse V. Karlsen, Commented Sep 16, 2017 at 16:52
For small files, rename detection is always going to be flaky. — o11c
– o11c, Commented Sep 16, 2017 at 18:04

torek · Accepted Answer · 2022-01-29 05:33:35Z

Git's "similarity index" computation is not, as far as I know, documented anywhere other than in the source, starting with diffcore-delta.c.

To compute the similarity index for two files S (source) and D (destination), Git:

reads both files
computes a hash table of all of the chunks of file S
computes a second hash table of all of the chunks of file D

The entries in these two hash tables are simply a count of occurrences of instances of that hash value (plus, as noted below, the length of the chunk).

The hash value for a file chunk is computed by:

start at the current file offset (initially zero)
read 64 bytes or until '\n' character, whichever occurs first
if the file is claimed to be text and there is a '\r' before the '\n', discard the '\r'
hash the resulting string-of-up-to-64 bytes using the algorithm shown in the linked file

Now that there are hash tables for both S and D, each possible hash h_i appears n_S times in S and n_D in D (either may be zero, though the code skips right over both-zero hash values). If the number of occurrences in D is less than or the same as the number of occurrences in S—i.e., n_D ≤ n_S—then D "copies from S" n_D times. If the number of occurrences in D exceeds the number in S (including when the number in S is zero), then D has a "literal add" of n_D - n_S occurrences of the hashed chunk, and D also copies all n_S original occurrences as well.

Each hashed chunk retains its number-of-input-bytes, and these multiply the number of copies or number of additions of "chunks" to get the number of bytes copied or added. (Deletions, where D lacks items that exist in S, have only indirect effect here: the byte copy and add counts get smaller, but Git does not specifically count the deletions themselves.)

These two values (src_copied and literal_added) computed in diffcore_count_changes are handed over to function estimate_similarity in diffcore-rename.c. It completely ignores the literal_added count (this count is used in deciding how to build packfile deltas, but not in terms of rename scoring). Instead, only the src_copied number matters:

score = (int)(src_copied * MAX_SCORE / max_size);

where max_size is the size in bytes of larger of the two input files S and D.

Note that there is an earlier computation:

max_size = ((src->size > dst->size) ? src->size : dst->size); base_size = ((src->size < dst->size) ? src->size : dst->size); delta_size = max_size - base_size;

and if the two files have changed size "too much":

if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE) return 0;

we never even get into the diffcore-delta.c code to hash them. The minimum_score here is the argument to -M or --find-renames, converted to a scaled number. MAX_SCORE is 60000.0 (type double), so the default minimum_score, when you use the default -M50%, is 30000 (half of 60000). Except for the case of CR-before-LF eating, though, this particular shortcut should not affect the outcome of the more expensive similarity computation.

[Edit: this is now obsolete:] git status always uses the default. There is no knob to change the threshold (nor the number of files allowed in the rename-finding queue). If there were the code would go here, setting the rename_score field of the diff options. Until Git version 2.18.0, there was no way to control this for git status. In Git 2.18.0 and later, git status has the same --find-renames option as git diff. The status.renames option in the Git configuration enables any default detection, and if unset, git status obeys the diff.renames setting; see the git config documentation and the git status documentation.

Collectives™ on Stack Overflow

Trying to understand `git diff` and `git mv` rename detection mechanism

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related