98

This earlier question asked for the differences between 4 different Git diff strategies, but the only difference that was explained was the difference between myers and patience, which is pretty well explained elsewhere.

How does the histogram strategy work? What differentiates it from patience? The git-diff man page only says that it "extends the patience algorithm to "support low-occurrence common elements"." Other pages mention that it's faster, and that it comes from JGit, but they don't explain where or how its algorithm or results will differ from patience.

Where can I find a description of the histogram algorithm relative to the patience algorithm, with the same level of detail as Bram Cohen's original description of the patience algorithm?

(If it's just a matter of implementation performance with no case that will produce different results, why wasn't it just implemented as a new backend for patience?)

1
  • 2
    Although this paper compares only two algorithms (Myers and Histogram), I think it can help. Commented Jul 8, 2019 at 8:20

2 Answers 2

107

This histogram strategy was introduced in git 1.7.7 (Sept 2011), with the following description (as mentioned by the OP)

"git diff" learned a "--histogram" option to use a different diff generation machinery stolen from jgit, which might give better performance.

JGit includes src/org/eclipse/jgit/diff/HistogramDiff.java and tst/org/eclipse/jgit/diff/HistogramDiffTest.java

The description there is fairly complete:

HistogramDiff

An extended form of Bram Cohen's patience diff algorithm.

This implementation was derived by using the 4 rules that are outlined in Bram Cohen's blog, and then was further extended to support low-occurrence common elements.

The basic idea of the algorithm is to create a histogram of occurrences for each element of sequence A. Each element of sequence B is then considered in turn. If the element also exists in sequence A, and has a lower occurrence count, the positions are considered as a candidate for the longest common subsequence (LCS).
After scanning of B is complete the LCS that has the lowest number of occurrences is chosen as a split point. The region is split around the LCS, and the algorithm is recursively applied to the sections before and after the LCS.

By always selecting a LCS position with the lowest occurrence count, this algorithm behaves exactly like Bram Cohen's patience diff whenever there is a unique common element available between the two sequences.
When no unique elements exist, the lowest occurrence element is chosen instead
.
This offers more readable diffs than simply falling back on the standard Myers' O(ND) algorithm would produce.

To prevent the algorithm from having an O(N^2) running time, an upper limit on the number of unique elements in a histogram bucket is configured by #setMaxChainLength(int).
If sequence A has more than this many elements that hash into the same hash bucket, the algorithm passes the region to #setFallbackAlgorithm(DiffAlgorithm).
If no fallback algorithm is configured, the region is emitted as a replace edit.

During scanning of sequence B, any element of A that occurs more than #setMaxChainLength(int) times is never considered for an LCS match position, even if it is common between the two sequences. This limits the number of locations in sequence A that must be considered to find the LCS,and helps maintain a lower running time bound.

So long as #setMaxChainLength(int) is a small constant (such as 64), the algorithm runs in O(N * D) time, where N is the sum of the input lengths and D is the number of edits in the resulting EditList.
If the supplied SequenceComparator has a good hash function, this implementation typically out-performs MyersDiff, even though its theoretical running time is the same.

This implementation has an internal limitation that prevents it from handling sequences with more than 268,435,456 (2^28) elements


Note that this kind of algo was already used for pack_check, back in 2006 (git 1.3), for git-verify-pack -v. It was reused for index-pack in git 1.7.7


Commit 8c912ee actually introduced --histogram to diff:

Port JGit's HistogramDiff algorithm over to C. Rough numbers (TODO) show that it is faster than its --patience cousin, as well as the default Meyers algorithm.

The implementation has been reworked to use structs and pointers, instead of bitmasks, thus doing away with JGit's 2^28 line limit.

We also use xdiff's default hash table implementation (xdl_hash_bits() with XDL_HASHLONG()) for convenience.

commit 8555123 (git 1.7.10, April 2012) added:

8c912ee (teach --histogram to diff, 2011-07-12) claimed histogram diff was faster than both Myers and patience.

We have since incorporated a performance testing framework, so add a test that compares the various diff tasks performed in a real 'log -p' workload.
This does indeed show that histogram diff slightly beats Myers, while patience is much slower than the others.

Finally, commit 07ab4de (git 1.8.2, March 2013) add

config: Introduce diff.algorithm variable

Some users or projects prefer different algorithms over others, e.g. patience over myers or similar.
However, specifying appropriate argument every time diff is to be used is impractical. Moreover, creating an alias doesn't play nicely with other tools based on diff (git-show for instance).

Hence, a configuration variable which is able to set specific algorithm is needed.
For now, these four values are accepted:

  • 'myers' (which has the same effect as not setting the config variable at all),
  • 'minimal',
  • 'patience' and
  • 'histogram'.

Commit 07924d4 added concurrently the --diff-algorithm command line option.
As the OP Stuart P. Bentley mentions in the comments:

you can configure Git to use histogram by default with:

git config --global diff.algorithm histogram 

Update: Git 2.12 (Q1 2017) will retire the "fast hash" that had disastrous performance issues in some corner cases.

See commit 1f7c926 (01 Dec 2016) by Jeff King (peff). (Merged by Junio C Hamano -- gitster -- in commit 731490b, 19 Dec 2016)

xdiff: drop XDL_FAST_HASH

The xdiff code hashes every line of both sides of a diff, and then compares those hashes to find duplicates. The overall performance depends both on how fast we can compute the hashes, but also on how many hash collisions we see.

The idea of XDL_FAST_HASH is to speed up the hash computation.
But the generated hashes have worse collision behavior. This means that in some cases it speeds diffs up (running "git log -p" on git.git improves by ~8% with it), but in others it can slow things down. One pathological case saw over a 100x slowdown.

There may be a better hash function that covers both properties, but in the meantime we are better off with the original hash. It's slightly slower in the common case, but it has fewer surprising pathological cases.


Note: "git diff --histogram" had a bad memory usage pattern, which has been rearranged to reduce the peak usage, with Git 2.19 (Q3 2018).

See commit 79cb2eb, commit 64c4e8b, commit c671d4b, commit 2820985 (19 Jul 2018) by Stefan Beller (stefanbeller).
(Merged by Junio C Hamano -- gitster -- in commit 57fbd8e, 15 Aug 2018)

xdiff/xhistogram: move index allocation into find_lcs

This fixes a memory issue when recursing a lot, which can be reproduced as

seq 1 100000 >one seq 1 4 100000 >two git diff --no-index --histogram one two 

Before this patch, histogram_diff would call itself recursively before calling free_index, which would mean a lot of memory is allocated during the recursion and only freed afterwards.

By moving the memory allocation (and its free call) into find_lcs, the memory is free'd before we recurse, such that memory is reused in the next step of the recursion instead of using new memory.

This addresses only the memory pressure, not the run time complexity, that is also awful for the corner case outlined above.


Note: the patience and histogram algorithms had memory leaks, fixed with Git 2.36 (Q2 2022)

See commit 43ad3af, commit 4a37b80, commit 61f8839, commit 9df0fc3 (16 Feb 2022) by Phillip Wood (phillipwood).
(Merged by Junio C Hamano -- gitster -- in commit 47be28e, 09 Mar 2022)

xdiff: fix a memory leak

Reported-by: Junio C Hamano
Signed-off-by: Phillip Wood

Although the patience and histogram algorithms initialize the environment they do not free it if there is an error.
In contrast for the Myers algorithm the environment is initalized in xdl_do_diff() and it is freed if there is an error.
Fix this by always initializing the environment in xdl_do_diff() and freeing it there if there is an error.

Sign up to request clarification or add additional context in comments.

4 Comments

For reference, git config --global diff.algorithm histogram is the command to use that last commit's for configuring Git to use histogram by default.
@StuartP.Bentley Good point. I have included your comment in the answer for more visibility.
What does XDL_FAST_HASH have to do with any of this?
@StuartP.Bentley It was used to try and optimize diff historigram and patience, as described in github.com/git/git/commit/…. But it backfired and was recently pulled out.
5

It's hard to compare histogram diff with patience diff precisely, because patience diff is not well-defined IMO, and the descriptions on the web of histogram diff (including the answer above) are vague and incomplete.

I think both methods can be considered sort of heuristic, as they are not based on any formal theory such as the well-studied longest common subsequence algorithms (Myers, Hunt-McIlroy, etc). They are empirical methods that just work pretty well.

Patience diff depends on matching up lines common between the files being compared that are unique to each file, and then running a "longest ascending subsequence" algorithm to find a longest set of those lines that are in the same order in both files. Then another pass has to find further diffs in the sections between those matching lines, and that's the part that may differ between implementations. Some apparently use Myers's algorithm there; Bram's original writeup implied recursing with patience. And if there are no unique matching lines, patience needs to fall back to another algorithm completely.

(Also note the similarity between patience and Heckel's method from 1978: https://dl.acm.org/doi/pdf/10.1145/359460.359467 )

The histogram algorithm does not need unique lines. Here's a fairly precise (I hope) description of the logic underlying the original jgit implementation:

The idea is to find a section of consecutive (adjacent) corresponding (same order) lines that match between the files, then do the same to the portions of the files before and after the section just matched, recursively, until sections have no matching lines. These sections that have no lines in common are the differences.

When finding the sections that match, try to find a section that is longest, and/or has a line in the first file that occurs as infrequently as possible. (Frequency of occurrence of lines in the second file are not used in the original jgit implementation, and I suspect also not in git, contrary to some info on the web.)

More precisely, consider files A and B.

  1. Count number of occurrences of each distinct line in A.

  2. Set lowcount to 65.

  3. Find the first line in B that matches a line in A that has an occurrence count <= lowcount. Choose the first occurrence of this line in A as the match with B.

  4. Widen the match by including adjacent corresponding matching lines before and after this match.

  5. Note the bounds of the widened match (first and last locations of the A and B lines) as the "best match" and set lowcount to the minimum occurrence count of the A lines in the widened match.

  6. Continue with subsequent A lines that match this B line, if any, skipping any A lines already matched in the widening. Repeat the widening for each such match. If the widened match has more lines than the "best match", or if its minimum A occurrence count is less than lowcount, then make this the "best match", note its bounds and set lowcount to its minimum A occurrence count.

  7. Proceed to the next B line that matches an A line with count <= lowcount, skipping any B lines already matched so far. Match this B line against each A line it matches, in the same manner as above, widening each match, and replace the "best match" with each new match that is longer or has a lower minimum A count than lowcount, re-setting lowcount to that count if any such match is found.

  8. After all B lines have been matched against all A lines, re-do the above process steps 1 through 7 with the portions of the file before and after the "best match". Recount the A lines and reset lowcount to 65 before looking for a "best match" in each sub-section. Continue to split each part of the file recursively, finding the "best match" in each sub-section.

When a sub-section has no "best match" with which to split it, the sub-section is a diff.

If no B line matches an A line with a count <= lowcount, then either consider the section as a diff, or fall back to another diff algorithm (e.g. Myers).

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.