How do I identify & list unique hunks in a git commit?

Question

I have a commit with a large number (hundreds) of similar hunks, and I'd like to list each unique hunk in the commit in order to compare them.

I wrote the following GNU awk script, which writes each hunk to a unique file (hunk-[md5-of-hunk].txt):

BEGIN { hunk = "" buildhunk = 0 } function writeHunk() { if (length(hunk) > 0) { print hunk > "hunk.tmp" close("hunk.tmp") cmd = "cat hunk.tmp | md5" cmd | getline md5 close(cmd) if (!(md5 in hunkfiles)) { hunkfilename = "hunk-" md5 ".txt" print hunk > hunkfilename hunkfiles[md5] = hunkfilename } } } /^@@|^diff/ { writeHunk() hunk = "" buildhunk = ($1 == "@@") ? 1 : 0 } /^[ +-]/ { if (buildhunk) { hunk = hunk $0 "\n" } } END { writeHunk() system("rm hunk.tmp") for (md5 in hunkfiles) { print hunkfiles[md5] } }

I then run this with git show [commit-SHA] | awk -f my_script.awk, which creates & lists the resulting files. It works for my purposes, but is there a way to do this more efficiently using git's plumbing commands.

Example

Suppose the commit's patch looks like this (reduced to 1 line of context below for clarity's sake):

diff --git a/file1.txt b/file1.txt index a3fb2ed..4d6f587 100644 --- a/file1.txt +++ b/file1.txt @@ -3,2 +3,3 @@ context context +added line context @@ -7,2 +8,3 @@ context context +added line context @@ -11,2 +13,3 @@ context context +added line context @@ -15,2 +18,3 @@ context context +different added line context @@ -19,2 +23,3 @@ context context +different added line context @@ -23,2 +28,3 @@ context context +different added line context @@ -27,2 +33,3 @@ context context +even more different added line context @@ -31,2 +38,3 @@ context context +even more different added line context

I want to be able to identity that there are only 3 unique hunks, and see what they are. Namely:

Unique hunk 1:

 context +added line context

Unique hunk 2:

 context +different added line context

Unique hunk 3:

 context +even more different added line context

@BenjaminW. thanks, I looked at that question, and it's similar in that both concern hunks, but my question is more about identifying & listing unique hunks in a commit, while that question is about non-interactively adding hunks in some scripted way. I'll update my question to clarity. — tavnab
– tavnab, Commented May 17, 2017 at 4:49
I agree, that's why I didn't mark as a duplicate, but I think a solution that programmatically gives you your hunks separately could be used to solve that other problem as well. — Benjamin W.
– Benjamin W., Commented May 17, 2017 at 4:50
@BenjaminW. Agreed & appreciated :) I have a feeling git already does this somewhere in its plumbing. I'm hoping someone who knows git's internals well enough has some simple one-liner to list a commit's unique hunks. — tavnab
– tavnab, Commented May 17, 2017 at 5:01

torek · Accepted Answer · 2017-05-17 05:38:11Z

Commits are snapshots, and as such, they don't have diff hunks.

Diffs, of course, do have diff hunks. So if you have just one commit, you cannot do this at all. You need two commits. You then simply diff them and do what you are doing.

Note that git show <commit-hash> really means git diff <parent or parents of commit> <commit-hash>. If the specified commit is a merge commit, this produces a combined diff, which is probably not useful for your purposes since combined diffs intentionally omit many changes entirely. You might want to run an explicit diff against the commit's first parent only (to view only changes brought in as part of the merge).

There are some parts of Git that internally do something like what you're doing, for git rerere and git patch-id. However, they don't do exactly what you're doing: for rerere they record only diff hunks where there was a merge conflict, and match up those diff hunks (saved by hash ID and file name) with resolutions recorded later. For patch-id they strip off line numbers and white-space but accumulate the entire set of changes from a commit into one big piece. It might be nice if Git had a bit of plumbing that did the git patch-id part hunk by hunk, independent of computing the overall patch ID for the commit, but it doesn't.

thanks! You're of course right that commits don't have hunks in their own right, but rather the hunks are a product of comparing 2 commits (and that the hunks themselves are context-sensitive). Supposing I isolate the hunks myself, do you see any advantage of using patch-id in place of md5 to ID the hunks?
@tavnab: Might be a bit faster, might not. See whether you like what it does to white space. Also you'll get sha-1s rather than md5s, so you get a few more bits of hash (160 bits, vs 128). None of these seem big arguments one way or another, except for treatment of white space.
I think I'll stick with md5 for now, but good to know I have the whitespace-stripping option of patch-id in case I need it.

Collectives™ on Stack Overflow

How do I identify & list unique hunks in a git commit?

Example

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Example

1 Answer 1

3 Comments

Linked

Related