2
$\begingroup$

(not sure if this is a SO or a CS StackExchange Question, but I'm putting it here for now.)

I've been working on a project developing a more extensible way to create synopses of ancient texts (essentially side-by-side comparisons with the diffs highlighted, but with options to focus only on semantically-useful diffs.) This means that it makes the most sense to use a character diff (often some of these diffs are one extra vowel letter, which is not 'important', so to speak.) I've implemented the Myers diff algorithm, as described in the wonderful blog post in https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/ (and subsequent two parts) in Javascript, which I attach at the bottom of this post.
Coglan notes that the Myers diff algorithm is greedy - 'trying to consume as many lines that are the same before making a change' - and thus avoids the so-called "wrong end" problem:

Good: class Foo Bad: class Foo def initialize(name) def initialize(name) @name = name @name = name end + end + + + def inspect + def inspect + @name + @name + end end end end 

I'm wondering if there is a way to avoid this situation when comparing letter-by-letter. The implementation Coglan describes leads to the following result in that case, using the example above. (I have highlighted the letters which are added; the un-highlighted end, which should be the last three letters, are instead the "e" in def, "n" in inspect, and "d" in the first end):

 class Foo def initialize(name) @name = name end def inspect @name end end 

I understand that the preferred method would likely simply be to just go back to diffing words, but then I lose some of the functionality I've developed... Anyhow, here's the code (in Javascript, again; based on the blog post above.)

function myersDiff(oldTokens, newTokens) { const m = oldTokens.length; const n = newTokens.length; const max = m + n; //max amount of moves we may need to make const v = new Map(); //map contains trace in order v.set(1, 0); let path = []; //get shortest edit graph using the 45-degree k-lines at each depth d for (let d = 0; d <= max; d++) { path[d] = new Map(); for (let k = -d; k <= d; k += 2) { /* e.g. if depth is 6, consider all options from -12 to 12 since the trace is a sparse binary tree if k = -d, we're on the edge of the graph. We can only go downward (i.e. left). If k = d, we can only go up-right (i.e. right.) Otherwise, check the elements to the right and left (k+1, k-1) and take the highest x-value */ let x; if (k === -d || (k !== d && v.get(k - 1) < v.get(k + 1))) { x = v.get(k + 1); } else { x = v.get(k - 1) + 1; //when moving rightward, the k-line index is higher } let y = x - k; //edit graph should consider diagonals to add 0 cost while (x < m && y < n && oldTokens[x].letter === newTokens[y].letter) { x++; y++; } v.set(k, x); path[d].set(k, { x, y }); //check for end if (x >= m && y >= n) { return buildChanges(path, oldTokens, newTokens); } } } } function buildChanges(path, oldTokens, newTokens) { const changes = []; let d = path.length - 1; let x = oldTokens.length; let y = newTokens.length; while (d >= 0) { const k = x - y; const step = path[d].get(k); let prevK; if (k === -d || (k !== d && path[d - 1] && path[d - 1].has(k - 1) && path[d - 1].get(k - 1).x < path[d - 1].get(k + 1)?.x)) { prevK = k + 1; } else { prevK = k - 1; } const prevStep = path[d - 1]?.get(prevK); //backtrack to construct the list, privileging diagonals while (x > (prevStep?.x || 0) && y > (prevStep?.y || 0)) { changes.unshift({ type: "unchanged", token: oldTokens[x - 1].letter, capital: oldTokens[x-1].capital }); x--; y--; } if (x > (prevStep?.x || 0)) { changes.unshift({ type: "removed", token: oldTokens[x - 1].letter, capital: oldTokens[x-1].capital }); x--; } else if (y > (prevStep?.y || 0)) { changes.unshift({ type: "added", token: newTokens[y - 1].letter, capital: newTokens[y-1].capital }); y--; } d--; } return changes; } 
$\endgroup$

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.