2

If I have a file like this

foo bar bat hukarz foo bar bat 

, then I would like to be made aware that there is one region that is identical to another region

foo bar bat 

The reason is that I have have some large text files and I have identical regions, often more than one time. I want to clean them up.

Lingo4G and the Carrot2 engine defines this as Document Overlap and Pairwise ​Similarity, as in how to identify identical text fragments in documents and returning information useful for visualization of such regions.

I was thinking Emacs might have a mode where it could use the Carrot2 engine for identifying identical or similar regions in a buffer, like they do with Carrot2:

Carrot2 engine identifying identical or similar regions in a file

8
  • What you're asking for is just a diff with additional highlight of removed/added regions. Emacs has some basic support for this that I recall to have seen in "git conflicts" highlights, but it is very basic and poor, similarly to git's git-highlight. If it's a one-off situation, I'd recommend to just use diffr as diff -u file1 file2 | diffr. Otherwise (unless there's a mode that solves this) you'd need to dig out the highlight functional from Emacs and write a minor mode around it. Commented Jan 6 at 23:51
  • Oh yeah, and this is assuming the regions are sorted alphabetically. Because if they aren't, this complicates the situation even further, and in this case I'm 99% certain there isn't a mode to solve this, because typically situations where you want to find duplicate lines are well solved by cat myfile | sort > myfile_sorted && … (potentially with -u parameter), and there's not much point in writing some complicated algo around it. Commented Jan 6 at 23:54
  • Does this question have anything to do with Emacs? Commented Jan 7 at 4:47
  • I disagree with @Hi-Angel - what you are asking for is the development of a suitable algorithm (I don't know of a reasonably efficient such algorithm and I don't think that the suggestions above are enough). If you have a suitable algorithm, then expressing it in Emacs Lisp might be an appropriate question for this site (even in that case, I think it would be a development project, so probably too long for Emacs SE). But in the absence of a suitable algorithm, this probably belongs on cstheory.stackexchange.com where somebody might be able to suggest such an algorithm. Commented Jan 7 at 21:47
  • No, I don't intend to develop an algorithm for this. I was thinking Emacs might have a mode where it could use the Carrot2 engine for identifying identical or similar regions in a buffer. Commented Jan 7 at 23:19

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.