How can I find identical regions in a buffer?

Question

If I have a file like this

foo bar bat hukarz foo bar bat

, then I would like to be made aware that there is one region that is identical to another region

foo bar bat

The reason is that I have have some large text files and I have identical regions, often more than one time. I want to clean them up.

Lingo4G and the Carrot2 engine defines this as Document Overlap and Pairwise Similarity, as in how to identify identical text fragments in documents and returning information useful for visualization of such regions.

I was thinking Emacs might have a mode where it could use the Carrot2 engine for identifying identical or similar regions in a buffer, like they do with Carrot2:

What you're asking for is just a diff with additional highlight of removed/added regions. Emacs has some basic support for this that I recall to have seen in "git conflicts" highlights, but it is very basic and poor, similarly to git's git-highlight. If it's a one-off situation, I'd recommend to just use diffr as diff -u file1 file2 | diffr. Otherwise (unless there's a mode that solves this) you'd need to dig out the highlight functional from Emacs and write a minor mode around it. — Hi-Angel
– Hi-Angel, Commented Jan 6 at 23:51
Oh yeah, and this is assuming the regions are sorted alphabetically. Because if they aren't, this complicates the situation even further, and in this case I'm 99% certain there isn't a mode to solve this, because typically situations where you want to find duplicate lines are well solved by cat myfile | sort > myfile_sorted && … (potentially with -u parameter), and there's not much point in writing some complicated algo around it. — Hi-Angel
– Hi-Angel, Commented Jan 6 at 23:54
I disagree with @Hi-Angel - what you are asking for is the development of a suitable algorithm (I don't know of a reasonably efficient such algorithm and I don't think that the suggestions above are enough). If you have a suitable algorithm, then expressing it in Emacs Lisp might be an appropriate question for this site (even in that case, I think it would be a development project, so probably too long for Emacs SE). But in the absence of a suitable algorithm, this probably belongs on cstheory.stackexchange.com where somebody might be able to suggest such an algorithm. — NickD
– NickD, Commented Jan 7 at 21:47
No, I don't intend to develop an algorithm for this. I was thinking Emacs might have a mode where it could use the Carrot2 engine for identifying identical or similar regions in a buffer. — Jason Hunter
– Jason Hunter, Commented Jan 7 at 23:19

Stack Exchange Network

How can I find identical regions in a buffer?

0

Hot Network Questions

How can I find identical regions in a buffer?

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Related

Hot Network Questions