Timeline for How can I find identical regions in a buffer?

Current License: CC BY-SA 4.0

13 events

when toggle format	what		by	license	comment
Jan 11 at 8:50	comment	added	Jean Louis		I would use pg_trgm PostgreSQL module. Insert entries into the database: `CREATE TABLE text_regions (id SERIAL PRIMARY KEY, content TEXT):` with `INSERT INTO text_regions (content) VALUES ('This is a sample text region.');` and `INSERT INTO text_regions (content) VALUES ('This is another text region.');` then `SELECT a.content AS region1, b.content AS region2, similarity(a.content, b.content) FROM text_regions a, text_regions b WHERE a.id != b.id AND similarity(a.content, b.content) > 0.5;` and of course by using PostgreSQL function pg_tgrm. Good for mass number of docs.
Jan 8 at 3:20	comment	added	NickD		Okay - given those parameters, I have retracted my close vote.
Jan 7 at 23:55	comment	added	Drew		@NickD: If the question is only asking whether Emacs itself, or some Emacs package, provides such a feature, I think it's OK for this site. I'm guessing it won't get much in the way of a solution/answer here, but it seems OK to me to ask it.
Jan 7 at 23:21	history	edited	Jason Hunter	CC BY-SA 4.0	added 311 characters in body
Jan 7 at 23:19	comment	added	Jason Hunter		No, I don't intend to develop an algorithm for this. I was thinking Emacs might have a mode where it could use the Carrot2 engine for identifying identical or similar regions in a buffer.
Jan 7 at 22:09	review	Close votes
Jan 8 at 3:22
Jan 7 at 21:47	comment	added	NickD		I disagree with @Hi-Angel - what you are asking for is the development of a suitable algorithm (I don't know of a reasonably efficient such algorithm and I don't think that the suggestions above are enough). If you have a suitable algorithm, then expressing it in Emacs Lisp might be an appropriate question for this site (even in that case, I think it would be a development project, so probably too long for Emacs SE). But in the absence of a suitable algorithm, this probably belongs on cstheory.stackexchange.com where somebody might be able to suggest such an algorithm.
Jan 7 at 4:47	comment	added	NickD		Does this question have anything to do with Emacs?
Jan 7 at 1:54	history	edited	Drew		edited tags
Jan 6 at 23:54	comment	added	Hi-Angel		Oh yeah, and this is assuming the regions are sorted alphabetically. Because if they aren't, this complicates the situation even further, and in this case I'm 99% certain there isn't a mode to solve this, because typically situations where you want to find duplicate lines are well solved by `cat myfile \| sort > myfile_sorted && …` (potentially with `-u` parameter), and there's not much point in writing some complicated algo around it.
Jan 6 at 23:51	comment	added	Hi-Angel		What you're asking for is just a diff with additional highlight of removed/added regions. Emacs has some basic support for this that I recall to have seen in "git conflicts" highlights, but it is very basic and poor, similarly to git's `git-highlight`. If it's a one-off situation, I'd recommend to just use `diffr` as `diff -u file1 file2 \| diffr`. Otherwise (unless there's a mode that solves this) you'd need to dig out the highlight functional from Emacs and write a minor mode around it.
Jan 6 at 20:38	history	edited	Jason Hunter	CC BY-SA 4.0	added 89 characters in body
Jan 6 at 20:28	history	asked	Jason Hunter	CC BY-SA 4.0