Skip to main content
13 events
when toggle format what by license comment
Jan 11 at 8:50 comment added Jean Louis I would use pg_trgm PostgreSQL module. Insert entries into the database: CREATE TABLE text_regions (id SERIAL PRIMARY KEY, content TEXT): with INSERT INTO text_regions (content) VALUES ('This is a sample text region.'); and INSERT INTO text_regions (content) VALUES ('This is another text region.'); then SELECT a.content AS region1, b.content AS region2, similarity(a.content, b.content) FROM text_regions a, text_regions b WHERE a.id != b.id AND similarity(a.content, b.content) > 0.5; and of course by using PostgreSQL function pg_tgrm. Good for mass number of docs.
Jan 8 at 3:20 comment added NickD Okay - given those parameters, I have retracted my close vote.
Jan 7 at 23:55 comment added Drew @NickD: If the question is only asking whether Emacs itself, or some Emacs package, provides such a feature, I think it's OK for this site. I'm guessing it won't get much in the way of a solution/answer here, but it seems OK to me to ask it.
Jan 7 at 23:21 history edited Jason Hunter CC BY-SA 4.0
added 311 characters in body
Jan 7 at 23:19 comment added Jason Hunter No, I don't intend to develop an algorithm for this. I was thinking Emacs might have a mode where it could use the Carrot2 engine for identifying identical or similar regions in a buffer.
Jan 7 at 22:09 review Close votes
Jan 8 at 3:22
Jan 7 at 21:47 comment added NickD I disagree with @Hi-Angel - what you are asking for is the development of a suitable algorithm (I don't know of a reasonably efficient such algorithm and I don't think that the suggestions above are enough). If you have a suitable algorithm, then expressing it in Emacs Lisp might be an appropriate question for this site (even in that case, I think it would be a development project, so probably too long for Emacs SE). But in the absence of a suitable algorithm, this probably belongs on cstheory.stackexchange.com where somebody might be able to suggest such an algorithm.
Jan 7 at 4:47 comment added NickD Does this question have anything to do with Emacs?
Jan 7 at 1:54 history edited Drew
edited tags
Jan 6 at 23:54 comment added Hi-Angel Oh yeah, and this is assuming the regions are sorted alphabetically. Because if they aren't, this complicates the situation even further, and in this case I'm 99% certain there isn't a mode to solve this, because typically situations where you want to find duplicate lines are well solved by cat myfile | sort > myfile_sorted && … (potentially with -u parameter), and there's not much point in writing some complicated algo around it.
Jan 6 at 23:51 comment added Hi-Angel What you're asking for is just a diff with additional highlight of removed/added regions. Emacs has some basic support for this that I recall to have seen in "git conflicts" highlights, but it is very basic and poor, similarly to git's git-highlight. If it's a one-off situation, I'd recommend to just use diffr as diff -u file1 file2 | diffr. Otherwise (unless there's a mode that solves this) you'd need to dig out the highlight functional from Emacs and write a minor mode around it.
Jan 6 at 20:38 history edited Jason Hunter CC BY-SA 4.0
added 89 characters in body
Jan 6 at 20:28 history asked Jason Hunter CC BY-SA 4.0