Timeline for How can I find identical regions in a buffer?
Current License: CC BY-SA 4.0
13 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jan 11 at 8:50 | comment | added | Jean Louis | I would use pg_trgm PostgreSQL module. Insert entries into the database: CREATE TABLE text_regions (id SERIAL PRIMARY KEY, content TEXT): with INSERT INTO text_regions (content) VALUES ('This is a sample text region.'); and INSERT INTO text_regions (content) VALUES ('This is another text region.'); then SELECT a.content AS region1, b.content AS region2, similarity(a.content, b.content) FROM text_regions a, text_regions b WHERE a.id != b.id AND similarity(a.content, b.content) > 0.5; and of course by using PostgreSQL function pg_tgrm. Good for mass number of docs. | |
| Jan 8 at 3:20 | comment | added | NickD | Okay - given those parameters, I have retracted my close vote. | |
| Jan 7 at 23:55 | comment | added | Drew | @NickD: If the question is only asking whether Emacs itself, or some Emacs package, provides such a feature, I think it's OK for this site. I'm guessing it won't get much in the way of a solution/answer here, but it seems OK to me to ask it. | |
| Jan 7 at 23:21 | history | edited | Jason Hunter | CC BY-SA 4.0 | added 311 characters in body |
| Jan 7 at 23:19 | comment | added | Jason Hunter | No, I don't intend to develop an algorithm for this. I was thinking Emacs might have a mode where it could use the Carrot2 engine for identifying identical or similar regions in a buffer. | |
| Jan 7 at 22:09 | review | Close votes | |||
| Jan 8 at 3:22 | |||||
| Jan 7 at 21:47 | comment | added | NickD | I disagree with @Hi-Angel - what you are asking for is the development of a suitable algorithm (I don't know of a reasonably efficient such algorithm and I don't think that the suggestions above are enough). If you have a suitable algorithm, then expressing it in Emacs Lisp might be an appropriate question for this site (even in that case, I think it would be a development project, so probably too long for Emacs SE). But in the absence of a suitable algorithm, this probably belongs on cstheory.stackexchange.com where somebody might be able to suggest such an algorithm. | |
| Jan 7 at 4:47 | comment | added | NickD | Does this question have anything to do with Emacs? | |
| Jan 7 at 1:54 | history | edited | Drew | edited tags | |
| Jan 6 at 23:54 | comment | added | Hi-Angel | Oh yeah, and this is assuming the regions are sorted alphabetically. Because if they aren't, this complicates the situation even further, and in this case I'm 99% certain there isn't a mode to solve this, because typically situations where you want to find duplicate lines are well solved by cat myfile | sort > myfile_sorted && … (potentially with -u parameter), and there's not much point in writing some complicated algo around it. | |
| Jan 6 at 23:51 | comment | added | Hi-Angel | What you're asking for is just a diff with additional highlight of removed/added regions. Emacs has some basic support for this that I recall to have seen in "git conflicts" highlights, but it is very basic and poor, similarly to git's git-highlight. If it's a one-off situation, I'd recommend to just use diffr as diff -u file1 file2 | diffr. Otherwise (unless there's a mode that solves this) you'd need to dig out the highlight functional from Emacs and write a minor mode around it. | |
| Jan 6 at 20:38 | history | edited | Jason Hunter | CC BY-SA 4.0 | added 89 characters in body |
| Jan 6 at 20:28 | history | asked | Jason Hunter | CC BY-SA 4.0 |