Timeline for What algorithms can I use to detect if articles or posts are duplicates?

Current License: CC BY-SA 3.0

18 events

when toggle format	what		by	license	comment
Oct 19, 2012 at 14:16	vote	accept	michael
Oct 8, 2012 at 11:48	answer	added	Inverted Llama		timeline score: 1
Oct 8, 2012 at 4:50	answer	added	Peter Rowell		timeline score: 6
Oct 8, 2012 at 3:01	history	tweeted			twitter.com/#!/StackProgrammer/status/255140837242052608
Oct 8, 2012 at 2:47	comment	added	jfrankcarr		@rdlowrey - Levenshtein algorithms are what I used in a similar project I did in C#. I agree, it's a good place to start and may be enough.
Oct 8, 2012 at 2:30	history	edited	michael	CC BY-SA 3.0	added 173 characters in body
Oct 8, 2012 at 1:55	history	edited	michael	CC BY-SA 3.0	deleted 2 characters in body
Oct 8, 2012 at 1:48	history	edited	michael	CC BY-SA 3.0	added 422 characters in body
Oct 8, 2012 at 1:14	comment	added	GlenPeterson		Great idea to have a human check likely duplicates! You may be able to automatically decide that > 7 is a duplicate and < 6 is different and only have humans check scores of 6 or 7. I know that with spam identification there is a machine-doesn't-know-AND-human-doesn't-know-either category; a gray area between a near duplicate and an original work where the best you can do is make a somewhat arbitrary judgement call.
Oct 8, 2012 at 1:03	history	edited	Adam Lear♦	CC BY-SA 3.0	appended answer 167912 as supplemental
Oct 8, 2012 at 0:34	answer	added	gam3		timeline score: 4
Oct 8, 2012 at 0:22	answer	added	Roland Mai		timeline score: 4
Oct 8, 2012 at 0:07	comment	added	xdlox		Also, before it was migrated from SO this was tagged [php], so you might check out php's native levenshtein function
Oct 8, 2012 at 0:05	comment	added	xdlox		There are many directions in which this kind of analysis can go. People write entire books on this sort of topic. If your goal is to determine "relative closeness" you really have little choice but to dig into what's called Natural Language Processing and Machine Learning. That's what computer scientists call it, but it's really just advanced statistical analysis. A good starting point might be looking at levenshtein distances, but "dumb" stats like word/sentence counts are likely going to do very little for you.
Oct 8, 2012 at 0:03	history	edited	Thomas Owens♦	CC BY-SA 3.0	deleted 13 characters in body; edited tags; edited title
Oct 8, 2012 at 0:01	history	migrated			from stackoverflow.com (revisions)
Oct 8, 2012 at 0:01	comment	added	James P.		If it's an exact match you could simply set a field to unique. If not, you'd need to decide at what point a text can be considered a copy or a closely derived work.
Oct 7, 2012 at 23:41	history	asked	michael	CC BY-SA 3.0

toggle format