Skip to main content
18 events
when toggle format what by license comment
Oct 19, 2012 at 14:16 vote accept michael
Oct 8, 2012 at 11:48 answer added Inverted Llama timeline score: 1
Oct 8, 2012 at 4:50 answer added Peter Rowell timeline score: 6
Oct 8, 2012 at 3:01 history tweeted twitter.com/#!/StackProgrammer/status/255140837242052608
Oct 8, 2012 at 2:47 comment added jfrankcarr @rdlowrey - Levenshtein algorithms are what I used in a similar project I did in C#. I agree, it's a good place to start and may be enough.
Oct 8, 2012 at 2:30 history edited michael CC BY-SA 3.0
added 173 characters in body
Oct 8, 2012 at 1:55 history edited michael CC BY-SA 3.0
deleted 2 characters in body
Oct 8, 2012 at 1:48 history edited michael CC BY-SA 3.0
added 422 characters in body
Oct 8, 2012 at 1:14 comment added GlenPeterson Great idea to have a human check likely duplicates! You may be able to automatically decide that > 7 is a duplicate and < 6 is different and only have humans check scores of 6 or 7. I know that with spam identification there is a machine-doesn't-know-AND-human-doesn't-know-either category; a gray area between a near duplicate and an original work where the best you can do is make a somewhat arbitrary judgement call.
Oct 8, 2012 at 1:03 history edited Adam Lear CC BY-SA 3.0
appended answer 167912 as supplemental
Oct 8, 2012 at 0:34 answer added gam3 timeline score: 4
Oct 8, 2012 at 0:22 answer added Roland Mai timeline score: 4
Oct 8, 2012 at 0:07 comment added xdlox Also, before it was migrated from SO this was tagged [php], so you might check out php's native levenshtein function
Oct 8, 2012 at 0:05 comment added xdlox There are many directions in which this kind of analysis can go. People write entire books on this sort of topic. If your goal is to determine "relative closeness" you really have little choice but to dig into what's called Natural Language Processing and Machine Learning. That's what computer scientists call it, but it's really just advanced statistical analysis. A good starting point might be looking at levenshtein distances, but "dumb" stats like word/sentence counts are likely going to do very little for you.
Oct 8, 2012 at 0:03 history edited Thomas Owens CC BY-SA 3.0
deleted 13 characters in body; edited tags; edited title
Oct 8, 2012 at 0:01 history migrated from stackoverflow.com (revisions)
Oct 8, 2012 at 0:01 comment added James P. If it's an exact match you could simply set a field to unique. If not, you'd need to decide at what point a text can be considered a copy or a closely derived work.
Oct 7, 2012 at 23:41 history asked michael CC BY-SA 3.0