I'm trying to detect if an article or forum postis a duplicate entry within the database. A little like this site does.
I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):
- simple copy paste the whole text
- copy and paste parts of text merging it with their own
- copy an article from an external site and masquerade as their own
I store information about each article in (1) statistics table and in (2) keywords table.
(1) Statistics Table The following statistics are stored about the textual content (much like this post)
- text length
- letter count
- word count
- sentence count
- average words per sentence
- automated readability index
- gunning fog score
2 Keywords Table
The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.
Sample Data
- text_length, 3963
- letter_count, 3052
- word_count, 684
- sentence_count, 33
- word_per_sentence, 21
- gunning_fog, 11.5
- auto_read_index, 9.9
- keyword 1, killed
- keyword 2, officers
- keyword 3, police
It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.
How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?
Thank You