Post Locked by CommunityBot

occurred Oct 8, 2012 at 0:01

Post Migrated Away to programmers.stackexchange.com by CommunityBot, DaveRandom

occurred Oct 8, 2012 at 0:01

Post Closed as "off topic" by CommunityBot, DaveRandom

occurred Oct 8, 2012 at 0:01

deleted 30 characters in body

Source Link

edited Oct 7, 2012 at 23:56

Michael Rich

301
2
11

Detecting Duplicate Content in PHP

I'm trying to detect if an article or forum postis a duplicate entry within the database. A little like this site does.

I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):

simple copy paste the whole text
copy and paste parts of text merging it with their own
copy an article from an external site and masquerade as their own

I store information about each article in (1) statistics table and in (2) keywords table.

(1) Statistics Table The following statistics are stored about the textual content (much like this post)

text length
letter count
word count
sentence count
average words per sentence
automated readability index
gunning fog score

2 Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.

Sample Data

text_length, 3963
letter_count, 3052
word_count, 684
sentence_count, 33
word_per_sentence, 21
gunning_fog, 11.5
auto_read_index, 9.9
keyword 1, killed
keyword 2, officers
keyword 3, police

It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.

How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?

Thank You

Detecting Duplicate Content in PHP

I'm trying to detect if an article or forum postis a duplicate entry within the database. A little like this site does.

I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):

simple copy paste the whole text
copy and paste parts of text merging it with their own
copy an article from an external site and masquerade as their own

I store information about each article in (1) statistics table and in (2) keywords table.

(1) Statistics Table The following statistics are stored about the textual content (much like this post)

text length
letter count
word count
sentence count
average words per sentence
automated readability index
gunning fog score

2 Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.

Sample Data

text_length, 3963
letter_count, 3052
word_count, 684
sentence_count, 33
word_per_sentence, 21
gunning_fog, 11.5
auto_read_index, 9.9
keyword 1, killed
keyword 2, officers
keyword 3, police

It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.

How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?

Thank You

Detecting Duplicate Content

I'm trying to detect if an article or forum postis a duplicate entry within the database.

I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):

simple copy paste the whole text
copy and paste parts of text merging it with their own
copy an article from an external site and masquerade as their own

I store information about each article in (1) statistics table and in (2) keywords table.

(1) Statistics Table The following statistics are stored about the textual content (much like this post)

text length
letter count
word count
sentence count
average words per sentence
automated readability index
gunning fog score

2 Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.

Sample Data

text_length, 3963
letter_count, 3052
word_count, 684
sentence_count, 33
word_per_sentence, 21
gunning_fog, 11.5
auto_read_index, 9.9
keyword 1, killed
keyword 2, officers
keyword 3, police

It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.

How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?

Thank You

deleted 46 characters in body

Source Link

edited Oct 7, 2012 at 23:50

Michael Rich

301
2
11

As title says, I'm trying to detect if an article or forum post (whatever) ispostis a duplicate entry within the database. MuchA little like Stackoverflowthis site does.

I've given this some thought, and I've comecoming to the conclusion that there's three ways a "content duplicator"someone who duplicate content will go about copy/pasting their waydo so using one of the three (in descending difficult to glorydetect):

They'll simple copy paste the whole text
They'll copy and paste only parts of text and mergemerging it with their own
They'll copy an article from an external site and masquerade as their own

I store information about arbitrary texteach article in a (1) statistics table in my database and also in a (2) keywords table.

(1) Statistics(1) Statistics Table The following statistics are stored about the textual content (much like this post)

text length
letter count
word count
sentence count
average words per sentence
automated readability index
gunning fog score

2 Keywords The2 Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc. In addition words less than 1 or greater than 30 are excluded from becoming a keyword.

My average set of statistics per article may look like this (I copy and pasted a BBC article for this):Sample Data

text_length, 3963
letter_count, 3052
word_count, 684
sentence_count, 33
word_per_sentence, 21
gunning_fog, 11.5
auto_read_index, 9.9
keyword 1, killed
keyword 2, officers
keyword 3, police

It should be noted that once an article gets updated then all of the above statistics are regenerated and could be completely different values.

Now my problem comes when trying toHow could I use the above information to deter anti-spam, I wonderdetect if I should combine the information to created some sort of weighted average and store the result of that average as a field next to thean article inthat's being published for the DB; I could then query against thisfirst time, possiblyis already existing within a range, e.g., statistics average 9837893.7373 > x AND < y? If result send the article to an administrator for a final decision.database?

I'm really at a loss at how I can go about this and for detecting external duplicate content e.g, on another domain I don't even know where to begin, can anyone help?Thank You

As title says, I'm trying to detect if an article or forum post (whatever) is a duplicate entry within the database. Much like Stackoverflow does.

I've given this some thought, and I've come to the conclusion that there's three ways a "content duplicator" will go about copy/pasting their way to glory:

They'll simple copy paste the whole text
They'll copy and paste only parts of text and merge it with their own
They'll copy an article from an external site and masquerade as their own

I store information about arbitrary text in a (1) statistics table in my database and also in a (2) keywords table.

(1) Statistics The following statistics are stored about the textual content (much like this post)

text length
letter count
word count
sentence count
average words per sentence
automated readability index
gunning fog score

2 Keywords The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc. In addition words less than 1 or greater than 30 are excluded from becoming a keyword.

My average set of statistics per article may look like this (I copy and pasted a BBC article for this):

text_length, 3963
letter_count, 3052
word_count, 684
sentence_count, 33
word_per_sentence, 21
gunning_fog, 11.5
auto_read_index, 9.9
keyword 1, killed
keyword 2, officers
keyword 3, police

It should be noted that once an article gets updated then all of the above statistics are regenerated and could be completely different values.

Now my problem comes when trying to use the information to deter anti-spam, I wonder if I should combine the information to created some sort of weighted average and store the result of that average as a field next to the article in the DB; I could then query against this, possibly within a range, e.g., statistics average 9837893.7373 > x AND < y? If result send the article to an administrator for a final decision.

I'm really at a loss at how I can go about this and for detecting external duplicate content e.g, on another domain I don't even know where to begin, can anyone help?

I'm trying to detect if an article or forum postis a duplicate entry within the database. A little like this site does.

I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):

simple copy paste the whole text
copy and paste parts of text merging it with their own
copy an article from an external site and masquerade as their own

I store information about each article in (1) statistics table and in (2) keywords table.

(1) Statistics Table The following statistics are stored about the textual content (much like this post)

text length
letter count
word count
sentence count
average words per sentence
automated readability index
gunning fog score

2 Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.

Sample Data

text_length, 3963
letter_count, 3052
word_count, 684
sentence_count, 33
word_per_sentence, 21
gunning_fog, 11.5
auto_read_index, 9.9
keyword 1, killed
keyword 2, officers
keyword 3, police

It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.

How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?

Thank You

Source Link

asked Oct 7, 2012 at 23:41

Michael Rich

301
2
11

Detecting Duplicate Content in PHP

As title says, I'm trying to detect if an article or forum post (whatever) is a duplicate entry within the database. Much like Stackoverflow does.

I've given this some thought, and I've come to the conclusion that there's three ways a "content duplicator" will go about copy/pasting their way to glory:

They'll simple copy paste the whole text
They'll copy and paste only parts of text and merge it with their own
They'll copy an article from an external site and masquerade as their own

I store information about arbitrary text in a (1) statistics table in my database and also in a (2) keywords table.

(1) Statistics The following statistics are stored about the textual content (much like this post)

text length
letter count
word count
sentence count
average words per sentence
automated readability index
gunning fog score

2 Keywords The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc. In addition words less than 1 or greater than 30 are excluded from becoming a keyword.

My average set of statistics per article may look like this (I copy and pasted a BBC article for this):

text_length, 3963
letter_count, 3052
word_count, 684
sentence_count, 33
word_per_sentence, 21
gunning_fog, 11.5
auto_read_index, 9.9
keyword 1, killed
keyword 2, officers
keyword 3, police

It should be noted that once an article gets updated then all of the above statistics are regenerated and could be completely different values.

Now my problem comes when trying to use the information to deter anti-spam, I wonder if I should combine the information to created some sort of weighted average and store the result of that average as a field next to the article in the DB; I could then query against this, possibly within a range, e.g., statistics average 9837893.7373 > x AND < y? If result send the article to an administrator for a final decision.

I'm really at a loss at how I can go about this and for detecting external duplicate content e.g, on another domain I don't even know where to begin, can anyone help?

php duplicates spam

Collectives™ on Stack Overflow

Return to Question

Detecting Duplicate Content in PHP

Detecting Duplicate Content in PHP

Detecting Duplicate Content

Detecting Duplicate Content in PHP