Skip to main content
Post Locked by CommunityBot
Post Migrated Away to programmers.stackexchange.com by CommunityBot, DaveRandom
Post Closed as "off topic" by CommunityBot, DaveRandom
deleted 30 characters in body
Source Link

Detecting Duplicate Content in PHP

I'm trying to detect if an article or forum postis a duplicate entry within the database. A little like this site does.

I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):

  1. simple copy paste the whole text
  2. copy and paste parts of text merging it with their own
  3. copy an article from an external site and masquerade as their own

I store information about each article in (1) statistics table and in (2) keywords table.

(1) Statistics Table The following statistics are stored about the textual content (much like this post)

  1. text length
  2. letter count
  3. word count
  4. sentence count
  5. average words per sentence
  6. automated readability index
  7. gunning fog score

2 Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.

Sample Data

  • text_length, 3963
  • letter_count, 3052
  • word_count, 684
  • sentence_count, 33
  • word_per_sentence, 21
  • gunning_fog, 11.5
  • auto_read_index, 9.9
  • keyword 1, killed
  • keyword 2, officers
  • keyword 3, police

It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.

How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?

Thank You

Detecting Duplicate Content in PHP

I'm trying to detect if an article or forum postis a duplicate entry within the database. A little like this site does.

I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):

  1. simple copy paste the whole text
  2. copy and paste parts of text merging it with their own
  3. copy an article from an external site and masquerade as their own

I store information about each article in (1) statistics table and in (2) keywords table.

(1) Statistics Table The following statistics are stored about the textual content (much like this post)

  1. text length
  2. letter count
  3. word count
  4. sentence count
  5. average words per sentence
  6. automated readability index
  7. gunning fog score

2 Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.

Sample Data

  • text_length, 3963
  • letter_count, 3052
  • word_count, 684
  • sentence_count, 33
  • word_per_sentence, 21
  • gunning_fog, 11.5
  • auto_read_index, 9.9
  • keyword 1, killed
  • keyword 2, officers
  • keyword 3, police

It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.

How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?

Thank You

Detecting Duplicate Content

I'm trying to detect if an article or forum postis a duplicate entry within the database.

I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):

  1. simple copy paste the whole text
  2. copy and paste parts of text merging it with their own
  3. copy an article from an external site and masquerade as their own

I store information about each article in (1) statistics table and in (2) keywords table.

(1) Statistics Table The following statistics are stored about the textual content (much like this post)

  1. text length
  2. letter count
  3. word count
  4. sentence count
  5. average words per sentence
  6. automated readability index
  7. gunning fog score

2 Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.

Sample Data

  • text_length, 3963
  • letter_count, 3052
  • word_count, 684
  • sentence_count, 33
  • word_per_sentence, 21
  • gunning_fog, 11.5
  • auto_read_index, 9.9
  • keyword 1, killed
  • keyword 2, officers
  • keyword 3, police

It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.

How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?

Thank You

deleted 46 characters in body
Source Link

As title says, I'm trying to detect if an article or forum post (whatever) ispostis a duplicate entry within the database. MuchA little like Stackoverflowthis site does.

I've given this some thought, and I've comecoming to the conclusion that there's three ways a "content duplicator"someone who duplicate content will go about copy/pasting their waydo so using one of the three (in descending difficult to glorydetect):

  1. They'll simple copy paste the whole text
  2. They'll copy and paste only parts of text and mergemerging it with their own
  3. They'll copy an article from an external site and masquerade as their own

I store information about arbitrary texteach article in a (1) statistics table in my database and also in a (2) keywords table.

(1) Statistics(1) Statistics Table The following statistics are stored about the textual content (much like this post)

  1. text length
  2. letter count
  3. word count
  4. sentence count
  5. average words per sentence
  6. automated readability index
  7. gunning fog score

2 Keywords The2 Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc. In addition words less than 1 or greater than 30 are excluded from becoming a keyword.

My average set of statistics per article may look like this (I copy and pasted a BBC article for this):Sample Data

  • text_length, 3963
  • letter_count, 3052
  • word_count, 684
  • sentence_count, 33
  • word_per_sentence, 21
  • gunning_fog, 11.5
  • auto_read_index, 9.9
  • keyword 1, killed
  • keyword 2, officers
  • keyword 3, police

It should be noted that once an article gets updated then all of the above statistics are regenerated and could be completely different values.

Now my problem comes when trying toHow could I use the above information to deter anti-spam, I wonderdetect if I should combine the information to created some sort of weighted average and store the result of that average as a field next to thean article inthat's being published for the DB; I could then query against thisfirst time, possiblyis already existing within a range, e.g., statistics average 9837893.7373 > x AND < y? If result send the article to an administrator for a final decision.database?

I'm really at a loss at how I can go about this and for detecting external duplicate content e.g, on another domain I don't even know where to begin, can anyone help?Thank You

As title says, I'm trying to detect if an article or forum post (whatever) is a duplicate entry within the database. Much like Stackoverflow does.

I've given this some thought, and I've come to the conclusion that there's three ways a "content duplicator" will go about copy/pasting their way to glory:

  1. They'll simple copy paste the whole text
  2. They'll copy and paste only parts of text and merge it with their own
  3. They'll copy an article from an external site and masquerade as their own

I store information about arbitrary text in a (1) statistics table in my database and also in a (2) keywords table.

(1) Statistics The following statistics are stored about the textual content (much like this post)

  1. text length
  2. letter count
  3. word count
  4. sentence count
  5. average words per sentence
  6. automated readability index
  7. gunning fog score

2 Keywords The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc. In addition words less than 1 or greater than 30 are excluded from becoming a keyword.

My average set of statistics per article may look like this (I copy and pasted a BBC article for this):

  • text_length, 3963
  • letter_count, 3052
  • word_count, 684
  • sentence_count, 33
  • word_per_sentence, 21
  • gunning_fog, 11.5
  • auto_read_index, 9.9
  • keyword 1, killed
  • keyword 2, officers
  • keyword 3, police

It should be noted that once an article gets updated then all of the above statistics are regenerated and could be completely different values.

Now my problem comes when trying to use the information to deter anti-spam, I wonder if I should combine the information to created some sort of weighted average and store the result of that average as a field next to the article in the DB; I could then query against this, possibly within a range, e.g., statistics average 9837893.7373 > x AND < y? If result send the article to an administrator for a final decision.

I'm really at a loss at how I can go about this and for detecting external duplicate content e.g, on another domain I don't even know where to begin, can anyone help?

I'm trying to detect if an article or forum postis a duplicate entry within the database. A little like this site does.

I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):

  1. simple copy paste the whole text
  2. copy and paste parts of text merging it with their own
  3. copy an article from an external site and masquerade as their own

I store information about each article in (1) statistics table and in (2) keywords table.

(1) Statistics Table The following statistics are stored about the textual content (much like this post)

  1. text length
  2. letter count
  3. word count
  4. sentence count
  5. average words per sentence
  6. automated readability index
  7. gunning fog score

2 Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.

Sample Data

  • text_length, 3963
  • letter_count, 3052
  • word_count, 684
  • sentence_count, 33
  • word_per_sentence, 21
  • gunning_fog, 11.5
  • auto_read_index, 9.9
  • keyword 1, killed
  • keyword 2, officers
  • keyword 3, police

It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.

How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?

Thank You

Source Link

Detecting Duplicate Content in PHP

As title says, I'm trying to detect if an article or forum post (whatever) is a duplicate entry within the database. Much like Stackoverflow does.

I've given this some thought, and I've come to the conclusion that there's three ways a "content duplicator" will go about copy/pasting their way to glory:

  1. They'll simple copy paste the whole text
  2. They'll copy and paste only parts of text and merge it with their own
  3. They'll copy an article from an external site and masquerade as their own

I store information about arbitrary text in a (1) statistics table in my database and also in a (2) keywords table.

(1) Statistics The following statistics are stored about the textual content (much like this post)

  1. text length
  2. letter count
  3. word count
  4. sentence count
  5. average words per sentence
  6. automated readability index
  7. gunning fog score

2 Keywords The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc. In addition words less than 1 or greater than 30 are excluded from becoming a keyword.

My average set of statistics per article may look like this (I copy and pasted a BBC article for this):

  • text_length, 3963
  • letter_count, 3052
  • word_count, 684
  • sentence_count, 33
  • word_per_sentence, 21
  • gunning_fog, 11.5
  • auto_read_index, 9.9
  • keyword 1, killed
  • keyword 2, officers
  • keyword 3, police

It should be noted that once an article gets updated then all of the above statistics are regenerated and could be completely different values.

Now my problem comes when trying to use the information to deter anti-spam, I wonder if I should combine the information to created some sort of weighted average and store the result of that average as a field next to the article in the DB; I could then query against this, possibly within a range, e.g., statistics average 9837893.7373 > x AND < y? If result send the article to an administrator for a final decision.

I'm really at a loss at how I can go about this and for detecting external duplicate content e.g, on another domain I don't even know where to begin, can anyone help?