DATA MINING KESHAV MAHAVIDYALAYA -SANJEEV AJEETH VINIT SNEHA
INTRODUCTION • Spam constitutes 55% of all emails, posing a significant challenge to communication. • It inundates mailboxes with unwanted advertisements and junk, consuming users' time and risking the deletion of legitimate emails. • Economic impacts have led to legislative measures in some countries. • Text classification, essential for organizing and categorizing text, distinguishes between spam and legitimate messages. • Machine learning automates this process efficiently by learning associations from pre-labeled data. • Feature extraction transforms text into numerical representations, aiding in accurate classification. • ML techniques enhance precision and speed in analyzing big data, crucial for informing business decisions and automating processes. • This project employs machine learning to detect spam messages without explicit programming. • Algorithms learn classification rules from pre-labeled data, predicting the category of unknown texts based on majority vote. 1
PROBLEM STATEMENT • Spammers are in continuous war with E-mail service providers. E-mail service providers implement various spam filtering methods to retain their users and spammers are continuous changing patterns using various embedding tricks to get through filtering. These filters can never be too aggressive because slight misclassification may lead to important misinformation loss for consumer. A rigid filtering method with additional reinforcements is needed to tackle the problem. • To combat the ever-evolving tactics of spammers, email service providers must continuously adapt their spam filtering strategies. By implementing a combination of sophisticated techniques such as content analysis, sender verification, and machine learning algorithms, providers can effectively block unwanted messages while allowing legitimate emails to reach their recipients. 2
3 OBJECTIVES: The objectives of this project are • To create a ensemble algorithm for classification of spam with highest possible accuracy. • To study on how to use machine learning for spam detection. • To study how natural language processing techniques can be implemented in spam detection. • To provide user with insights of the given text leveraging the created algorithm and NLP. • Develop ensemble algorithm for accurate spam classification using machine learning. • Enhance spam detection methods through machine learning techniques. • Implement natural language processing (NLP) for improved spam detection. • Provide users valuable insights from text by combining algorithm with NLP. • Revolutionize spam detection for a more secure online experience
WORKFLOW :
DATA DESCRIPTION: Dataset : UCI SMS Spam Collection. Source: Kaggle. Description : A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the NUS. The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
DATA PROCESSING: • Dataset cleaning • Dataset Merging TEXTUAL DATA PROCESSING: • Tag Removal • Sentencing, tokenization • Stop word removal • Lemmatization • Sentence formation FEATURE VECTOR FORMATION : • The texts are converted into feature vectors(numerical data) using the words present in all the texts combined. • This process is done using count vectorization of NLTK library. • The feature vectors can be formed using two language models Bag of Words and Term Frequency-inverse Document Frequency.
BAG OF WORDS: Bag of words is a language model used mainly in text classification. A bag of words represents the text in a numerical form. The two things required for Bag of Words are • A vocabulary of words known to us. • A way to measure the presence of words. Ex: a few lines from the book “A Tale of Two Cities” by Charles Dickens. “ It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, ” The unique words here (ignoring case and punctuation) are: [ “it”, “was”, “the”, “best”, “of”, “times”, “worst”,“age”, “wisdom”, “foolishness” ] The next step is scoring words present in every document
After scoring the four lines from the above stanza can be represented in vector form as “It was the best of times“ = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0] "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0] "it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0] "it was the age of foolishness"= [1, 1, 1, 0, 1, 0, 0, 1, 0, 1] Term Frequency-Inverse Document Frequency: • Term frequency-inverse document frequency of a word is a measurement of the importance of a word. • It compares the repentance of words to the collection of documents and calculates the score. • Terminology for the below formulae: t – term(word). d – document. N – count of documents. The TF-IDF process consists of various activities listed below.
i) Term Frequency • The count of appearance of a particular word in a document is called term frequency. 𝒕𝒇(𝒕, 𝒅) = 𝒄𝒐𝒖𝒏𝒕 𝒐𝒇 𝒕 𝒊𝒏 𝒅/ 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒐𝒓𝒅𝒔 𝒊𝒏 𝒅 ii) Document Frequency • Document frequency is the count of documents the word was detected in. We consider one instance of a word and it doesn’t matter if the word is present multiple times. 𝒅𝒇(𝒕) = 𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆 𝒐𝒇 𝒕 𝒊𝒏 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕𝒔 iii) Inverse Document Frequency • IDF (Inverse Document Frequency) is the inverse of document frequency. • It evaluates the significance of a term by considering its informational contribution. • Common terms like "are," "if," and "a" provide minimal document insight. • IDF diminishes the importance of frequently occurring terms and boosts rare ones. 𝒊𝒅𝒇(𝒕) = 𝑵/𝒅𝒇
Finally, the TF-IDF can be calculated by combining the term frequency and inverse document frequency. 𝒕𝒇_𝒊𝒅𝒇(𝒕, 𝒅) = 𝒕𝒇(𝒕, 𝒅) ∗ 𝐥𝐨 𝐠 (𝑵/(𝒅𝒇 + 𝟏)) The process can be explained using the following example: “Document 1 It is going to rain today. Document 2 Today I am not going outside. Document 3 I am going to watch the season premiere.” The Bag of words of the above sentences is [going:3, to:2, today:2, i:2, am:2, it:1, is:1, rain:1] • It combines term frequency (TF) and inverse document frequency (IDF). • TF represents the frequency of a word in a document, while IDF evaluates its significance across the collection. • By assigning weights to words, TF-IDF aids in text mining, information retrieval, and natural language processing.
Term Frequency: Then finding the inverse document frequency Inverse Document Frequency :
Applying the final equation the values of tf-idf becomes Using the above two language models the complete data has been converted into two kinds of vectors and stored into a csv type file for easy access and minimal processing.
MACHINE LEARNING: • Machine Learning is process in which the computer performs certain tasks without giving instructions. In this case the models takes the training data and train on them. • Then depending on the trained data any new unknown data will be processed based on the ruled derived from the trained data. • After completing the count vectorization and TF-IDF stages in the workflow the data is converted into vector form(numerical form) which is used for training and testing models. • For our study various machine learning models are compared to determine which method is more suitable for this task. • The models used for the study include Naïve Bayes, K Nearest Neighbors, and Support Vector Machine.
ALGORITHM S A combination of 3 algorithms are used for the classifications. NAÏVE BAYES CLASSIFIER A naïve Bayes classifier is a supervised probabilistic machine learning model that is used for classification tasks. The main principle behind this model is the Bayes theorem. Bayes Theorem: Naive Bayes is a classification technique that is based on Bayes’ Theorem with an assumption that all the features that predict the target value are independent of each other. It calculates the probability of each class and then picks the one with the highest probability. P(A│B)=(P(B│A)P(A))/P(B).
P(A|B) is the probability of hypothesis A given the data B. This is called the posterior probability. P(B|A) is the probability of data B given that hypothesis A was true. P(A) is the probability of hypothesis A being true (regardless of the data). This is called the prior probability of A. P(B) is the probability of the data (regardless of the hypothesis) Naïve Bayes classifiers are mostly used for text classification. The limitation of the Naïve Bayes model is that it treats every word in a text as independent and is equal in importance but every word cannot be treated equally important because articles and nouns are not the same when it comes to language.
K-NEAREST NEIGHBORS • KNN is a classification algorithm. It comes under supervised algorithms. All the data points are assumed to be in an n-dimensional space. And then based on neighbors the category of current data is determined based on the majority. • Euclidian distance is used to determine the distance between points. The distance between 2 points is calculated as d=√(〖(x2-x1)〗^2+〖(y2-y1)〗^2 )
• The distances between the unknown point and all the others are calculated. Depending on the K provided k closest neighbors are determined. The category to which the majority of the neighbors belong is selected as the unknown data category. • If the data contains up to 3 features then the plot can be visualized. It is fairly slow compared to other distance-based algorithms such as SVM as it needs to determine the distance to all points to get the closest neighbors to the given point. SUPPORT VECTOR MACHINES(SVM) It is a machine learning algorithm for classification. Decision boundaries are drawn between various categories and based on which side the point falls to the boundary the category is determined.
Support Vectors: The vectors closer to boundaries are called support vectors/planes. If there are n categories then there will be n+1 support vectors. Instead of points, these are called vectors because they are assumed to be starting from the origin. The distance between the support vectors is called margin. We want our margin to be as wide as possible because it yields better results. There are three types of boundaries used by SVM to create boundaries. Linear: used if the data is linearly separable. Poly: used if data is not separable. It creates any data into 3-dimensional data. Radial: This is the default kernel used in SVM. It converts any data into infinite-dimensional data.
• If the data is 2-dimensional then the boundaries are lines. If the data is 3- dimensional then the boundaries are planes. If the data categories are more than 3 then boundaries are called hyperplanes. • An SVM mainly depends on the decision boundaries for predictions. It doesn’t compare the data to all other data to get the prediction due to this SVM’s tend to be quick with predictions. RESULTS MODEL SELECTION • While selecting the best language model the data has been converted into both types of vectors and then the models been tested for to determine the best model for classifying spam. • The results from individual models are presented in the experimentation section under methodology. Now comparing the results from the models.
Metric Model Accuracy Precision F1 Score Naive Bayes 95.94% 100% 97.91% KNN 90.04% 100% 94.92% SVM 97.29% 97.41% 97.35% • From the code it is clear that TF-IDF proves to be better than BoW in every model tested. Hence TF-IDF has been selected as the primary language model for textual data conversion in feature vector formation. COMPARISO N The results from the proposed model has been compared with all the models individually in tabular form to illustrate the differences clearly.
SUMMAR Y • There are two main tasks in the project implementation. Language model selection for completing the textual processing phase and proposed model creation using the individual algorithms. These two tasks require comparison from other models and select of various parameters for better efficiency. • During the language model selection phase two models, Bag of Words and TF-IDF are compared to select the best model and from the results obtained it is evident that TF-IDF performs better. CONCLUSION AND FUTURE SCOPE Conclusion: From the results obtained we can conclude that an ensemble machine learning model is more effective in detection and classification of spam than any individual algorithms.
We can also conclude that TF-IDF (term frequency inverse document frequency) language model is more effective than Bag of words model in classification of spam when combined with several algorithms. And finally we can say that spam detection can get better if machine learning algorithms are combined and tuned to needs. Project Scope This project needs a coordinated scope of work. i. Combine existing machine learning algorithms to form a better ensemble algorithm. ii. Clean, processing and make use of the dataset for training and testing the model created. iii. Analyse the texts and extract entities for presentation.
Limitations This Project has certain limitations. i. This can only predict and classify spam but not block it. ii. Analysis can be tricky for some alphanumeric messages and it may struggle with entity detection. iii. Since the data is reasonably large it may take a few seconds to classify and anlayse the message THANK YOU
THANK YOU Presented By : Adeline Palmerston Larana University | 2024

Data Mining Email SPam Detection PPT WITH Algorithms

  • 1.
  • 2.
    INTRODUCTION • Spam constitutes55% of all emails, posing a significant challenge to communication. • It inundates mailboxes with unwanted advertisements and junk, consuming users' time and risking the deletion of legitimate emails. • Economic impacts have led to legislative measures in some countries. • Text classification, essential for organizing and categorizing text, distinguishes between spam and legitimate messages. • Machine learning automates this process efficiently by learning associations from pre-labeled data. • Feature extraction transforms text into numerical representations, aiding in accurate classification. • ML techniques enhance precision and speed in analyzing big data, crucial for informing business decisions and automating processes. • This project employs machine learning to detect spam messages without explicit programming. • Algorithms learn classification rules from pre-labeled data, predicting the category of unknown texts based on majority vote. 1
  • 3.
    PROBLEM STATEMENT • Spammersare in continuous war with E-mail service providers. E-mail service providers implement various spam filtering methods to retain their users and spammers are continuous changing patterns using various embedding tricks to get through filtering. These filters can never be too aggressive because slight misclassification may lead to important misinformation loss for consumer. A rigid filtering method with additional reinforcements is needed to tackle the problem. • To combat the ever-evolving tactics of spammers, email service providers must continuously adapt their spam filtering strategies. By implementing a combination of sophisticated techniques such as content analysis, sender verification, and machine learning algorithms, providers can effectively block unwanted messages while allowing legitimate emails to reach their recipients. 2
  • 4.
    3 OBJECTIVES: The objectives ofthis project are • To create a ensemble algorithm for classification of spam with highest possible accuracy. • To study on how to use machine learning for spam detection. • To study how natural language processing techniques can be implemented in spam detection. • To provide user with insights of the given text leveraging the created algorithm and NLP. • Develop ensemble algorithm for accurate spam classification using machine learning. • Enhance spam detection methods through machine learning techniques. • Implement natural language processing (NLP) for improved spam detection. • Provide users valuable insights from text by combining algorithm with NLP. • Revolutionize spam detection for a more secure online experience
  • 5.
  • 6.
    DATA DESCRIPTION: Dataset : UCISMS Spam Collection. Source: Kaggle. Description : A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the NUS. The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.
  • 7.
    DATA PROCESSING: • Dataset cleaning • DatasetMerging TEXTUAL DATA PROCESSING: • Tag Removal • Sentencing, tokenization • Stop word removal • Lemmatization • Sentence formation FEATURE VECTOR FORMATION : • The texts are converted into feature vectors(numerical data) using the words present in all the texts combined. • This process is done using count vectorization of NLTK library. • The feature vectors can be formed using two language models Bag of Words and Term Frequency-inverse Document Frequency.
  • 8.
    BAG OF WORDS: Bagof words is a language model used mainly in text classification. A bag of words represents the text in a numerical form. The two things required for Bag of Words are • A vocabulary of words known to us. • A way to measure the presence of words. Ex: a few lines from the book “A Tale of Two Cities” by Charles Dickens. “ It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, ” The unique words here (ignoring case and punctuation) are: [ “it”, “was”, “the”, “best”, “of”, “times”, “worst”,“age”, “wisdom”, “foolishness” ] The next step is scoring words present in every document
  • 9.
    After scoring thefour lines from the above stanza can be represented in vector form as “It was the best of times“ = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0] "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0] "it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0] "it was the age of foolishness"= [1, 1, 1, 0, 1, 0, 0, 1, 0, 1] Term Frequency-Inverse Document Frequency: • Term frequency-inverse document frequency of a word is a measurement of the importance of a word. • It compares the repentance of words to the collection of documents and calculates the score. • Terminology for the below formulae: t – term(word). d – document. N – count of documents. The TF-IDF process consists of various activities listed below.
  • 10.
    i) Term Frequency •The count of appearance of a particular word in a document is called term frequency. 𝒕𝒇(𝒕, 𝒅) = 𝒄𝒐𝒖𝒏𝒕 𝒐𝒇 𝒕 𝒊𝒏 𝒅/ 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒘𝒐𝒓𝒅𝒔 𝒊𝒏 𝒅 ii) Document Frequency • Document frequency is the count of documents the word was detected in. We consider one instance of a word and it doesn’t matter if the word is present multiple times. 𝒅𝒇(𝒕) = 𝒐𝒄𝒄𝒖𝒓𝒓𝒆𝒏𝒄𝒆 𝒐𝒇 𝒕 𝒊𝒏 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕𝒔 iii) Inverse Document Frequency • IDF (Inverse Document Frequency) is the inverse of document frequency. • It evaluates the significance of a term by considering its informational contribution. • Common terms like "are," "if," and "a" provide minimal document insight. • IDF diminishes the importance of frequently occurring terms and boosts rare ones. 𝒊𝒅𝒇(𝒕) = 𝑵/𝒅𝒇
  • 11.
    Finally, the TF-IDFcan be calculated by combining the term frequency and inverse document frequency. 𝒕𝒇_𝒊𝒅𝒇(𝒕, 𝒅) = 𝒕𝒇(𝒕, 𝒅) ∗ 𝐥𝐨 𝐠 (𝑵/(𝒅𝒇 + 𝟏)) The process can be explained using the following example: “Document 1 It is going to rain today. Document 2 Today I am not going outside. Document 3 I am going to watch the season premiere.” The Bag of words of the above sentences is [going:3, to:2, today:2, i:2, am:2, it:1, is:1, rain:1] • It combines term frequency (TF) and inverse document frequency (IDF). • TF represents the frequency of a word in a document, while IDF evaluates its significance across the collection. • By assigning weights to words, TF-IDF aids in text mining, information retrieval, and natural language processing.
  • 12.
    Term Frequency: Then findingthe inverse document frequency Inverse Document Frequency :
  • 13.
    Applying the finalequation the values of tf-idf becomes Using the above two language models the complete data has been converted into two kinds of vectors and stored into a csv type file for easy access and minimal processing.
  • 14.
    MACHINE LEARNING: • MachineLearning is process in which the computer performs certain tasks without giving instructions. In this case the models takes the training data and train on them. • Then depending on the trained data any new unknown data will be processed based on the ruled derived from the trained data. • After completing the count vectorization and TF-IDF stages in the workflow the data is converted into vector form(numerical form) which is used for training and testing models. • For our study various machine learning models are compared to determine which method is more suitable for this task. • The models used for the study include Naïve Bayes, K Nearest Neighbors, and Support Vector Machine.
  • 15.
    ALGORITHM S A combination of3 algorithms are used for the classifications. NAÏVE BAYES CLASSIFIER A naïve Bayes classifier is a supervised probabilistic machine learning model that is used for classification tasks. The main principle behind this model is the Bayes theorem. Bayes Theorem: Naive Bayes is a classification technique that is based on Bayes’ Theorem with an assumption that all the features that predict the target value are independent of each other. It calculates the probability of each class and then picks the one with the highest probability. P(A│B)=(P(B│A)P(A))/P(B).
  • 16.
    P(A|B) is theprobability of hypothesis A given the data B. This is called the posterior probability. P(B|A) is the probability of data B given that hypothesis A was true. P(A) is the probability of hypothesis A being true (regardless of the data). This is called the prior probability of A. P(B) is the probability of the data (regardless of the hypothesis) Naïve Bayes classifiers are mostly used for text classification. The limitation of the Naïve Bayes model is that it treats every word in a text as independent and is equal in importance but every word cannot be treated equally important because articles and nouns are not the same when it comes to language.
  • 17.
    K-NEAREST NEIGHBORS • KNN isa classification algorithm. It comes under supervised algorithms. All the data points are assumed to be in an n-dimensional space. And then based on neighbors the category of current data is determined based on the majority. • Euclidian distance is used to determine the distance between points. The distance between 2 points is calculated as d=√(〖(x2-x1)〗^2+〖(y2-y1)〗^2 )
  • 18.
    • The distancesbetween the unknown point and all the others are calculated. Depending on the K provided k closest neighbors are determined. The category to which the majority of the neighbors belong is selected as the unknown data category. • If the data contains up to 3 features then the plot can be visualized. It is fairly slow compared to other distance-based algorithms such as SVM as it needs to determine the distance to all points to get the closest neighbors to the given point. SUPPORT VECTOR MACHINES(SVM) It is a machine learning algorithm for classification. Decision boundaries are drawn between various categories and based on which side the point falls to the boundary the category is determined.
  • 19.
    Support Vectors: Thevectors closer to boundaries are called support vectors/planes. If there are n categories then there will be n+1 support vectors. Instead of points, these are called vectors because they are assumed to be starting from the origin. The distance between the support vectors is called margin. We want our margin to be as wide as possible because it yields better results. There are three types of boundaries used by SVM to create boundaries. Linear: used if the data is linearly separable. Poly: used if data is not separable. It creates any data into 3-dimensional data. Radial: This is the default kernel used in SVM. It converts any data into infinite-dimensional data.
  • 20.
    • If thedata is 2-dimensional then the boundaries are lines. If the data is 3- dimensional then the boundaries are planes. If the data categories are more than 3 then boundaries are called hyperplanes. • An SVM mainly depends on the decision boundaries for predictions. It doesn’t compare the data to all other data to get the prediction due to this SVM’s tend to be quick with predictions. RESULTS MODEL SELECTION • While selecting the best language model the data has been converted into both types of vectors and then the models been tested for to determine the best model for classifying spam. • The results from individual models are presented in the experimentation section under methodology. Now comparing the results from the models.
  • 21.
    Metric Model AccuracyPrecision F1 Score Naive Bayes 95.94% 100% 97.91% KNN 90.04% 100% 94.92% SVM 97.29% 97.41% 97.35% • From the code it is clear that TF-IDF proves to be better than BoW in every model tested. Hence TF-IDF has been selected as the primary language model for textual data conversion in feature vector formation. COMPARISO N The results from the proposed model has been compared with all the models individually in tabular form to illustrate the differences clearly.
  • 22.
    SUMMAR Y • There aretwo main tasks in the project implementation. Language model selection for completing the textual processing phase and proposed model creation using the individual algorithms. These two tasks require comparison from other models and select of various parameters for better efficiency. • During the language model selection phase two models, Bag of Words and TF-IDF are compared to select the best model and from the results obtained it is evident that TF-IDF performs better. CONCLUSION AND FUTURE SCOPE Conclusion: From the results obtained we can conclude that an ensemble machine learning model is more effective in detection and classification of spam than any individual algorithms.
  • 23.
    We can alsoconclude that TF-IDF (term frequency inverse document frequency) language model is more effective than Bag of words model in classification of spam when combined with several algorithms. And finally we can say that spam detection can get better if machine learning algorithms are combined and tuned to needs. Project Scope This project needs a coordinated scope of work. i. Combine existing machine learning algorithms to form a better ensemble algorithm. ii. Clean, processing and make use of the dataset for training and testing the model created. iii. Analyse the texts and extract entities for presentation.
  • 24.
    Limitations This Project hascertain limitations. i. This can only predict and classify spam but not block it. ii. Analysis can be tricky for some alphanumeric messages and it may struggle with entity detection. iii. Since the data is reasonably large it may take a few seconds to classify and anlayse the message THANK YOU
  • 25.
    THANK YOU Presented By: Adeline Palmerston Larana University | 2024