© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 325 Case Study: Prediction on Iris Dataset Using KNN Algorithm Shreyas Tayade1, Rakhi Gupta2, Deval Kherde3 , Chaitanya Ubale4 1Student,Sipna College of Engineering and Technology, Maharashtra, India 2Assistant Professor, Sipna College of Engineering and Technology, Maharashtra, India 3Student,Sipna College of Engineering and Technology, Maharashtra, India 4Student,Sipna College of Engineering and Technology, Maharashtra, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - The well-known Iris dataset is used in this case study to use the K-Nearest Neighbors (KNN) method. The 150 iris flower observations in the Iris dataset include 50 observations of each of the three species—Setosa, Versicolor, and Virginica. This case study aims to identify the four characteristics of sepal length, sepal breadth, petal length, and petal width that may be used to categorize iris flowers into their respective species. The KNN method is a well-liked and straightforward classification technique that makes predictions by locating the nearest neighbors of each observation. To guarantee that all of the characteristics in this case study are on the same scale, the dataset is first divided into training and testing sets. The next step is to train a KNN model with k=3, which takes into account each observation's three nearest neighbors. Lastly, the accuracy score is used to assess how well the model performed on the test set. Key Words: K-Nearest Neighbors,sepal length, sepal breadth, petal length,petal width 1.INTRODUCTION The Iris dataset, which includes measurements of three different iris flower species, is well-known in the machine learning field. The dataset is a well-known example of a problem that may be resolved using supervised learning techniques and has been widely used as a benchmark for classification systems. This issue may be resolved using the straightforward and well-liked classification technique K-Nearest Neighbors (KNN). In this case study, we will use the Iris dataset and the KNN method to categorize iris blossoms according to four characteristics: sepal length, sepal width, petal length, and petal width. This case study's main objective is to outline the fundamental procedures for using KNN on the Iris dataset, from loading the data through assessing the model's performance on hypothetical data. We'll load the dataset first, then divide it into training and testing sets, normalise the data, train the KNN model, and assess its performance. fig-1 Dataset For those who are new to machine learning, the Iris dataset serves as a nice example of a classification issue that can be handled using KNN. Further categorization issues in the future can be solved using the knowledge and methods obtained from this case study. 2. ATTRIBUTE SELECTION The key to attaining good classification accuracy on the Iris dataset is selecting the best attribute for KNN. The four characteristics in this dataset are sepal length, sepal width, petal length, and petal width. 2 Description of Data Using feature selection approaches that rank the characteristics according to their significance or relevance to the classification job is one method for selecting the best attribute. This may be accomplished using a variety of techniques, including feature selection based on mutual information, correlation, or trees. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 326 An alternative strategy is to show the data with scatter plots or other visualization tools and then assess how easily the classes can be distinguished depending on each feature. For each class of characteristics, for instance, we may plot the pairwise pairings and see which combination best separates the classes. The petal length and petal width variables are recognised to offer the best separation between the three classes in the context of the Iris dataset, as demonstrated in several research and visualizations. Consequently, in the Iris dataset, these two features are frequently used as the best attributes for KNN. It is crucial to remember that the selection of the best qualities might change based on the particular situation and dataset. As a result, it is always advised to experiment with various attribute combinations and assess how well the KNN model performs using a validation or test set. 3.Data Visualization 1. Scatter plot: Sepal length and sepal breadth are two examples of two continuous characteristics that can be visualized using a scatter plot. If there is a linear connection between two characteristics or if there are any anomalies, scatter plots can be used to find patterns or trends in the data. 2. Box plot: Box plots are used to show how a continuous quantity is distributed among various groups. To display the range of sepal length for each type of iris, for instance, a box plot can be used. The bars indicate the range of the data within 1.5 times the interquartile range (IQR), while the rectangle represents the IQR, which includes the middle 50% of the data. Box plots can be used to spot variations in how a measure is distributed among various groups. 3. Histogram: The spread of a singular continuous quantity is shown using histograms. For instance, the range of petal length in the iris sample can be displayed using a histogram. Histograms can be used to determine the distribution's form (such as normal or skewed), as well as any possible anomalies or data gaps. 4. Heatmap: The connection between two categorical factors is shown using heatmaps. For each species of iris, the prevalence of each mix of petal length and breadth can be displayed using a heatmap, for instance. Heatmaps can be used to find patterns or trends in the data, such as whether a particular set of variable pairings is more prevalent in one area than another. 5. Pie Chart:The percentage of each group within a single categorical variable is shown using pie plots. For instance, a pie graphic can be used to display the percentage of each species in the information for iris. Pie charts are helpful for contrasting the proportions of various groups and for helping to visualize the distribution of a variable. The iris dataset can be used to make a wide variety of images, of which these are only a few instances. Insights and useful readings from the data can be gained by researchers and experts with the aid of data visualization, eventually resulting in better decision-making. So, whether it is the eye dataset or another dataset, data visualization is an essential stage in the data analysis process. 4. Model Comparison 1. The chance that an instance will belong to a particular class is predicted using the linear categorization model known as logistic regression. It presupposes that the features and the goal variable have a linear connection. The approach is straightforward and easy to understand, and it can be applied to binary or multiple-class classification problems. 2. Non-linear models called decision trees can be applied to both categorization and regression problems. Recursively dividing the data into subgroups according to the values of the features, they then base their choices on the dominant class in each subset. Decision trees can manage both category and numerical characteristics and are comprehensible. 3. Random Forests: An ensemble technique, Random Forests uses various Decision Trees to produce a more reliable and precise model. The forecasts of all the trees in the forest are averaged to produce the end projection. Each tree in the forest is trained using a random subset of the data. Because of their great precision and prowess in handling complicated datasets, random forests are well known. 4. Support Vector Machines (SVM): For binary and multi-class classification problems, SVMs are a common paradigm. They operate by identifying the hyperplane that maximises the gap between the classes and best divides the data into various classes. SVMs are especially effective for datasets with distinct class borders because they can manage both linear and non-linear connections between the features and the target variable.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 327 RESULTS AND ANALYSIS For the Iris dataset, we observed that the highest accuracy, 95.5%, was obtained for KNN and the least accuracy 88.88%, was obtained using Logistic Regression. The same has been tabulated and represented below for the models used. Fig 3- accuracy plot Model Accuracy KNN 95.50% Decision Tree 93.33% Logistic Regression 88.88% SVM 93.33% Naive Bayes 91.11% Random Forest 91.11% Table 1- Comparison of algorithms 5. Model training In the case of KNN on the Iris dataset, the model training involves the following steps: 1. Dataset loading: The Iris dataset must first be loaded into the machine learning environment. 150 samples with 4 characteristics make up the dataset, which is frequently divided into a training set and a testing set. 2. Division of the dataset: A training set and a testing set are created from the dataset. This is done to assess how well the KNN model performs with unknown data. 70% of the data is often utilised for training and 30% is used for testing, or a split ratio of 70:30. 3. As KNN is a distance-based algorithm, it's crucial to make sure that all of the characteristics are scaled equally. To achieve this, divide each feature's standard deviation by its mean before summing them up. 4. KNN model training: The training set is used to train the KNN model. The number of neighbors to take into account is the primary KNN parameter (k). With the Iris dataset, a value of k=3 or k=5 is frequently employed. 5. A performance metric, such as accuracy, precision, recall, or F1 score, is used to assess the KNN model's performance on the testing set. In the case of the Iris dataset, the accuracy score is frequently employed. 6. Changing the value of k or experimenting with other distance measures are two ways to tweak the model if the performance of the KNN model is not adequate. fig-4 accuracy plot for K-values Overall, the KNN algorithm is relatively simple and easy to implement for the Iris dataset. The key steps are to split the data, normalize the data, train the model, and evaluate the performance. By following these steps and experimenting with different parameter values, it is possible to achieve high classification accuracy on the Iris dataset. 6. CONCLUSIONS As shown by its successful use on the well-known Iris dataset, the K-Nearest Neighbors (KNN) algorithm provides a straightforward and practical approach for classification challenges. For those who are new to machine learning, the Iris dataset serves as a nice example of a classification issue that can be handled using KNN.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 328 In this case study, we have demonstrated the fundamental procedures needed to use KNN on the Iris dataset, including loading the data, dividing it into training and testing sets, normalizing the data, and finally training and assessing the KNN model. The model performed well on the test set, demonstrating its efficacy in identifying the various kinds of iris blooms. This case study shows the value of data pretreatment and assessment in obtaining accurate and trustworthy results, and it may be used as a valuable reference for people interested in using KNN to solve classification challenges. REFERENCES [1] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188. [2] Anderson, E. (1935). The irises of the Gaspe peninsula. Bulletin of the American Iris Society, 59, 2-5. [3] Scikit-learn documentation: https://scikit- learn.org/stable/modules/generated/sklearn.neighbo rs.KNeighborsClassifier.html [4] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer. [5] Alpaydin, E. (2010). Introduction to machine learning (2nd ed.). MIT Press. [6] Geron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly Media. [7] Kaggle: https://www.kaggle.com/uciml/iris

Case Study: Prediction on Iris Dataset Using KNN Algorithm

  • 1.
    © 2023, IRJET| Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 325 Case Study: Prediction on Iris Dataset Using KNN Algorithm Shreyas Tayade1, Rakhi Gupta2, Deval Kherde3 , Chaitanya Ubale4 1Student,Sipna College of Engineering and Technology, Maharashtra, India 2Assistant Professor, Sipna College of Engineering and Technology, Maharashtra, India 3Student,Sipna College of Engineering and Technology, Maharashtra, India 4Student,Sipna College of Engineering and Technology, Maharashtra, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - The well-known Iris dataset is used in this case study to use the K-Nearest Neighbors (KNN) method. The 150 iris flower observations in the Iris dataset include 50 observations of each of the three species—Setosa, Versicolor, and Virginica. This case study aims to identify the four characteristics of sepal length, sepal breadth, petal length, and petal width that may be used to categorize iris flowers into their respective species. The KNN method is a well-liked and straightforward classification technique that makes predictions by locating the nearest neighbors of each observation. To guarantee that all of the characteristics in this case study are on the same scale, the dataset is first divided into training and testing sets. The next step is to train a KNN model with k=3, which takes into account each observation's three nearest neighbors. Lastly, the accuracy score is used to assess how well the model performed on the test set. Key Words: K-Nearest Neighbors,sepal length, sepal breadth, petal length,petal width 1.INTRODUCTION The Iris dataset, which includes measurements of three different iris flower species, is well-known in the machine learning field. The dataset is a well-known example of a problem that may be resolved using supervised learning techniques and has been widely used as a benchmark for classification systems. This issue may be resolved using the straightforward and well-liked classification technique K-Nearest Neighbors (KNN). In this case study, we will use the Iris dataset and the KNN method to categorize iris blossoms according to four characteristics: sepal length, sepal width, petal length, and petal width. This case study's main objective is to outline the fundamental procedures for using KNN on the Iris dataset, from loading the data through assessing the model's performance on hypothetical data. We'll load the dataset first, then divide it into training and testing sets, normalise the data, train the KNN model, and assess its performance. fig-1 Dataset For those who are new to machine learning, the Iris dataset serves as a nice example of a classification issue that can be handled using KNN. Further categorization issues in the future can be solved using the knowledge and methods obtained from this case study. 2. ATTRIBUTE SELECTION The key to attaining good classification accuracy on the Iris dataset is selecting the best attribute for KNN. The four characteristics in this dataset are sepal length, sepal width, petal length, and petal width. 2 Description of Data Using feature selection approaches that rank the characteristics according to their significance or relevance to the classification job is one method for selecting the best attribute. This may be accomplished using a variety of techniques, including feature selection based on mutual information, correlation, or trees. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072
  • 2.
    International Research Journalof Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 326 An alternative strategy is to show the data with scatter plots or other visualization tools and then assess how easily the classes can be distinguished depending on each feature. For each class of characteristics, for instance, we may plot the pairwise pairings and see which combination best separates the classes. The petal length and petal width variables are recognised to offer the best separation between the three classes in the context of the Iris dataset, as demonstrated in several research and visualizations. Consequently, in the Iris dataset, these two features are frequently used as the best attributes for KNN. It is crucial to remember that the selection of the best qualities might change based on the particular situation and dataset. As a result, it is always advised to experiment with various attribute combinations and assess how well the KNN model performs using a validation or test set. 3.Data Visualization 1. Scatter plot: Sepal length and sepal breadth are two examples of two continuous characteristics that can be visualized using a scatter plot. If there is a linear connection between two characteristics or if there are any anomalies, scatter plots can be used to find patterns or trends in the data. 2. Box plot: Box plots are used to show how a continuous quantity is distributed among various groups. To display the range of sepal length for each type of iris, for instance, a box plot can be used. The bars indicate the range of the data within 1.5 times the interquartile range (IQR), while the rectangle represents the IQR, which includes the middle 50% of the data. Box plots can be used to spot variations in how a measure is distributed among various groups. 3. Histogram: The spread of a singular continuous quantity is shown using histograms. For instance, the range of petal length in the iris sample can be displayed using a histogram. Histograms can be used to determine the distribution's form (such as normal or skewed), as well as any possible anomalies or data gaps. 4. Heatmap: The connection between two categorical factors is shown using heatmaps. For each species of iris, the prevalence of each mix of petal length and breadth can be displayed using a heatmap, for instance. Heatmaps can be used to find patterns or trends in the data, such as whether a particular set of variable pairings is more prevalent in one area than another. 5. Pie Chart:The percentage of each group within a single categorical variable is shown using pie plots. For instance, a pie graphic can be used to display the percentage of each species in the information for iris. Pie charts are helpful for contrasting the proportions of various groups and for helping to visualize the distribution of a variable. The iris dataset can be used to make a wide variety of images, of which these are only a few instances. Insights and useful readings from the data can be gained by researchers and experts with the aid of data visualization, eventually resulting in better decision-making. So, whether it is the eye dataset or another dataset, data visualization is an essential stage in the data analysis process. 4. Model Comparison 1. The chance that an instance will belong to a particular class is predicted using the linear categorization model known as logistic regression. It presupposes that the features and the goal variable have a linear connection. The approach is straightforward and easy to understand, and it can be applied to binary or multiple-class classification problems. 2. Non-linear models called decision trees can be applied to both categorization and regression problems. Recursively dividing the data into subgroups according to the values of the features, they then base their choices on the dominant class in each subset. Decision trees can manage both category and numerical characteristics and are comprehensible. 3. Random Forests: An ensemble technique, Random Forests uses various Decision Trees to produce a more reliable and precise model. The forecasts of all the trees in the forest are averaged to produce the end projection. Each tree in the forest is trained using a random subset of the data. Because of their great precision and prowess in handling complicated datasets, random forests are well known. 4. Support Vector Machines (SVM): For binary and multi-class classification problems, SVMs are a common paradigm. They operate by identifying the hyperplane that maximises the gap between the classes and best divides the data into various classes. SVMs are especially effective for datasets with distinct class borders because they can manage both linear and non-linear connections between the features and the target variable.
  • 3.
    International Research Journalof Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 327 RESULTS AND ANALYSIS For the Iris dataset, we observed that the highest accuracy, 95.5%, was obtained for KNN and the least accuracy 88.88%, was obtained using Logistic Regression. The same has been tabulated and represented below for the models used. Fig 3- accuracy plot Model Accuracy KNN 95.50% Decision Tree 93.33% Logistic Regression 88.88% SVM 93.33% Naive Bayes 91.11% Random Forest 91.11% Table 1- Comparison of algorithms 5. Model training In the case of KNN on the Iris dataset, the model training involves the following steps: 1. Dataset loading: The Iris dataset must first be loaded into the machine learning environment. 150 samples with 4 characteristics make up the dataset, which is frequently divided into a training set and a testing set. 2. Division of the dataset: A training set and a testing set are created from the dataset. This is done to assess how well the KNN model performs with unknown data. 70% of the data is often utilised for training and 30% is used for testing, or a split ratio of 70:30. 3. As KNN is a distance-based algorithm, it's crucial to make sure that all of the characteristics are scaled equally. To achieve this, divide each feature's standard deviation by its mean before summing them up. 4. KNN model training: The training set is used to train the KNN model. The number of neighbors to take into account is the primary KNN parameter (k). With the Iris dataset, a value of k=3 or k=5 is frequently employed. 5. A performance metric, such as accuracy, precision, recall, or F1 score, is used to assess the KNN model's performance on the testing set. In the case of the Iris dataset, the accuracy score is frequently employed. 6. Changing the value of k or experimenting with other distance measures are two ways to tweak the model if the performance of the KNN model is not adequate. fig-4 accuracy plot for K-values Overall, the KNN algorithm is relatively simple and easy to implement for the Iris dataset. The key steps are to split the data, normalize the data, train the model, and evaluate the performance. By following these steps and experimenting with different parameter values, it is possible to achieve high classification accuracy on the Iris dataset. 6. CONCLUSIONS As shown by its successful use on the well-known Iris dataset, the K-Nearest Neighbors (KNN) algorithm provides a straightforward and practical approach for classification challenges. For those who are new to machine learning, the Iris dataset serves as a nice example of a classification issue that can be handled using KNN.
  • 4.
    International Research Journalof Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 328 In this case study, we have demonstrated the fundamental procedures needed to use KNN on the Iris dataset, including loading the data, dividing it into training and testing sets, normalizing the data, and finally training and assessing the KNN model. The model performed well on the test set, demonstrating its efficacy in identifying the various kinds of iris blooms. This case study shows the value of data pretreatment and assessment in obtaining accurate and trustworthy results, and it may be used as a valuable reference for people interested in using KNN to solve classification challenges. REFERENCES [1] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188. [2] Anderson, E. (1935). The irises of the Gaspe peninsula. Bulletin of the American Iris Society, 59, 2-5. [3] Scikit-learn documentation: https://scikit- learn.org/stable/modules/generated/sklearn.neighbo rs.KNeighborsClassifier.html [4] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer. [5] Alpaydin, E. (2010). Introduction to machine learning (2nd ed.). MIT Press. [6] Geron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly Media. [7] Kaggle: https://www.kaggle.com/uciml/iris