DA 5230 – Statistical & Machine Learning Lecture 2 – Introduction to Statistical Learning Maninda Edirisooriya manindaw@uom.lk
Machine Learning Overview Source: https://en.wikipedia.org/wiki/Deep_learning#/media/File:AI-ML-DL.svg
Machine Learning Overview • Intelligence: Understanding the nature to generate useful information • Artificial Intelligence (AI): Mimicking the Intelligence in animals/humans by man-made machines • Machine Learning (ML): Consuming data by machines to achieve Artificial Intelligence • Deep Learning (DL): Machine Learning using multiple layers of nature inspired neurons (in Deep Neural Networks)
AI vs ML • AI may consist of theory and rule based intelligence • Expert Systems • Control Systems • Algorithms • And Machine Learning Systems • ML is developed by mainly using available data where AI can also be developed with any data by using a fixed set of rules • ML systems are almost free from fixed rules added by experts where data will design the system • Domain knowledge is less required • ML does not contain if-else statements (a common misconception)
What is Statistical Learning (SL)? • Using statistics to understand the nature with data • Have well established proven mathematical methods while ML can sometimes be a form of Alchemy with data where focus is more on results • Is the base of ML where the statistics used in some ML models may not have well studied yet • Has a higher interpretability as proven with mathematics • Has a blur line between with ML
SL vs ML Statistical Learning Machine Learning Focus Primarily focuses on understanding and modeling the relationships between variables in data using statistical methods. It aims to make inferences and predictions based on these relationships. A broader field that encompasses various techniques for building predictive models and making decisions without being overly concerned with the underlying statistical assumptions. It is often used for tasks such as classification, regression, clustering, and more. Foundation Rooted in statistical theory and often uses classical statistical techniques like linear regression, logistic regression, and analysis of variance. Draws from a wider range of techniques, including traditional statistics but also incorporates methods like decision trees, support vector machines, neural networks, and more. It is less reliant on statistical theory and more focused on empirical performance. Assumptions Methods often make explicit assumptions about the underlying data distribution, such as normality or linearity. These assumptions help in making inferences about population parameters. Models are often designed to be more flexible and adaptive, which can make them less reliant on strict data distribution assumptions. Interpretability Models tend to be more interpretable, meaning it is easier to understand how the model arrives at its predictions. This interpretability is important in fields where understanding the underlying relationships is crucial. While interpretability can be a concern in some machine learning models (e.g., deep neural networks), many machine learning models are designed with a primary focus on predictive accuracy rather than interpretability.
Course Structure • Machine Learning will be the main focus • You should be able to do ML stuff yourself from the available data • You should be familiar with every phase of the ML lifecycle • Statistical background will be explained depending on your progress of the above requirement • ML will be first taught with simpler mathematics and intuition and then will be explained with statistical fundamentals • You will first be able to work on ML projects and then the theory behind it will be learned with statistics
For Your Reference • Machine Learning can be self-learned with the free course https://www.coursera.org/specializations/machine-learning-introduction • You can learn more about Statistical Learning from the free book about Python based SL at https://www.statlearning.com • Learn Python, Numpy, Pandas and scikit-learn from online tutorials and Youtube videos • You can also clarify tricky ML/SL problems with ChatGPT • Anyway, note that some online tutorials, videos and ChatGPT may provide incorrect information where you should be careful when learning from these resources • Never use ChatGPT for answering Quizzes or Exams! (at least until the AI takes over the world)
What we want from Machine Learning? • Say we have some collected data • We want a computer/machine to learn from those data and get the insight of that data into a model • Our expectation is to use that model to predict/make inferences on newly provided data • This is like you teach a kid to learn a certain pattern from example pictures and ask him later to draw/classify similar pictures • After the model is made (known as “trained”) you want to make sure the model has learned the insights with a sufficient accuracy • For that requirement, you train the model with only a part of the given data and use the remaining data to check (known as “test”) the accuracy of the model • Model will be used for our needs (to predict/make inferences) only if the tests are passed. Otherwise, we have to look back about the problem and may have to start from data collection
What we do in Machine Learning? • We find a dataset • In Supervised ML we have labeled data (i.e.: data has both X values and Y values) • In Un-supervised ML we have un-labeled data (i.e.: data has only X values but no Y values) • We select a suitable ML algorithm for modeling (e.g.: Linear Regression) • We train a model with most of the data (say 80% of the total data) using that algorithm • We test (check the accuracy of) the trained model with the remaining data (say 20% of the total data) • If the tests are passing (i.e. the trained model is accurate enough) we can use the model to label more un-labeled data (in supervised ML) or making inferences on more data (in unsupervised ML). • Otherwise, we have to iterate the above process until the tests are passed
Supervised Machine Learning • Now, let’s further look more detail into Supervised Machine Learning • There are two types of fields/variables/parameters in a Supervised ML dataset 1. Independent variables/features/predictors/X values 2. Dependent variable/target variable/response/Y value • Data sets will contain a set of records where each record contains data in a certain set of X values and a one Y value • E.g.: X1 - GPA X2 - income X3 – IQ Y– life_expectency 3.41 3000 105 72 2.32 1800 86 65 3.82 6000 130 86 3.56 4800 112 ? Given For training/testing Need to predict
Supervised Machine Learning X1 - GPA X2 - income X3 – IQ Y– life_expectancy 3.41 3000 105 72 2.32 1800 86 65 X1 - GPA X2 - income X3 – IQ 3.56 4800 112 Y– life_expectency 76 ML Model Training Trained ML Model Predicting X1 - GPA X2 - income X3 – IQ Y– life_expectency 3.82 6000 130 86 Testing 1 3 2 Accuracy = 80%
Supervised Machine Learning • You are given to train a model to identify how X1, X2, X3 relates to Y by the definition of the function f. • Where, Y = f(X1, X2, X3 ) or simply, Y = f(X) • Once the model is trained it will model an estimator for f, named as መ f which is not the exact f as the model is just an approximation of the true f • When predicting Y values for new X data, it will generate ෡ Y, an estimator for Y due to መ f • Due to this error (i.e. ෡ Y ≠ Y) there will be an error 𝜀 • Now the trained model will be መ f(X) where, መ f(X) = ෡ Y = f(X) + 𝜀 Model’s error True function to be approximated Predicted values from the model Approximated model function
Supervised Machine Learning • There are mainly 2 types of Supervised Machine Learning problems • Regression problems • Classification problems • This difference comes from the data type we are going to predict (Y) • If the Y is a continuous number such as temperature or length it is a regression problem • Else if the Y is a discreate finite number such as gender or country it is a classification problem
Supervised Machine Learning – Example 1 • Problem: A real estate company wants to estimate the sales price of a house given the following details of last 100 houses sold as data, with parameters including the sale price, • Area of the house • Area of the land • Number of rooms • Number of floors • Distance to the main road • Solution: This is a supervised learning regression problem where sales price is the Y parameter and other parameters of the given dataset as X parameters
Supervised Machine Learning – Example 2 • Problem: A doctor wants to diagnose a cancer as malignant or benign using the data of 500 tumors with labeled data, • Length of the tumor • Age of the patient • Having a cancer patient in family • Solution: This is a supervised learning classification problem where malignant or benign nature is the Boolean Y parameter and other parameters of the given dataset are the X parameters. Here, length of the tumor and age of the patient are float in type X variables while having a cancer patient in family is a Boolean X variable.
Un-supervised Machine Learning • Now, let’s look more detail into Un-supervised Machine Learning • There is only one type of fields/variables/parameters in a Supervised ML dataset • Independent variables/features/X values • No dependent variables • There are several types of Un-supervised Machine Learning problems • Clustering • Dimensionality reduction • Anomaly detection • …
Un-supervised Machine Learning – Example 1 • Problem: A web site owner wants to categorize its past 1000 visitors into 10 types based on the following data, • Visited hour of the day • Visit time • Most preferred product • Web browser used • Country of the IP address • Solution: As there are no labelled data (Y parameters) this is an unsupervised learning clustering problem where the given parameters of the given dataset are X parameters. We can use K-means clustering to cluster the X parameters into 10 classes
Questions?

Lecture 2 - Introduction to Machine Learning, a lecture in subject module Statistical & Machine Learning

  • 1.
    DA 5230 –Statistical & Machine Learning Lecture 2 – Introduction to Statistical Learning Maninda Edirisooriya manindaw@uom.lk
  • 2.
    Machine Learning Overview Source:https://en.wikipedia.org/wiki/Deep_learning#/media/File:AI-ML-DL.svg
  • 3.
    Machine Learning Overview •Intelligence: Understanding the nature to generate useful information • Artificial Intelligence (AI): Mimicking the Intelligence in animals/humans by man-made machines • Machine Learning (ML): Consuming data by machines to achieve Artificial Intelligence • Deep Learning (DL): Machine Learning using multiple layers of nature inspired neurons (in Deep Neural Networks)
  • 4.
    AI vs ML •AI may consist of theory and rule based intelligence • Expert Systems • Control Systems • Algorithms • And Machine Learning Systems • ML is developed by mainly using available data where AI can also be developed with any data by using a fixed set of rules • ML systems are almost free from fixed rules added by experts where data will design the system • Domain knowledge is less required • ML does not contain if-else statements (a common misconception)
  • 5.
    What is StatisticalLearning (SL)? • Using statistics to understand the nature with data • Have well established proven mathematical methods while ML can sometimes be a form of Alchemy with data where focus is more on results • Is the base of ML where the statistics used in some ML models may not have well studied yet • Has a higher interpretability as proven with mathematics • Has a blur line between with ML
  • 6.
    SL vs ML StatisticalLearning Machine Learning Focus Primarily focuses on understanding and modeling the relationships between variables in data using statistical methods. It aims to make inferences and predictions based on these relationships. A broader field that encompasses various techniques for building predictive models and making decisions without being overly concerned with the underlying statistical assumptions. It is often used for tasks such as classification, regression, clustering, and more. Foundation Rooted in statistical theory and often uses classical statistical techniques like linear regression, logistic regression, and analysis of variance. Draws from a wider range of techniques, including traditional statistics but also incorporates methods like decision trees, support vector machines, neural networks, and more. It is less reliant on statistical theory and more focused on empirical performance. Assumptions Methods often make explicit assumptions about the underlying data distribution, such as normality or linearity. These assumptions help in making inferences about population parameters. Models are often designed to be more flexible and adaptive, which can make them less reliant on strict data distribution assumptions. Interpretability Models tend to be more interpretable, meaning it is easier to understand how the model arrives at its predictions. This interpretability is important in fields where understanding the underlying relationships is crucial. While interpretability can be a concern in some machine learning models (e.g., deep neural networks), many machine learning models are designed with a primary focus on predictive accuracy rather than interpretability.
  • 7.
    Course Structure • MachineLearning will be the main focus • You should be able to do ML stuff yourself from the available data • You should be familiar with every phase of the ML lifecycle • Statistical background will be explained depending on your progress of the above requirement • ML will be first taught with simpler mathematics and intuition and then will be explained with statistical fundamentals • You will first be able to work on ML projects and then the theory behind it will be learned with statistics
  • 8.
    For Your Reference •Machine Learning can be self-learned with the free course https://www.coursera.org/specializations/machine-learning-introduction • You can learn more about Statistical Learning from the free book about Python based SL at https://www.statlearning.com • Learn Python, Numpy, Pandas and scikit-learn from online tutorials and Youtube videos • You can also clarify tricky ML/SL problems with ChatGPT • Anyway, note that some online tutorials, videos and ChatGPT may provide incorrect information where you should be careful when learning from these resources • Never use ChatGPT for answering Quizzes or Exams! (at least until the AI takes over the world)
  • 9.
    What we wantfrom Machine Learning? • Say we have some collected data • We want a computer/machine to learn from those data and get the insight of that data into a model • Our expectation is to use that model to predict/make inferences on newly provided data • This is like you teach a kid to learn a certain pattern from example pictures and ask him later to draw/classify similar pictures • After the model is made (known as “trained”) you want to make sure the model has learned the insights with a sufficient accuracy • For that requirement, you train the model with only a part of the given data and use the remaining data to check (known as “test”) the accuracy of the model • Model will be used for our needs (to predict/make inferences) only if the tests are passed. Otherwise, we have to look back about the problem and may have to start from data collection
  • 10.
    What we doin Machine Learning? • We find a dataset • In Supervised ML we have labeled data (i.e.: data has both X values and Y values) • In Un-supervised ML we have un-labeled data (i.e.: data has only X values but no Y values) • We select a suitable ML algorithm for modeling (e.g.: Linear Regression) • We train a model with most of the data (say 80% of the total data) using that algorithm • We test (check the accuracy of) the trained model with the remaining data (say 20% of the total data) • If the tests are passing (i.e. the trained model is accurate enough) we can use the model to label more un-labeled data (in supervised ML) or making inferences on more data (in unsupervised ML). • Otherwise, we have to iterate the above process until the tests are passed
  • 11.
    Supervised Machine Learning •Now, let’s further look more detail into Supervised Machine Learning • There are two types of fields/variables/parameters in a Supervised ML dataset 1. Independent variables/features/predictors/X values 2. Dependent variable/target variable/response/Y value • Data sets will contain a set of records where each record contains data in a certain set of X values and a one Y value • E.g.: X1 - GPA X2 - income X3 – IQ Y– life_expectency 3.41 3000 105 72 2.32 1800 86 65 3.82 6000 130 86 3.56 4800 112 ? Given For training/testing Need to predict
  • 12.
    Supervised Machine Learning X1- GPA X2 - income X3 – IQ Y– life_expectancy 3.41 3000 105 72 2.32 1800 86 65 X1 - GPA X2 - income X3 – IQ 3.56 4800 112 Y– life_expectency 76 ML Model Training Trained ML Model Predicting X1 - GPA X2 - income X3 – IQ Y– life_expectency 3.82 6000 130 86 Testing 1 3 2 Accuracy = 80%
  • 13.
    Supervised Machine Learning •You are given to train a model to identify how X1, X2, X3 relates to Y by the definition of the function f. • Where, Y = f(X1, X2, X3 ) or simply, Y = f(X) • Once the model is trained it will model an estimator for f, named as መ f which is not the exact f as the model is just an approximation of the true f • When predicting Y values for new X data, it will generate ෡ Y, an estimator for Y due to መ f • Due to this error (i.e. ෡ Y ≠ Y) there will be an error 𝜀 • Now the trained model will be መ f(X) where, መ f(X) = ෡ Y = f(X) + 𝜀 Model’s error True function to be approximated Predicted values from the model Approximated model function
  • 14.
    Supervised Machine Learning •There are mainly 2 types of Supervised Machine Learning problems • Regression problems • Classification problems • This difference comes from the data type we are going to predict (Y) • If the Y is a continuous number such as temperature or length it is a regression problem • Else if the Y is a discreate finite number such as gender or country it is a classification problem
  • 15.
    Supervised Machine Learning– Example 1 • Problem: A real estate company wants to estimate the sales price of a house given the following details of last 100 houses sold as data, with parameters including the sale price, • Area of the house • Area of the land • Number of rooms • Number of floors • Distance to the main road • Solution: This is a supervised learning regression problem where sales price is the Y parameter and other parameters of the given dataset as X parameters
  • 16.
    Supervised Machine Learning– Example 2 • Problem: A doctor wants to diagnose a cancer as malignant or benign using the data of 500 tumors with labeled data, • Length of the tumor • Age of the patient • Having a cancer patient in family • Solution: This is a supervised learning classification problem where malignant or benign nature is the Boolean Y parameter and other parameters of the given dataset are the X parameters. Here, length of the tumor and age of the patient are float in type X variables while having a cancer patient in family is a Boolean X variable.
  • 17.
    Un-supervised Machine Learning •Now, let’s look more detail into Un-supervised Machine Learning • There is only one type of fields/variables/parameters in a Supervised ML dataset • Independent variables/features/X values • No dependent variables • There are several types of Un-supervised Machine Learning problems • Clustering • Dimensionality reduction • Anomaly detection • …
  • 18.
    Un-supervised Machine Learning– Example 1 • Problem: A web site owner wants to categorize its past 1000 visitors into 10 types based on the following data, • Visited hour of the day • Visit time • Most preferred product • Web browser used • Country of the IP address • Solution: As there are no labelled data (Y parameters) this is an unsupervised learning clustering problem where the given parameters of the given dataset are X parameters. We can use K-means clustering to cluster the X parameters into 10 classes
  • 19.