Machine Learning Basics and Supervised, unsupervised
1.
Unit 1 Introductionto Machine learning 1 Lecture Number Topic Hours 1 Classic and adaptive machines, Machine learning matters 01 2 Beyond machine learning-deep learning 01 3 Bio inspired adaptive systems 01 4 Machine learning and Big data 01 5 Important Elements of Machine Learning- Data formats, Learnability 01 6 Statistical learning approaches 01 7 Elements of information theory 01 Total hours =07
Classic and adaptivemachines • machine is never efficient or trendy without a concrete possibility to use it with pragmatism. • A machine is immediately considered useful and destined to be continuously improved if its users can easily understand what tasks can be completed with less effort or completely automatically
Machine learning matters •learning is the ability to change according to external stimuli and remembering most of all previous experiences • machine learning is an engineering approach that gives maximum importance to every technique that increases or improves the propensity for changing adaptively.
9.
Goal of ML •to study, engineer, and improve mathematical models which can be trained (once or continuously) with context-related data (provided by a generic environment), to infer the future and to make decisions without complete knowledge of all influencing elements (external factors).
Goal of ML •In other words • an agent (which is a software entity that receives information from an environment, picks the best action to reach a specific goal, and observes the results of it) adopts a statistical learning approach, trying to determine the right probability distributions and use them to compute the action (value or decision) that is most likely to be successful (with the least error).
Supervised learning • Asupervised scenario is characterized by the concept of a teacher or supervisor, whose main task is to provide the agent with a precise measure of its error (directly comparable with output values). • In a supervised scenario, the goal is training a system that must also work with samples never seen before. So, it's necessary to allow the model to develop a generalization ability and avoid a common problem called overfitting, which causes an overlearning due to an excessive capacity
Common supervised learningapplications include: • Predictive analysis based on regression or categorical classification • Spam detection • Pattern detection • Natural Language Processing • Sentiment analysis • Automatic image classification • Automatic sequence processing (for example, music or speech)
16.
Unsupervised learning • Thisapproach is based on the absence of any supervisor and therefore of absolute error measures; it's useful when it's necessary to learn how a set of elements can be grouped(clustered) according to their similarity (or distance measure)
Reinforcement Learning • reinforcementlearning is also based on feedback provided by the environment • the information is more qualitative and doesn't help the agent in determining a precise measure of its error. In reinforcement learning, this feedback is usually called reward (sometimes, a negative one is defined as a penalty) and it's useful to understand whether a certain action performed in a state is positive or not.
21.
Reinforcement Learning • Thesequence of most useful actions is a policy that the agent has to learn, so to be able to make always the best decision in terms of the highest immediate and cumulative reward. In other words, an action can also be imperfect, but in terms of a global policy it has to offer the highest total reward. • Reinforcement learning is particularly efficient when the environment is not completely deterministic, when it's often very dynamic, and when it's impossible to have a precise error measure.
Reinforcement Learning • reinforcementlearning is also based on feedback provided by the environment • the information is more qualitative and doesn't help the agent in determining a precise measure of its error. In reinforcement learning, this feedback is usually called reward (sometimes, a negative one is defined as a penalty) and it's useful to understand whether a certain action performed in a state is positive or not.
26.
Reinforcement Learning • Thesequence of most useful actions is a policy that the agent has to learn, so to be able to make always the best decision in terms of the highest immediate and cumulative reward. In other words, an action can also be imperfect, but in terms of a global policy it has to offer the highest total reward. • Reinforcement learning is particularly efficient when the environment is not completely deterministic, when it's often very dynamic, and when it's impossible to have a precise error measure.
Beyond machine learning deeplearning , bio inspired adaptive systems • complex (deep) neural architectures: deep learning – Rosenblatt invented the first perceptron, interest in neural networks • The idea behind these techniques is to create algorithms that work like a brain • Common deep learning applications include: – Image classification – Real-time visual tracking – Autonomous car driving – Logistic optimization – Bioinformatics – Speech recognition
29.
Machine learning andBig data. • the amount of information managed in different business contexts grew exponentially • opportunity to use it for machine learning purposes arose • it's possible to asynchronously train several local models, periodically share the updates, and re- synchronize them all with a master model • Not every machine learning problem is suitable for big data, and not all big datasets are really useful when training models.
30.
Parametric learning • avector-values function depend on an internal parameter vector which determines the actual instance of a generic predictor, the approach is called parametric learning:
31.
non-parametric learning • doesn'tmake initial assumptions about the family of predictors • instance-based learning and makes real-time predictions (without pre-computing parameter values) based on hypothesis determined only by the training samples (instance set). • concept of neighborhoods :classification
32.
• In aclassification problem, a new sample is automatically surrounded by classified training elements and the output class is determined considering the preponderant one in the neighborhood • Eg: kernel-based support vector machines
33.
• A genericparametric training process must find the best parameter vector which minimizes the regression/classification error given a specific training dataset and it should also generate a predictor that can correctly generalize when unknown samples are provided.
34.
• Supervised Learning –Regression and classification • Single value /single label • Multi value: multi-label classification and multi-output regression • Unsupervised learning – We only have an input set X with m-length vectors, and we define clustering function (with n target clusters) with the following expression:
two possibilities • Ifwe expect future data to be exactly distributed as training samples, a more complex model can be a good choice, to capture small variations that a lowerlevel one will discard. In this case, a linear (or lower-level) model will drive to underfitting, because it won't be able to capture an appropriate level of expressivity. • If we think that future data can be locally distributed differently but keeps a global trend, it's preferable to have a higher residual misclassification error as well as a more precise generalization ability. Using a bigger model focusing only on training data can drive to overfitting.
37.
Underfitting and overfitting •The purpose of a machine learning model is to approximate an unknown function that associates input elements to output ones (classes) • Underfitting: It means that the model isn't able to capture the dynamics shown by the same training set (probably because its capacity is too limited). • Overfitting: the model has an excessive capacity and it's not more able to generalize considering the original dynamics provided by the training set. It can associate almost perfectly all the known samples to the corresponding output values, but when an unknown input is presented, the corresponding prediction error can be very high.
38.
• low-capacity (underfitting): – easier to detect considering the prediction error • normal-capacity (normal fitting): – may prove to be more difficult to discover as it could be initially considered the result of a perfect fitting. • excessive capacity (overfitting):
39.
Data formats • Ina supervised learning problem, there will always be a dataset, defined as a finite set of real vectors with m features each: Considering approach is always probabilistic each X as drawn from a statistical multivariate distribution D.
40.
• dataset X:we expect all samples to be independent and identically distributed (i.i.d). • This means all variables belong to the same distribution D, and considering an arbitrary subset of m values, it happens that: • The corresponding output values can be both numerical- continuous or categorical. • In the first case, the process is called regression, while in the second, it is called classification.
• Generic Regressor: a vector-valued function which associates an input value to a continuous output • Generic Classifier, a vector-values function whose predicted output is categorical (discrete).
43.
Multiclass strategies • numberof output classes is greater than one • two main possibilities to manage a classification problem: – One-vs-all – One-vs-one • the choice is transparent and the output returned to the user will always be the final value or class.
44.
One-vs-one • training amodel for each pair of classes • complexity is O(n2) • the right class is determined by a majority vote. • this choice is more expensive and should be adopted only when a full dataset comparison is not preferable.
45.
One-vs-all • widely adoptedby scikit-learn • If there are n output classes, n classifiers will be trained in parallel considering there is always a separation between an actual class and the remaining ones. • lightweight approach (O(n) complexity)
46.
Learnability • A parametricmodel can be split into two parts: a static structure and a dynamic set of parameters. • a static structure : is determined by choice of a specific algorithm and is normally immutable • dynamic set of parameters is the objective of our optimization
47.
Learnability • Considering nunbounded parameters, they generate an n-dimensional space where each point, together with the immutable part of the estimator function, represents a learning hypothesis H • The goal of a parametric learning process is to find the best hypothesis whose corresponding prediction error is minimum and the residual generalization ability is enough to avoid overfitting.
49.
• the datasetX is linearly separable (without transformations) if there exists a hyperplane which divides the space into two subspaces containing only elements belonging to the same class
51.
• Cross-validation andother techniques easily show how our model works with test samples never seen during the training phase. • generic rule of thumb says that a residual error is always necessary to guarantee a good generalization ability, while a model that shows a validation accuracy of 99.999... percent on training samples is almost surely overfitted and will likely be unable to predict correctly when never-seen input samples are provided.
52.
Error measures • Ingeneral, when working with a supervised scenario, we define a non-negative error measure em which takes two arguments (expected and predicted output) and allows us to compute a total error value over the whole dataset (made up of n samples):
53.
loss function • it'suseful to consider the mean square error (MSE): • A generic training algorithm has to find the global minimum or a point quite close to it (there's always a tolerance to avoid an excessive number of iterations and a consequent risk of overfitting). • This measure is also called loss function because its value must be minimized through an optimization problem.
54.
zero-one-loss • efficient forbinary classifications (also for one- vs-rest multiclass strategy): • adopted in loss functions based on the probability of misclassification
55.
Statistical learning approaches •Imagine that you need to design a spam-filtering algorithm starting from this initial (over simplistic) classification based on two parameters: • We have collected 200 email messages (X) : p1 and p2 mutually exclusive • find a couple of probabilistic hypotheses to determine