Lecture 7 - Bias, Variance and Regularization, a lecture in subject module Statistical & Machine Learning

DA 5230 – Statistical & Machine Learning Lecture 7 – Bias, Variance and Regularization Maninda Edirisooriya manindaw@uom.lk

ML Process • You split your dataset into 2 • Large proportion to training and rest for testing • Then train the Training dataset with a suitable learning algorithm • Once trained, evaluate that model with the Test set and get the performance numbers (e.g.: accuracy) • Repeat the Data Collection, EDA, ML Algorithm Selection and Training phases iteratively till you get the expected level of performance

Model Fit • The same training dataset can be trained differently by different learning algorithms that will fit differently to the data • Even for a given algorithm, the level of explainability achieved by the model on the given dataset can be different depending on, • The number model parameters • Amount of data used for training • Number of iterations used for training • Regularization techniques used (will discuss later)

Model Fit Source: https://www.mathworks.com/discovery/overfitting.html Explains too much Explains too little Explains well

Bias and Variance • When a ML model cannot correctly make the predictions due to the simplicity of the model, it is known as a Bias Problem • When a ML model becomes very good at making predictions on its training dataset but bad (larger error) at the real world data (unseen data while training), it is a Variance Problem • As test data represents the unseen data to the model this causes higher error for test data • A good ML model should reduce both Bias and the Variance to an acceptable level

Bias and Variance as forms of Errors Source: https://towardsdatascience.com/bias-and-variance-in-linear-models-e772546e0c30

Bias – Variance Comparison Underfitting (i.e. Bias Problem) Overfitting (i.e. Variance Problem) Can happen when the model is not complex enough to understand the dataset. (i.e. small number of parameters for larger dataset) Can happen when the model is too complex for the dataset (i.e. large number of parameters for a smaller dataset) Can be due to Undertraining (i.e. trained for lesser number of iterations) Can be due to Overtraining Results lower performance (e.g. lower accuracy) Results higher performance for the training dataset but much lower performance for the testing dataset Problem is lower accuracy Problems is the model not being generalized for the real world data

Analogous to Humans Higher Bias People (Low IQ) Higher Variance People (Overthinking)

Bias • Bias is caused by not learning enough of the insights of the dataset by the model • Either due to the lesser expressive power of the model (i.e. lower number of parameters) • Or due to the smaller training dataset which does not contain enough information of the data distribution • When trained using an iterative method like Gradient Descent this may due to finishing the training process before completion (i.e. before the cost is reduced sufficiently)

Bias • Bias is defined as, Bias[f(X)] = E[෡ 𝐘] – Y • Bias can be reduced by, • Using a better ML algorithm • Using a larger model (i.e. with more parameters) • Training for more iterations if training was stopped earlier • Using a larger training data set • Reducing regularization (if exists) • Example for high bias, • Using a straight line to model a quadratic polynomial distribution

Variance • Variance is introduced when the model is too much fitting to the training dataset • The model can get highly optimized on the dataset that is being trained, including the noise • As the model is highly fitting to the noise information in the dataset, the model will perform poorly for real world data that are different from the training set

Variance • Variance is defined as, Variance[f(X)] = E[ {E(෠ 𝐘)– ෠ 𝐘}2 ] • Variance can be reduced by • Using a larger training dataset can reduce the variance as the errors get cancelled out • Reducing the number of parameters can also reduce variance as the less significant insights (like noise) will not be included in the model • Can use Dimensionality Reduction and Feature Selection (will be discussed in future) • Using Early Stopping to stop training at an optimal point • Dropout is used in Deep Learning models (not relevant to out subject module ☺) • Increasing (or introducing, if not at the moment) regularization • Example for high variance, • Using a 8 degree polynomial to model a linear distribution

Error Composition • Mean Square Error, MSE{መ 𝐟 𝐱 } = [Bias{෡ 𝐘}]2 + Var{෡ 𝐘} • Error in prediction, E(෡ 𝐘-Y)2 = MSE{መ 𝐟 𝒙 } + 𝝈 Where 𝝈 is the irreducible error Source: https://www.geeksforgeeks.org/bias-vs-variance-in-machine-learning/

Bias-Variance Tradeoff • ML algorithm, number of model parameters, amount of data, number of training iterations and regularization can be tried to tune to reduce both bias and variance • But this is not possible as when bias is reduced variance is increased and when variance is reduced bias is increased • This is known as the Bias Variance Tradeoff • Therefore, a better balance between bias and variance is used to create a better model

Early Stopping • When training with iterative methods like Gradient Descent, • Training error reduces monotonically due to increased fitting • But testing error reduces up to a certain level and starts to increase again due to the increased variance • Training the model can be stopped where the test error is minimum • This is known as Early Stopping Source: https://pub.towardsai.net/keras-earlystopping-callback-to-train-the-neural-networks-perfectly-2a3f865148f7

Regularization • When the high variance is observed during ML we may try to find more data to train. But that may be expensive • Then we may try to reduce the number of parameters in the model instead. Identifying the parameters to be reduced may not be obvious • The next best option is to apply Regularization during the training process • Regularization is a technique to penalize some information in the model, assuming they are relevant to the noise

Regularization • For the regularization, a penalty is added to the loss Loss := Loss + 𝜆 * ෌𝑖=1 𝑛 |𝛽𝑖 𝑘| • Where 𝜷𝒊is the ith parameter of the model. k is 1 or 2 in general • This penalty (or regularization term) has a factor 𝝀 known as the Regularization Strength • Best value for 𝝀 is found using Cross Validation (will learn in future) • There are 2 common regularization techniques, L1 (Lasso Regression) and L2 (Ridge Regression) • For L1, take k=1 and for L2 take k=2

L1 (Lasso) Regression • Loss function will be, Loss := Loss + 𝝀 * σ𝒊=𝟏 𝒏 |𝜷𝒊| • Penalty is proportional to the sum of parameter weights • Selects Features: most less-significant parameters end up as zero • Used when only few of the parameters are believed to be relevant to the the model, among the existing parameters, where other parameters should be eliminated from the model • When 𝝀 is getting larger, more features will become zero • When 𝝀 is very large only the bias 𝜷𝟎 will remain non-zero

L2 (Ridge) Regression • Loss function will be, Loss := Loss + 𝝀 * ෌𝒊=𝟏 𝒏 𝜷𝒊 𝟐 • Penalty is proportional to the sum of square of parameter weights • Weight Decay: reduces the weights of parameters with higher values • Used when all the parameters are believed to be contributing to the model, so need to significantly reduce the weights of excessively large parameters • When 𝝀 is getting larger, all the parameters 𝜷𝒊 will get reduced but will not get equal to zero

Elastic Net Regression • Both L1 and L2 functionalities can be used by weighting each of its values, which results Elastic Net Regression • This will bring some small parameters to zero (due to the L1 effect) and reduce some larger parameters (due to the L2 effect) • Select 𝜶 to adjust the balance of the effect between L1 and L2 Loss := Loss + 𝛼 ∗𝜆 * σ𝑖=1 𝑛 |𝛽𝑖| + (1-𝛼) ∗𝜆 *෌𝑗=1 𝑚 𝛽𝑗 2 Where 0 < 𝛼 < 1 Loss := Loss + 𝜆 * [𝛼 ∗ σ𝑖=1 𝑛 |𝛽𝑖| + (1-α) ∗ ෌𝑗=1 m 𝛽𝑗 2 ]

Linear Regression with L1 As cost (total loss) function for Linear Regression is Mean Square Error (MSE), after L1 regularization, J β = MSE + 𝜆 * σ𝑗=1 𝑚 |𝛽𝑗| J β = 1 2 ෍ i=1 n Yi − ෡ Yi 2 + 𝜆 * σ𝑗=1 𝑚 |𝛽𝑗| 𝜕J β 𝜕βj = ෌i=1 n ෡ Yi − Yi * Xi,j + 𝜆

Linear Regression with L2 As cost function for Linear Regression is Mean Square Error (MSE), after L2 regularization, J β = MSE + 𝜆 2 * ෌𝑗=1 𝑚 𝛽𝑗 2 J β = 1 2 ෍ i=1 n Yi − ෡ Yi 2 + 𝜆 2 * ෌j=1 𝑚 𝛽𝑗 2 𝜕J β 𝜕βj = ෌i=1 n ෡ Yi − Yi * Xi,j + 𝜆 * 𝛽𝑗

Logistic Regression with L1 & L2 Though the cost function for the Logistic Regression is the Cross Entropy function, still the Cost functions and their derivatives seem the same. (difference lies on ෡ 𝐘=መ 𝐟(X) which is sigmoid for logistic regression) L1 L2 J β = 1 2 ෍ i=1 n Yi − ෡ Yi 2 + 𝜆 * σ𝑗=1 𝑚 |𝛽𝑗| J β = 1 2 ෍ i=1 n Yi − ෡ Yi 2 + 𝜆 2 * ෌𝑗=1 𝑚 𝛽𝑗 2 𝜕J β 𝜕βj = ෌i=1 n ෡ Yi − Yi * Xi,j + 𝜆 𝜕J β 𝜕βj = ෌i=1 n ෡ Yi − Yi * Xi,j + 𝜆 * 𝛽𝑗

One Hour Homework • Officially we have one more hour to do after the end of the lectures • Therefore, for this week’s extra hour you have a homework • Bias and Variance are very important concepts in ML and regularization is widely used especially in Deep Learning • Go through the slides and get a clear understanding on Bias-Variance concept and familiar with regularization • Refer external sources to clarify all the ambiguities related to it • Good Luck!

Lecture 7 - Bias, Variance and Regularization, a lecture in subject module Statistical & Machine Learning

In this document

More Related Content

What's hot

Similar to Lecture 7 - Bias, Variance and Regularization, a lecture in subject module Statistical & Machine Learning

More from Maninda Edirisooriya

Recently uploaded

Lecture 7 - Bias, Variance and Regularization, a lecture in subject module Statistical & Machine Learning