DA 5230 – Statistical & Machine Learning Lecture 5 – Logistic Regression Maninda Edirisooriya manindaw@uom.lk
Classification • When the Y variable of a Supervised Learning problem is of several discreate classes (e.g.: Color, Age groups) the problem is known as a Classification problem • A Classification problem has to predict/select a certain Category (or a Class) as the dependent variable • When there are only 2 classes to be classified, it is known as a Binary Classification problem E.g.: Predicting a person’s gender (either as male or female) by testosterone concentration in blood, height and bone density
Binary Classification • Output classes of a binary classification can be represented by either • Boolean values, True or False (or Positive or Negative) • Numbers 1 or 0 • True or 1 value is used for the Positive Class for one class which is generally the class we want to analyze • False or 0 value is used for the Negative Class for the other class • E.g.: For classifying a tumor as malignant (a cancer) or benign (not a cancer) by the tumor size, being malignant can be taken as the Positive class and the benign class as the Negative class
Binary Classification - Example 0 (Benign) 1 (Malignant) X Y
Binary Classification – with Linear Regression 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign
Binary Classification – Problem with LR 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Misclassified
Binary Classification – Requirement 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Required Regression Classifier (Variant of Unit Step Function)
Binary Classification – Requirement 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Not Differentiable here for Gradient Descent Required Regression Classifier (Variant of Unit Step Function)
Binary Classification – Requirement 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Continuous Regression Classifier
Logistic/Sigmoid Function • Sigmoid function: 𝐟 𝐳 = 𝟏 𝟏+𝐞−𝐳 Z = 0 ⇒ f(Z) = 0.5 0 < f(Z) < 1 • A Non-linear function • This is a continuous alternative for the Unit Step Function Z f(Z)
Logistic Regression Like Linear Regression say, Z = β0 + β1*X1 + β2*X2 + ... + βn*Xn Logistic Function, f Z = 1 1+e−z f X = 1 1 + e−(β0 + β1∗X1 + β2∗X2 + ... + βn∗Xn) In vector form, f X = 1 1 + e−βTX where β0 = β0*X0 taking X0 = 1 This is the function of Logistic Regression.
Logistic Regression - Prediction Let’s take predictions as f(X) = ቊ 1 (or Positive) if, f x ≥ 0.5 0 (or Negative) if, f x < 0.5 f(X) = ൞ Positive ⇒ f X ≥ 0.5 ⇒ 1 1+e−βTX ≥ 0.5 ⇒ βTX ≥ 0 Negative ⇒ f X < 0.5 ⇒ 1 1+e−βTX < 0.5 ⇒ βTX < 0 Here, βTX = β0 + β1*X1 + β2*X2 + ... + βn*Xn
Prediction Example Take a classification problem with 2 independent variables where, f(X) = 1 1+e−(β0 + β1∗X1 + β2∗X2) Negative Positive X2 Z = β0 + β1*X1 + β2*X2 (Decision boundary) Z > 0 Positive Z < 0 Negative X1
Non-linear Classification Taking polynomials of X values (as discussed in Polynomial Regression) can classify non-linear data points with non-linear decision boundaries E.g.: f(X) = 1 1+e− (β0 + β1∗X1 2 + β2∗X2 2) Negative Positive X2 Z = β0 + β1∗X1 2 + β2∗X2 2 (Decision boundary) Z > 0 Positive Z < 0 Negative X1
Binary Logistic Regression – Cost Function Cost for a single data point is known as the Loss Take the Loss Function of Logistic Regression as L{f(X)} L f X , Y = ቊ − log f(X) if Y = 1 − log 1 − f(X) if Y = 0 L f X , Y = −Y log f(X) −(1 − Y) log 1 − f(X) Cost function: J(β) = 1 n σ𝑖=1 n L f x , Y J(β) = 1 n ෌𝑖=1 n [−Y log f(X) − (1 − Y) log 1 − f(X) ] This Cost Function is Convex (has a Global Minimum)
Multiclass Logistic Regression • Up to now we have looked at Binary Classification problems where there can be only two outcomes/categories/classes as the Y variable • When there are more than 2 classes available (only one of them is positive for any given data point) the problem becomes a Multiclass Classification problem • One way to handle Multiclass Classification is using the Binary Classifiers known as One-vs-All (OvA), also known as one-vs-rest (OvR) • It trains multiple binary classifiers, each one predicting the confidence (probability) of one class against the rest, and the highest class is selected
Multiclass Logistic Regression • OvA can be used • When you want to use different binary classifiers (e.g., SVMs or logistic regression) for each class • When available memory is limited or need to highly parallelize • There is another technique for Multiclass Logistic Regression by simply generalizing the binary classification problem of the Logistic Regression • This General form of Classifier is known as the Softmax Classifier • There, the Softmax Function is used instead of the Sigmoid function when there are multiple classes
Softmax Function • The name Softmax is used, as it is a continuous function approximation to the Maximum Function, where only one class (maximum) is allowed to be considered as Positive • Softmax function is used instead of the Maximum Function to make the function differentiable • Softmax Function: S(Xi) = 𝐞𝐱𝐢 ෍ 𝐣=𝟏 𝐧 𝐞 𝐱𝐣 where i is any data point and j is the index of the dimension of the vector Xi
Softmax Function • Softmax function exponentially highlights the value in the dimension where the value is maximum, while suppressing all other dimensions • Output values of a vector from a Softmax function sums to 1 • E.g.: Input Vector Output Vector Softmax Function
Softmax Regression • Like Z = βTX is the used for binary classification, Zk = βk TX is used for Multiclass classification, where k is the index of the class • Note that there K number of β vectors exists as model parameters • Like Y is used for binary classification where there is only a single dependent variables, Multiclass classification has K dependent variables, each denoted by Yk and its estimator ෡ 𝐘𝐤 ෡ 𝐘𝐤 = 𝐞𝐙𝒌 ෍ 𝐣=𝟏 𝐊 𝐞 𝐙𝐣
Softmax Regression Loss function: L f X , Y = -log(෡ Yk) = -log( eZ𝑘 ෍ j=1 K e Zj ) = -log( eβk TX ෎ j=1 K e βj TX ) Cost function (Cross Entropy Loss): J(β) = − ා 𝑖=1 N Σk=1 K I[Yi = k]log( eβk TX ෎ j=1 K e βj TX )
One Hour Homework • Officially we have one more hour to do after the end of the lectures • Therefore, for this week’s extra hour you have a homework • Logistic Regression is the basic building block of Deep Neural Networks (DNN). Softmax classifiers are used as it is in DNNs as the final classification layer • Go through the slides and get a clear understanding on Logistic and Softmax Regressions • Refer external sources to clarify all the ambiguities related to it • Good Luck!
Questions?

Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Machine Learning

  • 1.
    DA 5230 –Statistical & Machine Learning Lecture 5 – Logistic Regression Maninda Edirisooriya manindaw@uom.lk
  • 2.
    Classification • When theY variable of a Supervised Learning problem is of several discreate classes (e.g.: Color, Age groups) the problem is known as a Classification problem • A Classification problem has to predict/select a certain Category (or a Class) as the dependent variable • When there are only 2 classes to be classified, it is known as a Binary Classification problem E.g.: Predicting a person’s gender (either as male or female) by testosterone concentration in blood, height and bone density
  • 3.
    Binary Classification • Outputclasses of a binary classification can be represented by either • Boolean values, True or False (or Positive or Negative) • Numbers 1 or 0 • True or 1 value is used for the Positive Class for one class which is generally the class we want to analyze • False or 0 value is used for the Negative Class for the other class • E.g.: For classifying a tumor as malignant (a cancer) or benign (not a cancer) by the tumor size, being malignant can be taken as the Positive class and the benign class as the Negative class
  • 4.
    Binary Classification -Example 0 (Benign) 1 (Malignant) X Y
  • 5.
    Binary Classification –with Linear Regression 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign
  • 6.
    Binary Classification –Problem with LR 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Misclassified
  • 7.
    Binary Classification –Requirement 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Required Regression Classifier (Variant of Unit Step Function)
  • 8.
    Binary Classification –Requirement 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Not Differentiable here for Gradient Descent Required Regression Classifier (Variant of Unit Step Function)
  • 9.
    Binary Classification –Requirement 0 (Benign) 1 (Malignant) X Y Linear Regression Classifier 0.5 Malignant Benign Continuous Regression Classifier
  • 10.
    Logistic/Sigmoid Function • Sigmoidfunction: 𝐟 𝐳 = 𝟏 𝟏+𝐞−𝐳 Z = 0 ⇒ f(Z) = 0.5 0 < f(Z) < 1 • A Non-linear function • This is a continuous alternative for the Unit Step Function Z f(Z)
  • 11.
    Logistic Regression Like LinearRegression say, Z = β0 + β1*X1 + β2*X2 + ... + βn*Xn Logistic Function, f Z = 1 1+e−z f X = 1 1 + e−(β0 + β1∗X1 + β2∗X2 + ... + βn∗Xn) In vector form, f X = 1 1 + e−βTX where β0 = β0*X0 taking X0 = 1 This is the function of Logistic Regression.
  • 12.
    Logistic Regression -Prediction Let’s take predictions as f(X) = ቊ 1 (or Positive) if, f x ≥ 0.5 0 (or Negative) if, f x < 0.5 f(X) = ൞ Positive ⇒ f X ≥ 0.5 ⇒ 1 1+e−βTX ≥ 0.5 ⇒ βTX ≥ 0 Negative ⇒ f X < 0.5 ⇒ 1 1+e−βTX < 0.5 ⇒ βTX < 0 Here, βTX = β0 + β1*X1 + β2*X2 + ... + βn*Xn
  • 13.
    Prediction Example Take aclassification problem with 2 independent variables where, f(X) = 1 1+e−(β0 + β1∗X1 + β2∗X2) Negative Positive X2 Z = β0 + β1*X1 + β2*X2 (Decision boundary) Z > 0 Positive Z < 0 Negative X1
  • 14.
    Non-linear Classification Taking polynomialsof X values (as discussed in Polynomial Regression) can classify non-linear data points with non-linear decision boundaries E.g.: f(X) = 1 1+e− (β0 + β1∗X1 2 + β2∗X2 2) Negative Positive X2 Z = β0 + β1∗X1 2 + β2∗X2 2 (Decision boundary) Z > 0 Positive Z < 0 Negative X1
  • 15.
    Binary Logistic Regression– Cost Function Cost for a single data point is known as the Loss Take the Loss Function of Logistic Regression as L{f(X)} L f X , Y = ቊ − log f(X) if Y = 1 − log 1 − f(X) if Y = 0 L f X , Y = −Y log f(X) −(1 − Y) log 1 − f(X) Cost function: J(β) = 1 n σ𝑖=1 n L f x , Y J(β) = 1 n ෌𝑖=1 n [−Y log f(X) − (1 − Y) log 1 − f(X) ] This Cost Function is Convex (has a Global Minimum)
  • 16.
    Multiclass Logistic Regression •Up to now we have looked at Binary Classification problems where there can be only two outcomes/categories/classes as the Y variable • When there are more than 2 classes available (only one of them is positive for any given data point) the problem becomes a Multiclass Classification problem • One way to handle Multiclass Classification is using the Binary Classifiers known as One-vs-All (OvA), also known as one-vs-rest (OvR) • It trains multiple binary classifiers, each one predicting the confidence (probability) of one class against the rest, and the highest class is selected
  • 17.
    Multiclass Logistic Regression •OvA can be used • When you want to use different binary classifiers (e.g., SVMs or logistic regression) for each class • When available memory is limited or need to highly parallelize • There is another technique for Multiclass Logistic Regression by simply generalizing the binary classification problem of the Logistic Regression • This General form of Classifier is known as the Softmax Classifier • There, the Softmax Function is used instead of the Sigmoid function when there are multiple classes
  • 18.
    Softmax Function • Thename Softmax is used, as it is a continuous function approximation to the Maximum Function, where only one class (maximum) is allowed to be considered as Positive • Softmax function is used instead of the Maximum Function to make the function differentiable • Softmax Function: S(Xi) = 𝐞𝐱𝐢 ෍ 𝐣=𝟏 𝐧 𝐞 𝐱𝐣 where i is any data point and j is the index of the dimension of the vector Xi
  • 19.
    Softmax Function • Softmaxfunction exponentially highlights the value in the dimension where the value is maximum, while suppressing all other dimensions • Output values of a vector from a Softmax function sums to 1 • E.g.: Input Vector Output Vector Softmax Function
  • 20.
    Softmax Regression • LikeZ = βTX is the used for binary classification, Zk = βk TX is used for Multiclass classification, where k is the index of the class • Note that there K number of β vectors exists as model parameters • Like Y is used for binary classification where there is only a single dependent variables, Multiclass classification has K dependent variables, each denoted by Yk and its estimator ෡ 𝐘𝐤 ෡ 𝐘𝐤 = 𝐞𝐙𝒌 ෍ 𝐣=𝟏 𝐊 𝐞 𝐙𝐣
  • 21.
    Softmax Regression Loss function: Lf X , Y = -log(෡ Yk) = -log( eZ𝑘 ෍ j=1 K e Zj ) = -log( eβk TX ෎ j=1 K e βj TX ) Cost function (Cross Entropy Loss): J(β) = − ා 𝑖=1 N Σk=1 K I[Yi = k]log( eβk TX ෎ j=1 K e βj TX )
  • 22.
    One Hour Homework •Officially we have one more hour to do after the end of the lectures • Therefore, for this week’s extra hour you have a homework • Logistic Regression is the basic building block of Deep Neural Networks (DNN). Softmax classifiers are used as it is in DNNs as the final classification layer • Go through the slides and get a clear understanding on Logistic and Softmax Regressions • Refer external sources to clarify all the ambiguities related to it • Good Luck!
  • 23.