Trying to understand Logistic Regression Implementation

Question

I'm currently using the following code as a starting point to deepen my understanding of regularized logistic regression. As a first pass I'm just trying to do a binary classification on part of the iris data set.

One problem I think I have encountered is that the negative log-loss (computed with loss and stored in loss_vec) doesn't change much from one iteration to the next.

Another challenge I am facing is trying to figure out how to plot the decision boundary once I have learned the logistic regression coefficients. Using the coefficients to plot the 0.5 decision boundary is way off. This makes me think I have made a mistake somewhere

http://fa.bianp.net/blog/2013/numerical-optimizers-for-logistic-regression/

import numpy as np import matplotlib.pyplot as plt from sklearn import linear_model, datasets iris = datasets.load_iris() X = iris.data[:, :2] # we only take the first two features. Y = iris.target X = X[:100,:] Y = Y[:100] def phi(t): # logistic function, returns 1 / (1 + exp(-t)) idx = t > 0 out = np.empty(t.size, dtype=np.float) out[idx] = 1. / (1 + np.exp(-t[idx])) exp_t = np.exp(t[~idx]) out[~idx] = exp_t / (1. + exp_t) return out def loss(x0, X, y, alpha): # logistic loss function, returns Sum{-log(phi(t))} #x0 is the weight vector w are the paramaters, c is the bias term w, c = x0[:X.shape[1]], x0[-1] z = X.dot(w) + c yz = y * z idx = yz > 0 out = np.zeros_like(yz) out[idx] = np.log(1 + np.exp(-yz[idx])) out[~idx] = (-yz[~idx] + np.log(1 + np.exp(yz[~idx]))) out = out.sum() / X.shape[0] + .5 * alpha * w.dot(w) return out def gradient(x0, X, y, alpha): # gradient of the logistic loss w, c = x0[:X.shape[1]], x0[-1] z = X.dot(w) + c z = phi(y * z) z0 = (z - 1) * y grad_w = X.T.dot(z0) / X.shape[0] + alpha * w grad_c = z0.sum() / X.shape[0] return np.concatenate((grad_w, [grad_c])) def bgd(X, y, alpha, max_iter): step_sizes = np.array([100,10,1,.1,.01,.001,.0001,.00001]) iter_no = 0 x0 = np.random.random(X.shape[1] + 1) #initialize weight vector #placeholder for coefficients to test against the loss function next_thetas = np.zeros((step_sizes.shape[0],X.shape[1]+1) ) J = loss(x0,X,y,alpha) running_loss = [] while iter_no < max_iter: grad = gradient(x0,X,y,alpha) next_thetas = -(np.outer(step_sizes,grad)-x0) loss_vec = [] for i in range(np.shape(next_thetas)[0]): loss_vec.append(loss(next_thetas[i],X,y,alpha)) ind = np.argmin(loss_vec) x0 = next_thetas[ind] if iter_no % 500 == 0: running_loss.append(loss(x0,X,y,alpha)) iter_no += 1 return next_thetas

Good for you for trying to write your own implementation! Try increasing your step size (eta) and checking your error converges. Adaptively reduce the step size if necessary. — Emre
– Emre, Commented Oct 30, 2015 at 7:43
Thanks @Emre for the suggestion. I tried adaptively updating the step size, but I still don't see the loss decreasing after successive iterations. Any more thoughts? — razalfuhl
– razalfuhl, Commented Oct 30, 2015 at 17:30
It would be better if you can be more precise in your questions. What do you mean that the loss doesn't change much? What is the expected outcome and what are you seeing? It isn't expected to move much once it's in a vicinity of the solution... — Fabian Pedregosa
– Fabian Pedregosa, Commented Oct 31, 2015 at 7:53

metjush · Accepted Answer · 2015-10-30 18:07:44Z

There are several issues I see with the implementation. Some are just unnecessarily complicated ways of doing it, but some are genuine errors.

Primary takeaways

A: Try to start from the math behind the model. The logistic regression is a relatively simple one. Find the two equations you need and stick to them, replicate them letter by letter.

B: Vectorize. It will save you a lot of unnecessary steps and computations, if you step back for a bit and think of the best vectorized implementation.

C: Write more comments in the code. It will help those trying to help you. It will also help you understand each part better and maybe uncover errors yourself.

Now let's go over the code step by step.

1. The sigmoid function

Is there a reason for such a complicate implementation in phi(t)? Assuming that t is a vector (a numpy array), then all you really need is:

def phi(t): 1. / (1. + np.exp(-t))

As np.exp() operates element-wise over arrays. Ideally, I'd implement it as a function that can also return its derivative (not necessary here, but might be handy if you try to implement a basic neural net with sigmoid activations):

def phi(t, dt = False): if dt: phi(t) * (1. - phi(t)) else: 1. / (1. + np.exp(-t))

2. Cost function

Usually, the logistic cost function is defined as a log cost in the following way (vectorized): $ \frac{1}{m} (-(y^T \log{(\phi(X\theta))})-(1-y^T)(\log{(1 - \phi(X\theta))}) + \frac{\lambda}{2m} \theta^{1T}\theta $ where $\phi(z)$ is the logistic (sigmoid) function, $\theta$ is the full parameter vector (including bias weight), $\theta^1$ is parameter vector with $\theta_1=0$ (by convention, bias is not regularized) and $\lambda$ is the regularization parameter.

What I really don't understand is the part where you multiply y * z. Assuming y is your label vector $y$, why are you multiplying it with your z before applying the sigmoid function? And why do you need to split the cost function into zeros and ones and calculate losses for either sample separately? I think the problem in your code really lies in this part: you must be erroneously multiplying $y$ with $X\theta$ before applying $\phi(.)$.

Also, this bit here: X.dot(w) + c. So c is your bias parameter, right? Why are you adding it to every element of $X\theta$? It shouldn't be added - it should be the first element of the vector $X\theta$. Yes, you don't regularize it, but you need to use in the "prediction" part of the loss function.

In your code, I also see the cost function as being overly complicated. Here's what I would try:

def loss(X,y,w,lam): #calculate "prediction" Z = np.dot(X,w) #calculate cost without regularization #shape of X is (m,n), shape of w is (n,1) J = (1./len(X)) * (-np.dot(y.T, np.log(phi(Z))) * np.dot((1-y.T),np.log(1 - phi(Z)))) #add regularization #temporary weight vector w1 = copy.copy(w) #import copy to create a true copy w1[0] = 0 J += (lam/(2.*len(X))) * np.dot(w1.T, w) return J

3. Gradient

Again, let's first go over the formula for the gradient of the logistic loss, again, vectorized: $\frac{1}{m} ((\phi(X\theta) - y)^TX)^T + \frac{\lambda}{m}\theta^1$. This will return a vector of derivatives (i.e. gradient) for all parameters, regularized properly (without the bias term regularized).

Here again, you've multiplied by $y$ way too soon: phi(y * z). In fact, you shouldn't have multiplied by $y$ in gradient at all.

This is what I would do for the gradient:

def gradient(X, y, w, lam): #calculate the prediction Z = np.dot(X,w) #temporary weight vector w1 = copy.copy(w) #import copy to create a true copy w1[0] = 0 #calc gradient grad = (1./len(X)) * (np.dot((phi(Z) - y).T,X).T) + (lam/len(X)) * w1 return grad

The actual gradient descent implementation seems ok to me, but because there are errors in the gradient and cost function implementations, it fails to deliver :/

Hope this will help you get on track.

Thank you for the feedback. I originally tried to compute the logistic function directly, but I had some trouble with the exponential leading to overflow errors. I tried to implement Fabian Pedregosa's code (see code and link) considering that he clearly thought a lot about solving that particular problem. All that said, this has helped me tremendously. — razalfuhl
– razalfuhl, Commented Oct 30, 2015 at 21:14
Glad it helped. Yeah, overflow might be a problem. I haven't considered it my self - that's why your code looked so weird to me. But scaling all your features before training might be helpful (subtracting means and potentially dividing by the standard deviation) to get all the features within a tighter range of values centered around zero. By doing this, you may not need a more complicated implementation... — metjush
– metjush, Commented Oct 30, 2015 at 21:25
The reason to implement phi "in such a complicated way" is to avoid overflows. For example, your implementation will overflow for values of t below - 100. State of the art implementations like the ones found on liblinear use this trick.And the multiplication by y can be done only if the two classes are - 1 and 1, in which it defaults to the same loss you wrote. — Fabian Pedregosa
– Fabian Pedregosa, Commented Oct 31, 2015 at 7:50
@metjush isn't it J = (1./len(X)) * (-np.dot(y.T, np.log(phi(Z))) - np.dot((1-y.T),np.log(1 - phi(Z)))) instead of J = (1./len(X)) * (-np.dot(y.T, np.log(phi(Z))) * np.dot((1-y.T),np.log(1 - phi(Z)))) i.e. subtracting two entities? — Anish
– Anish, Commented Mar 20, 2017 at 19:06

oW_ · Accepted Answer · 2017-08-21 15:16:23Z

Below is how you can implement gradient descent in Python:

def sigmoid(z): s= 1/(1 + np.exp(-z)) return s def propagate(w, b, X, Y): m = X.shape[1] A = sigmoid(np.dot(w.T,X)+b) # compute activation cost = -1/m * np.sum(Y * np.log(A) + (1-Y) * (np.log(1-A))) dz= (1/m)*(A - Y) dw = np.dot(X, dz.T) db = np.sum(dz) cost = np.squeeze(cost) grads = {"dw": dw, "db": db} return grads, cost def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False): costs = [] for i in range(num_iterations): m = X.shape[1] grads,cost = propagate(w, b, X, Y) b = b - learning_rate*grads["db"] w = w - learning_rate*grads["dw"] if i % 100 == 0: costs.append(cost) if print_cost and i % 100 == 0: print ("Cost after iteration %i: %f" %(i, cost)) params = {"w": w, "b": b} return params, grads, costs def predict(w, b, X): m = X.shape[1] Y_prediction = np.zeros((1,m)) w = w.reshape(X.shape[0], 1) A = sigmoid(np.dot(w.T,X)+ b) for i in range(A.shape[1]): x_exp = np.exp(A) x_sum = np.sum(x_exp,axis=1,keepdims=True) s = np.divide(x_exp,x_sum) Y_prediction = 1. * (A > 0.5) return Y_prediction def model(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5, print_cost = False): w, b = initialize_with_zeros(X_train.shape[0]) print("learning rate:",learning_rate) parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost = False) w = parameters["w"] b = parameters["b"] Y_prediction_train = predict(w,b,X_train) Y_prediction_test = predict(w,b,X_test) d = {"costs": costs, "Y_prediction_test": Y_prediction_test, "Y_prediction_train" : Y_prediction_train, "w" : w, "b" : b, "learning_rate" : learning_rate, "num_iterations": num_iterations} return d d = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 2000, learning_rate = 0.005, print_cost = True)

Welcome to StackExchange Data Science! While we appreciate the contribution, it would be better if you could add some prose to explain how this helps to answer the question. Thanks. — Stephen Rauch
– Stephen Rauch ♦, Commented Aug 21, 2017 at 13:32

Stack Exchange Network

Trying to understand Logistic Regression Implementation

2 Answers 2

Hot Network Questions

Trying to understand Logistic Regression Implementation

2 Answers 2

Related

Hot Network Questions