Support vector machine from scratch

Question

I'm trying to build the linear SVC from scratch. I used some references from MIT course 6.034, and some youtube videos. I was able to get the code running, however, the results do not look right. I could not figure out what I did wrong, it would be nice if someone can point out my mistake. If I understand it correctly, the Hinge loss should only have one global minimum, and I should expect the cost to decrease monotonically. It certainly fluctuates towards the end.

#Generating data import seaborn as sns import matplotlib.pyplot as plt from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=300, n_features=2, centers=2, cluster_std=1, random_state=42) # Propose a model ==> (w,b) initialize a random plane np.random.seed(42) w = np.random.randn(2,1) b = np.random.randn(1,1) # Get output using the proposed model ==> distance score def cal_score(point_v,lable): return lable * (X @ w + b) s = cal_score(X,y) # Evaluate performance of the initial model ==> Hinge Loss def cal_hinge(score): hinge_loss = 1 - score hinge_loss[hinge_loss < 0] = 0 # cost = 0.5* sum(w**2) + sum(hinge_loss)/len(y) return hinge_loss, cost _, J = cal_hinge(s) loss = [J[0]] print('Cost of initial model: {}'.format(J[0])) #Gradient descent, update (w,b) def cal_grad(point_v,lable): hinge, _ = cal_hinge(cal_score(point_v,lable)) grad_w = np.zeros(w.shape) grad_b = np.zeros(b.shape) for i, h in enumerate(hinge): if h == 0: grad_w += w else: grad_w += w - (X[i] * y[i]).reshape(-1,1) grad_b += y[i] return grad_w/len(X), grad_b/len(X) grad_w,grad_b = cal_grad(X,y) w = w - 0.03*grad_w b = b - 0.03*grad_b # Re-evaluation after 1-step gradient descent s = cal_score(X,y) _, J = cal_hinge(s) print('Cost of 1-step model: {}'.format(J[0])) loss.append(J[0]) #How about 30 steps: for i in range(28): grad_w,grad_b = cal_grad(X,y) w = w - 0.04*grad_w b = b - 0.03*grad_b s = cal_score(X,y) _, J = cal_hinge(s) loss.append(J[0]) print('Cost of {}-step model: {}'.format(i+2,J[0])) print('Final model: w = {}, b = {}'.format(w,b))

Output

Cost of initial model: 0.13866202810721154 Cost of 1-step model: 0.13150688874177027 Cost of 2-step model: 0.12273179526491895 Cost of 3-step model: 0.11480467935989988 Cost of 4-step model: 0.1075336912554962 Cost of 5-step model: 0.10084006850825472 Cost of 6-step model: 0.09467250631773037 Cost of 7-step model: 0.08898976153627648 Cost of 8-step model: 0.08375382447902188 Cost of 9-step model: 0.07892966542038939 Cost of 10-step model: 0.07448500096528701 Cost of 11-step model: 0.07039007873679798 Cost of 12-step model: 0.06662137485152193 Cost of 13-step model: 0.0631641256490808 Cost of 14-step model: 0.06007003664049003 Cost of 15-step model: 0.05743247238207012 Cost of 16-step model: 0.05547068741404436 Cost of 17-step model: 0.05381989797841767 Cost of 18-step model: 0.05248657667528307 Cost of 19-step model: 0.051457041091025085 Cost of 20-step model: 0.050775749386560806 Cost of 21-step model: 0.0502143321989 Cost of 22-step model: 0.04964305284192223 Cost of 23-step model: 0.04934419897947399 Cost of 24-step model: 0.04918626712575319 Cost of 25-step model: 0.048988709405470836 Cost of 26-step model: 0.048964173310432575 Cost of 27-step model: 0.04890689234556096 Cost of 28-step model: 0.04901146890814169 Cost of 29-step model: 0.04882640882453289 Final model: w = [[ 0.21833245] [-0.16428035]], b = [[0.65908854]]

The end result looks to be what you would expect, and apart from step 28, the cost is monotonically decreasing — joostblack
– joostblack, Commented Mar 8, 2021 at 6:42
I compared my code performance to someone else's code. I think the main issue is that I'm using batch gradient descent, while others use SGD. I'm not sure why SGD produce a better performance model. This picture shows the main difference between the two methods. user-images.githubusercontent.com/66216181/… — anyang_peng
– anyang_peng, Commented Mar 8, 2021 at 18:41

tcglezen · Accepted Answer · 2021-03-08 07:08:50Z

It seems like the implementation of your code is correct. At such small margins, you should not worry about minor increases in your cost.

The increase in cost occurs when your learning rate multiplied by your gradient "overshoots" the optimal value. In this example, it happened by an extremely small amount so I would not worry about it.

If you are curious about why the cost increase at all, we first have to ask why shouldn't it? Gradient descent points in the direction which minimizes our loss. However, if our learning rate is large enough, we can shoot past the optimal value and end up with a larger cost! This is what your code essentially did, except at an extremely small and negligible scale.

The separation seems acceptable in this case, but it can get much worse if I try other random seeds. I included an image comparing my results to the expected results. user-images.githubusercontent.com/66216181/…
I can't say for sure what might be causing the difference but I have a couple of guesses. First, you could try letting your algorithm run for a few more iterations in order to have more time to train on the data. A second possibility is that you can try tweaking your hyper parameters (learning rates) and maybe that could help. Let me know if neither of those work.
here is the colab link if you are interested. I used batch gradient descent in my original code, and I also included someone else's SGD code. I think the SGD performs much better, although I think hinge loss is a convex function, both methods should end up with similar parameters. Not sure what went wrong. colab.research.google.com/drive/…

anyang_peng · Accepted Answer · 2021-03-10 01:07:11Z

I figured out what was wrong. When calculating the hinge loss, I should have used X @ w -b instead of X @ w + b; this affects how the bias term being updated during gradient descent. From my experience with these code, SGD reforms better than batchGD in general, and requires less hyper-parameter tuning. But theoretically, if a learning rate with decay is used for batchGD, it should perform no worse than SGD.

Collectives™ on Stack Overflow

Support vector machine from scratch

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related