32
$\begingroup$

People says soft margin SVM use hinge loss function: $\max(0,1-y_i(w^\intercal x_i+b))$. However, the actual objective function that soft margin SVM tries to minimize is $$ \frac{1}{2}\|w\|^2+C\sum_i\max(0,1-y_i(w^\intercal x_i+b)) $$ Some authors call the $\|w\|^2$ term regularizer and the $\max(0,1-y_i(w^\intercal x_i+b))$ term loss function.

However, for hard margin SVM, the whole objective function is just $$ \frac{1}{2}\|w\|^2 $$ Does that mean hard margin SVM only minimize a regularizer without any loss function? That sounds very strange.

Well, if $\frac{1}{2}\|w\|^2$ is the loss function in this case, can we call it quadratic loss function? If so, why the loss function of hard margin SVM becomes regularizer in soft margin SVM and make a change from quadratic loss to hinge loss?

$\endgroup$
2
  • 1
    $\begingroup$ For what I understand, hard margin means you don't accept datas in your margin. As a consequence, max(0, calculation) will always return 0. $\endgroup$ Commented Feb 19, 2014 at 15:07
  • $\begingroup$ There is a s.t. condition. Otherwise you could minimize with the trivial solution. $\endgroup$ Commented Jul 27, 2023 at 21:41

3 Answers 3

35
+50
$\begingroup$

The hinge loss term $\sum_i\max(0,1-y_i(\mathbf{w}^\intercal \mathbf{x}_i+b))$ in soft margin SVM penalizes misclassifications. In hard margin SVM there are, by definition, no misclassifications.

This indeed means that hard margin SVM tries to minimize $\|\mathbf{w}\|^2$. Due to the formulation of the SVM problem, the margin is $2/\|\mathbf{w}\|$. As such, minimizing the norm of $\mathbf{w}$ is geometrically equivalent to maximizing the margin. Exactly what we want!

Regularization is a technique to avoid overfitting by penalizing large coefficients in the solution vector. In hard margin SVM $\|\mathbf{w}\|^2$ is both the loss function and an $L_2$ regularizer.

In soft-margin SVM, the hinge loss term also acts like a regularizer but on the slack variables instead of $\mathbf{w}$ and in $L_1$ rather than $L_2$. $L_1$ regularization induces sparsity, which is why standard SVM is sparse in terms of support vectors (in contrast to least-squares SVM).

$\endgroup$
3
  • 1
    $\begingroup$ Can you explain the last two paragraphs with some more details and maths? $\endgroup$ Commented Aug 23, 2018 at 19:33
  • 1
    $\begingroup$ Soft martgin SVM uses L1 regularization on the slack variable terms $ξ_i$, which leads to more $1-y_i(\mathbf{w}^\intercal \mathbf{x}_i+b)$ become $<=0$? Can I say $ξ_i=\max(0,1-y_i(\mathbf{w}^\intercal \mathbf{x}_i+b))$ here? $\endgroup$ Commented May 21, 2022 at 14:08
  • $\begingroup$ Why is the margin 2/||w||? In that case why can't we just pick the norm of w to be super small? There is an unstated s.t. condition lurking here, right? $\endgroup$ Commented Jul 27, 2023 at 21:42
2
$\begingroup$

There's no "loss function" for hard-margin SVMs, but when we're solving soft-margin SVMs, it turns out the loss exists.

Now is the detailed explanation:

When we talk about loss function, what we really mean is a training objective that we want to minimize.

In hard-margin SVM setting, the "objective" is to maximize the geometric margin s.t each training example lies outside the separating hyperplane, i.e. $$\begin{aligned} & \max_{\gamma, w, b}\frac{1}{\Vert w \Vert} \\ &s.t\quad y(w^Tx+b) \ge 1 \end{aligned} $$ Note that this is a quadratic programming problem, so we cannot solve it numerically using direct gradient descent approach, that is, there is no analytic "loss function" for hard-margin SVMs.

However, in soft-margin SVM setting, we add a slack variable to allow our SVM to made mistakes. We now try to solve $$\begin{aligned} & \min_{w,b,\boldsymbol{\xi}}\frac{1}{2}\Vert w \Vert_2^2 + C\sum \xi_i \\ s.t\quad &y_i(w^Tx_i+b) \ge 1-\xi_i \\ & \boldsymbol{\xi} \succeq \mathbf{0} \end{aligned} $$ This is the same as we try to penalize the misclassified training example $x_i$ by adding $C\xi_i$ to our objective to be minimized. Recall hinge loss: $$ \ell_{\mbox{hinge}}(z) = \max\{0, 1-z\}, $$ since if the training example lies outside the margin $\xi_i$ will be zero and it will only be nonzero when training example falls into margin region, and since hinge loss is always nonnegative, it happens we can rephrase our problem as $$ \min \frac{1}{2}\Vert w \Vert_2^2 + C\sum\ell_{\mbox{hinge}}(y_i(w^Tx_i)). $$ We know that hinge loss is convex and its derivative is known, thus we can solve for soft-margin SVM directly by gradient descent.

So the slack variable is just hinge loss in disguise, and the property of hinge loss happens to wrap up our optimization constraints (i.e. nonnegativity and activates input when it's less than 1).

$\endgroup$
1
  • $\begingroup$ There could be a loss function for HM SVM if it were solved with SGD, no? $\endgroup$ Commented Aug 27, 2020 at 5:18
0
$\begingroup$

Just to clarify, $$ \frac{1}{2}\|w\|^2 $$ is minimized subject to the constraint that the points are linearly separable (I.e. one can draw a hyperplane that perfectly separates the two). In other words, the only allowed values of w that we can consider as solutions are those that separate the two sets of points.

Now, it is thought that hard margin SVM "overfits" more readily than soft margin. This is easier to imagine with a RBF SVM with high enough of a $\gamma$, which can create (overly) complicated and (potentially) over-fit decision boundaries. The harder the margin (emulated imprecisely with a higher "C"), the harder the search will try to find decision boundaries that perfectly classify the two sets of points.

When we move to "soft margin", the constraints are relaxed and replaced with a restraint through the introduction of "slack". This slack variable is defined with a "hinge loss" term. After simplification, one arrives at the hinge + l2 like loss term everyone associates with SVMs. FWIW, I like to frame SVMs as more of an optimization problem instead of the omnipresent "follow the gradients" problem.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.