8
$\begingroup$

I've understood that SVMs are binary, linear classifiers (without the kernel trick). They have training data $(x_i, y_i)$ where $x_i$ is a vector and $y_i \in \{-1, 1\}$ is the class. As they are binary, linear classifiers the task is to find a hyperplane which separates the data points with the label $-1$ from the data points with the label $+1$.

Assume for now, that the data points are linearly separable and we don't need slack variables.

Now I've read that the training problem is now the following optimization problem:

  • ${\min_{w, b} \frac{1}{2} \|w\|^2}$
  • s.t. $y_i ( \langle w, x_i \rangle + b) \geq 1$

I think I got that minizmizing $\|w\|^2$ means maximizing the margin (however, I don't understand why it is the square here. Would anything change if one would try to minimize $\|w\|$?).

I also understood that $y_i ( \langle w, x_i \rangle + b) \geq 0$ means that the model has to be correct on the training data. However, there is a $1$ and not a $0$. Why?

$\endgroup$
2
  • $\begingroup$ In the math minimize (derivative = 0) the squared probably turns out to be an easier equation $\endgroup$ Commented Dec 26, 2015 at 20:20
  • $\begingroup$ See also: Alexander Ihler: Support Vector Machines (1): Linear SVMs, primal form on YouTube. 25.01.2015. $\endgroup$ Commented Jan 17, 2016 at 13:43

1 Answer 1

11
$\begingroup$

First problem: Minimizing $\|w\|$ or $\|w\|^2$:

It is correct that one wants to maximize the margin. This is actually done by maximizing $\frac{2}{\|w\|}$. This would be the "correct" way of doing it, but it is rather inconvenient. Let's first drop the $2$, as it is just a constant. Now if $\frac{1}{\|w\|}$ is maximal, $\|w\|$ will have to be as small as possible. We can thus find the identical solution by minimizing $\|w\|$.

$\|w\|$ can be calculated by $\sqrt{w^T w}$. As the square root is a monotonic function, any point $x$ which maximizes $\sqrt{f(x)}$ will also maximize $f(x)$. To find this point $x$ we thus don't have to calculate the square root and can minimize $w^T w = \|w\|^2$.

Finally, as we often have to calculate derivatives, we multiply the whole expression by a factor $\frac{1}{2}$. This is done very often, because if we derive $\frac{d}{dx} x^2 = 2 x$ and thus $\frac{d}{dx} \frac{1}{2} x^2 = x$. This is how we end up with the problem: minimize $\frac{1}{2} \|w\|^2$.

tl;dr: yes, minimizing $\|w\|$ instead of $\frac{1}{2} \|w\|^2$ would work.

Second problem: $\geq 0$ or $\geq 1$:

As already stated in the question, $y_i \left( \langle w,x_i \rangle + b \right) \geq 0$ means that the point has to be on the correct side of the hyperplane. However this isn't enough: we want the point to be at least as far away as the margin (then the point is a support vector), or even further away.

Remember the definition of the hyperplane,

$\mathcal{H} = \{ x \mid \langle w,x \rangle + b = 0\}$.

This description however is not unique: if we scale $w$ and $b$ by a constant $c$, then we get an equivalent description of this hyperplane. To make sure our optimization algorithm doesn't just scale $w$ and $b$ by constant factors to get a higher margin, we define that the distance of a support vector from the hyperplane is always $1$, i.e. the margin is $\frac{1}{\|w\|}$. A support vector is thus characterized by $y_i \left( \langle w,x_i \rangle + b \right) = 1 $.

As already mentioned earlier, we want all points to be either a support vector, or even further away from the hyperplane. In training, we thus add the constraint $y_i \left( \langle w,x_i \rangle + b \right) \geq 1$, which ensures exactly that.

tl;dr: Training points don't only need to be correct, they have to be on the margin or further away.

$\endgroup$
4
  • $\begingroup$ Just to check if I understood it: Instead of writing $\geq 1$ we could also use any constant $\epsilon$ and write $\geq \epsilon$, where $\epsilon > 0$? $\endgroup$ Commented Dec 27, 2015 at 10:53
  • $\begingroup$ In principle, yes. E.g. in soft-margin SVMs (where you allow for some misclassifications or points within the margin), you use $\geq 1-\xi_i$ so you can be $\xi_i$ from the margin. Of course then you need some penalty term which forces most $\xi_i$ to be zero or at least very low. $\endgroup$ Commented Dec 27, 2015 at 11:00
  • 1
    $\begingroup$ I think in the above comment Martin was asking not about the case of soft margins where you add a $\xi_i$ to let some points cross over, but rather just about what happens if you replace $1$ with a different positive constant $\epsilon$. I believe the result in that case would be the same (i.e., you'd find the same plane of separation) but $w$ would be scaled so that the margin would be $\frac{2 \epsilon}{\|w\|}$ instead of $\frac{2}{\|w\|}$ $\endgroup$ Commented Nov 16, 2016 at 20:37
  • $\begingroup$ This is because $\langle w, x \rangle + b = \epsilon$ defines a plane orthogonal to $w$ and offset from the origin by $\frac{\epsilon - b}{\|w\|}$ in the $w$ direction. And likewise $-(\langle w, x \rangle + b) = \epsilon$ defines a plane orthogonal to $w$ and offset from the origin by $\frac{-\epsilon - b}{\|w\|}$. So the distance between the two planes is $\frac{\epsilon - b}{\|w\|} - \frac{-\epsilon - b}{\|w\|} = \frac{2 \epsilon}{\|w\|}$ $\endgroup$ Commented Nov 16, 2016 at 20:52

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.