Support Vector Machines
Optimization objective
SVM hypothesis:
logistic regression:
cost function:
Large Margin Intution
If , we want (not just )
If , we want (not just )
If C is too large, the deasion boundary will be sensitive by outliers
The mathematics behind large margin classification (optional)
Vector Inner Product
SVM Decision Boundary
Kernels I
Non-liner decision boundary:
Given x, compute new feature feature depending on proximity to landmarks defined manually.
Kernels and Similarity (Gaussian kernel):
If
If far from
Kernels II
Choosing the land marks:
Where to get l ?
Give
choose
For training examples
SVM with Kernels
Hypothesis: Given , compute features
Predict 'y=1' if
Training:
Kernels ususally were used with SVM, although it can be used with logistic regressin, it runs slowly.
SVM parameters
C :
- Large C: Lower bias, high variance.
- Small C: Higher bias, low variance.
:
- Larger : Features vary more smoothly. Higher bias, lower variance.(Underfit)
- Small : Feaugers vary less smoothly. Lower bias, higher variance. (Overfit)
Using an SVM
Need to specify:
- Choice of parameter C
- Choice of kernel (similarity function)
Note: Do perform feature scaling before using the Gaussian kernel.
Other choices of kernel
Not all similarity functions make valid kernels. (Need to satisfy technical condition called "Mercer's Theorem") to make sure SVM packages' optimizations run correctly, and do not diverge.
Many off-the-shelf kernels avaliable:
- Polynomial kernel:
- String kernel
- chi-square kernel
- histogram intersection kernel
Multi-class classification
Many SVM packages already have build-in multi-class classification functionality.
Logistic regression vs. SVM
n = number of features, m = number of training examples.
- If n is large (relative m):
Use logistc regression, or SVM without a kernel. - If n is small m is intermediate:
Use SVM with Gaussian kernel - If n is small, m is large:
Create/add more features, then use logistic regression or SVM without a kernel.