机器学习实战之使用 scikit-learn 库实现 svm

有关 svm 的理论知识,在博客支持向量机(SVM)入门理解与推导中已有详细介绍,svm的特性:

  • 训练好的模型的算法复杂度由支持向量的个数决定,而不是由数据的维度决定,所以 svm 算法不太容易产生 overfitting;
  • svm 训练出来的模型完全依赖于支持向量,即使训练集中所有的非支持向量都被去除,重复训练过程,结果仍然是一样的模型;
  • 一个 svm 如果训练得出的支持向量比较少,那么模型会比较容易被泛化。

本博客仅介绍如何使用scikit-learn 库实现 svm 算法。

1、线性可分例子1

下面给出一个最简单的例子,训练样本为:x = [[2,0], [1,1], [2,3]],label 为 y = [0,0,1],代码如下,这是一个线性可分的例子:

#!/bin/python
#coding=utf-8
#使用 sklearn 实现 SVM

from sklearn import svm
#训练样本
x = [[2,0], [1,1], [2,3]]
#label
y = [0,0,1]

clf = svm.SVC(kernel = 'linear')
clf.fit(x, y)

#打印出参数设置情况,只设置了 kernel,其他都是默认
print clf

#支持向量
print clf.support_vectors_

#支持向量的index
print clf.support_

#对于每个类别,分别有几个支持向量
print clf.n_support_

#对新数据进行预测
print clf.predict([[2,0]])

输出结果:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=’ovr’, degree=3, gamma=’auto’, kernel=’linear’,
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
[[1. 1.]
[2. 3.]]
[1 2]
[1 1]
[0]

2、线性可分例子2

下面给出线性可分的第二个例子,我们随机生成训练样本进行训练,并绘图便于查看训练结果,代码如下:

#!/bin/python
#coding=utf-8

print __doc__

import numpy as np
import pylab as pl
from sklearn import svm

#每次随机数据相同
np.random.seed(0)

x = np.r_[np.random.randn(20, 2) - [2,2], np.random.randn(20,2) + [2,2]]
y = [0] * 20 + [1] * 20

clf = svm.SVC(kernel = 'linear')
clf.fit(x, y)

w = clf.coef_[0]
a = - w[0] / w[1]
xx = np.linspace(-5, 5)

# 所求最大间隔分界线
yy = a * xx - (clf.intercept_[0] / w[1])
#最大间隔下面的线
b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
#最大间隔上面的线
b = clf.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])

#打印出参数

print "w: ", w
print "a: ", a
print "support_vectors_: ", clf.support_vectors_
print "clf.coef_: ", clf.coef_


pl.plot(xx, yy, 'k-')
pl.plot(xx, yy_down, 'k--')
pl.plot(xx, yy_up, 'k--')

#把支持向量圈起来
pl.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s = 80, facecolors = 'none')

#画出各个点
pl.scatter(x[:, 0], x[:, 1], c = y, cmap = pl.cm.Paired)

pl.axis('tight')
pl.show()

输出结果:

w: [0.90230696 0.64821811]
a: -1.39198047626
support_vectors_: [[-1.02126202 0.2408932 ]
[-0.46722079 -0.53064123]
[ 0.95144703 0.57998206]]
clf.coef_: [[0.90230696 0.64821811]]

看下面的图片,⭕️圈起来的是支持向量,图中绘制出了最大间隔以及边界线:
机器学习实战之使用 scikit-learn 库实现 svm_第1张图片

3、线性不可分例子(人脸识别)

使用核函数,一方面可将数据从低维空间映射到高维空间,另一方面可以大大降低由于计算内积带来的复杂度。

代码:

#!/bin/python
#coding=utf-8
#使用 svm 实现人脸识别

from __future__ import print_function

from time import time
import logging
import matplotlib.pyplot as plt

from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC


print(__doc__)

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')


###############################################################################
# Download the data, if not already on disk and load it as numpy arrays

lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape

# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]

# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]

print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)


###############################################################################
# Split into a training set and a test set using a stratified k fold

# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25)


###############################################################################
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150

print("Extracting the top %d eigenfaces from %d faces"
      % (n_components, X_train.shape[0]))
t0 = time()
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)
print("done in %0.3fs" % (time() - t0))

eigenfaces = pca.components_.reshape((n_components, h, w))

print("Projecting the input data on the eigenfaces orthonormal basis")
t0 = time()
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print("done in %0.3fs" % (time() - t0))


###############################################################################
# Train a SVM classification model

print("Fitting the classifier to the training set")
t0 = time()
param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }
clf = GridSearchCV(SVC(kernel='rbf', class_weight= None), param_grid)
clf = clf.fit(X_train_pca, y_train)
print("done in %0.3fs" % (time() - t0))
print("Best estimator found by grid search:")
print(clf.best_estimator_)


###############################################################################
# Quantitative evaluation of the model quality on the test set

print("Predicting people's names on the test set")
t0 = time()
y_pred = clf.predict(X_test_pca)
print("done in %0.3fs" % (time() - t0))

print(classification_report(y_test, y_pred, target_names=target_names))
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))


###############################################################################
# Qualitative evaluation of the predictions using matplotlib

def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())


# plot the result of the prediction on a portion of the test set

def title(y_pred, y_test, target_names, i):
    pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
    true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
    return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)

prediction_titles = [title(y_pred, y_test, target_names, i)
                     for i in range(y_pred.shape[0])]

plot_gallery(X_test, prediction_titles, h, w)

# plot the gallery of the most significative eigenfaces

eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)

plt.show()

输出结果:


Total dataset size:
n_samples: 1288
n_features: 1850
n_classes: 7
Extracting the top 150 eigenfaces from 966 faces
done in 0.089s
Projecting the input data on the eigenfaces orthonormal basis
done in 0.011s
Fitting the classifier to the training set
done in 16.597s
Best estimator found by grid search:
SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.005, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Predicting people's names on the test set
done in 0.056s
                   precision    recall  f1-score   support

     Ariel Sharon       0.91      0.77      0.83        13
     Colin Powell       0.78      0.88      0.83        58
  Donald Rumsfeld       0.90      0.69      0.78        26
    George W Bush       0.86      0.99      0.92       133
Gerhard Schroeder       0.92      0.83      0.87        29
      Hugo Chavez       1.00      0.50      0.67        20
       Tony Blair       0.94      0.79      0.86        43

      avg / total       0.88      0.87      0.86       322

[[ 10   1   0   2   0   0   0]
 [  1  51   1   5   0   0   0]
 [  0   3  18   3   2   0   0]
 [  0   1   0 132   0   0   0]
 [  0   2   0   2  24   0   1]
 [  0   4   0   5   0  10   1]
 [  0   3   1   5   0   0  34]]

最后一个关于 SVC(kernel=’rbf’, class_weight= None) 中 class_weight 参数设置问题 的讨论,但来自知乎答友,非常清楚,贴在这里用于理解。

机器学习实战之使用 scikit-learn 库实现 svm_第2张图片

机器学习实战之使用 scikit-learn 库实现 svm_第3张图片

你可能感兴趣的:(机器学习实战)