本人本科大三,非985,211出身,这是我写的第一篇博客,其目的是对我这段时间学习SVM支持向量机和PCA降维算法的一次总结。若有错误之处,还请各位路过的大神积极指正。若要转载,请注明出处,谢谢!
支持向量机(support vector machines,SVM)是一种十分优秀的二分类算法,它的主要是寻找一个超平面来对样本数据进行分割,原则是间隔(margin)最大化,如果对应的样本特征是二维的,那么这个超平面就是一条直线将样本分隔开。
SVM算法就是在寻找一个最优的决策边界(下图中的两条虚线)来确定分类线b,所提到的支持向量(support vector)就是距离决策边界最近的点(下图中p1、p2、p3点)。假设没有这些支持向量点作为支撑的话,b线的位置就会发生改变。故,SVM就是根据这些支持向量点来进行最大化margin,进而找到了最优的分类器(在这里也就是一条直线)
SVM的目标函数(损失函数):
m i n 1 2 ∣ ∣ w ∣ ∣ 2 s . t y ⋅ ( w T ⋅ x i + b ) ≥ 1 i = 1 , 2 , 3...... n min\frac{1}{2}||w||^2\quad s.t\quad y\cdot(w^T\cdot x_i+b)\ge1\quad i=1,2,3......n min21∣∣w∣∣2s.ty⋅(wT⋅xi+b)≥1i=1,2,3......n
知道了SVM的损失函数之后,就可以根据拉格朗日乘子法、KKT条件、对偶问题转换和SMO算法计算得到分割线w和b的值,当我们了确定w和b之后,我们就能构造出最大分割超平面,进而实现分类效果。
导包
import numpy as np
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.svm import SVC
from sklearn import datasets
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
数据加载
data = datasets.fetch_lfw_people(resize = 1,min_faces_per_person = 70)
X = data["data"]
y = data["target"]
faces = data["images"]
target_names = data["target_names"]
数据降维
pca = PCA(n_components = 0.95)
X_pca = pca.fit_transform(X)
数据拆分
X_train,X_test,y_train,y_test,faces_train,faces_test = train_test_split(X_pca,y,faces)
参数优化
svc = SVC()
params = {"C":np.logspace(-3,3,50),"kernel":["rbf","linear","poly"],"tol":[0.0001,0.01,0.1,1]}
gc = GridSearchCV(estimator = svc,param_grid = params)
gc.fit(X_pca,y)
gc.best_params_
Out:{'C': 0.001, 'kernel': 'linear', 'tol': 1}
数据建模,模型评分约为83%
svc = SVC(C = 0.001,kernel = "linear",tol = 1)
svc.fit(X_train,y_train)
y_predict = svc.predict(X_test)
display(svc.score(X_test,y_test),y_predict)
0.8260869565217391
array([1, 2, 1, 3, 3, 3, 3, 3, 3, 4, 3, 2, 2, 1, 1, 2, 1, 3, 2, 6, 3,2,
3, 1, 4, 3, 3, 3, 1, 0, 3, 2, 4, 1, 3, 2, 3, 3, 3, 1, 1, 3, 2,3,
3, 1, 2, 1, 0, 0, 3, 1, 1, 1, 4, 2, 3, 2, 3, 3, 1, 3, 3, 4, 3,3,
3, 3, 1, 0, 2, 3, 6, 1, 0, 3, 2, 6, 3, 3, 5, 2, 1, 3, 3, 0, 1,5,
3, 6, 1, 0, 3, 3, 0, 3, 1, 3, 3, 1, 4, 3, 3, 3, 1, 6, 4, 3, 1,1,
3, 3, 4, 2, 0, 3, 1, 4, 3, 3, 1, 3, 4, 1, 4, 1, 1, 0, 3, 1, 4,6,
6, 4, 1, 3, 3, 5, 4, 1, 3, 3, 3, 2, 3, 3, 3, 3, 3, 4, 4, 0, 3,5,
1, 3, 3, 3, 4, 3, 1, 3, 6, 5, 5, 2, 3, 6, 3, 2, 5, 2, 6, 2, 3,2,
3, 1, 3, 4, 3, 2, 4, 6, 3, 3, 1, 3, 0, 3, 2, 3, 1, 1, 1, 5, 3,3,
3, 1, 1, 3, 5, 2, 6, 1, 6, 1, 3, 3, 3, 5, 1, 2, 3, 3, 1, 3, 1,6,
1, 3, 4, 2, 4, 3, 1, 3, 0, 3, 3, 1, 2, 3, 3, 5, 0, 3, 1, 2, 3,1,
2, 3, 3, 5, 2, 2, 1, 3, 3, 3, 2, 3, 6, 0, 1, 6, 3, 2, 3, 3, 4,3,
3, 1, 6, 3, 4, 3, 0, 1, 6, 3, 2, 2, 3, 3, 3, 6, 5, 3, 3, 3, 3,2,
5, 2, 6, 3, 2, 1, 1, 2, 1, 2, 1, 3, 2, 2, 1, 6, 6, 3, 3, 5, 3,2,
1, 3, 2, 3, 2, 3, 3, 3, 3, 3, 2, 6, 3, 0], dtype=int64)
模型预测效果可视化
plt.figure(figsize=(5*2,10*3))
plt.rcParams["font.family"] = "PingFang SC"
for i in range(50):
ax = plt.subplot(10,5,i+1)
ax.imshow(faces_test[i],cmap = "gray")
ax.axis("off")
true_name = target_names[y_test[i]].split(" ")[-1]
predict_name = target_names[y_predict[i]].split(" ")[-1]
plt.title("True:%s\nPredict:%s"%(true_name,predict_name))