支持向量机(Support Vector Machine,SVM)是Corinna Cortes和Vapnik等于1995年首先提出的,它在解决小样本、非线性及高维模式识别中表现出许多特有的优势,并能够推广应用到函数拟合等其他机器学习问题中。
在机器学习中,支持向量机(SVM,还支持矢量网络)是与相关的学习算法有关的监督学习模型,可以分析数据,识别模式,用于分类和回归分析。-百度
分隔超平面:将数据集分割开来的直线叫做分隔超平面。
超平面:如果数据集是N维的,那么就需要N-1维的某对象来对数据进行分割。该对象叫做超平面,也就是分类的决策边界。
间隔:
一个点到分割面的距离,称为点相对于分割面的距离。
数据集中所有的点到分割面的最小间隔的2倍,称为分类器或数据集的间隔。
最大间隔:
SVM分类器是要找最大的数据集间隔。
支持向量:坐落在数据边际的两边超平面上的点被称为支持向量
参考:http://www.th7.cn/Program/Python/201605/849859.shtml
1)寻求最优分类边界:
正确:对大部分样本可以正确地划分类别。
泛化:最大化支持向量间距。
公平:与支持向量等距。
简单:线性,直线或平面,分割超平面。
2)基于核函数的升维变换:
通过名为核函数的特征变换,增加新的特征,使得低维度空间中的线性不可分问题变为高维度空间中的线性可分问题。
C : 惩罚参数 (default=1.0)
kernel : 核函数 (default=‘rbf’)
必须是 ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or 自定义回调函数中的一个。
degree : 维度,(default=3)
多项式核函数的次数。
gamma : float, optional (default=‘auto’) 1/n_features:1/特征参数数量
核系数为“rbf”,“poly”和“sigmoid”时。当前默认值是’auto’,它使用1 / n_features,如果’ gamma=‘scale’ '被传递,那么它使用1 / (n_features * X.std())作为gamma的值。gamma当前的默认值“auto”将在0.22版本中更改为“scale”。‘auto_deprecated’, 与’auto’相同(不推荐使用),表示没有传递显式的gamma值。
gamma的值必须大于0。随着gamma的增大,存在对于测试集分类效果差而对训练分类效果好的情况,并且容易泛化误差出现过拟合。如图发现gamma=0.01时准确度最高。
class_weight :类均衡, {dict, ‘balanced’},
对于svc,将类i的参数c设置为class_weight[i]C。如果没有给出,所有类的权值都应该是1,“平衡模式”使用y的值自动调整权重与输入数据中的类频率成反比。
对于类别比例严重失衡的训练样本,可以设置权重均衡参数:class_weight=‘balanced’。其目的是通过权重的分配,让小比例的样本的作用得到增强,以改善分类器的预测精度。
线性核函数:linear,不通过核函数进行维度提升,尽在原始维度空间中寻求线性分类边界。
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
x, y = [], []
with open('../../data/multiple2.txt', 'r') as f:
for line in f.readlines():
data = [float(substr) for substr
in line.split(',')]
x.append(data[:-1])
y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)
train_x, test_x, train_y, test_y = \
ms.train_test_split(
x, y, test_size=0.25, random_state=5)
# 基于线性核函数的支持向量机分类器
model = svm.SVC(kernel='linear')
model.fit(train_x, train_y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_test_y = model.predict(test_x)
cr = sm.classification_report(test_y, pred_test_y)
print(cr)
mp.figure('SVM Linear Classification',
facecolor='lightgray')
mp.title('SVM Linear Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y,
cmap='brg', s=80)
mp.show()
多项式核函数:poly,通过多项式函数增加原始样本特征的高次方幂
x1 x2 -> y
x1 x2 x1^2 x1x2 x2^2 -> y 2次多项式升维
x1 x2 x1^3 x1^2x2 x1x2^2 x2^3 -> y 3次多项式升维
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
x, y = [], []
with open('../../data/multiple2.txt', 'r') as f:
for line in f.readlines():
data = [float(substr) for substr
in line.split(',')]
x.append(data[:-1])
y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)
train_x, test_x, train_y, test_y = \
ms.train_test_split(
x, y, test_size=0.25, random_state=5)
# 基于3次多项式核函数的支持向量机分类器
model = svm.SVC(kernel='poly', degree=3)
model.fit(train_x, train_y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_test_y = model.predict(test_x)
cr = sm.classification_report(test_y, pred_test_y)
print(cr)
mp.figure('SVM Polynomial Classification',
facecolor='lightgray')
mp.title('SVM Polynomial Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y,
cmap='brg', s=80)
mp.show()
径向基核函数:rbf,通过高斯分布函数增加原始样本特征的分布概率
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
x, y = [], []
with open('../../data/multiple2.txt', 'r') as f:
for line in f.readlines():
data = [float(substr) for substr
in line.split(',')]
x.append(data[:-1])
y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)
train_x, test_x, train_y, test_y = \
ms.train_test_split(
x, y, test_size=0.25, random_state=5)
# 基于径向基核函数的支持向量机分类器
model = svm.SVC(kernel='rbf', C=600, gamma=0.01)
model.fit(train_x, train_y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_test_y = model.predict(test_x)
cr = sm.classification_report(test_y, pred_test_y)
print(cr)
mp.figure('SVM RBF Classification',
facecolor='lightgray')
mp.title('SVM RBF Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y,
cmap='brg', s=80)
mp.show()
…, class_weight=‘balanced’, …
通过类别权重的均衡化,使所占比例较小的样本权重较高,而所占比例较大的样本权重较低,以此平均化不同类别样本对分类模型的贡献,提高模型性能。
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
x, y = [], []
with open('../../data/imbalance.txt', 'r') as f:
for line in f.readlines():
data = [float(substr) for substr
in line.split(',')]
x.append(data[:-1])
y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)
train_x, test_x, train_y, test_y = \
ms.train_test_split(
x, y, test_size=0.25, random_state=5)
# 带有类别权重均衡的支持向量机分类器
model = svm.SVC(kernel='rbf', C=100, gamma=1,
class_weight='balanced')
model.fit(train_x, train_y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_test_y = model.predict(test_x)
cr = sm.classification_report(test_y, pred_test_y)
print(cr)
mp.figure('SVM Balanced Classification',
facecolor='lightgray')
mp.title('SVM Balanced Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y,
cmap='brg', s=80)
mp.show()
根据样本与分类边界的距离远近,对其预测类别的可信程度进行量化,离边界越近的样本,置信概率越高,反之,离边界越远的样本,置信概率越低。
构造model时指定参数,probability=True
model.predict_proba(输入样本矩阵)->置信概率矩阵
预测结果(model.predict()函数返回):
样本1 类别1
样本2 类别1
样本3 类别2
置信概率矩阵:
类别1 类别2
样本1 0.8 0.2
样本2 0.9 0.1
样本3 0.4 0.5
代码:svm_prob.py
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
train_x, test_x, train_y, test_y = \
ms.train_test_split(
x, y, test_size=0.25, random_state=5)
# 能够计算置信概率的支持向量机分类器
model = svm.SVC(kernel='rbf', C=600, gamma=0.01,
probability=True)
...
prob_x = np.array([
[2, 1.5],
[8, 9],
[4.8, 5.2],
[4, 4],
[2.5, 7],
[7.6, 2],
[5.4, 5.9]])
print(prob_x)
pred_prob_y = model.predict(prob_x)
print(pred_prob_y)
probs = model.predict_proba(prob_x)
print(probs)
比验证曲线更简单,功能更强大的模型优化策略。
参数组合列表:=[{参数名:[取值列表],…},…]
model=ms.GridSearchCV(模型对象,参数组合列表,cv=验证次数)
model.fit(输入,输出)
(1)根据参数组合列表中的每一种超参数的组合,做cv次交叉验证,获得平均f1得分
(2)根据最佳平均F1得分所对应的超参数组合设置模型对象;
(3)用提供的fit函数完整数据集训练模型对象
import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
...
params = [
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
{'kernel': ['poly'], 'C': [1], 'degree': [2, 3]},
{'kernel': ['rbf'], 'C': [1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001]}]
# 网格搜索寻优
model = ms.GridSearchCV(svm.SVC(probability=True), params, cv=5)
model.fit(train_x, train_y)
for param, score in zip(model.cv_results_['params'], model.cv_results_['mean_test_score']):
print(param, score)
print(model.best_params_)
print(model.best_score_)
print(model.best_estimator_)
print(cr)
...
不是所有的字符串特征都能使用标签编码器进行编码,如果字符串所代表的数据含义有连续特征,并且包含大小关系的语义,则不能使用标签编码器,得自己写预处理函数。
核心代码:
import numpy as np
import sklearn.preprocessing as sp
import sklearn.model_selection as ms
import sklearn.svm as svm
class DigitEncoder():
def fit_transform(self, y):
return y.astype(int)
def transform(self, y):
return y.astype(int)
# 逆转换
def inverse_transform(self, y):
return y.astype(str)
data = []
with open('events.txt', 'r') as f:
for line in f.readlines():
data.append(line[:-1].split(','))
z=np.array(data)[:,2]
data = np.delete(np.array(data).T, 1, 0)
encoders, x = [], []
for row in range(len(data)):
# isdigit是数字
if data[row][0].isdigit():
encoder = DigitEncoder()
else:
encoder = sp.LabelEncoder()
if row < len(data) - 1:
x.append(encoder.fit_transform(data[row]))
else:
y = encoder.fit_transform(data[row])
encoders.append(encoder)
x = np.array(x).T
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.preprocessing as sp
import sklearn.model_selection as ms
import sklearn.svm as svm
class DigitEncoder():
def fit_transform(self, y):
return y.astype(int)
def transform(self, y):
return y.astype(int)
# 逆转换
def inverse_transform(self, y):
return y.astype(str)
data = []
# 二元分类
# with openr'C:\Users\Cs\Desktop\机器学习\ML\data\event.txt', 'r') as f:
# 多元分类
with open(r'C:\Users\Cs\Desktop\机器学习\ML\data\events.txt', 'r') as f:
for line in f.readlines():
data.append(line[:-1].split(','))
z=np.array(data)[:,2]
data = np.delete(np.array(data).T, 1, 0)
encoders, x = [], []
for row in range(len(data)):
# isdigit是数字
if data[row][0].isdigit():
encoder = DigitEncoder()
else:
encoder = sp.LabelEncoder()
if row < len(data) - 1:
x.append(encoder.fit_transform(data[row]))
else:
y = encoder.fit_transform(data[row])
encoders.append(encoder)
x = np.array(x).T
z=zip(z,x[:,1])
w={}
for i in z:
w[i[0]]=i[1]
print(w)
for i in z:
print(i)
train_x, test_x, train_y, test_y = \
ms.train_test_split(x, y, test_size=0.25,
random_state=5)
model = svm.SVC(kernel='rbf',
class_weight='balanced')
print(ms.cross_val_score(
model, train_x, train_y, cv=3,
scoring='accuracy').mean())
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print((pred_test_y == test_y).sum() /
pred_test_y.size)
data = [['Tuesday', '12:30:00', '21', '23']]
data = np.array(data).T
x = []
for row in range(len(data)):
encoder = encoders[row]
x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(encoders[-1].inverse_transform(pred_y))
回归器不能做交叉验证
交通流量预测(回归)
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.preprocessing as sp
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
class DigitEncoder():
def fit_transform(self, y):
return y.astype(int)
def transform(self, y):
return y.astype(int)
def inverse_transform(self, y):
return y.astype(str)
data = []
# 回归
with open(r'C:\Users\Cs\Desktop\机器学习\ML\data\traffic.txt', 'r') as f:
for line in f.readlines():
data.append(line[:-1].split(','))
data = np.array(data).T
encoders, x = [], []
for row in range(len(data)):
if data[row][0].isdigit():
encoder = DigitEncoder()
else:
encoder = sp.LabelEncoder()
if row < len(data) - 1:
x.append(encoder.fit_transform(data[row]))
else:
y = encoder.fit_transform(data[row])
encoders.append(encoder)
x = np.array(x).T
train_x, test_x, train_y, test_y = \
ms.train_test_split(x, y, test_size=0.25,
random_state=5)
# 支持向量机回归器,epsolon参数:误差范围,当得到值为y±epsilon,则无损失。
model = svm.SVR(kernel='rbf', C=10, epsilon=0.2)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
data = [['Tuesday', '13:35', 'San Francisco', 'yes']]
data = np.array(data).T
x = []
for row in range(len(data)):
encoder = encoders[row]
x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(int(pred_y))