统计学习方法:基于SMO算法的SVM的Python实现
前言:
在阅读本篇文章之前,希望您已经读过李航老师的《统计学习方法》中的第七章——支持向量机,本文实现SVM的算法使用序列最小
最优化算法(SMO)。其中若出现公式的引用,则来自于李航老师的《统计学习方法》,如 公式 (7.107)来自于《统计学习方法》第127页。
算法过程:
------------------------------------------------------------------------------------------------------------------------
SMO算法
------------------------------------------------------------------------------------------------------------------------
算法输入:训练数据集合 T ; 最大迭代次数max_iter ; 常量 C ;精度epsilon
算法输出:分离超平面的参数 w 和 b
(1)初始化输入参数 max_iter 、C 、 epsilon ; 读入训练数据集 ;
(2)初始化 alpha向量 ,初始化迭代次数iters = 0 ;
(3)核心部份 :
1:while True:
2: iters += 1
3: for j in range(N): #N为训练样本数
4: 在N内随机选取一个i,且i不等于j。
5: i和j表示了选取的a1和a2的下标,表示为a1old,a2old
6: 求eta=kii+kjj-2*kij ,即书中式(7.107)
7: if eta == 0: continue #我们需要利用式(7.106)来计算a2newunc,其中eta为式(7/106)的分母,
#所以进行判断,如果为0则重新选取
8: 计算w,b的值,这里使用公式(7.50)和公式(7.51)来计算,在书中111页
9: 计算Ei和Ej的值,Ei为预测值与真实值的误差,见式(7.105)
10: 计算a2newunc的值,见式(7.106)
11: #计算L和H的值,见书中126页下方公式
if yi != yj: #yi和yj即为对应的标签
L = max(0,a2old-a1old), H = min(C,C+a2old-a1old)
else:
L = max(0,a2old+a1old-C), H = min(C,a2old + a1old)
12: 更新a2的值,见式(7.108)
13: 更新a1的值,见式(7.109)
14: 计算误差diff,并进行判断
if diff < epolon:
break
15: 判断迭代次数
if iters >= max_iter:
告知用户到达迭代次数
return
16:利用最终的alpha计算w和b的值
算法实现:
下面使用python来实现上面的算法过程 。我们的数据集如下:
------------------------------------------------------------------------------------------------------------------------
3.542485,1.977398,-1
3.018896,2.556416,-1
7.551510,-1.580030,1
2.114999,-0.004466,-1
8.127113,1.274372,1
7.108772,-0.986906,1
8.610639,2.046708,1
2.326297,0.265213,-1
3.634009,1.730537,-1
0.341367,-0.894998,-1
3.125951,0.293251,-1
2.123252,-0.783563,-1
0.887835,-2.797792,-1
7.139979,-2.329896,1
1.696414,-1.212496,-1
8.117032,0.623493,1
8.497162,-0.266649,1
4.658191,3.507396,-1
8.197181,1.545132,1
1.208047,0.213100,-1
1.928486,-0.321870,-1
2.175808,-0.014527,-1
7.886608,0.461755,1
3.223038,-0.552392,-1
3.628502,2.190585,-1
7.407860,-0.121961,1
7.286357,0.251077,1
2.301095,-0.533988,-1
-0.232542,-0.547690,-1
3.457096,-0.082216,-1
3.023938,-0.057392,-1
8.015003,0.885325,1
8.991748,0.923154,1
7.916831,-1.781735,1
7.616862,-0.217958,1
2.450939,0.744967,-1
7.270337,-2.507834,1
1.749721,-0.961902,-1
1.803111,-0.176349,-1
8.804461,3.044301,1
1.231257,-0.568573,-1
2.074915,1.410550,-1
-0.743036,-1.736103,-1
3.536555,3.964960,-1
8.410143,0.025606,1
7.382988,-0.478764,1
6.960661,-0.245353,1
8.234460,0.701868,1
8.168618,-0.903835,1
1.534187,-0.622492,-1
9.229518,2.066088,1
7.886242,0.191813,1
2.893743,-1.643468,-1
1.870457,-1.040420,-1
5.286862,-2.358286,1
6.080573,0.418886,1
2.544314,1.714165,-1
6.016004,-3.753712,1
0.926310,-0.564359,-1
0.870296,-0.109952,-1
2.369345,1.375695,-1
1.363782,-0.254082,-1
7.279460,-0.189572,1
1.896005,0.515080,-1
8.102154,-0.603875,1
2.529893,0.662657,-1
1.963874,-0.365233,-1
8.132048,0.785914,1
8.245938,0.372366,1
6.543888,0.433164,1
-0.236713,-5.766721,-1
8.112593,0.295839,1
9.803425,1.495167,1
1.497407,-0.552916,-1
1.336267,-1.632889,-1
9.205805,-0.586480,1
1.966279,-1.840439,-1
8.398012,1.584918,1
7.239953,-1.764292,1
7.556201,0.241185,1
9.015509,0.345019,1
8.266085,-0.230977,1
8.545620,2.788799,1
9.295969,1.346332,1
2.404234,0.570278,-1
2.037772,0.021919,-1
1.727631,-0.453143,-1
1.979395,-0.050773,-1
8.092288,-1.372433,1
1.667645,0.239204,-1
9.854303,1.365116,1
7.921057,-1.327587,1
8.500757,1.492372,1
1.339746,-0.291183,-1
3.107511,0.758367,-1
2.609525,0.902979,-1
3.263585,1.367898,-1
2.912122,-0.202359,-1
1.731786,0.589096,-1
2.387003,1.573131,-1
------------------------------------------------------------------------------------------------------------------------------
完成算法的(1)和(2):
import csv
import numpy as np
#读取数据集
def readData(filename):
data = []
with open(filename,'r') as csvfile:
spamreader = csv.reader(csvfile,delimiter=',')
for row in spamreader:
data.append(row)
return np.array(data)
max_iter = 1000 #最大迭代次数
C = 1.0 #常量
epsilon = 0.001 #精确度
filename = 'data/testSet.txt'
data = readData(filename)
data = data.astype(float)
iters = 0 #迭代次数
x, y = data[:,0:-1] ,data[:,-1].astype(int) #点及其对应的label
n = x.shape[0]
alpha = np.zeros((n))
接下来进行算法的核心部分,因为这部分相对复杂,我们一步步的来实现。首先定义一个函数SVM,它接收两个参数x和y,即上面代码中的x和y。
因为在每次迭代时需要计算误差值,见上面算法核心部分第14 ——计算diff 。diff的计算需要用到上次的alpha值以及本次迭代之后的alpha值。所以我们
需要一个变量来保存上次的alpha值。另外将iters的定义放入函数SVM内。此时代码如下:
import csv
import numpy as np
#读取数据集
def readData(filename):
data = []
with open(filename,'r') as csvfile:
spamreader = csv.reader(csvfile,delimiter=',')
for row in spamreader:
data.append(row)
return np.array(data)
def SVM(x,y):
N = x.shape[0]
iters = 0 #迭代次数
while True:
alpha_prev = np.copy(alpha)
iters += 1
for j in range(0,N):
#do something
diff = np.linalg.norm(alpha - alpha_prev)#计算误差,实际上是向量的二范数
if diff < epsilon:
break
if iters >= max_iter:
return
max_iter = 1000 #最大迭代次数
C = 1.0 #常量
epsilon = 0.001 #精确度
filename = 'data/testSet.txt'
data = readData(filename)
data = data.astype(float)
x, y = data[:,0:-1] ,data[:,-1].astype(int) #点及其对应的label
n = x.shape[0]
alpha = np.zeros((n))
SVM(x,y)
接下来我们关注for循环的内部实现。首先是在0到 N内(不含N)随机生成一个 i ,且 i 不等于 j 。我们来写一个产生这个随机数 i 的函数。
#产生一个随机数i,a<=i<=b 且 i不等于z
def rand_int(a,b,z):
i = z
while i == z:
i = rnd.randint(a,b)
return i
注意记着import random as rnd
Ok,有了 i 和 j ,我们就可以求解eta的值了 。根据eta值的计算公式,我们需要使用到核函数 ,简单起见,我们来实现一个线性核函数。当然也可以使用其他
核函数。
#线性核函数
def kernel(x1,x2):
return np.dot(x1,x2.T)
有了核函数,就很容易计算eta的值了,此时SVM函数的代码如下:
def SVM(x,y):
N = x.shape[0]
iters = 0 #迭代次数
while True:
alpha_prev = np.copy(alpha)
iters += 1
for j in range(0,N):
i = rand_int(0,N-1,j)
#对应于李老师书中的xi,xj
x_i, x_j, y_i, y_j = x[i,:], x[j,:], y[i], y[j]
eta = kernel(x_i, x_i) + kernel(x_j, x_j) - 2 * kernel(x_i, x_j)
if eta == 0:
continue
diff = np.linalg.norm(alpha - alpha_prev)#计算误差,实际上是向量的二范数
if diff < epsilon:
break
if iters >= max_iter:
return
接着计算w和b的值 ,代码如下:
w = np.dot(alpha * y,x)
#b求出来实际上是一个区间的值,所以这里取平均
b = np.mean(y - np.dot(w.T,x.T))
然后计算Ei和Ej的值,我们来定义一个函数来计算:
#对输入xi的预测值与真实输出yi之差
def E(x_k, y_k, w, b):
return np.sign(np.dot(w.T, x_k.T) + b).astype(int) - y_k
#计算Ei和Ej
E_i = E(x_i, y_i, w, b)
E_j = E(x_j, y_j, w, b)
此时,所有的的代码如下:
import csv
import numpy as np
import random as rnd
#读取数据集
def readData(filename):
data = []
with open(filename,'r') as csvfile:
spamreader = csv.reader(csvfile,delimiter=',')
for row in spamreader:
data.append(row)
return np.array(data)
#产生一个随机数i,a<=i<=b 且 i不等于z
def rand_int(a,b,z):
i = z
while i == z:
i = rnd.randint(a,b)
return i
#线性核函数
def kernel(x1,x2):
return np.dot(x1,x2.T)
#对输入xi的预测值与真实输出yi之差
def E(x_k, y_k, w, b):
return np.sign(np.dot(w.T, x_k.T) + b).astype(int) - y_k
def SVM(x,y):
N = x.shape[0]
iters = 0 #迭代次数
w = []
b = 0
while True:
alpha_prev = np.copy(alpha)
iters += 1
for j in range(0,N):
i = rand_int(0,N-1,j)
#对应于李老师书中的xi,xj
x_i, x_j, y_i, y_j = x[i,:], x[j,:], y[i], y[j]
eta = kernel(x_i, x_i) + kernel(x_j, x_j) - 2 * kernel(x_i, x_j)
if eta == 0:
continue
w = np.dot(alpha * y,x)
#b求出来实际上是一个区间的值,所以这里取平均
b = np.mean(y - np.dot(w.T,x.T))
#计算Ei和Ej
E_i = E(x_i, y_i, w, b)
E_j = E(x_j, y_j, w, b)
diff = np.linalg.norm(alpha - alpha_prev)#计算误差,实际上是向量的二范数
if diff < epsilon:
break
if iters >= max_iter:
return
max_iter = 1000 #最大迭代次数
C = 1.0 #常量
epsilon = 0.001 #精确度
filename = 'data/testSet.txt'
data = readData(filename)
data = data.astype(float)
x, y = data[:,0:-1] ,data[:,-1].astype(int) #点及其对应的label
n = x.shape[0]
alpha = np.zeros((n))
SVM(x,y)
Ok,至此只差最后的a2和a1的更新以及最后一次更新w和b的值 。对于w和b的值的计算,前面已经提过了,所以我们只需要着手
完成a2和a1的更新。首先是计算a2newunc ,代码如下:
#计算a2newunc
a2old, a1old = alpha[j], alpha[i]
a2newunc = a2old + float(y_j * (E_i - E_j))/eta
然后计算H和L的值 :
#计算H和L的值
L, H = 0, 0
if y_i != y_j:
L = max(0,a2old - a1old)
H = min(C,C + a2old - a1old)
else:
L = max(0,a2old + a1old - C)
H = min(C,a2old + a1old)
最后更新a2和a1的值:
alpha[j] = max(a2newunc, L)
alpha[j] = min(a2newunc, H)
alpha[i] = a1old + y_i*y_j * (a2old - alpha[j])
此时整个的代码如下:
import csv
import numpy as np
import random as rnd
#读取数据集
def readData(filename):
data = []
with open(filename,'r') as csvfile:
spamreader = csv.reader(csvfile,delimiter=',')
for row in spamreader:
data.append(row)
return np.array(data)
#产生一个随机数i,a<=i<=b 且 i不等于z
def rand_int(a,b,z):
i = z
while i == z:
i = rnd.randint(a,b)
return i
#线性核函数
def kernel(x1,x2):
return np.dot(x1,x2.T)
#对输入xi的预测值与真实输出yi之差
def E(x_k, y_k, w, b):
return np.sign(np.dot(w.T, x_k.T) + b).astype(int) - y_k
def SVM(x,y):
N = x.shape[0]
iters = 0 #迭代次数
w = []
b = 0
while True:
alpha_prev = np.copy(alpha)
iters += 1
for j in range(0,N):
i = rand_int(0,N-1,j)
#对应于李老师书中的xi,xj
x_i, x_j, y_i, y_j = x[i,:], x[j,:], y[i], y[j]
eta = kernel(x_i, x_i) + kernel(x_j, x_j) - 2 * kernel(x_i, x_j)
if eta == 0:
continue
w = np.dot(alpha * y,x)
#b求出来实际上是一个区间的值,所以这里取平均
b = np.mean(y - np.dot(w.T,x.T))
#计算Ei和Ej
E_i = E(x_i, y_i, w, b)
E_j = E(x_j, y_j, w, b)
#计算a2newunc
a2old, a1old = alpha[j], alpha[i]
a2newunc = a2old + float(y_j * (E_i - E_j))/eta
#计算H和L的值
L, H = 0, 0
if y_i != y_j:
L = max(0,a2old - a1old)
H = min(C,C + a2old - a1old)
else:
L = max(0,a2old + a1old - C)
H = min(C,a2old + a1old)
alpha[j] = max(a2newunc, L)
alpha[j] = min(a2newunc, H)
alpha[i] = a1old + y_i*y_j * (a2old - alpha[j])
diff = np.linalg.norm(alpha - alpha_prev)#计算误差,实际上是向量的二范数
if diff < epsilon:
break
if iters >= max_iter:
return
w = np.dot(alpha * y,x)
b = np.mean(y - np.dot(w.T,x.T))
return w, b
#绘制散点图
def plotFeature(dataMat, labelMat,w ,b):
plt.figure(figsize=(8, 6), dpi=80)
x = []
y = []
l = []
for data in dataMat:
x.append(data[0])
y.append(data[1])
for label in labelMat:
if label > 0:
l.append('r')
else:
l.append('b')
plt.scatter(x, y, marker = 'o', color = l, s = 15)
#分割超平面
x1 = 0
x2 = 10
y1 = -b / w[1]
y2 = (-b - w[0] * x2 ) / w[1]
lines = plt.plot([x1, x2], [y1, y2])
lines[0].set_color('green')
lines[0].set_linewidth(2.0)
plt.show()
max_iter = 1000 #最大迭代次数
C = 1.0 #常量
epsilon = 0.001 #精确度
filename = 'data/testSet.txt'
data = readData(filename)
data = data.astype(float)
x, y = data[:,0:-1] ,data[:,-1].astype(int) #点及其对应的label
n = x.shape[0]
alpha = np.zeros((n))
w, b = SVM(x,y)
print (w, b)
为了直观的验证在该数据集上的分类正确性,我们来画个图,这部分不做讲解了,因为只是简单的验证一下:
import csv
import numpy as np
import random as rnd
import matplotlib.pyplot as plt
#读取数据集
def readData(filename):
data = []
with open(filename,'r') as csvfile:
spamreader = csv.reader(csvfile,delimiter=',')
for row in spamreader:
data.append(row)
return np.array(data)
#产生一个随机数i,a<=i<=b 且 i不等于z
def rand_int(a,b,z):
i = z
while i == z:
i = rnd.randint(a,b)
return i
#线性核函数
def kernel(x1,x2):
return np.dot(x1,x2.T)
#对输入xi的预测值与真实输出yi之差
def E(x_k, y_k, w, b):
return np.sign(np.dot(w.T, x_k.T) + b).astype(int) - y_k
def SVM(x,y):
N = x.shape[0]
iters = 0 #迭代次数
w = []
b = 0
while True:
alpha_prev = np.copy(alpha)
iters += 1
for j in range(0,N):
i = rand_int(0,N-1,j)
#对应于李老师书中的xi,xj
x_i, x_j, y_i, y_j = x[i,:], x[j,:], y[i], y[j]
eta = kernel(x_i, x_i) + kernel(x_j, x_j) - 2 * kernel(x_i, x_j)
if eta == 0:
continue
w = np.dot(alpha * y,x)
#b求出来实际上是一个区间的值,所以这里取平均
b = np.mean(y - np.dot(w.T,x.T))
#计算Ei和Ej
E_i = E(x_i, y_i, w, b)
E_j = E(x_j, y_j, w, b)
#计算a2newunc
a2old, a1old = alpha[j], alpha[i]
a2newunc = a2old + float(y_j * (E_i - E_j))/eta
#计算H和L的值
L, H = 0, 0
if y_i != y_j:
L = max(0,a2old - a1old)
H = min(C,C + a2old - a1old)
else:
L = max(0,a2old + a1old - C)
H = min(C,a2old + a1old)
alpha[j] = max(a2newunc, L)
alpha[j] = min(a2newunc, H)
alpha[i] = a1old + y_i*y_j * (a2old - alpha[j])
diff = np.linalg.norm(alpha - alpha_prev)#计算误差,实际上是向量的二范数
if diff < epsilon:
break
if iters >= max_iter:
print("Iteration number exceeded the max of %d iterations" % (max_iter))
return
w = np.dot(alpha * y,x)
b = np.mean(y - np.dot(w.T,x.T))
return w, b
#绘制图
def plotFeature(dataMat, labelMat,w ,b):
plt.figure(figsize=(8, 6), dpi=80)
x = []
y = []
l = []
for data in dataMat:
x.append(data[0])
y.append(data[1])
for label in labelMat:
if label > 0:
l.append('r')
else:
l.append('b')
plt.scatter(x, y, marker = 'o', color = l, s = 15)
#分割超平面
x1 = 0
x2 = 10
y1 = -b / w[1]
y2 = (-b - w[0] * x2 ) / w[1]
lines = plt.plot([x1, x2], [y1, y2])
lines[0].set_color('green')
lines[0].set_linewidth(2.0)
plt.show()
max_iter = 1000 #最大迭代次数
C = 1.0 #常量
epsilon = 0.001 #精确度
filename = 'data/testSet.txt'
data = readData(filename)
data = data.astype(float)
x, y = data[:,0:-1] ,data[:,-1].astype(int) #点及其对应的label
n = x.shape[0]
alpha = np.zeros((n))
w, b = SVM(x,y)
print (w, b)
plotFeature(x,y,w,b)
结果如下:
补充说明:
对于高纬度的数据集依然适用,但是生成图象的函数只适用于该数据集 。这里只是为了验证算法的正确性而画图说明。
------------------------------------------------------------------------------------------------------------------------