使用smo算法编写svm对CIFAR-10数据分类

公式太难打了,弄成图片,可能不太美观,但知识没变味
使用smo算法编写svm对CIFAR-10数据分类_第1张图片
使用smo算法编写svm对CIFAR-10数据分类_第2张图片
使用smo算法编写svm对CIFAR-10数据分类_第3张图片使用smo算法编写svm对CIFAR-10数据分类_第4张图片使用smo算法编写svm对CIFAR-10数据分类_第5张图片使用smo算法编写svm对CIFAR-10数据分类_第6张图片使用smo算法编写svm对CIFAR-10数据分类_第7张图片使用smo算法编写svm对CIFAR-10数据分类_第8张图片

3:实验内容

3.1 提取hog特征

本实验的核心在于设计svm算法,因此提取特征使用库函数实现,最主要代码如下

from skimage import feature as ft
ft.hog(data[i],feature_vector=True,block_norm='L2-Hys',transform_sqrt=True)

3.2 使用SVM库验证特征提取后的分类效果

使用库的核心代码如下

trainmatrix=data2image(trainImg['Data'])
hogtrain=meanlie(feature_hog(trainmatrix))
testmatrix=data2image(testImg['Data'])
hogtest=meanlie(feature_hog(testmatrix))
from sklearn import svm
from skimage import feature as ft
clf=svm.SVC()
clf.fit(hogtrain,trainImg['Label'])
pre=clf.predict(hogtest)

分类结果如下
在这里插入图片描述在这里插入图片描述其中,count为正确分类的样本数,为4075,总的测试集样本数为5000。可以看到,分类准确性很高,达到0.815。这还是没有仔细调参的结果,这样的结果是很理想的,证明了hog特征+svm的思路切实可行,下面将其运用到自己编写的svm算法上。

3.3 验证自编二分类算法的正确性

本实验是个多分类问题,因此自己编写svm算法分两步,第一步编写二分类算法,第二步结合前面所选定的多分类策略基于此二分类算法实现多分类。
因此先来第一步:验证自编二分类算法
这里我分别抽取训练集和测试集中的6和9两类,训练集中每类分别选500个样本(只是为了运行快一些),二分类的类命名为PlattSMO,单独保存成自定义模块plattSMO,方便导入
完整smo代码太长,在附录给出,调用部分主要代码如下

trainmatrix=data2image(trainImg['Data'])
hogtrain=meanlie(feature_hog(trainmatrix))
testmatrix=data2image(testImg['Data'])
hogtest=meanlie(feature_hog(testmatrix))
hogtraindata,hogtrainlabel=extractClass(hogtrain,trainImg['Label'],6,9)
hogtraindata,hogtrainlabel=extractPart(hogtraindata,hogtrainlabel,500)
hogtestdata,hogtestlabel=extractClass(hogtest,testImg['Label'],6,9)
smo = plattSMO.PlattSMO(hogtraindata, hogtrainlabel, 0.05, 0.0001, 200, name='rbf', theta=20)
smo.smoP()
testResult = smo.predict(hogtestdata)
count=0
for i in range(len(testResult)):
    if testResult[i]==hogtestlabel[i]:
    count+=1
print('right rate:%f'%(float(count)/len(hogtestlabel)))

smoP函数就是完整的线性SMO算法
结果如下
在这里插入图片描述在这里插入图片描述其中,count是正确分类的样本数,1724个,而测试集中这两类样本一共2000个,二分类准确率0.862。验证了自编二分类代码是正确的。

3.4 验证自编多分类算法的正确性

多分类算法中,我们需要构建n(n-1)/2个二分类模型,只需调用PlattSMO类,实例化即可。构建好10个模型后,对于每个测试样本,使用模型进行分类,调整权重,最后投票表决得出结果即可。多分类的类命名为LibSVM,保存成模块libsvm,在主函数中调用即可。多分类代码主要有训练和预测函数,train和predict,train函数训练模型,保存到self.classfy变量,predict函数多分类策略,值得一提的是,10个模型的分类结果可能会使得某几个类别的权重相同,这种情况下我将这几个权重最大(相同权重)的类别取出来,再对该样本继续分类,调整权重,投票。相当于实行两次多分类策略,只是第二次的类别数较少一些(因为剔除了第一次权重小的类别)
完整代码在附件给出,主要代码如下

def __init__(self,data=[],label=[],C=0,toler=0,maxIter=0,**kernelargs):
        self.classlabel = unique(label)
        self.classNum = len(self.classlabel)
        self.classfyNum = (self.classNum * (self.classNum-1))/2
        self.classfy = []
        self.dataSet={}
        self.kernelargs = kernelargs
        self.C = C
        self.toler = toler
        self.maxIter = maxIter
        m = shape(data)[0]
        for i in range(m):
            label[i]=int(label[i])
            if label[i] not in self.dataSet.keys():
                self.dataSet[int(label[i])] = []
                self.dataSet[int(label[i])].append(data[i][:])
            else:
                self.dataSet[int(label[i])].append(data[i][:])
    def train(self):
        num = self.classNum
        for i in range(num):
            for j in range(i+1,num):
                data = []
                label = [1.0]*shape(self.dataSet[self.classlabel[i]])[0]
                label.extend([-1.0]*shape(self.dataSet[self.classlabel[j]])[0])
                data.extend(self.dataSet[self.classlabel[i]])
                data.extend(self.dataSet[self.classlabel[j]])
                svm = PlattSMO(array(data),array(label),self.C,self.toler,self.maxIter,**self.kernelargs)
                svm.smoP()
                self.classfy.append(svm)
        self.dataSet = None
    def predict(self,data,label):
        m = shape(data)[0]
        num = self.classNum
        classlabel = []
        count = 0.0
        for n in range(m):
            result = [0] * num
            index = -1
            for i in range(num):
                for j in range(i + 1, num):
                    index += 1
                    s = self.classfy[index]
                    t = s.predict([data[n]])[0]
                    if t > 0.0:
                        result[i] +=1
                    else:
                        result[j] +=1
            #classlabel.append(self.classlabel[result.index(max(result))])
            
            resultmax=max(result)
            maxindex=result.index(resultmax)
            index1=[maxindex]
            for i in range(maxindex+1,5):
                if result[i]==resultmax:
                    index1.append(i)
            index2 = [0 for _ in range(len(index1))]
            if len(index1) > 1:
                
                for i in range(len(index1)):
                    for j in range(i+1,len(index1)):
                        if index1[i]==0:
                            s = self.classfy[index1[j-1]]
                        elif index1[i]==3:
                            s=self.classfy[9]
                        else:
                            s=self.classfy[2*index1[i]+index1[j]]
                        t = s.predict([data[n]])[0]
                        if t > 0.0:
                            index2[i]+=1
                        else:
                            index2[j]+=1
            classlabel.append(self.classlabel[index1[index2.index(max(index2))]])
                        
            if classlabel[-1] != label[n]:
                count +=1
                print label[n],classlabel[n]
        #print classlabel
        countright=m-count
        print "right rate:",countright / m
        return classlabel

主函数调用核心代码,libSVM.LibSVM参数很重要,这里选择的松弛变量C为10,容错率toler为0.0001,最大迭代次数maxIter为200,核函数为高斯核’rbf’,对应的带宽theta为20。可调优

 trainImg = loadData(file)
    testImg=loadData(file1)
    traindata,trainlabel=extractData(trainImg['Data'],trainImg['Label'],400)
    trainmatrix=data2image(traindata)
    hogtrain=meanlie(feature_hog(trainmatrix))
    testmatrix=data2image(testImg['Data'])
    hogtest=meanlie(feature_hog(testmatrix))
    #C选10最好,0.678
    svm = libSVM.LibSVM(hogtrain, trainlabel, 10, 0.0001, 200, name='rbf', theta=20)
    svm.train()
    svm.predict(hogtest,testImg['Label'])

训练模型时,只取训练集中一部分样本,每个类别取200或者400个样本,总共1000或者2000个样本进行训练,运行时间约5min以内,是很快的。但是若取全部样本训练,则时间难以忍受,当然只选择这样少的样本训练模型,一定会使得分类准确度下降,但是即便只选择400个样本,准确率已经可以达到0.678了,这是很不错的结果,可以想见,当使用全样本训练时,结果应该可以达到前面使用svm库的0.815的准确率。
实验结果如下
每类抽取200个样本
在这里插入图片描述在这里插入图片描述其中countright为正确分类的样本数,3282个,分类正确率为0.6564
每类抽取400个样本
在这里插入图片描述在这里插入图片描述正确分类的样本数countright为3473,分类正确率为0.6946。比用每类200个样本时高了约4%,而全部训练样本为14968个,如果用上全部样本数据,准确率还会有不小的提升

附录:

代码中各函数功能说明

trialsvm.py
def meanlie(data):
data中每个元素除以对应列的均值,这一步骤替代归一化,在提取hog特征后因为要进行1后续的分类,所以需要归一化数据。而实验发现归一化后分类准确性不是很高,发现是由于每个特征对应的data中的列总是有个别的数极大,其他很小,因此采用每个元素除以所在列的均值的方式代替归一化
def loadData(file):
导入数据集,很简单
def data2image(data):
将每个样本对应的行向量转化成图片并灰度化,因为hog特征是在图片上提取,而数据集中图片是以一个1x3072的行向量表示的
def feature_hog(data):
对传进来的data提取hog特征,返回提取特征后的数据
def extractData(data,label,num):
从大的数据集data中抽取部分样本,num是每类需要抽取的样本数,返回抽取后的样本数据和对应的标签
plattSMO.py
class PlattSMO:
二分类的类
def init(self,dataMat,classlabels,C,toler,maxIter,**kernelargs):
初始化函数,初始化一些变量dataMat-数据矩阵,C - 松弛变量,classLabels - 数据标签,toler - 容错率,maxIter-最大迭代次数,**kernelargs-核函数有关的参数
def kernelTrans(self,x,z):
通过核函数将数据转换更高维的空间
def calcEK(self,k):
计算误差
def updateEK(self,k):
计算Ek,并更新误差缓存
def selectJ(self,i,Ei):
内循环启发方式2
def innerL(self,i):
优化的SMO算法
def smoP(self):
完整的线性SMO算法
def calcw(self):
计算权重W
def predict(self,testData):
预测函数,预测样本类别
libsvm.py
class LibSVM:
多分类类
def train(self):
训练函数,训练10个分类模型
def predict(self,data,label):
预测函数,实现多分类策略,使用训练模型对测试数据data进行预测,给出分类结果

trialsvm.py

# -*- coding: utf-8 -*-
"""
Created on Fri Dec 14 19:15:13 2018

@author: Administrator
"""
from mysvm import plattSMO,libSVM
import matplotlib.pyplot as plt
import numpy as np
import random
from numpy import *
import scipy.io as sio
from sklearn.decomposition import PCA
from sklearn import preprocessing
from skimage import feature as ft
def meanlie(data):
    #每个元素除以对应列的均值,这一步骤替代归一化,因归一化效果不好
    m,n=data.shape
    meandata=np.mean(data,axis=0)# axis=0,计算每一列的均值
    for i in range(n):
        data[:,i]/=meandata[i]
    return data
    
def loadData(file):
    #file='G:/lecture of grade one/pattern recognition/data/train_data.mat'
    trainImg=sio.loadmat(file)
    return trainImg
def data2image(data):
    newdata=[]
    m,n=data.shape
    for i in range(m):
        img=data[i,:].reshape((3,32,32))
        #gray=img.convert('L')
        gray = img[0,:, :]*0.2990+img[1,:, :]*0.5870+img[2,:, :]*0.1140
        newdata.append(gray)
    #np.array(newdata)
    return newdata    
def feature_hog(data):
    #提取hog特征
    fea=[]
    for i in range(len(data)):
        #data[i]=Image.fromarray(data[i][0])
        fea.append(ft.hog(data[i],feature_vector=True,block_norm='L2-Hys',transform_sqrt=True))
    fea=np.array(fea)
    return fea
'''
def extractClass(data,label,class1,class2):
    #抽取两类,并将类别标签改为1或-1,方便做svm
    m,n=data.shape
    index=[]
    for i in range(m):
        if label[i][0]!=class1 and label[i][0]!=class2:
            index.append(i)
    data=np.delete(data,index,0)
    label=np.delete(label,index,0)
    min_max_scaler=preprocessing.MinMaxScaler()
    data=min_max_scaler.fit_transform(data)
    Y=[]        
    for i in label:
        if i[0]==class1:#class1对应1
            Y.append(1)
        else:
            Y.append(-1)#class2对应-1
    Y=np.array(Y)
    return data,Y
def extractPart(data,label,nums):
    m,n=data.shape
    index=[]
    a=0;b=0
    for i in range(m):
        if label[i]==1 and anums and b>nums:
            break
    data=data[index]
    label=label[index]    
    return data,label
'''
def extractData(data,label,num):
    m,n=data.shape
    count=[0,0,0,0,0]
    cla=[0,6,7,8,9]
    index=[]
    for i in range(m):
        for j in range(5):
            if label[i]==cla[j] and count[j]

plattSMO.py

import sys
from numpy import *
from svm import *
from os import listdir
class PlattSMO:
    def __init__(self,dataMat,classlabels,C,toler,maxIter,**kernelargs):
        self.x = array(dataMat)
        self.label = array(classlabels).transpose()
        self.C = C
        self.toler = toler
        self.maxIter = maxIter
        self.m = shape(dataMat)[0]
        self.n = shape(dataMat)[1]
        self.alpha = array(zeros(self.m),dtype='float64')
        self.b = 0.0
        self.eCache = array(zeros((self.m,2)))
        self.K = zeros((self.m,self.m),dtype='float64')
        self.kwargs = kernelargs
        self.SV = ()
        self.SVIndex = None
        for i in range(self.m):
            for j in range(self.m):
                self.K[i,j] = self.kernelTrans(self.x[i,:],self.x[j,:])
    def calcEK(self,k):
        fxk = dot(self.alpha*self.label,self.K[:,k])+self.b
        Ek = fxk - float(self.label[k])
        return Ek
    def updateEK(self,k):
        Ek = self.calcEK(k)

        self.eCache[k] = [1 ,Ek]
    def selectJ(self,i,Ei):
        maxE = 0.0
        selectJ = 0
        Ej = 0.0
        validECacheList = nonzero(self.eCache[:,0])[0]
        if len(validECacheList) > 1:
            for k in validECacheList:
                if k == i:continue
                Ek = self.calcEK(k)
                deltaE = abs(Ei-Ek)
                if deltaE > maxE:
                    selectJ = k
                    maxE = deltaE
                    Ej = Ek
            return selectJ,Ej
        else:
            selectJ = selectJrand(i,self.m)
            Ej = self.calcEK(selectJ)
            return selectJ,Ej

    def innerL(self,i):
        Ei = self.calcEK(i)
        if (self.label[i] * Ei < -self.toler and self.alpha[i] < self.C) or \
                (self.label[i] * Ei > self.toler and self.alpha[i] > 0):
            self.updateEK(i)
            j,Ej = self.selectJ(i,Ei)
            alphaIOld = self.alpha[i].copy()
            alphaJOld = self.alpha[j].copy()
            if self.label[i] != self.label[j]:
                L = max(0,self.alpha[j]-self.alpha[i])
                H = min(self.C,self.C + self.alpha[j]-self.alpha[i])
            else:
                L = max(0,self.alpha[j]+self.alpha[i] - self.C)
                H = min(self.C,self.alpha[i]+self.alpha[j])
            if L == H:
                return 0
            eta = 2*self.K[i,j] - self.K[i,i] - self.K[j,j]
            if eta >= 0:
                return 0
            self.alpha[j] -= self.label[j]*(Ei-Ej)/eta
            self.alpha[j] = clipAlpha(self.alpha[j],H,L)
            self.updateEK(j)
            if abs(alphaJOld-self.alpha[j]) < 0.00001:
                return 0
            self.alpha[i] +=  self.label[i]*self.label[j]*(alphaJOld-self.alpha[j])
            self.updateEK(i)
            b1 = self.b - Ei - self.label[i] * self.K[i, i] * (self.alpha[i] - alphaIOld) - \
                 self.label[j] * self.K[i, j] * (self.alpha[j] - alphaJOld)
            b2 = self.b - Ej - self.label[i] * self.K[i, j] * (self.alpha[i] - alphaIOld) - \
                 self.label[j] * self.K[j, j] * (self.alpha[j] - alphaJOld)
            if 0 0) or (entrySet)):
            alphaPairChanged = 0
            if entrySet:
                for i in range(self.m):
                    alphaPairChanged+=self.innerL(i)
                iter += 1
            else:
                nonBounds = nonzero((self.alpha > 0)*(self.alpha < self.C))[0]
                for i in nonBounds:
                    alphaPairChanged+=self.innerL(i)
                iter+=1
            if entrySet:
                entrySet = False
            elif alphaPairChanged == 0:
                entrySet = True
        self.SVIndex = nonzero(self.alpha)[0]
        self.SV = self.x[self.SVIndex]
        self.SVAlpha = self.alpha[self.SVIndex]
        self.SVLabel = self.label[self.SVIndex]
        self.x = None
        self.K = None
        self.label = None
        self.alpha = None
        self.eCache = None
#   def K(self,i,j):
#       return self.x[i,:]*self.x[j,:].T
    def kernelTrans(self,x,z):
        if array(x).ndim != 1 or array(x).ndim != 1:
            raise Exception("input vector is not 1 dim")
        if self.kwargs['name'] == 'linear':
            return sum(x*z)
        elif self.kwargs['name'] == 'rbf':
            theta = self.kwargs['theta']
            return exp(sum((x-z)*(x-z))/(-1*theta**2))

    def calcw(self):
        for i in range(self.m):
            self.w += dot(self.alpha[i]*self.label[i],self.x[i,:])

    def predict(self,testData):
        test = array(testData)
        #return (test * self.w + self.b).getA()
        result = []
        m = shape(test)[0]
        for i in range(m):
            tmp = self.b
            for j in range(len(self.SVIndex)):
                tmp += self.SVAlpha[j] * self.SVLabel[j] * self.kernelTrans(self.SV[j],test[i,:])
            while tmp == 0:
                tmp = random.uniform(-1,1)
            if tmp > 0:
                tmp = 1
            else:
                tmp = -1
            result.append(tmp)
        return result
def plotBestfit(data,label,w,b):
    import matplotlib.pyplot as plt
    n = shape(data)[0]
    fig = plt.figure()
    ax = fig.add_subplot(111)
    x1 = []
    x2 = []
    y1 = []
    y2 = []
    for i in range(n):
        if int(label[i]) == 1:
            x1.append(data[i][0])
            y1.append(data[i][1])
        else:
            x2.append(data[i][0])
            y2.append(data[i][1])
    ax.scatter(x1,y1,s=10,c='red',marker='s')
    ax.scatter(x2,y2, s=10, c='green', marker='s')
    x = arange(-2,10,0.1)
    y = ((-b-w[0]*x)/w[1])
    plt.plot(x,y)
    plt.xlabel('X')
    plt.ylabel('y')
    plt.show()
def loadImage(dir,maps = None):
    dirList = listdir(dir)
    data = []
    label = []
    for file in dirList:
        label.append(file.split('_')[0])
        lines = open(dir +'/'+file).readlines()
        row = len(lines)
        col = len(lines[0].strip())
        line = []
        for i in range(row):
            for j in range(col):
                line.append(float(lines[i][j]))
        data.append(line)
        if maps != None:
            label[-1] = float(maps[label[-1]])
        else:
            label[-1] = float(label[-1])
    return array(data),array(label)

def main():
    '''
    data,label = loadDataSet('testSetRBF.txt')
    smo = PlattSMO(data,label,200,0.0001,10000,name = 'rbf',theta = 1.3)
    smo.smoP()
    smo.calcw()
    print smo.predict(data)
    '''
    maps = {'1':1.0,'9':-1.0}
    data,label = loadImage("digits/trainingDigits",maps)
    smo = PlattSMO(data, label, 200, 0.0001, 10000, name='rbf', theta=20)
    smo.smoP()
    print len(smo.SVIndex)
    test,testLabel = loadImage("digits/testDigits",maps)
    testResult = smo.predict(test)
    m = shape(test)[0]
    count  = 0.0
    for i in range(m):
        if testLabel[i] != testResult[i]:
            count += 1
    print "classfied error rate is:",count / m
    #smo.kernelTrans(data,smo.SV[0])

if __name__ == "__main__":
    sys.exit(main())

libsvm.py

import sys
from numpy import *
from svm import *
from os import listdir
from plattSMO import PlattSMO
import pickle
class LibSVM:

    def __init__(self,data=[],label=[],C=0,toler=0,maxIter=0,**kernelargs):
        self.classlabel = unique(label)
        self.classNum = len(self.classlabel)
        self.classfyNum = (self.classNum * (self.classNum-1))/2
        self.classfy = []
        self.dataSet={}
        self.kernelargs = kernelargs
        self.C = C
        self.toler = toler
        self.maxIter = maxIter
        m = shape(data)[0]
        for i in range(m):
            label[i]=int(label[i])
            if label[i] not in self.dataSet.keys():
                self.dataSet[int(label[i])] = []
                self.dataSet[int(label[i])].append(data[i][:])
            else:
                self.dataSet[int(label[i])].append(data[i][:])
    def train(self):
        num = self.classNum
        for i in range(num):
            for j in range(i+1,num):
                data = []
                label = [1.0]*shape(self.dataSet[self.classlabel[i]])[0]
                label.extend([-1.0]*shape(self.dataSet[self.classlabel[j]])[0])
                data.extend(self.dataSet[self.classlabel[i]])
                data.extend(self.dataSet[self.classlabel[j]])
                svm = PlattSMO(array(data),array(label),self.C,self.toler,self.maxIter,**self.kernelargs)
                svm.smoP()
                self.classfy.append(svm)
        self.dataSet = None
    def predict(self,data,label):
        m = shape(data)[0]
        num = self.classNum
        classlabel = []
        count = 0.0
        for n in range(m):
            result = [0] * num
            index = -1
            for i in range(num):
                for j in range(i + 1, num):
                    index += 1
                    s = self.classfy[index]
                    t = s.predict([data[n]])[0]
                    if t > 0.0:
                        result[i] +=1
                    else:
                        result[j] +=1
            #classlabel.append(self.classlabel[result.index(max(result))])
            
            resultmax=max(result)
            maxindex=result.index(resultmax)
            index1=[maxindex]
            for i in range(maxindex+1,5):
                if result[i]==resultmax:
                    index1.append(i)
            index2 = [0 for _ in range(len(index1))]
            if len(index1) > 1:
                
                for i in range(len(index1)):
                    for j in range(i+1,len(index1)):
                        if index1[i]==0:
                            s = self.classfy[index1[j-1]]
                        elif index1[i]==3:
                            s=self.classfy[9]
                        else:
                            s=self.classfy[2*index1[i]+index1[j]]
                        t = s.predict([data[n]])[0]
                        if t > 0.0:
                            index2[i]+=1
                        else:
                            index2[j]+=1
            classlabel.append(self.classlabel[index1[index2.index(max(index2))]])
                        
            if classlabel[-1] != label[n]:
                count +=1
                #print label[n],classlabel[n]
        #print classlabel
        countright=m-count
        print "right rate:",countright / m
        return classlabel
    def save(self,filename):
        fw = open(filename,'wb')
        pickle.dump(self,fw,2)
        fw.close()

    @staticmethod
    def load(filename):
        fr = open(filename,'rb')
        svm = pickle.load(fr)
        fr.close()
        return svm

def loadImage(dir,maps = None):
    dirList = listdir(dir)
    data = []
    label = []
    for file in dirList:
        label.append(file.split('_')[0])
        lines = open(dir +'/'+file).readlines()
        row = len(lines)
        col = len(lines[0].strip())
        line = []
        for i in range(row):
            for j in range(col):
                line.append(float(lines[i][j]))
        data.append(line)
        if maps != None:
            label[-1] = float(maps[label[-1]])
        else:
            label[-1] = float(label[-1])
    return data,label
def main():
    '''
    data,label = loadImage('trainingDigits')
    svm = LibSVM(data, label, 200, 0.0001, 10000, name='rbf', theta=20)
    svm.train()
    svm.save("svm.txt")
    '''
    svm = LibSVM.load("svm.txt")
    test,testlabel = loadImage('testDigits')
    svm.predict(test,testlabel)

if __name__ == "__main__":
    sys.exit(main())

你可能感兴趣的:(python,机器学习)