SVM作为机器学习里面的经典算法在实际中一直被广泛采用,而且其准确性也是非常之高,特别是在引入了核函数之后对识别性能变得非常高。
说明:本文不打算就SVM原理就深入分析,虽然对其原理略懂一二,但是对于SMO算法的理解确实比较浅,所以也不打算班门弄斧,略微介绍,本文重点在于SVM的应用,也就是对垃圾邮件的文本分类
关于支持向量机的原理性分析在CSDN上有July大神的博客 :http://blog.csdn.net/v_july_v/article/details/7624837,我就之略微介绍一下原理
伦理片 http://www.dotdy.com/
一,SVM原理象征性简述:
SVM主要应用是分类操作,以二元线性分类为例,主要思想是根据特征向量的超空间创建一条超平面分隔线,当然在加入核函数后,可以把非线性分类映射到高维空间使之成为线性分类,而在求参数的过程中会使用SMO算法选取参数,会有不错的性能
二,SVM对垃圾邮件的分类:
这篇文章中的SVM的python实现代码参考自《Machine Learning in Action》一书,其训练数据来自该书的朴素贝叶斯分类一章,朴素贝叶斯也是一种比较简单而实用的分类方法,这里是使用了那一章的邮件数据
步骤说明:
1,提取训练邮件数据的特征向量
由于邮件的内容很多,因此找出其主要的分类关键词尤为关键,在找出关键词后就可以用这些关键词对邮件进行特征标记,也就是如果关键词在这篇文章中标记为1不出现则标记为0
其中每一个邮件类别中的关键词的选取方法有很多,我决定采用 TF-IDF方法选取关键词,在计算IDF的时候,考虑到我们是对整个类邮件进行分类,因此就没有采用IDF的传统计算方法,而是计算这词语在整个类邮件中的邮件占比,也就是出现该词语的文档数量除以文档总数量
关于 TF-IDF的介绍见百度百科
http://baike.baidu.com/link?url=oYpXqrTb6yQB1KaNUl8LS-01gUsy09s0w9JGpPq4QH8s_AzFI796tvWXnXtoGtpW-WAvLrKYwhsHp1l3i3JGpK
在得出所有的词语的TF-IDF数后,我选取数最大的前100个词作为这一类邮件的关键词(每个类不重复的词语数量大概在300多个)
得到关键词后我们就可以对每个邮件进行特征向量标记了,每个邮件由100个特征值标记,也就是对每个上文提出的关键词,如果这个邮件存在这个词语就标记为1,如果不存在,那么这个词语就标记为0,
这样就可以得出了每个邮件的特征向量值了
2,将步骤1得到的特征值使用SVM训练,本文的SVM代码实现基本是基于李航的《统计学习方法》一书,因为本文不是来叙述原理的,所以也略过不表,在这个例子中使用了rbf作为核函数
3,得到训练模型后就可以使用交叉验证方法验证数据的正确性了,具体的说,就是使用50个训练数据中的40个邮件的特征向量训练数据,使用剩下的10个邮件的特征向量作为测试向量
还是那句话,代码是硬道理,下面直接上代码
影音先锋电影 http://www.iskdy.com/
SVM主要代码:
- # -*- coding: utf-8 -*-
- from numpy import *
- from time import sleep
- import matplotlib.pyplot as plt
- def loadDataSet(fileName):
- dataMat = []; labelMat = []
- fr = open(fileName)
- for line in fr.readlines():
- lineArr = line.strip().split('\t')
- dataMat.append([float(lineArr[0]), float(lineArr[1])])
- labelMat.append(float(lineArr[2]))
- return dataMat,labelMat
- def selectJrand(i,m):
- j=i #we want to select any J not equal to i
- while (j==i):
- j = int(random.uniform(0,m))
- return j
- def clipAlpha(aj,H,L):
- if aj > H:
- aj = H
- if L > aj:
- aj = L
- return aj
- def smoSimple(dataMatIn, classLabels, C, toler, maxIter):
- dataMatrix = mat(dataMatIn); labelMat = mat(classLabels).transpose()
- b = 0; m,n = shape(dataMatrix)
- alphas = mat(zeros((m,1)))
- iter = 0
- while (iter < maxIter):
- alphaPairsChanged = 0
- for i in range(m):
- fXi = float(multiply(alphas,labelMat).T*(dataMatrix*dataMatrix[i,:].T)) + b
- Ei = fXi - float(labelMat[i])#if checks if an example violates KKT conditions
- if ((labelMat[i]*Ei < -toler) and (alphas[i] < C)) or ((labelMat[i]*Ei > toler) and (alphas[i] > 0)):
- j = selectJrand(i,m)
- fXj = float(multiply(alphas,labelMat).T*(dataMatrix*dataMatrix[j,:].T)) + b
- Ej = fXj - float(labelMat[j])
- alphaIold = alphas[i].copy(); alphaJold = alphas[j].copy();
- if (labelMat[i] != labelMat[j]):
- L = max(0, alphas[j] - alphas[i])
- H = min(C, C + alphas[j] - alphas[i])
- else:
- L = max(0, alphas[j] + alphas[i] - C)
- H = min(C, alphas[j] + alphas[i])
- if L==H: print "L==H"; continue
- eta = 2.0 * dataMatrix[i,:]*dataMatrix[j,:].T - dataMatrix[i,:]*dataMatrix[i,:].T - dataMatrix[j,:]*dataMatrix[j,:].T
- if eta >= 0: print "eta>=0"; continue
- alphas[j] -= labelMat[j]*(Ei - Ej)/eta
- alphas[j] = clipAlpha(alphas[j],H,L)
- if (abs(alphas[j] - alphaJold) < 0.00001): print "j not moving enough"; continue
- alphas[i] += labelMat[j]*labelMat[i]*(alphaJold - alphas[j])#update i by the same amount as j
- #the update is in the oppostie direction
- b1 = b - Ei- labelMat[i]*(alphas[i]-alphaIold)*dataMatrix[i,:]*dataMatrix[i,:].T - labelMat[j]*(alphas[j]-alphaJold)*dataMatrix[i,:]*dataMatrix[j,:].T
- b2 = b - Ej- labelMat[i]*(alphas[i]-alphaIold)*dataMatrix[i,:]*dataMatrix[j,:].T - labelMat[j]*(alphas[j]-alphaJold)*dataMatrix[j,:]*dataMatrix[j,:].T
- if (0 < alphas[i]) and (C > alphas[i]): b = b1
- elif (0 < alphas[j]) and (C > alphas[j]): b = b2
- else: b = (b1 + b2)/2.0
- alphaPairsChanged += 1
- print "iter: %d i:%d, pairs changed %d" % (iter,i,alphaPairsChanged)
- if (alphaPairsChanged == 0): iter += 1
- else: iter = 0
- print "iteration number: %d" % iter
- return b,alphas
- def kernelTrans(X, A, kTup): #calc the kernel or transform data to a higher dimensional space
- m,n = shape(X)
- K = mat(zeros((m,1)))
- if kTup[0]=='lin': K = X * A.T #linear kernel
- elif kTup[0]=='rbf':
- for j in range(m):
- deltaRow = X[j,:] - A
- K[j] = deltaRow*deltaRow.T
- K = exp(K/(-1*kTup[1]**2)) #divide in NumPy is element-wise not matrix like Matlab
- else: raise NameError('Houston We Have a Problem -- \
- That Kernel is not recognized')
- return K
- class optStruct:
- def __init__(self,dataMatIn, classLabels, C, toler, kTup): # Initialize the structure with the parameters
- self.X = dataMatIn
- self.labelMat = classLabels
- self.C = C
- self.tol = toler
- self.m = shape(dataMatIn)[0]
- self.alphas = mat(zeros((self.m,1)))
- self.b = 0
- self.eCache = mat(zeros((self.m,2))) #first column is valid flag
- self.K = mat(zeros((self.m,self.m)))
- for i in range(self.m):
- self.K[:,i] = kernelTrans(self.X, self.X[i,:], kTup)
- def calcEk(oS, k):
- fXk = float(multiply(oS.alphas,oS.labelMat).T*oS.K[:,k] + oS.b)
- Ek = fXk - float(oS.labelMat[k])
- return Ek
- def selectJ(i, oS, Ei): #this is the second choice -heurstic, and calcs Ej
- maxK = -1; maxDeltaE = 0; Ej = 0
- oS.eCache[i] = [1,Ei] #set valid #choose the alpha that gives the maximum delta E
- validEcacheList = nonzero(oS.eCache[:,0].A)[0]
- if (len(validEcacheList)) > 1:
- for k in validEcacheList: #loop through valid Ecache values and find the one that maximizes delta E
- if k == i: continue #don't calc for i, waste of time
- Ek = calcEk(oS, k)
- deltaE = abs(Ei - Ek)
- if (deltaE > maxDeltaE):
- maxK = k; maxDeltaE = deltaE; Ej = Ek
- return maxK, Ej
- else: #in this case (first time around) we don't have any valid eCache values
- j = selectJrand(i, oS.m)
- Ej = calcEk(oS, j)
- return j, Ej
- def updateEk(oS, k):#after any alpha has changed update the new value in the cache
- Ek = calcEk(oS, k)
- oS.eCache[k] = [1,Ek]
- def innerL(i, oS):
- Ei = calcEk(oS, i)
- if ((oS.labelMat[i]*Ei < -oS.tol) and (oS.alphas[i] < oS.C)) or ((oS.labelMat[i]*Ei > oS.tol) and (oS.alphas[i] > 0)):
- j,Ej = selectJ(i, oS, Ei) #this has been changed from selectJrand
- alphaIold = oS.alphas[i].copy(); alphaJold = oS.alphas[j].copy();
- if (oS.labelMat[i] != oS.labelMat[j]):
- L = max(0, oS.alphas[j] - oS.alphas[i])
- H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
- else:
- L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
- H = min(oS.C, oS.alphas[j] + oS.alphas[i])
- if L==H: print "L==H"; return 0
- eta = 2.0 * oS.K[i,j] - oS.K[i,i] - oS.K[j,j] #changed for kernel
- if eta >= 0: print "eta>=0"; return 0
- oS.alphas[j] -= oS.labelMat[j]*(Ei - Ej)/eta
- oS.alphas[j] = clipAlpha(oS.alphas[j],H,L)
- updateEk(oS, j) #added this for the Ecache
- if (abs(oS.alphas[j] - alphaJold) < 0.00001): print "j not moving enough"; return 0
- oS.alphas[i] += oS.labelMat[j]*oS.labelMat[i]*(alphaJold - oS.alphas[j])#update i by the same amount as j
- updateEk(oS, i) #added this for the Ecache #the update is in the oppostie direction
- b1 = oS.b - Ei- oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.K[i,i] - oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.K[i,j]
- b2 = oS.b - Ej- oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.K[i,j]- oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.K[j,j]
- if (0 < oS.alphas[i]) and (oS.C > oS.alphas[i]): oS.b = b1
- elif (0 < oS.alphas[j]) and (oS.C > oS.alphas[j]): oS.b = b2
- else: oS.b = (b1 + b2)/2.0
- return 1
- else: return 0
- def smoP(dataMatIn, classLabels, C, toler, maxIter,kTup=('lin', 0)): #full Platt SMO
- oS = optStruct(mat(dataMatIn),mat(classLabels).transpose(),C,toler, kTup)
- iter = 0
- entireSet = True; alphaPairsChanged = 0
- while (iter < maxIter) and ((alphaPairsChanged > 0) or (entireSet)):
- alphaPairsChanged = 0
- if entireSet: #go over all
- for i in range(oS.m):
- alphaPairsChanged += innerL(i,oS)
- print "fullSet, iter: %d i:%d, pairs changed %d" % (iter,i,alphaPairsChanged)
- iter += 1
- else:#go over non-bound (railed) alphas
- nonBoundIs = nonzero((oS.alphas.A > 0) * (oS.alphas.A < C))[0]
- for i in nonBoundIs:
- alphaPairsChanged += innerL(i,oS)
- print "non-bound, iter: %d i:%d, pairs changed %d" % (iter,i,alphaPairsChanged)
- iter += 1
- if entireSet: entireSet = False #toggle entire set loop
- elif (alphaPairsChanged == 0): entireSet = True
- print "iteration number: %d" % iter
- return oS.b,oS.alphas