最近想学习一下机器学习的算法,同时自己把算法都实现一遍,主要参考《机器学习实战》,用python进行实现,当然,opencv中也集成了这些算法(可以参考:http://blog.csdn.net/xiaowei_cqu/article/details/23782561),可以很方便的我们进行调用,后面我们接着进行学习。
k近邻分类(k-nearest neighbor classification)算法
用一句话来描述就是:比较待测数据与已知分类的各样本的距离,选取距离最小的前k个样本,统计这k个样本各自所属的类别,出现频率最多的类别即标定为待测数据的类别。
算法步骤:
1、 计算已知类别数据集中每个点与当前点的距离;
2、选取与当前点距离最小的K个点;
3、 统计前K个点中每个类别的样本出现的频率;
4、 返回前K个点出现频率最高的类别作为当前点的预测分类。
主要算法实现:
#inX: the input data(vector) need to classify
#dadaSet: sample data set (m x n)
#labels: the label of dataSet(1 x m)
#k: knn algorithem paramter
def classify_knn(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize,1)) - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
classCount={}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
算法优点:
监督性学习,
简单,易于理解,易于实现,无需估计参数,无需训练
适合对稀有事件进行分类(例如当流失率很低时,比如低于0.5%,构造流失预测模型)
特别适合于多分类问题(multi-modal,对象具有多个类别标签),例如根据基因特征来判断其功能分类,kNN比SVM的表现要好
算法缺点:
懒惰算法,对测试样本分类时的计算量大,内存开销大,评分慢
可解释性较差,无法给出决策树那样的规则。
注意事项:
1、距离测量:欧式距离,或者其它距离测算公式
2、k参数的选择:20以下的常数,具体的选择要根据具体的情况进行调试
3、输入,输出数据的格式:对于不满足算法数据格式的要进行转换
4、python 总是传递地址,在递归时要尤其注意,这在下一节的决策树中表现的尤为突出。
5、其它事项亦可参考:
http://blog.csdn.net/jmydream/article/details/8644004
决策树(decision tree)是一个树结构(可以是二叉树或非二叉树)。其每个非叶节点表示一个特征属性上的测试,每个分支代表这个特征属性在某个值域上的输出,而每个叶节点存放一个类别。使用决策树进行决策的过程就是从根节点开始,测试待分类项中相应的特征属性,并按照其值选择输出分支,直到到达叶子节点,将叶子节点存放的类别作为决策结果。
构建决策树算法步骤:
1、检测数据集中的每个子项是否属于同一类别,如果是,返回该类别。
2、检测数据集中的特征属性是否全用完但类标签不唯一,如果是,用投票表决法(和knn相似)确定类别,并返回该类别。
3、确定最佳的分类特征属性。
4、根据最佳属性,创建分支节点, 划数据集。
5、对于每一个子项:递归执行1–4,并返回结果到分支节点。
这里面设计到如何确定最佳分类特征属性的问题:
这里主要引入熵(系统存在的无序程度)来进行解决,系统越混乱,熵值越高,而我们首选的分类特征属性当然是使得分类后的系统熵值越低越好,于是便解决了最佳的问题。
主要算法实现
1、熵的计算
#for examole: dataSet = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']],
#the dataSet[:][-1] is the classlabel the sample belongs to
def calcShannonEnt (dataSet):
numEntries = len(dataSet)
labelCounts = {}
for vec in dataSet:
currenLabel = vec[-1]
if currenLabel not in labelCounts:
labelCounts[currenLabel] = 0
labelCounts[currenLabel] +=1
shannonEnt = 0.0
for key in labelCounts:
prop = float(labelCounts[key] )/ numEntries
shannonEnt -= prop * log(prop,2)
return shannonEnt
2、根据特征来划分数据集
def splitDataSet(dataSet,axis,value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reduceFeatVec = featVec[:axis]
reduceFeatVec.extend(featVec[axis+1:])
retDataSet.append(reduceFeatVec)
return retDataSet
3、找到最佳的划分属性
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0])-1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0
bestFeature = -1
#choose feature
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0.0
#calculate info gain
for value in uniqueVals:
subDataset = splitDataSet(dataSet, i, value)
prob = len(subDataset)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataset)
infoGain = baseEntropy - newEntropy
if (bestInfoGain < infoGain):
bestInfoGain = infoGain
bestFeature = i;
return bestFeature
4、递归创建决策树
def createTree(dataSet, fetlabels):
classList = [example[-1] for example in dataSet]
#condition 1:
if classList.count(classList[0]) == len(classList):
return classList[0]
#condition 2:
if len(dataSet[0] ) ==1:
return majorityCnt(classList)
bestFeature = chooseBestFeatureToSplit(dataSet)#step3
bestFeatLabel = fetlabels[bestFeature]
print('bestFeature %d, bestLabels %s' %(bestFeature,bestFeatLabel))
myTree = {bestFeatLabel:{}}#step4.1
del(fetlabels[bestFeature])#careful
featValues = [example[bestFeature] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = fetlabels[:]
#step 4.2
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeature, value), subLabels)
return myTree
5、决策树进行分类
def classify(inputTree, featLabels, testVec):
firstStr = inputTree.keys()[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
for key in secondDict.keys():
if testVec[featIndex] == key:
if type(secondDict[key]).__name__=='dict':
classLabel = classify(secondDict[key], featLabels,testVec)
else:
classLabel = secondDict[key]
return classLabel
算法优点:
计算复杂度不高,输出结果易于理解,对中间值的缺失不敏感,可以处理不相关特征数据。
算法缺点:
可能会产生过度拟合
主要适用与标称型,数值型不太适合
注意事项:
1、如果属性用完了怎么办:可用上面提到的多数表决法。
2、剪枝:在实际构造决策树时,通常要进行剪枝,这时为了处理由于数据中的噪声和离群点导致的过分拟合问题。剪枝有两种:先剪枝和后剪枝。后面的算法CART会有涉及到,这里不过过多介绍。
3、在选择最佳分类特征时,可选用不同的方法,信息增益或纯度或其它。
5、决策树的存储:由于每次都构建决策树,开销太大,可考虑讲建好的决策树存起来,下次再用。用pickle模块。存储时:pickle.dump(inputTree, fw) ;载入:pickle.load(fr)
4、关于决策树的描绘:可用python中提供个图形库matplotlib。
没有做过多的研究,这里仅给出相关源码,有兴趣的同学可找相关资料或者私信我。
''' Created on Oct 14, 2010 @author: Peter Harrington '''
import matplotlib.pyplot as plt
decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")
def getNumLeafs(myTree):
numLeafs = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
numLeafs += getNumLeafs(secondDict[key])
else: numLeafs +=1
return numLeafs
def getTreeDepth(myTree):
maxDepth = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
thisDepth = 1 + getTreeDepth(secondDict[key])
else: thisDepth = 1
if thisDepth > maxDepth: maxDepth = thisDepth
return maxDepth
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
xytext=centerPt, textcoords='axes fraction',
va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
numLeafs = getNumLeafs(myTree) #this determines the x width of this tree
depth = getTreeDepth(myTree)
firstStr = myTree.keys()[0] #the text label for this node should be this
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
plotTree(secondDict[key],cntrPt,str(key)) #recursion
else: #it's a leaf node print the leaf node
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it's a tree, and the first element will be another dict
def createPlot(inTree):
fig = plt.figure(1, facecolor='white')
fig.clf()
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops) #no ticks
#createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
plotTree(inTree, (0.5,1.0), '')
plt.show()
#def createPlot():
# fig = plt.figure(1, facecolor='white')
# fig.clf()
# createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
# plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
# plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
# plt.show()
def retrieveTree(i):
listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
{'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
]
return listOfTrees[i]
#createPlot(thisTree)