机器学习实战--knn 和trees

最近想学习一下机器学习的算法,同时自己把算法都实现一遍,主要参考《机器学习实战》,用python进行实现,当然,opencv中也集成了这些算法(可以参考:http://blog.csdn.net/xiaowei_cqu/article/details/23782561),可以很方便的我们进行调用,后面我们接着进行学习。

一、knn

k近邻分类(k-nearest neighbor classification)算法
用一句话来描述就是:比较待测数据与已知分类的各样本的距离,选取距离最小的前k个样本,统计这k个样本各自所属的类别,出现频率最多的类别即标定为待测数据的类别。
算法步骤:
1、 计算已知类别数据集中每个点与当前点的距离;
2、选取与当前点距离最小的K个点;
3、 统计前K个点中每个类别的样本出现的频率;
4、 返回前K个点出现频率最高的类别作为当前点的预测分类。
主要算法实现:

#inX: the input data(vector) need to classify
#dadaSet: sample data set (m x n)
#labels: the label of dataSet(1 x m)
#k: knn algorithem paramter
def classify_knn(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()     
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

算法优点:
监督性学习,
简单,易于理解,易于实现,无需估计参数,无需训练
适合对稀有事件进行分类(例如当流失率很低时,比如低于0.5%,构造流失预测模型)
特别适合于多分类问题(multi-modal,对象具有多个类别标签),例如根据基因特征来判断其功能分类,kNN比SVM的表现要好
算法缺点:
懒惰算法,对测试样本分类时的计算量大,内存开销大,评分慢
可解释性较差,无法给出决策树那样的规则。
注意事项:
1、距离测量:欧式距离,或者其它距离测算公式
2、k参数的选择:20以下的常数,具体的选择要根据具体的情况进行调试
3、输入,输出数据的格式:对于不满足算法数据格式的要进行转换
4、python 总是传递地址,在递归时要尤其注意,这在下一节的决策树中表现的尤为突出。
5、其它事项亦可参考:
http://blog.csdn.net/jmydream/article/details/8644004

二、Decesion Tree

决策树(decision tree)是一个树结构(可以是二叉树或非二叉树)。其每个非叶节点表示一个特征属性上的测试,每个分支代表这个特征属性在某个值域上的输出,而每个叶节点存放一个类别。使用决策树进行决策的过程就是从根节点开始,测试待分类项中相应的特征属性,并按照其值选择输出分支,直到到达叶子节点,将叶子节点存放的类别作为决策结果。
构建决策树算法步骤:
1、检测数据集中的每个子项是否属于同一类别,如果是,返回该类别。
2、检测数据集中的特征属性是否全用完但类标签不唯一,如果是,用投票表决法(和knn相似)确定类别,并返回该类别。
3、确定最佳的分类特征属性。
4、根据最佳属性,创建分支节点, 划数据集。
5、对于每一个子项:递归执行1–4,并返回结果到分支节点。
这里面设计到如何确定最佳分类特征属性的问题:
这里主要引入熵(系统存在的无序程度)来进行解决,系统越混乱,熵值越高,而我们首选的分类特征属性当然是使得分类后的系统熵值越低越好,于是便解决了最佳的问题。

主要算法实现
1、熵的计算

#for examole: dataSet = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']],
#the dataSet[:][-1] is the classlabel the sample belongs to
def calcShannonEnt (dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for vec in dataSet:
        currenLabel = vec[-1]
        if  currenLabel not in labelCounts:
            labelCounts[currenLabel] = 0
        labelCounts[currenLabel] +=1
    shannonEnt = 0.0
    for key in labelCounts:
        prop = float(labelCounts[key] )/ numEntries
        shannonEnt -= prop * log(prop,2)
    return shannonEnt

2、根据特征来划分数据集

def splitDataSet(dataSet,axis,value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reduceFeatVec = featVec[:axis]
            reduceFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reduceFeatVec)
    return retDataSet

3、找到最佳的划分属性

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0])-1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    #choose feature
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        #calculate info gain
        for value in uniqueVals:
            subDataset = splitDataSet(dataSet, i, value)
            prob = len(subDataset)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataset)
        infoGain =  baseEntropy - newEntropy
        if (bestInfoGain < infoGain):
            bestInfoGain = infoGain
            bestFeature = i;
    return bestFeature

4、递归创建决策树

def createTree(dataSet, fetlabels):
    classList = [example[-1] for example in dataSet]
    #condition 1:
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    #condition 2:
    if len(dataSet[0] ) ==1:
        return majorityCnt(classList)
    bestFeature = chooseBestFeatureToSplit(dataSet)#step3
    bestFeatLabel = fetlabels[bestFeature]
    print('bestFeature %d, bestLabels %s'  %(bestFeature,bestFeatLabel))
    myTree = {bestFeatLabel:{}}#step4.1
    del(fetlabels[bestFeature])#careful
    featValues = [example[bestFeature] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = fetlabels[:]
        #step 4.2
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeature, value), subLabels)
    return myTree

5、决策树进行分类

def classify(inputTree, featLabels, testVec):
    firstStr = inputTree.keys()[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__=='dict':
                classLabel = classify(secondDict[key], featLabels,testVec)
            else:
                classLabel = secondDict[key]
    return classLabel

算法优点:
计算复杂度不高,输出结果易于理解,对中间值的缺失不敏感,可以处理不相关特征数据。
算法缺点:
可能会产生过度拟合
主要适用与标称型,数值型不太适合
注意事项:
1、如果属性用完了怎么办:可用上面提到的多数表决法。
2、剪枝:在实际构造决策树时,通常要进行剪枝,这时为了处理由于数据中的噪声和离群点导致的过分拟合问题。剪枝有两种:先剪枝和后剪枝。后面的算法CART会有涉及到,这里不过过多介绍。
3、在选择最佳分类特征时,可选用不同的方法,信息增益或纯度或其它。
5、决策树的存储:由于每次都构建决策树,开销太大,可考虑讲建好的决策树存起来,下次再用。用pickle模块。存储时:pickle.dump(inputTree, fw) ;载入:pickle.load(fr)
4、关于决策树的描绘:可用python中提供个图形库matplotlib。
没有做过多的研究,这里仅给出相关源码,有兴趣的同学可找相关资料或者私信我。

''' Created on Oct 14, 2010 @author: Peter Harrington '''
import matplotlib.pyplot as plt

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")

def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            numLeafs += getNumLeafs(secondDict[key])
        else:   numLeafs +=1
    return numLeafs

def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
             xytext=centerPt, textcoords='axes fraction',
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )

def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
    depth = getTreeDepth(myTree)
    firstStr = myTree.keys()[0]     #the text label for this node should be this
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes 
            plotTree(secondDict[key],cntrPt,str(key))        #recursion
        else:   #it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it's a tree, and the first element will be another dict

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
    #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    plotTree(inTree, (0.5,1.0), '')
    plt.show()

#def createPlot():
# fig = plt.figure(1, facecolor='white')
# fig.clf()
# createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
# plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
# plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
# plt.show()

def retrieveTree(i):
    listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                  {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
                  ]
    return listOfTrees[i]

#createPlot(thisTree)

你可能感兴趣的:(算法,python,机器学习,knn,trees)