《机器学习实战》第三章决策树学习笔记

       决策树算法通常用来解决有监督的分类问题,本章主要讲解决策树中的 ID3 算法。

1.工作原理

       决策树根据特征对数据集进行划分和分类,所以关键在于如何选择特征。这里就用到了信息论的知识。在信息论与概率统计中,熵表示随机变量不确定性的度量。熵越大,随机变量的不确定性就越大。即在未分类之前,数据集是无序的,熵是最大的。而通过分类,可以使数据集变得更加有序,熵减小。原始数据集 D D D 的熵 H ( D ) H(D) H(D) 是这样计算的:
H ( D ) = − ∑ k = 1 K ∣ C k ∣ ∣ D ∣ log ⁡ 2 ∣ C k ∣ ∣ D ∣ H(D) = -\sum\limits_{k=1}^K\frac{|C_k|}{|D|}\log_{2}{\frac{|C_k|}{|D|}} H(D)=k=1KDCklog2DCk        ∣ D ∣ |D| D 表示数据集中样本的总个数, D D D K K K 个分类 C k C_k Ck k = 1 , 2 , . . . , K k = 1,2,...,K k=1,2,...,K ∣ C k ∣ |C_k| Ck 是属于类 C k C_k Ck 的样本个数,那么 ∑ k = 1 K ∣ C k ∣ = ∣ D ∣ \sum\limits_{k=1}^K|C_k|=|D| k=1KCk=D

       然后开始选取特征,数据集根据选取特征的取值划分成若干个子数据集,计算划分之后的熵。

       假设选取特征 A A A A A A n n n 个不同的取值 a 1 , a 2 , . . . , a n {a_1,a_2,...,a_n} a1,a2,...,an,根据特征 A A A 的取值将 D D D 划分为 n n n 个子数据集 D 1 , D 2 , . . . , D n {D_1,D_2,...,D_n} D1,D2,...,Dn ∣ D i ∣ |D_i| Di D i D_i Di 的样本个数, ∑ i = 1 n ∣ D i ∣ = ∣ D ∣ \sum\limits_{i=1}^n|D_i|=|D| i=1nDi=D。记子集 D i D_i Di 中属于类 C k C_k Ck 的样本的集合为 D i k D_{ik} Dik,即 D i k = D i ∩ C k D_{ik}=D_i \cap C_k Dik=DiCk ∣ D i k ∣ |D_{ik}| Dik D i k D_{ik} Dik 的样本个数,那么划分之后的熵 H ( D ∣ A ) H(D|A) H(DA)为:
H ( D ∣ A ) = ∑ i = 1 n ∣ D i ∣ ∣ D ∣ H ( D i ) = − ∑ i = 1 n ∣ D i ∣ ∣ D ∣ ∑ k = 1 K ∣ D i k ∣ ∣ D i ∣ log ⁡ 2 ∣ D i k ∣ ∣ D i ∣ H(D|A) = \sum\limits_{i=1}^n\frac{|D_i|}{|D|}H(D_i) = -\sum\limits_{i=1}^n\frac{|D_i|}{|D|}\sum\limits_{k=1}^K\frac{|D_{ik}|}{|D_i|}\log_2\frac{|D_{ik}|}{|D_i|} H(DA)=i=1nDDiH(Di)=i=1nDDik=1KDiDiklog2DiDik       用原始数据集的熵减去划分之后的熵就得到了信息增益 g ( D , A ) g(D,A) g(D,A):
g ( D , A ) = H ( D ) − H ( D ∣ A ) g(D,A) = H(D)-H(D|A) g(D,A)=H(D)H(DA)       我们总是优先选取使得数据集信息增益最大的特征。对于得到的每个子数据集,需要先把选取过的特征去除掉,然后再重复上面的操作,即根据余下的特征继续划分子数据集,直到子数据集中所有的样本都属于同一个分类,或者用完了所有的特征,这种情况下通常采用多数表决的方法,即该子数据集所属的分类的类标签为该子数据集中类标签数目最多的类标签。因此,可以采用递归的方式建立决策树。

2.优缺点及适用范围

       优点:计算复杂度不高,输出结果易于理解,对中间值的缺失不敏感,可以处理不相关特征数据。
       缺点:可能会产生过度匹配的问题。
       适用数据类型:数值型和标称型。

3.代码实现

       本文实现了书中的代码,并从中发现了一些小问题,这里罗列一下:

  1. 字典的 keys() 方法返回值类型为 dict_keys,其并不能取索引,因此需要将其转成 list;

  2. 建树时,传入的参数 labels 会删去第一个选取的特征,即传入的 labels 会发生改变,因此最好传入 labels 的浅拷贝;

  3. 在存储分类器时会报错,所以采用二进制进行读写;

  4. 在使用 matplotlib 绘图时,不能显示中文,原中文字符变成小方格,只需在文件中加入两行代码即可:

    from pylab import *
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    

       trees.py 文件(main方法的语句块最好单独执行):

from math import log
import operator
import pickle

import treePlotter

def calcShannonEnt(dataSet):
    '''计算数据集的熵'''
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        shannonEnt -= prob * log(prob,2)
    return shannonEnt

def createDataSet():
    '''创建数据集'''
    dataSet = [[1,1,'yes'],
               [1,1,'yes'],
               [1,0,'no'],
               [0,1,'no'],
               [0,1,'no']]
    labels = ['no surfacing','flippers']
    return dataSet,labels

def splitDataSet(dataSet,axis,value):
    '''按照给定特征划分数据集'''
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def chooseBestFeatureToSplit(dataSet):
    '''选择最好的数据集划分方式'''
    numFeatures = len(dataSet[0]) - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        # 创建唯一的分类标签列表
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        # 计算每种划分方式的信息熵
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value)
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        # 计算最好的信息增益
        infoGain = baseEntropy - newEntropy
        if infoGain > bestInfoGain:
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

def majorityCnt(classList):
    '''计算出现最多的类标签'''
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),
                              reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    '''建树'''
    classList = [example[-1] for example in dataSet]
    # 类别完全相同则停止继续划分
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    # 遍历完所有特征时返回出现次数最多的类标签
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet\
                                    (dataSet,bestFeat,value),subLabels)
    return myTree

def classify(inputTree,featLabels,testVec):
    '''使用决策树的分类函数'''
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__  == 'dict':
                classLabel = classify(secondDict[key],featLabels,testVec)
            else:
                classLabel = secondDict[key]
    return classLabel

def storeTree(inputTree,filename):
    '''存储决策树'''
    fw = open(filename,'wb')
    pickle.dump(inputTree,fw)
    fw.close()

def grabTree(filename):
    '''加载决策树'''
    fr = open(filename,'rb')
    return pickle.load(fr)


if __name__ == '__main__':

    # 1.计算信息熵
    # myDat,labels = createDataSet()
    # print(myDat)
    # print(calcShannonEnt(myDat))
    # myDat[0][-1] = 'maybe'
    # print(myDat)
    # print(calcShannonEnt(myDat))

    # 2.按属性值分割数据集
    # myDat,labels = createDataSet()
    # print(splitDataSet(myDat,0,1))
    # print(splitDataSet(myDat,0,0))

    # 3.选择使得信息增益最大的特征
    # myDat,labels = createDataSet()
    # print(chooseBestFeatureToSplit(myDat))

    # 4.创建决策树
    # myDat,labels = createDataSet()
    # myTree = createTree(myDat,labels)
    # print(myTree)

    # 5.使用决策树进行分类
    # myDat,labels = createDataSet()
    # myTree = createTree(myDat,labels[:])
    # print(myTree)
    # print(classify(myTree,labels,[1,0]))
    # print(classify(myTree,labels,[1,1]))

    # 6.存储和加载决策树
    # myDat,labels = createDataSet()
    # myTree = createTree(myDat,labels[:])
    # storeTree(myTree,'classifierStorage.txt')
    # print(grabTree('classifierStorage.txt'))

    # 7.使用决策树预测隐形眼镜类型
    fr = open('lenses.txt')
    lenses = [inst.strip().split('\t') for inst in fr.readlines()]
    lensesLabels = ['age','prescript','astigmatic','tearRate']
    lensesTree = createTree(lenses,lensesLabels[:])
    print(lensesTree)
    treePlotter.createPlot(lensesTree)

       treePlotter.py 文件(main方法的语句块最好单独执行):

'''使用文本注解绘制树节点'''

import matplotlib.pyplot as plt
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']

# 定义文本框和箭头格式
decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")


def plotNode(nodeTxt,centerPt,parentPt,nodeType):
    '''绘制带箭头的注解'''
    createPlot.ax1.annotate(nodeTxt,xy=parentPt,
                            xycoords='axes fraction',
                            xytext=centerPt,textcoords='axes fraction',
                            va="center",ha="center",
                            bbox=nodeType,arrowprops=arrow_args )

def getNumLeafs(myTree):
    '''获取叶节点的数目'''
    numLeafs = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            numLeafs += getNumLeafs(secondDict[key])
        else:
            numLeafs += 1
    return numLeafs

def getTreeDepth(myTree):
    '''获取树的层数'''
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:
            thisDepth = 1
        if thisDepth > maxDepth:
            maxDepth = thisDepth
    return maxDepth

def retrieveTree(i):
    listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                  {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}]
    return listOfTrees[i]

def plotMidText(cntrPt,parentPt,txtString):
    '''在父子节点间填充文本信息'''
    xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
    createPlot.ax1.text(xMid,yMid,txtString,va="center",ha="center",rotation=30)

def plotTree(myTree,parentPt,nodeTxt):
    '''计算宽与高'''
    numLeafs = getNumLeafs(myTree)
    depth = getTreeDepth(myTree)
    firstStr = list(myTree.keys())[0]
    cntrPt = (plotTree.xOff
              + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW,
              plotTree.yOff)
    plotMidText(cntrPt,parentPt,nodeTxt)
    plotNode(firstStr,cntrPt,parentPt,decisionNode)
    secondDic = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
    for key in secondDic.keys():
        if type(secondDic[key]).__name__ == 'dict':
            plotTree(secondDic[key],cntrPt,str(key))
        else:
            plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW
            plotNode(secondDic[key],(plotTree.xOff,plotTree.yOff),
                     cntrPt,leafNode)
            plotMidText((plotTree.xOff,plotTree.yOff),cntrPt,str(key))
    plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD

def createPlot(inTree):
    fig = plt.figure(1,facecolor='white')
    fig.clf()
    axprops = dict(xticks=[],yticks=[])
    createPlot.ax1 = plt.subplot(111,frameon=False,**axprops)
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5 / plotTree.totalW
    plotTree.yOff = 1.0
    plotTree(inTree,(0.5,1.0),'')
    plt.show()


# def createPlot():
#    fig = plt.figure(1,facecolor='white')
#    fig.clf()
#    createPlot.ax1 = plt.subplot(111,frameon=False)
#    plotNode(U'决策节点',(0.5, 0.1),(0.1, 0.5),decisionNode)
#    plotNode(U'叶节点',(0.8, 0.1),(0.3, 0.8),leafNode)
#    plt.show()

if __name__ == '__main__':

    # 1.绘制树节点,这里使用注释掉的函数
    # createPlot()

    # 2.测试预定义的树结构,叶子节点和树层数函数
    # print(retrieveTree(1))
    # myTree = retrieveTree(0)
    # print(getNumLeafs(myTree))
    # print(getTreeDepth(myTree))

    # 3.绘制决策树
    myTree = retrieveTree(0)
    createPlot(myTree)
    myTree['no surfacing'][3] = 'maybe'
    print(myTree)
    createPlot(myTree)

4.相关文件

       这里给出本文用到的相关文件。

       链接: https://pan.baidu.com/s/1SRHqvRF8Q0iZjZs_tpm35Q 提取码: 9c4g

参考文献

  1. 机器学习实战
  2. 统计学习方法

你可能感兴趣的:(机器学习实战)