《机器学习实战》python3学习笔记——决策树

决策树理论部分

  • 前言
    • 1.1 信息熵
      • 1.1.1 代码实现
    • 1.2 信息增益
      • 1.2.1 代码实现
    • 1.3 创建决策树
  • 2.1 决策树的可视化
  • 3.1 测试和存储决策树
    • 存储及读取
  • 4 课本隐形眼镜实战代码
  • 参考文献

前言

决策树(decsision tree)是一类常见的机器学习算法。以周志华老师《机器学习》中西瓜数据集为例,我们希望从给定训练集中学得一个模型用于对测试集分类。本文将详细解读书中代码,并使用sklearn库实现管道泄漏信号四分类。

编号 色泽 根蒂 敲声 纹理 脐部 触感 好瓜
1 青绿 蜷缩 浊响 清晰 凹陷 硬滑
2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑
3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑
4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑
5 浅白 蜷缩 浊响 清晰 凹陷 硬滑
6 青绿 稍蜷 浊响 清晰 稍凹 软粘
7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘
8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑
9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑
10 青绿 硬挺 清脆 清晰 平坦 软粘
11 浅白 硬挺 清脆 模糊 平坦 硬滑
12 浅白 蜷缩 浊响 模糊 平坦 软粘
13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑
14 浅白 稍蜷 沉闷 模糊 凹陷 硬滑
15 乌黑 稍蜷 浊响 清晰 稍凹 软粘
16 浅白 蜷缩 浊响 模糊 平坦 硬滑
17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑

1.1 信息熵

“信息熵”是度量样本集合纯度最常用的一种指标。假定当前样本集合D中第k类样本所占的比例是pk,则D的信息熵定义为
E n t ( D ) = − ∑ k = 1 ∣ y ∣ p k log ⁡ 2 p k Ent\left( D\right) =-\sum ^{\left| y\right| }_{k=1}p_{k}\log _{2}p_{k} Ent(D)=k=1ypklog2pk
以表中数据集为例,共有17个训练样例,用于学习一颗能预测没剖开的西瓜是不是好瓜的决策树。显然|y|=2,在决策树开始学习时,根结点包含D中的所有样例,其中正例p1=8/17,反例p2=9/17,根据公式可算出信息熵为
E n t ( D ) = − ∑ k = 1 2 p k log ⁡ 2 p k = − ( 8 17 log ⁡ 2 8 17 + 9 17 log ⁡ 2 9 17 ) = 0.998 \begin{aligned}Ent\left( D\right) =-\sum ^{2}_{k=1}p_{k}\log _{2}p_{k} =-\left( \dfrac {8}{17}\log _{2}\dfrac {8}{17}+\dfrac {9}{17}\log _{2}\dfrac {9}{17}\right) =0.998\end{aligned} Ent(D)=k=12pklog2pk=(178log2178+179log2179)=0.998

1.1.1 代码实现

在编写代码之前,先对数据集进行标签化
色泽:0代表浅白 1代表青绿 2代表乌黑
根蒂:0代表蜷缩 1代表稍蜷 2代表硬挺
敲声:0代表沉闷 1代表浊响 2代表清脆
纹理:0代表清晰 1代表稍糊 2代表模糊
脐部:0代表凹陷 1代表稍凹 2代表平坦
触感:0代表硬滑 1代表软粘
好瓜:‘yes’代表好瓜,‘no’代表坏瓜
确定好这些之后,我们创建数据集并计算信息熵

from math import log
import operator

def createDataSet():
    dataSet = [[1,0,1,0,0,0,'yes'],
               [2,0,0,0,0,0,'yes'],
               [2,0,1,0,0,0,'yes'],
               [1,0,1,0,0,0,'yes'],
               [0,0,1,0,0,0,'yes'],
               [1,1,1,0,1,1,'yes'],
               [2,1,1,1,1,1,'yes'],
               [2,1,1,0,1,0,'yes'],
               [2,1,0,1,1,0,'no'],
               [1,2,2,0,2,1,'no'],
               [0,2,2,2,2,1,'no'],
               [0,0,1,2,2,1,'no'],
               [1,1,1,1,0,0,'no'],
               [0,1,0,1,0,0,'no'],
               [2,1,1,0,1,1,'no'],
               [0,0,1,1,1,1,'no'],
               [1,0,0,1,1,0,'no']]
    labels = ['色泽','根蒂','敲声','纹理','脐部','触感']
    return dataSet,labels

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)  #数据集的行数
    labelCounts = {}   #生成字典
    for featVec in dataSet:
        currentLabel = featVec[-1] #提取标签
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0 #标签不在字典中,就生成这个标签的键值对
        labelCounts[currentLabel] += 1 #每次for循环就+1
    shannonEnt = 0.0 #信息熵
    for key in labelCounts: #遍历字典中的键
        prob = float(labelCounts[key])/numEntries #选择该label的概率
        shannonEnt -= prob*log(prob,2) #香农熵计算公式
    return shannonEnt

if __name__=='__main__':
    a,b = createDataSet()
    c = calcShannonEnt(a)
    print(c)

c是我们用程序计算出来的信息熵0.9975025463691153,约等于我们手工计算结果0.998。

1.2 信息增益

假定离散数据集a上有V个可能的取值{a1,a2,……,aV},若使用a来对样本集D进行划分,将会产生V个分支结点。那么第v个分支结点包含了D中所有在a属性上取值为av的样本。我们根据前文提到的信息熵计算公式可得Dv数据集的信息熵,再考虑不同分支结点所包含的样本数不同,给分支结点赋予权重|Dv|/|D|,最后再用样本集D的信息熵减去DV的信息熵得到该离散属性的信息增益。
G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 v D v D E n t ( D v ) Gain\left( D,a\right) =Ent\left( D\right) -\sum ^{v}_{v=1}\dfrac {D^{v}}{D}Ent\left( D^{v}\right) Gain(D,a)=Ent(D)v=1vDDvEnt(Dv)
可能单纯看文字有些难以理解,我们来看一个例子。以西瓜数据集中的{色泽}属性为例,它可能有三个取值,根据这三个取值将数据集划分为三个子集:D1代表浅白,D2代表青绿,D3代表乌黑,子集D1中包含编号{5,11,12,14,16}五个样例,子集D2中包含编号{1,4,6,10,13,17}六个样例,子集D3中包含编号{2,3,7,8,9,15}六个样例。根据信息熵计算公式,我们可以计算出用{色泽}进行划分后所得的三个分支结点的信息熵:
E n t ( D 1 ) = − ( 1 5 log ⁡ 2 1 5 + 4 5 log ⁡ 2 4 5 ) = 0.722 Ent\left( D^{1}\right) =-\left( \dfrac {1}{5}\log _{2}\dfrac {1}{5}+\dfrac {4}{5}\log _{2}\dfrac {4}{5}\right) = 0.722 Ent(D1)=(51log251+54log254)=0.722
E n t ( D 2 ) = − ( 3 6 log ⁡ 2 3 6 + 3 6 log ⁡ 2 3 6 ) = 1.000 Ent\left( D^{2}\right) =-\left( \dfrac {3}{6}\log _{2}\dfrac {3}{6}+\dfrac {3}{6}\log _{2}\dfrac {3}{6}\right) = 1.000 Ent(D2)=(63log263+63log263)=1.000
E n t ( D 3 ) = − ( 4 6 log ⁡ 2 4 6 + 4 6 log ⁡ 2 4 6 ) = 0.918 Ent\left( D^{3}\right) =-\left( \dfrac {4}{6}\log _{2}\dfrac {4}{6}+\dfrac {4}{6}\log _{2}\dfrac {4}{6}\right) = 0.918 Ent(D3)=(64log264+64log264)=0.918
G a i n ( D , 色 泽 ) = E n t ( D ) − ∑ v = 1 3 D v D E n t ( D v ) =   0 , 988 − ( 5 17 × 0.722 + 6 17 × 1 + 6 17 × 0.918 ) = 0.109 Gain\left( D,色泽\right) =Ent\left( D\right) -\sum ^{3}_{v=1}\dfrac {D^{v}}{D}Ent\left( D^{v}\right)\\ =\ 0,988-\left( \dfrac {5}{17}\times 0.722+\dfrac {6}{17}\times 1+\dfrac {6}{17}\times 0.918\right) = 0.109 Gain(D,)=Ent(D)v=13DDvEnt(Dv)= 0,988(175×0.722+176×1+176×0.918)=0.109
由此我们计算得到色泽属性的信息增益为0.109

1.2.1 代码实现

from math import log

def createDataSet():
    dataSet = [[1,0,1,0,0,0,'yes'],
               [2,0,0,0,0,0,'yes'],
               [2,0,1,0,0,0,'yes'],
               [1,0,1,0,0,0,'yes'],
               [0,0,1,0,0,0,'yes'],
               [1,1,1,0,1,1,'yes'],
               [2,1,1,1,1,1,'yes'],
               [2,1,1,0,1,0,'yes'],
               [2,1,0,1,1,0,'no'],
               [1,2,2,0,2,1,'no'],
               [0,2,2,2,2,1,'no'],
               [0,0,1,2,2,1,'no'],
               [1,1,1,1,0,0,'no'],
               [0,1,0,1,0,0,'no'],
               [2,1,1,0,1,1,'no'],
               [0,0,1,1,1,1,'no'],
               [1,0,0,1,1,0,'no']]
    labels = ['色泽','根蒂','敲声','纹理','脐部','触感']
    return dataSet,labels

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)  #数据集的行数
    labelCounts = {}   #生成字典
    for featVec in dataSet:
        currentLabel = featVec[-1] #提取标签
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0 #标签不在字典中,就生成这个标签的键值对
        labelCounts[currentLabel] += 1 #每次for循环就+1
    shannonEnt = 0.0 #信息熵
    for key in labelCounts: #遍历字典中的键
        prob = float(labelCounts[key])/numEntries #选择该label的概率
        shannonEnt -= prob*log(prob,2) #香农熵计算公式
    return shannonEnt

def splitDataSet(dataSet,axis,value): #数据集、划分数据集的特征、需要返回的特征的值
    retDataSet = []  #生成列表
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def chooseBestFeatureToSpilt(dataSet):
    numFeature = len(dataSet[0]) - 1     #特征数量
    baseEntropy = calcShannonEnt(dataSet)  #计算原始训练集的香农熵
    bestInfoGain = 0 #信息增益
    bestFeature = -1 #最优特征索引,这里只是先行赋值
    for i in range(numFeature):
        featList = [example[i] for example in dataSet]  #提取第训练集中的第(i-1)个特征
        uniqueVals = set(featList) #去除重复的元素
        newEntropy = 0 #经验条件熵
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value) #划分成子集
            prob = len(subDataSet)/float(len(dataSet)) #占比例,可看作概率
            newEntropy += prob*calcShannonEnt(subDataSet) #计算该子集的经验熵
        infoGain = baseEntropy - newEntropy #信息增益
        print("第{}个特征的信息增益{}".format(i+1,infoGain))
        if (infoGain > bestInfoGain): #if语句更新信息增益,并返回最大信息增益特征的索引值
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

if __name__=='__main__':
    data,label = createDataSet()
    chooseBestFeatureToSpilt(data)

1.3 创建决策树

运用ID3算法创建决策树,具体操作如下:计算所有结点的信息增益,从中选择最大的信息增益作为该结点的特征,按照属性数量的多少建立不同的子结点,再使用子结点递归的方式创建决策树,直到没有特征可选择或无法再创建新的子结点为止。将其保存为trees.py

from math import log

def createDataSet():
    dataSet = [[1,0,1,0,0,0,'yes'],
               [2,0,0,0,0,0,'yes'],
               [2,0,1,0,0,0,'yes'],
               [1,0,1,0,0,0,'yes'],
               [0,0,1,0,0,0,'yes'],
               [1,1,1,0,1,1,'yes'],
               [2,1,1,1,1,1,'yes'],
               [2,1,1,0,1,0,'yes'],
               [2,1,0,1,1,0,'no'],
               [1,2,2,0,2,1,'no'],
               [0,2,2,2,2,1,'no'],
               [0,0,1,2,2,1,'no'],
               [1,1,1,1,0,0,'no'],
               [0,1,0,1,0,0,'no'],
               [2,1,1,0,1,1,'no'],
               [0,0,1,1,1,1,'no'],
               [1,0,0,1,1,0,'no']]
    labels = ['色泽','根蒂','敲声','纹理','脐部','触感']
    return dataSet,labels

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)  #数据集的行数
    labelCounts = {}   #生成字典
    for featVec in dataSet:
        currentLabel = featVec[-1] #提取标签
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0 #标签不在字典中,就生成这个标签的键值对
        labelCounts[currentLabel] += 1 #每次for循环就+1
    shannonEnt = 0.0 #信息熵
    for key in labelCounts: #遍历字典中的键
        prob = float(labelCounts[key])/numEntries #选择该label的概率
        shannonEnt -= prob*log(prob,2) #香农熵计算公式
    return shannonEnt

def splitDataSet(dataSet,axis,value): #数据集、划分数据集的特征、需要返回的特征的值
    retDataSet = []  #生成列表
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def chooseBestFeatureToSpilt(dataSet):
    numFeature = len(dataSet[0]) - 1     #特征数量
    baseEntropy = calcShannonEnt(dataSet)  #计算原始训练集的香农熵
    bestInfoGain = 0 #信息增益
    bestFeature = -1 #最优特征索引,这里只是先行赋值
    for i in range(numFeature):
        featList = [example[i] for example in dataSet]  #提取第训练集中的第(i-1)个特征
        uniqueVals = set(featList) #去除重复的元素
        newEntropy = 0 #经验条件熵
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value) #划分成子集
            prob = len(subDataSet)/float(len(dataSet)) #占比例,可看作概率
            newEntropy += prob*calcShannonEnt(subDataSet) #计算该子集的经验熵
        infoGain = baseEntropy - newEntropy #信息增益
        #print("第{}个特征的信息增益{}".format(i+1,infoGain))
        if (infoGain > bestInfoGain): #if语句更新信息增益,并返回最大信息增益特征的索引值
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        if vote not in classList.keys():
            classCount[vote] = 0
        classCount[vote] += 1
        sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
        return sortedClassCount[0][0]

#创建决策树
def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet] #取分类标签
    if classList.count(classList[0]) == len(classList): #类别完全相同,停止划分
        return classList[0]
    if len(dataSet[0]) == 1: #遍历所有特征 停止划分
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSpilt(dataSet) #选择最优特征
    bestFeatLabel = labels[bestFeat] #最优特征标签
    myTree = {bestFeatLabel:{}} #最优标签生成树
    del(labels[bestFeat]) #删除已使用的特征标签
    featValues = [example[bestFeat] for example in dataSet] #得训练集中最优标签的特征属性
    uniqueVals = set(featValues) #除掉重复
    for value in uniqueVals: #遍历 创建树
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree

if __name__=='__main__':
    data,label = createDataSet()
    tree = createTree(data,label)
    print(tree)

2.1 决策树的可视化

运行完上述代码,我们得到用于描述决策树的字典。为了使决策树看上去更加直观,我们引入matplotlib库进行决策树的可视化。保存为treePlot.py

import matplotlib.pyplot as plt
from trees import *
from matplotlib.font_manager import FontProperties

decisionNode = dict(boxstyle = 'sawtooth',fc='0.8')
leafNode = dict(boxstyle='round4',fc='0.8')
arrow_args = dict(arrowstyle='<-')

def plotNode(nodeTxt,centerPt,parentPt,nodeType):
    arrow_args = dict(arrowstyle="<-")  # 定义箭头格式
    font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)

    createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords='axes fraction',
                            xytext=centerPt,textcoords='axes fraction',
                            va='center',ha='center',bbox=nodeType,arrowprops=arrow_args,
                            FontProperties=font)


def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = next(iter(myTree))
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            numLeafs += getNumLeafs(secondDict[key])
        else:
            numLeafs += 1
    return numLeafs

def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = next(iter(myTree))
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':
            thisDepth = 1+getTreeDepth(secondDict[key])
        else:
            thisDepth = 1
        if thisDepth > maxDepth:
            maxDepth = thisDepth
    return maxDepth

def plotMidText(cntrPt,parentPt,txtString):
    xMid = (parentPt[0]-cntrPt[0])/2 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2 + cntrPt[1]
    createPlot.ax1.text(xMid,yMid,txtString)

def plotTree(myTree, parentPt, nodeTxt):
    decisionNode = dict(boxstyle="sawtooth", fc="0.8")                                        #设置结点格式
    leafNode = dict(boxstyle="round4", fc="0.8")                                            #设置叶结点格式
    numLeafs = getNumLeafs(myTree)                                                          #获取决策树叶结点数目,决定了树的宽度
    depth = getTreeDepth(myTree)                                                            #获取决策树层数
    firstStr = next(iter(myTree))                                                            #下个字典
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)    #中心位置
    plotMidText(cntrPt, parentPt, nodeTxt)                                                    #标注有向边属性值
    plotNode(firstStr, cntrPt, parentPt, decisionNode)                                        #绘制结点
    secondDict = myTree[firstStr]                                                            #下一个字典,也就是继续绘制子结点
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD                                        #y偏移
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':                                            #测试该结点是否为字典,如果不是字典,代表此结点为叶子结点
            plotTree(secondDict[key],cntrPt,str(key))                                        #不是叶结点,递归调用继续绘制
        else:                                                                                #如果是叶结点,绘制叶结点,并标注有向边属性值
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')                                                    #创建fig
    fig.clf()                                                                                #清空fig
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)                                #去掉x、y轴
    plotTree.totalW = float(getNumLeafs(inTree))                                            #获取决策树叶结点数目
    plotTree.totalD = float(getTreeDepth(inTree))                                            #获取决策树层数
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;                                #x偏移
    plotTree(inTree, (0.5,1.0), '')                                                            #绘制决策树
    plt.show()

if __name__ == '__main__':
    dataSet, labels = createDataSet()
    featLabels = []
    myTree = createTree(dataSet, labels)
    print(myTree)
    createPlot(myTree)

3.1 测试和存储决策树

依靠训练数据构造了决策树后,我们可以用于实际数据的分类。在进行数据分类时,需要构造树的标签向量。注意因为creatree函数中会删除信息增益最大的属性,所以在进行测试时,要在主函数中重新定义数据标签,否则会提示list out of range。这里为了节省篇幅,仅贴出测试代码和主函数,可以接着tree.py下面使用

def classify(inputTree,featLabels,testVec):
    firstStr = next(iter(inputTree))
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__== 'dict':
                classLabel = classify(secondDict[key],featLabels,testVec)
            else:
                classLabel = secondDict[key]
    return classLabel

if __name__=='__main__':
    data,labels = createDataSet()
    print(labels)
    featlabel = ['色泽','根蒂','敲声','纹理','脐部','触感']
    tree = createTree(data,labels)
    print(tree)
    result = classify(tree,featlabel,[0,0,1,0,0,0])
    print(result)

存储及读取

def storeTree(inputTree,filename):
    import pickle
    fw = open(filename,'wb')
    pickle.dump(inputTree,fw)
    fw.close()

def grabTree(filename):
    import pickle
    fr = open(filename,'rb')
    return pickle.load(fr)

if __name__=='__main__':
    data,labels = createDataSet()
    tree = createTree(data,labels)
    print(tree)
    storeTree(tree,'xigua.txt')
    mytree = grabTree('xigua.txt')
    print(mytree)

4 课本隐形眼镜实战代码

import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import pydotplus
from sklearn.externals.six import StringIO
from sklearn import tree
import numpy as np

if __name__ == '__main__':
    with open('lenses.txt', 'r') as fr:                                        #加载文件
        lenses = [inst.strip().split('\t') for inst in fr.readlines()]        #处理文件
    lenses_target = []                                                        #提取每组数据的类别,保存在列表里
    for each in lenses:
        lenses_target.append(each[-1])

    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']            #特征标签
    lenses_list = []                                                        #保存lenses数据的临时列表
    lenses_dict = {}                                                        #保存lenses数据的字典,用于生成pandas
    for each_label in lensesLabels:                                            #提取信息,生成字典
        for each in lenses:
            lenses_list.append(each[lensesLabels.index(each_label)])
        lenses_dict[each_label] = lenses_list
        lenses_list = []
    #print(lenses_dict)                                                        #打印字典信息
    lenses_pd = pd.DataFrame(lenses_dict)                                    #生成pandas.DataFrame
    #print(lenses_pd)
    le = LabelEncoder()
    for col in lenses_pd.columns:
        lenses_pd[col] = le.fit_transform(lenses_pd[col])
    #print(lenses_pd)

    clf = tree.DecisionTreeClassifier(max_depth=4)
    clf = clf.fit(lenses_pd.values.tolist(),lenses_target)
    dot_data = StringIO()
    tree.export_graphviz(clf,out_file=dot_data,feature_names=lenses_pd.keys(),
                         class_names=clf.classes_,filled=True,rounded=True,special_characters=True)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    graph.write_pdf('tree.pdf')

参考文献

周志华 《机器学习》
Peter Harrington 《机器学习实战》

ps:本博文代码均在pycharm成功运行过,有任何问题欢迎评论或私信我。

你可能感兴趣的:(《机器学习实战》python3学习笔记——决策树)