决策树(decsision tree)是一类常见的机器学习算法。以周志华老师《机器学习》中西瓜数据集为例,我们希望从给定训练集中学得一个模型用于对测试集分类。本文将详细解读书中代码,并使用sklearn库实现管道泄漏信号四分类。
编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 好瓜 |
---|---|---|---|---|---|---|---|
1 | 青绿 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
2 | 乌黑 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 是 |
3 | 乌黑 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
4 | 青绿 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 是 |
5 | 浅白 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
6 | 青绿 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 是 |
7 | 乌黑 | 稍蜷 | 浊响 | 稍糊 | 稍凹 | 软粘 | 是 |
8 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 硬滑 | 是 |
9 | 乌黑 | 稍蜷 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 否 |
10 | 青绿 | 硬挺 | 清脆 | 清晰 | 平坦 | 软粘 | 否 |
11 | 浅白 | 硬挺 | 清脆 | 模糊 | 平坦 | 硬滑 | 否 |
12 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 软粘 | 否 |
13 | 青绿 | 稍蜷 | 浊响 | 稍糊 | 凹陷 | 硬滑 | 否 |
14 | 浅白 | 稍蜷 | 沉闷 | 模糊 | 凹陷 | 硬滑 | 否 |
15 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 否 |
16 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 硬滑 | 否 |
17 | 青绿 | 蜷缩 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 否 |
“信息熵”是度量样本集合纯度最常用的一种指标。假定当前样本集合D中第k类样本所占的比例是pk,则D的信息熵定义为
E n t ( D ) = − ∑ k = 1 ∣ y ∣ p k log 2 p k Ent\left( D\right) =-\sum ^{\left| y\right| }_{k=1}p_{k}\log _{2}p_{k} Ent(D)=−k=1∑∣y∣pklog2pk
以表中数据集为例,共有17个训练样例,用于学习一颗能预测没剖开的西瓜是不是好瓜的决策树。显然|y|=2,在决策树开始学习时,根结点包含D中的所有样例,其中正例p1=8/17,反例p2=9/17,根据公式可算出信息熵为
E n t ( D ) = − ∑ k = 1 2 p k log 2 p k = − ( 8 17 log 2 8 17 + 9 17 log 2 9 17 ) = 0.998 \begin{aligned}Ent\left( D\right) =-\sum ^{2}_{k=1}p_{k}\log _{2}p_{k} =-\left( \dfrac {8}{17}\log _{2}\dfrac {8}{17}+\dfrac {9}{17}\log _{2}\dfrac {9}{17}\right) =0.998\end{aligned} Ent(D)=−k=1∑2pklog2pk=−(178log2178+179log2179)=0.998
在编写代码之前,先对数据集进行标签化
色泽:0代表浅白 1代表青绿 2代表乌黑
根蒂:0代表蜷缩 1代表稍蜷 2代表硬挺
敲声:0代表沉闷 1代表浊响 2代表清脆
纹理:0代表清晰 1代表稍糊 2代表模糊
脐部:0代表凹陷 1代表稍凹 2代表平坦
触感:0代表硬滑 1代表软粘
好瓜:‘yes’代表好瓜,‘no’代表坏瓜
确定好这些之后,我们创建数据集并计算信息熵
from math import log
import operator
def createDataSet():
dataSet = [[1,0,1,0,0,0,'yes'],
[2,0,0,0,0,0,'yes'],
[2,0,1,0,0,0,'yes'],
[1,0,1,0,0,0,'yes'],
[0,0,1,0,0,0,'yes'],
[1,1,1,0,1,1,'yes'],
[2,1,1,1,1,1,'yes'],
[2,1,1,0,1,0,'yes'],
[2,1,0,1,1,0,'no'],
[1,2,2,0,2,1,'no'],
[0,2,2,2,2,1,'no'],
[0,0,1,2,2,1,'no'],
[1,1,1,1,0,0,'no'],
[0,1,0,1,0,0,'no'],
[2,1,1,0,1,1,'no'],
[0,0,1,1,1,1,'no'],
[1,0,0,1,1,0,'no']]
labels = ['色泽','根蒂','敲声','纹理','脐部','触感']
return dataSet,labels
def calcShannonEnt(dataSet):
numEntries = len(dataSet) #数据集的行数
labelCounts = {} #生成字典
for featVec in dataSet:
currentLabel = featVec[-1] #提取标签
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0 #标签不在字典中,就生成这个标签的键值对
labelCounts[currentLabel] += 1 #每次for循环就+1
shannonEnt = 0.0 #信息熵
for key in labelCounts: #遍历字典中的键
prob = float(labelCounts[key])/numEntries #选择该label的概率
shannonEnt -= prob*log(prob,2) #香农熵计算公式
return shannonEnt
if __name__=='__main__':
a,b = createDataSet()
c = calcShannonEnt(a)
print(c)
c是我们用程序计算出来的信息熵0.9975025463691153,约等于我们手工计算结果0.998。
假定离散数据集a上有V个可能的取值{a1,a2,……,aV},若使用a来对样本集D进行划分,将会产生V个分支结点。那么第v个分支结点包含了D中所有在a属性上取值为av的样本。我们根据前文提到的信息熵计算公式可得Dv数据集的信息熵,再考虑不同分支结点所包含的样本数不同,给分支结点赋予权重|Dv|/|D|,最后再用样本集D的信息熵减去DV的信息熵得到该离散属性的信息增益。
G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 v D v D E n t ( D v ) Gain\left( D,a\right) =Ent\left( D\right) -\sum ^{v}_{v=1}\dfrac {D^{v}}{D}Ent\left( D^{v}\right) Gain(D,a)=Ent(D)−v=1∑vDDvEnt(Dv)
可能单纯看文字有些难以理解,我们来看一个例子。以西瓜数据集中的{色泽}属性为例,它可能有三个取值,根据这三个取值将数据集划分为三个子集:D1代表浅白,D2代表青绿,D3代表乌黑,子集D1中包含编号{5,11,12,14,16}五个样例,子集D2中包含编号{1,4,6,10,13,17}六个样例,子集D3中包含编号{2,3,7,8,9,15}六个样例。根据信息熵计算公式,我们可以计算出用{色泽}进行划分后所得的三个分支结点的信息熵:
E n t ( D 1 ) = − ( 1 5 log 2 1 5 + 4 5 log 2 4 5 ) = 0.722 Ent\left( D^{1}\right) =-\left( \dfrac {1}{5}\log _{2}\dfrac {1}{5}+\dfrac {4}{5}\log _{2}\dfrac {4}{5}\right) = 0.722 Ent(D1)=−(51log251+54log254)=0.722
E n t ( D 2 ) = − ( 3 6 log 2 3 6 + 3 6 log 2 3 6 ) = 1.000 Ent\left( D^{2}\right) =-\left( \dfrac {3}{6}\log _{2}\dfrac {3}{6}+\dfrac {3}{6}\log _{2}\dfrac {3}{6}\right) = 1.000 Ent(D2)=−(63log263+63log263)=1.000
E n t ( D 3 ) = − ( 4 6 log 2 4 6 + 4 6 log 2 4 6 ) = 0.918 Ent\left( D^{3}\right) =-\left( \dfrac {4}{6}\log _{2}\dfrac {4}{6}+\dfrac {4}{6}\log _{2}\dfrac {4}{6}\right) = 0.918 Ent(D3)=−(64log264+64log264)=0.918
G a i n ( D , 色 泽 ) = E n t ( D ) − ∑ v = 1 3 D v D E n t ( D v ) = 0 , 988 − ( 5 17 × 0.722 + 6 17 × 1 + 6 17 × 0.918 ) = 0.109 Gain\left( D,色泽\right) =Ent\left( D\right) -\sum ^{3}_{v=1}\dfrac {D^{v}}{D}Ent\left( D^{v}\right)\\ =\ 0,988-\left( \dfrac {5}{17}\times 0.722+\dfrac {6}{17}\times 1+\dfrac {6}{17}\times 0.918\right) = 0.109 Gain(D,色泽)=Ent(D)−v=1∑3DDvEnt(Dv)= 0,988−(175×0.722+176×1+176×0.918)=0.109
由此我们计算得到色泽属性的信息增益为0.109
from math import log
def createDataSet():
dataSet = [[1,0,1,0,0,0,'yes'],
[2,0,0,0,0,0,'yes'],
[2,0,1,0,0,0,'yes'],
[1,0,1,0,0,0,'yes'],
[0,0,1,0,0,0,'yes'],
[1,1,1,0,1,1,'yes'],
[2,1,1,1,1,1,'yes'],
[2,1,1,0,1,0,'yes'],
[2,1,0,1,1,0,'no'],
[1,2,2,0,2,1,'no'],
[0,2,2,2,2,1,'no'],
[0,0,1,2,2,1,'no'],
[1,1,1,1,0,0,'no'],
[0,1,0,1,0,0,'no'],
[2,1,1,0,1,1,'no'],
[0,0,1,1,1,1,'no'],
[1,0,0,1,1,0,'no']]
labels = ['色泽','根蒂','敲声','纹理','脐部','触感']
return dataSet,labels
def calcShannonEnt(dataSet):
numEntries = len(dataSet) #数据集的行数
labelCounts = {} #生成字典
for featVec in dataSet:
currentLabel = featVec[-1] #提取标签
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0 #标签不在字典中,就生成这个标签的键值对
labelCounts[currentLabel] += 1 #每次for循环就+1
shannonEnt = 0.0 #信息熵
for key in labelCounts: #遍历字典中的键
prob = float(labelCounts[key])/numEntries #选择该label的概率
shannonEnt -= prob*log(prob,2) #香农熵计算公式
return shannonEnt
def splitDataSet(dataSet,axis,value): #数据集、划分数据集的特征、需要返回的特征的值
retDataSet = [] #生成列表
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSpilt(dataSet):
numFeature = len(dataSet[0]) - 1 #特征数量
baseEntropy = calcShannonEnt(dataSet) #计算原始训练集的香农熵
bestInfoGain = 0 #信息增益
bestFeature = -1 #最优特征索引,这里只是先行赋值
for i in range(numFeature):
featList = [example[i] for example in dataSet] #提取第训练集中的第(i-1)个特征
uniqueVals = set(featList) #去除重复的元素
newEntropy = 0 #经验条件熵
for value in uniqueVals:
subDataSet = splitDataSet(dataSet,i,value) #划分成子集
prob = len(subDataSet)/float(len(dataSet)) #占比例,可看作概率
newEntropy += prob*calcShannonEnt(subDataSet) #计算该子集的经验熵
infoGain = baseEntropy - newEntropy #信息增益
print("第{}个特征的信息增益{}".format(i+1,infoGain))
if (infoGain > bestInfoGain): #if语句更新信息增益,并返回最大信息增益特征的索引值
bestInfoGain = infoGain
bestFeature = i
return bestFeature
if __name__=='__main__':
data,label = createDataSet()
chooseBestFeatureToSpilt(data)
运用ID3算法创建决策树,具体操作如下:计算所有结点的信息增益,从中选择最大的信息增益作为该结点的特征,按照属性数量的多少建立不同的子结点,再使用子结点递归的方式创建决策树,直到没有特征可选择或无法再创建新的子结点为止。将其保存为trees.py
from math import log
def createDataSet():
dataSet = [[1,0,1,0,0,0,'yes'],
[2,0,0,0,0,0,'yes'],
[2,0,1,0,0,0,'yes'],
[1,0,1,0,0,0,'yes'],
[0,0,1,0,0,0,'yes'],
[1,1,1,0,1,1,'yes'],
[2,1,1,1,1,1,'yes'],
[2,1,1,0,1,0,'yes'],
[2,1,0,1,1,0,'no'],
[1,2,2,0,2,1,'no'],
[0,2,2,2,2,1,'no'],
[0,0,1,2,2,1,'no'],
[1,1,1,1,0,0,'no'],
[0,1,0,1,0,0,'no'],
[2,1,1,0,1,1,'no'],
[0,0,1,1,1,1,'no'],
[1,0,0,1,1,0,'no']]
labels = ['色泽','根蒂','敲声','纹理','脐部','触感']
return dataSet,labels
def calcShannonEnt(dataSet):
numEntries = len(dataSet) #数据集的行数
labelCounts = {} #生成字典
for featVec in dataSet:
currentLabel = featVec[-1] #提取标签
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0 #标签不在字典中,就生成这个标签的键值对
labelCounts[currentLabel] += 1 #每次for循环就+1
shannonEnt = 0.0 #信息熵
for key in labelCounts: #遍历字典中的键
prob = float(labelCounts[key])/numEntries #选择该label的概率
shannonEnt -= prob*log(prob,2) #香农熵计算公式
return shannonEnt
def splitDataSet(dataSet,axis,value): #数据集、划分数据集的特征、需要返回的特征的值
retDataSet = [] #生成列表
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSpilt(dataSet):
numFeature = len(dataSet[0]) - 1 #特征数量
baseEntropy = calcShannonEnt(dataSet) #计算原始训练集的香农熵
bestInfoGain = 0 #信息增益
bestFeature = -1 #最优特征索引,这里只是先行赋值
for i in range(numFeature):
featList = [example[i] for example in dataSet] #提取第训练集中的第(i-1)个特征
uniqueVals = set(featList) #去除重复的元素
newEntropy = 0 #经验条件熵
for value in uniqueVals:
subDataSet = splitDataSet(dataSet,i,value) #划分成子集
prob = len(subDataSet)/float(len(dataSet)) #占比例,可看作概率
newEntropy += prob*calcShannonEnt(subDataSet) #计算该子集的经验熵
infoGain = baseEntropy - newEntropy #信息增益
#print("第{}个特征的信息增益{}".format(i+1,infoGain))
if (infoGain > bestInfoGain): #if语句更新信息增益,并返回最大信息增益特征的索引值
bestInfoGain = infoGain
bestFeature = i
return bestFeature
def majorityCnt(classList):
classCount = {}
for vote in classList:
if vote not in classList.keys():
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
#创建决策树
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet] #取分类标签
if classList.count(classList[0]) == len(classList): #类别完全相同,停止划分
return classList[0]
if len(dataSet[0]) == 1: #遍历所有特征 停止划分
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSpilt(dataSet) #选择最优特征
bestFeatLabel = labels[bestFeat] #最优特征标签
myTree = {bestFeatLabel:{}} #最优标签生成树
del(labels[bestFeat]) #删除已使用的特征标签
featValues = [example[bestFeat] for example in dataSet] #得训练集中最优标签的特征属性
uniqueVals = set(featValues) #除掉重复
for value in uniqueVals: #遍历 创建树
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
return myTree
if __name__=='__main__':
data,label = createDataSet()
tree = createTree(data,label)
print(tree)
运行完上述代码,我们得到用于描述决策树的字典。为了使决策树看上去更加直观,我们引入matplotlib库进行决策树的可视化。保存为treePlot.py
import matplotlib.pyplot as plt
from trees import *
from matplotlib.font_manager import FontProperties
decisionNode = dict(boxstyle = 'sawtooth',fc='0.8')
leafNode = dict(boxstyle='round4',fc='0.8')
arrow_args = dict(arrowstyle='<-')
def plotNode(nodeTxt,centerPt,parentPt,nodeType):
arrow_args = dict(arrowstyle="<-") # 定义箭头格式
font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)
createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords='axes fraction',
xytext=centerPt,textcoords='axes fraction',
va='center',ha='center',bbox=nodeType,arrowprops=arrow_args,
FontProperties=font)
def getNumLeafs(myTree):
numLeafs = 0
firstStr = next(iter(myTree))
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':
numLeafs += getNumLeafs(secondDict[key])
else:
numLeafs += 1
return numLeafs
def getTreeDepth(myTree):
maxDepth = 0
firstStr = next(iter(myTree))
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':
thisDepth = 1+getTreeDepth(secondDict[key])
else:
thisDepth = 1
if thisDepth > maxDepth:
maxDepth = thisDepth
return maxDepth
def plotMidText(cntrPt,parentPt,txtString):
xMid = (parentPt[0]-cntrPt[0])/2 + cntrPt[0]
yMid = (parentPt[1]-cntrPt[1])/2 + cntrPt[1]
createPlot.ax1.text(xMid,yMid,txtString)
def plotTree(myTree, parentPt, nodeTxt):
decisionNode = dict(boxstyle="sawtooth", fc="0.8") #设置结点格式
leafNode = dict(boxstyle="round4", fc="0.8") #设置叶结点格式
numLeafs = getNumLeafs(myTree) #获取决策树叶结点数目,决定了树的宽度
depth = getTreeDepth(myTree) #获取决策树层数
firstStr = next(iter(myTree)) #下个字典
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff) #中心位置
plotMidText(cntrPt, parentPt, nodeTxt) #标注有向边属性值
plotNode(firstStr, cntrPt, parentPt, decisionNode) #绘制结点
secondDict = myTree[firstStr] #下一个字典,也就是继续绘制子结点
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD #y偏移
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict': #测试该结点是否为字典,如果不是字典,代表此结点为叶子结点
plotTree(secondDict[key],cntrPt,str(key)) #不是叶结点,递归调用继续绘制
else: #如果是叶结点,绘制叶结点,并标注有向边属性值
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
def createPlot(inTree):
fig = plt.figure(1, facecolor='white') #创建fig
fig.clf() #清空fig
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops) #去掉x、y轴
plotTree.totalW = float(getNumLeafs(inTree)) #获取决策树叶结点数目
plotTree.totalD = float(getTreeDepth(inTree)) #获取决策树层数
plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0; #x偏移
plotTree(inTree, (0.5,1.0), '') #绘制决策树
plt.show()
if __name__ == '__main__':
dataSet, labels = createDataSet()
featLabels = []
myTree = createTree(dataSet, labels)
print(myTree)
createPlot(myTree)
依靠训练数据构造了决策树后,我们可以用于实际数据的分类。在进行数据分类时,需要构造树的标签向量。注意因为creatree函数中会删除信息增益最大的属性,所以在进行测试时,要在主函数中重新定义数据标签,否则会提示list out of range。这里为了节省篇幅,仅贴出测试代码和主函数,可以接着tree.py下面使用
def classify(inputTree,featLabels,testVec):
firstStr = next(iter(inputTree))
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
for key in secondDict.keys():
if testVec[featIndex] == key:
if type(secondDict[key]).__name__== 'dict':
classLabel = classify(secondDict[key],featLabels,testVec)
else:
classLabel = secondDict[key]
return classLabel
if __name__=='__main__':
data,labels = createDataSet()
print(labels)
featlabel = ['色泽','根蒂','敲声','纹理','脐部','触感']
tree = createTree(data,labels)
print(tree)
result = classify(tree,featlabel,[0,0,1,0,0,0])
print(result)
def storeTree(inputTree,filename):
import pickle
fw = open(filename,'wb')
pickle.dump(inputTree,fw)
fw.close()
def grabTree(filename):
import pickle
fr = open(filename,'rb')
return pickle.load(fr)
if __name__=='__main__':
data,labels = createDataSet()
tree = createTree(data,labels)
print(tree)
storeTree(tree,'xigua.txt')
mytree = grabTree('xigua.txt')
print(mytree)
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import pydotplus
from sklearn.externals.six import StringIO
from sklearn import tree
import numpy as np
if __name__ == '__main__':
with open('lenses.txt', 'r') as fr: #加载文件
lenses = [inst.strip().split('\t') for inst in fr.readlines()] #处理文件
lenses_target = [] #提取每组数据的类别,保存在列表里
for each in lenses:
lenses_target.append(each[-1])
lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate'] #特征标签
lenses_list = [] #保存lenses数据的临时列表
lenses_dict = {} #保存lenses数据的字典,用于生成pandas
for each_label in lensesLabels: #提取信息,生成字典
for each in lenses:
lenses_list.append(each[lensesLabels.index(each_label)])
lenses_dict[each_label] = lenses_list
lenses_list = []
#print(lenses_dict) #打印字典信息
lenses_pd = pd.DataFrame(lenses_dict) #生成pandas.DataFrame
#print(lenses_pd)
le = LabelEncoder()
for col in lenses_pd.columns:
lenses_pd[col] = le.fit_transform(lenses_pd[col])
#print(lenses_pd)
clf = tree.DecisionTreeClassifier(max_depth=4)
clf = clf.fit(lenses_pd.values.tolist(),lenses_target)
dot_data = StringIO()
tree.export_graphviz(clf,out_file=dot_data,feature_names=lenses_pd.keys(),
class_names=clf.classes_,filled=True,rounded=True,special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf('tree.pdf')
周志华 《机器学习》
Peter Harrington 《机器学习实战》
ps:本博文代码均在pycharm成功运行过,有任何问题欢迎评论或私信我。