ID3算法的决策树的构造
决策树的理论部分,不再赘述,本篇博文主要是自己的学习笔记(《机器学习实战》)
先看下述决策树,希望对理解决策树有一定的帮助。
3.1.1信息增益
首先需要了解两个公式:
创建名为treesde.py文件,将下述代码添加进去
from math import log
def calcShannonEnt(dataSet):#该函数的功能是计算给定数据集的香农熵
numEntries=len(dataSet)
labelCounts={}
for featVec in dataSet:
currentLabel=featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel]=0
labelCounts[currentLabel]+=1
shannonEnt=0.0
for key in labelCounts:
prob =float(labelCounts[key])/numEntries
shannonEnt-=prob*log(prob,2)
return shannonEnt
输入数据集
def createDataSet():
dataSet=[[1,1,'yes'],
[1, 1, 'yes'],
[1,0,'no'],
[0, 1, 'no'],
[0, 1, 'no'],
]
labels=['no suffacing','flippers']
return dataSet,labels
在python命令提示符下输入下述命令:
得到的0.970~~~~就是商,熵越高则说明混合的数据越多。
3.1.2 划分数据集
def splitDataSet(dataSet,axis,value):#按照给定的特征划分数据集
retDataSet=[]
for featVec in dataSet:
if featVec[axis]==value:
reduceFeatVec=featVec[:axis]
reduceFeatVec.extend(featVec[axis+1:])
#extend接受一个列表作为参数,并将该参数的每个元素都添加到原有列表中
retDataSet.append(reduceFeatVec)
return retDataSet
现在来测试一下该函数
在python命令提示符中输入:
接下来我们将遍历整个数据集,循环计算香农熵和splitDataSet()函数,找到最好的特征划分方式。
依旧在trees.py中加入如下代码
#选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):
numFeatures=len(dataSet[0]-1)
baseEntropy=calcShannonEnt(dataSet)
bestInfoGain=0.0;bestFeature=-1
for i in range(numFeatures):
featList=[example[i] for example in dataSet]
uniqueVals=set(featList)
newEntropy=0.0
for value in uniqueVals:
subDataSet=splitDataSet(dataSet,i,value)
prob=len(subDataSet)/float(len(dataSet))
newEntropy+=prob*calcShannonEnt(subDataSet)
infoGain=baseEntropy-newEntropy
if(infoGain>baseEntropy):
bestInfoGain=infoGain
bestFeature=i
return bestFeature
测试代码:
最好的划分是0
3.1.3递归构建决策树
import operator
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys():classCount[vote]=0
classCount+=1
sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
该函数使用分类名称的列表,然后创建键值为classList中唯一值得数据字典,字典对象存储了classList中每个类标签出现的频率,最后利用operator操作键值排序字典,并返回出现次数最多的分类名称
def createTree(dataSet,labels):#创建数的函数代码
classList=[example[-1] for example in dataSet]
if classList.count(classList[0])==len(classList):
return classList[0]
if len(dataSet[0])==1:
return majorityCnt(classList)
bestFeat=chooseBestFeatureToSplit(dataSet)
bestFeatLabel=labels[bestFeat]
myTree={bestFeatLabel:{}}
del(labels[bestFeat])
featValues=[example[bestFeat] for example in dataSet]
uniqueVals=set(featValues)
for value in uniqueVals:
subLabels =labels[:]
myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
return myTree
进行测试:
决策树已建成,不过看起来有点费劲~~·~我们需要利用Matplotlib注解来绘制树图形
在我的上一篇博文中有讲到
3.3 测试和存储分类器
3.3.1 测试算法:使用决策树执行分类
def classfiy(inputTree,featLabels,testVec):
firstStr=inputTree.keys()[0]
secondDict=inputTree[firstStr]
featIndex=featLabels.index(firstStr)
for key in secondDict.keys():
if testVec[featIndex]==key:
if type(secondDict[key]).__name__=='dict':
classLbel=classfiy(secondDict[key],featLabels,testVec)
else: classLbel=secondDict[key]
return classLbel
代码测试:
3.3.2 使用算法:决策树的存储
def storeTree(inputTree,filename):#使用pickle模块存储决策树
import pickle
fw=open(filename,'w')
pickle.dump(inputTree,fw)
fw.close()
def grabTree(filename):
import pickle
fr=open(filename)
return pickle.load(fr)
3.4 实例:使用决策树预测隐形眼镜类型
数据如下:
young myope no reduced no lenses
young myope no normal soft
young myope yes reduced no lenses
young myope yes normal hard
young hyper no reduced no lenses
young hyper no normal soft
young hyper yes reduced no lenses
young hyper yes normal hard
pre myope no reduced no lenses
pre myope no normal soft
pre myope yes reduced no lenses
pre myope yes normal hard
pre hyper no reduced no lenses
pre hyper no normal soft
pre hyper yes reduced no lenses
pre hyper yes normal no lenses
presbyopic myope no reduced no lenses
presbyopic myope no normal no lenses
presbyopic myope yes reduced no lenses
presbyopic myope yes normal hard
presbyopic hyper no reduced no lenses
presbyopic hyper no normal soft
presbyopic hyper yes reduced no lenses
presbyopic hyper yes normal no lenses
在python命令提示符中下列命令:
结果如下: