ID3决策树与C4.5和CART决策树的本质区别:ID3决策树的数据集划分依据是最大化信息增益。
信息增益=原数据的信息熵-划分后的数据的信息熵的加权和
熵的计算过程python代码如下:
from math import log
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVect in dataSet:
label = featVect[-1]
if label not in labelCounts.keys():
labelCounts[label] = 0
labelCounts[label] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob*log(prob,2)
return shannonEnt
def createDataSet():
dataSet = [[1,1,'yes'],
[1,1,'yes'],
[1,0,'no'],
[0,1,'no'],
[0,1,'no']]
labels = ['no surfacing','flippers']
return dataSet,labels
找到最大化信息增益的那个特征,然后依据这个特征对数据集进行划分。
def splitDataSet(dataSet,axis,value):
retDataSet = []
for featVect in dataSet:
if featVect[axis] == value:
reducedTeatVect = featVect[:axis]
reducedTeatVect.extend(featVect[axis+1:])
retDataSet.append(reducedTeatVect)
return retDataSet
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0])-1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = -9999;bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet,i,value)
newEntropy += calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if (infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
工作原理:得到原始数据集,然后基于最好的属性值划分数据集,由于特征值可能多于两个,因此可能存在大于两个分支的数据集划分。第一次划分之后,数据将被向下传到树分支的下一个节点,在这个节点上,我们可以再次划分数据。因此我们采用递归的原则处理数据集。
递归结束的条件:程序遍历完所有划分数据集的属性,或者每个分支下的所有实例都具有相同的分类。
def majorityCnt(classList):
classCount = {}
for vote in classList:
if vote not in classCount:
classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(),key = operator.itemgetter(1),reverse = True)
return sortedClassCount[0][0]
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataSet[0]) == 1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
print(bestFeat,bestFeatLabel)
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featVals = [example[bestFeat] for example in dataSet]
uniqueVals = set(featVals)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
return myTree
def classify(inputTree,featLabels,testVect):
firstStr = list(inputTree.keys())[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
for key in secondDict.keys():
if testVect[featIndex] == key:
if type(secondDict[key]).__name__ == 'dict':
classLabel = classify(secondDict[key],featLabels,testVect)
else:
classLabel = secondDict[key]
return classLabel
使用python的pickle模块序列化对象,然后保存在硬盘上,这样可以持久化决策树分类器。
def storeTree(inputTree,filename):
import pickle
fw = open(filename,'wb')
pickle.dump(inputTree,fw)
fw.close()
def grabTree(filename):
import pickle
fr = open(filename,'rb')
return pickle.load(fr)
过度匹配问题:为了减少过度匹配问题,我们可以裁剪决策树(包括预剪枝和后建剪枝),去掉一些不必要的叶子节点。如果叶子节点只能增加少许信息,则可以删除该节点,将它并入到其他叶子节点中(即合并相邻的无法产生大量信息增益的叶节点)。
ID3的缺点:无法直接处理数值数据,需要提前离散化;对特征值数目较多的属性有所偏好。
最流行的其实是C4.5和CART决策树。
C4.5决策树:它的数据集划分选择依据是信息增益率,偏向特征值数目较少的属性。
CART决策树:使用基尼指数来选择划分属性。