前一篇文章机器学习算法–决策树ID3–python实现 讲述了决策树的基本概念和最经典的ID3算法。
在那篇文章中,我没有谈到ID3算法的缺陷,更多的是侧重于介绍决策树的算法和概念。但其实,ID3算法存在这么几个缺陷:
1.信息增益准则对可取数值数目较多的属性有所偏好;
比如,如果在原来的数据中加入[序号]这一属性,运行ID3算法后,我们会发现序号被作为最优属性首先被划分。但常识告诉我们,序号根本和样本类别毫无关系。
2.只能处理离散变量的属性,对于类似于身高、体重、年龄、工资这样存在无限可能的连续数值毫无办法。
为了优化并解决以上2个问题,著名的C4.5算法被提出。
C4.5决策树算法不直接使用信息增益(Information gain)划分最优属性,而是采用增益率(gain ratio)。其定义为:
G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) , Gain\_ratio(D,a)=\dfrac{Gain(D,a)}{IV(a)}, Gain_ratio(D,a)=IV(a)Gain(D,a),
其中
I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ l o g 2 ∣ D v ∣ ∣ D ∣ IV(a)=-\sum_{v=1}^{V}\dfrac{|D^v|}{|D|}log_2\dfrac{|D^v|}{|D|} IV(a)=−v=1∑V∣D∣∣Dv∣log2∣D∣∣Dv∣
成为属性a的固有值。由表达式可知,Gain(D,a)所表示的仍然是信息增益,与ID3算法中的Gain(D,a)并无差别,但重点在于IV(a)这一项:如果属性a的可能取值数目越多(即V越大),则IV(a)的值通常会越大,那么最终的Gain_ratio的值会相应减小,以此来解决上文提到的问题。
if type(featVals[0]).__name__=='float' or type(featVals[0]).__name__=='int':
...
if type(dataSet[0][bestFeat]).__name__=='str':
...
*如果你事先已经对离散值做了整数型数值替换表示,那么可以根据具体情况进行修改上述判断语句。
def splitContinuousDataSet(dataSet,i,value,direction):
subDataSet=[]
for one in dataSet:
if direction==0:
if one[i]>value:
reduceData=one[:i]
reduceData.extend(one[i+1:])
subDataSet.append(reduceData)
if direction==1:
if one[i]<=value:
reduceData=one[:i]
reduceData.extend(one[i+1:])
subDataSet.append(reduceData)
return subDataSet
由于变量是连续变量,因此可能存在所有值不同的情景,如果像离散值那样对每个连续值划分支是不现实也是没有意义的,因此可以考虑在该属性的众多数值中选择1个或2个分割点,然后按照大于(>)该值或小于等于(<=)该值进行划分。
bestSplitDic={}
...
sortedFeatVals=sorted(featVals)
splitList=[]
for j in range(len(featVals)-1):
splitList.append((sortedFeatVals[j]+sortedFeatVals[j+1])/2.0)
for j in range(len(splitList)):
newEntropy=0.0
gainRatio=0.0
splitInfo=0.0
value=splitList[j]
subDataSet0=splitContinuousDataSet(dataSet,i,value,0)
subDataSet1=splitContinuousDataSet(dataSet,i,value,1)
prob0=float(len(subDataSet0))/len(dataSet)
newEntropy-=prob0*calcShannonEntropy(subDataSet0)
prob1=float(len(subDataSet1))/len(dataSet)
newEntropy-=prob1*calcShannonEntropy(subDataSet1)
splitInfo-=prob0*log(prob0,2)
splitInfo-=prob1*log(prob1,2)
gainRatio=float(baseEntropy-newEntropy)/splitInfo
print('IVa '+str(j)+':'+str(splitInfo))
if gainRatio>baseGainRatio:
baseGainRatio=gainRatio
bestSplit=j
bestFeat=i
bestSplitDic[labels[i]]=splitList[bestSplit]
myTree[labels[bestFeat]]['>' + str(value)] = createTree(greaterDataSet, subLabels)
print(myTree)
print('== ' * len(dataSet[0]))
myTree[labels[bestFeat]]['<=' + str(value)] = createTree(smallerDataSet, subLabels)
输入: 训练集 D=(x1,y1),(x2,y2),…,(xm,ym);
属性集 A=a1,a2,a3,…,ad;
过程: 函数CreateTree(D,A)
1. 生成节点node;
算法的原理已经讲过了,那么这里就来说说具体的python实现。
from math import log,sqrt
import operator
import re
def createDataSet():
dataSet = [[1,'长', '粗', '男'],
[2,'短', '粗', '男'],
[3,'短', '粗', '男'],
[4,'长', '细', '女'],
[5,'短', '细', '女'],
[6,'短', '粗', '女'],
[7,'长', '粗', '女'],
[8,'长', '粗', '女']]
labels = ['序号','头发', '声音'] # 两个特征
return dataSet, labels
def classCount(dataSet):
labelCount={}
for one in dataSet:
if one[-1] not in labelCount.keys():
labelCount[one[-1]]=0
labelCount[one[-1]]+=1
return labelCount
def calcShannonEntropy(dataSet):
labelCount=classCount(dataSet)
numEntries=len(dataSet)
Entropy=0.0
for i in labelCount:
prob=float(labelCount[i])/numEntries
Entropy-=prob*log(prob,2)
return Entropy
def majorityClass(dataSet):
labelCount=classCount(dataSet)
sortedLabelCount=sorted(labelCount.items(),key=operator.itemgetter(1),reverse=True)
return sortedLabelCount[0][0]
def splitDataSet(dataSet,i,value):
subDataSet=[]
for one in dataSet:
if one[i]==value:
reduceData=one[:i]
reduceData.extend(one[i+1:])
subDataSet.append(reduceData)
return subDataSet
def splitContinuousDataSet(dataSet,i,value,direction):
subDataSet=[]
for one in dataSet:
if direction==0:
if one[i]>value:
reduceData=one[:i]
reduceData.extend(one[i+1:])
subDataSet.append(reduceData)
if direction==1:
if one[i]<=value:
reduceData=one[:i]
reduceData.extend(one[i+1:])
subDataSet.append(reduceData)
return subDataSet
def chooseBestFeat(dataSet,labels):
baseEntropy=calcShannonEntropy(dataSet)
bestFeat=0
baseGainRatio=-1
numFeats=len(dataSet[0])-1
bestSplitDic={}
i=0
print('dataSet[0]:' + str(dataSet[0]))
for i in range(numFeats):
featVals=[example[i] for example in dataSet]
#print('chooseBestFeat:'+str(i))
if type(featVals[0]).__name__=='float' or type(featVals[0]).__name__=='int':
j=0
sortedFeatVals=sorted(featVals)
splitList=[]
for j in range(len(featVals)-1):
splitList.append((sortedFeatVals[j]+sortedFeatVals[j+1])/2.0)
for j in range(len(splitList)):
newEntropy=0.0
gainRatio=0.0
splitInfo=0.0
value=splitList[j]
subDataSet0=splitContinuousDataSet(dataSet,i,value,0)
subDataSet1=splitContinuousDataSet(dataSet,i,value,1)
prob0=float(len(subDataSet0))/len(dataSet)
newEntropy-=prob0*calcShannonEntropy(subDataSet0)
prob1=float(len(subDataSet1))/len(dataSet)
newEntropy-=prob1*calcShannonEntropy(subDataSet1)
splitInfo-=prob0*log(prob0,2)
splitInfo-=prob1*log(prob1,2)
gainRatio=float(baseEntropy-newEntropy)/splitInfo
print('IVa '+str(j)+':'+str(splitInfo))
if gainRatio>baseGainRatio:
baseGainRatio=gainRatio
bestSplit=j
bestFeat=i
bestSplitDic[labels[i]]=splitList[bestSplit]
else:
uniqueFeatVals=set(featVals)
GainRatio=0.0
splitInfo=0.0
newEntropy=0.0
for value in uniqueFeatVals:
subDataSet=splitDataSet(dataSet,i,value)
prob=float(len(subDataSet))/len(dataSet)
splitInfo-=prob*log(prob,2)
newEntropy-=prob*calcShannonEntropy(subDataSet)
gainRatio=float(baseEntropy-newEntropy)/splitInfo
if gainRatio > baseGainRatio:
bestFeat = i
baseGainRatio = gainRatio
if type(dataSet[0][bestFeat]).__name__=='float' or type(dataSet[0][bestFeat]).__name__=='int':
bestFeatValue=bestSplitDic[labels[bestFeat]]
##bestFeatValue=labels[bestFeat]+'<='+str(bestSplitValue)
if type(dataSet[0][bestFeat]).__name__=='str':
bestFeatValue=labels[bestFeat]
return bestFeat,bestFeatValue
def createTree(dataSet,labels):
classList=[example[-1] for example in dataSet]
if len(set(classList))==1:
return classList[0][0]
if len(dataSet[0])==1:
return majorityClass(dataSet)
Entropy = calcShannonEntropy(dataSet)
bestFeat,bestFeatLabel=chooseBestFeat(dataSet,labels)
print('bestFeat:'+str(bestFeat)+'--'+str(labels[bestFeat])+', bestFeatLabel:'+str(bestFeatLabel))
myTree={labels[bestFeat]:{}}
subLabels = labels[:bestFeat]
subLabels.extend(labels[bestFeat+1:])
print('subLabels:'+str(subLabels))
if type(dataSet[0][bestFeat]).__name__=='str':
featVals = [example[bestFeat] for example in dataSet]
uniqueVals = set(featVals)
print('uniqueVals:' + str(uniqueVals))
for value in uniqueVals:
reduceDataSet=splitDataSet(dataSet,bestFeat,value)
print('reduceDataSet:'+str(reduceDataSet))
myTree[labels[bestFeat]][value]=createTree(reduceDataSet,subLabels)
if type(dataSet[0][bestFeat]).__name__=='int' or type(dataSet[0][bestFeat]).__name__=='float':
value=bestFeatLabel
greaterDataSet=splitContinuousDataSet(dataSet,bestFeat,value,0)
smallerDataSet=splitContinuousDataSet(dataSet,bestFeat,value,1)
print('greaterDataset:' + str(greaterDataSet))
print('smallerDataSet:' + str(smallerDataSet))
print('== ' * len(dataSet[0]))
myTree[labels[bestFeat]]['>' + str(value)] = createTree(greaterDataSet, subLabels)
print(myTree)
print('== ' * len(dataSet[0]))
myTree[labels[bestFeat]]['<=' + str(value)] = createTree(smallerDataSet, subLabels)
return myTree
if __name__ == '__main__':
dataSet,labels=createDataSet()
print(createTree(dataSet,labels))
*上述代码中有带有许多print()函数,打印出这些看似多余的中间值,是为了监控整个执行过程。写这篇文章的时候我犹豫了一下,最后决定将其保留,正在看这篇文章的你如果觉得多余,就请自行删除吧。 : /