机器学习算法|决策树C4.5--python实现

1. C4.5算法

前一篇文章机器学习算法–决策树ID3–python实现 讲述了决策树的基本概念和最经典的ID3算法。
在那篇文章中,我没有谈到ID3算法的缺陷,更多的是侧重于介绍决策树的算法和概念。但其实,ID3算法存在这么几个缺陷:
1.信息增益准则对可取数值数目较多的属性有所偏好;

比如,如果在原来的数据中加入[序号]这一属性,运行ID3算法后,我们会发现序号被作为最优属性首先被划分。但常识告诉我们,序号根本和样本类别毫无关系。

2.只能处理离散变量的属性,对于类似于身高、体重、年龄、工资这样存在无限可能的连续数值毫无办法。

为了优化并解决以上2个问题,著名的C4.5算法被提出。

1.1 解决类似于"序号"这样的干扰

C4.5决策树算法不直接使用信息增益(Information gain)划分最优属性,而是采用增益率(gain ratio)。其定义为:
G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) , Gain\_ratio(D,a)=\dfrac{Gain(D,a)}{IV(a)}, Gain_ratio(D,a)=IV(a)Gain(D,a),
其中
I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ l o g 2 ∣ D v ∣ ∣ D ∣ IV(a)=-\sum_{v=1}^{V}\dfrac{|D^v|}{|D|}log_2\dfrac{|D^v|}{|D|} IV(a)=v=1VDDvlog2DDv
成为属性a的固有值。由表达式可知,Gain(D,a)所表示的仍然是信息增益,与ID3算法中的Gain(D,a)并无差别,但重点在于IV(a)这一项:如果属性a的可能取值数目越多(即V越大),则IV(a)的值通常会越大,那么最终的Gain_ratio的值会相应减小,以此来解决上文提到的问题。

1.2 增加对连续变量的处理模块
  • 首先,要在处理每一个属性之前,判断该属性的值是字符/字符串(这意味该属性的值是离散的)还是整型/浮点型(这意味该属性的只是回归的、连续的),两种类型的值要分别进行处理。
if type(featVals[0]).__name__=='float' or type(featVals[0]).__name__=='int':
...
if type(dataSet[0][bestFeat]).__name__=='str':
...

*如果你事先已经对离散值做了整数型数值替换表示,那么可以根据具体情况进行修改上述判断语句。

  • 其次,要增添一个根据连续变量属性分割数据集的函数:
def splitContinuousDataSet(dataSet,i,value,direction):
    subDataSet=[]
    for one in dataSet:
        if direction==0:
            if one[i]>value:
                reduceData=one[:i]
                reduceData.extend(one[i+1:])
                subDataSet.append(reduceData)
        if direction==1:
            if one[i]<=value:
                reduceData=one[:i]
                reduceData.extend(one[i+1:])
                subDataSet.append(reduceData)
    return subDataSet

由于变量是连续变量,因此可能存在所有值不同的情景,如果像离散值那样对每个连续值划分支是不现实也是没有意义的,因此可以考虑在该属性的众多数值中选择1个或2个分割点,然后按照大于(>)该值或小于等于(<=)该值进行划分。

  • 然后,在选择最优划分属性模块的返回值中添加具体划分点的数值。由于对于回归属性,返回值不再像是离散值那样只有一个索引值(index),还需要知道大于或小于哪一个具体的值,因此,需要建立一个字典,用来存储每一个回归属性所对应划分点的具体值。其次,要对每一个潜在的划分数值点进行增益率计算,在众多潜在数值中找到那个最优的点,并返回。
bestSplitDic={}
...
		sortedFeatVals=sorted(featVals)
		splitList=[]
		for j in range(len(featVals)-1):
		    splitList.append((sortedFeatVals[j]+sortedFeatVals[j+1])/2.0)
		for j in range(len(splitList)):
		    newEntropy=0.0
		    gainRatio=0.0
		    splitInfo=0.0
		    value=splitList[j]
		    subDataSet0=splitContinuousDataSet(dataSet,i,value,0)
		    subDataSet1=splitContinuousDataSet(dataSet,i,value,1)
		    prob0=float(len(subDataSet0))/len(dataSet)
		    newEntropy-=prob0*calcShannonEntropy(subDataSet0)
		    prob1=float(len(subDataSet1))/len(dataSet)
		    newEntropy-=prob1*calcShannonEntropy(subDataSet1)
		    splitInfo-=prob0*log(prob0,2)
		    splitInfo-=prob1*log(prob1,2)
		    gainRatio=float(baseEntropy-newEntropy)/splitInfo
		    print('IVa '+str(j)+':'+str(splitInfo))
		    if gainRatio>baseGainRatio:
		        baseGainRatio=gainRatio
		        bestSplit=j
		        bestFeat=i
		bestSplitDic[labels[i]]=splitList[bestSplit]
  • 最后,还需要在生成决策的模块,修改划分点的标签,如“> x.xxx”,"<= x.xxx",关键代码:
 myTree[labels[bestFeat]]['>' + str(value)] = createTree(greaterDataSet, subLabels)
        print(myTree)
        print('== ' * len(dataSet[0]))
        myTree[labels[bestFeat]]['<=' + str(value)] = createTree(smallerDataSet, subLabels)

2. C4.5算法的python实现

2.1 思路

输入: 训练集 D=(x1,y1),(x2,y2),…,(xm,ym);
    属性集 A=a1,a2,a3,…,ad;
过程: 函数CreateTree(D,A)
1. 生成节点node;

  1. if D中样本全部属于同一类别C then
  2. 将node标记为C类叶节点; return
  3. end if
  4. if A=∅ OR D中的样本在A上取值都相同 then
  5. 将node标记为叶节点,其类别标记为D中样本数量最多的类; return
  6. end if
  7. 从A中选择最优的划分属性a∗;
  8. for a∗的每一个值av∗ do
    10.  为node节点生成一个分支;令Dv表示D中在a∗上取值为av∗的样本子集;
    11.  if Dv为空 then
    12.   将分支节点标记为叶节点,其类别标记为D中样本最多的类; return
    13.  else
    14.   以CreateTree(Dv,a∗)为分支节点
    15.  end if
    16. end for
    输出:以node为根节点的一棵决策树
2.2 具体的代码

算法的原理已经讲过了,那么这里就来说说具体的python实现。

from math import log,sqrt
import operator
import re

def createDataSet():
    dataSet = [[1,'长', '粗', '男'],
               [2,'短', '粗', '男'],
               [3,'短', '粗', '男'],
               [4,'长', '细', '女'],
               [5,'短', '细', '女'],
               [6,'短', '粗', '女'],
               [7,'长', '粗', '女'],
               [8,'长', '粗', '女']]
    labels = ['序号','头发', '声音']  # 两个特征
    return dataSet, labels

def classCount(dataSet):
    labelCount={}
    for one in dataSet:
        if one[-1] not in labelCount.keys():
            labelCount[one[-1]]=0
        labelCount[one[-1]]+=1
    return labelCount

def calcShannonEntropy(dataSet):
    labelCount=classCount(dataSet)
    numEntries=len(dataSet)
    Entropy=0.0
    for i in labelCount:
        prob=float(labelCount[i])/numEntries
        Entropy-=prob*log(prob,2)
    return Entropy

def majorityClass(dataSet):
    labelCount=classCount(dataSet)
    sortedLabelCount=sorted(labelCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedLabelCount[0][0]

def splitDataSet(dataSet,i,value):
    subDataSet=[]
    for one in dataSet:
        if one[i]==value:
            reduceData=one[:i]
            reduceData.extend(one[i+1:])
            subDataSet.append(reduceData)
    return subDataSet

def splitContinuousDataSet(dataSet,i,value,direction):
    subDataSet=[]
    for one in dataSet:
        if direction==0:
            if one[i]>value:
                reduceData=one[:i]
                reduceData.extend(one[i+1:])
                subDataSet.append(reduceData)
        if direction==1:
            if one[i]<=value:
                reduceData=one[:i]
                reduceData.extend(one[i+1:])
                subDataSet.append(reduceData)
    return subDataSet

def chooseBestFeat(dataSet,labels):
    baseEntropy=calcShannonEntropy(dataSet)
    bestFeat=0
    baseGainRatio=-1
    numFeats=len(dataSet[0])-1
    bestSplitDic={}
    i=0
    print('dataSet[0]:' + str(dataSet[0]))
    for i in range(numFeats):
        featVals=[example[i] for example in dataSet]
        #print('chooseBestFeat:'+str(i))
        if type(featVals[0]).__name__=='float' or type(featVals[0]).__name__=='int':
            j=0
            sortedFeatVals=sorted(featVals)
            splitList=[]
            for j in range(len(featVals)-1):
                splitList.append((sortedFeatVals[j]+sortedFeatVals[j+1])/2.0)
            for j in range(len(splitList)):
                newEntropy=0.0
                gainRatio=0.0
                splitInfo=0.0
                value=splitList[j]
                subDataSet0=splitContinuousDataSet(dataSet,i,value,0)
                subDataSet1=splitContinuousDataSet(dataSet,i,value,1)
                prob0=float(len(subDataSet0))/len(dataSet)
                newEntropy-=prob0*calcShannonEntropy(subDataSet0)
                prob1=float(len(subDataSet1))/len(dataSet)
                newEntropy-=prob1*calcShannonEntropy(subDataSet1)
                splitInfo-=prob0*log(prob0,2)
                splitInfo-=prob1*log(prob1,2)
                gainRatio=float(baseEntropy-newEntropy)/splitInfo
                print('IVa '+str(j)+':'+str(splitInfo))
                if gainRatio>baseGainRatio:
                    baseGainRatio=gainRatio
                    bestSplit=j
                    bestFeat=i
            bestSplitDic[labels[i]]=splitList[bestSplit]
        else:
            uniqueFeatVals=set(featVals)
            GainRatio=0.0
            splitInfo=0.0
            newEntropy=0.0
            for value in uniqueFeatVals:
                subDataSet=splitDataSet(dataSet,i,value)
                prob=float(len(subDataSet))/len(dataSet)
                splitInfo-=prob*log(prob,2)
                newEntropy-=prob*calcShannonEntropy(subDataSet)
            gainRatio=float(baseEntropy-newEntropy)/splitInfo
            if gainRatio > baseGainRatio:
                bestFeat = i
                baseGainRatio = gainRatio
    if type(dataSet[0][bestFeat]).__name__=='float' or type(dataSet[0][bestFeat]).__name__=='int':
        bestFeatValue=bestSplitDic[labels[bestFeat]]
        ##bestFeatValue=labels[bestFeat]+'<='+str(bestSplitValue)
    if type(dataSet[0][bestFeat]).__name__=='str':
        bestFeatValue=labels[bestFeat]
    return bestFeat,bestFeatValue



def createTree(dataSet,labels):
    classList=[example[-1] for example in dataSet]
    if len(set(classList))==1:
        return classList[0][0]
    if len(dataSet[0])==1:
        return majorityClass(dataSet)
    Entropy = calcShannonEntropy(dataSet)
    bestFeat,bestFeatLabel=chooseBestFeat(dataSet,labels)
    print('bestFeat:'+str(bestFeat)+'--'+str(labels[bestFeat])+', bestFeatLabel:'+str(bestFeatLabel))
    myTree={labels[bestFeat]:{}}
    subLabels = labels[:bestFeat]
    subLabels.extend(labels[bestFeat+1:])
    print('subLabels:'+str(subLabels))
    if type(dataSet[0][bestFeat]).__name__=='str':
        featVals = [example[bestFeat] for example in dataSet]
        uniqueVals = set(featVals)
        print('uniqueVals:' + str(uniqueVals))
        for value in uniqueVals:
            reduceDataSet=splitDataSet(dataSet,bestFeat,value)
            print('reduceDataSet:'+str(reduceDataSet))
            myTree[labels[bestFeat]][value]=createTree(reduceDataSet,subLabels)
    if type(dataSet[0][bestFeat]).__name__=='int' or type(dataSet[0][bestFeat]).__name__=='float':
        value=bestFeatLabel
        greaterDataSet=splitContinuousDataSet(dataSet,bestFeat,value,0)
        smallerDataSet=splitContinuousDataSet(dataSet,bestFeat,value,1)
        print('greaterDataset:' + str(greaterDataSet))
        print('smallerDataSet:' + str(smallerDataSet))
        print('== ' * len(dataSet[0]))
        myTree[labels[bestFeat]]['>' + str(value)] = createTree(greaterDataSet, subLabels)
        print(myTree)
        print('== ' * len(dataSet[0]))
        myTree[labels[bestFeat]]['<=' + str(value)] = createTree(smallerDataSet, subLabels)
    return myTree

if __name__ == '__main__':
	dataSet,labels=createDataSet()
	print(createTree(dataSet,labels))

*上述代码中有带有许多print()函数,打印出这些看似多余的中间值,是为了监控整个执行过程。写这篇文章的时候我犹豫了一下,最后决定将其保留,正在看这篇文章的你如果觉得多余,就请自行删除吧。 : /

你可能感兴趣的:(机器学习,决策树,人工智能,机器学习,决策树,C4.5,python实现)