《机器学习实战》—决策树


1 决策树的构造

在构造决策树之前,我们需要解决的第一个问题就是当前数据集上哪个特征在划分数据集上起决定性作用。为了划分出最好的效果,我们必须评估每个特征。

一般的划分数据采用二分法,而本文采用ID3算法划分数据集

《机器学习实战》—决策树_第1张图片


上面的表包含了5个海洋生物的数据,两个特征以及把这些动物分成鱼类和非鱼类,现在我们决定是依据第一个特征还是第二个特征来划分数据。


2. 信息增益和熵

在讲如何划分之前,我们先讲下信息增益。所谓信息增益就是在划分数据集之前和之后发生的变化称为信息增益。集合信息的度量称为香农熵或者熵。

1948年,香农引入了信息熵,将其定义为离散随机事件出现的概率,一个系统越是有序,信息熵就越低,反之一个系统越是混乱,它的信息熵就越高。所以信息熵可以被认为是系统有序化程度的一个度量。

假如一个随机变量的取值为,每一种取到的概率分别是,那么 的熵定义为

 

             

 

   意思是一个变量的变化情况可能越多,那么它携带的信息量就越大。

 

   对于分类系统来说,类别是变量,它的取值是,而每一个类别出现的概率分别是

 

             

 

   而这里的就是类别的总数,此时分类系统的熵就可以表示为

 

             

关于熵和信息增益详见http://blog.csdn.net/acdreamers/article/details/44661149


3.Python实现决策树

from math import log
import operator

def calShannonEnt(dataSet):
    numEntries=len(dataSet)
    lableCount={}
    for featVec in dataSet:
        currentlabel=featVec[-1]
       # print(currentlabel)
        if currentlabel not in lableCount.keys():
            #print(lableCount)
            lableCount[currentlabel]=0
        lableCount[currentlabel]+=1
    shannonEnt=0.0
    for key in lableCount:
        prob=float(lableCount[key])/numEntries
        #print(prob)
        shannonEnt-=prob*log(prob,2)
    return shannonEnt
 
 
def createDataSet():
    dataSet=[[1,1,'yes'],
             [1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]
    labels=['no surfacing','flippers']
    return dataSet, labels


def splitDataSet(dataSet,axis,value):
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value:
            rFeatVec=featVec[:axis]
            rFeatVec.extend(featVec[axis+1:])
            retDataSet.append(rFeatVec)
    return retDataSet
    
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]
        #print(featList)
       #create a list of all the examples of this feature
        uniqueVals = set(featList) 
        #print(uniqueVals)#get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature     
    
def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    #print(classList)
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    print(bestFeatLabel)
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    print(uniqueVals)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree                            
    

4.结果


你可能感兴趣的:(机器学习)