[置顶] 决策树之数据划分

这篇文章利用了信息熵计算的东西,先写一个数据划分的东西,先写一个简单的逻辑划分:

 

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)

这是最简单的一个逻辑划分,首先建立一个新的数据集,将划分后的数据添入该数据集,

 

下面介绍选择最最好的数据集划分方式:

1.建立该列的变化标签

2.计算每种划分的信息熵

对上面步骤进行循环。

下面贴出该步骤的代码:

 

def chooseBestFeatureToSplit(dataSet):
    numberFeatures = len(dataSet[0])-1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0;
    bestFeature = -1;
    for i in range(numberFeatures):
        featList = [example[i] for example in dataSet]
        print(featList)
        uniqueVals = set(featList)
        print(uniqueVals)
        newEntropy =0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if(infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

下面给个运行结果截图:

 

 

你可能感兴趣的:(决策树)