决策树是一常见的机器学习算法,本例程将参考《机器学习实践》中的代码完成
决策树算法中,信息熵被用来定义数据的纯度。
假定当前样本集合D中第k类样本所占比例为 p k p_k pk ,则有 E n t ( D ) = − ∑ k = 1 n p k log 2 p k Ent(D)=-\sum_{k=1}^n p_k \log_2 p_k Ent(D)=−k=1∑npklog2pk
我们认为Ent(D)的值越小,则D的纯度越高。
假定离散属性a有V个可能的取值 { a 1 , a 2 , . . . , a V } \{ a^1, a^2, ..., a^V \} {a1,a2,...,aV} ,若用a对样本集D进行划分,则会产生V个分支结点,其中第v个结点包含了D中所有在属性a上取值为 a v a^v av的样本,记为 D v D^v Dv。我们使用“信息增益”来度量使用a对样本集D进行划分的效果好坏,其表达式如下
G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) Gain(D, a) = Ent(D)- \sum_{v=1}^V \frac{\mid D^v \mid} {\mid D \mid} Ent(D^v) Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv)
一般而言,信息增益越大,则使用属性a进行划分所得“纯度提升”越大。因此,我们可用信息增益来进行决策树的划分属性选择我们每一趟划分选择属性的依据为
a ∗ = a r g max a ∈ A G a i n ( D , a ) a_* = arg\max_{a\in A} Gain(D, a) a∗=arga∈AmaxGain(D,a)
下面我们就来构建决策树,首先我们需要写一个函数计算给定数据集的信息熵
from math import log
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 1
else:
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob, 2)
return shannonEnt
接着我们定义数据集,然后计算数据集的信息熵
def createDataSet():
dataSet = [
[1, 1, "yes"],
[1, 1, "yes"],
[1, 0, "no"],
[0, 1, "no"],
[0, 1, "no"]
]
labels = ["no surfacing", "flippers"]
return dataSet, labels
dataSet, labels = createDataSet()
print(dataSet)
print(calcShannonEnt(dataSet))
[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]
['no surfacing', 'flippers']
0.9709505944546686
接下来,我们需要进行数据集的划分
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
# 下面两行实际上就是在数据集中,将相应列去掉
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
根据信息增益,选择最好的划分属性
def chooseBestFeatureToSplit(dataSet):
# 计算出当前可以被划分的属性数
numFeatures = len(dataSet[0])-1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0 # 用于记录最大的信息增益
bsetFeature = -1 # 用于记录取得最大信息增益的划分属性
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if(infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
假设一个结点被划分成叶节点了,但是属性仍然不纯,则使用少数服从多数的方法仲裁
import operator
def majorityCnt(classList):
classCount = {}
for vote in classList:
if vote not in classCount.keys(): classCount[vote]=0
classCount[vote] += 1
# 降序排序
sortedClassCount = sorted(classCount.iteritems(), \
key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
递归构建决策树
def createTree(dataSet, labels):
classList = [example[-1] for example in dataSet]
# 类别完全相同,则停止划分
if classList.count(classList[0]) == len(classList):
return classList[0]
# 遍历完所有特征则返回出现次数多的类别
if len(dataSet[0]) == 1:
return majorityCnt(classList)
# 选择划分属性
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
# 构建决策树结点
myTree = {bestFeatLabel:{}}
# 删除掉已经选中的属性
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet \
(dataSet, bestFeat, value), subLabels)
return myTree
我们使用之前的数据集构建决策树
dataSet, labels = createDataSet()
myTree = createTree(dataSet, labels)
import json
print(json.dumps(myTree, indent=4))
{
"no surfacing": {
"0": "no",
"1": {
"flippers": {
"0": "no",
"1": "yes"
}
}
}
}
上面就是我们构建的决策树啦,可以看到,第一个判断的属性是 “no surfacing”,如果取值是0,那么直接输出结果为"no";如果是1则要继续判断"flippers"的属性,如果属性值为0,则最后结果为"no",属性值为1,结果为"yes"
下面我们来编写使用模型的代码
def model(inputTree, featLabels, testVec):
firstStr = list(inputTree.keys())[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
for key in secondDict.keys():
if testVec[featIndex] == key:
# 如果对应分支不是叶子节点,则需要递归判断
if(type(secondDict[key]).__name__ == "dict"):
classLabel = model(secondDict[key], featLabels, testVec)
# 如果是叶子节点,则表示得到最后的分类结果了
else:
classLabel = secondDict[key]
return classLabel
return classLabel
test = [0, 0]
label = ["no surfacing", "flippers"]
print(model(myTree, label, test))
no
至此,本例程就结束啦。