决策树模型一系列判断的集合,有类似树-枝杈结构,如下图所示(摘自西瓜书中的例子):
其中彩色的节点都是代表一组判断(一组if-then规则),一般称作内部节点(根节点),白色节点代表这分类标签,处于树结构的末端,一般称作叶节点。
特征是描述问题的维度,比如判断瓜甜不甜,可能判断基于瓜的颜色,瓜的根蒂,瓜的敲声等,这一些都是描述问题的维度,都是瓜甜不甜的问题的特征。
对于一个决策树模型,首先要问的当然是怎么选择特征作为根节点?然后怎么在这个特征下划分子集呢,划分的方法是什么呢?
信息论中用熵的概念来度量随机变量的不确定性,对于一个分类问题,设类别标签 Z Z Z是一个取有限值的离散变量,则其概率分布为可由训练样本D中的频率来估计 P ( Z = z i ) = p k = ∑ I ( Z = z i ) N ( D ) , k = 1 , . . . , m P(Z=z_i)=p_k=\frac{\sum I(Z=z_i)}{N(D)},k=1,...,m P(Z=zi)=pk=N(D)∑I(Z=zi),k=1,...,m
那么信息熵定义如下 E n t ( D ) = − ∑ i = 1 m p k log p k D = { Z = z k , k = 1 , . . . , m } Ent(D)=-\sum_{i=1}^mp_k\log p_k\quad D=\{Z=z_k,k=1,...,m\} Ent(D)=−i=1∑mpklogpkD={Z=zk,k=1,...,m}
可以猜想的是随机变量 Z Z Z的信息熵是在 p k p_k pk上的凹函数,即当信息熵最大的时候,意味着随机变量不确定性最高,当信息熵最小时,意味着随机变量转化为确定性变量。
以伯努利随机变量为例(即二分类问题),当 p = 0.5 p=0.5 p=0.5时 E n t ( D ) Ent(D) Ent(D)最大,当 p = 0 p=0 p=0或者 p = 1 p=1 p=1时, E n t ( D ) = 0 Ent(D)=0 Ent(D)=0最小(即当集合B中一半是+1,一般是-1的时候集合不确定度最大,这时任何一个在落在集合B中的待预测点,我们都没有任何办法推测他是属于+1还是-1)。
也就是说随着信息熵的减小,随机变量的不确定性在下降。
在分类问题中,即当我们知道根节点的分类特征为 x i x_i xi时,怎么衡量分类划分成若干子集后的整体的不确定性。
(上述问题等同于当事件是多个随机变量时,怎么衡量知晓一个随机变量取值之后事件的不确定性呢?)
条件熵和条件熵分布
设分类特征选择为 x i x_i xi,将训练样本 D D D划分成 ∑ D i \sum D_i ∑Di,训练样本类别标签为 z j z_j zj,那么这组随机变量 ( X = x i , Z = z j ) (X=x_i,Z=z_j) (X=xi,Z=zj),其联合概率分布为 P ( Z = z j ∣ X = x i ) = p j ∣ i = p i j p i = ∑ I ( X = x i , Y = y j ) N / N ( D i ) N , i = 1 , . . . , n ; j = 1 , . . . , m P(Z=z_j|X=x_i)=p_{j|i}=\frac{p_{ij}}{p_i}=\frac {\sum I(X=x_i,Y=y_j)}{N}/\frac{N(D_i)}{N}, i=1,...,n;j=1,...,m P(Z=zj∣X=xi)=pj∣i=pipij=N∑I(X=xi,Y=yj)/NN(Di),i=1,...,n;j=1,...,m
现在问题就是转化成已经获取 X X X的取值,在子集组 ∑ D i \sum D_i ∑Di上怎么评估的随机性。类似条件概率分布,引出条件熵分布的概念。
定义条件熵分布:
E n t ( D i ) = − ∑ j = 1 m p j ∣ i log p j ∣ i D i = { Z = z j , j = 1 , . . . , m ∣ X = x i } Ent(D_i)=-\sum_{j=1}^mp_{j|i}\log p_{j|i}\quad D_i=\{Z=z_j,j=1,...,m|X=x_i\} Ent(Di)=−j=1∑mpj∣ilogpj∣iDi={Z=zj,j=1,...,m∣X=xi}
条件熵定义如下:
E n t ( D ^ ) = ∑ i = 1 n p i E n t ( D i ) D ^ = { D i = { Z = z j , j = 1 , . . . , m ∣ X = x i } , i = 1 , . . . , n } Ent(\hat D)=\sum_{i=1}^np_iEnt(D_i)\quad \hat D=\{D_i=\{Z=z_j,j=1,...,m|X=x_i\}, i=1,...,n\} Ent(D^)=i=1∑npiEnt(Di)D^={Di={Z=zj,j=1,...,m∣X=xi},i=1,...,n}
显然 E n t ( D ^ ) < E n t ( D ) Ent(\hat D)<Ent(D) Ent(D^)<Ent(D),意味着随着获得随机变量X的信息的增加,事件的不确定性在减少。
在实际工作中,一般都是采用数据估计得到概率分布,
其实决策树就是基于熵减少的原理来进行分叉建模(树生成)的。
最简单的熵减原理模型就是信息增益。信息增益的定义如下:
G a i n ( D , i ) = E n t ( D ) − E n t ( D ^ ) = E n t ( D ) − ∑ i = 1 n p i E n t ( D i ) Gain(D,i) = Ent(D)-Ent(\hat D)=Ent(D)-\sum_{i=1}^np_iEnt(D_i) Gain(D,i)=Ent(D)−Ent(D^)=Ent(D)−i=1∑npiEnt(Di)
在每一次节点选择时候,对比所有的信息增益,选取最大的信息增益分叉特征作为本次节点,进一步进行分叉建模,子树生成。
还有基于熵减原理的模型是信息增益率。信息增益率的定义如下:
G a i n _ r a t i o ( D , i ) = E n t ( D ) − E n t ( D ^ ) E n t ( D ) Gain\_ratio(D,i)=\frac {Ent(D)-Ent(\hat D)}{Ent(D)} Gain_ratio(D,i)=Ent(D)Ent(D)−Ent(D^)
信息增益、信息增益率为不同的熵减模型,基于不同的熵减模型形成了不同的决策树算法,典型是ID3决策树和C4.5决策树。
在处理离散特征,比如{瓜颜色}这类特征,在特征向量中能够自然的进行子集划分,比如可以将瓜集合划分成{青绿}{乌黑}{浅白}三个子集。但是连续特征进行子集划分就变得不那么显而易见了。一般需要在处理之前将连续特征离散化。
下面介绍二分法离散(引用自西瓜书):
输入:训练样本集合 D D D,其中连续属性 a a a在 D D D中的取值从小到大排序 { a 1 , a 2 , . . . , a n } \{a^1, a^2,...,a^n\} {a1,a2,...,an}.
过程:
1.二分法是将每两个紧挨值的中间值作为集合划分的候选值,计算候选值集合
T a = { a i + a i + 1 2 ∣ 1 ≤ i ≤ n − 1 } T_a=\{\frac{a^i+a^{i+1}}{2}|1\le i \le n-1\} Ta={2ai+ai+1∣1≤i≤n−1}
2.在这些候选值划分的左右集合如下
D t − = { a < T a } D_t^-=\{a<T_a\} Dt−={a<Ta}
D t + = { a > T a } D_t^+=\{a>T_a\} Dt+={a>Ta}
3.计算信息增益或信息增益率,选取最大信息增益或信息增益率对应的划分 T a T_a Ta为最佳划分点。
决策树生成后,可能对训练集有过拟合的现象,导致模型预测能力下降,也称为模型泛化能力差。决策树的过拟合主要由于生成的树模型过于复杂庞大,把训练集数据本身的细节(不属于事件的评判)也拟合进去了。为了改善过拟合现象,一般在决策树生成后对其进行剪枝处理或者决策树生成前先对训练集进行剪枝处理。
决策树剪枝处理一般是以通过极小化决策树的整体损失函数或者代价函数来实现的。
剪枝算法有很多种,其中典型的如Cost-ComplexityPruning,Reduced-Error Pruning, Pesimistic-Error Pruning, Minimum Error Pruning。
ID3决策树主要基于信息增益建立的决策数模型,算法流程如下所示:
C4.5决策树主要基于信息增益率建立的决策数模型,算法流程如下所示:
① CCP算法
设决策树最终的叶节点个数为 ∣ T ∣ |T| ∣T∣, t t t是树 T T T的叶节点,该叶节点熵有 N t N_t Nt个样本点,其中 k k k类的样本点有 N t k N_{tk} Ntk, H t ( T ) H_t(T) Ht(T)为叶节点t上的经验熵, α ≥ 0 \alpha \ge0 α≥0,定义决策树的损失函数为
C α ( T ) = ∑ t = 1 ∣ T ∣ N t H t ( T ) + α ∣ T ∣ = − ∑ t = 1 ∣ T ∣ N t ∑ k = 1 l N t k N t log N t k N t + α ∣ T ∣ = − ∑ t = 1 ∣ T ∣ ∑ k = 1 l N t k log N t k N t + α ∣ T ∣ \begin{aligned}C_\alpha (T)&=\sum_{t=1}^{|T|}N_tH_t(T)+\alpha|T|\\ &=-\sum_{t=1}^{|T|}N_t\sum_{k=1}^l\frac{N_{tk}}{N_t}\log\frac{N_{tk}}{N_t}+\alpha|T|\\ &=-\sum_{t=1}^{|T|}\sum_{k=1}^{l}N_{tk}\log\frac{N_{tk}}{N_t}+\alpha|T| \end{aligned} Cα(T)=t=1∑∣T∣NtHt(T)+α∣T∣=−t=1∑∣T∣Ntk=1∑lNtNtklogNtNtk+α∣T∣=−t=1∑∣T∣k=1∑lNtklogNtNtk+α∣T∣
第一项主要衡量的是模型在训练数据集上面的拟合程度,第二项主要是度量生成的树模型的复杂度。
算法流程如下:
1.计算叶节点和上一层父节点的经验熵,
2.计算叶节点回缩前决策树 T A T_A TA的损失函数值 C α ( T A ) C_\alpha(T_A) Cα(TA)和回缩到父节点决策树 T B T_B TB的损失函数值 C B α ( T B ) C_B\alpha(T_B) CBα(TB),如果 C α ( T B ) < C α ( T B ) C_\alpha(T_B)\lt C_\alpha(T_B) Cα(TB)<Cα(TB),则进行剪枝处理,将父节点作为新的叶节点,返回决策树树模型T
3.重复步骤1~2,直至不在剪枝为止。
CART决策树是一种分类回归二叉决策树算法。即在节点处划分子集只有两个子集。
一般CART分类决策树以基尼指数作为特征节点选取原则的二叉树。
基尼指数如下:
G i n i ( p ) = ∑ k = 1 l p k ( 1 − p k ) Gini(p)=\sum_{k=1}^lp_k(1-p_k) Gini(p)=k=1∑lpk(1−pk)
基尼系数的意义可以有两种解释:
1.代表随机选取两个样本,两个样本关于特征A是不相容的概率
2.代表随机选取一个样本,该样本被误分类的概率
所以显然特征选取使基尼指数最小的特征为节点,进行二叉子树生成。
问题描述
输入:训练样本如下
4.45925637575900 8.22541838354701 -1
0.0432761720122110 6.30740040001402 -1
6.99716180262699 9.31339338579386 -1
4.75483224215432 9.26037784240288 -1
0.640487340762152 2.96504627163533 -1
7.09749399121559 4.84058301823207 1
4.15244831176753 1.44597290703838 1
9.55986996363196 1.13832040773527 1
1.63276516895206 0.446783742774178 1
9.38532498107474 0.913169554364942 1
预测目标样本
2.56324459286653 7.83286351946551
5.42032358874179 8.77024851395462
2.78940677946772 5.84911155908821
9.19417616894172 2.68752405465039
7.21993188858784 0.108525252131188
5.61112164928016 1.53003323224417
ID3决策树Python代码
计算香农熵
def calshannonEnt(dataset):
feature_vector = {}
for loopi in dataset.label:
if loopi in feature_vector.keys():
feature_vector[loopi] +=1
else:
feature_vector[loopi] = 1
shannon_entropy = 0
for key in feature_vector.keys():
pk = feature_vector[key]/dataset.shape[0]
shannon_entropy += - pk * math.log(pk, 2)
return shannon_entropy
计算连续特征的最佳划分点。
def bestsplit(dataset, feature_label, valuelist):
# dataset is about feature vector and label(last columns);
# valuelist is a candidate list(a list of middle values of continuous feature)
entropy = calshannonEnt(dataset)
bestgain = 0
for loopi in range(len(valuelist)):
left = dataset[dataset.loc[:,feature_label]<valuelist[loopi]]
right = dataset[dataset.loc[:,feature_label]>valuelist[loopi]]
leftentropy = calshannonEnt(left)
rightentropy = calshannonEnt(right)
# entroy gain means ID3 decision tree, entroy gain ratio means C4.5 decision tree
gain = entropy - left.shape[0]*leftentropy/dataset.shape[0] - right.shape[0]*rightentropy/dataset.shape[0]
if gain>bestgain:
dataleft = left
dataright = right
bestgain = gain
value = valuelist[loopi]
return bestgain, dataleft, dataright, value
递归的选择信息增益或者信息增益率最大的特征将集合划分左右子集,直到子集为叶节点。
def buildtree(dataset):
subtree = {}
label = dataset.label
# if whole set is same in label
if np.abs(label.sum())==dataset.shape[0]:
subtree['label'] = dataset.label.iloc[0]
return subtree
feature_label = dataset.columns.values.tolist()
bestgain = 0
for loopi in range(len(feature_label)-1):
dataset1 = dataset.sort_values(by=feature_label[loopi], axis=0, ascending=True)
valuelist = []
for loopj in range(0,dataset.shape[0]-1):
temp = dataset1.iloc[loopj][ feature_label[loopi]]/2 + dataset1.iloc[loopj+1][feature_label[loopi]]/2
valuelist.append(temp)
labelgain, dataleft, dataright, value = bestsplit(dataset1, feature_label[loopi], valuelist)
if labelgain>bestgain:
left = dataleft
right = dataright
bestgain = labelgain
feature_value = value
feature = feature_label[loopi]
subtree['feature'] = feature
subtree['value'] = feature_value
subtree1 = buildtree(left)
subtree2 = buildtree(right)
if 'label' in subtree1.keys():
subtree['leftsubtree'] = subtree1
if 'label' in subtree2.keys():
subtree['rightsubtree'] = subtree2
if 'feature' in subtree1.keys():
subtree['leftsubtree'] = subtree1
if 'feature' in subtree2.keys():
subtree['rightsubtree'] = subtree2
return subtree
通过决策树进行预测
def predict(vector, tree):
# vector is target feature vector
label = []
for loopi in range(vector.shape[0]):
subtree = tree
while 'feature' in subtree:
if vector.loc[loopi, subtree['feature']]<subtree['value']:
subtree = subtree['leftsubtree']
else:
subtree = subtree['rightsubtree']
label.append(subtree['label'])
label = pd.DataFrame(label, columns=['label'])
return label
结果预测如下图所示,其中红色为待预测的目标样本,绿色和蓝色为训练样本。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
def calshannonEnt(dataset):
feature_vector = {}
for loopi in dataset.label:
if loopi in feature_vector.keys():
feature_vector[loopi] +=1
else:
feature_vector[loopi] = 1
shannon_entropy = 0
for key in feature_vector.keys():
pk = feature_vector[key]/dataset.shape[0]
shannon_entropy += - pk * math.log(pk, 2)
print(shannon_entropy)
return shannon_entropy
'''
def labeljudge(dataset):
labelvote = {}
for loopi in range(dataset.shape[0]):
if dataset.label.iloc[loopi] not in labelvote.keys():
labelvote[dataset.label.iloc[loopi]] = 0
labelvote[dataset.label.iloc[loopi]] += 1
label = max(labelvote,key=labelvote.get)
return label
'''
def buildtree(dataset):
subtree = {}
label = dataset.label
if np.abs(label.sum())==dataset.shape[0]:
subtree['label'] = dataset.label.iloc[0]
return subtree
feature_label = dataset.columns.values.tolist()
bestgain = 0
for loopi in range(len(feature_label)-1):
dataset1 = dataset.sort_values(by=feature_label[loopi], axis=0, ascending=True)
valuelist = []
for loopj in range(0,dataset.shape[0]-1):
temp = dataset1.iloc[loopj][ feature_label[loopi]]/2 + dataset1.iloc[loopj+1][feature_label[loopi]]/2
valuelist.append(temp)
labelgain, dataleft, dataright, value = bestsplit(dataset1, feature_label[loopi], valuelist)
if labelgain>bestgain:
left = dataleft
right = dataright
bestgain = labelgain
feature_value = value
feature = feature_label[loopi]
subtree['feature'] = feature
subtree['value'] = feature_value
subtree1 = buildtree(left)
subtree2 = buildtree(right)
if 'label' in subtree1.keys():
subtree['leftsubtree'] = subtree1
if 'label' in subtree2.keys():
subtree['rightsubtree'] = subtree2
if 'feature' in subtree1.keys():
subtree['leftsubtree'] = subtree1
if 'feature' in subtree2.keys():
subtree['rightsubtree'] = subtree2
return subtree
def predict(vector, tree):
subtree = tree
label = []
for loopi in range(vector.shape[0]):
subtree = tree
while 'feature' in subtree:
if vector.loc[loopi, subtree['feature']]<subtree['value']:
subtree = subtree['leftsubtree']
else:
subtree = subtree['rightsubtree']
label.append(subtree['label'])
label = pd.DataFrame(label, columns=['label'])
return label
def bestsplit(dataset, feature_label, valuelist):
entropy = calshannonEnt(dataset)
bestgain = 0
for loopi in range(len(valuelist)):
left = dataset[dataset.loc[:,feature_label]<valuelist[loopi]]
right = dataset[dataset.loc[:,feature_label]>valuelist[loopi]]
leftentropy = calshannonEnt(left)
rightentropy = calshannonEnt(right)
gain = entropy - left.shape[0]*leftentropy/dataset.shape[0] - right.shape[0]*rightentropy/dataset.shape[0]
if gain>bestgain:
dataleft = left
dataright = right
bestgain = gain
value = valuelist[loopi]
return bestgain, dataleft, dataright, value
if __name__=='__main__':
dataset = []
with open(r'XX\train-data2.txt') as f:
for loopi in f.readlines():
line = loopi.strip().split('\t')
dataset.append([float(line[0]), float(line[1]), float(line[2])])
data_set = pd.DataFrame(dataset, columns=['x', 'y', 'label'])
predict_dataset = []
with open(r'XX\predict-data1.txt') as f:
for loopi in f.readlines():
line = loopi.strip().split('\t')
predict_dataset.append([float(line[0]), float(line[1])])
predict_dataset = pd.DataFrame(predict_dataset, columns=['x', 'y'])
entropy = calshannonEnt(data_set)
tree = buildtree(data_set)
label = predict(predict_dataset, tree)
datapredict = pd.concat([predict_dataset, label], axis=1)
plt.scatter(data_set[data_set.label == -1].x, data_set[data_set.label == -1].y, marker='+', c='green')
plt.scatter(data_set[data_set.label == 1].x, data_set[data_set.label == 1].y, marker='o', c='blue')
plt.scatter(datapredict[datapredict.label == -1].x, datapredict[datapredict.label == -1].y, marker='+', c='red')
plt.scatter(datapredict[datapredict.label == 1].x, datapredict[datapredict.label == 1].y, marker='o', c='red')
plt.show()