决策树构建的目的有两个——探索与预测。探索方面,参与决策树声场的数据为训练数据,待树长成后即可探索数据所隐含的信息。预测方面,可以借助决策树推导出的规则预测未来数据。由于需要考虑未来数据进入该模型的分类表现,因此在基于训练数据构建决策树之后,可以用测试数据来衡量该模型的稳健性和分类表现。通过一连串的验证过程,最后得到最佳的分类规则,用作未来数据的预测。
决策树的建立步骤包括数据准备、决策树生长、决策树修剪及规则提取。
决策树的分析数据包括两类变量:一是根据问题所决定的目标变量;二是根据问题背景与环境所选择的各种属性变量作为分支变量。分支变量是否容易理解与解释将决定决策树分析结果。
(1)二元属性:其测试条件可以产生两种结果。
(2)名目属性:名目属性结果的多少可以用不同属性值来表示,如血型可以分为A、B、AB、O四种类别。
(3)顺序属性:可以生成二元或二元以上的分割,其属性可以是群组,但群组必须符合属性值顺序特征。如年龄可以分为青年、中年、老年,
(4)连续属性:连续属性的条件可以表示成x=a的关系。决策树必须考虑到所有可能的分割点y,再从中选出最好的分割。
取得数据后,将所搜集的数据分为训练数据集和测试数据集,数据分割可参照如下方法:
数据分割是将数据分成训练数据集、测试数据集和验证数据集。训练数据集用来建立模型,测试数据集用来评估模型是否过度复杂及其通用性,验证数据集则用以衡量模型的好坏,例如分类错误率、均方误差。一个好的训练模式应该对于未知的数据仍有很好的适配度,若当模式复杂程度越来越高,而测试数据的误差却越来越大,表示该训练模型有过度配适的问题。
数据分割的比例有不同的定义,均应代表原来的数据。一种方法是抽取80%的数据作为训练数据集构建模型,剩下的20%用于模型的效度检验。另一种方法是k-fold交互验证。该方法首先将数据分为k等份,每次抽取k-1份数据进行模式训练,剩下的1份数据用于测试模型,如此重复k次,使每笔数据都能成为训练数据集与测试数据集,最后的平均结果则用来代表模型的效度。该方法适合于样本数较少的情况,可以有效涵盖整个数据,但缺点是计算时间很长。
在决策树构建过程中,如果一个决策树模型仅在训练数据中有很低的错误率,但在测试数据集中有很高的错误率,则说明该决策树模型过度配适,造成模型无法用于估计其他数据。因此建立决策树模型后,应根据估计测试数据的分类表现,适当地修剪决策树,增加其分类或预测的争取性,避免过度配适。
决策树的分支准则决定了树的规模大小,包括宽度和深度。常见的分支准则有:信息增益、Gini系数、卡方统计量、信息增益比等。
假设训练数据集有k个类别,分别为C1、C2、……、Ck,属性A有l中不同的数值,A1、A2、……、Al。
属性 |
类别 |
||||
C1 |
C2 |
… |
Ck |
总和 |
|
A1 |
x11 |
x12 |
… |
x1k |
x1. |
A2 |
x21 |
x22 |
… |
x2k |
x2. |
… |
… |
… |
… |
… |
… |
Al |
xl1 |
xl2 |
… |
xlk |
xl. |
总和 |
x.1 |
x.2 |
… |
x.k |
N |
(1)信息增益
信息增益是根据不同信息的似然值或概率衡量不同条件下的信息量。
若每个类别的数据个数定义为x.j,N为数据集合中所有数据的个数,类别出现的概率为pj = x.j/N,根据信息论可以知道,各类别的信息为-log2(pj),因此各类别C1、C2、……、Ck所带来的信息总和Info(D)为:
Info(D)= - (x.1/N)*log2(x.1/N) - (x.2/N)*log2(x.2/N) - … - (x.k/N)*log2(x.k/N)
Info(D)又称为熵,常用以衡量数据离散程度。当各类别出现的概率相等,则Info(D)=1,表示该分类的信息复杂程度最高。
假设该数据集D要根据属性A进行分割,产生共L各数据分割集Di,其中xi.为各属性值Ai下的分割数据总个数xij为属性值Ai下且为类别Cj的个数,因此可计算属性Ai下的信息Info(Ai):
Info(Ai)= - (xi1/ xi.)*log2(xi1/ xi.) - (xi2/ xi.)*log2(xi2/ xi.) - … - (xik/ xi.)*log2(xik/xi.)
属性A的信息则根据各属性值下数据个数多寡决定:
InfoA(D)= (x1./N)*Info(A1) + (x2./N)*Info(A2) + … + (xl./N)*Info(Al)
信息增益可以表示为原始数据的总信息量减去分之后的总信息量,以表示属性A作为分支属性对信息的贡献程度。以此类推可以计算出各个属性作为分支变量能带来的信息贡献度,比较后可找出具有最佳信息增益的信息属性。
(2)Gini系数
Gini系数是衡量数据集合对于所有类别的不纯度。
Gini(D)= 1 – sum(j = 1, ….,k, pj^2)
各属性值Ai下数据集合的不纯度Gini(Ai)为:
Gini(Ai)= 1 – (xi1/xi.)^2 – (xi2/xi.)^2 - ……, – (xik/xi.)^2
属性A的总数据不纯度为:
GiniA(D)= (x1./N)*Gini(A1) + (x2./N)*Gini(A2) + … + (xl./N)*Gini(Al)
属性A对不纯度减少的贡献:
deltaGini(A)= Gini(D) –GiniA(D)
(3)卡方统计量
卡方统计量是用列联表来计算两列变量之间的相依程度,当计算出的样本卡方统计值越大,表示两变量之间的相依程度越高。
(4)信息增益比
信息增益比是考虑候选属性本身所携带的信息,在将这些信息转移至决策树,经由计算增益与分支属性的信息量的比值来找出最适合的分支属性。
(5)方差缩减
当目标变量为连续时,可采用放假缩减作为分支依据。
决策树的修剪方式包括事前修剪和事后修剪。事前修剪应用于一开始决策树的生长过程中,实现设定停止决策树生长的门槛值,常见的设定门槛如分割的评估值没达到此门槛值时,就会停止决策树的生长,例如信息增益值要大于0.1或是节点中包含足够的样本数目。事前修剪的优点在于具有执行效率,但可能会有过度修剪的缺点。事后修剪法虽然效率较低,但对于解决决策树的过度配适问题相当具有正面效益。
完成决策树的生长及修剪之后,即可利用决策树提取数据中隐含的信息。
算法 |
CART |
C4.5/C5.0 |
CHAID |
|
处理数据形态 |
离散、连续 |
离散、连续 |
离散 |
|
连续型数据分支方式 |
只分2支 |
无限制 |
无法处理 |
|
分支准则 |
类别型相依变数 |
Gini分散度指标 |
信息增益比 |
卡方检验 |
连续型相依变数 |
方差缩减 |
方差缩减 |
卡方检定或F检定(需先转化为类别变量) |
|
分支方法 |
类别型独立变量 |
二元分支 |
多元分支 |
多元分支 |
连续型独立变量 |
二元分支 |
二元分支 |
多元分支(需转化为类别变量) |
|
修剪方法 |
成本复杂性修剪 |
基于错误的修剪 |
无 |
决策树分类模型可以从两个方面评估其分类及预测表现:(1)以测试组数据的结果来客观评估较佳的决策树模型,例如分类错误率;(2)由于分类规则的提取随着问题而异,因此在客观评估后,通常均需由该领域专家根据问题背景选出最适合的决策树模型。
载入包和数据集
> library(MASS)
> data("Pima.tr")
> str(Pima.tr)
'data.frame': 200 obs. of 8 variables:
$ npreg: int 5 7 5 0 0 5 3 1 3 2 ...
$ glu : int 86 195 77 165 107 97 83 193 142 128 ...
$ bp : int 68 70 82 76 60 76 58 50 80 78 ...
$ skin : int 28 33 41 43 25 27 31 16 15 37 ...
$ bmi : num 30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 ...
$ ped : num 0.364 0.163 0.156 0.259 0.133 ...
$ age : int 24 55 35 26 23 52 25 24 63 31 ...
$ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
其中,Pima数据已经被分为两部分,Pima.tr为训练集、Pima.te为测试集。
> #首先以不修剪的方法进行决策树的构建,因而将复杂系数cp设置为0
> cart_tree1 = rpart(type~., Pima.tr, control = rpart.control(cp = 0))
> summary(cart_tree1)
Call:
rpart(formula = type ~ ., data = Pima.tr, control = rpart.control(cp = 0))
n= 200
CP nsplit rel error xerror xstd
1 0.22058824 0 1.0000000 1.0000000 0.09851844
2 0.16176471 1 0.7794118 0.9852941 0.09816108
3 0.07352941 2 0.6176471 0.8235294 0.09337946
4 0.05882353 3 0.5441176 0.7941176 0.09233140
5 0.01470588 4 0.4852941 0.6176471 0.08470895
6 0.00000000 7 0.4411765 0.7500000 0.09064718
Node number 1: 200 observations, complexity param=0.2205882
predicted class=No expected loss=0.34 P(node) =1
class counts: 132 68
probabilities: 0.660 0.340
left son=2 (109 obs) right son=3 (91 obs)
Primary splits:
glu < 123.5 to the left, improve=19.624700, (0 missing)
age < 28.5 to the left, improve=15.016410, (0 missing)
npreg < 6.5 to the left, improve=10.465630, (0 missing)
bmi < 27.35 to the left, improve= 9.727105, (0 missing)
skin < 22.5 to the left, improve= 8.201159, (0 missing)
Surrogate splits:
age < 30.5 to the left, agree=0.685, adj=0.308, (0 split)
bp < 77 to the left, agree=0.650, adj=0.231, (0 split)
npreg < 6.5 to the left, agree=0.640, adj=0.209, (0 split)
skin < 32.5 to the left, agree=0.635, adj=0.198, (0 split)
bmi < 30.85 to the left, agree=0.575, adj=0.066, (0 split)
Node number 2: 109 observations, complexity param=0.01470588
predicted class=No expected loss=0.1376147 P(node) =0.545
class counts: 94 15
probabilities: 0.862 0.138
left son=4 (74 obs) right son=5 (35 obs)
Primary splits:
age < 28.5 to the left, improve=3.2182780, (0 missing)
npreg < 6.5 to the left, improve=2.4578310, (0 missing)
bmi < 33.5 to the left, improve=1.6403660, (0 missing)
bp < 59 to the left, improve=0.9851960, (0 missing)
skin < 24 to the left, improve=0.8342926, (0 missing)
Surrogate splits:
npreg < 4.5 to the left, agree=0.798, adj=0.371, (0 split)
bp < 77 to the left, agree=0.734, adj=0.171, (0 split)
skin < 36.5 to the left, agree=0.725, adj=0.143, (0 split)
bmi < 38.85 to the left, agree=0.716, adj=0.114, (0 split)
glu < 66 to the right, agree=0.688, adj=0.029, (0 split)
Node number 3: 91 observations, complexity param=0.1617647
predicted class=Yes expected loss=0.4175824 P(node) =0.455
class counts: 38 53
probabilities: 0.418 0.582
left son=6 (35 obs) right son=7 (56 obs)
Primary splits:
ped < 0.3095 to the left, improve=6.528022, (0 missing)
bmi < 28.65 to the left, improve=6.473260, (0 missing)
skin < 19.5 to the left, improve=4.778504, (0 missing)
glu < 166 to the left, improve=4.104532, (0 missing)
age < 39.5 to the left, improve=3.607390, (0 missing)
Surrogate splits:
glu < 126.5 to the left, agree=0.670, adj=0.143, (0 split)
bp < 93 to the right, agree=0.659, adj=0.114, (0 split)
bmi < 27.45 to the left, agree=0.659, adj=0.114, (0 split)
npreg < 9.5 to the right, agree=0.648, adj=0.086, (0 split)
skin < 20.5 to the left, agree=0.637, adj=0.057, (0 split)
Node number 4: 74 observations
predicted class=No expected loss=0.05405405 P(node) =0.37
class counts: 70 4
probabilities: 0.946 0.054
Node number 5: 35 observations, complexity param=0.01470588
predicted class=No expected loss=0.3142857 P(node) =0.175
class counts: 24 11
probabilities: 0.686 0.314
left son=10 (9 obs) right son=11 (26 obs)
Primary splits:
glu < 90 to the left, improve=2.3934070, (0 missing)
bmi < 33.4 to the left, improve=1.3714290, (0 missing)
bp < 68 to the right, improve=0.9657143, (0 missing)
ped < 0.334 to the left, improve=0.9475564, (0 missing)
skin < 39.5 to the right, improve=0.7958592, (0 missing)
Surrogate splits:
ped < 0.1795 to the left, agree=0.8, adj=0.222, (0 split)
Node number 6: 35 observations, complexity param=0.05882353
predicted class=No expected loss=0.3428571 P(node) =0.175
class counts: 23 12
probabilities: 0.657 0.343
left son=12 (27 obs) right son=13 (8 obs)
Primary splits:
glu < 166 to the left, improve=3.438095, (0 missing)
ped < 0.2545 to the right, improve=1.651429, (0 missing)
skin < 25.5 to the left, improve=1.651429, (0 missing)
npreg < 3.5 to the left, improve=1.078618, (0 missing)
bp < 73 to the right, improve=1.078618, (0 missing)
Surrogate splits:
bp < 94.5 to the left, agree=0.8, adj=0.125, (0 split)
Node number 7: 56 observations, complexity param=0.07352941
predicted class=Yes expected loss=0.2678571 P(node) =0.28
class counts: 15 41
probabilities: 0.268 0.732
left son=14 (11 obs) right son=15 (45 obs)
Primary splits:
bmi < 28.65 to the left, improve=5.778427, (0 missing)
age < 39.5 to the left, improve=3.259524, (0 missing)
npreg < 6.5 to the left, improve=2.133215, (0 missing)
ped < 0.8295 to the left, improve=1.746894, (0 missing)
skin < 22 to the left, improve=1.474490, (0 missing)
Surrogate splits:
skin < 19.5 to the left, agree=0.839, adj=0.182, (0 split)
Node number 10: 9 observations
predicted class=No expected loss=0 P(node) =0.045
class counts: 9 0
probabilities: 1.000 0.000
Node number 11: 26 observations, complexity param=0.01470588
predicted class=No expected loss=0.4230769 P(node) =0.13
class counts: 15 11
probabilities: 0.577 0.423
left son=22 (19 obs) right son=23 (7 obs)
Primary splits:
bp < 68 to the right, improve=1.6246390, (0 missing)
bmi < 33.4 to the left, improve=1.6173080, (0 missing)
npreg < 6.5 to the left, improve=0.9423077, (0 missing)
skin < 39.5 to the right, improve=0.6923077, (0 missing)
ped < 0.334 to the left, improve=0.4923077, (0 missing)
Surrogate splits:
glu < 94.5 to the right, agree=0.808, adj=0.286, (0 split)
ped < 0.2105 to the right, agree=0.808, adj=0.286, (0 split)
Node number 12: 27 observations
predicted class=No expected loss=0.2222222 P(node) =0.135
class counts: 21 6
probabilities: 0.778 0.222
Node number 13: 8 observations
predicted class=Yes expected loss=0.25 P(node) =0.04
class counts: 2 6
probabilities: 0.250 0.750
Node number 14: 11 observations
predicted class=No expected loss=0.2727273 P(node) =0.055
class counts: 8 3
probabilities: 0.727 0.273
Node number 15: 45 observations
predicted class=Yes expected loss=0.1555556 P(node) =0.225
class counts: 7 38
probabilities: 0.156 0.844
Node number 22: 19 observations
predicted class=No expected loss=0.3157895 P(node) =0.095
class counts: 13 6
probabilities: 0.684 0.316
Node number 23: 7 observations
predicted class=Yes expected loss=0.2857143 P(node) =0.035
class counts: 2 5
probabilities: 0.286 0.714
> par(xpd = TRUE); plot(cart_tree1); text(cart_tree1)
> #对测试集进行预测分析,并得到预测精度
> pre_cart_tree1 = predict(cart_tree1, Pima.te, type = "class")
> matrix1 = table(Type = Pima.te$type, predict = pre_cart_tree1)
> matrix1
predict
Type No Yes
No 223 0
Yes 109 0
> accuracy_tree1 = sum(diag(matrix1))/sum(matrix1)
> accuracy_tree1
[1] 0.6716867
> #对建成的决策树模型进行剪枝,将cp设为0.03
> cart_tree2 = prune(cart_tree1, cp = 0.03)
> par(xpd = TRUE); plot(cart_tree2); text(cart_tree2)
> #基于剪枝后的模型对测试集进行预测分析,并得到预测精度
> pre_cart_tree2 = predict(cart_tree2, Pima.te, type = "class")
> matrix2 = table(Type = Pima.te$type, predict = pre_cart_tree2)
> matrix2
predict
Type No Yes
No 223 0
Yes 109 0
> accuracy_tree2 = sum(diag(matrix2))/sum(matrix2)
> accuracy_tree2
[1] 0.6716867
> #对建成的决策树模型进行进一步剪枝,将cp设为0.1
> cart_tree3 = prune(cart_tree2, cp = 0.1)
> par(xpd = TRUE); plot(cart_tree3); text(cart_tree3)
> #基于剪枝后的模型对测试集进行预测分析,并得到预测精度
> pre_cart_tree3 = predict(cart_tree3, Pima.te, type = "class")
> matrix3 = table(Type = Pima.te$type, predict = pre_cart_tree3)
> matrix3
predict
Type No Yes
No 223 0
Yes 109 0
> accuracy_tree3 = sum(diag(matrix3))/sum(matrix3)
> accuracy_tree3
[1] 0.6716867
显然当cp为0.03时可以获得较高的准确率,而cp设为0.1时,模型二道了极大的简化且准确率基本并未过多损失。
> #C5.0决策树分析
> library(C50)
> library(MASS)
> data("Pima.tr")
> str(Pima.tr)
'data.frame': 200 obs. of 8 variables:
$ npreg: int 5 7 5 0 0 5 3 1 3 2 ...
$ glu : int 86 195 77 165 107 97 83 193 142 128 ...
$ bp : int 68 70 82 76 60 76 58 50 80 78 ...
$ skin : int 28 33 41 43 25 27 31 16 15 37 ...
$ bmi : num 30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 ...
$ ped : num 0.364 0.163 0.156 0.259 0.133 ...
$ age : int 24 55 35 26 23 52 25 24 63 31 ...
$ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
> C50_tree2 = C5.0(type~., Pima.tr, control=C5.0Control(noGlobalPruning = TRUE)) #不对树进行剪枝
> summary(C50_tree2)
Call:
C5.0.formula(formula = type ~ ., data = Pima.tr, control = C5.0Control(noGlobalPruning = TRUE))
C5.0 [Release 2.07 GPL Edition] Sat Sep 16 12:12:54 2017
-------------------------------
Class specified by attribute `outcome'
Read 200 cases (8 attributes) from undefined.data
Decision tree:
glu <= 123: No (109/15)
glu > 123:
:...bmi > 28.6:
:...ped <= 0.344: No (29/12)
: ped > 0.344: Yes (41/5)
bmi <= 28.6:
:...age <= 32: No (11)
age > 32:
:...bp > 80: No (3)
bp <= 80:
:...ped <= 0.162: No (2)
ped > 0.162: Yes (5)
Evaluation on training data (200 cases):
Decision Tree
----------------
Size Errors
7 32(16.0%) <<
(a) (b) <-classified as
---- ----
127 5 (a): class No
27 41 (b): class Yes
Attribute usage:
100.00% glu
45.50% bmi
38.50% ped
10.50% age
5.00% bp
Time: 0.0 secs
> plot(C50_tree2)
> C50_tree3 = C5.0(type~., Pima.tr, control=C5.0Control(noGlobalPruning = FALSE)) #对树进行剪枝
> summary(C50_tree3)
Call:
C5.0.formula(formula = type ~ ., data = Pima.tr, control = C5.0Control(noGlobalPruning = FALSE))
C5.0 [Release 2.07 GPL Edition] Sat Sep 16 12:14:14 2017
-------------------------------
Class specified by attribute `outcome'
Read 200 cases (8 attributes) from undefined.data
Decision tree:
glu <= 123: No (109/15)
glu > 123:
:...bmi <= 28.6: No (21/5)
bmi > 28.6:
:...ped <= 0.344: No (29/12)
ped > 0.344: Yes (41/5)
Evaluation on training data (200 cases):
Decision Tree
----------------
Size Errors
4 37(18.5%) <<
(a) (b) <-classified as
---- ----
127 5 (a): class No
32 36 (b): class Yes
Attribute usage:
100.00% glu
45.50% bmi
35.00% ped
Time: 0.0 secs
> plot(C50_tree3)
> pre_C50_Cla2 = predict(C50_tree2, Pima.te, type = "class")
> matrix2 = table(Type = Pima.te$type, predict = pre_C50_Cla2)
> matrix2
predict
Type No Yes
No 193 30
Yes 58 51
> accuracy_tree2 = sum(diag(matrix2))/sum(matrix2)
> accuracy_tree2
[1] 0.7349398
> pre_C50_Cla3 = predict(C50_tree3, Pima.te, type = "class")
> matrix3 = table(Type = Pima.te$type, predict = pre_C50_Cla3)
> matrix3
predict
Type No Yes
No 195 28
Yes 60 49
> accuracy_tree3 = sum(diag(matrix3))/sum(matrix3)
> accuracy_tree3
[1] 0.7349398
我们发现修剪和不修剪对模型正确率没有影响,但修剪之后的模型显然更容易解释。
> #CHAID决策树分析
> #CHAID决策树只能对离散型属性进行处理,因此需要将数据中的连续型数据都转化为离散型,不用考虑时候修剪的问题。
> install.packages("CHAID")#如果找不到,则可以从https://r-forge.r-project.org/R/?group_id=343下载后安装
> library(CHAID)
> #加载训练和测试数据集
> data("Pima.tr")
> data("Pima.te")
> #将数据集合并
> Pima = rbind(Pima.tr, Pima.te)
> #对数据进行离散化处理,并输出离散化的属性
> level_name = {}
> for(i in 1:7)
+ {
+ Pima[,i] = cut(Pima[,i], breaks = 3, ordered_result = TRUE, include.lowest = TRUE)
+ level_name <- rbind(level_name, levels(Pima[,i]))
+ }
> level_name = data.frame(level_name)
> row.names(level_name) = colnames(Pima)[1:7]
> colnames(level_name) = paste("L",1:3,sep="")
> level_name
L1 L2 L3
npreg [-0.017,5.67] (5.67,11.3] (11.3,17]
glu [55.9,104] (104,151] (151,199]
bp [23.9,52.7] (52.7,81.3] (81.3,110]
skin [6.91,37.7] (37.7,68.3] (68.3,99.1]
bmi [18.2,34.5] (34.5,50.8] (50.8,67.1]
ped [0.0827,0.863] (0.863,1.64] (1.64,2.42]
age [20.9,41] (41,61] (61,81.1]
> #以前200个数据为训练集,剩下的332个数据为测试集
> Pima.tr = Pima[1:200,]
> Pima.te = Pima[201:nrow(Pima),]
> CHAID_tree = chaid(type~., Pima.tr)
> CHAID_tree
Model formula:
type ~ npreg + glu + bp + skin + bmi + ped + age
Fitted party:
[1] root
| [2] glu in [55.9,104]
| | [3] age in [20.9,41]: No (n = 50, err = 6.0%)
| | [4] age in (41,61], (61,81.1]: No (n = 10, err = 40.0%)
| [5] glu in (104,151]
| | [6] age in [20.9,41]: No (n = 86, err = 27.9%)
| | [7] age in (41,61], (61,81.1]: Yes (n = 15, err = 26.7%)
| [8] glu in (151,199]: Yes (n = 39, err = 33.3%)
Number of inner nodes: 3
Number of terminal nodes: 5
> plot(CHAID_tree)
> #对测试集分别进行预测分析,并得到预测精度
> pre_CHAID_tree = predict(CHAID_tree, Pima.te)
> matrix = table(Type = Pima.te$type, predict = pre_CHAID_tree)
> matrix
predict
Type No Yes
No 199 24
Yes 47 62
> accuracy_tree = sum(diag(matrix))/sum(matrix)
> accuracy_tree
[1] 0.7861446