R语言(递归分割树[传统决策树])分类模型(一)

简介

分类算法是基于类标号已知的训练数据集建立分类模型并使用其对新观测值(测试数据集)进行分类的算法,因而也和回归一样属于监督学习算法,都是使用训练集的已知结论(类标号)预测测试数据集的分类结果,分类与回归的最大区别是后者对连续值进行处理。
我们先使用churn数据集分别建立训练数据集和测试数据集,然后使用不用的分类模型对其分类。紧接着使用传统分类树与条件推理树来介绍基于树的分类方法,还将介绍基于延迟与基于概率的算法。这些算法都会基于训练数据建立分类数据集,然后利用分类模型建立预测测试数据的分类结果,我们还将构建一个混淆矩阵来评测这些模型的预测性能。

准备训练与测试数据集

使用C50包中的客户流失数据(churn数据集),有3333个样例,数据维度为20,我们建立一个分类模型判断客户是否会流失,因为争取一个新客户的成本要小于一维护一个老客户,因此预测的结果就比较重要。
在建立模型时,首先对数据进行预处理,通过观测state、area_code、account_length对建模分类没有作用,因此先去掉这三个属性。
完成预处后将数据集分为训练和测试两个集合,我们使用一个样本随机函数生成一个序列,序列大小等于70%的样例大小,再生成一个大小等于30%的样例序列。

library(grid)
library(partykit)
library(C50)

data(churn)

str(churnTrain)
'data.frame':   3333 obs. of  20 variables:
 $ state                        : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
 $ account_length               : int  128 107 137 84 75 118 121 147 117 141 ...
 $ area_code                    : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
 $ international_plan           : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
 $ voice_mail_plan              : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
 $ number_vmail_messages        : int  25 26 0 0 0 0 24 0 0 37 ...
 $ total_day_minutes            : num  265 162 243 299 167 ...
 $ total_day_calls              : int  110 123 114 71 113 98 88 79 97 84 ...
 $ total_day_charge             : num  45.1 27.5 41.4 50.9 28.3 ...
 $ total_eve_minutes            : num  197.4 195.5 121.2 61.9 148.3 ...
 $ total_eve_calls              : int  99 103 110 88 122 101 108 94 80 111 ...
 $ total_eve_charge             : num  16.78 16.62 10.3 5.26 12.61 ...
 $ total_night_minutes          : num  245 254 163 197 187 ...
 $ total_night_calls            : int  91 103 104 89 121 118 118 96 90 97 ...
 $ total_night_charge           : num  11.01 11.45 7.32 8.86 8.41 ...
 $ total_intl_minutes           : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
 $ total_intl_calls             : int  3 3 5 7 3 6 7 6 4 5 ...
 $ total_intl_charge            : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
 $ number_customer_service_calls: int  1 1 0 2 3 0 3 0 1 0 ...
 $ churn                        : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
churnTrain = churnTrain[,!names(churnTrain) %in% c("state","area_code","account_length")]
#生成随机编号为2的随机数
set.seed(2)
#将churnTrain的数据集分为两类,按0.7与0.3的比例无放回抽样
ind = sample(2,nrow(churnTrain),replace = TRUE,prob = c(0.7,0.3))

trainset = churnTrain[ind == 1,]
testset = churnTrain[ind == 2,]
dim(trainset)
[1] 2315   17
dim(testset)
[1] 1018   17

使用递归分割树建立分类模型

分类树对分类结果的预测是基于一个或者多个输入变量并结合划分条件完成的。分裂过程从分类树树根结点开始:在每一个节点,算法将根据划分条件检查输入变量是否需要断续向左子叶树与右子叶树递归进行划分,当达到任意分类树的子节点(终点)时,停止分裂。

churn.rp = rpart(churn ~ .,data = trainset)

churn.rp
n= 2315 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 2315 342 no (0.14773218 0.85226782)  
   2) total_day_minutes>=265.45 144  59 yes (0.59027778 0.40972222)  
     4) voice_mail_plan=no 110  29 yes (0.73636364 0.26363636)  
       8) total_eve_minutes>=188.5 67   3 yes (0.95522388 0.04477612) *
       9) total_eve_minutes< 188.5 43  17 no (0.39534884 0.60465116)  
        18) total_day_minutes>=282.7 19   6 yes (0.68421053 0.31578947) *
        19) total_day_minutes< 282.7 24   4 no (0.16666667 0.83333333) *
     5) voice_mail_plan=yes 34   4 no (0.11764706 0.88235294) *
   3) total_day_minutes< 265.45 2171 257 no (0.11837863 0.88162137)  
     6) number_customer_service_calls>=3.5 168  82 yes (0.51190476 0.48809524)  
      12) total_day_minutes< 160.2 71  10 yes (0.85915493 0.14084507) *
      13) total_day_minutes>=160.2 97  25 no (0.25773196 0.74226804)  
        26) total_eve_minutes< 155.5 20   7 yes (0.65000000 0.35000000) *
        27) total_eve_minutes>=155.5 77  12 no (0.15584416 0.84415584) *
     7) number_customer_service_calls< 3.5 2003 171 no (0.08537194 0.91462806)  
      14) international_plan=yes 188  76 no (0.40425532 0.59574468)  
        28) total_intl_calls< 2.5 38   0 yes (1.00000000 0.00000000) *
        29) total_intl_calls>=2.5 150  38 no (0.25333333 0.74666667)  
          58) total_intl_minutes>=13.1 32   0 yes (1.00000000 0.00000000) *
          59) total_intl_minutes< 13.1 118   6 no (0.05084746 0.94915254) *
      15) international_plan=no 1815  95 no (0.05234160 0.94765840)  
        30) total_day_minutes>=224.15 251  50 no (0.19920319 0.80079681)  
          60) total_eve_minutes>=259.8 36  10 yes (0.72222222 0.27777778) *
          61) total_eve_minutes< 259.8 215  24 no (0.11162791 0.88837209) *
        31) total_day_minutes< 224.15 1564  45 no (0.02877238 0.97122762) *

接下来,调用printcp来检查复杂性参数

printcp(churn.rp)

Classification tree:
rpart(formula = churn ~ ., data = trainset)

Variables actually used in tree construction:
[1] international_plan            number_customer_service_calls
[3] total_day_minutes             total_eve_minutes            
[5] total_intl_calls              total_intl_minutes           
[7] voice_mail_plan              

Root node error: 342/2315 = 0.14773

n= 2315 

        CP nsplit rel error  xerror     xstd
1 0.076023      0   1.00000 1.00000 0.049920
2 0.074561      2   0.84795 0.99708 0.049860
3 0.055556      4   0.69883 0.76023 0.044421
4 0.026316      7   0.49415 0.52632 0.037673
5 0.023392      8   0.46784 0.52047 0.037481
6 0.020468     10   0.42105 0.50877 0.037092
7 0.017544     11   0.40058 0.47076 0.035788
8 0.010000     12   0.38304 0.47661 0.035993
plotcp(churn.rp)

R语言(递归分割树[传统决策树])分类模型(一)_第1张图片

成本复杂性函数
最后使用summary检查已经建立的模型

 summary(churn.rp)
Call:
rpart(formula = churn ~ ., data = trainset)
  n= 2315 

          CP nsplit rel error    xerror       xstd
1 0.07602339      0 1.0000000 1.0000000 0.04992005
2 0.07456140      2 0.8479532 0.9590643 0.04906076
3 0.05555556      4 0.6988304 0.7953216 0.04530196
4 0.02631579      7 0.4941520 0.5233918 0.03757730
5 0.02339181      8 0.4678363 0.5263158 0.03767329
6 0.02046784     10 0.4210526 0.5175439 0.03738427
7 0.01754386     11 0.4005848 0.5058480 0.03699399
8 0.01000000     12 0.3830409 0.4970760 0.03669750

Variable importance
            total_day_minutes              total_day_charge number_customer_service_calls 
                           18                            18                            10 
           total_intl_minutes             total_intl_charge              total_eve_charge 
                            8                             8                             8 
            total_eve_minutes            international_plan              total_intl_calls 
                            8                             7                             6 
        number_vmail_messages               voice_mail_plan             total_night_calls 
                            3                             3                             1 
              total_eve_calls 
                            1 

Node number 1: 2315 observations,    complexity param=0.07602339
  predicted class=no   expected loss=0.1477322  P(node) =1
    class counts:   342  1973
   probabilities: 0.148 0.852 
  left son=2 (144 obs) right son=3 (2171 obs)
  Primary splits:
      total_day_minutes             < 265.45 to the right, improve=60.145020, (0 missing)
      total_day_charge              < 45.125 to the right, improve=60.145020, (0 missing)
      number_customer_service_calls < 3.5    to the right, improve=53.641430, (0 missing)
      international_plan            splits as  RL,         improve=43.729370, (0 missing)
      voice_mail_plan               splits as  LR,         improve= 6.089388, (0 missing)
  Surrogate splits:
      total_day_charge < 45.125 to the right, agree=1, adj=1, (0 split)

Node number 2: 144 observations,    complexity param=0.07602339
  predicted class=yes  expected loss=0.4097222  P(node) =0.06220302
    class counts:    85    59
   probabilities: 0.590 0.410 
  left son=4 (110 obs) right son=5 (34 obs)
  Primary splits:
      voice_mail_plan       splits as  LR,         improve=19.884860, (0 missing)
      number_vmail_messages < 9.5    to the left,  improve=19.884860, (0 missing)
      total_eve_minutes     < 167.05 to the right, improve=14.540020, (0 missing)
      total_eve_charge      < 14.2   to the right, improve=14.540020, (0 missing)
      total_day_minutes     < 283.9  to the right, improve= 6.339827, (0 missing)
  Surrogate splits:
      number_vmail_messages < 9.5    to the left,  agree=1.000, adj=1.000, (0 split)
      total_night_minutes   < 110.3  to the right, agree=0.785, adj=0.088, (0 split)
      total_night_charge    < 4.965  to the right, agree=0.785, adj=0.088, (0 split)
      total_night_calls     < 50     to the right, agree=0.778, adj=0.059, (0 split)
      total_intl_minutes    < 15.3   to the left,  agree=0.771, adj=0.029, (0 split)

Node number 3: 2171 observations,    complexity param=0.0745614
  predicted class=no   expected loss=0.1183786  P(node) =0.937797
    class counts:   257  1914
   probabilities: 0.118 0.882 
  left son=6 (168 obs) right son=7 (2003 obs)
  Primary splits:
      number_customer_service_calls < 3.5    to the right, improve=56.398210, (0 missing)
      international_plan            splits as  RL,         improve=43.059160, (0 missing)
      total_day_minutes             < 224.15 to the right, improve=10.847440, (0 missing)
      total_day_charge              < 38.105 to the right, improve=10.847440, (0 missing)
      total_intl_minutes            < 13.15  to the right, improve= 6.347319, (0 missing)

Node number 4: 110 observations,    complexity param=0.02631579
  predicted class=yes  expected loss=0.2636364  P(node) =0.0475162
    class counts:    81    29
   probabilities: 0.736 0.264 
  left son=8 (67 obs) right son=9 (43 obs)
  Primary splits:
      total_eve_minutes   < 188.5  to the right, improve=16.419610, (0 missing)
      total_eve_charge    < 16.025 to the right, improve=16.419610, (0 missing)
      total_night_minutes < 206.85 to the right, improve= 5.350500, (0 missing)
      total_night_charge  < 9.305  to the right, improve= 5.350500, (0 missing)
      total_day_minutes   < 281.15 to the right, improve= 5.254545, (0 missing)
  Surrogate splits:
      total_eve_charge   < 16.025 to the right, agree=1.000, adj=1.000, (0 split)
      total_night_calls  < 82     to the right, agree=0.655, adj=0.116, (0 split)
      total_intl_minutes < 3.35   to the right, agree=0.636, adj=0.070, (0 split)
      total_intl_charge  < 0.905  to the right, agree=0.636, adj=0.070, (0 split)
      total_day_minutes  < 268.55 to the right, agree=0.627, adj=0.047, (0 split)

Node number 5: 34 observations
  predicted class=no   expected loss=0.1176471  P(node) =0.01468683
    class counts:     4    30
   probabilities: 0.118 0.882 

Node number 6: 168 observations,    complexity param=0.0745614
  predicted class=yes  expected loss=0.4880952  P(node) =0.07257019
    class counts:    86    82
   probabilities: 0.512 0.488 
  left son=12 (71 obs) right son=13 (97 obs)
  Primary splits:
      total_day_minutes             < 160.2  to the left,  improve=29.655880, (0 missing)
      total_day_charge              < 27.235 to the left,  improve=29.655880, (0 missing)
      total_eve_minutes             < 180.65 to the left,  improve= 8.556953, (0 missing)
      total_eve_charge              < 15.355 to the left,  improve= 8.556953, (0 missing)
      number_customer_service_calls < 4.5    to the right, improve= 5.975362, (0 missing)
  Surrogate splits:
      total_day_charge              < 27.235 to the left,  agree=1.000, adj=1.000, (0 split)
      total_night_calls             < 79     to the left,  agree=0.625, adj=0.113, (0 split)
      total_intl_calls              < 2.5    to the left,  agree=0.619, adj=0.099, (0 split)
      number_customer_service_calls < 4.5    to the right, agree=0.607, adj=0.070, (0 split)
      total_eve_calls               < 89.5   to the left,  agree=0.601, adj=0.056, (0 split)

Node number 7: 2003 observations,    complexity param=0.05555556
  predicted class=no   expected loss=0.08537194  P(node) =0.8652268
    class counts:   171  1832
   probabilities: 0.085 0.915 
  left son=14 (188 obs) right son=15 (1815 obs)
  Primary splits:
      international_plan splits as  RL,         improve=42.194510, (0 missing)
      total_day_minutes  < 224.15 to the right, improve=16.838410, (0 missing)
      total_day_charge   < 38.105 to the right, improve=16.838410, (0 missing)
      total_intl_minutes < 13.15  to the right, improve= 6.210678, (0 missing)
      total_intl_charge  < 3.55   to the right, improve= 6.210678, (0 missing)

Node number 8: 67 observations
  predicted class=yes  expected loss=0.04477612  P(node) =0.02894168
    class counts:    64     3
   probabilities: 0.955 0.045 

Node number 9: 43 observations,    complexity param=0.02046784
  predicted class=no   expected loss=0.3953488  P(node) =0.01857451
    class counts:    17    26
   probabilities: 0.395 0.605 
  left son=18 (19 obs) right son=19 (24 obs)
  Primary splits:
      total_day_minutes   < 282.7  to the right, improve=5.680947, (0 missing)
      total_day_charge    < 48.06  to the right, improve=5.680947, (0 missing)
      total_night_minutes < 212.65 to the right, improve=4.558140, (0 missing)
      total_night_charge  < 9.57   to the right, improve=4.558140, (0 missing)
      total_eve_minutes   < 145.4  to the right, improve=4.356169, (0 missing)
  Surrogate splits:
      total_day_charge   < 48.06  to the right, agree=1.000, adj=1.000, (0 split)
      total_day_calls    < 103    to the left,  agree=0.674, adj=0.263, (0 split)
      total_eve_calls    < 104.5  to the left,  agree=0.674, adj=0.263, (0 split)
      total_intl_minutes < 11.55  to the left,  agree=0.651, adj=0.211, (0 split)
      total_intl_charge  < 3.12   to the left,  agree=0.651, adj=0.211, (0 split)

Node number 12: 71 observations
  predicted class=yes  expected loss=0.1408451  P(node) =0.03066955
    class counts:    61    10
   probabilities: 0.859 0.141 

Node number 13: 97 observations,    complexity param=0.01754386
  predicted class=no   expected loss=0.257732  P(node) =0.04190065
    class counts:    25    72
   probabilities: 0.258 0.742 
  left son=26 (20 obs) right son=27 (77 obs)
  Primary splits:
      total_eve_minutes             < 155.5  to the left,  improve=7.753662, (0 missing)
      total_eve_charge              < 13.22  to the left,  improve=7.753662, (0 missing)
      total_intl_minutes            < 13.55  to the right, improve=2.366149, (0 missing)
      total_intl_charge             < 3.66   to the right, improve=2.366149, (0 missing)
      number_customer_service_calls < 4.5    to the right, improve=2.297667, (0 missing)
  Surrogate splits:
      total_eve_charge  < 13.22  to the left,  agree=1.000, adj=1.00, (0 split)
      total_night_calls < 143.5  to the right, agree=0.814, adj=0.10, (0 split)
      total_eve_calls   < 62     to the left,  agree=0.804, adj=0.05, (0 split)

Node number 14: 188 observations,    complexity param=0.05555556
  predicted class=no   expected loss=0.4042553  P(node) =0.0812095
    class counts:    76   112
   probabilities: 0.404 0.596 
  left son=28 (38 obs) right son=29 (150 obs)
  Primary splits:
      total_intl_calls   < 2.5    to the left,  improve=33.806520, (0 missing)
      total_intl_minutes < 13.1   to the right, improve=30.527050, (0 missing)
      total_intl_charge  < 3.535  to the right, improve=30.527050, (0 missing)
      total_day_minutes  < 221.95 to the right, improve= 3.386095, (0 missing)
      total_day_charge   < 37.735 to the right, improve= 3.386095, (0 missing)

Node number 15: 1815 observations,    complexity param=0.02339181
  predicted class=no   expected loss=0.0523416  P(node) =0.7840173
    class counts:    95  1720
   probabilities: 0.052 0.948 
  left son=30 (251 obs) right son=31 (1564 obs)
  Primary splits:
      total_day_minutes   < 224.15 to the right, improve=12.5649300, (0 missing)
      total_day_charge    < 38.105 to the right, improve=12.5649300, (0 missing)
      total_eve_minutes   < 244.95 to the right, improve= 4.7875890, (0 missing)
      total_eve_charge    < 20.825 to the right, improve= 4.7875890, (0 missing)
      total_night_minutes < 163.85 to the right, improve= 0.9074391, (0 missing)
  Surrogate splits:
      total_day_charge < 38.105 to the right, agree=1, adj=1, (0 split)

Node number 18: 19 observations
  predicted class=yes  expected loss=0.3157895  P(node) =0.008207343
    class counts:    13     6
   probabilities: 0.684 0.316 

Node number 19: 24 observations
  predicted class=no   expected loss=0.1666667  P(node) =0.01036717
    class counts:     4    20
   probabilities: 0.167 0.833 

Node number 26: 20 observations
  predicted class=yes  expected loss=0.35  P(node) =0.008639309
    class counts:    13     7
   probabilities: 0.650 0.350 

Node number 27: 77 observations
  predicted class=no   expected loss=0.1558442  P(node) =0.03326134
    class counts:    12    65
   probabilities: 0.156 0.844 

Node number 28: 38 observations
  predicted class=yes  expected loss=0  P(node) =0.01641469
    class counts:    38     0
   probabilities: 1.000 0.000 

Node number 29: 150 observations,    complexity param=0.05555556
  predicted class=no   expected loss=0.2533333  P(node) =0.06479482
    class counts:    38   112
   probabilities: 0.253 0.747 
  left son=58 (32 obs) right son=59 (118 obs)
  Primary splits:
      total_intl_minutes < 13.1   to the right, improve=45.356840, (0 missing)
      total_intl_charge  < 3.535  to the right, improve=45.356840, (0 missing)
      total_day_calls    < 95.5   to the left,  improve= 4.036407, (0 missing)
      total_day_minutes  < 237.75 to the right, improve= 1.879020, (0 missing)
      total_day_charge   < 40.42  to the right, improve= 1.879020, (0 missing)
  Surrogate splits:
      total_intl_charge < 3.535  to the right, agree=1.0, adj=1.000, (0 split)
      total_day_minutes < 52.45  to the left,  agree=0.8, adj=0.063, (0 split)
      total_day_charge  < 8.92   to the left,  agree=0.8, adj=0.063, (0 split)

Node number 30: 251 observations,    complexity param=0.02339181
  predicted class=no   expected loss=0.1992032  P(node) =0.1084233
    class counts:    50   201
   probabilities: 0.199 0.801 
  left son=60 (36 obs) right son=61 (215 obs)
  Primary splits:
      total_eve_minutes     < 259.8  to the right, improve=22.993380, (0 missing)
      total_eve_charge      < 22.08  to the right, improve=22.993380, (0 missing)
      voice_mail_plan       splits as  LR,         improve= 4.745664, (0 missing)
      number_vmail_messages < 7.5    to the left,  improve= 4.745664, (0 missing)
      total_night_minutes   < 181.15 to the right, improve= 3.509731, (0 missing)
  Surrogate splits:
      total_eve_charge < 22.08  to the right, agree=1, adj=1, (0 split)

Node number 31: 1564 observations
  predicted class=no   expected loss=0.02877238  P(node) =0.675594
    class counts:    45  1519
   probabilities: 0.029 0.971 

Node number 58: 32 observations
  predicted class=yes  expected loss=0  P(node) =0.01382289
    class counts:    32     0
   probabilities: 1.000 0.000 

Node number 59: 118 observations
  predicted class=no   expected loss=0.05084746  P(node) =0.05097192
    class counts:     6   112
   probabilities: 0.051 0.949 

Node number 60: 36 observations
  predicted class=yes  expected loss=0.2777778  P(node) =0.01555076
    class counts:    26    10
   probabilities: 0.722 0.278 

Node number 61: 215 observations
  predicted class=no   expected loss=0.1116279  P(node) =0.09287257
    class counts:    24   191
   probabilities: 0.112 0.888 

总结

使rpart包里的递归分割树构建一个分类树模型。递归分割树算法包括两个步骤,递归与分割。
我们曾经讨论使用统计评估法,基于评价结果将数据划分成不同的部分,当我们确定了子节点后,就能重复执行分裂直至满足算法终止条件。
我们在调用library函数加载了rpart后,将churn变量做为分类变量(类标号),剩下的其他特征变量作为输入特征变量,建立相应的分类模型。
通过churn.rp显示分类树节点细节信息,在所输出的节点信息中,n代表样本的大小,loss代表分类错误的代价,本例中为no和yes,yprob为两类百分比。
接下来,用printcp函数输出模型的复杂性参数,该复杂性参数可以做为控制树规模的惩罚因子,cp值越大,分裂规模(nsplit)越小,输出参数(rel error)指示了当前分类模型树与空树之间的平均偏差比值,xerror的值是通过使用10-交叉检验得到的相对误差,xstd表示相对误差的标准差。
为了使成本复杂度参数cp更具可读性,我们使用plotcp绘制出cp的信息图,底部x轴为cp值,y轴为相对误差,顶部x轴为树的大小,虚线值为标准偏差的上限,当树的大小为12时,我们能得到其最小交叉检验的误差。
也可以使用summary()函数来展现函数调用,显示建立的树模型,变量重要性有成本复杂度表,这将有助于用户更好的理解分类树中的最重要的参数(总体为100),以及每个节点的详细信息。
使用决策树的优点是其非常灵活也易于理解,可以同时解决分类和回归两种问题,决策树是一种无参算法,意味用户不需要担心数据是否线性可分。决策树算法不足主要是它容易产生偏差与过度适应,条件推理树可以克服偏差问题,过度适应可以借助随机森林法与树的剪枝来解决。

你可能感兴趣的:(R语言-分类1(树,延迟,概率))