'data.frame': 3333 obs. of 20 variables:
$ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
$ account_length : int 128 107 137 84 75 118 121 147 117 141 ...
$ area_code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
$ international_plan : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
$ voice_mail_plan : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
$ number_vmail_messages : int 25 26 0 0 0 0 24 0 0 37 ...
$ total_day_minutes : num 265 162 243 299 167 ...
$ total_day_calls : int 110 123 114 71 113 98 88 79 97 84 ...
$ total_day_charge : num 45.1 27.5 41.4 50.9 28.3 ...
$ total_eve_minutes : num 197.4 195.5 121.2 61.9 148.3 ...
$ total_eve_calls : int 99 103 110 88 122 101 108 94 80 111 ...
$ total_eve_charge : num 16.78 16.62 10.3 5.26 12.61 ...
$ total_night_minutes : num 245 254 163 197 187 ...
$ total_night_calls : int 91 103 104 89 121 118 118 96 90 97 ...
$ total_night_charge : num 11.01 11.45 7.32 8.86 8.41 ...
$ total_intl_minutes : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
$ total_intl_calls : int 3 3 5 7 3 6 7 6 4 5 ...
$ total_intl_charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
$ number_customer_service_calls: int 1 1 0 2 3 0 3 0 1 0 ...
$ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
churnTrain = churnTrain[,!names(churnTrain) %in% c("state","area_code","account_length")]
ind = sample(2,nrow(churnTrain),replace = TRUE,prob = c(0.7,0.3))
trainset = churnTrain[ind == 1,]
testset = churnTrain[ind == 2,]
[1] 2315 17
[1] 1018 17
churn.rp = rpart(churn ~ .,data = trainset)
n= 2315
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 2315 342 no (0.14773218 0.85226782)
2) total_day_minutes>=265.45 144 59 yes (0.59027778 0.40972222)
4) voice_mail_plan=no 110 29 yes (0.73636364 0.26363636)
8) total_eve_minutes>=188.5 67 3 yes (0.95522388 0.04477612) *
9) total_eve_minutes< 188.5 43 17 no (0.39534884 0.60465116)
18) total_day_minutes>=282.7 19 6 yes (0.68421053 0.31578947) *
19) total_day_minutes< 282.7 24 4 no (0.16666667 0.83333333) *
5) voice_mail_plan=yes 34 4 no (0.11764706 0.88235294) *
3) total_day_minutes< 265.45 2171 257 no (0.11837863 0.88162137)
6) number_customer_service_calls>=3.5 168 82 yes (0.51190476 0.48809524)
12) total_day_minutes< 160.2 71 10 yes (0.85915493 0.14084507) *
13) total_day_minutes>=160.2 97 25 no (0.25773196 0.74226804)
26) total_eve_minutes< 155.5 20 7 yes (0.65000000 0.35000000) *
27) total_eve_minutes>=155.5 77 12 no (0.15584416 0.84415584) *
7) number_customer_service_calls< 3.5 2003 171 no (0.08537194 0.91462806)
14) international_plan=yes 188 76 no (0.40425532 0.59574468)
28) total_intl_calls< 2.5 38 0 yes (1.00000000 0.00000000) *
29) total_intl_calls>=2.5 150 38 no (0.25333333 0.74666667)
58) total_intl_minutes>=13.1 32 0 yes (1.00000000 0.00000000) *
59) total_intl_minutes< 13.1 118 6 no (0.05084746 0.94915254) *
15) international_plan=no 1815 95 no (0.05234160 0.94765840)
30) total_day_minutes>=224.15 251 50 no (0.19920319 0.80079681)
60) total_eve_minutes>=259.8 36 10 yes (0.72222222 0.27777778) *
61) total_eve_minutes< 259.8 215 24 no (0.11162791 0.88837209) *
31) total_day_minutes< 224.15 1564 45 no (0.02877238 0.97122762) *
Classification tree:
rpart(formula = churn ~ ., data = trainset)
Variables actually used in tree construction:
[1] international_plan number_customer_service_calls
[3] total_day_minutes total_eve_minutes
[5] total_intl_calls total_intl_minutes
[7] voice_mail_plan
Root node error: 342/2315 = 0.14773
n= 2315
CP nsplit rel error xerror xstd
1 0.076023 0 1.00000 1.00000 0.049920
2 0.074561 2 0.84795 0.99708 0.049860
3 0.055556 4 0.69883 0.76023 0.044421
4 0.026316 7 0.49415 0.52632 0.037673
5 0.023392 8 0.46784 0.52047 0.037481
6 0.020468 10 0.42105 0.50877 0.037092
7 0.017544 11 0.40058 0.47076 0.035788
8 0.010000 12 0.38304 0.47661 0.035993
rpart(formula = churn ~ ., data = trainset)
n= 2315
CP nsplit rel error xerror xstd
1 0.07602339 0 1.0000000 1.0000000 0.04992005
2 0.07456140 2 0.8479532 0.9590643 0.04906076
3 0.05555556 4 0.6988304 0.7953216 0.04530196
4 0.02631579 7 0.4941520 0.5233918 0.03757730
5 0.02339181 8 0.4678363 0.5263158 0.03767329
6 0.02046784 10 0.4210526 0.5175439 0.03738427
7 0.01754386 11 0.4005848 0.5058480 0.03699399
8 0.01000000 12 0.3830409 0.4970760 0.03669750
Variable importance
total_day_minutes total_day_charge number_customer_service_calls
18 18 10
total_intl_minutes total_intl_charge total_eve_charge
8 8 8
total_eve_minutes international_plan total_intl_calls
8 7 6
number_vmail_messages voice_mail_plan total_night_calls
3 3 1
Node number 1: 2315 observations, complexity param=0.07602339
predicted class=no expected loss=0.1477322 P(node) =1
class counts: 342 1973
probabilities: 0.148 0.852
left son=2 (144 obs) right son=3 (2171 obs)
Primary splits:
total_day_minutes < 265.45 to the right, improve=60.145020, (0 missing)
total_day_charge < 45.125 to the right, improve=60.145020, (0 missing)
number_customer_service_calls < 3.5 to the right, improve=53.641430, (0 missing)
international_plan splits as RL, improve=43.729370, (0 missing)
voice_mail_plan splits as LR, improve= 6.089388, (0 missing)
Surrogate splits:
total_day_charge < 45.125 to the right, agree=1, adj=1, (0 split)
Node number 2: 144 observations, complexity param=0.07602339
predicted class=yes expected loss=0.4097222 P(node) =0.06220302
class counts: 85 59
probabilities: 0.590 0.410
left son=4 (110 obs) right son=5 (34 obs)
Primary splits:
voice_mail_plan splits as LR, improve=19.884860, (0 missing)
number_vmail_messages < 9.5 to the left, improve=19.884860, (0 missing)
total_eve_minutes < 167.05 to the right, improve=14.540020, (0 missing)
total_eve_charge < 14.2 to the right, improve=14.540020, (0 missing)
total_day_minutes < 283.9 to the right, improve= 6.339827, (0 missing)
Surrogate splits:
number_vmail_messages < 9.5 to the left, agree=1.000, adj=1.000, (0 split)
total_night_minutes < 110.3 to the right, agree=0.785, adj=0.088, (0 split)
total_night_charge < 4.965 to the right, agree=0.785, adj=0.088, (0 split)
total_night_calls < 50 to the right, agree=0.778, adj=0.059, (0 split)
total_intl_minutes < 15.3 to the left, agree=0.771, adj=0.029, (0 split)
Node number 3: 2171 observations, complexity param=0.0745614
predicted class=no expected loss=0.1183786 P(node) =0.937797
class counts: 257 1914
probabilities: 0.118 0.882
left son=6 (168 obs) right son=7 (2003 obs)
Primary splits:
number_customer_service_calls < 3.5 to the right, improve=56.398210, (0 missing)
international_plan splits as RL, improve=43.059160, (0 missing)
total_day_minutes < 224.15 to the right, improve=10.847440, (0 missing)
total_day_charge < 38.105 to the right, improve=10.847440, (0 missing)
total_intl_minutes < 13.15 to the right, improve= 6.347319, (0 missing)
Node number 4: 110 observations, complexity param=0.02631579
predicted class=yes expected loss=0.2636364 P(node) =0.0475162
class counts: 81 29
probabilities: 0.736 0.264
left son=8 (67 obs) right son=9 (43 obs)
Primary splits:
total_eve_minutes < 188.5 to the right, improve=16.419610, (0 missing)
total_eve_charge < 16.025 to the right, improve=16.419610, (0 missing)
total_night_minutes < 206.85 to the right, improve= 5.350500, (0 missing)
total_night_charge < 9.305 to the right, improve= 5.350500, (0 missing)
total_day_minutes < 281.15 to the right, improve= 5.254545, (0 missing)
Surrogate splits:
total_eve_charge < 16.025 to the right, agree=1.000, adj=1.000, (0 split)
total_night_calls < 82 to the right, agree=0.655, adj=0.116, (0 split)
total_intl_minutes < 3.35 to the right, agree=0.636, adj=0.070, (0 split)
total_intl_charge < 0.905 to the right, agree=0.636, adj=0.070, (0 split)
total_day_minutes < 268.55 to the right, agree=0.627, adj=0.047, (0 split)
Node number 5: 34 observations
predicted class=no expected loss=0.1176471 P(node) =0.01468683
class counts: 4 30
probabilities: 0.118 0.882
Node number 6: 168 observations, complexity param=0.0745614
predicted class=yes expected loss=0.4880952 P(node) =0.07257019
class counts: 86 82
probabilities: 0.512 0.488
left son=12 (71 obs) right son=13 (97 obs)
Primary splits:
total_day_minutes < 160.2 to the left, improve=29.655880, (0 missing)
total_day_charge < 27.235 to the left, improve=29.655880, (0 missing)
total_eve_minutes < 180.65 to the left, improve= 8.556953, (0 missing)
total_eve_charge < 15.355 to the left, improve= 8.556953, (0 missing)
number_customer_service_calls < 4.5 to the right, improve= 5.975362, (0 missing)
Surrogate splits:
total_day_charge < 27.235 to the left, agree=1.000, adj=1.000, (0 split)
total_night_calls < 79 to the left, agree=0.625, adj=0.113, (0 split)
total_intl_calls < 2.5 to the left, agree=0.619, adj=0.099, (0 split)
number_customer_service_calls < 4.5 to the right, agree=0.607, adj=0.070, (0 split)
total_eve_calls < 89.5 to the left, agree=0.601, adj=0.056, (0 split)
Node number 7: 2003 observations, complexity param=0.05555556
predicted class=no expected loss=0.08537194 P(node) =0.8652268
class counts: 171 1832
probabilities: 0.085 0.915
left son=14 (188 obs) right son=15 (1815 obs)
Primary splits:
international_plan splits as RL, improve=42.194510, (0 missing)
total_day_minutes < 224.15 to the right, improve=16.838410, (0 missing)
total_day_charge < 38.105 to the right, improve=16.838410, (0 missing)
total_intl_minutes < 13.15 to the right, improve= 6.210678, (0 missing)
total_intl_charge < 3.55 to the right, improve= 6.210678, (0 missing)
Node number 8: 67 observations
predicted class=yes expected loss=0.04477612 P(node) =0.02894168
class counts: 64 3
probabilities: 0.955 0.045
Node number 9: 43 observations, complexity param=0.02046784
predicted class=no expected loss=0.3953488 P(node) =0.01857451
class counts: 17 26
probabilities: 0.395 0.605
left son=18 (19 obs) right son=19 (24 obs)
Primary splits:
total_day_minutes < 282.7 to the right, improve=5.680947, (0 missing)
total_day_charge < 48.06 to the right, improve=5.680947, (0 missing)
total_night_minutes < 212.65 to the right, improve=4.558140, (0 missing)
total_night_charge < 9.57 to the right, improve=4.558140, (0 missing)
total_eve_minutes < 145.4 to the right, improve=4.356169, (0 missing)
Surrogate splits:
total_day_charge < 48.06 to the right, agree=1.000, adj=1.000, (0 split)
total_day_calls < 103 to the left, agree=0.674, adj=0.263, (0 split)
total_eve_calls < 104.5 to the left, agree=0.674, adj=0.263, (0 split)
total_intl_minutes < 11.55 to the left, agree=0.651, adj=0.211, (0 split)
total_intl_charge < 3.12 to the left, agree=0.651, adj=0.211, (0 split)
Node number 12: 71 observations
predicted class=yes expected loss=0.1408451 P(node) =0.03066955
class counts: 61 10
probabilities: 0.859 0.141
Node number 13: 97 observations, complexity param=0.01754386
predicted class=no expected loss=0.257732 P(node) =0.04190065
class counts: 25 72
probabilities: 0.258 0.742
left son=26 (20 obs) right son=27 (77 obs)
Primary splits:
total_eve_minutes < 155.5 to the left, improve=7.753662, (0 missing)
total_eve_charge < 13.22 to the left, improve=7.753662, (0 missing)
total_intl_minutes < 13.55 to the right, improve=2.366149, (0 missing)
total_intl_charge < 3.66 to the right, improve=2.366149, (0 missing)
number_customer_service_calls < 4.5 to the right, improve=2.297667, (0 missing)
Surrogate splits:
total_eve_charge < 13.22 to the left, agree=1.000, adj=1.00, (0 split)
total_night_calls < 143.5 to the right, agree=0.814, adj=0.10, (0 split)
total_eve_calls < 62 to the left, agree=0.804, adj=0.05, (0 split)
Node number 14: 188 observations, complexity param=0.05555556
predicted class=no expected loss=0.4042553 P(node) =0.0812095
class counts: 76 112
probabilities: 0.404 0.596
left son=28 (38 obs) right son=29 (150 obs)
Primary splits:
total_intl_calls < 2.5 to the left, improve=33.806520, (0 missing)
total_intl_minutes < 13.1 to the right, improve=30.527050, (0 missing)
total_intl_charge < 3.535 to the right, improve=30.527050, (0 missing)
total_day_minutes < 221.95 to the right, improve= 3.386095, (0 missing)
total_day_charge < 37.735 to the right, improve= 3.386095, (0 missing)
Node number 15: 1815 observations, complexity param=0.02339181
predicted class=no expected loss=0.0523416 P(node) =0.7840173
class counts: 95 1720
probabilities: 0.052 0.948
left son=30 (251 obs) right son=31 (1564 obs)
Primary splits:
total_day_minutes < 224.15 to the right, improve=12.5649300, (0 missing)
total_day_charge < 38.105 to the right, improve=12.5649300, (0 missing)
total_eve_minutes < 244.95 to the right, improve= 4.7875890, (0 missing)
total_eve_charge < 20.825 to the right, improve= 4.7875890, (0 missing)
total_night_minutes < 163.85 to the right, improve= 0.9074391, (0 missing)
Surrogate splits:
total_day_charge < 38.105 to the right, agree=1, adj=1, (0 split)
Node number 18: 19 observations
predicted class=yes expected loss=0.3157895 P(node) =0.008207343
class counts: 13 6
probabilities: 0.684 0.316
Node number 19: 24 observations
predicted class=no expected loss=0.1666667 P(node) =0.01036717
class counts: 4 20
probabilities: 0.167 0.833
Node number 26: 20 observations
predicted class=yes expected loss=0.35 P(node) =0.008639309
class counts: 13 7
probabilities: 0.650 0.350
Node number 27: 77 observations
predicted class=no expected loss=0.1558442 P(node) =0.03326134
class counts: 12 65
probabilities: 0.156 0.844
Node number 28: 38 observations
predicted class=yes expected loss=0 P(node) =0.01641469
class counts: 38 0
probabilities: 1.000 0.000
Node number 29: 150 observations, complexity param=0.05555556
predicted class=no expected loss=0.2533333 P(node) =0.06479482
class counts: 38 112
probabilities: 0.253 0.747
left son=58 (32 obs) right son=59 (118 obs)
Primary splits:
total_intl_minutes < 13.1 to the right, improve=45.356840, (0 missing)
total_intl_charge < 3.535 to the right, improve=45.356840, (0 missing)
total_day_calls < 95.5 to the left, improve= 4.036407, (0 missing)
total_day_minutes < 237.75 to the right, improve= 1.879020, (0 missing)
total_day_charge < 40.42 to the right, improve= 1.879020, (0 missing)
Surrogate splits:
total_intl_charge < 3.535 to the right, agree=1.0, adj=1.000, (0 split)
total_day_minutes < 52.45 to the left, agree=0.8, adj=0.063, (0 split)
total_day_charge < 8.92 to the left, agree=0.8, adj=0.063, (0 split)
Node number 30: 251 observations, complexity param=0.02339181
predicted class=no expected loss=0.1992032 P(node) =0.1084233
class counts: 50 201
probabilities: 0.199 0.801
left son=60 (36 obs) right son=61 (215 obs)
Primary splits:
total_eve_minutes < 259.8 to the right, improve=22.993380, (0 missing)
total_eve_charge < 22.08 to the right, improve=22.993380, (0 missing)
voice_mail_plan splits as LR, improve= 4.745664, (0 missing)
number_vmail_messages < 7.5 to the left, improve= 4.745664, (0 missing)
total_night_minutes < 181.15 to the right, improve= 3.509731, (0 missing)
Surrogate splits:
total_eve_charge < 22.08 to the right, agree=1, adj=1, (0 split)
Node number 31: 1564 observations
predicted class=no expected loss=0.02877238 P(node) =0.675594
class counts: 45 1519
probabilities: 0.029 0.971
Node number 58: 32 observations
predicted class=yes expected loss=0 P(node) =0.01382289
class counts: 32 0
probabilities: 1.000 0.000
Node number 59: 118 observations
predicted class=no expected loss=0.05084746 P(node) =0.05097192
class counts: 6 112
probabilities: 0.051 0.949
Node number 60: 36 observations
predicted class=yes expected loss=0.2777778 P(node) =0.01555076
class counts: 26 10
probabilities: 0.722 0.278
Node number 61: 215 observations
predicted class=no expected loss=0.1116279 P(node) =0.09287257
class counts: 24 191
probabilities: 0.112 0.888
接下来,用printcp函数输出模型的复杂性参数,该复杂性参数可以做为控制树规模的惩罚因子,cp值越大,分裂规模(nsplit)越小,输出参数(rel error)指示了当前分类模型树与空树之间的平均偏差比值,xerror的值是通过使用10-交叉检验得到的相对误差,xstd表示相对误差的标准差。