期中作业参考
Table of Contents
- 1. 采用最小描述长度有指导地进行分箱处理
- 2. 进行关联分析,并选择打印出关联规则
- 3. 采用Fisher线性判别分析进行预测
- 4. 采用主成分因子分析等产生新的变量,之后进行判别分析,
- 5. 进行广义线性回归,对测试集进行预测
1 采用最小描述长度有指导地进行分箱处理
bank<-read.table("data/bank/bank.csv",header=TRUE,sep=";") m<-ncol(bank) factors<-lapply(bank,is.factor) iffactor<-as.logical(factors) m<-ncol(bank) bank.factors<-(1:m)[iffactor] bank.int<-(1:m)[!iffactor] bankint<-bank[,c(bank.int,m)] # 产生数值型数据,并保存为bank.int bankfac<-bank[,iffactor] summary(bankfac)
job marital education default housing management :969 divorced: 528 primary : 678 no :4445 no :1962 blue-collar:946 married :2797 secondary:2306 yes: 76 yes:2559 technician :768 single :1196 tertiary :1350 admin. :478 unknown : 187 services :417 retired :230 (Other) :713 loan contact month poutcome y no :3830 cellular :2896 may :1398 failure: 490 no :4000 yes: 691 telephone: 301 jul : 706 other : 197 yes: 521 unknown :1324 aug : 633 success: 129 jun : 531 unknown:3705 nov : 389 apr : 293 (Other): 571
采用mdlp进行有指导的分箱,对连续型数据变量进行离散化处理,进一步转化为 因子型数据,替换原有的连续性数据,替换后的数据框记为banknew
library(discretization) bankintDisc<-mdlp(bankint)$Disc.data #对数据进行有指导的分箱 bankintFa<-as.data.frame(lapply(bankintDisc,as.factor)) ##将分箱后的离散型变量转换为字 ##符型,并转化为数据框 banknew<-bank ##转换为新的数据 p<-ncol(bankintFa) banknew[,bank.int]<-bankintFa[,-p] # 将其中的数值型变量替换为分箱后的因子型变量
2 进行关联分析,并选择打印出关联规则
首先对数据集进行分割为训练集和测试集
##首先对数据集进行分割为训练集和测试集 set.seed(1314) id<-sample(1:4521,2712) bank.train<-banknew[id,] bank.test<-bank[-id,]
- 对训练集进行关联分析
library("arules") banktransac <- as(bank.train, "transactions") rules <- apriori(banktransac,parameter = list(support = 0.1, confidence = 0.6)) summary(rules)
parameter specification: confidence minval smax arem aval originalSupport support minlen maxlen target 0.6 0.1 1 none FALSE TRUE 0.1 1 10 rules ext FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[64 item(s), 2712 transaction(s)] done [0.00s]. sorting and recoding items ... [36 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.27s]. writing ... [271827 rule(s)] done [0.04s]. creating S4 object ... done [0.11s]. set of 271827 rules rule length distribution (lhs + rhs):sizes 1 2 3 4 5 6 7 8 9 10 11 341 2977 13135 34796 59692 68851 54041 28443 9540 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 6.000 7.000 6.842 8.000 10.000 summary of quality measures: support confidence lift Min. :0.1003 Min. :0.6000 Min. :0.7879 1st Qu.:0.1136 1st Qu.:0.8701 1st Qu.:1.0000 Median :0.1364 Median :0.9904 Median :1.0164 Mean :0.1673 Mean :0.9282 Mean :1.0889 3rd Qu.:0.1881 3rd Qu.:1.0000 3rd Qu.:1.2147 Max. :1.0000 Max. :1.0000 Max. :5.4567 mining info: data ntransactions support confidence banktransac 2712 0.1 0.6
## 选出 "y=no" 为后项的关联规则, rulesyno <- subset(rules, subset = rhs %in% "y=no" & lift > 1.13) summary(rulesyno) ## 观察所得的关联规则的大致形式,按照置信度进行排序 inspect(head(sort(rulesyno, by = "lift"), n = 10)) ## 将全部关联规则另存为文本文件 write(rulesyno, file = "data/bank/bankyapriorirules.csv", sep = ",")
set of 319 rules rule length distribution (lhs + rhs):sizes 3 4 5 6 7 8 9 10 3 18 48 76 80 58 28 8 Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 6.000 7.000 6.687 8.000 10.000 summary of quality measures: support confidence lift Min. :0.1010 Min. :1 Min. :1.132 1st Qu.:0.1034 1st Qu.:1 1st Qu.:1.132 Median :0.1125 Median :1 Median :1.132 Mean :0.1118 Mean :1 Mean :1.132 3rd Qu.:0.1206 3rd Qu.:1 3rd Qu.:1.132 Max. :0.1239 Max. :1 Max. :1.132 mining info: data ntransactions support confidence banktransac 2712 0.1 0.6 lhs rhs support confidence lift 1 {duration=1, poutcome=unknown} => {y=no} 0.1231563 1 1.131886 2 {duration=1, previous=1} => {y=no} 0.1231563 1 1.131886 3 {duration=1, pdays=1} => {y=no} 0.1238938 1 1.131886 4 {duration=1, previous=1, poutcome=unknown} => {y=no} 0.1231563 1 1.131886 5 {duration=1, pdays=1, poutcome=unknown} => {y=no} 0.1231563 1 1.131886 6 {loan=no, duration=1, poutcome=unknown} => {y=no} 0.1039823 1 1.131886 7 {age=1, duration=1, poutcome=unknown} => {y=no} 0.1224189 1 1.131886 8 {default=no, duration=1, poutcome=unknown} => {y=no} 0.1205752 1 1.131886 9 {duration=1, campaign=1, poutcome=unknown} => {y=no} 0.1231563 1 1.131886 10 {day=1, duration=1, poutcome=unknown} => {y=no} 0.1231563 1 1.131886
3 采用Fisher线性判别分析进行预测
## 首先采用训练数据集建立判别规则 require("MASS") z <- lda(y ~ ., bankint, prior = c(1,1)/2,subset = id) z.predict<- predict(z,bank.test) predicty<-z.predict$class truey<-bank.test$y table(predicty, truey)
载入需要的程辑包:MASS truey predicty no yes no 1389 71 yes 215 134
4 采用主成分因子分析等产生新的变量,之后进行判别分析,
这里以主成分为例进行
bankinttrain<-bankint[id,] p<-ncol(bankint) trainpr<-princomp(bankinttrain[,-p],cor=TRUE,loadings=TRUE) summary(trainpr)
Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 1.2904231 1.0802812 1.0501263 0.9857049 0.9408610 Proportion of Variance 0.2378845 0.1667153 0.1575379 0.1388020 0.1264599 Cumulative Proportion 0.2378845 0.4045999 0.5621378 0.7009398 0.8273997 Comp.6 Comp.7 Standard deviation 0.8991575 0.63223248 Proportion of Variance 0.1154978 0.05710256 Cumulative Proportion 0.9428974 1.00000000
- 挑选其中前五个主成分构成新的变量
loading5<-as.matrix(trainpr$loadings)[,1:5] components<-as.matrix(bankinttrain[,-p])%*%loading5 bankprtrain<-data.frame(components,y=bankint[id,8]) #相应的测试集合需要进行类似的处理 components<-as.matrix(bankint[-id,-p])%*%loading5 bankprtest<-data.frame(components,y=bankint[-id,8]) #head(bankprtrain)
- 对训练集进行新的 Fisher线性判别分析,再运用到新的测试集合上
z<-lda(y~.,data=bankprtrain) prpredic<-predict(z,newdata=bankprtest) predicy<-prpredic$class truey<-bankprtest$y table(predicy,truey)
truey predicy no yes no 1565 158 yes 39 47
效果不如直接进行判别分析的好,不再进行下去
5 进行广义线性回归,对测试集进行预测
bankinttest<-bankint[-id,] m<-glm(y ~.,family="binomial",subset=id,data=bankint) step(m) # 逐步回归方法选择最优变量
Start: AIC=1611.09 y ~ age + balance + day + duration + campaign + pdays + previous Df Deviance AIC - day 1 1595.5 1609.5 - balance 1 1596.6 1610.6 - age 1 1596.7 1610.7 <none> 1595.1 1611.1 - pdays 1 1602.3 1616.3 - campaign 1 1602.9 1616.9 - previous 1 1604.3 1618.3 - duration 1 1905.0 1919.0 Step: AIC=1609.46 y ~ age + balance + duration + campaign + pdays + previous Df Deviance AIC - balance 1 1596.9 1608.9 - age 1 1597.0 1609.0 <none> 1595.5 1609.5 - pdays 1 1602.5 1614.5 - campaign 1 1602.9 1614.9 - previous 1 1604.6 1616.6 - duration 1 1905.1 1917.1 Step: AIC=1608.89 y ~ age + duration + campaign + pdays + previous Df Deviance AIC - age 1 1598.8 1608.8 <none> 1596.9 1608.9 - pdays 1 1603.9 1613.9 - campaign 1 1604.4 1614.4 - previous 1 1606.2 1616.2 - duration 1 1905.4 1915.4 Step: AIC=1608.8 y ~ duration + campaign + pdays + previous Df Deviance AIC <none> 1598.8 1608.8 - pdays 1 1605.6 1613.6 - campaign 1 1606.7 1614.7 - previous 1 1608.3 1616.3 - duration 1 1908.1 1916.1 Call: glm(formula = y ~ duration + campaign + pdays + previous, family = "binomial", data = bankint, subset = id) Coefficients: (Intercept) duration campaign pdays previous -3.191878 0.003464 -0.084405 0.001790 0.110634 Degrees of Freedom: 2711 Total (i.e. Null); 2707 Residual Null Deviance: 1952 Residual Deviance: 1599 AIC: 1609
m<-glm(y ~ age + duration + campaign + pdays + previous,family="binomial",subset=id,data=bankint) summary(m)
Call: glm(formula = y ~ age + duration + campaign + pdays + previous, family = "binomial", data = bankint, subset = id) Deviance Residuals: Min 1Q Median 3Q Max -3.8052 -0.4398 -0.3412 -0.2859 2.5410 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.5381351 0.2880890 -12.281 < 2e-16 *** age 0.0083428 0.0060098 1.388 0.16507 duration 0.0034554 0.0002189 15.785 < 2e-16 *** campaign -0.0829148 0.0328637 -2.523 0.01164 * pdays 0.0018123 0.0006601 2.746 0.00604 ** previous 0.1094893 0.0347442 3.151 0.00163 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1952.3 on 2711 degrees of freedom Residual deviance: 1596.9 on 2706 degrees of freedom AIC: 1608.9 Number of Fisher Scoring iterations: 6
mpredict<-predict(m, newdata=bankinttest,type = "response") predicty<-ifelse(mpredict<0.5,"no","yes") # 对预测值进行重新编码 # 列表输出结果 truey<-bankint$y[-id] table(predicty,truey)
truey predicty no yes no 1582 175 yes 22 30
效果也不如Fisher 判别分析,也不如主成分之后判别分析