数据挖掘期中作业参考

期中作业参考

Table of Contents

  • 1. 采用最小描述长度有指导地进行分箱处理
  • 2. 进行关联分析,并选择打印出关联规则
  • 3. 采用Fisher线性判别分析进行预测
  • 4. 采用主成分因子分析等产生新的变量,之后进行判别分析,
  • 5. 进行广义线性回归,对测试集进行预测

1 采用最小描述长度有指导地进行分箱处理

bank<-read.table("data/bank/bank.csv",header=TRUE,sep=";")
m<-ncol(bank)
factors<-lapply(bank,is.factor)
iffactor<-as.logical(factors)
m<-ncol(bank)
bank.factors<-(1:m)[iffactor]
bank.int<-(1:m)[!iffactor]
bankint<-bank[,c(bank.int,m)] # 产生数值型数据,并保存为bank.int
bankfac<-bank[,iffactor]
summary(bankfac)
         job          marital         education    default    housing   
management :969   divorced: 528   primary  : 678   no :4445   no :1962  
blue-collar:946   married :2797   secondary:2306   yes:  76   yes:2559  
technician :768   single  :1196   tertiary :1350                        
admin.     :478                   unknown  : 187                        
services   :417                                                         
retired    :230                                                         
(Other)    :713                                                         
 loan           contact         month         poutcome      y       
no :3830   cellular :2896   may    :1398   failure: 490   no :4000  
yes: 691   telephone: 301   jul    : 706   other  : 197   yes: 521  
           unknown  :1324   aug    : 633   success: 129             
                            jun    : 531   unknown:3705             
                            nov    : 389                            
                            apr    : 293                            
                            (Other): 571

采用mdlp进行有指导的分箱,对连续型数据变量进行离散化处理,进一步转化为 因子型数据,替换原有的连续性数据,替换后的数据框记为banknew

library(discretization)
bankintDisc<-mdlp(bankint)$Disc.data #对数据进行有指导的分箱
bankintFa<-as.data.frame(lapply(bankintDisc,as.factor)) ##将分箱后的离散型变量转换为字 
##符型,并转化为数据框
banknew<-bank   ##转换为新的数据
p<-ncol(bankintFa)
banknew[,bank.int]<-bankintFa[,-p] # 将其中的数值型变量替换为分箱后的因子型变量

2 进行关联分析,并选择打印出关联规则

首先对数据集进行分割为训练集和测试集

##首先对数据集进行分割为训练集和测试集
set.seed(1314) 
id<-sample(1:4521,2712)
bank.train<-banknew[id,]
bank.test<-bank[-id,]
  1. 对训练集进行关联分析
    library("arules") 
    banktransac <- as(bank.train, "transactions")
    rules <- apriori(banktransac,parameter = list(support = 0.1, confidence = 0.6)) 
    summary(rules)
    
    parameter specification:
     confidence minval smax arem  aval originalSupport support minlen maxlen target
            0.6    0.1    1 none FALSE            TRUE     0.1      1     10  rules
       ext
     FALSE
    
    algorithmic control:
     filter tree heap memopt load sort verbose
        0.1 TRUE TRUE  FALSE TRUE    2    TRUE
    
    apriori - find association rules with the apriori algorithm
    version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
    set item appearances ...[0 item(s)] done [0.00s].
    set transactions ...[64 item(s), 2712 transaction(s)] done [0.00s].
    sorting and recoding items ... [36 item(s)] done [0.00s].
    creating transaction tree ... done [0.00s].
    checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.27s].
    writing ... [271827 rule(s)] done [0.04s].
    creating S4 object  ... done [0.11s].
    set of 271827 rules
    
    rule length distribution (lhs + rhs):sizes
        1     2     3     4     5     6     7     8     9    10 
       11   341  2977 13135 34796 59692 68851 54041 28443  9540 
    
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1.000   6.000   7.000   6.842   8.000  10.000 
    
    summary of quality measures:
        support         confidence          lift       
     Min.   :0.1003   Min.   :0.6000   Min.   :0.7879  
     1st Qu.:0.1136   1st Qu.:0.8701   1st Qu.:1.0000  
     Median :0.1364   Median :0.9904   Median :1.0164  
     Mean   :0.1673   Mean   :0.9282   Mean   :1.0889  
     3rd Qu.:0.1881   3rd Qu.:1.0000   3rd Qu.:1.2147  
     Max.   :1.0000   Max.   :1.0000   Max.   :5.4567  
    
    mining info:
            data ntransactions support confidence
     banktransac          2712     0.1        0.6
    
    ## 选出 "y=no" 为后项的关联规则,
    rulesyno <- subset(rules, subset = rhs %in% "y=no" & lift > 1.13) 
    summary(rulesyno)
    ## 观察所得的关联规则的大致形式,按照置信度进行排序
    inspect(head(sort(rulesyno, by = "lift"), n = 10))
    ## 将全部关联规则另存为文本文件
    write(rulesyno, file = "data/bank/bankyapriorirules.csv", sep = ",")
    
    set of 319 rules
    
    rule length distribution (lhs + rhs):sizes
     3  4  5  6  7  8  9 10 
     3 18 48 76 80 58 28  8 
    
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      3.000   6.000   7.000   6.687   8.000  10.000 
    
    summary of quality measures:
        support         confidence      lift      
     Min.   :0.1010   Min.   :1    Min.   :1.132  
     1st Qu.:0.1034   1st Qu.:1    1st Qu.:1.132  
     Median :0.1125   Median :1    Median :1.132  
     Mean   :0.1118   Mean   :1    Mean   :1.132  
     3rd Qu.:0.1206   3rd Qu.:1    3rd Qu.:1.132  
     Max.   :0.1239   Max.   :1    Max.   :1.132  
    
    mining info:
            data ntransactions support confidence
     banktransac          2712     0.1        0.6
       lhs                   rhs      support confidence     lift
    1  {duration=1,                                              
        poutcome=unknown} => {y=no} 0.1231563          1 1.131886
    2  {duration=1,                                              
        previous=1}       => {y=no} 0.1231563          1 1.131886
    3  {duration=1,                                              
        pdays=1}          => {y=no} 0.1238938          1 1.131886
    4  {duration=1,                                              
        previous=1,                                              
        poutcome=unknown} => {y=no} 0.1231563          1 1.131886
    5  {duration=1,                                              
        pdays=1,                                                 
        poutcome=unknown} => {y=no} 0.1231563          1 1.131886
    6  {loan=no,                                                 
        duration=1,                                              
        poutcome=unknown} => {y=no} 0.1039823          1 1.131886
    7  {age=1,                                                   
        duration=1,                                              
        poutcome=unknown} => {y=no} 0.1224189          1 1.131886
    8  {default=no,                                              
        duration=1,                                              
        poutcome=unknown} => {y=no} 0.1205752          1 1.131886
    9  {duration=1,                                              
        campaign=1,                                              
        poutcome=unknown} => {y=no} 0.1231563          1 1.131886
    10 {day=1,                                                   
        duration=1,                                              
        poutcome=unknown} => {y=no} 0.1231563          1 1.131886
    

3 采用Fisher线性判别分析进行预测

## 首先采用训练数据集建立判别规则
require("MASS")
z <- lda(y ~ ., bankint, prior = c(1,1)/2,subset = id)
z.predict<- predict(z,bank.test)
predicty<-z.predict$class
truey<-bank.test$y
table(predicty, truey)
载入需要的程辑包:MASS
        truey
predicty   no  yes
     no  1389   71
     yes  215  134

4 采用主成分因子分析等产生新的变量,之后进行判别分析,

这里以主成分为例进行

bankinttrain<-bankint[id,]
p<-ncol(bankint)
trainpr<-princomp(bankinttrain[,-p],cor=TRUE,loadings=TRUE)
summary(trainpr)
Importance of components:
                          Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
Standard deviation     1.2904231 1.0802812 1.0501263 0.9857049 0.9408610
Proportion of Variance 0.2378845 0.1667153 0.1575379 0.1388020 0.1264599
Cumulative Proportion  0.2378845 0.4045999 0.5621378 0.7009398 0.8273997
                          Comp.6     Comp.7
Standard deviation     0.8991575 0.63223248
Proportion of Variance 0.1154978 0.05710256
Cumulative Proportion  0.9428974 1.00000000
  1. 挑选其中前五个主成分构成新的变量
    loading5<-as.matrix(trainpr$loadings)[,1:5]
    components<-as.matrix(bankinttrain[,-p])%*%loading5
    bankprtrain<-data.frame(components,y=bankint[id,8])
    #相应的测试集合需要进行类似的处理
    components<-as.matrix(bankint[-id,-p])%*%loading5
    bankprtest<-data.frame(components,y=bankint[-id,8])
    #head(bankprtrain)
    
  2. 对训练集进行新的 Fisher线性判别分析,再运用到新的测试集合上
    z<-lda(y~.,data=bankprtrain)
    prpredic<-predict(z,newdata=bankprtest)
    predicy<-prpredic$class
    truey<-bankprtest$y
    table(predicy,truey)
    
           truey
    predicy   no  yes
        no  1565  158
        yes   39   47
    

    效果不如直接进行判别分析的好,不再进行下去

5 进行广义线性回归,对测试集进行预测

bankinttest<-bankint[-id,]
m<-glm(y ~.,family="binomial",subset=id,data=bankint)
step(m) # 逐步回归方法选择最优变量
Start:  AIC=1611.09
y ~ age + balance + day + duration + campaign + pdays + previous

           Df Deviance    AIC
- day       1   1595.5 1609.5
- balance   1   1596.6 1610.6
- age       1   1596.7 1610.7
<none>          1595.1 1611.1
- pdays     1   1602.3 1616.3
- campaign  1   1602.9 1616.9
- previous  1   1604.3 1618.3
- duration  1   1905.0 1919.0

Step:  AIC=1609.46
y ~ age + balance + duration + campaign + pdays + previous

           Df Deviance    AIC
- balance   1   1596.9 1608.9
- age       1   1597.0 1609.0
<none>          1595.5 1609.5
- pdays     1   1602.5 1614.5
- campaign  1   1602.9 1614.9
- previous  1   1604.6 1616.6
- duration  1   1905.1 1917.1

Step:  AIC=1608.89
y ~ age + duration + campaign + pdays + previous

           Df Deviance    AIC
- age       1   1598.8 1608.8
<none>          1596.9 1608.9
- pdays     1   1603.9 1613.9
- campaign  1   1604.4 1614.4
- previous  1   1606.2 1616.2
- duration  1   1905.4 1915.4

Step:  AIC=1608.8
y ~ duration + campaign + pdays + previous

           Df Deviance    AIC
<none>          1598.8 1608.8
- pdays     1   1605.6 1613.6
- campaign  1   1606.7 1614.7
- previous  1   1608.3 1616.3
- duration  1   1908.1 1916.1

Call:  glm(formula = y ~ duration + campaign + pdays + previous, family = "binomial", 
    data = bankint, subset = id)

Coefficients:
(Intercept)     duration     campaign        pdays     previous  
  -3.191878     0.003464    -0.084405     0.001790     0.110634  

Degrees of Freedom: 2711 Total (i.e. Null);  2707 Residual
Null Deviance:	    1952 
Residual Deviance: 1599 	AIC: 1609
m<-glm(y ~ age + duration + campaign + pdays + previous,family="binomial",subset=id,data=bankint)
summary(m)
Call:
glm(formula = y ~ age + duration + campaign + pdays + previous, 
    family = "binomial", data = bankint, subset = id)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.8052  -0.4398  -0.3412  -0.2859   2.5410  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.5381351  0.2880890 -12.281  < 2e-16 ***
age          0.0083428  0.0060098   1.388  0.16507    
duration     0.0034554  0.0002189  15.785  < 2e-16 ***
campaign    -0.0829148  0.0328637  -2.523  0.01164 *  
pdays        0.0018123  0.0006601   2.746  0.00604 ** 
previous     0.1094893  0.0347442   3.151  0.00163 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1952.3  on 2711  degrees of freedom
Residual deviance: 1596.9  on 2706  degrees of freedom
AIC: 1608.9

Number of Fisher Scoring iterations: 6
mpredict<-predict(m, newdata=bankinttest,type = "response")
predicty<-ifelse(mpredict<0.5,"no","yes") # 对预测值进行重新编码
# 列表输出结果
truey<-bankint$y[-id]
table(predicty,truey)
        truey
predicty   no  yes
     no  1582  175
     yes   22   30

效果也不如Fisher 判别分析,也不如主成分之后判别分析


Validate

你可能感兴趣的:(数据挖掘,R语言)