R数据分析及可视化的一个简单例子

需求分析

葡萄牙某银行拟根据现有客户资料建立预测模型,以配合其数据库营销策略,营销方式为电话直销,销售产品为某金融产品(term deposit),数据分析的目标为通过预测模型识别对该金融产品有较高购买意愿的用户群。  

数据形式:从数据库中导出的excel文件

数据内容:

 # bank client data:
   1 - age (numeric)
   2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                       "blue-collar","self-employed","retired","technician","services")
   3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
   4 - education (categorical: "unknown","secondary","primary","tertiary")
   5 - default: has credit in default? (binary: "yes","no")
   6 - balance: average yearly balance, in euros (numeric)
   7 - housing: has housing loan? (binary: "yes","no")
   8 - loan: has personal loan? (binary: "yes","no")
   # related with the last contact of the current campaign:
   9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
  10 - day: last contact day of the month (numeric)
  11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
  12 - duration: last contact duration, in seconds (numeric)
   # other attributes:
  13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  15 - previous: number of contacts performed before this campaign and for this client (numeric)
  16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

  Output variable (desired target):
  17 - y - has the client subscribed a term deposit? (binary: "yes","no")

下面用R语言进行数据分析及可视化

> bank <- read.csv("H:/bank/bank-full.csv", sep = ";", header = T) //载入数据
> summary(bank) //对数据进行分析汇总

R数据分析及可视化的一个简单例子_第1张图片

//导入要用到的包
> library(caret)
> library(ggplot2)
> library(gplots)
 
> require(rpart)
> require(caret)
> require(ggplot2)
> require(gplots)
//运用决策树模型对数据做初步分类建模和变量选择
> bank.tree <- rpart(y ~ ., data = bank, method = "class", cp = 0.001)
> treeImp <- varImp(bank.tree, scale = TRUE, surrogates = FALSE, competes = TRUE)
> treeImp$Variable <- rownames(treeImp)
> treeImp.sort <- treeImp[order(-treeImp$Overall), ]
> ggplot(treeImp, aes(Variable, Overall)) + geom_bar(stat = "identity") + coord_flip()

R数据分析及可视化的一个简单例子_第2张图片

//根据cpplot对树做裁剪
plotcp(bank.tree)

R数据分析及可视化的一个简单例子_第3张图片

//输出
> printcp(bank.tree)

Classification tree:

rpart(formula = y ~ ., data = bank, method = "class", cp = 0.001)

 

Variables actually used in tree construction:

 [1] age       balance   contact   day       duration  education housing   job       marital 

[10] month     pdays     poutcome  previous

 

Root node error: 5289/45211 = 0.11698

 

n= 45211

 

          CP nsplit rel error  xerror     xstd

0.0380034      0   1.00000 1.00000 0.012921

0.0253356      3   0.88599 0.89147 0.012287

0.0170164      4   0.86065 0.86425 0.012120

0.0080355      5   0.84364 0.84969 0.012028

0.0042226      7   0.82757 0.84061 0.011971

0.0039705     10   0.81490 0.84061 0.011971

0.0034033     13   0.80299 0.83324 0.011924

0.0022373     15   0.79618 0.82133 0.011848

0.0019537     21   0.78276 0.81584 0.011812

10 0.0017962     24   0.77690 0.81566 0.011811

11 0.0016071     26   0.77330 0.81717 0.011821

12 0.0015126     30   0.76687 0.82057 0.011843

13 0.0014180     32   0.76385 0.81906 0.011833

14 0.0013235     40   0.75099 0.81887 0.011832

15 0.0012290     44   0.74570 0.81755 0.011823

16 0.0011344     51   0.73587 0.81660 0.011817

17 0.0010399     61   0.72452 0.81868 0.011831

18 0.0010000     63   0.72244 0.82416 0.011866

//绘制决策树
> bank.tree <- rpart(y ~ ., data = bank, method = "class", cp = 0.0022373)
> plot(bank.tree, branch = 0, margin = 0.1, uniform = T)
> text(bank.tree, use.n = T, col = "red", cex = 0.6)

R数据分析及可视化的一个简单例子_第4张图片

//变量初选,分析和变换

根据决策树分析的结果,我们选择变量重要性最高的前5个变量做进一步研究,依次是:

 

Duration : last contact duration, in seconds (numeric)

month : last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

poutcome : outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

pdays : number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

previous : number of contacts performed before this campaign and for this client (numeric)

//a) Duration参数分析
> bank$y_dummy = ifelse(bank$y == "yes", 1, 0)
> summary(bank$duration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   103.0   180.0   258.2   319.0  4918.0 
> ggplot(bank, aes(duration, y_dummy)) + geom_smooth() + geom_point()

R数据分析及可视化的一个简单例子_第5张图片

//根据拟合形态对Duration做一个二次项。
> bank$duration.sq <- bank$duration * bank$duration

//b)对month 参数分析
> summary(bank$month)
  apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep 
 2932  6247   214  2649  1403  6895  5341   477 13766  3970   738   579 
> library(Rcmdr)
> plotMeans(bank$y_dummy, bank$month, error.bars = "se")

R数据分析及可视化的一个简单例子_第6张图片

> bank$month.sel <- ifelse(bank$month == "dec", 1, 0)
> bank$month.sel <- ifelse(bank$month == "mar", 1, bank$month)
> bank$month.sel <- ifelse(bank$month == "oct", 1, bank$month)
> bank$month.sel <- ifelse(bank$month == "sep", 1, bank$month)

//c) poutcome参数分析
>summary(bank$poutcome)
failure   other success unknown 
   4901    1840    1511   36959 
> plotMeans(bank$y_dummy, bank$poutcome, error.bars = "se")

R数据分析及可视化的一个简单例子_第7张图片

> bank$poutcome.success <- ifelse(bank$poutcome == "success", 1, 0)

//d)pdays参数分析
 
> summary(bank$pdays)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -1.0    -1.0    -1.0    40.2    -1.0   871.0 
> bank$nocontact <- ifelse(bank$pdays == -1, 1, 0)
> bank$pdays <- ifelse(bank$pdays == -1, 0, bank$pdays)
> summary(bank$nocontact)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  1.0000  1.0000  0.8174  1.0000  1.0000 
> plotMeans(bank$y_dummy, as.factor(bank$nocontact), error.bars = "se")

R数据分析及可视化的一个简单例子_第8张图片

> ggplot(bank, aes(log(pdays + 1))) + geom_histogram()

R数据分析及可视化的一个简单例子_第9张图片

> ggplot(bank, aes(log(pdays + 1), y_dummy)) + geom_smooth() + geom_point()

R数据分析及可视化的一个简单例子_第10张图片

//e) previous参数分析
 
> summary(bank$previous)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0000   0.0000   0.0000   0.5803   0.0000 275.0000 
 
> ggplot(bank, aes(log(previous + 1))) + geom_histogram()

R数据分析及可视化的一个简单例子_第11张图片

> ggplot(bank, aes(log(previous + 1), y_dummy)) + geom_smooth() + geom_point()

R数据分析及可视化的一个简单例子_第12张图片

> bank$previous.0 <- as.factor(ifelse(bank$previous == 0, 1, 0))
> bank$previous.1 <- as.factor(ifelse(bank$previous == 1, 1, 0))
> bank$previous.2 <- as.factor(ifelse(bank$previous == 2, 1, 0))
> bank$previous.2plus <- as.factor(ifelse(bank$previous > 2, 1, 0))

//逻辑回归建模
> logistic.full <- glm(y_dummy ~ duration + duration.sq + month.sel + poutcome.success + 
+     bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 + bank$previous.2 + 
+     bank$previous.2plus, data = bank)
> summary(logistic.full)

Call:

glm(formula = y_dummy ~ duration + duration.sq + month.sel +

    poutcome.success + bank$nocontact + log(pdays + 1) + bank$previous.0 +

    bank$previous.1 + bank$previous.2 + bank$previous.2plus,

    data = bank)

 

Deviance Residuals:

     Min        1Q    Median        3Q       Max 

-1.15672  -0.11482  -0.04176   0.01313   1.08332 

 

Coefficients: (2 not defined because of singularities)

                       Estimate Std. Error t value Pr(>|t|)   

(Intercept)           1.583e-01  2.348e-02   6.742 1.58e-11 ***

duration              6.574e-04  9.605e-06  68.444  < 2e-16 ***

duration.sq          -1.350e-07  6.145e-09 -21.971  < 2e-16 ***

month.sel            -6.720e-03  4.346e-04 -15.463  < 2e-16 ***

poutcome.success      4.555e-01  8.085e-03  56.340  < 2e-16 ***

bank$nocontact       -1.749e-01  2.336e-02  -7.490 7.04e-14 ***

log(pdays + 1)       -2.106e-02  4.335e-03  -4.858 1.19e-06 ***

bank$previous.01             NA         NA      NA       NA   

bank$previous.11     -2.522e-02  7.126e-03  -3.540 0.000401 ***

bank$previous.21     -1.641e-02  7.714e-03  -2.128 0.033367 * 

bank$previous.2plus1         NA         NA      NA       NA   

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for gaussian family taken to be 0.07690626)

 

    Null deviance: 4670.3  on 45210  degrees of freedom

Residual deviance: 3476.3  on 45202  degrees of freedom

AIC: 12340

 

Number of Fisher Scoring iterations: 2

> logistic.step <- step(logistic.full, direction = "both", k = 2)


Start:  AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

    bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 +

    bank$previous.2 + bank$previous.2plus

 

 

Step:  AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

    bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 +

    bank$previous.2

 

 

Step:  AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

    bank$nocontact + log(pdays + 1) + bank$previous.1 + bank$previous.2

 

                   Df Deviance   AIC

<none>                  3476.3 12340

- bank$previous.2   1   3476.7 12343

- bank$previous.1   1   3477.3 12351

- log(pdays + 1)    1   3478.1 12362

- bank$nocontact    1   3480.6 12394

- month.sel         1   3494.7 12577

- duration.sq       1   3513.4 12819

- poutcome.success  1   3720.4 15407

- duration          1   3836.6 16797

> summary(logistic.step)

Call:

glm(formula = y_dummy ~ duration + duration.sq + month.sel +

    poutcome.success + bank$nocontact + log(pdays + 1) + bank$previous.1 +

    bank$previous.2, data = bank)

 

Deviance Residuals:

     Min        1Q    Median        3Q       Max 

-1.15672  -0.11482  -0.04176   0.01313   1.08332 

 

Coefficients:

                   Estimate Std. Error t value Pr(>|t|)   

(Intercept)       1.583e-01  2.348e-02   6.742 1.58e-11 ***

duration          6.574e-04  9.605e-06  68.444  < 2e-16 ***

duration.sq      -1.350e-07  6.145e-09 -21.971  < 2e-16 ***

month.sel        -6.720e-03  4.346e-04 -15.463  < 2e-16 ***

poutcome.success  4.555e-01  8.085e-03  56.340  < 2e-16 ***

bank$nocontact   -1.749e-01  2.336e-02  -7.490 7.04e-14 ***

log(pdays + 1)   -2.106e-02  4.335e-03  -4.858 1.19e-06 ***

bank$previous.11 -2.522e-02  7.126e-03  -3.540 0.000401 ***

bank$previous.21 -1.641e-02  7.714e-03  -2.128 0.033367 * 

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

(Dispersion parameter for gaussian family taken to be 0.07690626)

 

    Null deviance: 4670.3  on 45210  degrees of freedom

Residual deviance: 3476.3  on 45202  degrees of freedom

AIC: 12340

 

Number of Fisher Scoring iterations: 2

//模型scoring和ROC评估
> require(ROCR)
载入需要的程辑包:ROCR
> bank.pred<-1/(1+exp(-predict(logistic.step)))
> roc.data <- prediction(bank.pred, labels = bank$y)
> roc.data <- performance(roc.data, "tpr", "fpr")
> plot(roc.data)

R数据分析及可视化的一个简单例子_第13张图片

> score<-data.frame("prob.y"=bank.pred,"y"=as.factor(bank$y_dummy))
>ggplot(score, aes(x=prob.y, fill=y)) + geom_histogram(position="identity", binwidth=0.01,alpha=0.5)

R数据分析及可视化的一个简单例子_第14张图片

通过对ROCScore分布的分析,逻辑回归Score的分类效果还是不错的。具体的score cutoff值需要根据业务要求和营销成本而定。

你可能感兴趣的:(数据分析,数据挖掘,R语言,决策树,可视化)