R数据分析及可视化的一个简单例子

需求分析

葡萄牙某银行拟根据现有客户资料建立预测模型，以配合其数据库营销策略，营销方式为电话直销，销售产品为某金融产品（term deposit），数据分析的目标为通过预测模型识别对该金融产品有较高购买意愿的用户群。

数据形式：从数据库中导出的excel文件

数据内容：

# bank client data:
   1 - age (numeric)
   2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                       "blue-collar","self-employed","retired","technician","services")
   3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
   4 - education (categorical: "unknown","secondary","primary","tertiary")
   5 - default: has credit in default? (binary: "yes","no")
   6 - balance: average yearly balance, in euros (numeric)
   7 - housing: has housing loan? (binary: "yes","no")
   8 - loan: has personal loan? (binary: "yes","no")
   # related with the last contact of the current campaign:
   9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
12 - duration: last contact duration, in seconds (numeric)
   # other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):
17 - y - has the client subscribed a term deposit? (binary: "yes","no")

下面用R语言进行数据分析及可视化

> bank <- read.csv("H:/bank/bank-full.csv", sep = ";", header = T) //载入数据
> summary(bank) //对数据进行分析汇总

//导入要用到的包
> library(caret)
> library(ggplot2)
> library(gplots)
 
> require(rpart)
> require(caret)
> require(ggplot2)
> require(gplots)
//运用决策树模型对数据做初步分类建模和变量选择
> bank.tree <- rpart(y ~ ., data = bank, method = "class", cp = 0.001)
> treeImp <- varImp(bank.tree, scale = TRUE, surrogates = FALSE, competes = TRUE)
> treeImp$Variable <- rownames(treeImp)
> treeImp.sort <- treeImp[order(-treeImp$Overall), ]
> ggplot(treeImp, aes(Variable, Overall)) + geom_bar(stat = "identity") + coord_flip()

//根据cpplot对树做裁剪
plotcp(bank.tree)

//输出
> printcp(bank.tree)

Classification tree:

rpart(formula = y ~ ., data = bank, method = "class", cp = 0.001)

Variables actually used in tree construction:

[1] age balance contact day duration education housing job marital

[10] month pdays poutcome previous

Root node error: 5289/45211 = 0.11698

n= 45211

CP nsplit rel error xerror xstd

1 0.0380034 0 1.00000 1.00000 0.012921

2 0.0253356 3 0.88599 0.89147 0.012287

3 0.0170164 4 0.86065 0.86425 0.012120

4 0.0080355 5 0.84364 0.84969 0.012028

5 0.0042226 7 0.82757 0.84061 0.011971

6 0.0039705 10 0.81490 0.84061 0.011971

7 0.0034033 13 0.80299 0.83324 0.011924

8 0.0022373 15 0.79618 0.82133 0.011848

9 0.0019537 21 0.78276 0.81584 0.011812

10 0.0017962 24 0.77690 0.81566 0.011811

11 0.0016071 26 0.77330 0.81717 0.011821

12 0.0015126 30 0.76687 0.82057 0.011843

13 0.0014180 32 0.76385 0.81906 0.011833

14 0.0013235 40 0.75099 0.81887 0.011832

15 0.0012290 44 0.74570 0.81755 0.011823

16 0.0011344 51 0.73587 0.81660 0.011817

17 0.0010399 61 0.72452 0.81868 0.011831

18 0.0010000 63 0.72244 0.82416 0.011866

//绘制决策树
> bank.tree <- rpart(y ~ ., data = bank, method = "class", cp = 0.0022373)
> plot(bank.tree, branch = 0, margin = 0.1, uniform = T)
> text(bank.tree, use.n = T, col = "red", cex = 0.6)

//变量初选,分析和变换

根据决策树分析的结果，我们选择变量重要性最高的前5个变量做进一步研究，依次是：

Duration : last contact duration, in seconds (numeric)

month : last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

poutcome : outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

pdays : number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

previous : number of contacts performed before this campaign and for this client (numeric)

//a) Duration参数分析
> bank$y_dummy = ifelse(bank$y == "yes", 1, 0)
> summary(bank$duration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   103.0   180.0   258.2   319.0  4918.0 
> ggplot(bank, aes(duration, y_dummy)) + geom_smooth() + geom_point()

//根据拟合形态对Duration做一个二次项。
> bank$duration.sq <- bank$duration * bank$duration

//b）对month 参数分析
> summary(bank$month)
  apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep 
 2932  6247   214  2649  1403  6895  5341   477 13766  3970   738   579 
> library(Rcmdr)
> plotMeans(bank$y_dummy, bank$month, error.bars = "se")

> bank$month.sel <- ifelse(bank$month == "dec", 1, 0)
> bank$month.sel <- ifelse(bank$month == "mar", 1, bank$month)
> bank$month.sel <- ifelse(bank$month == "oct", 1, bank$month)
> bank$month.sel <- ifelse(bank$month == "sep", 1, bank$month)

//c) poutcome参数分析
>summary(bank$poutcome)
failure   other success unknown 
   4901    1840    1511   36959 
> plotMeans(bank$y_dummy, bank$poutcome, error.bars = "se")

> bank$poutcome.success <- ifelse(bank$poutcome == "success", 1, 0)

//d）pdays参数分析
 
> summary(bank$pdays)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -1.0    -1.0    -1.0    40.2    -1.0   871.0 
> bank$nocontact <- ifelse(bank$pdays == -1, 1, 0)
> bank$pdays <- ifelse(bank$pdays == -1, 0, bank$pdays)
> summary(bank$nocontact)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  1.0000  1.0000  0.8174  1.0000  1.0000 
> plotMeans(bank$y_dummy, as.factor(bank$nocontact), error.bars = "se")

> ggplot(bank, aes(log(pdays + 1))) + geom_histogram()

> ggplot(bank, aes(log(pdays + 1), y_dummy)) + geom_smooth() + geom_point()

//e) previous参数分析
 
> summary(bank$previous)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0000   0.0000   0.0000   0.5803   0.0000 275.0000 
 
> ggplot(bank, aes(log(previous + 1))) + geom_histogram()

> ggplot(bank, aes(log(previous + 1), y_dummy)) + geom_smooth() + geom_point()

> bank$previous.0 <- as.factor(ifelse(bank$previous == 0, 1, 0))
> bank$previous.1 <- as.factor(ifelse(bank$previous == 1, 1, 0))
> bank$previous.2 <- as.factor(ifelse(bank$previous == 2, 1, 0))
> bank$previous.2plus <- as.factor(ifelse(bank$previous > 2, 1, 0))

//逻辑回归建模
> logistic.full <- glm(y_dummy ~ duration + duration.sq + month.sel + poutcome.success + 
+     bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 + bank$previous.2 + 
+     bank$previous.2plus, data = bank)
> summary(logistic.full)

Call:

glm(formula = y_dummy ~ duration + duration.sq + month.sel +

poutcome.success + bank$nocontact + log(pdays + 1) + bank$previous.0 +

bank$previous.1 + bank$previous.2 + bank$previous.2plus,

data = bank)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.15672 -0.11482 -0.04176 0.01313 1.08332

Coefficients: (2 not defined because of singularities)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.583e-01 2.348e-02 6.742 1.58e-11 ***

duration 6.574e-04 9.605e-06 68.444 < 2e-16 ***

duration.sq -1.350e-07 6.145e-09 -21.971 < 2e-16 ***

month.sel -6.720e-03 4.346e-04 -15.463 < 2e-16 ***

poutcome.success 4.555e-01 8.085e-03 56.340 < 2e-16 ***

bank$nocontact -1.749e-01 2.336e-02 -7.490 7.04e-14 ***

log(pdays + 1) -2.106e-02 4.335e-03 -4.858 1.19e-06 ***

bank$previous.01 NA NA NA NA

bank$previous.11 -2.522e-02 7.126e-03 -3.540 0.000401 ***

bank$previous.21 -1.641e-02 7.714e-03 -2.128 0.033367 *

bank$previous.2plus1 NA NA NA NA

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.07690626)

Null deviance: 4670.3 on 45210 degrees of freedom

Residual deviance: 3476.3 on 45202 degrees of freedom

AIC: 12340

Number of Fisher Scoring iterations: 2

> logistic.step <- step(logistic.full, direction = "both", k = 2)

Start: AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 +

bank$previous.2 + bank$previous.2plus

Step: AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

bank$nocontact + log(pdays + 1) + bank$previous.0 + bank$previous.1 +

bank$previous.2

Step: AIC=12340.45

y_dummy ~ duration + duration.sq + month.sel + poutcome.success +

bank$nocontact + log(pdays + 1) + bank$previous.1 + bank$previous.2

Df Deviance AIC

3476.3 12340

- bank$previous.2 1 3476.7 12343

- bank$previous.1 1 3477.3 12351

- log(pdays + 1) 1 3478.1 12362

- bank$nocontact 1 3480.6 12394

- month.sel 1 3494.7 12577

- duration.sq 1 3513.4 12819

- poutcome.success 1 3720.4 15407

- duration 1 3836.6 16797

> summary(logistic.step)

Call:

glm(formula = y_dummy ~ duration + duration.sq + month.sel +

poutcome.success + bank$nocontact + log(pdays + 1) + bank$previous.1 +

bank$previous.2, data = bank)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.15672 -0.11482 -0.04176 0.01313 1.08332

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.583e-01 2.348e-02 6.742 1.58e-11 ***

duration 6.574e-04 9.605e-06 68.444 < 2e-16 ***

duration.sq -1.350e-07 6.145e-09 -21.971 < 2e-16 ***

month.sel -6.720e-03 4.346e-04 -15.463 < 2e-16 ***

poutcome.success 4.555e-01 8.085e-03 56.340 < 2e-16 ***

bank$nocontact -1.749e-01 2.336e-02 -7.490 7.04e-14 ***

log(pdays + 1) -2.106e-02 4.335e-03 -4.858 1.19e-06 ***

bank$previous.11 -2.522e-02 7.126e-03 -3.540 0.000401 ***

bank$previous.21 -1.641e-02 7.714e-03 -2.128 0.033367 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.07690626)

Null deviance: 4670.3 on 45210 degrees of freedom

Residual deviance: 3476.3 on 45202 degrees of freedom

AIC: 12340

Number of Fisher Scoring iterations: 2

//模型scoring和ROC评估
> require(ROCR)
载入需要的程辑包：ROCR
> bank.pred<-1/(1+exp(-predict(logistic.step)))
> roc.data <- prediction(bank.pred, labels = bank$y)
> roc.data <- performance(roc.data, "tpr", "fpr")
> plot(roc.data)

> score<-data.frame("prob.y"=bank.pred,"y"=as.factor(bank$y_dummy))
>ggplot(score, aes(x=prob.y, fill=y)) + geom_histogram(position="identity", binwidth=0.01,alpha=0.5)

通过对ROC和Score分布的分析，逻辑回归Score的分类效果还是不错的。具体的score cutoff值需要根据业务要求和营销成本而定。

R数据分析及可视化的一个简单例子

你可能感兴趣的:(R数据分析及可视化的一个简单例子)