《精通机器学习：基于R 第二版》学习笔记

1、神经网络介绍

“神经网络”的概念相当宽泛，它包括了很多相关的方法。我们主要关注使用反向传播方法进行训练的前馈神经网络。
神经网络模型的优点在于，可以对输入变量（特征）和响应变量之间的高度复杂关系进行建模，特别是关系呈现高度非线性时。神经网络模型的构建和评价不需要基本假设，对于定量和定性响应变量都适用。
神经网络的结果是一个黑盒。换言之，没有一个带有系数的等式可供检验并分享给业务伙伴。实际上，结果几乎是无法解释的。另外一种批评意见的主要内容是，当初始的随机输入发生变化时，我们不清楚结果会发生什么变化。还有，神经网络的训练过程需要昂贵的时间和计算成本。
常用的激活函数：sigmoid、Rectifier、Maxout以及双曲正切函数（tanh）。
使用R画出sigmoid函数：

> library(pacman)
> p_load(ggplot2, dplyr, hrbrthemes)
> sigmoid <- function(x) {
+     1/(1 + exp(-x))
+ }
> 
> x <- seq(-5, 5, 0.1)
> df <- tibble(sigmoid.x = sigmoid(x), index = 1:length(x))
> ggplot(df, aes(index, sigmoid.x)) + geom_point() + theme_ft_rc()

sigmoid函数

tanh() 函数（双曲正切函数）是sigmoid函数的一种变体，它的输出值为-1 ~ 1。

> tibble(x = x, sigmoid.x = sigmoid(x), tanh.x = tanh(x)) %>% ggplot(aes(x)) + 
+     geom_line(aes(y = sigmoid.x, color = "sigmoid"), size = 1) + 
+     geom_line(aes(y = tanh.x, color = "tanh"), size = 1) + 
+     theme_ft_rc() + theme(legend.position = "top", legend.title = element_blank())

sigmoid函数和tanh函数

2、深度学习简介

深度学习是机器学习的一个分支，它的基础就是神经网络，它的特点其实就是使用机器学习技术（一般是无监督学习）在输入变量的基础之上构建新的特征。

3、数据理解与数据准备

> library(pacman)
> p_load(MASS)
> 
> data("shuttle")
> str(shuttle)

## 'data.frame':    256 obs. of  7 variables:
##  $ stability: Factor w/ 2 levels "stab","xstab": 2 2 2 2 2 2 2 2 2 2 ...
##  $ error    : Factor w/ 4 levels "LX","MM","SS",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sign     : Factor w/ 2 levels "nn","pp": 2 2 2 2 2 2 1 1 1 1 ...
##  $ wind     : Factor w/ 2 levels "head","tail": 1 1 1 2 2 2 1 1 1 2 ...
##  $ magn     : Factor w/ 4 levels "Light","Medium",..: 1 2 4 1 2 4 1 2 4 1 ...
##  $ vis      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ use      : Factor w/ 2 levels "auto","noauto": 1 1 1 1 1 1 1 1 1 1 ...

数据集包括256个观测和7个变量。所有变量都是分类变量，响应变量 use 有两个水平，auto 和 noauto。
 stability ：能否稳定定位（ stab / xstab ）
 error ：误差大小（ MM / SS / LX ）
 sign ：误差的符号，正或负（ pp / nn ）
 wind ：风向的符号（ head / tail ）
 magn ：风力强度（ Light / Medium / Strong / Out of Range ）
 vis ：能见度（ yes / no ）

> table(shuttle$use)

## 
##   auto noauto 
##    145    111

使用自动着陆的决策比例为57%。 table() 函数完美适用于两个变量之间的对比，但如果加入更多的变量，vcd包的 structable()函数会是更好的选择：

> p_load(vcd)
> tab1 <- structable(wind + magn ~ use, shuttle)
> print(tab1)

##        wind  head                    tail                  
##        magn Light Medium Out Strong Light Medium Out Strong
## use                                                        
## auto           19     19  16     18    19     19  16     19
## noauto         13     13  16     14    13     13  16     13

在表中，我们可以看出在逆风（head）的情况下，如果风力为轻微（ Light ），那么自动着陆（ auto ）发生19次，非自动着陆（ noauto ）发生13次。
mosaic() 函数，将 structable()函数生成的表格绘制成统计图，同时提供了卡方检验的p值：

> mosaic(tab1, shade = T)

mosaic统计图

图中的方块可以表示表格中相应位置的数值比例，p值是不显著的，所以特征与响应变量不相关。这说明，风力强度magn不能帮助我们预测是否使用自动着陆。

> mosaic(use ~ error + vis, shuttle)

mosaic统计图

神经网络的数据准备是非常重要的，因为所有协变量和响应变量都必须是数值型。本数据集中所有变量都是分类变量，需要使用caret包快速建立虚拟变量作为输入特征：

> p_load(caret)
> dummies <- dummyVars(use ~ ., shuttle, fullRank = T)
> dummies

## Dummy Variable Object
## 
## Formula: use ~ .
## 
## 7 variables, 7 factors
## Variables and levels will be separated by '.'
## A full rank encoding is used

转换为数据框：

> shuttle.2 <- as.data.frame(predict(dummies, newdata = shuttle))
> names(shuttle.2)

##  [1] "stability.xstab" "error.MM"        "error.SS"        "error.XL"       
##  [5] "sign.pp"         "wind.tail"       "magn.Medium"     "magn.Out"       
##  [9] "magn.Strong"     "vis.yes"

> str(shuttle.2)

## 'data.frame':    256 obs. of  10 variables:
##  $ stability.xstab: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ error.MM       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ error.SS       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ error.XL       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ sign.pp        : num  1 1 1 1 1 1 0 0 0 0 ...
##  $ wind.tail      : num  0 0 0 1 1 1 0 0 0 1 ...
##  $ magn.Medium    : num  0 1 0 0 1 0 0 1 0 0 ...
##  $ magn.Out       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ magn.Strong    : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ vis.yes        : num  0 0 0 0 0 0 0 0 0 0 ...

现在，我们得到具有10个变量的输入特征空间。对于stability， 0 表示 stab ， 1 表示 xstab 。error的基准是 LX ，用3个变量表示其他分类。
可以用 ifelse() 函数建立响应变量：

> shuttle.2$use <- ifelse(shuttle$use == "auto", 1, 0)
> table(shuttle.2$use)

## 
##   0   1 
## 111 145

拆分为训练集和测试集：

> set.seed(123)
> train.index <- createDataPartition(shuttle.2$use, p = 0.7, list = F)
> str(train.index)

##  int [1:180, 1] 1 4 5 6 7 8 9 11 13 14 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "Resample1"

> shuttle.train <- shuttle.2[train.index, ]
> shuttle.test <- shuttle.2[-train.index, ]
> 
> dim(shuttle.train)

## [1] 180  11

> dim(shuttle.test)

## [1] 76 11

4、模型建立与模型评价

以前，我们使用y ~指定数据集中除响应变量之外的所有变量作为输入，但neuralnet中不允许这种写法。绕过这种限制的方式是，使用 as.formula() 函数。先建立一个保存变量名的对象，然后用这个对象作为输入，从而将变量名粘贴到公式右侧。

> p_load(neuralnet)
> 
> n <- names(shuttle.train)
> form <- as.formula(paste("use ~", paste(n[!n %in% "use"], collapse = "+")))
> print(form)

## use ~ stability.xstab + error.MM + error.SS + error.XL + sign.pp + 
##     wind.tail + magn.Medium + magn.Out + magn.Strong + vis.yes
##

建立模型：

> fit <- neuralnet(form, data = shuttle.train, err.fct = "ce", linear.output = F)

参数说明：
 hidden ：每层中隐藏神经元的数量，最多可以设置3个隐藏层，默认值为1
 act.fct ：激活函数，默认为逻辑斯蒂函数，也可以设置为tanh函数
 err.fct ：计算误差，默认为sse；因为我们处理的是二值结果变量，所以要设置成ce，使用交叉熵
 linear.output ：逻辑参数，控制是否忽略act.fct，默认值为 TRUE；对于我们的数据来说，需要设置为 FALSE

> fit$result.matrix

##                                      [,1]
## error                         0.013651024
## reached.threshold             0.009868817
## steps                       670.000000000
## Intercept.to.1layhid1         5.136942014
## stability.xstab.to.1layhid1  -2.485264957
## error.MM.to.1layhid1          1.032588807
## error.SS.to.1layhid1          2.543705586
## error.XL.to.1layhid1          0.030906433
## sign.pp.to.1layhid1           0.840732458
## wind.tail.to.1layhid1         0.721638821
## magn.Medium.to.1layhid1       0.034567106
## magn.Out.to.1layhid1         -2.436662220
## magn.Strong.to.1layhid1      -0.099174792
## vis.yes.to.1layhid1          -7.556133035
## Intercept.to.use            -28.580429411
## 1layhid1.to.use              66.014874838

可以看到，误差为0.013651024。steps的值是算法达到阈值所需的训练次数，也就是误差函数的偏导数的绝对值小于阈值（默认为0.1）时的训练次数。权重最高的神经元是error.SS.to.1layhid1，权重值为2.543705586。

查看广义权重（第i个协变量对对数发生比的贡献）：

> head(fit$generalized.weights[[1]])

##         [,1]      [,2]      [,3]       [,4]      [,5]      [,6]       [,7]
## 1  -4.701598 1.9534405  4.812155 0.05846846 1.5904887 1.3651886 0.06539368
## 4  -2.355740 0.9787731  2.411135 0.02929567 0.7969158 0.6840290 0.03276556
## 5  -2.277955 0.9464547  2.331521 0.02832835 0.7706022 0.6614428 0.03168367
## 6  -2.593462 1.0775429  2.654447 0.03225195 0.8773340 0.7530556 0.03607199
## 7 -10.097313 4.1952759 10.334750 0.12556887 3.4157882 2.9319260 0.14044172
## 8  -9.798060 4.0709409 10.028460 0.12184740 3.3145548 2.8450328 0.13627946
##        [,8]        [,9]      [,10]
## 1 -4.609652 -0.18761781 -14.294612
## 4 -2.309670 -0.09400607  -7.162328
## 5 -2.233406 -0.09090206  -6.925833
## 6 -2.542743 -0.10349240  -7.885092
## 7 -9.899846 -0.40293446 -30.699599
## 8 -9.606445 -0.39099273 -29.789758

可视化：

> plot(fit)

权重统计图

从这张统计图中，可以知道截距和每个变量的权重。
查看广义权重：

> par(mfrow = c(1, 2))
> # 一直报错：Error in plot.window(...): 'ylim'值不能是无限的
> # gwplot(fit, selected.covariate = "vis.yes")
> # gwplot(fit, selected.covariate = "wind.tail")
> # 换种方式实现
> fit.covariate <- fit$covariate %>% as.data.frame()
> plot(fit.covariate$vis.yes,main = "vis.yes",xlab = "",ylab = "")
> plot(fit.covariate$wind.tail,main = "wind.tail",xlab = "",ylab = "")

广义权重统计图

wind.tail的连接权重在整体上处于较低的位置。vis.yes的广义权重非常不对称，而wind.tail的权重则分布得非常均匀，说明这个变量基本不具备预测能力。

看看模型在测试集上的表现：

> results.train <- compute(fit, shuttle.train[, 1:10])
> pred.train <- results.train$net.result
> print(pred.train)

##             [,1]
## 1   1.000000e+00
## 4   1.000000e+00
## 5   1.000000e+00
## 6   1.000000e+00
## ---------------------
## 251 6.119581e-04
## 252 1.484097e-10
## 253 2.765947e-08
## 255 3.022915e-07
## 256 7.566422e-08

> pred.train <- ifelse(pred.train < 0.5, 0, 1)
> table(pred.train, shuttle.train$use)

##           
## pred.train   0   1
##          0  73   0
##          1   0 107

神经网络模型的正确率达到了100%！看看它在测试集上的表现：

> result.test <- compute(fit, shuttle.test[, 1:10])
> pred.test <- result.test$net.result
> pred.test <- ifelse(pred.test < 0.5, 0, 1)
> table(pred.test, shuttle.test$use)

##          
## pred.test  0  1
##         0 38  2
##         1  0 36

测试集上有两个错误，查看是哪两个：

> which(pred.test == 0 & shuttle.test$use == 1)

## [1] 58 59

测试集中的58、59行预测错误。

5、深度学习示例

5.1 安装H2O

1、如果以前安装过H2O包，先删除

> if ("package:h2o" %in% search()) {
+     detach("package:h2o", unload = TRUE)
+ }

## [1] "A shutdown has been triggered. "

> if ("package:h2o" %in% rownames(installed.packages())) {
+     remove.packages("h2o")
+ }

2、下载并安装h2o包所需的依赖包：

> library(pacman)
> p_load(methods, statmod, stats, graphics, RCurl, jsonlite, tools, utils)

3、安装并加载h2o包

> p_load(h2o)

5.2 将数据上传到H2O平台

> path <- "./data_set/data-master/bank_DL.csv"

连接到H2O平台，并在集群上启动一个实例。

> # nthreads=-1使实例可以使用集群上的所有CPU
> local.h2o <- h2o.init(nthreads = -1)

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         5 hours 57 minutes 
##     H2O cluster timezone:       Asia/Shanghai 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.28.0.4 
##     H2O cluster version age:    15 days  
##     H2O cluster name:           H2O_started_from_R_Admin_wzk082 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.57 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.2 (2019-12-12)

服务已经启动，此时从浏览器中也可以看到：

h2o

将数据文件上传到H2O平台：

> # h2o.importFolder h2o.importURL h2o.importHDFS
> bank <- h2o.uploadFile(path = path)

> class(bank)

## [1] "H2OFrame"

> # 在H2O平台中，很多R函数的输出和我们以前用过的函数不一样。
> str(bank)

## Class 'H2OFrame'  
##  - attr(*, "op")= chr "Parse"
##  - attr(*, "id")= chr "bank_DL_sid_8edb_2"
##  - attr(*, "eval")= logi FALSE
##  - attr(*, "nrow")= int 4521
##  - attr(*, "ncol")= int 64
##  - attr(*, "types")=List of 64
##   ..$ : chr "real"
## --------------------------------------------------------
##   ..$ previous_2         : num  0 0 0 0 0 0 1 0 0 1
##   ..$ previous_3         : num  0 0 0 0 0 1 0 0 0 0
##   ..$ previous_4         : num  0 1 0 0 0 0 0 0 0 0
##   ..$ previous_5         : num  0 0 0 0 0 0 0 0 0 0
##   ..$ poutcome_failure   : num  0 1 1 0 0 1 0 0 0 1
##   ..$ poutcome_other     : num  0 0 0 0 0 0 1 0 0 0
##   ..$ poutcome_success   : num  0 0 0 0 0 0 0 0 0 0
##   ..$ poutcome_unknown   : num  1 0 0 1 1 0 0 1 1 0
##   ..$ y                  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1

查看相应变量的分布：

> h2o.table(bank$y)

##     y Count
## 1  no  4000
## 2 yes   521

可以看到，有521名客户对银行的营销活动给出了“是”的反应，另外4000名客户的反应则是“否”。这个响应变量有点不平衡。

5.3 拆分训练集和测试集

> # 建立统一的随机数向量
> rand <- h2o.runif(bank, seed = 123)
> 
> train <- bank[rand <= 0.7, ] %>% h2o.assign(key = "train")
> test <- bank[rand > 0.7, ] %>% h2o.assign(key = "test")

> # 查看拆分是否均衡
> h2o.table(train[, 64])

##     y Count
## 1  no  2783
## 2 yes   396

> h2o.table(test[, 64])

##     y Count
## 1  no  1217
## 2 yes   125

5.4 构建模型

使用随机搜索的方法调整超参数，这种方法比全网格搜索要节省时间。需要检查的超参数有：有舍弃（dropout）和无舍弃的tanh激活函数、3种不同形式的隐藏层（神经元组合）、两种不同的舍弃率，以及两种不同的学习率。

> # 建立随机搜索超参数的列表
> hyper.params <- list(activation = c("Tanh", "TanhWithDropout"), 
+     hidden = list(c(20, 20), c(30, 30), c(30, 30, 30)), 
+     input_dropout_ratio = c(0, 0.05), rate = c(0.01, 0.25))

> # 建立随机搜索原则的列表，strategy设置 为RandomDiscrete表示随机搜索；如果要进行全网格搜索，就要设置为Cartesian
> search.criteria <- list(
+   strategy = "RandomDiscrete",max_runtime_secs = 420,
+   max_models = 100,seed =123,stopping_rounds = 5,
+   # 结束标志为：前5个模型之间的误差在1%以内
+   stopping_tolerance = 0.01
+ )

> random.search <- h2o.grid(
+   # 深度学习算法
+   algorithm = "deeplearning",
+   grid_id = "random.search",
+   # 训练数据集
+   training_frame = train,
+   # 验证数据集
+   validation_frame = test,
+   # 输入特征
+   x = 1:63,
+   # 响应变量
+   y = 64,
+   epochs = 1,
+   stopping_metric = "misclassification",
+   hyper_params = hyper.params,
+   search_criteria = search.criteria
+ )

检查效果最好的前5个模型的结果：

> grid <- h2o.getGrid("random.search", sort_by = "auc", decreasing = T)
> grid

## H2O Grid Details
## ================
## 
## Grid ID: random.search 
## Used hyper parameters: 
##   -  activation 
##   -  hidden 
##   -  input_dropout_ratio 
##   -  rate 
## Number of models: 24 
## Number of failed models: 0 
## 
## Hyper-Parameter Search Summary: ordered by decreasing auc
##        activation       hidden input_dropout_ratio rate              model_ids
## 1 TanhWithDropout     [30, 30]                 0.0 0.01 random.search_model_17
## 2 TanhWithDropout     [30, 30]                0.05 0.01 random.search_model_16
## 3 TanhWithDropout     [30, 30]                 0.0 0.25  random.search_model_6
## 4 TanhWithDropout [30, 30, 30]                0.05 0.01 random.search_model_23
## 5            Tanh [30, 30, 30]                0.05 0.01 random.search_model_14
##                  auc
## 1 0.8635497124075596
## 2 0.8588824979457683
## 3 0.8580049301561217
## 4   0.85473459326212
## 5 0.8506162695152013
## 
## ---
##         activation       hidden input_dropout_ratio rate
## 19            Tanh     [20, 20]                0.05 0.25
## 20 TanhWithDropout     [20, 20]                 0.0 0.25
## 21            Tanh [30, 30, 30]                0.05 0.25
## 22            Tanh     [30, 30]                0.05 0.01
## 23 TanhWithDropout [30, 30, 30]                 0.0 0.01
## 24 TanhWithDropout [30, 30, 30]                0.05 0.25
##                 model_ids                auc
## 19 random.search_model_18 0.8081840591618734
## 20  random.search_model_8  0.802872637633525
## 21  random.search_model_3  0.798695152013147
## 22  random.search_model_2 0.7974330320460148
## 23 random.search_model_22 0.7543467543138865
## 24 random.search_model_11  0.716936729663106

所以，第17号模型最终胜出，它使用有舍弃的tanh激活函数、2个隐藏层（每个隐藏层中有30个神经元）、0.0的舍弃率和0.01的学习率，其AUC大概是0.864。
通过混淆矩阵查看模型在测试集上的表现：

> best.model <- h2o.getModel(grid@model_ids[[1]])

> h2o.confusionMatrix(best.model, valid = T)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.123302996428301:
##          no yes    Error       Rate
## no     1133  84 0.069022   =84/1217
## yes      60  65 0.480000    =60/125
## Totals 1193 149 0.107303  =144/1342

尽管错误率不到11%，但yes标签上的错误太多了，它的假阳性率和假阴性率都非常高。这说明数据不平衡的分类可能是一个问题。

5.5 使用交叉验证建立模型

> dlmodel <- h2o.deeplearning(x = 1:63, y = 64, training_frame = train, 
+     hidden = c(30, 30), 
+     epochs = 3, nfolds = 5, fold_assignment = "Stratified", balance_classes = T, 
+     activation = "TanhWithDropout", seed = 123, adaptive_rate = F, input_dropout_ratio = 0, 
+     stopping_metric = "misclassification", variable_importances = T)

> dlmodel

## Model Details:
## ==============
## 
## H2OBinomialModel: deeplearning
## Model ID:  DeepLearning_model_R_1583821038674_163 
## Status of Neuron Layers: predicting y, 2-class classification, bernoulli distribution, CrossEntropy loss,
## 2,912 weights/biases, 22.9 KB, 18,957 training samples, mini-batch size 1
##   layer units        type dropout       l1       l2 mean_rate rate_rms
## 1     1    63       Input  0.00 %       NA       NA        NA       NA
## 2     2    30 TanhDropout 50.00 % 0.000000 0.000000  0.004907 0.000000
## 3     3    30 TanhDropout 50.00 % 0.000000 0.000000  0.004907 0.000000
## 4     4     2     Softmax      NA 0.000000 0.000000  0.004907 0.000000
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1       NA          NA         NA        NA       NA
## 2 0.000000    0.025206   0.814655  0.147166 0.735583
## 3 0.000000   -0.005851   0.337015 -0.084604 0.411204
## 4 0.000000    0.053744   0.769235 -0.004516 0.303754
## 
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
## 
## MSE:  0.1826312
## RMSE:  0.4273537
## LogLoss:  0.5439753
## Mean Per-Class Error:  0.1502497
## AUC:  0.9159916
## AUCPR:  0.8807761
## Gini:  0.8319832
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          no  yes    Error       Rate
## no     2108  675 0.242544  =675/2783
## yes     161 2617 0.057955  =161/2778
## Totals 2269 3292 0.150333  =836/5561
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.105880    0.862273 298
## 2                       max f2  0.034591    0.918540 352
## 3                 max f0point5  0.300881    0.845340 190
## 4                 max accuracy  0.125034    0.850207 286
## 5                max precision  0.595273    0.942308   4
## 6                   max recall  0.004675    1.000000 393
## 7              max specificity  0.600539    0.999641   0
## 8             max absolute_mcc  0.105880    0.711645 298
## 9   max min_per_class_accuracy  0.245084    0.838733 218
## 10 max mean_per_class_accuracy  0.125034    0.850278 286
## 11                     max tns  0.600539 2782.000000   0
## 12                     max fns  0.600539 2764.000000   0
## 13                     max fps  0.003311 2783.000000 399
## 14                     max tps  0.004675 2778.000000 393
## 15                     max tnr  0.600539    0.999641   0
## 16                     max fnr  0.600539    0.994960   0
## 17                     max fpr  0.003311    1.000000 399
## 18                     max tpr  0.004675    1.000000 393
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(, )` or `h2o.gainsLift(, valid=, xval=)`
## 
## H2OBinomialMetrics: deeplearning
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.08705731
## RMSE:  0.2950548
## LogLoss:  0.2875898
## Mean Per-Class Error:  0.2193127
## AUC:  0.872492
## AUCPR:  0.4780927
## Gini:  0.744984
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          no yes    Error       Rate
## no     2497 286 0.102767  =286/2783
## yes     133 263 0.335859   =133/396
## Totals 2630 549 0.131802  =419/3179
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.424191    0.556614 172
## 2                       max f2  0.122184    0.657076 295
## 3                 max f0point5  0.424191    0.507330 172
## 4                 max accuracy  0.717398    0.884869  63
## 5                max precision  0.957903    0.800000   3
## 6                   max recall  0.000684    1.000000 398
## 7              max specificity  0.972282    0.999281   0
## 8             max absolute_mcc  0.424191    0.490448 172
## 9   max min_per_class_accuracy  0.188988    0.804887 266
## 10 max mean_per_class_accuracy  0.122184    0.809987 295
## 11                     max tns  0.972282 2781.000000   0
## 12                     max fns  0.972282  395.000000   0
## 13                     max fps  0.000346 2783.000000 399
## 14                     max tps  0.000684  396.000000 398
## 15                     max tnr  0.972282    0.999281   0
## 16                     max fnr  0.972282    0.997475   0
## 17                     max fpr  0.000346    1.000000 399
## 18                     max tpr  0.000684    1.000000 398
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(, )` or `h2o.gainsLift(, valid=, xval=)`
## Cross-Validation Metrics Summary: 
##                 mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy   0.8826488 0.019553455  0.9073783  0.8575949 0.88621795 0.89269054
## auc        0.8777314 0.022875382  0.8861917 0.89000493  0.8754107 0.89764535
## aucpr     0.47863057 0.055181786 0.52034676  0.4284161 0.52908653 0.50504553
## err       0.11735118 0.019553455 0.09262166 0.14240506 0.11378205 0.10730949
## err_count       74.6   12.381437       59.0       90.0       71.0       69.0
##           cv_5_valid
## accuracy  0.86936235
## auc       0.83940434
## aucpr     0.41025785
## err       0.13063763
## err_count       84.0
## 
## ---
##                   mean          sd cv_1_valid cv_2_valid  cv_3_valid
## pr_auc      0.47863057 0.055181786 0.52034676  0.4284161  0.52908653
## precision    0.5355106    0.077891  0.5882353 0.42741936   0.5903614
## r2           0.2003508  0.08083034 0.18793476 0.22176513 0.086192444
## recall       0.6443926 0.075244226  0.5633803  0.7361111   0.5697674
## rmse        0.29464537  0.02022303 0.28359163 0.28028414  0.32952103
## specificity  0.9167673 0.031265505 0.95053005  0.8732143    0.936803
##             cv_4_valid cv_5_valid
## pr_auc      0.50504553 0.41025785
## precision    0.5940594  0.4774775
## r2          0.31192234 0.19393937
## recall       0.6818182  0.6708861
## rmse        0.28509894  0.2947311
## specificity  0.9261261  0.8971631

看看在测试集上的表现：

> perf <- h2o.performance(dlmodel, test)

> perf

## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.07173268
## RMSE:  0.2678296
## LogLoss:  0.2341893
## Mean Per-Class Error:  0.2004174
## AUC:  0.8735283
## AUCPR:  0.3951697
## Gini:  0.7470567
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          no yes    Error       Rate
## no     1031 186 0.152835  =186/1217
## yes      31  94 0.248000    =31/125
## Totals 1062 280 0.161699  =217/1342
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.257779    0.464198 175
## 2                       max f2  0.183122    0.609168 210
## 3                 max f0point5  0.575237    0.496183  25
## 4                 max accuracy  0.576611    0.915052  23
## 5                max precision  0.604630    1.000000   0
## 6                   max recall  0.003262    1.000000 399
## 7              max specificity  0.604630    1.000000   0
## 8             max absolute_mcc  0.257779    0.428554 175
## 9   max min_per_class_accuracy  0.183122    0.808000 210
## 10 max mean_per_class_accuracy  0.117068    0.816250 242
## 11                     max tns  0.604630 1217.000000   0
## 12                     max fns  0.604630  124.000000   0
## 13                     max fps  0.003262 1217.000000 399
## 14                     max tps  0.003262  125.000000 399
## 15                     max tnr  0.604630    1.000000   0
## 16                     max fnr  0.604630    0.992000   0
## 17                     max fpr  0.003262    1.000000 399
## 18                     max tpr  0.003262    1.000000 399
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(, )` or `h2o.gainsLift(, valid=, xval=)`

做个对比：
训练集上的混淆矩阵：

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
         no yes    Error       Rate
no     2497 286 0.102767  =286/2783
yes     133 263 0.335859   =133/396
Totals 2630 549 0.131802  =419/3179

测试集上的混淆矩阵：

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
         no yes    Error       Rate
no     1031 186 0.152835  =186/1217
yes      31  94 0.248000    =31/125
Totals 1062 280 0.161699  =217/1342

整体错误率提高了，假阴性率有所下降，所以需要更多调优工作。
最后，可以计算变量重要性，在表中，我们看到变量是按照重要性顺序排列的，但变量重要性会受到抽样变动的影响。如果你换一个随机数种子，变量重要性的顺序也很可能发生改变。以下是按重要性排列的前5个和最后6个变量：

> dlmodel@model$variable_importances

## Variable Importances: 
##           variable relative_importance scaled_importance percentage
## 1         duration            1.000000          1.000000   0.105319
## 2 poutcome_success            0.738116          0.738116   0.077738
## 3        month_oct            0.415810          0.415810   0.043793
## 4        month_mar            0.282554          0.282554   0.029758
## 5 poutcome_unknown            0.263573          0.263573   0.027759
## 
## ---
##             variable relative_importance scaled_importance percentage
## 58 contact_telephone            0.072925          0.072925   0.007680
## 59    job_unemployed            0.071896          0.071896   0.007572
## 60        campaign_3            0.070856          0.070856   0.007463
## 61     job_housemaid            0.066491          0.066491   0.007003
## 62        campaign_6            0.065220          0.065220   0.006869
## 63       campaign_10            0.065091          0.065091   0.006855

45-R语言机器学习：神经网络与深度学习