MachineLearning 8. 癌症诊断机器学习之随机森林（Random Forest)

前言

随机森林是由很多决策树构成的，不同决策树之间没有关联。当我们进行分类任务时，新的输入样本进入，就让森林中的每一棵决策树分别进行判断和分类，每个决策树会得到一个自己的分类结果，决策树的分类结果中哪一个分类最多，那么随机森林就会把这个结果当做最终的结果。

基本原理

随机森林技术在模型构建过程中使用两种方法以提高模型预测能力。

第一个方法称为∶自助聚集法或称装袋法。在装袋法中，使用数据集的一次随机抽样建立一个独立树，抽样的数量大概为全部观测的2/3(请记住，剩下的1/3被称为袋外数据，out-of-bag)。这个过程重复几十次或上百次，最后取平均结果。其中每个树都任其生长，不进行任何基于误差测量的剪枝，这意味着每个独立树的方差都很大。但是，通过对结果的平均化处理可以降低方差，同时又不增加偏差。

第二个方法如下∶对数据进行随机抽样（装袋）的同时，独立树每次分裂时对输入特征也进行随机抽样。在randomForest包中使用随机抽样数的默认值来对预测特征进行抽样。

对于分类问题，默认值为所有预测特征数量的平方根;

对于回归问题，默认值为所有预测特征数量除以3。

在模型调优过程中，每次树分裂时，算法随机选择的预测特征数量是可变的。通过每次分裂时对特征的随机抽样，可以减轻高度相关的预测特征的影响，这种预测特征在由装袋法生成的独立树中往往起主要作用。独立树彼此之间的相关性减少后，对结果的平均化可以使泛化效果更好，对于异常值影响也更加不敏感，比仅进行装袋的效果要好。

实例解析

我们选择两个数据，一个结果变量是连续数值的比如prostate，一个结果变量是分类的比如BreastCancer。这样就我们就可以知道当我们想预测的结果是连续型和分类型的都该怎么进行随机森林的建模。

1. 软件安装

随机森林主要就是通过randomForest软件包来实现的，首先就是安装并加载，如下：

if (!require(randomForest)) install.packages("randomForest")

library(randomForest)

2. 分类型随机森林

数据来源《机器学习与R语言》书中，具体来自UCI机器学习仓库。地址：http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/ 下载wbdc.data和wbdc.names这两个数据集，数据经过整理，成为面板数据。查看数据结构，其中第一列为id列，无特征意义，需要删除。第二列diagnosis为响应变量，字符型，一般在R语言中分类任务都要求响应变量为因子类型，因此需要做数据类型转换。剩余的为预测变量，数值类型。查看数据维度，97个样本，9个特征（包括响应特征）。

### 分类型
BreastCancer <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
dim(BreastCancer)
## [1] 568  32
table(BreastCancer$diagnosis)
## 
##   B   M 
## 357 211
library(tidyverse)
data <- select(BreastCancer, -1) %>%
    mutate_at("diagnosis", as.factor)
sum(is.na(data))  ##判断是否有缺失
## [1] 0

当我们只有一套数据的时候，可以将数据分为训练集和测试集，具体怎么分割可以看公众号的专题：Topic 5. 样本量确定及分割

library(sampling)
set.seed(123)
# 每层抽取70%的数据
train_id <- strata(data, "diagnosis", size = rev(round(table(data$diagnosis) * 0.7)))$ID_unit
# 训练数据
train_data <- data[train_id, ]
# 测试数据
test_data <- data[-train_id, ]

# 查看训练、测试数据中正负样本比例
prop.table(table(train_data$diagnosis))
## 
##         B         M 
## 0.6281407 0.3718593
prop.table(table(test_data$diagnosis))
## 
##         B         M 
## 0.6294118 0.3705882

构建分类随机森林模型我们这里面选择乳腺癌诊断是恶性还是良性的分类作为结果变量，构建随机森林模型，并选取最小的RSS，如下：

set.seed(123)
rf.biop <- randomForest(diagnosis ~ ., data = train_data)
rf.biop
## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = train_data) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 4.77%
## Confusion matrix:
##     B   M class.error
## B 244   6  0.02400000
## M  13 135  0.08783784
plot(rf.biop)

which.min(rf.biop$err.rate[, 1])
## [1] 53

最优随机数

我们看到最少树的个数为53，我们再次构建随机森林模型，假阳和假阴个数仅为6个，准确率达到0.9647，还是蛮高的，如下：

set.seed(123)
rf.biop.2 <- randomForest(diagnosis ~ ., data = train_data, ntree = 53)
getTree(rf.biop, 1)
##    left daughter right daughter split var split point status prediction
## 1              2              3         8    0.051455      1          0
## 2              4              5        22   28.025000      1          0
## 3              6              7         2   15.900000      1          0
## 4              0              0         0    0.000000     -1          1
## 5              8              9         4  697.800000      1          0
## 6             10             11         8    0.079050      1          0
## 7             12             13        29    0.228650      1          0
## 8             14             15        24  766.450000      1          0
## 9             16             17        21   17.900000      1          0
## 10             0              0         0    0.000000     -1          1
## 11             0              0         0    0.000000     -1          2
## 12            18             19         2   20.595000      1          0
## 13            20             21        28    0.142350      1          0
## 14             0              0         0    0.000000     -1          1
## 15            22             23         3   90.995000      1          0
## 16             0              0         0    0.000000     -1          2
## 17             0              0         0    0.000000     -1          1
## 18             0              0         0    0.000000     -1          1
## 19             0              0         0    0.000000     -1          2
## 20            24             25         9    0.195450      1          0
## 21             0              0         0    0.000000     -1          2
## 22             0              0         0    0.000000     -1          2
## 23             0              0         0    0.000000     -1          1
## 24             0              0         0    0.000000     -1          2
## 25            26             27        17    0.030105      1          0
## 26             0              0         0    0.000000     -1          2
## 27             0              0         0    0.000000     -1          1
rf.biop.2
## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = train_data, ntree = 53) 
##                Type of random forest: classification
##                      Number of trees: 53
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 3.52%
## Confusion matrix:
##     B   M class.error
## B 246   4  0.01600000
## M  10 138  0.06756757

测试集预测

从测试集上我们可以看到有6个变量属于重点决定性的变量，如下：

library(caret)
rf.biop.test <- predict(rf.biop.2, newdata = test_data, type = "response")
table(rf.biop.test, test_data$diagnosis)
##             
## rf.biop.test   B   M
##            B 106   5
##            M   1  58
confusionMatrix(rf.biop.test, test_data$diagnosis, positive = "B")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 106   5
##          M   1  58
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9248, 0.9869)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9233          
##                                           
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.9206          
##          Pos Pred Value : 0.9550          
##          Neg Pred Value : 0.9831          
##              Prevalence : 0.6294          
##          Detection Rate : 0.6235          
##    Detection Prevalence : 0.6529          
##       Balanced Accuracy : 0.9556          
##                                           
##        'Positive' Class : B               
## 
varImpPlot(rf.biop.2)

绘制ROC曲线

对模型进行性能分析，分类随机数需要看分类准确性。

### 绘制ROC
library(ROSE)
roc.curve(rf.biop.test, test_data$diagnosis, main = "ROC curve of randomForest",
    col = 2, lwd = 2, lty = 2)
## Area under the curve (AUC): 0.969
legend("bottomright", "AUC:0.969", col = 2, lty = 1, lwd = 2, bty = "n")

3. 回归随机森林

在做回归结果随机森林模型时，我们选择ElemStatLearn软件包里面的胰腺癌数据，结果变量是连续数值，对数据进行读取和分割，如下：

library(ElemStatLearn)
data(prostate)
prostate$gleason <- ifelse(prostate$gleason == 6, 0, 1)
pros.train <- subset(prostate, train == TRUE)[, 1:9]
pros.test = subset(prostate, train == FALSE)[, 1:9]

回归随机森林模型构建

################ RF
set.seed(123)
rf.pros <- randomForest(lpsa ~ ., data = pros.train)
rf.pros
## 
## Call:
##  randomForest(formula = lpsa ~ ., data = pros.train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.6936697
##                     % Var explained: 51.73
plot(rf.pros)

最后随机树个数

最优个数为80，带入模型，重新建模，我们发现lcavol和lweight这两个变量在模型中非常重要，影响比较大，如下：

which.min(rf.pros$mse)
## [1] 80
set.seed(123)
rf.pros.2 <- randomForest(lpsa ~ ., data = pros.train, ntree = 80)
rf.pros.2
## 
## Call:
##  randomForest(formula = lpsa ~ ., data = pros.train, ntree = 80) 
##                Type of random forest: regression
##                      Number of trees: 80
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.6566502
##                     % Var explained: 54.31
varImpPlot(rf.pros.2, scale = TRUE, main = "Variable Importance Plot - PSA Score")

importance(rf.pros.2)
##         IncNodePurity
## lcavol      25.011557
## lweight     15.822110
## age          7.167320
## lbph         5.471032
## svi          8.497838
## lcp          8.113947
## gleason      4.990213
## pgg45        6.663911

测试集验证

我们发现残差为0.557，这里需要说一下，如果是回归随机森林建模，就需要考虑RSS最小，如下：

rf.pros.test <- predict(rf.pros.2, newdata = pros.test)
plot(rf.pros.test, pros.test$lpsa)

rf.resid <- rf.pros.test - pros.test$lpsa  #calculate residual
mean(rf.resid^2)
## [1] 0.5512549

绘制ROC

绘制ROC曲线，我们发现同样是绘制ROC曲线，但是使用的软件包不同，一个同样也跟结果变量的数据类型相关。

## ROC
library(ROCR)
pred <- prediction(predictions = rf.pros.test, labels = pros.test$gleason)

perf <- performance(prediction.obj = pred, measure = "tpr", x.measure = "fpr")
perf
## A performance instance
##   'False positive rate' vs. 'True positive rate' (alpha: 'Cutoff')
##   with 31 data points
plot(perf, colorize = TRUE, main = "ROC", lwd = 2, xlab = "True positive rate", ylab = "False positive rate",
    box.lty = 7, box.lwd = 2, box.col = "gray")
abline(a = 0, b = 1, lty = 2, col = "gray")

结果解读

我们再回忆一下对乳腺癌数据我们都用过哪些机器学习方法。

基于乳腺癌的数据我们已经做过三种类型的机器学习算法，如下：

K-邻近算法（KNN）准确率为0.9471；

支持向量机（SVM）准确率为 0.9765

而使用分类随机森林RT准确率为0.969。

基于胰腺癌的数据我们已经做过三种类型的机器学习算法，如下：

Lasso回归 RSS为0.5084126

回归树（RT）RSS为0.5267748

而使用回归随机森林RSS为0.551。

这说明做模型时，不应单一只是用一种算法，需要多种算法比较，找到最优的选择！

还有就是注意绘制ROC的方法，在做分类随机森林时我们使用的是ROSE，而在做回归随机森林时我们使用的是ROCR，需要注意使用时方法的选择。

References:

Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
Breiman, L (2002), “Manual On Setting Up, Using, And Understanding Random Forests V3.1”,

本文使用文章同步助手同步