[R - ml]ensemble在各种模型中的使用

ensemble的好处

  • 避免纠缠于寻找单个最佳模型
  • 更适合未来问题
  • 大规模数据中性能提升,可以用并行处理

R 中随机森林的使用

随机森林是基于决策树的ensemble方法
这个方法把bagging原理与模型中特征的随机选择结合以增加决策模型的多样性,
当然ensemble模型获得之后,还是需要投票解决分歧。

随机森林可以对具有很大量特征的数据很好处理,也只选择重要模型,简单易用,
不过解释性差了,同时需要根据数据对模型进行调整。

m = randomForest(train, class, ntree = 500, mtry = sqrt(p))

  • ntree: 生成树的数目
  • mtry:可选变量,缺省是sqrt(p), p是特征数目
# install.packages('randomForest')
library(randomForest)
set.seed(300)
rf = randomForest(Good.Loan ~. , data = credit)
rf # 23.8% 比决策树的精确度提升很多

OOB 是指 out-of-bag error rate,也就是对未来数据的 error rate 的估计

我们也可以用caret

library(caret)
ctrl = trainControl(method = 'repeatedcv', number = 5, repeats = 5)
grid_rf = expand.grid(.mtry = c(2, 4, 8, 16))
set.seed(300)
m_rf = train(Good.Loan ~., data = credit, method = 'rf',
             metric = 'Kappa', trControl = ctrl,
             tuneGrid = grid_rf)
m_rf # 最优的是 16

利用 boost tree 再来一次

library(C50)
grid_c50 = expand.grid(.model = 'tree', .trails = c(10, 20, 30, 40), .winnow = FALSE)
set.seed(300)
m_c50 = train(Good.Loan ~. , data = credit, method = 'C5.0',
              metric = 'Kappa', trControl = ctrl,
              tuneGrid = grid_c50)

caret 包里面的C5.0 方法,更新了。
http://topepo.github.io/caret/train-models-by-tag.html#
再看文档,调一下

自建过程 ensemble

数据模拟

set.seed(10)

y = seq(1:1000)
x1 = seq(1:1000)*runif(1000, min = 0, max = 2)
x2 = (seq(1:1000)*runif(1000, min = 0, max = 2))^2
x3 = log(seq(1:1000)*runif(1000, min = 0, max = 2))

lm_fit = lm(y ~ x1 + x2 + x3)
summary(lm_fit)

解释 0.6648

线性模型

set.seed(10)
all_data = data.frame(y, x1, x2, x3)
positions = sample(nrow(all_data), size = floor(nrow(all_data) / 4 * 3))
training = all_data[positions, ]
testing = all_data[-positions, ]
lm_fit = lm(y ~ x1 + x2 + x3, data = training)
predictions = predict(lm_fit, newdata = testing)
error = sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 177.3647

bagging

library(foreach)
length_divisor = 6
iterations = 1000
predictions = foreach(m = 1:iterations, .combine = cbind) %do% {
  training_positions = sample(nrow(training), size = floor(nrow(training) / length_divisor)) # 抽样1/6, floor 取整
  train_pos = 1:nrow(training) %in% training_positions
  lm_fit = lm(y ~ x1 + x2 + x3, data = training[train_pos, ])
  predict(lm_fit, newdata = testing)
}
predictions = rowMeans(predictions)
error = sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 177.2885

提升不明显的原因,还是组间差异过小

rf - ensemble

library(randomForest)
rf_fit = randomForest(y ~ x1 + x2 + x3, data = training, ntree = 500)
predictions = predict(rf_fit, newdata = testing)
error = sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 135.7778

随机森林采用了ensemble方法之后提升还是比较多的

rf + bagging

把randomForest 与 bagging 得到的模型 组合在一起

length_divisor = 6
iterations = 1000
predictions = foreach(m = 1:iterations, .combine = cbind) %do% { 
  training_positions = sample(nrow(training), size = floor((nrow(training) / length_divisor)))
  train_pos = 1:nrow(training) %in% training_positions
  lm_fit = lm(y ~ x1 + x2 + x3, data = training[train_pos, ])
  predict(lm_fit, newdata = testing)
}
lm_predictions = rowMeans(predictions)

library(randomForest)
rf_fit = randomForest(y ~ x1 + x2 + x3, data = training, ntree = 500)
rf_predictions = predict(rf_fit, newdata = testing)
predictions = (lm_predictions + rf_predictions) / 2 # 两种方法预测值简单求平均

error =  sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 148.1932

比直接使用randomForest差, 因为randomForest 本身已经考虑了ensemble, 再叠加一个较差的线性模型,改进不大

考虑给rf 90%的权重

predictions = (lm_predictions + rf_predictions*9) / 10 

error =  sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 136.5708

比单纯rf ,有一点提升

svm

library(e1071)
svm_fit = svm(y ~ x1 + x2 + x3, data = training)
svm_predictions = predict(svm_fit, newdata = testing)

error =  sqrt(sum((testing$y - svm_predictions)^2) / nrow(testing))
# error 129.8723

支持向量机 比 randomForest 更好

对svm 也采用 bagging

length_divisor = 6
iterations = 5000
predictions = foreach(m = 1:iterations, .combine = cbind) %do% { 
  training_positions = sample(nrow(training), size = floor((nrow(training) / length_divisor)))
  train_pos = 1:nrow(training) %in% training_positions
  svm_fit = svm(y ~ x1 + x2 + x3, data = training[train_pos, ])
  predict(svm_fit, newdata = testing)
}
svm2_predictions = rowMeans(predictions)
error =  sqrt(sum((testing$y - svm2_predictions)^2) / nrow(testing))
# error 140.9896

反而结果变差了

  1. 样本之间相似性
  2. 样本量不大,只有1000

ensemble svm+rf

predictions = (svm_predictions + rf_predictions) / 2 
error =  sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 129.0832 更好一点
predictions = (svm_predictions*3 + rf_predictions) / 4 
error =  sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 128.5303

你可能感兴趣的:([R - ml]ensemble在各种模型中的使用)