ensemble的好处
- 避免纠缠于寻找单个最佳模型
- 更适合未来问题
- 大规模数据中性能提升,可以用并行处理
R 中随机森林的使用
随机森林是基于决策树的ensemble方法
这个方法把bagging原理与模型中特征的随机选择结合以增加决策模型的多样性,
当然ensemble模型获得之后,还是需要投票解决分歧。
随机森林可以对具有很大量特征的数据很好处理,也只选择重要模型,简单易用,
不过解释性差了,同时需要根据数据对模型进行调整。
m = randomForest(train, class, ntree = 500, mtry = sqrt(p))
- ntree: 生成树的数目
- mtry:可选变量,缺省是sqrt(p), p是特征数目
# install.packages('randomForest')
library(randomForest)
set.seed(300)
rf = randomForest(Good.Loan ~. , data = credit)
rf # 23.8% 比决策树的精确度提升很多
OOB 是指 out-of-bag error rate,也就是对未来数据的 error rate 的估计
我们也可以用caret
包
library(caret)
ctrl = trainControl(method = 'repeatedcv', number = 5, repeats = 5)
grid_rf = expand.grid(.mtry = c(2, 4, 8, 16))
set.seed(300)
m_rf = train(Good.Loan ~., data = credit, method = 'rf',
metric = 'Kappa', trControl = ctrl,
tuneGrid = grid_rf)
m_rf # 最优的是 16
利用 boost tree 再来一次
library(C50)
grid_c50 = expand.grid(.model = 'tree', .trails = c(10, 20, 30, 40), .winnow = FALSE)
set.seed(300)
m_c50 = train(Good.Loan ~. , data = credit, method = 'C5.0',
metric = 'Kappa', trControl = ctrl,
tuneGrid = grid_c50)
caret 包里面的C5.0 方法,更新了。
http://topepo.github.io/caret/train-models-by-tag.html#
再看文档,调一下
自建过程 ensemble
数据模拟
set.seed(10)
y = seq(1:1000)
x1 = seq(1:1000)*runif(1000, min = 0, max = 2)
x2 = (seq(1:1000)*runif(1000, min = 0, max = 2))^2
x3 = log(seq(1:1000)*runif(1000, min = 0, max = 2))
lm_fit = lm(y ~ x1 + x2 + x3)
summary(lm_fit)
解释 0.6648
线性模型
set.seed(10)
all_data = data.frame(y, x1, x2, x3)
positions = sample(nrow(all_data), size = floor(nrow(all_data) / 4 * 3))
training = all_data[positions, ]
testing = all_data[-positions, ]
lm_fit = lm(y ~ x1 + x2 + x3, data = training)
predictions = predict(lm_fit, newdata = testing)
error = sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 177.3647
bagging
library(foreach)
length_divisor = 6
iterations = 1000
predictions = foreach(m = 1:iterations, .combine = cbind) %do% {
training_positions = sample(nrow(training), size = floor(nrow(training) / length_divisor)) # 抽样1/6, floor 取整
train_pos = 1:nrow(training) %in% training_positions
lm_fit = lm(y ~ x1 + x2 + x3, data = training[train_pos, ])
predict(lm_fit, newdata = testing)
}
predictions = rowMeans(predictions)
error = sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 177.2885
提升不明显的原因,还是组间差异过小
rf - ensemble
library(randomForest)
rf_fit = randomForest(y ~ x1 + x2 + x3, data = training, ntree = 500)
predictions = predict(rf_fit, newdata = testing)
error = sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 135.7778
随机森林采用了ensemble方法之后提升还是比较多的
rf + bagging
把randomForest 与 bagging 得到的模型 组合在一起
length_divisor = 6
iterations = 1000
predictions = foreach(m = 1:iterations, .combine = cbind) %do% {
training_positions = sample(nrow(training), size = floor((nrow(training) / length_divisor)))
train_pos = 1:nrow(training) %in% training_positions
lm_fit = lm(y ~ x1 + x2 + x3, data = training[train_pos, ])
predict(lm_fit, newdata = testing)
}
lm_predictions = rowMeans(predictions)
library(randomForest)
rf_fit = randomForest(y ~ x1 + x2 + x3, data = training, ntree = 500)
rf_predictions = predict(rf_fit, newdata = testing)
predictions = (lm_predictions + rf_predictions) / 2 # 两种方法预测值简单求平均
error = sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 148.1932
比直接使用randomForest差, 因为randomForest 本身已经考虑了ensemble, 再叠加一个较差的线性模型,改进不大
考虑给rf 90%的权重
predictions = (lm_predictions + rf_predictions*9) / 10
error = sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 136.5708
比单纯rf ,有一点提升
svm
library(e1071)
svm_fit = svm(y ~ x1 + x2 + x3, data = training)
svm_predictions = predict(svm_fit, newdata = testing)
error = sqrt(sum((testing$y - svm_predictions)^2) / nrow(testing))
# error 129.8723
支持向量机 比 randomForest 更好
对svm 也采用 bagging
length_divisor = 6
iterations = 5000
predictions = foreach(m = 1:iterations, .combine = cbind) %do% {
training_positions = sample(nrow(training), size = floor((nrow(training) / length_divisor)))
train_pos = 1:nrow(training) %in% training_positions
svm_fit = svm(y ~ x1 + x2 + x3, data = training[train_pos, ])
predict(svm_fit, newdata = testing)
}
svm2_predictions = rowMeans(predictions)
error = sqrt(sum((testing$y - svm2_predictions)^2) / nrow(testing))
# error 140.9896
反而结果变差了
- 样本之间相似性
- 样本量不大,只有1000
ensemble svm+rf
predictions = (svm_predictions + rf_predictions) / 2
error = sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 129.0832 更好一点
predictions = (svm_predictions*3 + rf_predictions) / 4
error = sqrt(sum((testing$y - predictions)^2) / nrow(testing))
# error 128.5303