R-xgboost使用

本篇主要介绍xgboost在R中的使用，主要参考了here出的文章。

data loading

require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test

其中train包含有data和label，同理test。

basic training

使用xgboost时必须设定的几个参数：
objective: 目标函数，如binary:logistic，表示二分类
max_depth: 树的深度
nthread: 调用线程数
nrounds: 树的棵树
eta: 学习率

xgb = xgboost(data=train$data, label=train$label, max_depth=2, eta=1, objective='binary:logistic', nrounds=2)
## [1]  train-error:0.046522 
## [2]  train-error:0.022263

xgb.DMatrix

用于组合train$data和train$label：

dtrain = xgb.DMatrix(data=train$data, label=train$label)
bst = xgboost(data=dtrain, max_depth=2, eta=1, objective='binary:logistic', nrounds=2)

verbose option

用于设置训练过程中显示的信息，其中：
verbose=0，no message
verbose=1，evaluation metric
verbose=2，evaluation metric + tree information

bst = xgboost(data=dtrain, max_depth=2, eta=1, objective='binary:logistic', nrounds=2, verbose=2)
## [22:38:46] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
## [1]  train-error:0.046522 
## [22:38:46] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
## [2]  train-error:0.022263

predict

在使用xgboost训练得到模型bst后，可以用于预测test$data的label.

pred = predict(bst, test$data)
prediction = as.numeric(pred > 0.5)
print(head(prediction, 5))
err = mean(as.numeric(prediction != test$label))

xgb.train

使用该方法，可以在每一个round结束后，计算测试集的准确率，从而选择不overfitting的模型.

dtrain = xgb.DMatrix(data=train$data, label=train$label)
dtest = xgb.DMatrix(data=test$data, label=test$label)
watchlist = list(train=dtrain, test=dtest)

bst = xgb.train(data=dtrain, max_depth=2, eta=1, nthread=2, nrounds=2, watchlist=watchlist, objective='binary:logistic')

linear boosting

前面所用的模型均基于boosting tree, 通过设置booster参数booster='gblinear'，同时remove eta，我们可以使用linear boosting.

bst <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
# 设定两个eval_metric，查看模型预测结果

保存、加载

# DMatrix save & load
xgb.DMatrix.save(dtrain, 'dtrain.buffer')
dtrain = xgb.DMatrix('dtrain.buffer')

# model save & load
xgb.save(bst, 'xgboost.model')
bst = xgb.load('xgboost.model')

查看模型中变量的重要性

import_mat = xgb.importance(names(train$data), model=bst)
print(import_mat)
xgb.plot.importance(importance_matrix=import_mat)

查看树

使用xgb.dump(model, with_stats=T)，使用xgb.plot.tree(model)
则可以画出模型中的树。