在analytics vidhya上看到一篇<Complete Guide to Parameter Tuning in XGBoost in Python>,写的很好。因此打算翻译一下这篇文章,也让自己有更深的印象。具体内容主要翻译文章的关键意思。
原文见:
http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
这篇文章按照原文的分节,共分为三个部分,其中本章介绍第一部分。
1、简介与XGboost
2、参数理解
3、参数调优
Introduction
If things don't go your way in predictive modeling, use XGboost. XGBoost algorithmhas become the ultimate weapon of many data scientist. It’s a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data.
如果你在你的预测模型中遇到不顺利,可以尝试一下XGBoost。XGBoost是一个精巧而强大的模型,可以解决各种类型的不规则数据。
Building a model using XGBoost is easy. But, improving the model using XGBoost is difficult (at least I struggled a lot). This algorithm uses multiple parameters. To improve the model, parameter tuning is must. It is very difficult to get answers to practical questions like – Which set of parameters you should tune ? What is the ideal value of these parameters to obtain optimal output ?
用XGBoost建立一个模型是简单的,但是要用XGBoost对模型做进一步的提升却很困难。原因在于XGBoost有许多可以调整的参数,选择需要调整的参数并调整到合适的值是一件不容易的事情。
This article isbest suited to people who are new to XGBoost. In this article, we’ll learn the art of parameter tuning along with some useful information about XGBoost. Also,we’ll practice this algorithm using a data set in Python.
What should you know ?
XGBoost(eXtreme Gradient Boosting) is an advanced implementation of gradient boosting algorithm. Since I covered Gradient Boosting Machine in detail in my previous article – Complete Guide to Parameter Tuning in GradientBoosting (GBM) in Python,I highly recommend going through that before reading further. It will help youbolster your understanding of boosting in general and parameter tuning for GBM.
SpecialThanks: Personally, I would like to acknowledge the timeless support provided by Mr. Sudalai Rajkumar (aka SRK), currently AV Rank 2. This article wouldn’t be possible without his help. Heis helping us guide thousands of data scientists. A big thanks to SRK!
Table of Contents
- The XGBoost Advantage
- Understanding XGBoost Parameters
- Tuning Parameters (with Example)
1. The XGBoost Advantage
I’ve always admired the boosting capabilities that this algorithm infuses in a predictive model. When I explored more about its performance and science behind its high accuracy, I discovered many advantages:
- Regularization(正则化):
- Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting.
- In fact, XGBoost is also known as ‘regularized boosting‘ technique
- 标准的GBM方法没有正则化,因此XGBoost的正则化有助于降低过拟合
- Parallel Processing(并行处理):
- XGBoost implements parallel processing and is blazingly faster as compared to GBM.
- XGBoost 的并行化执行使其速度比GBM快很多。
- But hang on, we know that boosting is sequential process so how can it be parallelized? We know that each tree can be built only after the previous one, so what stops us from making a tree using all cores? I hope you get where I’m coming from. Check this link out to explore further.
- boosting是一种时序化处理过程,如何进行并行处理可以看以上链接中的文章。
- XGBoost also supports implementation on Hadoop.
-
- High Flexibility(高度的弹性)
- XGBoost allow users to define custom optimization objectives and evaluation criteria.
- This adds a whole new dimension to the model and there is no limit to what we can do.
- XGBoost 允许用户自定义最优化目标与评估标准,这使得XGBoost模型有更大的可能性
- Handling Missing Values(处理缺失值)
- XGBoost has an in-built routine to handle missing values.
- XGBoost 有内置的缺失值处理程序
- User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.
- 用户需要提交与其他观测值不同的值作为缺失值的取值(作为参数进行传递)。XGBoost在每个节点遇到缺失值时都会尝试不同的事情,并学习如何处理未来的缺失值
- Tree Pruning(剪枝):
- A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
- XGBoost on the other hand make splits upto the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
- XGBoost 先从顶到底建立所有可以建立的子树,再从底到顶反向进行剪枝。比起GBM,这样不容易陷入局部最优解
- Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
- Built-in Cross-Validation(内置的交叉检验)
- XGBoost allows user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
- XGBoost允许用户在每次迭代过程中运行交叉检验,这样方便用户获取最优的boosting迭代次数。
- This is unlike GBM where we have to run a grid-search and only a limited values can be tested.
- Continue on Existing Model(继续已经中断的工作)
- User can start training an XGBoost model from its last iteration of previous run. This can be of significant advantage in certain specific applications.
- GBM implementation of sklearn also has this feature so they are even on this point.
- sklearn中的GBM模型与XGBoost模型都可以接着上次停止的训练位置继续进行训练。
Did I whet yourappetite ? Good. You can refer to following web-pages for a deeper understanding:
- XGBoost Guide – Introduction to Boosted Trees
- Words from the Author of XGBoost [Video]