xgboost导入数据
After grabbing some theory behind XGBoost framework in the previous stories What Makes XGBoost Fast and Powerful? and How XGBoost Handles Sparsities Arising From of Missing Data? (With an Example), now it is time to get our hands dirty and and apply it on a small dataset for classifying and predicting income level.
在先前的故事中了解了XGBoost框架背后的一些理论之后,是什么使XGBoost快速而强大? 和 XGBoost如何处理因缺少数据而引起的稀疏性? (带有一个示例) ,现在该弄脏我们的双手,并将其应用于一个小的数据集,以分类和预测收入水平。
Let’s say you have applied for a mortgage loan. It is important for the bank to assess your credibility and they think that your income is a good indicator whether to lend you the money. You knew that you are pretty much new to this bank. So at most, maybe they could find some demographic information about you, but not more than that. That makes the current bank you have applied your mortgage, hard to assess your worthiness. So you lied, told them you earned a lot and you had a fantastic job to get the money. But somehow they checked their limited records about you and found out that you make less than 50k in a year and nooo!, you’ve been rejected.
假设您已申请抵押贷款。 银行评估您的信誉很重要,他们认为您的收入是是否要借钱给您的好指标。 您知道您对这家银行非常陌生。 所以最多,也许他们可以找到一些有关您的人口统计信息,但不止于此。 这使您当前申请抵押的银行难以评估您的资产价值。 所以你撒谎了,告诉他们你赚了很多,而且你做得很棒。 但是他们以某种方式检查了关于您的有限记录,发现您一年的收入不足5万,而且不行!您被拒绝了。
But how did this happen? Wait, what? They have an XGBoost classifier model to check your income level? But how? Here comes the how part below:
但是,这是怎么发生的呢? 等一下 他们有一个XGBoost分类器模型来检查您的收入水平? 但是如何? 这里是下面的部分:
Data used in the model, and the code for preprocessing and model building used in this story could be found on my github page here.
该模型中使用的数据以及此故事中使用的预处理和模型构建代码可以在我的github页面上找到 。
There will be other stories on Hypatai, showing the implementation of other frameworks such as H2O and LightGBM with this same dataset, therefore data preparation notebook on github is designed to produce different model data, each corresponds to the related algorithm.
Hypatai上还会有其他故事, 说明使用相同的数据集实现H2O和LightGBM等其他框架的情况,因此github上的数据准备笔记本旨在生成不同的模型数据,每个模型都与相关算法相对应。
Now we are ready to start!
现在我们准备开始!
Let’s read the prepared data, ready for the modeling phase:
让我们阅读准备好的数据,准备好进行建模阶段:
With a quick approach and keeping mostly the default levels, let’s set our model hyper parameters.
通过一种快速的方法并保持大多数默认级别,让我们设置模型超级参数。
max_depth: Maximum depth for each tree. Default value was kept here.
max_depth:每棵树的最大深度。 默认值保留在此处。
eta: Learning rate. Shrinks the feature weights to make the boosting process more conservative after each boosting step. Default value was kept here.
eta:学习率。 缩小特征权重,以使增强过程在每个增强步骤之后更加保守。 默认值保留在此处。
objective: Since we are dealing with a problem with a target of income levels ≤50K and >50K, this is a binary classification problem.
目的: 我们正在处理目标收入水平≤5万且> 5万的问题,这是一个二元分类问题。
seed: This is needed to reproduce same results each time we train the model.
种子:每次我们训练模型时,都需要用它来重现相同的结果。
min_child_weight: As stated in the story What Makes XGBoost Fast and Powerful? in the part where it explains Weighted Quantile Sketch algorithm , weight (or cover) in XGBoost is the number of points in the node (or leaf) for regression problems. Therefore, maintaining equal weight among quantiles is the same thing as maintaining an equal number of observations for regression. In other words, quantiles in regression are just ordinary quantiles. For classification problems, however, it has a different calculation. Namely, badly predicted instances have higher weights, and to maintain the equality of the weight in each quantile, instances with large residuals will go into more specialized quantiles, leading those quantiles to have less number of data points. This results in an increase in accuracy by possibly increasing the number of quantiles. Therefore min_child weight is the minimum threshold for the sum of hessians in each child. The higher this parameter, the less prone the model is to overfitting. Default value is 1 for XGBoost. Here I used a value of 5 with a quick trial and error.
min_child_weight:如中所述 故事 是什么使XGBoost快速而强大? 在它解释的部分 加权位数草图算法, 重量(或盖)的XGBoost是在回归问题的节点(或叶)的点的数量。 因此,在分位数之间保持相等的权重与保持相等数量的回归观察是相同的。 换句话说,回归中的分位数只是普通分位数。 但是,对于分类问题,它具有不同的计算。 即,不良预测的实例具有较高的权重,并且为了保持每个分位数中权重的相等性,具有较大残差的实例将进入更专门的分位数,导致那些分位数具有较少的数据点数。 通过可能增加分位数可以提高准确性。 因此,min_child权重是每个孩子中粗麻布总和的最小阈值。 此参数越高,模型越不适合过度拟合。 XGBoost的默认值为1。 在这里,我使用5值进行了快速试验和错误。
n_estimators: Number of gradient boosted trees. Equivalent to number of boosting rounds. Default value is 100. Since we have about 104 parameters to train the model, I increased this value to 250 to avoid from underfitting.
n_estimators:梯度增强树的数量。 相当于助推轮数。 默认值为100。由于我们有大约104个参数来训练模型,因此我将该值增加到250,以避免拟合不足。
There is of course always some other parameter to tune, however to keep the example simple I stopped here and skipped to the training part.
当然,总会有一些其他参数需要调整,但是为了使示例简单起见,我在这里停下来并跳到培训部分。
Data was already shuffled in the preprocessing step, so I used a 80 to 20 ratio for train and validation sets in this phase. The metric for evaluating model performance is area under the curve. I also set 20 for early_stopping_rounds, meaning if our metric does not improve for 20 rounds anywhere before 250 rounds (this was the value we set for n_estimators parameter), then our training will stop.
数据已经在预处理步骤中进行了混洗,因此在此阶段,我将训练和验证集的比例设为80到20。 评估模型性能的指标是曲线下的面积 。 我还为Early_stopping_rounds设置了20 ,这意味着如果我们的指标在250轮之前的20轮中没有改善(这是我们为n_estimators参数设置的值),那么我们的训练就会停止。
Here comes the point where our training ended:
我们的培训到此结束:
Seems like we did not need to set our n_estimators hyper parameter that high, thankfully early_stopping_rounds did a good job here and saved us from going further.
似乎我们不需要将n_estimators超级参数设置得很高,幸运的是early_stopping_rounds在这里做得很好,使我们不再走得更远。
Our validation set shows that we have got about 93 % auc value and it does not seem to be a bad result. We will put another story on how to calculate and interpret various model performance metrics for classification problems soon, therefore I would like to continue with getting more idea on what kind of model we have constructed. Here is the feature importance list showing top features of our model, calculated from total gain generated from all trees.
我们的验证集显示,我们获得了大约93%的auc值,这似乎并不是一个不好的结果。 我们将在另一个故事中介绍如何针对分类问题快速计算和解释各种模型性能指标,因此我想继续对我们构建的哪种模型有更多了解。 这是特征重要性列表,其中显示了我们模型的主要特征,这些特征是根据所有树木产生的总增益计算得出的。
Let’s define our function for plotting top ten features of our model:
让我们定义用于绘制模型的十大特征的函数:
Looks like whether to be a married spouse (this one is a one-hot encoded parameter) and capital gain are the most important parameters after age variable when we need to predict the income level.
看起来是否要成为已婚配偶(这是一个一次性编码的参数),当我们需要预测收入水平时,资本收益是年龄变量之后最重要的参数。
Next story will be all about AUCs, optimal probability thresholds, confusion matrices, and other metrics to evaluate our model performance on validation and test sets.
接下来的故事将涉及AUC,最佳概率阈值,混淆矩阵以及其他用于评估我们在验证和测试集上的模型性能的指标。
Stay tuned!
敬请关注!
翻译自: https://medium.com/hypatai/implementation-of-xgboost-on-income-data-eda80ca6828e
xgboost导入数据