数据来源于天池赛题:零基础入门数据挖掘 - 二手车交易价格预测
地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX
XGBoost作为一种提升树,通过集成多棵树,对数据具有极好的泛化能力。XGBoost可以很好地处理缺失值,对于特征的值有缺失的样本,XGBoost采用的稀疏感知算法可以自动学习出它的分裂方向。现将该算法的学习路径总结如下:
具体的算法推导过程可以参考
链接:https://mp.weixin.qq.com/s/HDEKnIufbW8xQcOgHaXlZw
#加载需要的模块
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, make_scorer
from xgboost.sklearn import XGBRegressor
import xgboost as xgb
from lightgbm.sklearn import LGBMRegressor
#导入数据
data = pd.read_csv('F:/data/used_car_train_20200313.csv', sep=' ') #sep=' '表示原数据是以空格分隔
data.shape
(150000, 5)
data=data[['v_12','v_10','v_9','v_11','price']]
#选择前面特征工程过程中筛选出的特征
特征的筛选过程可以参考上一篇文章
链接:https://blog.csdn.net/weixin_45481473/article/details/105159419
#划分标签
x = data.drop("price",axis=1)
y = data["price"]
#交叉验证
from sklearn.model_selection import train_test_split
x_train,x_val,y_train,y_val = train_test_split(x,y,test_size=0.3) #划分训练集train和验证集val
#设置模型参数
def build_model_xgb(x_train,y_train):
model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'
model.fit(x_train, y_train)
return model
#模型训练
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb) #用平均绝对误差MAE来评价模型
print('MAE of val with xgb:',MAE_xgb)
MAE of val with xgb: 953.8296908175865
平均绝对误差(Mean Absolute Error,MAE)越小,说明回归模型的拟合效果越好。MAE用于反映预测值与真实值误差的实际情况,其计算公式如下:
M A E = 1 N ∑ i = 1 N ∣ y i − y ^ i ∣ MAE=\frac{1}{N} \sum_{i=1}^{N}\left|y_{i}-\hat{y}_{i}\right| MAE=N1i=1∑N∣yi−y^i∣
在前面建立的模型中,MAE=953.8296908175865,值偏大。由于这里的样本数据有150000条,样本基数大,因此拟合效果还算可以。