XGBoost = eXtreme + GBDT
=eXtreme +(Gradient + BDT)
=eXtreme +Gradient + (Boosting + DecisionTree)
Boosting --> BDT --> GBDT -->XGBoost
from xgboost import XGBClassifier
XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0,min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1,colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,base_score=0.5,random_state=0, seed=None, missing=None, **kwargs)
三种参数
General parameters:参数控制在提升(boosting)过程中使用哪种booster,
常用的booster有树模型(tree)和线性模型(linear model)。
Booster parameters:这取决于使用哪种booster。
Task parameters:控制学习的场景,例如在回归问题中会使用不同的参数控制排序。
通用参数
这些参数用来控制XGBoost的宏观功能。
3、nthread[默认值为最大可能的线程数] 用来进行多线程控制,应当输入系统的核数。
booster参数.
尽管有两种booster可供选择,这里只介绍tree booster,因为它的表现远远胜过linear booster,所以linear booster很少用到。
学习目标参数
这个参数用来控制理想的优化目标和每一步结果的度量方法。
属性
方法
from xgboost import XGBRegressor
XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True,objective='reg:linear', booster='gbtree', n_jobs=1, nthread=None, gamma=0,min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1,colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
属性
用法
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import os
os.environ['KMP_DUPLICATE_LIB_OK']='tree'
#加载数据
dataset = pd.read_csv(r"diabetes.csv") #网上都有。
print(dataset.shape)
x = dataset.iloc[:,0:8]
y = dataset.iloc[:,-1]
#划分数据集
seed = 7
test_size = 0.33
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=test_size,random_state=seed)
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
#创建及训练模型
model = XGBClassifier(n_jobs=-1)
model.fit(X_train,y_train)
#使用训练后的模型对测试集进行预测,并计算预测值与实际之间的acc值
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print("accuracy:{:.2f}".format(accuracy * 100.0))
#使用训练后的模型对测试集进行预测,得到每个类别的预测概率:
y_pred = model.predict(X_test)
print(y_pred)
y_pred_proba = model.predict_proba(X_test)
print(y_pred_proba)
#输出各种特征重要程度
from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline
plot_importance(model)
plt.show()
#画出模型树图
from xgboost import plot_tree
_, ax = plt.subplots(figsize=(30, 30))
plot_tree(model, ax=ax)
plt.show()
#调参
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
#创建模型及参数搜索空间
model_GS = XGBClassifier()
learning_rate = [0.00001,0.001,0.01,0.1,0.2,0.3]
max_depth = [1,2,3,4,5]
param_grid = dict(learning_rate=learning_rate,max_depth=max_depth)
#设置分层抽样验证及创建搜索对象
kflod = StratifiedKFold(n_splits=10,shuffle=True,random_state=seed)
grid_search = GridSearchCV(model_GS,param_grid=param_grid,scoring='neg_log_loss',n_jobs=-1)
grid_result = grid_search.fit(x,y)
y_pred = grid_result.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print("accuracy:{:.2f}".format(accuracy * 100.0))
grid_result.best_score_,grid_result.best_params_