机器学习之xgboost算法实战

一、导入必要的工具包

导入必要的工具包

import xgboost as xgb

二、数据读取

XGBoost加载的数据存储在对象DMatrix中
XGBoost自定义了一个数据矩阵类DMatrix,优化了存储和运算速度

#read in data,数据在xgboost安装的路径下的demo目录,现在我们将其copy到当前代码下的data目录my_workpath = ‘./data/’

data_train = xgb.DMatrix('G:\\ML\\12.agaricus_train.txt')
data_test = xgb.DMatrix('G:\\ML\\12.agaricus_test.txt')

查看数据情况

data_train.num_row()
data_train.num_col()
data_test.num_col()

三、训练参数设置

max_depth: 树的最大深度。缺省值为6,取值范围为:[1,∞]
eta:为了防止过拟合,更新过程中用到的收缩步长。在每次提升计算之后,算法会直接获得新特征的权重。
eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3,取值范围为:[0,1]
silent:取0时表示打印出运行时信息,取1时表示以缄默方式运行,不打印运行时信息。缺省值为0
objective: 定义学习任务及相应的学习目标,“binary:logistic” 表示二分类的逻辑回归问题,输出为概率。

其他参数取默认值。

#specify parameters via map
param = {‘max_depth’:2, ‘eta’:1, ‘silent’:0, ‘objective’:‘binary:logistic’ }
print(param)

四、训练模型

设置boosting迭代计算次数

num_round = 2

import time
starttime = time.clock()

bst = xgb.train(param, dtrain, num_round) # dtrain是训练数据集

endtime = time.clock()
print (endtime - starttime)

XGBoost预测的输出是概率。这里蘑菇分类是一个二类分类问题,输出值是样本为第一类的概率。
我们需要将概率值转换为0或1。

train_preds = bst.predict(dtrain)
train_predictions = [round(value) for value in train_preds]
y_train = dtrain.get_label() #值为输入数据的第一行
train_accuracy = accuracy_score(y_train, train_predictions)
print ("Train Accuary: %.2f%%" % (train_accuracy * 100.0))

五、测试

模型训练好后,可以用训练好的模型对测试数据进行预测

# make prediction
preds = bst.predict(dtest)

检查模型在测试集上的正确率
XGBoost预测的输出是概率,输出值是样本为第一类的概率。我们需要将概率值转换为0或1。

predictions = [round(value) for value in preds]
y_test = dtest.get_label()
test_accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy: %.2f%%" % (test_accuracy * 100.0))

六、模型可视化

在这里我没有对数据进行可视化展示

调用XGBoost工具包中的plot_tree,在显示
要可视化模型需要安装graphviz软件包
plot_tree()的三个参数:

  1. 模型
  2. 树的索引,从0开始
  3. 显示方向,缺省为竖直,‘LR’是水平方向
from matplotlib import pyplot
import graphviz
xgb.plot_tree(bst, num_trees=0, rankdir= 'LR' )
pyplot.show()

#xgb.plot_tree(bst,num_trees=1, rankdir= 'LR' )
#pyplot.show()
#xgb.to_graphviz(bst,num_trees=0)
#xgb.to_graphviz(bst,num_trees=1)

七、代码整理

# /usr/bin/python
# -*- encoding:utf-8 -*-

import xgboost as xgb
import numpy as np
import time

# 1、xgBoost的基本使用
# 2、自定义损失函数的梯度和二阶导
# 3、binary:logistic/logitraw


# 定义f: theta * x
def log_reg(y_hat, y):
    p = 1.0 / (1.0 + np.exp(-y_hat))
    g = p - y.get_label()
    h = p * (1.0-p)
    return g, h


def error_rate(y_hat, y):
    return 'error', float(sum(y.get_label() != (y_hat > 0.5))) / len(y_hat)


if __name__ == "__main__":
    # 读取数据
    data_train = xgb.DMatrix('G:\\ML\\12.agaricus_train.txt')
    data_test = xgb.DMatrix('G:\\ML\\12.agaricus_test.txt')
    
    print(data_train.num_col())
    print(data_train.num_row())
  
    
    # 设置参数
    param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logitraw'} # logitraw
    # param = {'max_depth': 3, 'eta': 0.3, 'silent': 1, 'objective': 'reg:logistic'}
    watchlist = [(data_test, 'eval'), (data_train, 'train')]
    n_round = 3
    starttime = time.perf_counter()
    # bst = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist)
    bst = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist, obj=log_reg, feval=error_rate)
    
    endtime = time.perf_counter()
    print(endtime-starttime)

    # 计算错误率
    y_hat = bst.predict(data_test)
   
    
    #print (y_hat)
    #print (y)
   
    error = sum(y != (y_hat > 0))
    error_rate = float(error) / len(y_hat)
    print ('样本总数:\t', len(y_hat))
    print ('错误数目:\t%4d' % error)
    print ('错误率:\t%.5f%%' % (100*error_rate))

测试结果:

[0] eval-auc:0.96037 train-auc:0.95823 eval-error:0.04283 train-error:0.04652
[1] eval-auc:0.97993 train-auc:0.98141 eval-error:0.02173 train-error:0.02226
[2] eval-auc:0.99852 train-auc:0.99707 eval-error:0.01800 train-error:0.01520
0.23339390000001004
样本总数: 1611
错误数目: 10
错误率: 0.62073%

利用xgboost对鸢尾花数据实战

sklearn之train_test_split()函数各参数含义(非常全)

在机器学习中,我们通常将原始数据按照比例分割为“测试集”和“训练集”,从 sklearn.model_selection 中调用train_test_split 函数 。

简单用法如下:

X_train,X_test, y_train, y_test =sklearn.model_selection.train_test_split(train_data,train_target,test_size=0.4, random_state=0,stratify=y_train)

train_data:所要划分的样本特征集
train_target:所要划分的样本结果
test_size:样本占比,如果是整数的话就是样本的数量
random_state:是随机数的种子。
随机数种子:其实就是该组随机数的编号,在需要重复试验的时候,保证得到一组一样的随机数。比如你每次都填1,其他参数一样的情况下你得到的随机数组是一样的。但填0或不填,每次都会不一样。
rain_data:所要划分的样本特征集

实例代码:

# /usr/bin/python
# -*- encoding:utf-8 -*-

import xgboost as xgb
import numpy as np
from sklearn.model_selection import train_test_split   # cross_validation


def iris_type(s):
    it = {b'Iris-setosa': 0, b'Iris-versicolor': 1, b'Iris-virginica': 2}
    return it[s]


if __name__ == "__main__":
    path = u'G:\\ML\\8.iris.data'  # 数据文件路径
    data = np.loadtxt(path, dtype=float, delimiter=',', converters={4: iris_type})
    x, y = np.split(data, (4,), axis=1)
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=50)

    data_train = xgb.DMatrix(x_train, label=y_train)
    data_test = xgb.DMatrix(x_test, label=y_test)
    watch_list = [(data_test, 'eval'), (data_train, 'train')]
    param = {'max_depth': 3, 'eta': 1, 'silent': 1, 'objective': 'multi:softmax', 'num_class': 3}

    bst = xgb.train(param, data_train, num_boost_round=6, evals=watch_list)
    y_hat = bst.predict(data_test)
    result = y_test.reshape(1, -1) == y_hat
    print ('正确率:\t', float(np.sum(result)) / len(y_hat))
    print ('END.....\n')

测试结果:
[0] eval-merror:0.02000 train-merror:0.02000
[1] eval-merror:0.02000 train-merror:0.02000
[2] eval-merror:0.04000 train-merror:0.01000
[3] eval-merror:0.04000 train-merror:0.01000
[4] eval-merror:0.04000 train-merror:0.00000
[5] eval-merror:0.04000 train-merror:0.00000
[6] eval-merror:0.04000 train-merror:0.00000
正确率: 0.96
END…

利用xgboost对酒进行预测实战

代码实例:

# /usr/bin/python
# -*- encoding:utf-8 -*-

import xgboost as xgb
import numpy as np
from sklearn.model_selection import train_test_split   # cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler


def show_accuracy(a, b, tip):
    acc = a.ravel() == b.ravel()
    #print (acc)
    print (tip + '正确率:\t', float(acc.sum()) / a.size)


if __name__ == "__main__":
    data = np.loadtxt('12.wine.data', dtype=float, delimiter=',')
    y, x = np.split(data, (1,), axis=1)
    # x = StandardScaler().fit_transform(x)
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=0.5)

    # Logistic回归
    lr = LogisticRegression(penalty='l2')
 
    lr.fit(x_train, y_train.ravel())
    y_hat = lr.predict(x_test)
    print(y_hat)
    show_accuracy(y_hat, y_test, 'Logistic回归 ')

    # XGBoost
    y_train[y_train == 3] = 0
    y_test[y_test == 3] = 0
    data_train = xgb.DMatrix(x_train, label=y_train)
    data_test = xgb.DMatrix(x_test, label=y_test)
    watch_list = [(data_test, 'eval'), (data_train, 'train')]
    param = {'max_depth': 3, 'eta': 1, 'silent': 0, 'objective': 'multi:softmax', 'num_class': 3}
    bst = xgb.train(param, data_train, num_boost_round=4)
    y_hat = bst.predict(data_test)
    print(y_hat)
    show_accuracy(y_hat, y_test, 'XGBoost ')

结果分析
[3. 2. 1. 2. 1. 3. 2. 1. 3. 2. 1. 2. 2. 1. 2. 2. 3. 1. 2. 1. 1. 2. 2. 2. 1. 3. 1. 1. 1. 3. 2. 3. 3. 1. 2. 2. 2. 2. 2. 1. 1. 2. 3. 1. 1. 1. 1. 1. 1. 1. 2. 3. 3. 1. 2. 1. 1. 2. 3. 2. 2. 1. 3. 2. 3. 1. 2. 1. 2. 1. 3. 3.3. 3. 2. 2. 1. 3. 1. 1. 3. 1. 2. 1. 3. 2. 2. 1. 2.]
Logistic回归 正确率: 0.9438202247191011

[0. 2. 1. 2. 1. 0. 2. 1. 0. 2. 1. 1. 2. 1. 2. 2. 0. 1. 2. 1. 1. 2. 0. 1.1. 0. 1. 1. 1. 0. 2. 0. 0. 1. 2. 2. 2. 2. 2. 1. 1. 2. 0. 1. 1. 1. 2. 1.1. 1. 2. 0. 0. 1. 2. 2. 1. 2. 0. 2. 2. 1. 0. 2. 0. 1. 2. 1. 2. 1. 0. 0. 0. 0. 2. 2. 1. 0. 1. 2. 0. 1. 2. 1. 0. 2. 2. 1. 2.]
XGBoost 正确率: 0.9887640449438202

通过结果可以看出xgboost对分类预测结果的准确度非常高,算法本身就是基于梯度提升树实现的集成算法。

最后谢谢给小编一个爱心点赞呗

你可能感兴趣的:(机器学习,python,机器学习)