神洛华

lightGBM实战

文章目录

- 一、使用LGBMClassifier对iris进行训练
- - 1.1 使用lgb.LGBMClassifier
  - - 1.1.2使用pickle进行保存模型，然后加载预测
    - 1.1.3 使用txt和json保存模型并加载
  - 1.2使用原生的API进行模型训练和预测
  - - 1.2.2 使用txt/json格式保存模型
    - 1.2.3 使用pickle进行保存模型
- 三、任务3 分类、回归和排序任务
- - 3.1使用 make_classification生成二分类数据进行训练
  - - 3.1.1 sklearn接口
    - 3.1.2 原生train接口
  - 3.2使用 make_classification生成多分类数据进行训练
  - - 3.2.1 sklearn接口
    - 3.2.2 原生train接口
  - 3.3使用 make_regression生成回归数据
  - - 3.3.1 sklearn接口
    - 3.3.2 原生train接口
- 四、graphviz可视化
- 五、模型调参（网格、随机、贝叶斯）
- - 5.1 加载数据集
  - 5.2:步骤2 设置树模型深度分别为[3,5,6,9]，记录下验证集AUC精度。
  - 5.3 步骤3：category变量设置为categorical_feature
  - 5.4 步骤4：超参搜索
  - - 5.4.1 GridSearchCV
  - 5.4.3 随机搜索
  - 5.4.4 贝叶斯搜索
- 六、模型微调与参数衰减
- - 6.2 学习率衰减
- 七、特征筛选方法
- - 7.1 筛选最重要的3个特征
  - 7.2 利用PermutationImportance排列特征重要性
  - 7.3 Null Importances进行特征选择
  - - 7.3.1 主要思想：
    - 7.3.2实现步骤
    - 7.3.3 读取数据集，计算Real Targe和shuffle Target下的特征重要度
    - 7.3.4计算Score
- 八、自定义损失函数和评测函数
- 九模型部署与加速
- 九模型部署与加速

LightGBM有两种接口：

sklearn接口接口文档
原生train接口文档
Python API（包括Scikit-learn API）
阿水知乎贴：《你应该知道的LightGBM各种操作》
《Coggle 30 Days of ML（22年1&2月）》

学习内容：

LightGBM（Light Gradient Boosting Machine）是微软开源的一个实现 GBDT 算法的框架，支持高效率的并行训练。LightGBM 提出的主要原因是为了解决 GBDT 在海量数据遇到的问题。本次学习内容包括使用LightGBM完成各种操作，包括竞赛和数据挖掘中的模型训练、验证和调参过程。

打卡汇总：

任务名称	难度、分数	所需技能
任务1模型训练与预测	低、1	LightGBM
任务2：模型保存与加载	低、1	LightGBM
任务3：分类、回归和排序任务	高、3	LightGBM
任务4：模型可视化	低、1	graphviz
任务5：模型调参（网格、随机、贝叶斯）	中、2	模型调参
任务6：模型微调与参数衰减	中、2	LightGBM
任务7：特征筛选方法	高、3	特征筛选方法
任务8：自定义损失函数	中、2	损失函数&评价函数
任务9：模型部署与加速	高、3	Treelite

一、使用LGBMClassifier对iris进行训练

from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from lightgbm import LGBMClassifier
import numpy as np
import pandas as pd
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore")

iris = load_iris()
X,y = iris.data,iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022) # 数据集分割

1.1 使用lgb.LGBMClassifier

gbm = lgb.LGBMClassifier(max_depth=10,
            learning_rate=0.01,
            n_estimators=2000,#提升迭代次数
            objective='multi:softmax',#默认regression，用于设置损失函数
            num_class=3 ,          
            nthread=-1,#LightGBM 的线程数
            min_child_weight=1,
            max_delta_step=0,
            subsample=0.85,
            colsample_bytree=0.7,
            reg_alpha=0,#L1正则化系数
            reg_lambda=1,#L2正则化系数
            scale_pos_weight=1,
            seed=0,
            missing=None)
gbm.fit(X_train, y_train)

y_pred = gbm.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

[LightGBM] [Warning] num_threads is set with n_jobs=-1, nthread=-1 will be ignored. Current value: num_threads=-1
accuarcy: 93.33%

1.1.2使用pickle进行保存模型，然后加载预测

import pickle

with open('model.pkl', 'wb') as fout:
    pickle.dump(gbm, fout)
# load model with pickle to predict
with open('model.pkl', 'rb') as fin:
    pkl_bst = pickle.load(fin)
# can predict with any iteration when loaded in pickle way
y_pred = pkl_bst.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

accuarcy: 93.33%

1.1.3 使用txt和json保存模型并加载

# txt格式
gbm.booster_.save_model("skmodel.txt")

clf_loads = lgb.Booster(model_file='skmodel.txt')
y_pred = clf_loads.predict(X_test)
y_pred=np.argmax(y_pred,axis=-1)
# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

accuarcy: 93.33%

#json格式
gbm.booster_.save_model("skmodel.json")

clf_loads = lgb.Booster(model_file='skmodel.json')
y_pred = clf_loads.predict(X_test)
y_pred=np.argmax(y_pred,axis=-1)
# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

accuarcy: 93.33%

1.2使用原生的API进行模型训练和预测

import numpy as np

lgb_train = lgb.Dataset(X_train, y_train)
#reference:如果这是用于验证的数据集，则应使用训练数据作为参考
#weight : list每个实例的权重
#free_raw_data：default=True，构建内部 Dataset 后释放原始数据，节省内存。
#silent:布尔类型，default=False。是否在构建过程中打印消息。
#init_score：数据集初始分数
#feature_name：设为 'auto' 时，如果 data 是 pandas DataFrame，则使用数据列名称。
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
#设置参数
#多分类的objective为multiclass或者别名softmax
params = {
    'boosting_type': 'gbdt',
    'objective': 'softmax',
    'num_class': 3,
    'max_depth': 6,
    'lambda_l2': 1,
    'subsample': 0.85,
    'colsample_bytree': 0.7,
    'min_child_weight': 1,
    'learning_rate':0.01,
    "verbosity":-1}

gbm = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                valid_sets=lgb_eval,
                callbacks=[lgb.early_stopping(stopping_rounds=5)])

y_pred = gbm.predict(X_test,num_iteration=gbm.best_iteration)
y_pred=np.argmax(y_pred,axis=-1)
# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[10]	valid_0's multi_logloss: 0.96537
accuarcy: 93.33%

1.2.2 使用txt/json格式保存模型

#使用txt保存模型
gbm.save_model('model.txt')
bst = lgb.Booster(model_file='model.txt')
y_pred = bst.predict(X_test, num_iteration=gbm.best_iteration)
y_pred=np.argmax(y_pred,axis=-1)
# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

#使用json格式保存模型
    
gbm.save_model('model.json')
bst = lgb.Booster(model_file='model.json')
y_pred = bst.predict(X_test, num_iteration=gbm.best_iteration)
y_pred=np.argmax(y_pred,axis=-1)
# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

accuarcy: 93.33%
accuarcy: 93.33%

1.2.3 使用pickle进行保存模型

import pickle

with open('model.pkl', 'wb') as fout:
    pickle.dump(gbm, fout)
# load model with pickle to predict
with open('model.pkl', 'rb') as fin:
    pkl_bst = pickle.load(fin)
# can predict with any iteration when loaded in pickle way
y_pred = pkl_bst.predict(X_test)
y_pred=np.argmax(y_pred,axis=-1)
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

accuarcy: 93.33%

三、任务3 分类、回归和排序任务

3.1使用 make_classification生成二分类数据进行训练

from sklearn.datasets import make_classification
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
#n_samples:样本数量，默认100
#n_features:特征数，默认20
#n_informative：有效特征数量，默认2
#n_redundant:冗余特征，默认2
#n_repeated :重复的特征个数，默认0
#n_clusters_per_class：每个类别中cluster数量，默认2

#weight：各个类的占比

#n_classes* n_clusters_per_class 必须≤ 2**有效特征数
data, target = make_classification(n_samples=1000,n_features=3,n_informative=3,n_redundant=0,n_classes=2)

df = pd.DataFrame(data)
df['target'] = target

df1 = df[df['target']==0]
df2 = df[df['target']==1]
df1.index = range(len(df1))
df2.index = range(len(df2))

# 画出数据集的数据分布
plt.figure(figsize=(3,3))
plt.scatter(df1[0],df1[1],color='red')
plt.scatter(df2[0],df2[1],color='green')

plt.figure(figsize=(6,2))
df1[0].hist()
df1[0].plot(kind = 'kde', secondary_y=True)

mean_ = df1[0].mean()
std_ = df1[0].std()

stats.kstest(df1[0], 'norm', (mean_, std_))

KstestResult(statistic=0.03723785150172143, pvalue=0.4930944895472954)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3jpNhDmW-1642190201168)(lightGBM_files/lightGBM_17_1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QBs17PgZ-1642190201170)(lightGBM_files/lightGBM_17_2.png)]

3.1.1 sklearn接口

sklearn接口lgb分类器参考文档
注意：每次产生的数据都不一样，所以

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from lightgbm import LGBMClassifier

X,y = data,target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022) # 数据集分割


gbm = lgb.LGBMClassifier()

gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='binary_logloss',
        callbacks=[lgb.early_stopping(5)])
#eval_metric默认值：LGBMRegressor 为“l2”，LGBMClassifier 为“logloss”，LGBMRanker 为“ndcg”。
#使用binary_logloss或者logloss准确率都是一样的。默认logloss
y_pred = gbm.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[39]	valid_0's binary_logloss: 0.255538
accuarcy: 88.50%

3.1.2 原生train接口

import numpy as np

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
#设置参数
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'max_depth': 10,
    'metric': 'binary_logloss',
    "verbosity":-1}

gbm2 = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                valid_sets=lgb_eval,
                callbacks=[lgb.early_stopping(stopping_rounds=5)])

y_pred = gbm2.predict(X_test,num_iteration=gbm2.best_iteration)#结果是0-1之间的概率值，是一维数组
y_pred =[1 if x >0.5 else 0 for x in y_pred]
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[10]	valid_0's binary_logloss: 0.371903
accuarcy: 88.00%

3.2使用 make_classification生成多分类数据进行训练

3.2.1 sklearn接口

from sklearn.datasets import make_classification
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

data, target = make_classification(n_samples=1000,n_features=3,n_informative=3,n_redundant=0,n_classes=4)

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from lightgbm import LGBMClassifier

X,y = data,target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022) # 数据集分割


gbm = lgb.LGBMClassifier()

gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='logloss',
        callbacks=[lgb.early_stopping(5)])
#eval_metric默认值：LGBMRegressor 为“l2”，LGBMClassifier 为“logloss”，LGBMRanker 为“ndcg”。
#使用binary_logloss或者logloss准确率都是一样的。默认logloss
y_pred = gbm.predict(X_test)
# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[39]	valid_0's binary_logloss: 0.255538
accuarcy: 88.50%

3.2.2 原生train接口

import numpy as np

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
#设置参数
params = {
    'boosting_type': 'gbdt',
    'objective': 'softmax',
    'num_class': 4,
    'max_depth': 10,
    'metric': 'softmax',
    "verbosity":-1}

gbm2 = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                valid_sets=lgb_eval,
                callbacks=[lgb.early_stopping(stopping_rounds=5)])

y_pred = gbm2.predict(X_test,num_iteration=gbm2.best_iteration)#结果是0-1之间的概率值，是一维数组
y_pred=np.argmax(y_pred,axis=-1)
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[10]	valid_0's multi_logloss: 0.322024
accuarcy: 87.50%

3.3使用 make_regression生成回归数据

3.3.1 sklearn接口

make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, 
                effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)

n_samples：样本数
n_features：特征数(自变量个数)
n_informative：参与建模特征数
n_targets：因变量个数
noise：噪音
bias：偏差(截距)
coef：是否输出coef标识
random_state：随机状态若为固定值则每次产生的数据都一样

from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

data, target = make_regression(n_samples=1000, n_features=5,n_targets=1,noise=1.5,random_state=1)

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
from lightgbm import LGBMClassifier

X,y = data,target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022) # 数据集分割


gbm = lgb.LGBMRegressor()#直接使用默认参数，mse较小。

gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='l1',
        callbacks=[lgb.early_stopping(5)])
#eval_metric默认值：LGBMRegressor 为“l2”，LGBMClassifier 为“logloss”，LGBMRanker 为“ndcg”。
#使用binary_logloss或者logloss准确率都是一样的。默认logloss
y_pred = gbm.predict(X_test)
# 计算准确率
mse= mean_squared_error(y_test,y_pred)
print("mse: %.2f" % (mse))

Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[99]	valid_0's l1: 8.13036	valid_0's l2: 119.246
mse: 119.25

3.3.2 原生train接口

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
#设置参数为lgb.LGBMRegressor的默认参数。
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    "num_leaves":31,
    "learning_rate": 0.1, 
    "n_estimators": 100,
    "min_child_samples": 20,
    "verbosity":-1}

gbm2 = lgb.train(params,
                lgb_train,
                num_boost_round=5,
                valid_sets=lgb_eval,
                callbacks=[lgb.early_stopping(stopping_rounds=5)])

y_pred = gbm2.predict(X_test,num_iteration=gbm2.best_iteration)#结果是0-1之间的概率值，是一维数组
mse= mean_squared_error(y_test,y_pred)
print("mse: %.2f" % (mse))

Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[99]	valid_0's l2: 119.246
mse: 119.25

四、graphviz可视化

参考文档：《lightgbm 决策树可视化 graphviz》、 graphviz参考文档、《xgboost 可视化API文档》、《lightgbm可视化API》

from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from lightgbm import LGBMClassifier
import graphviz

iris = load_iris()
X,y = iris.data,iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2022) # 数据集分割

lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train, y_train)
lgb.create_tree_digraph(lgb_clf, tree_index=1)
#设置参数
#多分类的objective为multiclass或者别名softmax

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1QMxh4Kb-1642190201171)(lightGBM_files/lightGBM_33_0.svg)]

#lgb没有to_graphviz，无法这样保存图片通
digraph = lgb.to_graphviz(lgb_clf , num_trees=1)#报错module 'lightgbm' has no attribute 'to_graphviz'
digraph.format = 'png'
digraph.view('./iris_lgb')

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

 in 
      1 #lgb没有to_graphviz，无法这样保存图片通
----> 2 digraph = lgb.to_graphviz(lgb_clf , num_trees=1)#报错module 'lightgbm' has no attribute 'to_graphviz'
      3 digraph.format = 'png'
      4 digraph.view('./iris_lgb')


AttributeError: module 'lightgbm' has no attribute 'to_graphviz'

import xgboost as xgb
from sklearn.datasets import load_iris
iris = load_iris()

xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(iris.data, iris.target)
xgb.to_graphviz(xgb_clf, num_trees=1)

[02:55:18] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CK4h5BDz-1642190201171)(lightGBM_files/lightGBM_35_1.svg)]

#通过Digraph对象可以将保存文件并查看
digraph = xgb.to_graphviz(xgb_clf, num_trees=1)
digraph.format = 'png'#将图像保存为png图片
digraph.view('./iris_xgb')

'iris_xgb.png'

### 步骤3：读取任务2的json格式模型文件
bst = lgb.Booster(model_file='model.json')
lgb.create_tree_digraph(bst, tree_index=1)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rAEbwKaU-1642190201172)(lightGBM_files/lightGBM_37_0.svg)]

五、模型调参（网格、随机、贝叶斯）

5.1 加载数据集

import pandas as pd, numpy as np, time
data= pd.read_csv("https://cdn.coggle.club/kaggle-flight-delays/flights_10k.csv.zip")

# 提取有用的列
data= data[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]
data.dropna(inplace=True)

from sklearn.model_selection import train_test_split
# 筛选出部分数据
data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"]>10)*1

# 以下四列数据转换为类别
cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
    data[item] = data[item].astype("category").cat.codes +1

# 划分训练集和测试集
train, test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"], random_state=10, test_size=0.25)
data

	MONTH	DAY	DAY_OF_WEEK	AIRLINE	FLIGHT_NUMBER	DESTINATION_AIRPORT	ORIGIN_AIRPORT	AIR_TIME	DEPARTURE_TIME	DISTANCE	ARRIVAL_DELAY
0	1	1	4	2	88	253	13	169.0	2354.0	1448	0
1	1	1	4	1	2120	213	164	263.0	2.0	2330	0
2	1	1	4	12	803	60	262	266.0	18.0	2296	0
3	1	1	4	1	238	185	164	258.0	15.0	2342	0
4	1	1	4	2	122	14	261	199.0	24.0	1448	0
...	...	...	...	...	...	...	...	...	...	...	...
9994	1	1	4	8	2399	44	215	62.0	1710.0	473	0
9995	1	1	4	7	149	128	210	28.0	1716.0	100	1
9996	1	1	4	8	2510	208	76	29.0	1653.0	147	0
9997	1	1	4	8	2512	62	215	28.0	1721.0	135	1
9998	1	1	4	8	2541	208	182	103.0	2000.0	594	1

9592 rows × 11 columns

from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from lightgbm import LGBMClassifier

5.2:步骤2 设置树模型深度分别为[3,5,6,9]，记录下验证集AUC精度。

predict：lgb.LGBMClassifier()等sklearn接口中是返回预测的类别
predict_proba：klearn接口中是返回预测的概率。重点是求auc时，我们必须用predict_proba。因为roc曲线的阀值是根据其正样本的概率求的。

参考《数据挖掘竞赛lightgbm通过求最大auc调参》

#sklearn接口
def test_depth(max_depth):    
    gbm = lgb.LGBMClassifier(max_depth=max_depth)
    gbm.fit(train, y_train,
            eval_set=[(test, y_test)],
            eval_metric='binary_logloss',
            callbacks=[lgb.early_stopping(5)])
    #eval_metric默认值：LGBMRegressor 为“l2”，LGBMClassifier 为“logloss”，LGBMRanker 为“ndcg”。
    #使用binary_logloss或者logloss准确率都是一样的。默认logloss
    y_pred = gbm.predict(test)
    # 计算准确率
    accuracy = accuracy_score(y_test,y_pred)
    auc_score=metrics.roc_auc_score(y_test,gbm.predict_proba(test)[:,1])#predict_proba输出正负样本概率值，取第二列为正样本概率值
    print("max_depth=",max_depth,"accuarcy: %.2f%%" % (accuracy*100.0),"auc_score: %.2f%%" % (auc_score*100.0))

test_depth(3)
test_depth(5)
test_depth(6)
test_depth(9)

Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[98]	valid_0's binary_logloss: 0.429334
max_depth= 3 accuarcy: 81.90% auc_score: 76.32%
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[60]	valid_0's binary_logloss: 0.430826
max_depth= 5 accuarcy: 81.98% auc_score: 75.54%
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[65]	valid_0's binary_logloss: 0.429341
max_depth= 6 accuarcy: 81.69% auc_score: 75.63%
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[52]	valid_0's binary_logloss: 0.429146
max_depth= 9 accuarcy: 81.94% auc_score: 76.07%

#原生train接口
import numpy as np
from sklearn import metrics
lgb_train = lgb.Dataset(train, y_train)
lgb_eval = lgb.Dataset(test, y_test, reference=lgb_train)
#设置参数
def test_depth(max_depth): 
    params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'max_depth': 10,
        'metric': 'binary_logloss',
        "learning_rate":0.1, 
        "min_child_samples":20,
         "num_leaves":31,
         "max_depth":max_depth}

    gbm2 = lgb.train(params,
                    lgb_train,
                    num_boost_round=10,
                    valid_sets=lgb_eval,
                    callbacks=[lgb.early_stopping(stopping_rounds=5)])

    y_pred = gbm2.predict(test,num_iteration=gbm2.best_iteration)#结果是0-1之间的概率值，是一维数组
    pred =[1 if x >0.5 else 0 for x in y_pred]
    accuracy = accuracy_score(y_test,pred)
    auc_score=metrics.roc_auc_score(y_test,y_pred)
    print("max_depth=",max_depth,"accuarcy: %.2f%%" % (accuracy*100.0),"auc_score: %.2f%%" % (auc_score*100.0))
test_depth(3)
print('-------------------------------------------------------------------------')
test_depth(5)
print('-------------------------------------------------------------------------')
test_depth(6)
print('-------------------------------------------------------------------------')
test_depth(9)

5.3 步骤3：category变量设置为categorical_feature

参考《Lightgbm如何处理类别特征？》
参考kaggle教程《Feature Selection with Null Importances》中的代码。

lightGBM比XGBoost的1个改进之处在于对类别特征的处理, 不再需要将类别特征转为one-hot形式。这一步通过设置categorical_feature来实现。

唯一疑惑的是真正的object特征只有’AIRLINE’, ‘DESTINATION_AIRPORT’, ‘ORIGIN_AIRPORT’，但是’FLIGHT_NUMBER’也设置成类别特征效果更好。

'FLIGHT_NUMBER’也为类别特征：accuarcy: 81.82% auc_score: 77.52%
'FLIGHT_NUMBER’不是类别特征：accuarcy: 81.69% auc_score: 76.48%

估计跟数据集有关系，没有仔细研究数据集。

import pandas as pd, numpy as np, time
data= pd.read_csv("https://cdn.coggle.club/kaggle-flight-delays/flights_10k.csv.zip")

# 提取有用的列
data= data[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]
data.dropna(inplace=True)

from sklearn.model_selection import train_test_split
# 筛选出部分数据
data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"]>10)*1
categorical_feats  = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
#categorical_feats = [f for f in data.columns if data[f].dtype == 'object']

#将上面四列特征转为类别特征，但不是one-hot编码
for f_ in categorical_feats:
    data[f_], _ = pd.factorize(data[f_])
    # Set feature type as categorical
    data[f_] = data[f_].astype('category')
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"], random_state=10, test_size=0.25)
categorical_feats

['AIRLINE', 'FLIGHT_NUMBER', 'DESTINATION_AIRPORT', 'ORIGIN_AIRPORT']

import numpy as np
from sklearn import metrics

lgb_train = lgb.Dataset(X_train, y_train,free_raw_data=False)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False)
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 16,
    'learning_rate': 0.3,
    'feature_fraction': 0.5,
    'lambda_l1': 0.0,
    'lambda_l2': 2.9,
    'max_depth': 15,
    'min_data_in_leaf': 12,
    'min_gain_to_split': 1.0,
    'min_sum_hessian_in_leaf': 0.0038,
    "verbosity":-1}


# 特征命名
#num_train, num_feature = X_train.shape#X_train是7194行10列的数据集，num_feature=10表示特征数量
#feature_name = ['feature_' + str(col) for col in range(num_feature)]#feature_0到9

gbm = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                valid_sets=lgb_eval,  #验证集设置
                #feature_name=feature_name,  #特征命名
                categorical_feature=categorical_feats,
                callbacks=[lgb.early_stopping(stopping_rounds=5)]) #设置分类变量

y_pred = gbm.predict(X_test,num_iteration=gbm.best_iteration)#结果是0-1之间的概率值，是一维数组
pred =[1 if x >0.5 else 0 for x in y_pred]
accuracy = accuracy_score(y_test,pred)
auc_score=metrics.roc_auc_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0),"auc_score: %.2f%%" % (auc_score*100.0))

Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[10]	valid_0's binary_logloss: 0.424384
accuarcy: 81.82% auc_score: 77.52%

#不设置categorical_feature结果一样啊，不知道为何？

# 进行one-hot编码
cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
    data[item] = data[item].astype("category").cat.codes +1

# 划分训练集和测试集
train, test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"], random_state=10, test_size=0.25)
gbm2 = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                valid_sets=lgb_eval,
                 callbacks=[lgb.early_stopping(stopping_rounds=5)]) 

y_pred2 = gbm2.predict(X_test,num_iteration=gbm2.best_iteration)#结果是0-1之间的概率值，是一维数组
pred2 =[1 if x >0.5 else 0 for x in y_pred2]
accuracy2 = accuracy_score(y_test,pred2)
auc_score2=metrics.roc_auc_score(y_test,y_pred2)
print("accuarcy: %.2f%%" % (accuracy2*100.0),"auc_score: %.2f%%" % (auc_score2*100.0))

Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[10]	valid_0's binary_logloss: 0.424384
accuarcy: 81.82% auc_score: 77.52%

5.4 步骤4：超参搜索

5.4.1 GridSearchCV

GridSearchCV参考文档

sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)

其中 scoring是字符串格式或者str列表、字典。具体的参数列表参考文档scoring-parameter

网格搜索——尝试所有可能的组合：

from sklearn.datasets import load_iris
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import lightgbm as lgb
from lightgbm import LGBMClassifier
import numpy as np
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

import pandas as pd, numpy as np, time
# 读取数据
data = pd.read_csv("https://cdn.coggle.club/kaggle-flight-delays/flights_10k.csv.zip")

# 提取有用的列
data = data[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]
data.dropna(inplace=True)

# 筛选出部分数据
data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"]>10)*1

# 进行编码
cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
    data[item] = data[item].astype("category").cat.codes +1

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"], random_state=10, test_size=0.25)

parameters = {
              'max_depth': [8, 10, 12],
              'learning_rate': [0.05, 0.1, 0.15],
              'n_estimators': [100, 200,500],
              "num_leaves":[25,31,36]}

gbm = lgb.LGBMClassifier(max_depth=10,#构建树的深度，越大越容易过拟合
            learning_rate=0.01,
            n_estimators=100,         
            seed=0,
            missing=None)

gs = GridSearchCV(gbm, param_grid=parameters, scoring='accuracy', cv=3)
gs.fit(X_train, y_train)

print("Best score: %0.3f" % gs.best_score_)
print("Best parameters set: %s" % gs.best_params_ )

Best score: 0.805
Best parameters set: {'learning_rate': 0.05, 'max_depth': 8, 'n_estimators': 100, 'num_leaves': 36}

#使用最优参数预测验证集
y_pred = gs.predict(X_test)
# 计算准确率和auc
accuracy = accuracy_score(y_test,y_pred)
auc_score=metrics.roc_auc_score(y_test,gs.predict_proba(X_test)[:,1])#predict_proba输出正负样本概率值，取第二列为正样本概率值
print("accuarcy: %.2f%%" % (accuracy*100.0),"auc_score: %.2f%%" % (auc_score*100.0))

accuarcy: 82.07% auc_score: 75.92%

5.4.3 随机搜索

参考文档

网格搜索尝试超参数的所有组合，因此增加了计算的时间复杂度，在数据量较大，或者模型较为复杂等等情况下，可能导致不可行的计算成本，这样网格搜索调参方法就不适用了。然而，随机搜索提供更便利的替代方案，该方法只测试你选择的超参数组成的元组，并且超参数值的选择是完全随机的，如下图所示。

from sklearn.model_selection import RandomizedSearchCV

param = dict(n_estimators=[80,100, 200],
             max_depth=[6,8,10],
            learning_rate= [0.02,0.05, 0.1],
            num_leaves=[25,31,36])

grid = RandomizedSearchCV(estimator=lgb.LGBMClassifier(),
                          param_distributions=param,scoring='accuracy',cv=3)
grid.fit(X_train, y_train)

print("Best score: %0.3f" % grid.best_score_)
print("Best parameters set: %s" % grid.best_params_ )
# 找到最好的模型
grid.best_estimator_

Best score: 0.806
Best parameters set: {'num_leaves': 36, 'n_estimators': 80, 'max_depth': 6, 'learning_rate': 0.1}





LGBMClassifier(max_depth=6, n_estimators=80, num_leaves=36)

最优模型直接用grid或者rid.best_estimator_都行

#使用最优参数预测验证集
y_pred = grid .predict(X_test)
# 计算准确率和auc
accuracy = accuracy_score(y_test,y_pred)
auc_score=metrics.roc_auc_score(y_test,grid .predict_proba(X_test)[:,1])#predict_proba输出正负样本概率值，取第二列为正样本概率值
print("accuarcy: %.2f%%" % (accuracy*100.0),"auc_score: %.2f%%" % (auc_score*100.0))

# 找到最好的模型
gd=grid.best_estimator_
y_pred = gd .predict(X_test)
# 计算准确率和auc
accuracy = accuracy_score(y_test,y_pred)
auc_score=metrics.roc_auc_score(y_test,gd .predict_proba(X_test)[:,1])#predict_proba输出正负样本概率值，取第二列为正样本概率值
print("accuarcy: %.2f%%" % (accuracy*100.0),"auc_score: %.2f%%" % (auc_score*100.0))

accuarcy: 81.69% auc_score: 75.62%
accuarcy: 81.69% auc_score: 75.62%

5.4.4 贝叶斯搜索

参考《贝叶斯全局优化（LightGBM调参）》

贝叶斯搜索使用贝叶斯优化技术对搜索空间进行建模，以尽快获得优化的参数值。它使用搜索空间的结构来优化搜索时间。贝叶斯搜索方法使用过去的评估结果来采样最有可能提供更好结果的新候选参数（如下图所示）:

#设定贝叶斯优化的黑盒函数LGB_bayesian
def LGB_bayesian(
    num_leaves,  # int
    min_data_in_leaf,  # int
    learning_rate,
    min_sum_hessian_in_leaf,    # int  
    feature_fraction,
    lambda_l1,
    lambda_l2,
    min_gain_to_split,
    max_depth):
    
    # LightGBM expects next three parameters need to be integer. So we make them integer
    num_leaves = int(num_leaves)
    min_data_in_leaf = int(min_data_in_leaf)
    max_depth = int(max_depth)

    assert type(num_leaves) == int
    assert type(min_data_in_leaf) == int
    assert type(max_depth) == int

    param = {
        'num_leaves': num_leaves,
        'max_bin': 63,
        'min_data_in_leaf': min_data_in_leaf,
        'learning_rate': learning_rate,
        'min_sum_hessian_in_leaf': min_sum_hessian_in_leaf,
        'bagging_fraction': 1.0,
        'bagging_freq': 5,
        'feature_fraction': feature_fraction,
        'lambda_l1': lambda_l1,
        'lambda_l2': lambda_l2,
        'min_gain_to_split': min_gain_to_split,
        'max_depth': max_depth,
        'save_binary': True, 
        'seed': 1337,
        'feature_fraction_seed': 1337,
        'bagging_seed': 1337,
        'drop_seed': 1337,
        'data_random_seed': 1337,
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'verbose': 1,
        'metric': 'auc',
        'is_unbalance': True,
        'boost_from_average': False,
        "verbosity":-1

    }    
    
    
    lgb_train = lgb.Dataset(X_train,
                           label=y_train)
    lgb_valid = lgb.Dataset(X_test,label=y_test,reference=lgb_train)   

    num_round = 500
    gbm= lgb.train(param, lgb_train, num_round, valid_sets = [lgb_valid],callbacks=[lgb.early_stopping(stopping_rounds=5)])   
    predictions = gbm.predict(X_test,num_iteration=gbm.best_iteration)
    score = metrics.roc_auc_score(y_test, predictions)
    
    return score

LGB_bayesian函数从贝叶斯优化框架获取num_leaves，min_data_in_leaf，learning_rate，min_sum_hessian_in_leaf，feature_fraction，lambda_l1，lambda_l2，min_gain_to_split，max_depth的值。请记住，对于LightGBM，num_leaves，min_data_in_leaf和max_depth应该是整数。但贝叶斯优化会发送连续的函数。所以我强制它们是整数。我只会找到它们的最佳参数值。读者可以增加或减少要优化的参数数量。
现在需要为这些参数提供边界，以便贝叶斯优化仅在边界内搜索

bounds_LGB = {
    'num_leaves': (5, 20), 
    'min_data_in_leaf': (5, 20),  
    'learning_rate': (0.01, 0.3),
    'min_sum_hessian_in_leaf': (0.00001, 0.01),    
    'feature_fraction': (0.05, 0.5),
    'lambda_l1': (0, 5.0), 
    'lambda_l2': (0, 5.0), 
    'min_gain_to_split': (0, 1.0),
    'max_depth':(3,15),
}

#将它们全部放在BayesianOptimization对象中
from bayes_opt import BayesianOptimization
LGB_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state=13)
print(LGB_BO.space.keys)#显示要优化的参数

['feature_fraction', 'lambda_l1', 'lambda_l2', 'learning_rate', 'max_depth', 'min_data_in_leaf', 'min_gain_to_split', 'min_sum_hessian_in_leaf', 'num_leaves']

调用maximize方法LGB_BO才会开始搜索。

init_points：我们想要执行的随机探索的初始随机运行次数。在我们的例子中，LGB_bayesian将被运行n_iter次。
n_iter：运行init_points数后，我们要执行多少次贝叶斯优化运行。

import warnings
import gc
pd.set_option('display.max_columns', 200)

init_points = 5
n_iter = 5
print('-' * 130)

with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    LGB_BO.maximize(init_points=init_points, n_iter=n_iter, acq='ucb', xi=0.0, alpha=1e-6)

----------------------------------------------------------------------------------------------------------------------------------
|   iter    |  target   | featur... | lambda_l1 | lambda_l2 | learni... | max_depth | min_da... | min_ga... | min_su... | num_le... |
-------------------------------------------------------------------------------------------------------------------------------------
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[22]	valid_0's auc: 0.770384
| [0m 1       [0m | [0m 0.7704  [0m | [0m 0.4     [0m | [0m 1.188   [0m | [0m 4.121   [0m | [0m 0.2901  [0m | [0m 14.67   [0m | [0m 11.8    [0m | [0m 0.609   [0m | [0m 0.007758[0m | [0m 14.62   [0m |
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[5]	valid_0's auc: 0.737399
| [0m 2       [0m | [0m 0.7374  [0m | [0m 0.3749  [0m | [0m 0.1752  [0m | [0m 1.492   [0m | [0m 0.02697 [0m | [0m 13.28   [0m | [0m 10.59   [0m | [0m 0.6798  [0m | [0m 0.00257 [0m | [0m 10.21   [0m |
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[8]	valid_0's auc: 0.719501
| [0m 3       [0m | [0m 0.7195  [0m | [0m 0.05424 [0m | [0m 1.792   [0m | [0m 4.745   [0m | [0m 0.07319 [0m | [0m 6.833   [0m | [0m 18.77   [0m | [0m 0.0319  [0m | [0m 0.000660[0m | [0m 14.45   [0m |
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[19]	valid_0's auc: 0.760323
| [0m 4       [0m | [0m 0.7603  [0m | [0m 0.4432  [0m | [0m 0.04358 [0m | [0m 3.733   [0m | [0m 0.2457  [0m | [0m 3.909   [0m | [0m 14.85   [0m | [0m 0.5093  [0m | [0m 0.004804[0m | [0m 19.33   [0m |
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[8]	valid_0's auc: 0.719412
| [0m 5       [0m | [0m 0.7194  [0m | [0m 0.05001 [0m | [0m 1.235   [0m | [0m 3.561   [0m | [0m 0.1041  [0m | [0m 6.324   [0m | [0m 15.43   [0m | [0m 0.9186  [0m | [0m 0.002452[0m | [0m 11.87   [0m |
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[11]	valid_0's auc: 0.761779
| [0m 6       [0m | [0m 0.7618  [0m | [0m 0.5     [0m | [0m 1.457   [0m | [0m 5.0     [0m | [0m 0.3     [0m | [0m 15.0    [0m | [0m 11.16   [0m | [0m 0.5786  [0m | [0m 0.01    [0m | [0m 17.3    [0m |
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[8]	valid_0's auc: 0.723696
| [0m 7       [0m | [0m 0.7237  [0m | [0m 0.05    [0m | [0m 5.0     [0m | [0m 5.0     [0m | [0m 0.3     [0m | [0m 15.0    [0m | [0m 12.07   [0m | [0m 0.0     [0m | [0m 0.01    [0m | [0m 14.22   [0m |
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[17]	valid_0's auc: 0.770585
| [95m 8       [0m | [95m 0.7706  [0m | [95m 0.5     [0m | [95m 0.0     [0m | [95m 2.9     [0m | [95m 0.3     [0m | [95m 14.75   [0m | [95m 11.78   [0m | [95m 1.0     [0m | [95m 0.003764[0m | [95m 16.12   [0m |
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[19]	valid_0's auc: 0.769728
| [0m 9       [0m | [0m 0.7697  [0m | [0m 0.5     [0m | [0m 0.0     [0m | [0m 4.272   [0m | [0m 0.3     [0m | [0m 15.0    [0m | [0m 8.6     [0m | [0m 1.0     [0m | [0m 0.009985[0m | [0m 15.0    [0m |
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[28]	valid_0's auc: 0.770488
| [0m 10      [0m | [0m 0.7705  [0m | [0m 0.5     [0m | [0m 0.0     [0m | [0m 4.457   [0m | [0m 0.3     [0m | [0m 10.79   [0m | [0m 9.347   [0m | [0m 1.0     [0m | [0m 0.01    [0m | [0m 16.83   [0m |
=====================================================================================================================================

print(LGB_BO.max['target'])#最佳的auc值
LGB_BO.max['params']#最佳模型参数

0.7705848546741305





{'feature_fraction': 0.5,
 'lambda_l1': 0.0,
 'lambda_l2': 2.899605369776912,
 'learning_rate': 0.3,
 'max_depth': 14.752822601781512,
 'min_data_in_leaf': 11.782200828907708,
 'min_gain_to_split': 1.0,
 'min_sum_hessian_in_leaf': 0.0037639771497955552,
 'num_leaves': 16.11909067874899}

#将这些参数用于我们的最终模型
LGB_BO.probe(
    params={'feature_fraction': 0.5,
            'lambda_l1': 0.0,
            'lambda_l2': 2.9,
            'learning_rate': 0.3,
            'max_depth': 15,
            'min_data_in_leaf': 12,
            'min_gain_to_split': 1.0,
            'min_sum_hessian_in_leaf': 0.0038,
            'num_leaves': 16},
            lazy=True)

#对LGB_BO对象进行最大化调用。
LGB_BO.maximize(init_points=0, n_iter=0)

|   iter    |  target   | featur... | lambda_l1 | lambda_l2 | learni... | max_depth | min_da... | min_ga... | min_su... | num_le... |
-------------------------------------------------------------------------------------------------------------------------------------
[LightGBM] [Warning] verbosity is set=-1, verbose=1 will be ignored. Current value: verbosity=-1
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[17]	valid_0's auc: 0.770586
| [95m 11      [0m | [95m 0.7706  [0m | [95m 0.5     [0m | [95m 0.0     [0m | [95m 2.9     [0m | [95m 0.3     [0m | [95m 15.0    [0m | [95m 12.0    [0m | [95m 1.0     [0m | [95m 0.0038  [0m | [95m 16.0    [0m |
=====================================================================================================================================

#通过属性LGB_BO.res可以获得探测的所有参数列表及其相应的目标值。
for i, res in enumerate(LGB_BO.res):
    print("Iteration {}: \n\t{}".format(i, res))

将LGB_BO的最佳参数保存到param_lgb字典中，然后进行5折交叉训练

from sklearn.model_selection import StratifiedKFold
from scipy.stats import rankdata
param_lgb = {
        'num_leaves': int(LGB_BO.max['params']['num_leaves']), # remember to int here
        'max_bin': 63,
        'min_data_in_leaf': int(LGB_BO.max['params']['min_data_in_leaf']), # remember to int here
        'learning_rate': LGB_BO.max['params']['learning_rate'],
        'min_sum_hessian_in_leaf': LGB_BO.max['params']['min_sum_hessian_in_leaf'],
        'bagging_fraction': 1.0, 
        'bagging_freq': 5, 
        'feature_fraction': LGB_BO.max['params']['feature_fraction'],
        'lambda_l1': LGB_BO.max['params']['lambda_l1'],
        'lambda_l2': LGB_BO.max['params']['lambda_l2'],
        'min_gain_to_split': LGB_BO.max['params']['min_gain_to_split'],
        'max_depth': int(LGB_BO.max['params']['max_depth']), # remember to int here
        'save_binary': True,
        'seed': 1337,
        'feature_fraction_seed': 1337,
        'bagging_seed': 1337,
        'drop_seed': 1337,
        'data_random_seed': 1337,
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'verbose': 1,
        'metric': 'auc',
        'is_unbalance': True,
        'boost_from_average': False,
    }

nfold = 5
gc.collect()
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=2019)

oof = np.zeros(len(y_train))
predictions = np.zeros((len(X_test),nfold))

i = 1
for train_index, valid_index in skf.split(X_train, y_train):
    print("\nfold {}".format(i))
    lgb_train = lgb.Dataset(X_train,label=y_train)
    lgb_valid = lgb.Dataset(X_test,label=y_test,reference=lgb_train)  
   
    clf = lgb.train(param_lgb, lgb_train,500, valid_sets = [lgb_valid ], verbose_eval=250, callbacks=[lgb.early_stopping(stopping_rounds=5)])
    print(clf.predict(X_train, num_iteration=clf.best_iteration) )
    oof[valid_index] = clf.predict(X_train.iloc[valid_index].values, num_iteration=clf.best_iteration) 
    
    predictions[:,i-1] += clf.predict(X_test, num_iteration=clf.best_iteration)
    i = i + 1

print("\n\nCV AUC: {:<0.2f}".format(metrics.roc_auc_score(y_train, oof)))

fold 1
[LightGBM] [Info] Number of positive: 1600, number of negative: 5594
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000288 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 393
[LightGBM] [Info] Number of data points in the train set: 7194, number of used features: 7
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[17]	valid_0's auc: 0.770586
[0.52559221 0.40000825 0.43907974 ... 0.40122056 0.46515425 0.56678622]

fold 2
[LightGBM] [Info] Number of positive: 1600, number of negative: 5594
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000330 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 393
[LightGBM] [Info] Number of data points in the train set: 7194, number of used features: 7
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[17]	valid_0's auc: 0.770586
[0.52559221 0.40000825 0.43907974 ... 0.40122056 0.46515425 0.56678622]

fold 3
[LightGBM] [Info] Number of positive: 1600, number of negative: 5594
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000292 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 393
[LightGBM] [Info] Number of data points in the train set: 7194, number of used features: 7
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[17]	valid_0's auc: 0.770586
[0.52559221 0.40000825 0.43907974 ... 0.40122056 0.46515425 0.56678622]

fold 4
[LightGBM] [Info] Number of positive: 1600, number of negative: 5594
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000302 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 393
[LightGBM] [Info] Number of data points in the train set: 7194, number of used features: 7
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[17]	valid_0's auc: 0.770586
[0.52559221 0.40000825 0.43907974 ... 0.40122056 0.46515425 0.56678622]

fold 5
[LightGBM] [Info] Number of positive: 1600, number of negative: 5594
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000300 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 393
[LightGBM] [Info] Number of data points in the train set: 7194, number of used features: 7
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[17]	valid_0's auc: 0.770586
[0.52559221 0.40000825 0.43907974 ... 0.40122056 0.46515425 0.56678622]


CV AUC: 0.81

另一种贝叶斯搜索，参考：《网格搜索、随机搜索和贝叶斯搜索实用教程》

#还没有写完，并不能正确运行
from skopt import BayesSearchCV
# 参数范围由下面的一个指定
from skopt.space import Real, Categorical, Integer
search_spaces = {
  'C': Real(0.1, 1e+4),
  'gamma': Real(1e-6, 1e+1, 'log-uniform')}

#接下来创建一个RandomizedSearchCV带参数n_iter_search的对象，并将使用训练数据来训练模型。

n_iter_search = 20 
bayes_search = BayesSearchCV( 
    lgb.LGBMClassifier(), 
    search_spaces, 
    n_iter=n_iter_search, 
    cv=3, 
    verbose=3 
) 
bayes_search.fit(X_train, y_train)
bayes_search.best_params_

六、模型微调与参数衰减

6.2 学习率衰减

参考《python实现LightGBM(进阶)python实现LightGBM(进阶)》

import pandas as pd, numpy as np, time
from sklearn.model_selection import train_test_split

# 读取数据
data = pd.read_csv("https://cdn.coggle.club/kaggle-flight-delays/flights_10k.csv.zip")

# 提取有用的列
data = data[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]
data.dropna(inplace=True)

# 筛选出部分数据
data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"]>10)*1

# 进行编码
cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
    data[item] = data[item].astype("category").cat.codes +1

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"], random_state=10, test_size=0.25)

lgb_train = lgb.Dataset(X_train, y_train,free_raw_data=False)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False)
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 16,
    'learning_rate': 0.3,
    'feature_fraction': 0.5,
    'lambda_l1': 0.0,
    'lambda_l2': 2.9,
    'max_depth': 15,
    'min_data_in_leaf': 12,
    'min_gain_to_split': 1.0,
    'min_sum_hessian_in_leaf': 0.0038,
    "verbosity":-1}

# 学习率指数衰减,learning_rates弃用了
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                learning_rates=lambda iter: 0.3 * (0.99 ** iter),# 学习率衰减
                valid_sets=lgb_eval)
#设置learning_rates结果是accuarcy: 82.07% auc_score: 75.40%
#不设置learning_rates结果是accuarcy: 81.61% auc_score: 75.74%,还是不一样
# 学习率指数衰减
gbm2 = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                init_model=gbm,
                valid_sets=lgb_eval,
                callbacks=[lgb.reset_parameter(bagging_fraction=lambda iter: 0.3 * (0.99 ** iter))])
#不设置init_model，结果是accuarcy: 81.69% auc_score: 75.25%
#  设置init_model，结果是accuarcy: 81.94% auc_score: 76.32%
#lgb.reset_parameter参数可以是列表或者衰减函数，不知道为啥bagging_fraction设置不同值结果是一样的
y_pred1,y_pred2 = gbm.predict(X_test,num_iteration=gbm.best_iteration),gbm2.predict(X_test,num_iteration=gbm2.best_iteration)
pred1,pred2 =[1 if x >0.5 else 0 for x in y_pred1],[1 if x >0.5 else 0 for x in y_pred2]
accuracy1,accuracy2 = accuracy_score(y_test,pred1),accuracy_score(y_test,pred2)
auc_score1,auc_score2=metrics.roc_auc_score(y_test,y_pred1),metrics.roc_auc_score(y_test,y_pred2)
print("accuarcy: %.2f%%" % (accuracy1*100.0),"auc_score: %.2f%%" % (auc_score1*100.0))
print("accuarcy: %.2f%%" % (accuracy2*100.0),"auc_score: %.2f%%" % (auc_score2*100.0))

[1]	valid_0's binary_logloss: 0.48425
[2]	valid_0's binary_logloss: 0.471031
[3]	valid_0's binary_logloss: 0.46278
[4]	valid_0's binary_logloss: 0.456369
[5]	valid_0's binary_logloss: 0.449357
[6]	valid_0's binary_logloss: 0.444377
[7]	valid_0's binary_logloss: 0.440908
[8]	valid_0's binary_logloss: 0.438597
[9]	valid_0's binary_logloss: 0.435632
[10]	valid_0's binary_logloss: 0.434647
accuarcy: 82.07% auc_score: 75.40%
accuarcy: 82.19% auc_score: 76.08%

# 学习率阶梯衰减,bagging_fraction'如果使用列表，列表元素数量要和  'num_boost_round'值一样
gbm3 = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                init_model=gbm,#
                valid_sets=lgb_eval,
                callbacks=[lgb.reset_parameter(bagging_fraction=[0.6]*5+[0.2]*3+[0.1]*2)])

y_pred = gbm3.predict(X_test,num_iteration=gbm3.best_iteration)#结果是0-1之间的概率值，是一维数组
pred =[1 if x >0.5 else 0 for x in y_pred]
accuracy = accuracy_score(y_test,pred)
auc_score=metrics.roc_auc_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0),"auc_score: %.2f%%" % (auc_score*100.0))

accuarcy: 81.94% auc_score: 76.30%

七、特征筛选方法

7.1 筛选最重要的3个特征

#通过feature_importances_方法得到特征重要性，值越高越重要
gbm = lgb.LGBMClassifier(max_depth=9)
gbm.fit(train, y_train,
            eval_set=[(test, y_test)],
            eval_metric='binary_logloss',
            callbacks=[lgb.early_stopping(5)])
df=pd.DataFrame(gbm.feature_importances_,gbm.feature_name_,columns=['value'])
df.sort_values('value',inplace=True,ascending=False)
df

Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[52]	valid_0's binary_logloss: 0.429146

	value
DEPARTURE_TIME	276
ORIGIN_AIRPORT	262
DESTINATION_AIRPORT	250
FLIGHT_NUMBER	236
AIR_TIME	227
DISTANCE	184
AIRLINE	124
MONTH	0
DAY	0
DAY_OF_WEEK	0

上图看出，最重要的是DEPARTURE_TIME、ORIGIN_AIRPORT、DESTINATION_AIRPORT

7.2 利用PermutationImportance排列特征重要性

eli5文档
利用PermutationImportance挑选变量
kaggle教程

import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(gbm, random_state=1).fit(test,y_test)
eli5.show_weights(perm, feature_names =gbm.feature_name_)

Training until validation scores don’t improve for 5 rounds
Early stopping, best iteration is:
[52] valid_0’s binary_logloss: 0.429146

Weight	Feature

0.0396

        ± 0.0096



    DEPARTURE_TIME

0.0144

        ± 0.0057



    DESTINATION_AIRPORT

0.0137

        ± 0.0043



    ORIGIN_AIRPORT

0.0089

        ± 0.0067



    AIR_TIME

0.0048

        ± 0.0041



    AIRLINE

0.0042

        ± 0.0045



    DISTANCE

0.0038

        ± 0.0029



    FLIGHT_NUMBER

        ± 0.0000



    DAY_OF_WEEK

        ± 0.0000



    MONTH

所以前三重要的特征是DEPARTURE_TIME、DESTINATION_AIRPORT、ORIGIN_AIRPORT

7.3 Null Importances进行特征选择

参考kaggkle教程
《数据竞赛】99%情况下都有效的特征筛选策略–Null Importance》

特征筛选策略 – Null Importance 特征筛选

7.3.1 主要思想：

通过利用跑树模型得到特征的importance来判断特征的稳定性和好坏。

将构建好的特征和正确的标签扔进树模型中，此时可以得到每个特征的重要性（split 和 gain）
将数据的标签打乱，再扔进模型中，得到打乱标签后，每个特征的重要性（split和gain）；重复n次；取n次特征重要性的平均值。
将1中正确标签跑的特征的重要性和2中打乱标签的特征中重要性进行比较；具体比较方式可以参考上面的kernel

当一个特征非常work，那它在正确标签的树模型中的importance应该很高，但它在打乱标签的树模型中的importance将很低（无法识别随机标签）；反之，一个垃圾特征，那它在正确标签的模型中importance很一般，打乱标签的树模型中importance将大于等于正确标签模型的importance。所以通过同时判断每个特征在正确标签的模型和打乱标签的模型中的importance（split和gain），可以选择特征稳定和work的特征。
思想大概就是这样吧，importance受到特征相关性的影响，特征的重要性会被相关特征的重要性稀释，看importance也不一定准，用这个来对暴力特征进行筛选还是可以的。

7.3.2实现步骤

Null Importance算法的实现步骤为：

在原始数据集上运行模型并且记录每个特征重要性。以此作为基准；
构建Null importances分布：对我们的标签进行随机Shuffle，并且计算shuffle之后的特征的重要性；
对2进行多循环操作，得到多个不同shuffle之后的特征重要性；
设计score函数，得到未shuffle的特征重要性与shuffle之后特征重要性的偏离度，并以此设计特征筛选策略；
计算不同筛选情况下的模型的分数，并进行记录；
将分数最好的几个分数对应的特征进行返回。实现步骤

7.3.3 读取数据集，计算Real Targe和shuffle Target下的特征重要度

import pandas as pd
import numpy as np

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
import time
from lightgbm import LGBMClassifier
import lightgbm as lgb

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
%matplotlib inline

import warnings
warnings.simplefilter('ignore', UserWarning)

import gc
gc.enable()

import pandas as pd, numpy as np, time
data= pd.read_csv("https://cdn.coggle.club/kaggle-flight-delays/flights_10k.csv.zip")

# 提取有用的列
data= data[["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]
data.dropna(inplace=True)

from sklearn.model_selection import train_test_split
# 筛选出部分数据
data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"]>10)*1
#categorical_feats  = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
categorical_feats = [f for f in data.columns if data[f].dtype == 'object']

#将上面四列特征转为类别特征，但不是one-hot编码
for f_ in categorical_feats:
    data[f_], _ = pd.factorize(data[f_])
    # Set feature type as categorical
    data[f_] = data[f_].astype('category')
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"], random_state=10, test_size=0.25)

创建评分函数
feature_importances_ :特征重要性的类型。default=‘split’。

如果是split，则结果包含该特征在模型中使用的次数。
如果为“gain”，则结果包含使用该特征的分割的总增益。

def get_feature_importances(X_train, X_test, y_train, y_test,shuffle, seed=None):
    # 获取特征
    train_features = list(X_train.columns)   
    # 判断是否shuffle TARGET
    y_train,y_test= y_train.copy(),y_test.copy()
    if shuffle:
        # Here you could as well use a binomial distribution
        y_train,y_test= y_train.copy().sample(frac=1.0),y_test.copy().sample(frac=1.0)
    
    
    lgb_train = lgb.Dataset(X_train, y_train,free_raw_data=False,silent=True)
    lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False,silent=True)
    # 在 RF 模式下安装 LightGBM，它比 sklearn RandomForest 更快   
    lgb_params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 16,
    'learning_rate': 0.3,
    'feature_fraction': 0.5,
    'lambda_l1': 0.0,
    'lambda_l2': 2.9,
    'max_depth': 15,
    'min_data_in_leaf': 12,
    'min_gain_to_split': 1.0,
    'min_sum_hessian_in_leaf': 0.0038}
    
    # 训练模型
    clf = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,num_boost_round=10, categorical_feature=categorical_feats)#将object特征设置为分类特征，但是并不需要进行one-hot编码

    #得到特征重要性
    imp_df = pd.DataFrame()
    imp_df["feature"] = list(train_features)
    imp_df["importance_gain"] = clf.feature_importance(importance_type='gain')
    imp_df["importance_split"] = clf.feature_importance(importance_type='split')
    imp_df['trn_score'] = roc_auc_score(y_test, clf.predict( X_test))
    
    return imp_df

np.random.seed(123)
# 获得市实际的特征重要性，即没有shuffletarget
actual_imp_df = get_feature_importances(X_train, X_test, y_train, y_test, shuffle=False)
actual_imp_df

[LightGBM] [Info] Number of positive: 1600, number of negative: 5594
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000416 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1406
[LightGBM] [Info] Number of data points in the train set: 7194, number of used features: 7
[1]	valid_0's binary_logloss: 0.479157
[2]	valid_0's binary_logloss: 0.46882
[3]	valid_0's binary_logloss: 0.454724
[4]	valid_0's binary_logloss: 0.445913
[5]	valid_0's binary_logloss: 0.440924
[6]	valid_0's binary_logloss: 0.438309
[7]	valid_0's binary_logloss: 0.433886
[8]	valid_0's binary_logloss: 0.432747
[9]	valid_0's binary_logloss: 0.431001
[10]	valid_0's binary_logloss: 0.429621

	feature	importance_gain	importance_split	trn_score
0	AIRLINE	153.229680	15	0.764829
1	FLIGHT_NUMBER	189.481180	23	0.764829
2	DESTINATION_AIRPORT	1036.401096	23	0.764829
3	ORIGIN_AIRPORT	650.938854	22	0.764829
4	AIR_TIME	119.763649	17	0.764829
5	DEPARTURE_TIME	994.109417	37	0.764829
6	DISTANCE	93.170790	13	0.764829

null_imp_df = pd.DataFrame()
nb_runs = 10
import time
start = time.time()
dsp = ''
for i in range(nb_runs):
    # 获取当前的特征重要性
    imp_df = get_feature_importances(X_train, X_test, y_train, y_test, shuffle=True)
    imp_df['run'] = i + 1 
    # 将特征重要性连起来
    null_imp_df = pd.concat([null_imp_df, imp_df], axis=0)
    # 删除上一条信息
    for l in range(len(dsp)):
        print('\b', end='', flush=True)
    # Display current run and time used
    spent = (time.time() - start) / 60
    dsp = 'Done with %4d of %4d (Spent %5.1f min)' % (i + 1, nb_runs, spent)
    print(dsp, end='', flush=True)

null_imp_df

	feature	importance_gain	importance_split	trn_score	run
0	AIRLINE	26.436000	8	0.525050	1
1	FLIGHT_NUMBER	142.159161	35	0.525050	1
2	DESTINATION_AIRPORT	231.459383	20	0.525050	1
3	ORIGIN_AIRPORT	319.862975	26	0.525050	1
4	AIR_TIME	97.764902	24	0.525050	1
...	...	...	...	...	...
2	DESTINATION_AIRPORT	254.016771	20	0.509197	10
3	ORIGIN_AIRPORT	271.220462	20	0.509197	10
4	AIR_TIME	82.260759	17	0.509197	10
5	DEPARTURE_TIME	137.511192	25	0.509197	10
6	DISTANCE	73.353821	19	0.509197	10

70 rows × 5 columns

可视化演示

def display_distributions(actual_imp_df_, null_imp_df_, feature_):
    plt.figure(figsize=(13, 6))
    gs = gridspec.GridSpec(1, 2)
    # 画出 Split importances
    ax = plt.subplot(gs[0, 0])
    a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_split'].values, label='Null importances')
    ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_split'].mean(), 
               ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
    ax.legend()
    ax.set_title('Split Importance of %s' % feature_.upper(), fontweight='bold')
    plt.xlabel('Null Importance (split) Distribution for %s ' % feature_.upper())
    # 画出 Gain importances
    ax = plt.subplot(gs[0, 1])
    a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_gain'].values, label='Null importances')
    ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_gain'].mean(), 
               ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
    ax.legend()
    ax.set_title('Gain Importance of %s' % feature_.upper(), fontweight='bold')
    plt.xlabel('Null Importance (gain) Distribution for %s ' % feature_.upper())

#画出“DESTINATION_AIRPORT”的特征重要性
display_distributions(actual_imp_df_=actual_imp_df, null_imp_df_=null_imp_df, feature_='DESTINATION_AIRPORT')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-irMW8jCB-1642190201174)(lightGBM_files/lightGBM_91_0.png)]

7.3.4计算Score

以未进行特征shuffle的特征重要性除以shuffle之后的0.75分位数作为我们的score
因为’MONTH’,‘DAY’,'DAY_OF_WEEK’三个特征没有什么用，画的的图结果不好看，所以把这三个去掉了。

feature_scores = []
for _f in actual_imp_df['feature'].unique():
    f_null_imps_gain = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].values
    f_act_imps_gain = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].mean()
    gain_score = np.log(1e-10 + f_act_imps_gain / (1 + np.percentile(f_null_imps_gain, 75)))  # Avoid didvide by zero
    f_null_imps_split = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].values
    f_act_imps_split = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].mean()
    split_score = np.log(1e-10 + f_act_imps_split / (1 + np.percentile(f_null_imps_split, 75)))  # Avoid didvide by zero
    feature_scores.append((_f, split_score, gain_score))

scores_df = pd.DataFrame(feature_scores, columns=['feature', 'split_score', 'gain_score'])

plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
# Plot Split importances
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=scores_df.sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
# Plot Gain importances
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=scores_df.sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()

null_imp_df.to_csv('null_importances_distribution_rf.csv')
actual_imp_df.to_csv('actual_importances_ditribution_rf.csv')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MbfvGl3f-1642190201175)(lightGBM_files/lightGBM_93_0.png)]

shuffle target之后特征重要性低于实际target对应特征的重要性0.25分位数的次数百分比。

correlation_scores = []
for _f in actual_imp_df['feature'].unique():
    f_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].values
    f_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].values
    gain_score = 100 * (f_null_imps < np.percentile(f_act_imps, 25)).sum() / f_null_imps.size
    f_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].values
    f_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].values
    split_score = 100 * (f_null_imps < np.percentile(f_act_imps, 25)).sum() / f_null_imps.size
    correlation_scores.append((_f, split_score, gain_score))

corr_scores_df = pd.DataFrame(correlation_scores, columns=['feature', 'split_score', 'gain_score'])

fig = plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
# Plot Split importances
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=corr_scores_df.sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
# Plot Gain importances
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=corr_scores_df.sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.suptitle("Features' split and gain scores", fontweight='bold', fontsize=16)
fig.subplots_adjust(top=0.93)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aDm9kLIm-1642190201175)(lightGBM_files/lightGBM_95_0.png)]

correlation_scores

[('AIRLINE', 100.0, 100.0),
 ('FLIGHT_NUMBER', 0.0, 60.0),
 ('DESTINATION_AIRPORT', 100.0, 100.0),
 ('ORIGIN_AIRPORT', 50.0, 100.0),
 ('AIR_TIME', 10.0, 90.0),
 ('DEPARTURE_TIME', 100.0, 100.0),
 ('DISTANCE', 30.0, 100.0)]

计算特征筛选之后的最佳分数并记录相应特征
通过运行下面的代码，train_features选择不同的特征来拟合模型，最终Results for threshold 20/30效果最好。此时的模型特征为

split_feats = [_f for _f, _score, _ in correlation_scores if _score >=20]
split_feats

['AIRLINE',
 'DESTINATION_AIRPORT',
 'ORIGIN_AIRPORT',
 'DEPARTURE_TIME',
 'DISTANCE']

#此时的特征为

def score_feature_selection(data,train_features=None, cat_feats=None):
    # Fit LightGBM 
    lgb_train = lgb.Dataset(data[train_features], data["ARRIVAL_DELAY"],free_raw_data=False,silent=True)
    # 在 RF 模式下安装 LightGBM，它比 sklearn RandomForest 更快   
    lgb_params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 16,
    'learning_rate': 0.3,
    'feature_fraction': 0.5,
    'lambda_l1': 0.0,
    'lambda_l2': 2.9,
    'max_depth': 15,
    'min_data_in_leaf': 12,
    'min_gain_to_split': 1.0,
    'metric': 'auc',
    'min_sum_hessian_in_leaf': 0.0038,
    "verbosity":-1}
    #"force_col_wise":true}
    
    # 训练模型
    hist = lgb.cv(params=lgb_params,train_set=lgb_train,
    num_boost_round=10, categorical_feature=cat_feats,
    nfold=5,stratified=True,shuffle=True,early_stopping_rounds=5,seed=17)

    # Return the last mean / std values 
    return hist['auc-mean'][-1], hist['auc-stdv'][-1]

# features = [f for f in data.columns if f not in ['SK_ID_CURR', 'TARGET']]
# score_feature_selection(df=data[features], train_features=features, target=data['TARGET'])

for threshold in [0, 10, 20, 30 , 40, 50 ,60 , 70, 80 , 90, 99]:
    split_feats = [_f for _f, _score, _ in correlation_scores if _score >= threshold]
    split_cat_feats = [_f for _f, _score, _ in correlation_scores if (_score >= threshold) & (_f in categorical_feats)]
    gain_feats = [_f for _f, _, _score in correlation_scores if _score >= threshold]
    gain_cat_feats = [_f for _f, _, _score in correlation_scores if (_score >= threshold) & (_f in categorical_feats)]
                                                                                             
    print('Results for threshold %3d' % threshold)
    split_results = score_feature_selection(data,train_features=split_feats, cat_feats=split_cat_feats)
    print('\t SPLIT : %.6f +/- %.6f' % (split_results[0], split_results[1]))
    gain_results = score_feature_selection(data,train_features=gain_feats, cat_feats=gain_cat_feats)
    print('\t GAIN  : %.6f +/- %.6f' % (gain_results[0], gain_results[1]))

Results for threshold   0
	 SPLIT : 0.757882 +/- 0.012114
	 GAIN  : 0.757882 +/- 0.012114
Results for threshold  10
	 SPLIT : 0.756999 +/- 0.011506
	 GAIN  : 0.757882 +/- 0.012114
Results for threshold  20
	 SPLIT : 0.757959 +/- 0.012558
	 GAIN  : 0.757882 +/- 0.012114
Results for threshold  30
	 SPLIT : 0.757959 +/- 0.012558
	 GAIN  : 0.757882 +/- 0.012114
Results for threshold  40
	 SPLIT : 0.745729 +/- 0.013217
	 GAIN  : 0.757882 +/- 0.012114
Results for threshold  50
	 SPLIT : 0.745729 +/- 0.013217
	 GAIN  : 0.757882 +/- 0.012114
Results for threshold  60
	 SPLIT : 0.727063 +/- 0.006758
	 GAIN  : 0.757882 +/- 0.012114
Results for threshold  70
	 SPLIT : 0.727063 +/- 0.006758
	 GAIN  : 0.756999 +/- 0.011506
Results for threshold  80
	 SPLIT : 0.727063 +/- 0.006758
	 GAIN  : 0.756999 +/- 0.011506
Results for threshold  90
	 SPLIT : 0.727063 +/- 0.006758
	 GAIN  : 0.756999 +/- 0.011506
Results for threshold  99
	 SPLIT : 0.727063 +/- 0.006758
	 GAIN  : 0.757959 +/- 0.012558

八、自定义损失函数和评测函数

参考XGB/LGB—自定义损失函数与评价函数
参考示例

自定义损失函数，预测概率小于0.1的正样本（标签为正样本，但模型预测概率小于0.1），梯度增加一倍。
自定义评价函数，阈值大于0.8视为正样本（标签为正样本，但模型预测概率大于0.8）。

#正常模型效果
import warnings
warnings.filterwarnings("ignore")

#特征去掉'MONTH','DAY','DAY_OF_WEEK'三个没用的之后，正常模型阈值0.5时accuarcy: 83.03% auc_score: 83.67%
#acc阈值 0.8时accuarcy: 80.40% auc_score: 83.67%
lgb_train = lgb.Dataset(X_train, y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False,silent=True)
    # 在 RF 模式下安装 LightGBM，它比 sklearn RandomForest 更快   
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'binary_logloss',#别名binary_error
'num_leaves': 16,
'learning_rate': 0.3,
'feature_fraction': 0.5,
'lambda_l1': 0.0,
'lambda_l2': 2.9,
'max_depth': 15,
'min_data_in_leaf': 12,
'min_gain_to_split': 1.0,
'min_sum_hessian_in_leaf': 0.0038,
"verbosity":-5}
    
    # 训练模型
clf2 = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,num_boost_round=10, 
                categorical_feature=categorical_feats)

y_pred = clf2.predict(X_test,num_iteration=clf2.best_iteration)#结果是0-1之间的概率值，是一维数组
pred =[1 if x >0.5 else 0 for x in y_pred]
accuracy = accuracy_score(y_test,pred)
auc_score=metrics.roc_auc_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0),"auc_score: %.2f%%" % (auc_score*100.0))

[1]	valid_0's binary_logloss: 0.479157
[2]	valid_0's binary_logloss: 0.46882
[3]	valid_0's binary_logloss: 0.454724
[4]	valid_0's binary_logloss: 0.445913
[5]	valid_0's binary_logloss: 0.440924
[6]	valid_0's binary_logloss: 0.438309
[7]	valid_0's binary_logloss: 0.433886
[8]	valid_0's binary_logloss: 0.432747
[9]	valid_0's binary_logloss: 0.431001
[10]	valid_0's binary_logloss: 0.429621
accuarcy: 81.69% auc_score: 76.48%

# 自定义目标函数，预测概率小于0.1的正样本（标签为正样本，但模型预测概率小于0.1），梯度增加一倍。
def loglikelihood(preds, train_data):
    labels=train_data.get_label()
    preds=1./(1.+np.exp(-preds))
    
    grad=[(p-l) if p>=0.1 else 2*(p-l) for (p,l) in zip(preds,labels) ]
    hess=[p*(1.-p) if p>=0.1 else 2*p*(1.-p) for p in preds ]

    return grad, hess

# 自定义评价指标binary_error，阈值大于0.8视为正样本
def binary_error(preds, train_data):
    labels = train_data.get_label()
    preds = 1. / (1. + np.exp(-preds))
    return 'error', np.mean(labels != (preds > 0.8)), False

clf3 = lgb.train(lgb_params,
                lgb_train,
                num_boost_round=10,
                init_model=clf2,
                fobj=loglikelihood, # 目标函数
                feval=binary_error, # 评价指标
                valid_sets=lgb_eval)

y_pred = clf3.predict(X_test,num_iteration=clf3.best_iteration)#结果是0-1之间的概率值，是一维数组
pred =[1 if x >0.8 else 0 for x in y_pred]
accuracy2 = accuracy_score(y_test,pred)
auc_score2=metrics.roc_auc_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy2*100.0),"auc_score: %.2f%%" % (auc_score2*100.0))

[11]	valid_0's binary_logloss: 4.68218	valid_0's error: 0.196414
[12]	valid_0's binary_logloss: 4.51541	valid_0's error: 0.196414
[13]	valid_0's binary_logloss: 4.4647	valid_0's error: 0.19558
[14]	valid_0's binary_logloss: 4.5248	valid_0's error: 0.196414
[15]	valid_0's binary_logloss: 4.51904	valid_0's error: 0.196414
[16]	valid_0's binary_logloss: 4.52481	valid_0's error: 0.196414
[17]	valid_0's binary_logloss: 4.4928	valid_0's error: 0.196414
[18]	valid_0's binary_logloss: 4.43027	valid_0's error: 0.196414
[19]	valid_0's binary_logloss: 4.4285	valid_0's error: 0.196414
[20]	valid_0's binary_logloss: 4.42314	valid_0's error: 0.196831
accuarcy: 81.19% auc_score: 76.47%

九模型部署与加速

参考文档
python+Treelite：Sklearn树模型训练迁移到c、java部署

由于 Treelite 的范围仅限于预测，因此必须使用其他机器学习包来训练决策树集成模型。在本文档中，我们将展示如何导入已在其他地方训练过的集成模型。

import treelite失败，无法导入，不知道为什么

gbm9 = lgb.LGBMClassifier()
gbm9.fit(X_train, y_train)

y_pred = gbm9.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
auc_score=metrics.roc_auc_score(y_test,gbm9.predict_proba(X_test)[:,1])#predict_proba输出正负样本概率值，取第二列为正样本概率值
print("accuarcy: %.2f%%" % (accuracy*100.0),"auc_score: %.2f%%" % (auc_score*100.0))

gbm9.booster_.save_model("model9.txt")#保存模型为txt格式

import treelite
import treelite.sklearn
model = treelite.sklearn.import_model(gbm9)#导入 scikit-learn 模型
y_pred = model.predict(X_test)
accuracy2 = accuracy_score(y_test,y_pred)
auc_score2=metrics.roc_auc_score(y_test,model.predict_proba(X_test)[:,1])
print("accuarcy: %.2f%%" % (accuracy2*100.0),"auc_score: %.2f%%" % (auc_score2*100.0))

lgb.train(lgb_params,
                lgb_train,
                num_boost_round=10,
                init_model=clf2,
                fobj=loglikelihood, # 目标函数
                feval=binary_error, # 评价指标
                valid_sets=lgb_eval)

y_pred = clf3.predict(X_test,num_iteration=clf3.best_iteration)#结果是0-1之间的概率值，是一维数组
pred =[1 if x >0.8 else 0 for x in y_pred]
accuracy2 = accuracy_score(y_test,pred)
auc_score2=metrics.roc_auc_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy2*100.0),"auc_score: %.2f%%" % (auc_score2*100.0))

[11]	valid_0's binary_logloss: 4.68218	valid_0's error: 0.196414
[12]	valid_0's binary_logloss: 4.51541	valid_0's error: 0.196414
[13]	valid_0's binary_logloss: 4.4647	valid_0's error: 0.19558
[14]	valid_0's binary_logloss: 4.5248	valid_0's error: 0.196414
[15]	valid_0's binary_logloss: 4.51904	valid_0's error: 0.196414
[16]	valid_0's binary_logloss: 4.52481	valid_0's error: 0.196414
[17]	valid_0's binary_logloss: 4.4928	valid_0's error: 0.196414
[18]	valid_0's binary_logloss: 4.43027	valid_0's error: 0.196414
[19]	valid_0's binary_logloss: 4.4285	valid_0's error: 0.196414
[20]	valid_0's binary_logloss: 4.42314	valid_0's error: 0.196831
accuarcy: 81.19% auc_score: 76.47%

九模型部署与加速

参考文档
python+Treelite：Sklearn树模型训练迁移到c、java部署

import treelite失败，无法导入，不知道为什么

gbm9 = lgb.LGBMClassifier()
gbm9.fit(X_train, y_train)

y_pred = gbm9.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
auc_score=metrics.roc_auc_score(y_test,gbm9.predict_proba(X_test)[:,1])#predict_proba输出正负样本概率值，取第二列为正样本概率值
print("accuarcy: %.2f%%" % (accuracy*100.0),"auc_score: %.2f%%" % (auc_score*100.0))

gbm9.booster_.save_model("model9.txt")#保存模型为txt格式

import treelite
import treelite.sklearn
model = treelite.sklearn.import_model(gbm9)#导入 scikit-learn 模型
y_pred = model.predict(X_test)
accuracy2 = accuracy_score(y_test,y_pred)
auc_score2=metrics.roc_auc_score(y_test,model.predict_proba(X_test)[:,1])
print("accuarcy: %.2f%%" % (accuracy2*100.0),"auc_score: %.2f%%" % (auc_score2*100.0))

你可能感兴趣的:(集成学习,数据挖掘,python)

量化交易系统中如何处理机器学习模型的训练和部署？ openwin_top 量化交易系统开发机器学习人工智能量化交易
microPythonPython最小内核源码解析NI-motion运动控制c语言示例代码解析python编程示例系列python编程示例系列二python的Web神器Streamlit如何应聘高薪职位量化交易系统中，机器学习模型的训练和部署需要遵循一套严密的流程，以确保模型的可靠性、性能和安全性。以下是详细描述以及相关的示例：1.数据收集和预处理数据收集在量化交易中，数据是最重要的资产。收集的数
Mac下载python并安装小小酥*
下载pythonPython官网：https://www.python.org/进入官网后点击download，选择MacOSX版本2.安装MAC系统一般都自带有Python2.x版本的环境，你也可以在链接https://www.python.org/downloads/mac-osx/上下载最新版安装。3.设置环境变量程序和可执行文件可以在许多目录，而这些路径很可能不在操作系统提供可执行文件的搜
Python使用minIO上传下载身似山河挺脊梁 python
前提VSCode+Python3.9minIO有Python的例子1.python生成临时文件2.写入一些数据3.上传到minIO4.获取分享出连接5.发出通知#创建一个客户端minioClient=Minio(endpoint='xx',access_key='xx',secret_key='xx',secure=False)#生成文件名current_datetime=datetime.dat
深入理解Python上下文管理器 ……-…… python 开发语言
1.什么是上下文管理器？2.with语句的魔法3.创建上下文管理器的两种方式3.1基于类的实现3.2使用contextlib模块4.异常处理1.什么是上下文管理器？上下文管理器（ContextManager）是Python中用于精确分配和释放资源的机制。它通过__enter__()和__exit__()两个魔术方法实现了上下文管理协议，确保即使在代码执行出错的情况下，资源也能被正确清理。#经典文件
【Appium】Appium征服安卓自动化：GitHub 10.5k+星开源神器，Python代码实战全解析！山河不见老 python 测试 appium android 自动化
Appium一、为什么开发者都在用Appium？二、环境搭建：5分钟极速配置2.1核心工具链2.2安卓设备连接三、脚本实战：从零编写自动化操作3.1示例1：自动登录微信并发送消息3.2示例2：动态滑动屏幕与数据抓取四、避坑指南4.1元素定位优化4.2稳定性增强4.3云真机集成五、生态扩展：超越安卓的自动化版图一、为什么开发者都在用Appium？万星认证：GitHub超10.5k+星标，活跃社区持续
基于Streamlit实现的音频处理示例大霸王龙音视频 ffmpeg
基于Streamlit实现的音频处理示例，包含录音、语音转文本、文件下载和进度显示功能，整合了多个技术方案：一、环境准备#安装依赖库pipinstallstreamlitstreamlit-webrtcaudio-recorder-streamlitopenai-whisperpython-dotx二、完整示例代码importstreamlitasstfromaudio_recorder_stre
npm错误 gyp错误 vs版本不对 msvs_version不兼容澎湖Java架构师前端 html npm node.js 前端
npm错误gyp错误vs版本不对msvs_version不兼容windowsSDK报错执行更新GYP语句第一种方案第二种方案执行更新GYP语句npminstall-gnode-gyp最新的GYP好像已经不支持Python2.7版本，npm会提示你更新都3.*.*版本安装Node.js的时候一定要勾选以下这个，会自动检测安装缺少的环境第一种方案管理员运行CMD（PowerShell也行）执行更新工具
深入了解 ArangoDB 的图数据库应用与 Python 实践 eahba 数据库 python 开发语言
在当前数据驱动的时代，对连接数据的高效处理和分析需求日益增长。ArangoDB作为一个可扩展的图数据库系统，能够加速从连接数据中获取价值。本文将介绍如何使用Python连接和操作ArangoDB，并展示如何结合图问答链来获取数据洞察。技术背景介绍ArangoDB是一个多模型数据库，支持文档、图和键值类型的数据存储。其强大的图形存储和查询能力使其成为处理复杂数据关系的理想选择。通过JSON支持和单一
不懂英语可以学编程吗?,不懂英文可以学编程吗 P5688346 人工智能
大家好，给大家分享一下英语不好能学python编程吗，很多人还不知道这一点。下面详细解释一下。现在让我们来看看！Sourcecodedownload:本文相关源码提到人工智能，就不得不提Python编程语言，大多数人觉得编程语言肯定会涉及到很多代码，满屏的英文字母，想想就头疼，觉得自己不会英语，肯定学不好Python，但是不会英语到底能不能够学习Python呢，下面小编给大家分析分析。其实各位想要
一、Python入门基础 MeyrlNotFound python 开发语言
1.Python简介与环境搭建•了解Python的历史、特点和应用领域Python的历史Python是一种高级编程语言，由GuidovanRossum于1989年发明。Python语言的设计目标是让代码易读、易写、易维护，从而提高开发效率和代码质量。自其诞生以来，Python已从一个简单的系统管理工具发展成为一种广泛应用于多个领域的编程语言。Python的特点1.简单易学：Python的语法简洁明
npm error gyp info 计算机辅助工程 npm 前端 node.js
在使用npm安装Node.js包时，可能会遇到各种错误，其中gyp错误是比较常见的一种。gyp是Node.js的一个工具，用于编译C++代码。这些错误通常发生在需要编译原生模块的npm包时。下面是一些常见的原因和解决方法：常见原因及解决方法Python未安装或版本不兼容：Node.js使用Python来运行gyp。确保你的系统上安装了Python，并且版本与node-gyp兼容。通常推荐使用Pyt
股票量化交易开发 Yfinance 数字化转型2025 python 开发语言
以下是一段基于Python的股票量化分析代码，包含数据获取、技术指标计算、策略回测和可视化功能：pythonimportyfinanceasyfimportpandasaspdimportnumpyasnpimportmatplotlib.pyplotaspltimportseabornassnsfrombacktestingimportBacktest,Strategyfrombacktesti
sqlmap笔记君如尘网络安全-渗透笔记笔记
1.运行环境sqlmap是用Python编写的，因此首先需要确保你的系统上安装了Python。sqlmap支持Python2.6、2.7和Python3.4及以上版本。2.常用命令通用格式：bythonsqlmap.py-r注入点地址--参数-rpost请求-uget请求--level=测试等级--risk=测试风险-v显示详细信息级别-p针对某个注入点注入-threads更改线程数，加速--ba
python环境部署工具 uv Honnnnnn uv
以原先使用的pipenv工具为例子，通过pipfile.lock生成requirements文件，再将requirements转成pyproject.toml文件，最后生成uv.lock基于当前虚拟环境导出requirements.txt--pipfreeze>requirements.txt（如果原先不是env而是基础的通过requirements.txt文件，省去转化requirements的
leetcode-hot100-python-专题三：滑动窗口 ༺ Dorothy ༻ leetcode hot100 leetcode python 算法
1、无重复字符的最长子串中等给定一个字符串s，请你找出其中不含有重复字符的最长子串的长度。示例1:输入:s=“abcabcbb”输出:3解释:因为无重复字符的最长子串是“abc”，所以其长度为3示例2:输入:s=“bbbbb”输出:1解释:因为无重复字符的最长子串是“b”，所以其长度为1。示例3:输入:s=“pwwkew”输出:3解释:因为无重复字符的最长子串是“wke”，所以其长度为3。请注意，
Python UV - 安装、升级、卸载云客Coder python uv 开发语言
文章目录安装检查升级设置自动补全卸载UV命令官方文档详见：https://docs.astral.sh/uv/getting-started/installation/安装pipinstalluv检查安装后可运行下面命令，查看是否安装成功uv--version%uv--versionuv0.6.3(a0b9f22a22025-02-24)升级uvselfupdate将重新运行安装程序并可能修改您的
使用Python构建去中心化预测市场：从概念到实现 Echo_Wish Python！实战！python 去中心化开发语言
使用Python构建去中心化预测市场：从概念到实现大家好，我是Echo_Wish。今天，我们将深入探讨一个前沿的区块链应用——去中心化预测市场，并学习如何使用Python来构建一个简易的预测市场平台。预测市场是基于市场参与者对未来事件的预测来产生结果的地方，通常被用来预测政治事件、金融市场走向、体育比赛结果等。传统的预测市场如Augur、Polymarket等，基于去中心化平台，利用区块链技术确保
Python自动登陆、登出南京理工大学NJUST校园网程序 JimesMz python 开发语言
本文程序针对南京理工大学NJUST和NJUST-FREE校园网开发，其他学校无法使用。文章目录开发目的使用说明参考资料开发目的今天突然想要用代码实现一下自动登陆校园网，上网搜寻了一下。知乎有一些教程，CSDN也有一些完整的代码，但是我跟随教程或者直接运行现有代码都没有能够成功登陆，且NJUST校园网付费，我想要一个“登出”功能，借助Kimi自己写了一下。本人技术不精，以实现功能为主。使用说明请确保
Python爬虫笔记一（来自MOOC） Requests库入门小灰不停前进 #Python python pycharm 爬虫
Python爬虫笔记一通用代码框架：importrequestsdefgetHTMLText(url):try:r=requests.get(url,timeput=30)r.raise_for_status()#如果状态不是200，引发HTTPError异常r.encoding=r.apparemt_encodingreturnr.textexcept:return"产生异常"if__name_
Python调用fofa API接口并写入csv文件中 YOHO !GIRL 网络测绘 python 网络安全
前言一.功能目的二.功能调研三.编写代码1.引入库2.读取数据3.写入csv文件中总结前言上一篇我们讲述了目前较为主流的几款网络探测系统，简单介绍了页面的使用方法。链接如下，点击跳转：网络空间测绘引擎集合：Zoomeye、fofa、360、shodan、censys、鹰图然而当我们需要针对单个引擎进行二次开发时，页面就不能满足我们的需求了，这就需要参考API文档进行简单的数据处理，接下来，给大家介
SenseVoice 部署记录安静六角开源软件
最近试用了SenseVoice（阿里团队开源的语音转文字）效果可以，可以本地部署，有webui界面，测试了万字以上的转换效果可以。首先部署好conda环境和cuda，这个可以查看他人的文章。步骤1.创建虚拟环境：condacreate-nmainenvpython=3.102.然后安装依赖condaactivatemainenvpipinstall-rC:\Users\xx\Documents\P
Python基于深度学习的动物图片识别技术的研究与实现 Java老徐 Python 毕业设计 python 深度学习开发语言深度学习的动物图片识别技术 Python动物图片识别技术
博主介绍：✌程序员徐师兄、7年大厂程序员经历。全网粉丝12w+、csdn博客专家、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和毕业项目实战✌文末获取源码联系精彩专栏推荐订阅不然下次找不到哟2022-2024年最全的计算机软件毕业设计选题大全：1000个热门选题推荐✅Java项目精品实战案例《100套》Java微信小程序项目实战《100套》感兴趣的可以先收藏起来，还有大家
Python实现微信自动发送消息热心市民小汪 python 微信开发语言
实现需求：Python定时发送微信消息importpyautoguiaspgimportpyperclipaspcfromapscheduler.schedulers.blockingimportBlockingScheduler"""实现定时自动发送消息"""#操作间隔为1秒pg.PAUSE=1name='Hello~'msg='是时候点餐啦！！'defmain():#打开微信pg.hotkey
震惊！ “深度学习”都在学习什么扉间798 深度学习学习人工智能
常见的机器学习分类算法俗话说三个臭皮匠胜过诸葛亮这里面集成学习就是将单一的算法弱弱结合算法融合用投票给特征值加权重AdaBoost集成学习算法通过迭代训练一系列弱分类器，给予分类错误样本更高权重，使得后续弱分类器更关注这些样本，然后将这些弱分类器线性组合成强分类器，提高整体分类性能。（一）投票机制投票是一种直观且常用的算法融合策略。在多分类问题中，假设有多个分类器对同一数据进行分类判断。每个分类器
程序代码篇---Pyqt的密码界面 Ronin-Lotus 程序代码篇上位机知识篇 pyqt 数据库 python ubuntu
文章目录前言一、代码二、代码解释2.1用户数据库定义2.2窗口初始化2.3认证逻辑2.5角色处理2.6错误处理优化2.7功能扩展说明2.7.1用户类型区分管理员普通用户其他用户2.7.2安全增强建议三、运行效果四、运行命令五、界面改进建议5.1密码显示5.2用户头像显示5.3输入框动画效果5.4加载进度显示5.5键盘快捷键前言本文简单介绍了在Ubuntu系统上使用Python的Pyqt创建密码登录
深度学习 | pytorch + torchvision + python 版本对应及环境安装 zfgfdgbhs 深度学习 python pytorch
目录一、版本对应二、安装命令（pip）1.版本（1）v2.5.1~v2.0.0（2）v1.13.1~v1.11.0（3）v1.10.1~v1.7.02.安装全过程（1）选择版本（2）安装结果参考文章一、版本对应下表来自pytorch的github官方文档：pytorch/vision:Datasets,TransformsandModelsspecifictoComputerVisionpytor
Python读取.nc文件的方法与技术详解傻啦嘿哟关于python那些事儿人工智能前端服务器
目录一、引言二、使用netCDF4库读取.nc文件安装netCDF4库导入netCDF4库打开.nc文件获取变量读取变量数据案例与代码三、使用xarray库读取.nc文件安装xarray库导入xarray库打开.nc文件访问变量数据案例与代码四、性能与优化分块读取使用Dask进行并行计算减少不必要的变量加载五、其他注意事项文件路径变量命名数据类型文件关闭六、总结一、引言.nc文件，即NetCDF（
Python画词云图，Python画圆形词云图，API详解请一直在路上 python 开发语言
在Python中，词云图的常用库是wordcloud。以下是核心API参数的详细讲解，以及一个完整的使用示例。一、参数类型默认值说明参数类型默认值说明widthint400词云图的宽度（像素）heightint200词云图的高度（像素）background_colorstr“black”背景颜色，可以是颜色名称（如“white”）或十六进制值（如“#FFFFFF”）colormapstr/matp
23、nc文件快速切片与索引爱转呼啦圈的小兔子气象数据处理与可视化 python 气象气象可视化气候变化
1前言在气象、海洋学和环境科学等领域，.nc（NetCDF）格式文件是存储和共享多维科学数据的常用格式。这些数据文件通常包含大量的经度、纬度、时间和垂直层次数据。在处理这些数据时，研究人员常常需要根据特定的地理和时间范围提取数据，以便进行深入分析。为此，我们开发了一个名为nc_slice的Python函数，用于从一个或多个.nc格式文件中高效地筛选和提取数据。nc_slice函数提供了一种简洁而灵
【最新】TensorFlow、cuDNN、CUDA三者之间的最新版本对应及下载地址江上_酒开发环境及工具配置 TensorFlow CUDA cuDNN
TensorFlow、cuDNN、CUDA对应关系官网查询地址CUDA下载地址cuDNN下载地址VersionPythonversionCompilerBuildtoolscuDNNCUDAtensorflow_gpu-2.9.03.7-3.10MSVC2019Bazel5.0.08.111.2tensorflow_gpu-2.8.03.7-3.10MSVC2019Bazel4.2.18.111.
Nginx负载均衡 510888780 nginx 应用服务器
Nginx负载均衡一些基础知识: nginx 的 upstream目前支持 4 种方式的分配 1)、轮询（默认）每个请求按时间顺序逐一分配到不同的后端服务器，如果后端服务器down掉，能自动剔除。 2)、weight 指定轮询几率，weight和访问比率成正比
RedHat 6.4 安装 rabbitmq bylijinnan erlang rabbitmq redhat
在 linux 下安装软件就是折腾，首先是测试机不能上外网要找运维开通，开通后发现测试机的 yum 不能使用于是又要配置 yum 源，最后安装 rabbitmq 时也尝试了两种方法最后才安装成功机器版本： [root@redhat1 rabbitmq]# lsb_release LSB Version: :base-4.0-amd64:base-4.0-noarch:core
FilenameUtils工具类 eksliang FilenameUtils common-io
转载请出自出处：http://eksliang.iteye.com/blog/2217081 一、概述这是一个Java操作文件的常用库，是Apache对java的IO包的封装，这里面有两个非常核心的类FilenameUtils跟FileUtils，其中FilenameUtils是对文件名操作的封装;FileUtils是文件封装，开发中对文件的操作，几乎都可以在这个框架里面找到。非常的好用。
xml文件解析SAX 不懂事的小屁孩 xml
xml文件解析:xml文件解析有四种方式， 1.DOM生成和解析XML文档(SAX是基于事件流的解析) 2.SAX生成和解析XML文档(基于XML文档树结构的解析) 3.DOM4J生成和解析XML文档 4.JDOM生成和解析XML 本文章用第一种方法进行解析，使用android常用的DefaultHandler import org.xml.sax.Attributes;
通过定时任务执行mysql的定期删除和新建分区，此处是按日分区酷的飞上天空 mysql
使用python脚本作为命令脚本，linux的定时任务来每天定时执行 #!/usr/bin/python # -*- coding: utf8 -*- import pymysql import datetime import calendar #要分区的表 table_name = 'my_table' #连接数据库的信息 host,user,passwd,db =
如何搭建数据湖架构？听听专家的意见蓝儿唯美架构
Edo Interactive在几年前遇到一个大问题：公司使用交易数据来帮助零售商和餐馆进行个性化促销，但其数据仓库没有足够时间去处理所有的信用卡和借记卡交易数据 “我们要花费27小时来处理每日的数据量，”Edo主管基础设施和信息系统的高级副总裁Tim Garnto说道：“所以在2013年，我们放弃了现有的基于PostgreSQL的关系型数据库系统，使用了Hadoop集群作为公司的数
spring学习——控制反转与依赖注入 a-john spring
控制反转（Inversion of Control，英文缩写为IoC）是一个重要的面向对象编程的法则来削减计算机程序的耦合问题，也是轻量级的Spring框架的核心。控制反转一般分为两种类型，依赖注入（Dependency Injection，简称DI）和依赖查找（Dependency Lookup）。依赖注入应用比较广泛。
用spool+unixshell生成文本文件的方法 aijuans xshell
例如我们把scott.dept表生成文本文件的语句写成dept.sql,内容如下: 　　set pages 50000; 　　set lines 200; 　　set trims on; 　　set heading off; 　　spool /oracle_backup/log/test/dept.lst; 　　select deptno||','||dname||','||loc
1、基础--名词解析(OOA/OOD/OOP) asia007 学习基础知识
OOA:Object-Oriented Analysis（面向对象分析方法）是在一个系统的开发过程中进行了系统业务调查以后，按照面向对象的思想来分析问题。OOA与结构化分析有较大的区别。OOA所强调的是在系统调查资料的基础上，针对OO方法所需要的素材进行的归类分析和整理，而不是对管理业务现状和方法的分析。　　OOA（面向对象的分析）模型由5个层次（主题层、对象类层、结构层、属性层和服务层）
浅谈java转成json编码格式技术百合不是茶 json编码 java转成json编码
json编码;是一个轻量级的数据存储和传输的语言在java中需要引入json相关的包,引包方式在工程的lib下就可以了 JSON与JAVA数据的转换（JSON 即 JavaScript Object Natation，它是一种轻量级的数据交换格式，非常适合于服务器与 JavaScript 之间的数据的交
web.xml之Spring配置(基于Spring+Struts+Ibatis) bijian1013 java web.xml SSI spring配置
指定Spring配置文件位置 <context-param> <param-name>contextConfigLocation</param-name> <param-value> /WEB-INF/spring-dao-bean.xml,/WEB-INF/spring-resources.xml, /WEB-INF/
Installing SonarQube（Fail to download libraries from server） sunjing Install Sonar
1. Download and unzip the SonarQube distribution 2. Starting the Web Server The default port is "9000" and the context path is "/". These values can be changed in &l
【MongoDB学习笔记十一】Mongo副本集基本的增删查 bit1129 mongodb
一、创建复本集假设mongod,mongo已经配置在系统路径变量上，启动三个命令行窗口，分别执行如下命令： mongod --port 27017 --dbpath data1 --replSet rs0 mongod --port 27018 --dbpath data2 --replSet rs0 mongod --port 27019 -
Anychart图表系列二之执行Flash和HTML5渲染白糖_ Flash
今天介绍Anychart的Flash和HTML5渲染功能 HTML5 Anychart从6.0第一个版本起，已经逐渐开始支持各种图的HTML5渲染效果了，也就是说即使你没有安装Flash插件，只要浏览器支持HTML5，也能看到Anychart的图形（不过这些是需要做一些配置的）。这里要提醒下大家，Anychart6.0版本对HTML5的支持还不算很成熟，目前还处于
Laravel版本更新异常4.2.8-> 4.2.9 Declaration of ... CompilerEngine ... should be compa bozch laravel
昨天在为了把laravel升级到最新的版本，突然之间就出现了如下错误： ErrorException thrown with message "Declaration of Illuminate\View\Engines\CompilerEngine::handleViewException() should be compatible with Illuminate\View\Eng
编程之美-NIM游戏分析-石头总数为奇数时如何保证先动手者必胜 bylijinnan 编程之美
import java.util.Arrays; import java.util.Random; public class Nim { /**编程之美 NIM游戏分析问题：有N块石头和两个玩家A和B，玩家A先将石头随机分成若干堆，然后按照BABA...的顺序不断轮流取石头，能将剩下的石头一次取光的玩家获胜，每次取石头时，每个玩家只能从若干堆石头中任选一堆，
lunce创建索引及简单查询 chengxuyuancsdn 查询创建索引 lunce
import java.io.File; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Docume
[IT与投资]坚持独立自主的研究核心技术 comsci it
和别人合作开发某项产品....如果互相之间的技术水平不同,那么这种合作很难进行,一般都会成为强者控制弱者的方法和手段..... 所以弱者,在遇到技术难题的时候,最好不要一开始就去寻求强者的帮助,因为在我们这颗星球上,生物都有一种控制其
flashback transaction闪回事务查询 daizj oracle sql 闪回事务
闪回事务查询有别于闪回查询的特点有以下3个：（1）其正常工作不但需要利用撤销数据，还需要事先启用最小补充日志。（2）返回的结果不是以前的“旧”数据，而是能够将当前数据修改为以前的样子的撤销SQL（Undo SQL）语句。（3）集中地在名为flashback_transaction_query表上查询，而不是在各个表上通过“as of”或“vers
Java I/O之FilenameFilter类列举出指定路径下某个扩展名的文件游其是你 FilenameFilter
这是一个FilenameFilter类用法的例子，实现的列举出“c:\\folder“路径下所有以“.jpg”扩展名的文件。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
C语言学习五函数，函数的前置声明以及如何在软件开发中合理的设计函数来解决实际问题 dcj3sjt126com c
# include <stdio.h> int f(void) //括号中的void表示该函数不能接受数据，int表示返回的类型为int类型 { return 10; //向主调函数返回10 } void g(void) //函数名前面的void表示该函数没有返回值 { //return 10; //error 与第8行行首的void相矛盾 } in
今天在测试环境使用yum安装，遇到一个问题： Error: Cannot retrieve metalink for repository: epel. Pl dcj3sjt126com centos
今天在测试环境使用yum安装，遇到一个问题： Error: Cannot retrieve metalink for repository: epel. Please verify its path and try again 处理很简单，修改文件“/etc/yum.repos.d/epel.repo”，将baseurl的注释取消， mirrorlist注释掉。即可。 &n
单例模式 shuizhaosi888 单例模式
单例模式懒汉式 public class RunMain { /** * 私有构造 */ private RunMain() { } /** * 内部类，用于占位，只有 */ private static class SingletonRunMain { priv
Spring Security（09）——Filter 234390216 Spring Security
Filter 目录 1.1 Filter顺序 1.2 添加Filter到FilterChain 1.3 DelegatingFilterProxy 1.4 FilterChainProxy 1.5
公司项目NODEJS实践0.1 逐行分析JS源代码 mongodb nginx ubuntu nodejs
一、前言前端如何独立用nodeJs实现一个简单的注册、登录功能，是不是只用nodejs+sql就可以了？其实是可以实现，但离实际应用还有距离，那要怎么做才是实际可用的。网上有很多nod
java.lang.Math liuhaibo_ljf java Math lang
System.out.println(Math.PI); System.out.println(Math.abs(1.2)); System.out.println(Math.abs(1.2)); System.out.println(Math.abs(1)); System.out.println(Math.abs(111111111)); System.out.println(Mat
linux下时间同步 nonobaba ntp
今天在linux下做hbase集群的时候，发现hmaster启动成功了，但是用hbase命令进入shell的时候报了一个错误 PleaseHoldException: Master is initializing，查看了日志，大致意思是说master和slave时间不同步，没办法，只好找一种手动同步一下，后来发现一共部署了10来台机器，手动同步偏差又比较大，所以还是从网上找现成的解决方
ZooKeeper3.4.6的集群部署 roadrunners zookeeper 集群部署
ZooKeeper是Apache的一个开源项目，在分布式服务中应用比较广泛。它主要用来解决分布式应用中经常遇到的一些数据管理问题，如：统一命名服务、状态同步、集群管理、配置文件管理、同步锁、队列等。这里主要讲集群中ZooKeeper的部署。 1、准备工作我们准备3台机器做ZooKeeper集群，分别在3台机器上创建ZooKeeper需要的目录。数据存储目录
Java高效读取大文件 tomcat_oracle java
　　读取文件行的标准方式是在内存中读取，Guava 和Apache Commons IO都提供了如下所示快速读取文件行的方法：　　Files.readLines(new File(path), Charsets.UTF_8); 　　FileUtils.readLines(new File(path)); 　　这种方法带来的问题是文件的所有行都被存放在内存中，当文件足够大时很快就会导致
微信支付api返回的xml转换为Map的方法 xu3508620 xml map 微信api
举例如下： <xml> <return_code><![CDATA[SUCCESS]]></return_code> <return_msg><![CDATA[OK]]></return_msg> <appid><

lightGBM实战

文章目录

一、使用LGBMClassifier对iris进行训练

1.1 使用lgb.LGBMClassifier

1.1.2使用pickle进行保存模型，然后加载预测

1.1.3 使用txt和json保存模型并加载

1.2使用原生的API进行模型训练和预测

1.2.2 使用txt/json格式保存模型

1.2.3 使用pickle进行保存模型

三、任务3 分类、回归和排序任务

3.1使用 make_classification生成二分类数据进行训练

3.1.1 sklearn接口

3.1.2 原生train接口

3.2使用 make_classification生成多分类数据进行训练

3.2.1 sklearn接口

3.2.2 原生train接口

3.3使用 make_regression生成回归数据

3.3.1 sklearn接口

3.3.2 原生train接口

四、graphviz可视化

五、模型调参（网格、随机、贝叶斯）

5.1 加载数据集

5.2:步骤2 设置树模型深度分别为[3,5,6,9]，记录下验证集AUC精度。

5.3 步骤3：category变量设置为categorical_feature

5.4 步骤4：超参搜索

5.4.1 GridSearchCV

5.4.3 随机搜索

5.4.4 贝叶斯搜索

六、模型微调与参数衰减

6.2 学习率衰减

七、特征筛选方法

7.1 筛选最重要的3个特征

7.2 利用PermutationImportance排列特征重要性

7.3 Null Importances进行特征选择

7.3.1 主要思想：

7.3.2实现步骤

7.3.3 读取数据集，计算Real Targe和shuffle Target下的特征重要度

7.3.4计算Score

八、自定义损失函数和评测函数

九 模型部署与加速

九 模型部署与加速

你可能感兴趣的:(集成学习,数据挖掘,python)

九模型部署与加速

九模型部署与加速