任务五:使用LightGBM对数据进行分类并评估

1.对前几部得到的特征进行分类,主要用到sklearn中的LightGBM进行评估,并用网格搜索进行参数调优。

2.Lightgbm是2017年在当时的NeurIPS(当时为NIPS)上发表的论文,文中主要是相比于XGBoost,LightGBM更高效。

import pandas as pd
from sklearn.model_selection import GridSearchCV
import lightgbm as lgb
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer

"""读取数据"""
print("data read begin...")
train_data = pd.read_csv('./new_data/train_set.csv')
test_data = pd.read_csv('./new_data/test_set.csv')
train_data.drop(columns=['article','id'], inplace = True)
test_data.drop(columns=['article'], inplace = True)
print("data read end...")
train_data.head()

"""提取特征并划分数据集"""
tfidf=TfidfVectorizer()
x_train=tfidf.fit_transform(train_data['word_seg'])

x_train, x_test, y_train, y_test = train_test_split(x_train, Y_train, test_size=0.25, random_state=33)

"""训练和验证"""
lightgbm  = lgb.sklearn.LGBMClassifier()
param_grid = {
    'learning_rate': [0.01, 0.1, 0.5],
    'n_estimators': [30, 40]
}
lightgbm = GridSearchCV(lightgbm, param_grid)
lightgbm.fit(x_train, y_train)
y_lgb = lightgbm.predict(x_test)
print(f1_score(y_test, y_lgb , average='weighted'))  

最后的结果为:

你可能感兴趣的:(Python,数据挖掘)