近年来,中国经济飞速发展。随着一线城市的生活压力不断增大,越来越多的年轻人选择了返回家乡发展。中国联通大数据体系已形成九大类3000个以上标签,每日千亿级数据采集加工能力,pb级存储能力能够提供4亿用户的全样本数据。中国联通也一直在进行针对大数据应用和服务模式的创新研究为数据行业提供不少的经验和指导。中国联通的大数据业务在运营商中连续多年稳居第一。
基于中国联通的大数据能力,通过使用对联通的信令数据、通话数据、互联网行为等数据进行建模,对个人是否会返乡工作进行判断。
要通过调用两个库完成任务,分别为StratifiedKFold和Lightgbm。
StratifiedKFold(用于划分数据集):与我们机器学习课熟悉的KFold不同,KFold划分数据集的原理是根据n_split直接进行划分,而StratifiedKFold划分数据集,划分后的训练集和验证集中类别分布尽量和原数据集一样。而其传入的参数与KFold类似,为n_split,random_state以及shuffle,作用分别为确定划分个数,确定构建模型以及打乱数据。
Lightgbm(用于模型训练):GBDT (Gradient Boosting Decision Tree) 是机器学习中一个长盛不衰的模型,其主要思想是利用弱分类器(决策树)迭代训练以得到最优模型,该模型具有训练效果好、不易过拟合等优点。LightGBM(Light Gradient Boosting Machine)是一个实现GBDT算法的框架,支持高效率的并行训练,并且具有更快的训练速度、更低的内存消耗、更好的准确率、支持分布式可以快速处理海量数据等优点。
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
- Lightgbm:用于模型训练
StratufiedKFold:用于划分数据集
2.数据读取
train = pd.read_csv('E:/ccf返乡发展人员预测/dataTrain.csv')
test = pd.read_csv('E:/ccf返乡发展人员预测/dataA.csv')
train['f3'] = train['f3'].map({'low': 1, 'mid': 2, 'high': 3})
test['f3'] = test['f3'].map({'low': 1, 'mid': 2, 'high': 3})
3.建立额外特征
loc_f = ['f1', 'f2', 'f4', 'f5', 'f6']
for i in range(len(loc_f)):
for j in range(i + 1, len(loc_f)):
train[f'{loc_f[i]}+{loc_f[j]}'] = train[loc_f[i]] + train[loc_f[j]]
train[f'{loc_f[i]}-{loc_f[j]}'] = train[loc_f[i]] - train[loc_f[j]]
train[f'{loc_f[i]}*{loc_f[j]}'] = train[loc_f[i]] * train[loc_f[j]]
train[f'{loc_f[i]}/{loc_f[j]}'] = train[loc_f[i]] / train[loc_f[j]]
train[f'{loc_f[i]}&{loc_f[j]}'] = train[loc_f[i]] & train[loc_f[j]]
train[f'{loc_f[i]}|{loc_f[j]}'] = train[loc_f[i]] | train[loc_f[j]]
train[f'{loc_f[i]}^{loc_f[j]}'] = train[loc_f[i]] ^ train[loc_f[j]]
test[f'{loc_f[i]}+{loc_f[j]}'] = test[loc_f[i]] + test[loc_f[j]]
test[f'{loc_f[i]}-{loc_f[j]}'] = test[loc_f[i]] - test[loc_f[j]]
test[f'{loc_f[i]}*{loc_f[j]}'] = test[loc_f[i]] * test[loc_f[j]]
test[f'{loc_f[i]}/{loc_f[j]}'] = test[loc_f[i]] / test[loc_f[j]]
test[f'{loc_f[i]}&{loc_f[j]}'] = test[loc_f[i]] & test[loc_f[j]]
test[f'{loc_f[i]}|{loc_f[j]}'] = test[loc_f[i]] | test[loc_f[j]]
test[f'{loc_f[i]}^{loc_f[j]}'] = test[loc_f[i]] ^ test[loc_f[j]]
com_f = ['f43', 'f44', 'f45', 'f46']
for i in range(len(com_f)):
for j in range(i + 1, len(com_f)):
train[f'{com_f[i]}+{com_f[j]}'] = train[com_f[i]] + train[com_f[j]]
train[f'{com_f[i]}-{com_f[j]}'] = train[com_f[i]] - train[com_f[j]]
train[f'{com_f[i]}*{com_f[j]}'] = train[com_f[i]] * train[com_f[j]]
train[f'{com_f[i]}/{com_f[j]}'] = train[com_f[i]] / train[com_f[j]]
train[f'{com_f[i]}&{com_f[j]}'] = train[com_f[i]] & train[com_f[j]]
train[f'{com_f[i]}|{com_f[j]}'] = train[com_f[i]] | train[com_f[j]]
train[f'{com_f[i]}^{com_f[j]}'] = train[com_f[i]] ^ train[com_f[j]]
test[f'{com_f[i]}+{com_f[j]}'] = test[com_f[i]] + test[com_f[j]]
test[f'{com_f[i]}-{com_f[j]}'] = test[com_f[i]] - test[com_f[j]]
test[f'{com_f[i]}*{com_f[j]}'] = test[com_f[i]] * test[com_f[j]]
test[f'{com_f[i]}/{com_f[j]}'] = test[com_f[i]] / test[com_f[j]]
test[f'{com_f[i]}&{com_f[j]}'] = test[com_f[i]] & test[com_f[j]]
test[f'{com_f[i]}|{com_f[j]}'] = test[com_f[i]] | test[com_f[j]]
test[f'{com_f[i]}^{com_f[j]}'] = test[com_f[i]] ^ test[com_f[j]]
优化1:通过运算符建立额外特征,可以新构建多组数据以增加分析准确性。
4.排除噪声数据
train = train[:50000]
优化2:通过测试得到训练组的后10000组数据为噪声数据,于是将其剔除,只保留前50000组数据。
5.使用StratifiedKFold来训练数据
features = [i for i in train.columns if i not in ['label', 'id']]
y = train['label']
KF = StratifiedKFold(n_splits=5, random_state=4000, shuffle=True)
feat_imp_df = pd.DataFrame({'feat': features, 'imp': 0})
params = {
'objective': 'binary',
'boosting_type': 'gbdt',
'metric': 'auc',
'n_jobs': 30,
'learning_rate': 0.05,
'num_leaves': 64,
'max_depth': 8,
'tree_learner': 'serial',
'subsample_freq': 1,
'subsample': 0.9,
'num_boost_round': 3000,
'early_stopping_rounds': 300,
'max_bin': 255,
'verbose': -1,
'seed': 4000,
'bagging_seed': 4000,
'feature_fraction_seed': 4000,
}
oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros((len(test)))
for fold_, (trn_idx, val_idx) in enumerate(KF.split(train.values, y.values)):
trn_data = lgb.Dataset(train.iloc[trn_idx][features], label=y.iloc[trn_idx])
val_data = lgb.Dataset(train.iloc[val_idx][features], label=y.iloc[val_idx])
num_round = 5000
clf = lgb.train(
params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=100,
early_stopping_rounds=50,
)
oof_lgb[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iteration)
predictions_lgb[:] += clf.predict(test[features], num_iteration=clf.best_iteration) / 5
feat_imp_df['imp'] += clf.feature_importance() / 5
使用StratifiedKFold来训练数据,较之KFold的优势为:划分后的训练集和验证集中类别分布尽量和原数据集一样。
6.将结果保存到Excel中
test['label'] = predictions_lgb
test[['id', 'label']].to_csv('E:/ccf返乡发展人员预测/submission.csv', index=False)
7.完整代码
# 导入相关库
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
# 读取数据
train = pd.read_csv('E:/ccf返乡发展人员预测/dataTrain.csv')
test = pd.read_csv('E:/ccf返乡发展人员预测/dataA.csv')
train['f3'] = train['f3'].map({'low': 1, 'mid': 2, 'high': 3})
test['f3'] = test['f3'].map({'low': 1, 'mid': 2, 'high': 3})
# 建立额外特征
loc_f = ['f1', 'f2', 'f4', 'f5', 'f6']
for i in range(len(loc_f)):
for j in range(i + 1, len(loc_f)):
train[f'{loc_f[i]}+{loc_f[j]}'] = train[loc_f[i]] + train[loc_f[j]]
train[f'{loc_f[i]}-{loc_f[j]}'] = train[loc_f[i]] - train[loc_f[j]]
train[f'{loc_f[i]}*{loc_f[j]}'] = train[loc_f[i]] * train[loc_f[j]]
train[f'{loc_f[i]}/{loc_f[j]}'] = train[loc_f[i]] / train[loc_f[j]]
train[f'{loc_f[i]}&{loc_f[j]}'] = train[loc_f[i]] & train[loc_f[j]]
train[f'{loc_f[i]}|{loc_f[j]}'] = train[loc_f[i]] | train[loc_f[j]]
train[f'{loc_f[i]}^{loc_f[j]}'] = train[loc_f[i]] ^ train[loc_f[j]]
test[f'{loc_f[i]}+{loc_f[j]}'] = test[loc_f[i]] + test[loc_f[j]]
test[f'{loc_f[i]}-{loc_f[j]}'] = test[loc_f[i]] - test[loc_f[j]]
test[f'{loc_f[i]}*{loc_f[j]}'] = test[loc_f[i]] * test[loc_f[j]]
test[f'{loc_f[i]}/{loc_f[j]}'] = test[loc_f[i]] / test[loc_f[j]]
test[f'{loc_f[i]}&{loc_f[j]}'] = test[loc_f[i]] & test[loc_f[j]]
test[f'{loc_f[i]}|{loc_f[j]}'] = test[loc_f[i]] | test[loc_f[j]]
test[f'{loc_f[i]}^{loc_f[j]}'] = test[loc_f[i]] ^ test[loc_f[j]]
com_f = ['f43', 'f44', 'f45', 'f46']
for i in range(len(com_f)):
for j in range(i + 1, len(com_f)):
train[f'{com_f[i]}+{com_f[j]}'] = train[com_f[i]] + train[com_f[j]]
train[f'{com_f[i]}-{com_f[j]}'] = train[com_f[i]] - train[com_f[j]]
train[f'{com_f[i]}*{com_f[j]}'] = train[com_f[i]] * train[com_f[j]]
train[f'{com_f[i]}/{com_f[j]}'] = train[com_f[i]] / train[com_f[j]]
train[f'{com_f[i]}&{com_f[j]}'] = train[com_f[i]] & train[com_f[j]]
train[f'{com_f[i]}|{com_f[j]}'] = train[com_f[i]] | train[com_f[j]]
train[f'{com_f[i]}^{com_f[j]}'] = train[com_f[i]] ^ train[com_f[j]]
test[f'{com_f[i]}+{com_f[j]}'] = test[com_f[i]] + test[com_f[j]]
test[f'{com_f[i]}-{com_f[j]}'] = test[com_f[i]] - test[com_f[j]]
test[f'{com_f[i]}*{com_f[j]}'] = test[com_f[i]] * test[com_f[j]]
test[f'{com_f[i]}/{com_f[j]}'] = test[com_f[i]] / test[com_f[j]]
test[f'{com_f[i]}&{com_f[j]}'] = test[com_f[i]] & test[com_f[j]]
test[f'{com_f[i]}|{com_f[j]}'] = test[com_f[i]] | test[com_f[j]]
test[f'{com_f[i]}^{com_f[j]}'] = test[com_f[i]] ^ test[com_f[j]]
# 剔除噪声数据,只保留前50000组数据。
train = train[:50000]
# 使用StratifiedKFold来训练数据
features = [i for i in train.columns if i not in ['label', 'id']]
y = train['label']
KF = StratifiedKFold(n_splits=5, random_state=4000, shuffle=True)
feat_imp_df = pd.DataFrame({'feat': features, 'imp': 0})
params = {
'objective': 'binary',
'boosting_type': 'gbdt',
'metric': 'auc',
'n_jobs': 30,
'learning_rate': 0.05,
'num_leaves': 64,
'max_depth': 8,
'tree_learner': 'serial',
'subsample_freq': 1,
'subsample': 0.9,
'num_boost_round': 3000,
'early_stopping_rounds': 300,
'max_bin': 255,
'verbose': -1,
'seed': 4000,
'bagging_seed': 4000,
'feature_fraction_seed': 4000,
}
oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros((len(test)))
for fold_, (trn_idx, val_idx) in enumerate(KF.split(train.values, y.values)):
trn_data = lgb.Dataset(train.iloc[trn_idx][features], label=y.iloc[trn_idx])
val_data = lgb.Dataset(train.iloc[val_idx][features], label=y.iloc[val_idx])
num_round = 5000
clf = lgb.train(
params,
trn_data,
num_round,
valid_sets=[trn_data, val_data],
verbose_eval=100,
early_stopping_rounds=50,
)
oof_lgb[val_idx] = clf.predict(train.iloc[val_idx][features], num_iteration=clf.best_iteration)
predictions_lgb[:] += clf.predict(test[features], num_iteration=clf.best_iteration) / 5
feat_imp_df['imp'] += clf.feature_importance() / 5
# 将结果保存到Excel
test['label'] = predictions_lgb
test[['id', 'label']].to_csv('E:/ccf返乡发展人员预测/submission.csv', index=False)
通过测试得到训练组的后10000组数据为噪声数据,于是将其剔除,只保留前50000组数据。