✅作者简介:人工智能专业本科在读,喜欢计算机与编程,写博客记录自己的学习历程。
个人主页:小嗷犬的个人主页
个人网站:小嗷犬的技术小站
个人信条:为天地立心,为生民立命,为往圣继绝学,为万世开太平。
DataCastle 员工离职预测:https://challenge.datacastle.cn/v3/cmptDetail.html?id=342
给定影响员工离职的因素和员工是否离职的记录,建立模型预测有可能离职的员工。
评分算法为准确率,准确率越高,说明正确预测出离职员工与留职员工的效果越好。
评分算法参考代码如下:
from sklearn.metrics import accuracy_score
y_true = [1, 0, 1, 0]
y_pred = [1, 1, 1, 0]
score = accuracy_score(y_true, y_pred)
百度网盘:https://pan.baidu.com/share/init?surl=UjkKggnWQMIBhrU1vPm1sw&pwd=99gu
数据主要包括影响员工离职的各种因素工资、出差、工作环境满意度、工作投入度、是否加班、是否升职、工资提升比例等. 以及员工是否已经离职的对应记录。
数据分为训练数据和测试数据,分别保存在 train.csv
和 test_noLabel.csv
两个文件中,字段说明如下:
Age
:员工年龄Label
:员工是否已经离职,1 表示已经离职,2 表示未离职,这是目标预测值;BusinessTravel
:商务差旅频率,Non-Travel 表示不出差,Travel_Rarely 表示不经常出差,Travel_Frequently 表示经常出差;Department
:员工所在部门,Sales 表示销售部,Research & Development 表示研发部,Human Resources 表示人力资源部;DistanceFromHome
:公司跟家庭住址的距离,从 1 到 29,1 表示最近,29 表示最远;Education
:员工的教育程度,从 1 到 5,5 表示教育程度最高;EducationField
:员工所学习的专业领域,Life Sciences 表示生命科学,Medical 表示医疗,Marketing 表示市场营销,Technical Degree 表示技术学位,Human Resources 表示人力资源,Other 表示其他;EmployeeNumber
:员工号码;EnvironmentSatisfaction
:员工对于工作环境的满意程度,从 1 到 4,1 的满意程度最低,4 的满意程度最高;Gender
:员工性别,Male 表示男性,Female 表示女性;JobInvolvement
:员工工作投入度,从 1 到 4,1 为投入度最低,4 为投入度最高;JobLevel
:职业级别,从 1 到 5,1 为最低级别,5 为最高级别;JobRole
:工作角色:Sales Executive 是销售主管,Research Scientist 是科学研究员,Laboratory Technician 实验室技术员,Manufacturing Director 是制造总监,Healthcare Representative 是医疗代表,Manager 是经理,Sales Representative 是销售代表,Research Director 是研究总监,Human Resources 是人力资源;JobSatisfaction
:工作满意度,从 1 到 4,1 代表满意程度最低,4 代表满意程度最高;MaritalStatus
:员工婚姻状况,Single 代表单身,Married 代表已婚,Divorced 代表离婚;MonthlyIncome
:员工月收入,范围在 1009 到 19999 之间;NumCompaniesWorked
:员工曾经工作过的公司数;Over18
:年龄是否超过 18 岁;OverTime
:是否加班,Yes 表示加班,No 表示不加班;PercentSalaryHike
:工资提高的百分比;PerformanceRating
:绩效评估;RelationshipSatisfaction
:关系满意度,从 1 到 4,1 表示满意度最低,4 表示满意度最高;StandardHours
:标准工时;StockOptionLevel
:股票期权水平;TotalWorkingYears
:总工龄;TrainingTimesLastYear
:上一年的培训时长,从 0 到6,0 表示没有培训,6 表示培训时间最长;WorkLifeBalance
:工作与生活平衡程度,从 1 到 4,1 表示平衡程度最低,4 表示平衡程度最高;YearsAtCompany
:在目前公司工作年数;YearsInCurrentRole
:在目前工作职责的工作年数;YearsSinceLastPromotion
:距离上次升职时长YearsWithCurrManager
:跟目前的管理者共事年数;测试数据主要包括 350 条记录,跟训练数据的不同是测试数据并不包括员工是否已经离职的记录,学员需要通过由训练数据所建立的模型以及所给的测试数据,得出测试数据相应的员工是否已经离职的预测。
注:比赛所用到的数据取自于 IBM Watson Analytics 分析平台分享的样例数据。我们只选取了其中的子集,并对数据做了一些预处理使数据更加符合逻辑回归分析比赛的要求。
import warnings
warnings.filterwarnings("ignore")
import random
import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
from category_encoders import OrdinalEncoder, BinaryEncoder
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import StackingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
seed = 1
random.seed(seed)
np.random.seed(seed)
# 加载数据
df = pd.read_csv("./employee_leave/train.csv", index_col=0)
X_test = pd.read_csv("./employee_leave/test_noLabel.csv", index_col=0)
X = df.drop(columns="Label")
y = df["Label"].copy()
# 数据是否缺失
print(df.isnull().sum().sum(), X_test.isnull().sum().sum())
# 输出:0 0
# 样本标签是否均衡
y.value_counts()
# 输出:
# Label
# 0 922
# 1 178
# Name: count, dtype: int64
# 类别特征
cat_cols = set(X.select_dtypes(include="object").columns)
print(cat_cols)
# 输出:{'BusinessTravel', 'MaritalStatus', 'OverTime', 'JobRole', 'Over18', 'Department', 'EducationField', 'Gender'}
# 对 BusinessTravel 进行顺序编码
encoder_map = {"Non-Travel": 0, "Travel_Rarely": 1, "Travel_Frequently": 2}
X["BusinessTravel"] = X["BusinessTravel"].map(encoder_map)
X_test["BusinessTravel"] = X_test["BusinessTravel"].map(encoder_map)
# 对性别、是否加班、年龄是否超过 18 岁进行顺序编码
ord_cols = {"Gender", "OverTime", "Over18"}
encoder = OrdinalEncoder(cols=ord_cols)
X = encoder.fit_transform(X)
X_test = encoder.transform(X_test)
# 对其他类别特征进行二进制编码
bin_cols = cat_cols - ord_cols - {"BusinessTravel"}
encoder = BinaryEncoder(cols=bin_cols)
X = encoder.fit_transform(X)
X_test = encoder.transform(X_test)
# 特征衍生
pass
# 数据标准化
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), index=X.index)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index)
# 数据降维
pca = PCA(n_components=0.95)
X = pd.DataFrame(pca.fit_transform(X), index=X.index)
X_test = pd.DataFrame(pca.transform(X_test), index=X_test.index)
# 特征选择
selector = VarianceThreshold(threshold=0.01).fit(X)
X = pd.DataFrame(selector.transform(X), index=X.index)
X_test = pd.DataFrame(selector.transform(X_test), index=X_test.index)
# 对训练数据进行 SMOTE 过采样
sampler = SMOTE(random_state=seed)
X, y = sampler.fit_resample(X, y)
print(X.shape, y.shape)
# 输出:(1844, 35) (1844,)
xgb = XGBClassifier(random_state=seed)
param_grid = {
"n_estimators": [300],
"max_depth": [4],
"min_child_weight": [1],
"gamma": [0.1],
"subsample": [0.8],
"colsample_bytree": [0.2],
"regalpha": [0],
"reglambda": [0],
"learning_rate": [0.1],
}
grid_search = GridSearchCV(xgb, param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=1)
grid_search.fit(X, y)
xgb_params = grid_search.best_params_
print(xgb_params)
print(grid_search.best_score_)
# 输出:
# Fitting 5 folds for each of 1 candidates, totalling 5 fits
# {'colsample_bytree': 0.2, 'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 4, 'min_child_weight': 1, 'n_estimators': 300, 'regalpha': 0, 'reglambda': 0, 'subsample': 0.8}
# 0.8932102038411689
lgb = LGBMClassifier(random_state=seed, verbose=-1)
param_grid = {
"n_estimators": [300],
"max_depth": [9],
"num_leaves": [52],
"max_bin": [100],
"min_data_in_leaf": [3],
"feature_fraction": [0.7],
"bagging_fraction": [0.8],
"bagging_freq": [2],
"lambda_l1": [0],
"lambda_l2": [0],
"min_split_gain": [0],
"learning_rate": [0.1],
}
grid_search = GridSearchCV(lgb, param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=1)
grid_search.fit(X, y)
lgb_params = grid_search.best_params_
print(lgb_params)
print(grid_search.best_score_)
# 输出:
# Fitting 5 folds for each of 1 candidates, totalling 5 fits
# {'bagging_fraction': 0.8, 'bagging_freq': 2, 'feature_fraction': 0.7, 'lambda_l1': 0, 'lambda_l2': 0, 'learning_rate': 0.1, 'max_bin': 100, 'max_depth': 9, 'min_data_in_leaf': 3, 'min_split_gain': 0, 'n_estimators': 300, 'num_leaves': 52}
# 0.9105573229645341
cat = CatBoostClassifier(random_state=seed, verbose=0)
param_grid = {
"n_estimators": [300],
"depth": [9],
"rsm": [0.8],
"l2_leaf_reg": [1],
"learning_rate": [0.1],
}
grid_search = GridSearchCV(cat, param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=1)
grid_search.fit(X, y)
cat_params = grid_search.best_params_
print(cat_params)
print(grid_search.best_score_)
# 输出:
# Fitting 5 folds for each of 1 candidates, totalling 5 fits
# {'depth': 9, 'l2_leaf_reg': 1, 'learning_rate': 0.1, 'n_estimators': 300, 'rsm': 0.8}
# 0.9219350182632262
stacking = StackingClassifier(
estimators=[
("xgb", XGBClassifier(random_state=seed, **xgb_params)),
("lgb", LGBMClassifier(random_state=seed, verbose=-1, **lgb_params)),
("cat", CatBoostClassifier(random_state=seed, verbose=0, **cat_params)),
],
cv=5,
n_jobs=-1,
)
stacking.fit(X, y)
y_pred = pd.Series(stacking.predict(X_test), index=X_test.index, name="Label")
prob = pd.DataFrame(stacking.predict_proba(X_test), index=X_test.index, columns=stacking.classes_)
prob.head()
# 伪标签
idx_pseudo = prob[prob.max(axis=1) > 0.9].index
X_pesudo = X_test.loc[idx_pseudo]
y_pesudo = y_pred[idx_pseudo]
X_new = pd.concat([X, X_pesudo])
y_new = pd.concat([y, y_pesudo])
print(X_new.shape, y_new.shape)
# 输出:(2123, 35) (2123,)
stacking.fit(X_new, y_new)
y_pred = pd.Series(stacking.predict(X_test), index=X_test.index, name="Label")
submission = y_pred.to_frame()
submission.to_csv("submission.csv")
submission.head()
线上得分: 0.8714