2.优秀员工离职原因分析与预测-R（一）

一、背景

1.公司大量优秀且有经验的员工过早的离开

2.数据来源：kaggle

3.变量

satisfaction: Employee satisfaction level
evaluation: Last evaluation
project: Number of projects
hours: Average monthly hours
years: Time spent at the company
accident: Whether they have had a work accident
promotion: Whether they have had a promotion in the last 5 years
sales: Department
salary: Salary
left: Whether the employee has left

4.分析目的与衡量标准：

（1）分析并得出优秀员工离职的主要可能的原因
（2）构建预测模型，预测下一位将会离开的优秀员工是谁

二、数据分析

所需包导入

library(readr)
library(dplyr)
library(ggplot2)
library(gmodels)

（一）导入数据并查看

## 1.1 数据导入

library(readr)
hr <- read_csv("HR_comma_sep.csv")
hr <- tbl_df(hr)
View(hr)
str(hr)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 14999 obs. of 10 variables:

$ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
$ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
$ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
$ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
$ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
$ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
$ left : int 1 1 1 1 1 1 1 1 1 1 ...
$ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
$ sales : chr "sales" "sales" "sales" "sales" ...
$ salary : chr "low" "medium" "medium" "low" ...

## 1.2 变量重命名

hr_good <- filter(hr, evaluation>=0.75 & year>=4 & project>= 4)

colnames(hr) <- c("satisfaction","evaluation","project","hours","years","accident","left","promotion","sales","salary")

## 1.3 因子化

hr$sales <- factor(hr$sales)
hr$salary <- factor(hr$salary, levels=c("low","medium","high"))

## 1.4 查看数据

sum(is.na(hr))

# [1] 0

summary(hr)

satisfaction_level last_evaluation number_project average_montly_hours
Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0

time_spend_company Work_accident left promotion_last_5years
Min. : 2.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median : 3.000 Median :0.0000 Median :0.0000 Median :0.00000
Mean : 3.498 Mean :0.1446 Mean :0.2381 Mean :0.02127
3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
ax. :10.000 Max. :1.0000 Max. :1.0000 Max. :1.00000

sales salary
sales :4140 low :7316
technical :2720 medium:6446
support :2229 high :1237
IT :1227
product_mng: 902
marketing : 858
(Other) :2923

（二）根据定义选取优秀员工的子集，并做初步分析

优秀员工的定义：
（1）评价(evaluation)>=0.75
（2）项目数量(project)>=4
（3）有经验的(year)>=4

## 2.1 根据定义选取子集

hr_good <- filter(hr, evaluation>=0.75 & years>=4 & project>= 4)

## 2.2 对比总体与选取子集中离职员工的占比情况

【结论】：优秀员工离职情况非常严重
1.在总离职员工中，优秀员工的数量占了(1778/3571=) 50%；
2.在优秀员工子集中，离职的数量高达(1778/2753=) 64%；

CrossTable(hr$left)

Total Observations in Table: 14999

| stay | left |
|-----------|-----------|
| 11428 | 3571 |
| 0.762 | 0.238 |
|-----------|-----------|

CrossTable(hr_good$left)

Total Observations in Table: 2763
| stay | left |
|-----------|-----------|
| 985 | 1778 |
| 0.356 | 0.644 |
|-----------|-----------|

## 2.3 了解优秀员工子集的统计量

summary(hr_good)

## 2.4 了解优秀员工子集中各变量之间的相关性

【结论】：离职与满意度呈负相关，且相关度最高

hr_good_corr <- select(hr, -sales,-salary) %>% cor()
corrplot(hr_good_corr, method="circle", tl.col="black",title="离职与满意度呈负相关，且相关度最高",mar=c(1,1,3,1))

（三）逐个变量分析员工离职、满意度与其他变量之间的关系

## 3.1 查看满意度的分布图

hr_good$left <- factor(hr_good$left, levels=c(0,1), labels=c("stay", "left"))

ggplot(hr_good, aes(satisfaction, fill=left)) + geom_histogram(position="dodge") + scale_x_continuous(breaks=c(0.1,0.13,0.25,0.50,0.73,0.75,0.92,1.00)) + theme3 + theme(axis.text.x=element_text(angle=90)) + labs(title="满意度在[0.1,0.13]与[0.73,0.92]两个区间离职人数非常多")

## 3.2 收入、工作时间、满意度之间的关系

【结论】：工作超长时间的员工满意度低，且离职率高
1.低&中等薪资水平中较高工作时间的员工大量离职
2.超长工作时间的员工满意度都很低且几乎都已离职

ggplot(hr_good, aes(salary, hours, alpha=satisfaction, color=left)) + geom_jitter() + theme3 + labs(title=paste("低&中等薪资水平中较高工作时间的员工大量离职","\n","超长工作时间的员工满意度都很低且几乎都已离职"))

## 3.3 晋升、满意度与离职的关系

【结论】：高评价的员工几乎没有人晋升，离职人员也主要集中在未晋升中

ggplot(hr_good, aes(promotion, evaluation, color=left)) + geom_jitter() + theme3 + scale_x_discrete(limits=c(0,1)) + labs(title=paste("高评价的员工几乎没有人晋升","\n","离职人员也主要集中在未晋升中"))

## 3.4 工作年限、满意度与离职的关系

【结论】：
1.低满意度(0.1)水平下，4年司龄的员工大量离职
2.大量高满意度员工在第5年与第6年离职
3.7年以上的员工没有人离职

ggplot(hr_good, aes(years,satisfaction, color=left)) + geom_jitter() + scale_x_discrete(limits=c(4,5,6,7,8,9,10)) + theme3 + labs(title=paste("低满意度(","0.1)","水平下，4年司龄的员工大量离职","\n","大量高满意度员工在第5年与第6年离职"))

## 3.5 部门、项目数与离职的关系

【结论】：对于6个以上的项目无论在哪个部门离职率都非常高

ggplot(hr_good, aes(sales,fill=left)) + geom_bar(position="fill") + facet_wrap(~factor(project),ncol=1) + theme3 + theme(axis.text.x=element_text(angle=270)) + labs(y="number projcet",title="对于6个以上的项目无论在哪个部门离职率都非常高")

## 3.6 部门与离职的关系

【结论】：各部门离职人员均高于在职人员，管理部门除外

ggplot(hr_good, aes(sales, fill=left)) + geom_bar(position="dodge") + coord_flip() + scale_x_discrete(limits=c("management","RandD","hr","accounting","marketing","product_mng","IT","support","technical","sales")) + labs(title="各部门离职人员均高于在职人员，管理部门除外") + theme3

三、构建预测模型1：分类

## 数据分割

library(caret)
set.seed(0001)
train <- createDataPartition(hr_good$left, p=0.75, list=FALSE)
hr_good_train <- hr_good[train, ]
hr_good_test <- hr_good[-train, ]

（一）Logistic回归

## 1. 构建逻辑回归并验证

ctrl <- trainControl(method="cv",number=5)
logit <- train(left~., hr_good_train, method="LogitBoost", trControl=ctrl)
logit.pred <- predict(logit, hr_good_test, tyep="response")
confusionMatrix( hr_good_test$left, logit.pred)

Confusion Matrix and Statistics

Reference
Prediction stay left
stay 213 33
left 8 436

Accuracy : 0.9406
95% CI : (0.9203, 0.957)
No Information Rate : 0.6797
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8675
Mcnemar's Test P-Value : 0.0001781
Sensitivity : 0.9638
Specificity : 0.9296
Pos Pred Value : 0.8659
Neg Pred Value : 0.9820
Prevalence : 0.3203
Detection Rate : 0.3087
Detection Prevalence : 0.3565
Balanced Accuracy : 0.9467
'Positive' Class : stay

## 2. 评价模型，绘制ROC/AUC曲线

library(pROC)
roc(as.numeric(hr_good_test$left), as.numeric(logit.pred), plot=TRUE, print.thres=TRUE, print.auc=TRUE, col="black")

（二）决策树

## 1. 构建决策树

library(rpart)
dtree <- rpart(left~., hr_good_train, method="class",parms=list(split="information"))
dtree$cptable

CP nsplit rel error xerror xstd

1 0.55209743 0 1.00000000 1.00000000 0.02950911
2 0.12855210 1 0.44790257 0.44790257 0.02256805
3 0.08254398 3 0.19079838 0.19079838 0.01551204
4 0.02300406 4 0.10825440 0.10825440 0.01186737
5 0.01000000 5 0.08525034 0.08525034 0.01057607

plotcp(dtree)

dtree.pruned <- prune(dtree, cp=0.01)
library(partykit)
library(grid)
plot(as.party(dtree.pruned),main="Decision Tree")

dtree.pruned.pred <- predict(dtree.pruned, hr_good_test, type="class")
confusionMatrix(hr_good_test$left, dtree.pruned.pred)

Confusion Matrix and Statistics

Reference
Prediction stay left
stay 236 10
left 13 431

Accuracy : 0.9667
95% CI : (0.9504, 0.9788)
No Information Rate : 0.6391
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9275
Mcnemar's Test P-Value : 0.6767
Sensitivity : 0.9478
Specificity : 0.9773
Pos Pred Value : 0.9593
Neg Pred Value : 0.9707
Prevalence : 0.3609
Detection Rate : 0.3420
Detection Prevalence : 0.3565
Balanced Accuracy : 0.9626
'Positive' Class : stay

## 2. 评价模型，绘制ROC/AUC曲线

roc(as.numeric(hr_good_test$left),as.numeric(dtree.pruned.pred), plot=TRUE, print.thres=TRUE, print.auc=TRUE,col="blue")

（三）随机森林

## 1. 构建随机森林

library(randomForest)
set.seed(0002)
forest <- randomForest(left~., hr_good_train, importance=TRUE, na.action=na.roughfix)
forest

Call:

randomForest(formula = left ~ ., data = hr_good_train, importance = TRUE, na.action = na.roughfix)

Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 1.35%
Confusion matrix:
stay left class.error
stay 731 8 0.01082544
left 20 1314 0.01499250

importance(forest, type=2)

MeanDecreaseGini

satisfaction 315.455428
evaluation 63.101527
project 51.659717
hours 307.201129
years 168.530576
accident 6.232136
promotion 1.897123
sales 20.666829
salary 12.338717

forest.pred <- predict(forest, hr_good_test)
confusionMatrix(hr_good_test$left, forest.pred)

Confusion Matrix and Statistics

Reference
Prediction stay left
stay 241 5
left 4 440

Accuracy : 0.987
95% CI : (0.9754, 0.994)
No Information Rate : 0.6449
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9715
Mcnemar's Test P-Value : 1
Sensitivity : 0.9837
Specificity : 0.9888
Pos Pred Value : 0.9797
Neg Pred Value : 0.9910
Prevalence : 0.3551
Detection Rate : 0.3493
Detection Prevalence : 0.3565
Balanced Accuracy : 0.9862
'Positive' Class : stay

## 2. 评价模型，绘制ROC/AUC曲线

roc(as.numeric(hr_good_test$left), as.numeric(forest.pred), plot=TRUE, print.thres=TRUE, print.auc=T, col="green")

（四）支持向量机SVM

## 1. 构建SVM

library(e1071)
set.seed(0003)
svm <- svm(left~., hr_good_train)
svm.pred <- predict(svm, na.omit(hr_good_test))
confusionMatrix(na.omit(hr_good_test)$left, svm.pred)

Confusion Matrix and Statistics

Reference

Prediction stay left
stay 211 35
left 11 433

Accuracy : 0.9333
95% CI : (0.9121, 0.9508)
No Information Rate : 0.6783
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8515
Mcnemar's Test P-Value : 0.000696
Sensitivity : 0.9505
Specificity : 0.9252
Pos Pred Value : 0.8577
Neg Pred Value : 0.9752
Prevalence : 0.3217
Detection Rate : 0.3058
Detection Prevalence : 0.3565
Balanced Accuracy : 0.9378
'Positive' Class : stay

## 2. 评价模型，绘制ROC/AUC曲线

roc(as.numeric(na.omit(hr_good_test)$left), as.numeric(svm.pred), plot=T, print.thres=T, print.auc=T, col="orange")

（五）对比模型，选择准确性最高的模型

【结论】：随机森林的拟合度最高，选择该模型为预测模型

roc(as.numeric(hr_good_test$left), as.numeric(logit.pred), plot=TRUE,col="black",main=paste("ROC曲线:","Logitis(black)","dtree(blue)","randomForest(green)","SVM(orange)",sep=" "))
roc(as.numeric(hr_good_test$left),as.numeric(dtree.pruned.pred), plot=TRUE, col="blue", add=T)
roc(as.numeric(hr_good_test$left), as.numeric(forest.pred), plot=TRUE, col="green", add=T)
roc(as.numeric(na.omit(hr_good_test)$left), as.numeric(svm.pred), plot=T, col="orange", add=T)

（六）模型应用

importance(forest,type=2)

MeanDecreaseGini

satisfaction 315.455428
evaluation 63.101527
project 51.659717
hours 307.201129
years 168.530576
accident 6.232136
promotion 1.897123
sales 20.666829
salary 12.338717

## 1. 剔除明显不重要的因子，重新构建模型

forest2 <- randomForest(left~.-promotion-accident-salary-sales, hr_good, na.action=na.roughfix, importance=TRUE)
importance(forest2, type=2)

MeanDecreaseGini

satisfaction 462.97925
evaluation 72.49430
project 71.09463
hours 417.63782
years 231.64080

forest2.pred <- predict(forest2, hr_good_test)
confusionMatrix(hr_good_test$left, forest2.pred, positive="left")

Confusion Matrix and Statistics

Reference

Prediction stay left
stay 244 2
left 1 443

Accuracy : 0.9957
95% CI : (0.9873, 0.9991)
No Information Rate : 0.6449
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9905
Mcnemar's Test P-Value : 1
Sensitivity : 0.9959
Specificity : 0.9955
Pos Pred Value : 0.9919
Neg Pred Value : 0.9977
Prevalence : 0.3551
Detection Rate : 0.3536
Detection Prevalence : 0.3565
Balanced Accuracy : 0.9957
'Positive' Class : stay

## 2. 评价模型，绘制ROC/AUC曲线

roc(as.numeric(hr_good_test$left), as.numeric(forest2.pred), plot=TRUE, print.thres=T, print.auc=T, main="Random Forest", col="green")

（七）结论

1.调整后的随机森林预测模型员工离职的准确性达99.5%；其中离职的员工被正确预测的概率为99.5%，被预测离职的员工中，实际离职的概率为99.8%；

2.剔除不重要的变量（promotion,accident,sales,salary）并不会对模型造成影响；

3.满意度（satisfaction）、月平均工作时间（hours）、工作年限（years）是影响优秀员工离职的主要三个变量

2.优秀员工离职原因分析与预测-R（一）

二、数据分析

（一）导入数据并查看

Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 14999 obs. of 10 variables:

（二）根据定义选取优秀员工的子集，并做初步分析

（三）逐个变量分析员工离职、满意度与其他变量之间的关系

三、构建预测模型1：分类

（一）Logistic回归

Confusion Matrix and Statistics

（二）决策树

CP nsplit rel error xerror xstd

（三）随机森林

MeanDecreaseGini

Confusion Matrix and Statistics

（四）支持向量机SVM

（五）对比模型，选择准确性最高的模型

（六）模型应用

MeanDecreaseGini

（七）结论

你可能感兴趣的:(2.优秀员工离职原因分析与预测-R（一）)