The author selected Girls Who Code to receive a donation as part of the Write for DOnations program.
作者选择了《编码的女孩》作为Write for DOnations计划的一部分来接受捐赠。
Keras is a neural network API that is written in Python. It runs on top of TensorFlow, CNTK, or Theano. It is a high-level abstraction of these deep learning frameworks and therefore makes experimentation faster and easier. Keras is modular, which means implementation is seamless as developers can quickly extend models by adding modules.
Keras是用Python编写的神经网络API。 它在TensorFlow , CNTK或Theano之上运行。 它是这些深度学习框架的高级抽象,因此使实验更快,更容易。 Keras是模块化的,这意味着实现是无缝的,因为开发人员可以通过添加模块来快速扩展模型。
TensorFlow is an open-source software library for machine learning. It works efficiently with computation involving arrays; so it’s a great choice for the model you’ll build in this tutorial. Furthermore, TensorFlow allows for the execution of code on either CPU or GPU, which is a useful feature especially when you’re working with a massive dataset.
TensorFlow是一个用于机器学习的开源软件库。 它可以有效地处理涉及数组的计算。 因此,对于您在本教程中构建的模型而言,这是一个不错的选择。 此外,TensorFlow允许在CPU或GPU上执行代码,这是一个有用的功能,尤其是在处理海量数据集时。
In this tutorial, you’ll build a deep learning model that will predict the probability of an employee leaving a company. Retaining the best employees is an important factor for most organizations. To build your model, you’ll use this dataset available at Kaggle, which has features that measure employee satisfaction in a company. To create this model, you’ll use the Keras sequential layer to build the different layers for the model.
在本教程中,您将构建一个深度学习模型,该模型将预测员工离开公司的可能性。 留住最好的员工是大多数组织的重要因素。 要构建模型,您将使用Kaggle可用的数据集 , 该数据集具有可衡量公司员工满意度的功能。 要创建此模型,您将使用Keras 顺序层为模型构建不同的层。
Before you begin this tutorial you’ll need the following:
在开始本教程之前,您需要满足以下条件:
An Anaconda development environment on your machine.
您计算机上的Anaconda开发环境 。
A Jupyter Notebook installation. Anaconda will install Jupyter Notebook for you during its installation. You can also follow this tutorial for a guide on how to navigate and use Jupyter Notebook.
Jupyter Notebook安装。 Anaconda将在安装过程中为您安装Jupyter Notebook。 您也可以按照本教程获取有关如何导航和使用Jupyter Notebook的指南。
Familiarity with Machine learning.
熟悉机器学习 。
Data Pre-processing is necessary to prepare your data in a manner that a deep learning model can accept. If there are categorical variables in your data, you have to convert them to numbers because the algorithm only accepts numerical figures. A categorical variable represents quantitive data represented by names. In this step, you’ll load in your dataset using pandas
, which is a data manipulation Python library.
数据预处理对于以深度学习模型可以接受的方式准备数据是必需的。 如果数据中包含分类变量 ,则必须将它们转换为数字,因为该算法仅接受数字。 分类变量表示由名称表示的定量数据。 在此步骤中,您将使用pandas
加载数据集, pandas
是一个数据处理Python库。
Before you begin data pre-processing, you’ll activate your environment and ensure you have all the necessary packages installed to your machine. It’s advantageous to use conda
to install keras
and tensorflow
since it will handle the installation of any necessary dependencies for these packages, and ensure they are compatible with keras
and tensorflow
. In this way, using the Anaconda Python distribution is a good choice for data science related projects.
在开始数据预处理之前,您将激活您的环境并确保已将所有必需的软件包安装到计算机上。 这是有利的,使用conda
安装keras
和tensorflow
因为它会处理好这些包的任何必要的依赖安装,并确保它们与兼容keras
和tensorflow
。 这样,对于数据科学相关项目,使用Anaconda Python发行版是一个不错的选择。
Move into the environment you created in the prerequisite tutorial:
进入在先决条件教程中创建的环境:
conda activate my_env
conda激活my_env
Run the following command to install keras
and tensorflow
:
运行以下命令以安装keras
和tensorflow
:
Now, open Jupyter Notebook to get started. Jupyter Notebook is opened by typing the following command on your terminal:
现在,打开Jupyter Notebook开始使用。 通过在终端上键入以下命令来打开Jupyter Notebook:
Note: If you’re working from a remote server, you’ll need to use SSH tunneling to access your notebook. Please revisit step 2 of the prerequisite tutorial for detailed on instructions on setting up SSH tunneling. You can use the following command from your local machine to initiate your SSH tunnel:
注意:如果您使用的是远程服务器,则需要使用SSH隧道访问笔记本电脑。 请重新访问先决条件教程的步骤2 ,以获取有关设置SSH隧道的说明的详细信息。 您可以从本地计算机使用以下命令来启动SSH隧道:
ssh -L 8888:localhost:8888 your_username@your_server_ip
ssh -L 8888:localhost:8888 your_username @ your_server_ip
After accessing Jupyter Notebook, click on the anaconda3 file, and then click New at the top of the screen, and select Python 3 to load a new notebook.
访问Jupyter Notebook后,单击anaconda3文件,然后单击屏幕顶部的“ 新建 ”,然后选择“ Python 3”以加载新的笔记本。
Now, you’ll import the required modules for the project and then load the dataset in a notebook cell. You’ll load in the pandas
module for manipulating your data and numpy
for converting the data into numpy
arrays. You’ll also convert all the columns that are in string format to numerical values for your computer to process.
现在,您将导入项目所需的模块,然后将数据集加载到笔记本单元格中。 您将加载用于处理数据的pandas
模块,以及将数据转换为numpy
数组的numpy
模块。 您还将所有字符串格式的列转换为数值,以供计算机处理。
Insert the following code into a notebook cell and then click Run:
将以下代码插入笔记本单元格,然后单击“运行” :
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/mwitiderrick/kerasDO/master/HR_comma_sep.csv")
You’ve imported numpy
and pandas
. You then used pandas
to load in the dataset for the model.
您已经导入了numpy
和pandas
。 然后,您使用了pandas
来加载模型的数据集。
You can get a glimpse at the dataset you’re working with by using head()
. This is a useful function from pandas
that allows you to view the first five records of your dataframe. Add the following code to a notebook cell and then run it:
您可以使用head()
瞥见正在使用的数据集。 这是pandas
一项有用功能,可让您查看数据框的前五个记录。 将以下代码添加到笔记本单元,然后运行它:
df.head()
You’ll now proceed to convert the categorical columns to numbers. You do this by converting them to dummy variables. Dummy variables are usually ones and zeros that indicate the presence or absence of a categorical feature. In this kind of situation, you also avoid the dummy variable trap by dropping the first dummy.
现在,您将继续将分类列转换为数字。 您可以通过将它们转换为虚拟变量来实现 。 虚拟变量通常为1和0,表示存在或不存在分类特征。 在这种情况下,还可以通过删除第一个虚拟对象来避免虚拟变量陷阱 。
Note: The dummy variable trap is a situation whereby two or more variables are highly correlated. This leads to your model performing poorly. You, therefore, drop one dummy variable to always remain with N-1 dummy variables. Any of the dummy variables can be dropped because there is no preference as long as you remain with N-1 dummy variables. An example of this is if you were to have an on/off switch. When you create the dummy variable you shall get two columns: an on
column and an off
column. You can drop one of the columns because if the switch isn’t on, then it is off.
注意:虚拟变量陷阱是两个或多个变量高度相关的情况。 这导致您的模型表现不佳。 因此,您删除一个虚拟变量,使其始终与N-1个虚拟变量一起保留。 可以删除任何虚拟变量,因为没有偏好,只要您保留N-1个虚拟变量即可。 例如,如果您要有一个开/关开关。 当您创建虚拟变量时,您将获得两列: on
列和off
列。 您可以删除其中一列,因为如果开关未打开,则表示开关处于关闭状态。
Insert this code in the next notebook cell and execute it:
将此代码插入下一个笔记本单元并执行:
feats = ['department','salary']
df_final = pd.get_dummies(df,columns=feats,drop_first=True)
feats = ['department','salary']
defines the two columns for which you want to create dummy variables. pd.get_dummies(df,columns=feats,drop_first=True)
will generate the numerical variables that your employee retention model requires. It does this by converting the feats
that you define from categorical to numerical variables.
feats = ['department','salary']
定义要为其创建虚拟变量的两列。 pd.get_dummies(df,columns=feats,drop_first=True)
将生成员工保留模型所需的数字变量。 它通过将执行此feats
,你从分类定义于数值变量。
You’ve loaded in the dataset and converted the salary and department columns into a format the keras
deep learning model can accept. In the next step, you will split the dataset into a training and testing set.
您已经加载了数据集,并将薪金和部门列转换为keras
深度学习模型可以接受的格式。 在下一步中,您将数据集分为训练和测试集。
You’ll use scikit-learn
to split your dataset into a training and a testing set. This is necessary so you can use part of the employee data to train the model and a part of it to test its performance. Splitting a dataset in this way is a common practice when building deep learning models.
您将使用scikit-learn
将数据集分为训练和测试集。 这是必要的,因此您可以使用部分员工数据来训练模型,并使用部分数据来测试其性能。 在构建深度学习模型时,以这种方式拆分数据集是一种常见的做法。
It is important to implement this split in the dataset so the model you build doesn’t have access to the testing data during the training process. This ensures that the model learns only from the training data, and you can then test its performance with the testing data. If you exposed your model to testing data during the training process then it would memorize the expected outcomes. Consequently, it would fail to give accurate predictions on data that it hasn’t seen.
在数据集中实施此拆分很重要,这样您构建的模型在训练过程中就无法访问测试数据。 这样可以确保模型仅从训练数据中学习,然后可以使用测试数据测试其性能。 如果您在训练过程中将模型暴露于测试数据,那么它将记住预期的结果。 因此,它将无法对尚未看到的数据做出准确的预测。
You’ll start by importing the train_test_split
module from the scikit-learn
package. This is the module that will provide the splitting functionality. Insert this code in the next notebook cell and run:
您将从scikit-learn
包中导入train_test_split
模块开始。 这是将提供拆分功能的模块。 将此代码插入下一个笔记本单元并运行:
from sklearn.model_selection import train_test_split
With the train_test_split
module imported, you’ll use the left
column in your dataset to predict if an employee will leave the company. Therefore, it is essential that your deep learning model doesn’t come into contact with this column. Insert the following into a cell to drop the left
column:
随着train_test_split
模块导入后,您将使用left
的列数据集中的预测如果员工将离开公司。 因此,请务必不要将您的深度学习模型与本专栏联系。 将下面的进入细胞砸left
列:
X = df_final.drop(['left'],axis=1).values
y = df_final['left'].values
Your deep learning model expects to get the data as arrays. Therefore you use numpy
to convert the data to numpy
arrays with the .values
attribute.
您的深度学习模型期望以数组形式获取数据。 因此,您可以使用numpy
将数据转换为具有.values
属性的numpy
数组。
You’re now ready to convert the dataset into a testing and training set. You’ll use 70% of the data for training and 30% for testing. The training ratio is more than the testing ratio because you’ll need to use most of the data for the training process. If desired, you can also experiment with a ratio of 80% for the training set and 20% for the testing set.
现在您可以将数据集转换为测试和训练集。 您将使用70%的数据进行培训,并使用30%的数据进行测试。 训练比率大于测试比率,因为您需要在训练过程中使用大多数数据。 如果需要,您还可以对训练集进行80%的测试,对测试集进行20%的测试。
Now add this code to the next cell and run to split your training and testing data to the specified ratio:
现在,将此代码添加到下一个单元格并运行以将训练和测试数据拆分为指定的比例:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
You have now converted the data into the type that Keras expects it to be in (numpy
arrays), and your data is split into a training and testing set. You’ll pass this data to the keras
model later in the tutorial. Beforehand you need to transform the data, which you’ll complete in the next step.
现在,您已经将数据转换为Keras期望的数据类型( numpy
数组),并且您的数据被分为训练和测试集。 您将在本教程的后面部分将此数据传递给keras
模型。 事先需要转换数据,然后将在下一步中完成。
When building deep learning models it is usually good practice to scale your dataset in order to make the computations more efficient. In this step, you’ll scale the data using the StandardScaler
; this will ensure that your dataset values have a mean of zero and a unit variable. This transforms the dataset to be normally distributed. You’ll use the scikit-learn
StandardScaler
to scale the features to be within the same range. This will transform the values to have a mean of 0 and a standard deviation of 1. This step is important because you’re comparing features that have different measurements; so it is typically required in machine learning.
建立深度学习模型时,通常最好对数据集进行缩放 ,以使计算效率更高。 在这一步中,您将使用StandardScaler
缩放数据; 这将确保您的数据集值的平均值为零,并且单位变量。 这会将数据集转换为正态分布。 您将使用scikit-learn
StandardScaler
将功能缩放到相同范围内。 这会将值转换为平均值为0,标准偏差为1。此步骤很重要,因为您要比较具有不同测量值的特征。 因此在机器学习中通常是必需的。
To scale the training set and the test set, add this code to the notebook cell and run it:
要缩放训练集和测试集,请将以下代码添加到笔记本单元并运行:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Here, you start by importing the StandardScaler
and calling an instance of it. You then use its fit_transform
method to scale the training and testing set.
在这里,您首先导入StandardScaler
并调用其实例。 然后,使用其fit_transform
方法缩放训练和测试集。
You have scaled all your dataset features to be within the same range. You can start building the artificial neural network in the next step.
您已将所有数据集要素缩放到相同范围内。 您可以在下一步中开始构建人工神经网络。
Now you will use keras
to build the deep learning model. To do this, you’ll import keras
, which will use tensorflow
as the backend by default. From keras
, you’ll then import the Sequential
module to initialize the artificial neural network. An artificial neural network is a computational model that is built using inspiration from the workings of the human brain. You’ll import the Dense
module as well, which will add layers to your deep learning model.
现在,您将使用keras
来构建深度学习模型。 为此,您将导入keras
,默认情况下将使用tensorflow
作为后端。 然后从keras
导入Sequential
模块,以初始化人工神经网络 。 人工神经网络是一种计算模型,它是根据人脑运作的启发而构建的。 您还将导入Dense
模块,这将为深度学习模型添加层次。
When building a deep learning model you usually specify three layer types:
在构建深度学习模型时,通常指定三种层类型:
The input layer is the layer to which you’ll pass the features of your dataset. There is no computation that occurs in this layer. It serves to pass features to the hidden layers.
输入层是将数据集要素传递到的层。 在这一层没有任何计算。 它用于将要素传递到隐藏层。
The hidden layers are usually the layers between the input layer and the output layer—and there can be more than one. These layers perform the computations and pass the information to the output layer.
隐藏层通常是输入层和输出层之间的层,并且可以不止一层。 这些层执行计算,并将信息传递到输出层。
The output layer represents the layer of your neural network that will give you the results after training your model. It is responsible for producing the output variables.
输出层代表神经网络的一层,在训练模型后,它将为您提供结果。 它负责产生输出变量。
To import the Keras
, Sequential
, and Dense
modules, run the following code in your notebook cell:
要导入Keras
, Sequential
和Dense
模块,请在笔记本单元格中运行以下代码:
import keras
from keras.models import Sequential
from keras.layers import Dense
You’ll use Sequential
to initialize a linear stack of layers. Since this is a classification problem, you’ll create a classifier variable. A classification problem is a task where you have labeled data and would like to make some predictions based on the labeled data. Add this code to your notebook to create a classifier variable:
您将使用Sequential
来初始化线性的图层堆栈。 由于这是分类问题 ,因此您将创建一个分类变量。 分类问题是一项任务,您已经标记了数据,并希望根据标记的数据做出一些预测。 将此代码添加到笔记本中以创建分类器变量:
classifier = Sequential()
You’ve used Sequential
to initialize the classifier.
您已使用Sequential
来初始化分类器。
You can now start adding layers to your network. Run this code in your next cell:
现在,您可以开始向网络中添加图层了。 在下一个单元格中运行以下代码:
classifier.add(Dense(9, kernel_initializer = "uniform",activation = "relu", input_dim=18))
You add layers using the .add()
function on your classifier and specify some parameters:
您可以在分类器上使用.add()
函数添加图层,并指定一些参数:
The first parameter is the number of nodes that your network should have. The connection between different nodes is what forms the neural network. One of the strategies to determine the number of nodes is to take the average of the nodes in the input layer and the output layer.
第一个参数是您的网络应具有的节点数。 不同节点之间的连接是构成神经网络的要素。 确定节点数的策略之一是取输入层和输出层中节点的平均值。
The second parameter is the kernel_initializer.
When you fit your deep learning model the weights will be initialized to numbers close to zero, but not zero. To achieve this you use the uniform distribution initializer. kernel_initializer
is the function that initializes the weights.
第二个参数是kernel_initializer.
当您适合您的深度学习模型时,权重将被初始化为接近零但不为零的数字。 为此,您可以使用统一分布初始化程序。 kernel_initializer
是用于初始化权重的函数。
The third parameter is the activation
function. Your deep learning model will learn through this function. There are usually linear and non-linear activation functions. You use the relu
activation function because it generalizes well on your data. Linear functions are not good for problems like these because they form a straight line.
第三个参数是activation
功能。 您的深度学习模型将通过此功能进行学习。 通常有线性和非线性激活函数。 您可以使用relu
激活功能,因为它可以很好地概括您的数据。 线性函数不适用于此类问题,因为它们形成一条直线。
The last parameter is input_dim
, which represents the number of features in your dataset.
最后一个参数是input_dim
,它表示数据集中input_dim
的数量。
Now you’ll add the output layer that will give you the predictions:
现在,您将添加将为您提供预测的输出层:
classifier.add(Dense(1, kernel_initializer = "uniform",activation = "sigmoid"))
The output layer takes the following parameters:
输出层采用以下参数:
The number of output nodes. You expect to get one output: if an employee leaves the company. Therefore you specify one output node.
输出节点数。 您期望得到一个输出:如果员工离开公司。 因此,您指定一个输出节点。
For kernel_initializer
you use the sigmoid
activation function so that you can get the probability that an employee will leave. In the event that you were dealing with more than two categories, you would use the softmax
activation function, which is a variant of the sigmoid
activation function.
对于kernel_initializer
您可以使用sigmoid
激活函数,以便获得员工离职的可能性。 如果要处理两个以上的类别,则可以使用softmax
激活函数,它是sigmoid
激活函数的一种变体。
Next, you’ll apply a gradient descent to the neural network. This is an optimization strategy that works to reduce errors during the training process. Gradient descent is how randomly assigned weights in a neural network are adjusted by reducing the cost function, which is a measure of how well a neural network performs based on the output expected from it.
接下来,您将对神经网络应用梯度下降 。 这是一种优化策略,可减少培训过程中的错误。 梯度下降是通过减少成本函数来调整神经网络中随机分配的权重的方法, 成本函数是基于神经网络预期输出的性能的度量。
The aim of a gradient descent is to get the point where the error is at its least. This is done by finding where the cost function is at its minimum, which is referred to as a local minimum. In gradient descent, you differentiate to find the slope at a specific point and find out if the slope is negative or positive—you’re descending into the minimum of the cost function. There are several types of optimization strategies, but you’ll use a popular one known as adam
in this tutorial.
梯度下降的目的是获得误差最小的点。 通过找到成本函数在其最小值处的最小值(称为局部最小值)来完成此操作 。 在梯度下降法中,您可以进行区分以找到特定点处的斜率,并找出斜率是负还是正-从而降低了成本函数的最小值。 有几种类型的优化策略,但你会使用流行的一种被称为adam
在本教程中。
Add this code to your notebook cell and run it:
将此代码添加到笔记本计算机并运行:
classifier.compile(optimizer= "adam",loss = "binary_crossentropy",metrics = ["accuracy"])
Applying gradient descent is done via the compile
function that takes the following parameters:
通过采用以下参数的compile
函数完成梯度下降:
optimizer
is the gradient descent.
optimizer
是梯度下降。
loss
is a function that you’ll use in the gradient descent. Since this is a binary classification problem you use the binary_crossentropy
loss
function.
loss
是您将在梯度下降中使用的功能。 由于这是一个二进制分类问题,因此您可以使用binary_crossentropy
loss
函数。
The last parameter is the metric
that you’ll use to evaluate your model. In this case, you’d like to evaluate it based on its accuracy when making predictions.
最后一个参数是metric
,你会用它来评估模型。 在这种情况下,您希望在进行预测时根据其准确性进行评估。
You’re ready to fit your classifier to your dataset. Keras makes this possible via the .fit()
method. To do this, insert the following code into your notebook and run it in order to fit the model to your dataset:
您已经准备好将分类器适合数据集。 .fit()
通过.fit()
方法使之成为可能。 为此,将以下代码插入到笔记本中并运行它,以使模型适合您的数据集:
classifier.fit(X_train, y_train, batch_size = 10, epochs = 1)
The .fit()
method takes a couple of parameters:
.fit()
方法采用两个参数:
The first parameter is the training set with the features.
第一个参数是带有功能的训练集。
The second parameter is the column that you’re making the predictions on.
第二个参数是您要进行预测的列。
The batch_size
represents the number of samples that will go through the neural network at each training round.
batch_size
表示每个训练周期将通过神经网络的样本数量。
epochs
represents the number of times that the dataset will be passed via the neural network. The more epochs the longer it will take to run your model, which also gives you better results.
epochs
表示数据集将通过神经网络传递的次数。 历时越长,运行模型所需的时间就越长,这也将为您带来更好的结果。
You’ve created your deep learning model, compiled it, and fitted it to your dataset. You’re ready to make some predictions using the deep learning model. In the next step, you’ll start making predictions with the dataset that the model hasn’t yet seen.
您已经创建了深度学习模型,对其进行了编译,并将其拟合到数据集中。 您已经准备好使用深度学习模型做出一些预测。 在下一步中,您将开始使用该模型尚未看到的数据集进行预测。
To start making predictions, you’ll use the testing dataset in the model that you’ve created. Keras enables you to make predictions by using the .predict()
function.
要开始进行预测,您将在创建的模型中使用测试数据集。 Keras使您可以使用.predict()
函数进行预测。
Insert the following code in the next notebook cell to begin making predictions:
在下一个笔记本单元格中插入以下代码以开始进行预测:
y_pred = classifier.predict(X_test)
Since you’ve already trained the classifier with the training set, this code will use the learning from the training process to make predictions on the test set. This will give you the probabilities of an employee leaving. You’ll work with a probability of 50% and above to indicate a high chance of the employee leaving the company.
由于您已经使用训练集对分类器进行了训练,因此该代码将使用训练过程中的学习结果来对测试集进行预测。 这将给您雇员离职的可能性。 您将以50%或更高的概率工作,以表明员工离职的可能性很高。
Enter the following line of code in your notebook cell in order to set this threshold:
在您的笔记本单元格中输入以下代码行以设置此阈值:
y_pred = (y_pred > 0.5)
You’ve created predictions using the predict method and set the threshold for determining if an employee is likely to leave. To evaluate how well the model performed on the predictions, you will next use a confusion matrix.
您已经使用预测方法创建了预测,并设置了确定员工是否可能离职的阈值。 要评估模型对预测的执行情况,接下来将使用混淆矩阵 。
In this step, you will use a confusion matrix to check the number of correct and incorrect predictions. A confusion matrix, also known as an error matrix, is a square matrix that reports the number of true positives(tp), false positives(fp), true negatives(tn), and false negatives(fn) of a classifier.
在此步骤中,您将使用混淆矩阵来检查正确和不正确预测的数量。 混淆矩阵(也称为误差矩阵)是一个正方形矩阵,用于报告分类器的真正数(tp),误正数(fp),真负数(tn)和假负数(fn)的数量。
A true positive is an outcome where the model correctly predicts the positive class (also known as sensitivity or recall).
真正的阳性是模型正确预测阳性分类的结果(也称为敏感度或召回率)。
A true negative is an outcome where the model correctly predicts the negative class.
真正的否定是模型正确预测否定类别的结果。
A false positive is an outcome where the model incorrectly predicts the positive class.
假阳性是模型错误地预测阳性类别的结果。
A false negative is an outcome where the model incorrectly predicts the negative class.
假阴性是模型错误地预测阴性类别的结果。
To achieve this you’ll use a confusion matrix that scikit-learn
provides.
为此,您将使用scikit-learn
提供的混淆矩阵。
Insert this code in the next notebook cell to import the scikit-learn
confusion matrix:
将此代码插入下一个笔记本单元格,以导入scikit-learn
混淆矩阵:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm
The confusion matrix output means that your deep learning model made 3305 + 375
correct predictions and 106 + 714
wrong predictions. You can calculate the accuracy with: (3305 + 375) / 4500
. The total number of observations in your dataset is 4500. This gives you an accuracy of 81.7%. This is a very good accuracy rate since you can achieve at least 81% correct predictions from your model.
混淆矩阵输出意味着您的深度学习模型做出了3305 + 375
正确的预测和106 + 714
错误的预测。 您可以使用(3305 + 375) / 4500
来计算精度。 数据集中的观测总数为4500。这使您的准确度为81.7%。 因为您可以从模型中获得至少81%的正确预测,所以这是非常好的准确率。
Output
array([[3305, 106],
[ 714, 375]])
You’ve evaluated your model using the confusion matrix. Next, you’ll work on making a single prediction using the model that you have developed.
您已经使用混淆矩阵评估了模型。 接下来,您将使用已开发的模型进行单个预测。
In this step you’ll make a single prediction given the details of one employee with your model. You will achieve this by predicting the probability of a single employee leaving the company. You’ll pass this employee’s features to the predict
method. As you did earlier, you’ll scale the features as well and convert them to a numpy
array.
在此步骤中,您将使用模型中的一名员工的详细信息进行单个预测。 您将通过预测单个员工离开公司的可能性来实现这一目标。 您将把该员工的功能传递给predict
方法。 正如您之前所做的那样,您还将缩放功能并将其转换为numpy
数组。
To pass the employee’s features, run the following code in a cell:
要传递员工的功能,请在单元格中运行以下代码:
new_pred = classifier.predict(sc.transform(np.array([[0.26,0.7 ,3., 238., 6., 0.,0.,0.,0., 0.,0.,0.,0.,0.,1.,0., 0.,1.]])))
These features represent the features of a single employee. As shown in the dataset in step 1, these features represent: satisfaction level, last evaluation, number of projects, and so on. As you did in step 3, you have to transform the features in a manner that the deep learning model can accept.
这些功能代表单个员工的功能。 如步骤1中的数据集所示,这些功能表示:满意度,上次评估,项目数量等。 正如您在步骤3中所做的那样,您必须以深度学习模型可以接受的方式来转换功能。
Add a threshold of 50% with the following code:
使用以下代码添加50%的阈值:
new_pred = (new_pred > 0.5)
new_pred
This threshold indicates that where the probability is above 50% an employee will leave the company.
此阈值表示在可能性超过50%的情况下,员工将离开公司。
You can see in your output that the employee won’t leave the company:
您可以在输出中看到员工不会离开公司:
Output
array([[False]])
You might decide to set a lower or higher threshold for your model. For example, you can set the threshold to be 60%:
您可能决定为模型设置较低或较高的阈值。 例如,您可以将阈值设置为60%:
new_pred = (new_pred > 0.6)
new_pred
This new threshold still shows that the employee won’t leave the company:
这个新的门槛仍然表明该员工不会离开公司:
Output
array([[False]])
In this step, you have seen how to make a single prediction given the features of a single employee. In the next step, you will work on improving the accuracy of your model.
在此步骤中,您已经了解了如何根据单个员工的特征做出单个预测。 在下一步中,您将致力于提高模型的准确性。
If you train your model many times you’ll keep getting different results. The accuracies for each training have a high variance. In order to solve this problem, you’ll use K-fold cross-validation. Usually, K is set to 10. In this technique, the model is trained on the first 9 folds and tested on the last fold. This iteration continues until all folds have been used. Each of the iterations gives its own accuracy. The accuracy of the model becomes the average of all these accuracies.
如果您多次训练模型,将会得到不同的结果。 每次培训的准确性差异很大。 为了解决这个问题,您将使用K折交叉验证 。 通常,将K设置为10。在这种技术中,模型在前9折训练,并在最后折测试。 该迭代将继续,直到使用完所有折痕为止。 每个迭代都给出其自己的精度。 模型的准确性成为所有这些准确性的平均值。
keras
enables you to implement K-fold cross-validation via the KerasClassifier
wrapper. This wrapper is from scikit-learn
cross-validation. You’ll start by importing the cross_val_score
cross-validation function and the KerasClassifier
. To do this, insert and run the following code in your notebook cell:
keras
使您可以通过KerasClassifier
包装器实现K折交叉验证。 该包装器来自scikit-learn
交叉验证。 您将从导入cross_val_score
交叉验证函数和KerasClassifier
。 为此,请在笔记本单元中插入并运行以下代码:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
To create the function that you will pass to the KerasClassifier
, add this code to the next cell:
要创建传递给KerasClassifier
的函数,请将此代码添加到下一个单元格:
def make_classifier():
classifier = Sequential()
classifier.add(Dense(9, kernel_initializer = "uniform", activation = "relu", input_dim=18))
classifier.add(Dense(1, kernel_initializer = "uniform", activation = "sigmoid"))
classifier.compile(optimizer= "adam",loss = "binary_crossentropy",metrics = ["accuracy"])
return classifier
Here, you create a function that you’ll pass to the KerasClassifier
—the function is one of the arguments that the classifier expects. The function is a wrapper of the neural network design that you used earlier. The passed parameters are also similar to the ones used earlier in the tutorial. In the function, you first initialize the classifier using Sequential()
, you then use Dense
to add the input and output layer. Finally, you compile the classifier and return it.
在这里,您将创建一个传递给KerasClassifier
的函数-该函数是分类器期望的参数之一。 该函数是您先前使用的神经网络设计的包装。 传递的参数也与本教程前面使用的参数相似。 在函数中,首先使用Sequential()
初始化分类器,然后使用Dense
添加输入和输出层。 最后,编译分类器并返回它。
To pass the function you’ve built to the KerasClassifier
, add this line of code to your notebook:
要将已构建的功能传递给KerasClassifier
,请将以下代码行添加到笔记本中:
classifier = KerasClassifier(build_fn = make_classifier, batch_size=10, nb_epoch=1)
The KerasClassifier
takes three arguments:
KerasClassifier
接受三个参数:
build_fn
: the function with the neural network design
build_fn
:神经网络设计的功能
batch_size
: the number of samples to be passed via the network in each iteration
batch_size
:每次迭代中通过网络传递的样本数
nb_epoch
: the number of epochs the network will run
nb_epoch
:网络将运行的时期数
Next, you apply the cross-validation using Scikit-learn’s cross_val_score
. Add the following code to your notebook cell and run it:
接下来,您使用Scikit-learn的cross_val_score
应用交叉验证。 将以下代码添加到笔记本单元并运行:
accuracies = cross_val_score(estimator = classifier,X = X_train,y = y_train,cv = 10,n_jobs = -1)
This function will give you ten accuracies since you have specified the number of folds as 10. Therefore, you assign it to the accuracies variable and later use it to compute the mean accuracy. It takes the following arguments:
由于已将折叠数指定为10,因此该函数将为您提供十个精度。因此,您可以将其分配给Accuracys变量,然后使用它来计算平均准确度。 它采用以下参数:
estimator
: the classifier that you’ve just defined
estimator
:您刚刚定义的分类estimator
X
: the training set features
X
:训练集功能
y
: the value to be predicted in the training set
y
:训练集中要预测的值
cv
: the number of folds
cv
:折叠数
n_jobs
: the number of CPUs to use (specifying it as -1 will make use of all the available CPUs)
n_jobs
:要使用的CPU数量(将其指定为-1将使用所有可用的CPU)
Now you have applied the cross-validation, you can compute the mean and variance of the accuracies. To achieve this, insert the following code into your notebook:
现在,您已经应用了交叉验证,您可以计算精度的均值和方差。 为此,请将以下代码插入笔记本:
mean = accuracies.mean()
mean
In your output you’ll see that the mean is 83%:
在输出中,您将看到平均值为83%:
Output
0.8343617910685696
To compute the variance of the accuracies, add this code to the next notebook cell:
要计算精度的方差,请将此代码添加到下一个笔记本单元格:
variance = accuracies.var()
variance
You see that the variance is 0.00109. Since the variance is very low, it means that your model is performing very well.
您会看到方差为0.00109。 由于方差很小,这意味着您的模型运行良好。
Output
0.0010935021002275425
You’ve improved your model’s accuracy by using K-Fold cross-validation. In the next step, you’ll work on the overfitting problem.
通过使用K折交叉验证,您已经提高了模型的准确性。 在下一步中,您将解决过度拟合的问题。
Predictive models are prone to a problem known as overfitting. This is a scenario whereby the model memorizes the results in the training set and isn’t able to generalize on data that it hasn’t seen. Typically you observe overfitting when you have a very high variance on accuracies. To help fight over-fitting in your model, you will add a layer to your model.
预测模型容易出现称为过拟合的问题。 在这种情况下,模型会记忆训练集中的结果,并且无法根据未看到的数据进行概括。 通常,当您的精度差异很大时,您会发现过度拟合。 为了帮助解决模型中的过拟合问题,您将在模型中添加一个图层。
In neural networks, dropout regularization is the technique that fights overfitting by adding a Dropout
layer in your neural network. It has a rate
parameter that indicates the number of neurons that will deactivate at each iteration. The process of deactivating nerurons is usually random. In this case, you specify 0.1 as the rate meaning that 1% of the neurons will deactivate during the training process. The network design remains the same.
在神经网络中, 辍学正则化是通过在神经网络中添加Dropout
层来克服过度拟合的技术。 它具有一个rate
参数,该参数指示每次迭代将停用的神经元数量。 停用神经元的过程通常是随机的。 在这种情况下,将速率指定为0.1,表示在训练过程中1%的神经元将失活。 网络设计保持不变。
To add your Dropout
layer, add the following code to the next cell:
要添加Dropout
层,请将以下代码添加到下一个单元格:
from keras.layers import Dropout
classifier = Sequential()
classifier.add(Dense(9, kernel_initializer = "uniform", activation = "relu", input_dim=18))
classifier.add(Dropout(rate = 0.1))
classifier.add(Dense(1, kernel_initializer = "uniform", activation = "sigmoid"))
classifier.compile(optimizer= "adam",loss = "binary_crossentropy",metrics = ["accuracy"])
You have added a Dropout
layer between the input and output layer. Having set a dropout rate of 0.1 means that during the training process 15 of the neurons will deactivate so that the classifier doesn’t overfit on the training set. After adding the Dropout
and output layers you then compiled the classifier as you have done previously.
您已经在输入和输出层之间添加了一个Dropout
层。 将辍学率设置为0.1意味着在训练过程中,神经元的15个将失活,因此分类器不会过度适合训练集。 添加了Dropout
和output层后,您就可以像以前一样编译分类器。
You worked to fight over-fitting in this step with a Dropout
layer. Next, you’ll work on further improving the model by tuning the parameters you used while creating the model.
在此步骤中,您使用Dropout
图层努力克服了过拟合问题。 接下来,您将通过调整在创建模型时使用的参数来进一步改进模型。
Grid search is a technique that you can use to experiment with different model parameters in order to obtain the ones that give you the best accuracy. The technique does this by trying different parameters and returning those that give the best results. You’ll use grid search to search for the best parameters for your deep learning model. This will help in improving model accuracy. scikit-learn
provides the GridSearchCV
function to enable this functionality. You will now proceed to modify the make_classifier
function to try out different parameters.
网格搜索是一种可用于试验不同模型参数的技术,以获得最准确的参数。 该技术通过尝试不同的参数并返回可提供最佳结果的参数来实现此目的。 您将使用网格搜索为您的深度学习模型搜索最佳参数。 这将有助于提高模型的准确性。 scikit-learn
提供GridSearchCV
功能来启用此功能。 现在,您将继续修改make_classifier
函数,以尝试不同的参数。
Add this code to your notebook to modify the make_classifier
function so you can test out different optimizer functions:
将此代码添加到笔记本中以修改make_classifier
函数,以便您可以测试不同的优化器函数:
from sklearn.model_selection import GridSearchCV
def make_classifier(optimizer):
classifier = Sequential()
classifier.add(Dense(9, kernel_initializer = "uniform", activation = "relu", input_dim=18))
classifier.add(Dense(1, kernel_initializer = "uniform", activation = "sigmoid"))
classifier.compile(optimizer= optimizer,loss = "binary_crossentropy",metrics = ["accuracy"])
return classifier
You have started by importing GridSearchCV
. You have then made changes to the make_classifier
function so that you can try different optimizers. You’ve initialized the classifier, added the input and output layer, and then compiled the classifier. Finally, you have returned the classifier so you can use it.
您已经开始导入GridSearchCV
。 然后,您对make_classifier
函数进行了更改,以便您可以尝试其他优化器。 您已经初始化了分类器,添加了输入和输出层,然后编译了分类器。 最后,您已返回分类器,以便可以使用它。
Like in step 4, insert this line of code to define the classifier:
像在步骤4中一样,插入以下代码行来定义分类器:
classifier = KerasClassifier(build_fn = make_classifier)
You’ve defined the classifier using the KerasClassifier
, which expects a function through the build_fn
parameter. You have called the KerasClassifier
and passed the make_classifier
function that you created earlier.
您已经使用KerasClassifier
定义了分类器,该分类器期望通过build_fn
参数实现功能。 您已调用KerasClassifier
并传递了您先前创建的make_classifier
函数。
You will now proceed to set a couple of parameters that you wish to experiment with. Enter this code into a cell and run:
现在,您将设置一些您想尝试的参数。 将此代码输入单元格中并运行:
params = {
'batch_size':[20,35],
'epochs':[2,3],
'optimizer':['adam','rmsprop']
}
Here you have added different batch sizes, number of epochs, and different types of optimizer functions.
在这里,您添加了不同的批处理大小,时期数和不同类型的优化器功能。
For a small dataset like yours, a batch size of between 20–35 is good. For large datasets its important to experiment with larger batch sizes. Using low numbers for the number of epochs ensures that you get results within a short period. However, you can experiment with bigger numbers that will take a while to complete depending on the processing speed of your server. The adam
and rmsprop
optimizers from keras
are a good choice for this type of neural network.
对于像您这样的小型数据集,批量大小在20–35之间是好的。 对于大型数据集,重要的是要试验更大的批次大小。 为时期数使用较小的数字可确保您在短时间内获得结果。 但是,您可以尝试更大的数量,这需要一段时间才能完成,具体取决于服务器的处理速度。 keras
的adam
和rmsprop
优化器是此类神经网络的不错选择。
Now you’re going to use the different parameters you have defined to search for the best parameters using the GridSearchCV
function. Enter this into the next cell and run it:
现在,您将使用定义的不同参数,通过GridSearchCV
函数搜索最佳参数。 将其输入到下一个单元格中并运行它:
grid_search = GridSearchCV(estimator=classifier,
param_grid=params,
scoring="accuracy",
cv=2)
The grid search function expects the following parameters:
网格搜索功能需要以下参数:
estimator
: the classifier that you’re using.
estimator
:您正在使用的分类estimator
。
param_grid
: the set of parameters that you’re going to test.
param_grid
:要测试的参数集。
scoring
: the metric you’re using.
scoring
:您使用的指标。
cv
: the number of folds you’ll test on.
cv
:要测试的折数。
Next, you fit this grid_search
to your training dataset:
接下来,您可以将此grid_search
适合您的训练数据集:
grid_search = grid_search.fit(X_train,y_train)
Your output will be similar to the following, wait a moment for it to complete:
您的输出将类似于以下内容,请稍等片刻:
Output
Epoch 1/2
5249/5249 [==============================] - 1s 228us/step - loss: 0.5958 - acc: 0.7645
Epoch 2/2
5249/5249 [==============================] - 0s 82us/step - loss: 0.3962 - acc: 0.8510
Epoch 1/2
5250/5250 [==============================] - 1s 222us/step - loss: 0.5935 - acc: 0.7596
Epoch 2/2
5250/5250 [==============================] - 0s 85us/step - loss: 0.4080 - acc: 0.8029
Epoch 1/2
5249/5249 [==============================] - 1s 214us/step - loss: 0.5929 - acc: 0.7676
Epoch 2/2
5249/5249 [==============================] - 0s 82us/step - loss: 0.4261 - acc: 0.7864
Add the following code to a notebook cell to obtain the best parameters from this search using the best_params_
attribute:
将以下代码添加到笔记本单元格,以使用best_params_
属性从此搜索中获取最佳参数:
best_param = grid_search.best_params_
best_accuracy = grid_search.best_score_
You can now check the best parameters for your model with the following code:
现在,您可以使用以下代码检查模型的最佳参数:
best_param
Your output shows that the best batch size is 20
, the best number of epochs is 2
, and the adam
optimizer is the best for your model:
您的输出显示最佳批处理大小为20
,最佳时期数为2
,而adam
优化器最适合您的模型:
Output
{'batch_size': 20, 'epochs': 2, 'optimizer': 'adam'}
You can check the best accuracy for your model. The best_accuracy number
represents the highest accuracy you obtain from the best parameters after running the grid search:
您可以检查模型的最佳准确性。 best_accuracy number
表示运行网格搜索后从最佳参数获得的最高准确性:
best_accuracy
Your output will be similar to the following:
您的输出将类似于以下内容:
Output
0.8533193637489285
You’ve used GridSearch
to figure out the best parameters for your classifier. You have seen that the best batch_size
is 20, the best optimizer
is the adam
optimizer and the best number of epochs is 2. You have also obtained the best accuracy for your classifier as being 85%. You’ve built an employee retention model that is able to predict if an employee stays or leaves with an accuracy of up to 85%.
您已使用GridSearch
找出分类器的最佳参数。 您已经看到最佳的batch_size
是20,最佳的optimizer
是adam
优化器,最佳历元数是2。您还为分类器获得了85%的最佳准确性。 您已经建立了员工保留模型,该模型能够以高达85%的准确度预测员工是留下还是离开。
In this tutorial, you’ve used Keras to build an artificial neural network that predicts the probability that an employee will leave a company. You combined your previous knowledge in machine learning using scikit-learn
to achieve this. To further improve your model, you can try different activation functions or optimizer functions from keras
. You could also experiment with a different number of folds, or, even build a model with a different dataset.
在本教程中,您已经使用Keras构建了一个人工神经网络,该网络可以预测员工离开公司的可能性。 您使用scikit-learn
将您在机器学习中的知识结合起来,以实现这一目标。 为了进一步改善模型,您可以尝试使用来自keras
其他激活函数或优化器函数 。 您还可以尝试不同数量的折叠,甚至可以使用不同的数据集构建模型。
For other tutorials in the machine learning field or using TensorFlow, you can try building a neural network to recognize handwritten digits or other DigitalOcean machine learning tutorials.
对于机器学习领域或使用TensorFlow的其他教程,您可以尝试构建神经网络以识别手写数字或其他DigitalOcean 机器学习教程 。
翻译自: https://www.digitalocean.com/community/tutorials/how-to-build-a-deep-learning-model-to-predict-employee-retention-using-keras-and-tensorflow