数学建模数据预处理_银行存款的数据清理和预处理建模订阅

数学建模数据预处理

重点 (Top highlight)

The exploration of data has always fascinated me. The kind of insights and information that can be hidden in raw data is invigorating to discover and communicate. In this post, I chose to explore the bank marketing data from the UCI Machine Learning Repository to uncover insights that suggest whether a client will subscribe for a term deposit or not. So, yes! You guessed right! It is a classification problem. The data were already cleaned, at least to some extent, with no missing values so there wasn’t too much data cleaning required, hence my focus will be on Exploratory Data Analysis (EDA).

对数据的探索一直让我着迷。可以隐藏在原始数据中的这种见解和信息正在激发发现和交流的活力。在这篇文章中，我选择探索UCI机器学习存储库中的银行营销数据，以发现能表明客户是否将订阅定期存款的见解。所以，是的！你猜对了！这是一个分类问题。数据已经被清理，至少在某种程度上没有丢失值，因此不需要太多的数据清理，因此，我的重点将放在探索性数据分析(EDA)上。

I outlined the steps I plan to follow below:

我在下面概述了我打算遵循的步骤：

1. EDA

a. Univariate Analysis

一个。单变量分析

b. Bivariate Analysis

b。双变量分析

c. Insights Exploration

C。洞察探索

2. Preprocessing

2.预处理

a. Data Transformation

一个。数据转换

b. Feature Engineering

b。特征工程

3. Modelling

3.建模

a. Model Development

一个。模型开发

b. Model Evaluation

b。模型评估

c. Model Comparison

C。型号比较

步骤1：探索性数据分析(EDA) (Step 1: Exploratory Data Analysis (EDA))

Data was sourced from the UCI Machine Learning repository. The data represents the results of marketing campaigns (phone calls) of a Portuguese banking institution which comprises of 41188 observations (rows) and 21 features (columns), which includes client’s data like age, job, education etc., economic and social attributes like employment variation rate, number of employees etc. The dependent variable (target) is represented with “y” which states the outcome of the marketing campaign whether the respondent subscribed for a deposit “yes” or “no”. A detailed description of the features can be found here

数据来自UCI机器学习存储库。数据代表葡萄牙银行业机构进行的营销活动(电话)的结果，包括41188项观察(行)和21项特征(列)，其中包括客户的年龄，工作，受教育程度等数据，以及经济和社会属性，例如就业变化率，雇员人数等。因变量(目标)用“ y”表示，它表示营销活动的结果，无论受访者是否同意存款“是”或“否”。可以在此处找到功能的详细说明

Let begin with the exploration — first load all libraries that will be used

让我们开始探索-首先加载将要使用的所有库

# Ignore warningsimport warningswarnings.filterwarnings('ignore')# Handle table-like data and matricesimport numpy as npimport pandas as pd# Modelling Algorithmsfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.neural_network import MLPClassifierfrom xgboost import XGBClassifierfrom sklearn.metrics import accuracy_score, confusion_matrix, classification_report# Modelling Helpersfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_split# Visualisationimport matplotlib as mplimport matplotlib.pyplot as pltimport matplotlib.pylab as pylabimport seaborn as sns# read in the datadf = pd.read_csv('data/bank-additional-full.csv', sep=';')df.head()

Let’s look at the description of the data using the describe method from pandas library.

让我们使用pandas库中的describe方法查看数据的描述。

data.select_dtypes(include=[“int64”, “float64”]).describe().T

From the data, we can observe that the mean age is 40 years with a maximum age of 98 years and minimum age of 17 years. The balance is the average yearly balance which is in euros. Try to understand the rest of the data descriptions. While trying to understand the data, you can try to answer these questions:

从数据中可以看出，平均年龄为40岁，最大年龄为98岁，最小年龄为17岁。余额是每年的平均余额，以欧元为单位。尝试了解其余的数据描述。在尝试理解数据时，您可以尝试回答以下问题：

单变量分析 (Univariate Analysis)

To make this step and the next easier and reproducible I created a class to visualize the data. The class has different functions to make the communication of exploration and insight easy. Below is a snapshot of how I did it:

为了使此步骤和下一个步骤更容易且可重复，我创建了一个类来可视化数据。该课程具有不同的功能，可简化探索和见解的交流。下面是我如何做的快照：

Target distribution 目标分配

There seem to be a lot more clients that have not subscribed to a term deposit. This is certainly an imbalanced class problem.

似乎还有更多的客户没有订阅定期存款。这当然是一个不平衡的阶级问题。

1. How does this class imbalance problem impact your model performance?

1.该类不平衡问题如何影响您的模型性能？

2. What data technique approaches could be useful?

2.哪些数据技术方法可能有用？

NB: More details on how to handle these nature of problems will be provided in the transformation step.

注意：在转换步骤中将提供有关如何处理问题性质的更多详细信息。

marital status distribution 婚姻状况分布

More visualizations were generated to understand the distribution of each feature in the dataset.

生成了更多的可视化文件，以了解数据集中每个要素的分布。

双变量分析 (Bivariate Analysis)

Comparing the distribution of a feature with respect to another feature for a classification task the target is the primary feature while another feature can be selected to observe their relationship

比较某个特征相对于另一个特征的分类任务的分布，目标是主要特征，而可以选择另一个特征以观察它们之间的关系

marital status distribution by target y 目标y的婚姻状况分布

education distribution by target y 目标y的教育分布

洞察探索 (Insights Exploration)

From the result of Univariate and Bivariate Analysis above, it can be inferred that:

从上面的单变量和双变量分析的结果可以推断出：

1. There are some similar values in a column, like divorced and single categories for marital status and basic.4y, basic.6y and basic.9y which can be replaced with basic since all refer to the same thing.

1.栏中有一些相似的值，例如婚姻状况和basic.4y，basic.6y和basic.9y的离婚和单一类别，可以用basic代替，因为它们都是相同的。

2. The distribution of marital status is relatively balanced between the married and single (single + divorced)

2.已婚和单身(单身+离婚)之间婚姻状况的分配相对平衡

3. Approx. 90% of the target says no for term subscription leaving us with the task to balance the data for a good model

3.大约 90％的目标表示不定期订阅，这使我们负有为一个好的模型平衡数据的任务

4. Most of the respondent don’t have a personal loan.

4.大多数受访者没有个人贷款。

5. Most of the respondent are University degree holders

5.大部分受访者是大学学位持有者

6. Calls were not put through to respondent on weekends and relatively equal number of calls were recorded each day of the week

6.周末没有打给受访者的电话，并且在一周的每一天都记录了相对相等数量的电话

7. Approx. 65% of all respondents were contacted via a cellular phone

7.约通过手机联系了65％的受访者

8. Most calls occurred in May. The skewness of the previous campaigns’ efforts in the summer may potentially impact the outcome of the future campaigns.

8.大多数电话都发生在5月。在夏季，先前活动的工作偏向可能会影响未来活动的结果。

步骤2：预处理(数据转换) (Step 2: Preprocessing (Data Transformation))

To prepare the data for modelling, we will do the following:

为了准备用于建模的数据，我们将执行以下操作：

1. Copy the original data so all transformation is done on a duplicate data

1.复制原始数据，以便对重复数据进行所有转换

2. specify target column and drop from the data

2.指定目标列并从数据中删除

3. Map some similar features into one for proper encoding

3.将一些相似的功能映射到一个以进行正确编码

4. convert object data types column to integer using the One Hot Encoding Techniques

4.转换目的使用一种热编码技术将数据类型列转换为整数

5. Scale the Data using the robust scaler algorithm: this was chosen because it is less susceptible to outlier.

5.使用健壮的缩放器算法缩放数据：之所以选择该选项是因为它不易受到异常值的影响。

6. fix imbalance part of the data more information here

6.修复部分数据失衡的更多信息，请点击此处

7. Apply Dimensionality Reduction using PCA

7.使用PCA应用降维

8. Add the target column back to the data

8.将目标列添加回数据

9. Return the transformed data frame

9.返回转换后的数据框

To make these steps easier, I used a class in which all the steps were implemented accordingly. Below is a screenshot illustrating how it was done

为了简化这些步骤，我使用了一个类，在该类中相应地实现了所有步骤。以下是说明其完成方式的屏幕截图

data transformation code 数据转换代码

Each step was carried out with a function to make it more reproducible. Similar categories of a column were mapped together to further increase the performance of the predictive model; One-hot-encoding data imbalance was handled using the imblearn smote module to oversample the minority class in our case the number of respondents that said “yes” to a term deposit subscription; dimensionality reduction was then performed to reduce the complexity of the data which arise as a result of the one-hot-encoded technique.

每个步骤都具有使其具有更高重现性的功能。将列的相似类别映射在一起，以进一步提高预测模型的性能；使用imblearn smote模块处理少数群体类别的样本时，采用了热编码数据不平衡的处理方式，在本例中，对定期存款订阅说“是”的受访者人数过多；然后执行降维以减少由于单热编码技术而导致的数据复杂性。

特征工程 (Feature Engineering)

This is the process of adding more features to already existing features in the dataset. This can be done through feature transformation and/ or feature aggregation. An example is shown below

这是将更多特征添加到数据集中已有特征的过程。这可以通过特征转换和/或特征聚合来完成。一个例子如下所示

feature engineering 特征工程

In the first line age, a numerical feature was transformed to categories that include young, middle-aged and old. Age was also aggregated by some categorical columns like marital and transformed with mean to get additional feature for education per mean age.

在第一线年龄段，数字特征被转换为包括年轻人，中年和老年人的类别。年龄也通过“婚姻”等一些分类列进行汇总，并用均值进行了转换，以获取每个平均年龄的附加教育特征。

步骤3：建模 (Step 3: Modelling)

Model Development

模型开发

At this stage, machine learning models were built. The simplest and most interpretable model to predict the categorical variable “y” would be Logistic Regression. Hence, it was used as a baseline model to be improved. In fitting the model, stratified 5-fold cross-validation was used because it samples each fold such that each fold is a good representative of the whole data. StratifiedKFold shuffles and splits the data once, therefore the test sets do not overlap hence a good way to perform cross validation.

在这一阶段，建立了机器学习模型。预测分类变量“ y”的最简单，最可解释的模型是Logistic回归。因此，它被用作要改进的基准模型。在拟合模型时，使用了分层的5倍交叉验证，因为它对每个折叠进行采样，使得每个折叠都可以很好地代表整个数据。 StratifiedKFold会对数据进行一次混洗和拆分，因此测试集不会重叠，因此是执行交叉验证的好方法。

Implementing a Logistic Regression Model

实施逻辑回归模型

################ LOGISTIC REGRSSION #########################def run_logistic_regression(self, X_train, X_val, Y_train, Y_val):model_name = '26-08-2020-20-32-31-00-log-reg.pkl'# initialze the kfoldkfold, scores = KFold(n_splits=5, shuffle=True, random_state=221), list()# split data index to train and testfor train, test in kfold.split(X_train):# specify train and test setsx_train, x_test = X_train[train], X_train[test]y_train, y_test = Y_train[train], Y_train[test]# initialize the modelmodel = LogisticRegression(random_state=27,  solver='lbfgs')# trainmodel.fit(x_train, y_train)# predict for evaluationpreds = model.predict(x_test)# compute f1-scorescore = f1_score(y_test, preds)scores.append(score)test_pred = model.predict(X_val)print('f1-score: ',score)print("Average: ", sum(scores)/len(scores))

Other models that were built include XGBoost, Multilayered Perceptron, Support Vector Machine, Random Forest.

构建的其他模型包括XGBoost，多层感知器，支持向量机，随机森林。

implementing a xgboost model 实施xgboost模型

These models were selected because of their sophisticated approach in detecting hidden patterns and non-linear correlations/relationship in data.

选择这些模型是因为它们具有检测数据中隐藏模式和非线性相关性/关系的复杂方法。

模型评估 (Model Evaluation)

The evaluation metrics used for this data are recall_score, f1_score and Receiver Operating Characteristics (ROC) curve. They were selected because:

用于此数据的评估指标为Recall_score，f1_score和接收器操作特性(ROC)曲线。选择它们是因为：

1. The data are highly imbalance and skew towards the negative side of the company i.e. loss, so f1_score will help detect the harmonic mean between the sensitivity and specificity.

1.数据高度不平衡，并且偏向公司的不利方面，即损失，因此f1_score将有助于检测灵敏度和特异性之间的谐波均值。

2. Higher recall score is better than higher precision here because recall score aka (sensitivity) is the fraction of retrieved instances among all relevant instances in our case, identifying the customers that will subscribe for a term deposit is much more important than identifying customers that may or may not subscribe.

2.在这里，较高的召回得分比更高的精度要好，因为在我们的案例中，召回得分又名(敏感度)是所检索到的实例在所有相关实例中所占的比例，识别要订购定期存款的客户比识别要可能会或可能不会订阅。

3. ROC curve helps to explain how well each model explains the variance in the data closer to 1 is always preferred and should be above 0.5 to be better than random guess.

3. ROC曲线有助于解释每个模型如何较好地解释数据中接近1的方差，并且应始终大于0.5以优于随机猜测。

model evaluation 模型评估

The code for evaluating the models

评估模型的代码

def eval_model(self, target_test_data, prediction):from sklearn.metrics import accuracy_score,confusion_matrix,recall_score, f1_score, precision_scorefrom sklearn.metrics import classification_reportconfusion_matrix = confusion_matrix(target_test_data, prediction)print('Accuracy Score: ', accuracy_score(target_test_data, prediction))print('F1-Score: ', f1_score(target_test_data, prediction))print('Recall: ', recall_score(target_test_data, prediction))print('Precision: ', precision_score(target_test_data, prediction))print(confusion_matrix)print(classification_report(target_test_data, prediction))def plot_auc_curve(self, model, model_name, test_data, target_test_data):from sklearn.metrics import roc_auc_scorefrom sklearn.metrics import roc_curvelogit_roc_auc = roc_auc_score(target_test_data, model.predict(test_data))fpr, tpr, thresholds = roc_curve(target_test_data, model.predict_proba(test_data)[:,1])plt.figure()plt.plot(fpr, tpr, label=f'{model_name} (area under curve = %0.2f)' % logit_roc_auc)plt.plot([0, 1], [0, 1],'r--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title(f'Receiver operating characteristic ({model_name})')plt.legend(loc="lower right")plt.savefig(f'{model_name}_ROC')plt.show()

Evaluating the models

评估模型

Accuracy Score: 0.8460791454236465

准确性得分：0.8460791454236465

F1-Score: 0.41941391941391937

F1-分数：0.41941391941391937

Recall: 0.49353448275862066

召回：0.49353448275862066

Precision: 0.3646496815286624

精度：0.3646496815286624

classification report 分类报告

roc curve 罗克曲线

xgboost roc curve xgboost roc曲线

classification report 分类报告

The model’s prediction and recall for class 0 (respondents who didn’t subscribe for a deposit) is high, but for class 1 (the respondents who subscribed) it is low. The F1 score of class 0 is high, while for the class 1 it’s lower than 0.5. Overall the model doesn’t seem to be very good in predicting the respondents who subscribe for a term deposit. The area under the curve is 0.72, which is substantially higher than the probability area for random guessing (0.5). Given the results of the classification report above, it could be assumed that the biggest contribution to the area under the ROC curve comes from the correctly identified class 0.

该模型对第0类(未订阅存款的受访者)的预测和召回率很高，但对于第1类(已订阅存款的受访者)的预测和召回率很低。 0级的F1分数很高，而1级的F1分数低于0.5。总体而言，该模型似乎无法很好地预测订阅定期存款的受访者。曲线下的面积为0.72，大大高于随机猜测的概率范围(0.5)。根据上述分类报告的结果，可以假设对ROC曲线下面积的最大贡献来自正确识别的0类。

型号比较 (Model Comparison)

Models that were built 建立的模型

Logistic Regression has the highest value for recall score among all models. Logistic回归在所有模型中的召回得分最高。

Gradient Boost model has the highest performance in terms of f1-score. 就f1分数而言，Gradient Boost模型具有最高的性能。

Conclusion

结论

Support Vector Machine gave a steady and reliable model performance for both the F1-score and recall score. The high F1-score suggests that the model may be a good predictor of the deposit subscription campaign success for the existing customers. This was put together as part of the weekly challenge at 10 academy training.

支持向量机为F1得分和召回得分提供了稳定可靠的模型性能。高F1分数表明该模型可以很好地预测现有客户的存款订阅活动成功。这是在10所学院培训中每周挑战的一部分。

The notebook with the detailed code for this analysis can be found here

可以在此处找到带有用于此分析的详细代码的笔记本

翻译自: https://towardsdatascience.com/data-cleaning-and-preprocessing-modelling-subscription-for-bank-deposits-e810bd1ab5da