



  1. 完成对数据集缺失值的处理
  2. 完成对数据集非数值变量的转换
  3. 完成对数据集的标准化
  4. 保存预处理后的数据集






  • 1 - age: 年龄 (数字)
  • 2 - job: 工作类型 。管理员(admin),蓝领(blue-collar),企业家(entrepreneur),家庭主妇(housemaid),管理者('management'),退休('retired'),个体经营('self-employed'),服务业('services'),学生('student'),技术人员('technician'),无业('unemployed'),未知('unknown')
  • 3 - marital : 婚姻状态,离婚('divorced'),结婚('married'),单身('single'),未知('unknown')。说明:离婚也包括寡居
  • 4 - education: 教育情况 : 基本4年('basic.4y'), 基本6年('basic.6y'),基本九年('basic.9y'),高中('high.school'),文盲('illiterate'),专业课程('professional.course'),大学学位('university.degree'),未知('unknown')
  • 5 - default: 是否有信用违约? ('no','yes','unknown')
  • 6 - housing: 是否有房贷 ( 'no','yes','unknown')
  • 7 - loan: 是否有个人贷款 (categorical: 'no','yes','unknown')


  • 8 - contact: 联系类型,手机( 'cellular'),电话:'telephone'
  • 9 - month: 年度最后一次联系的月份 (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  • 10 - day_of_week: 最后一次联系的星期 (categorical: 'mon','tue','wed','thu','fri')
  • 11 - duration: 上一次联系的通话时长(秒). 重要提示:此属性高度影响输出目标(例如,如果持续时间=0,则y='no')。然而,在执行呼叫之前,持续时间还不知道。而且,在通话结束后,Y显然是已知的。因此,这个输入应该只包括在基准测试中,如果想要有一个实际的预测模型,就应该丢弃它。(预测时不知道会通话的时长)


  • 12 - campaign: 针对该客户,为了此次营销所发起联系的数量。(数字,包括最后一次联络)
  • 13 - pdays: 上次营销到现在已经过了多少天。(数字,如果是999表示这个客户还没有联系过)
  • 14 - previous: 在本次营销之前和客户联系过几次(数字)
  • 15 - poutcome: 上一次营销活动的结果 ( 'failure','nonexistent','success')


  • 16 - emp.var.rate: 就业变动率 -系度指标(numeric)
  • 17 - cons.price.idx: 消费物价指数-月度指标 (numeric)
  • 18 - cons.conf.idx: 消费者信心指数--月度指标(numeric)
  • 19 - euribor3m: 欧元同业拆借利率3个月 - 每日指标 (numeric)
  • 20 - nr.employed: 员工数量-季度指标 (numeric)


  • 21 - y -客户存钱了吗(被成功营销了吗)? (binary: 'yes','no')



1. 数据装载

  • 数据装载,使用head()观察数据
  • 为了方便后续处理,将分类变量和数值变量的列名分别存放在不同列表中
    categoryVar = [ ...]

import numpy as np
import pandas as pd
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
(41188, 21)



数据集的输入变量是20个特征量,分为数值变量(numeric)和分类(categorical)变量。从前期数据信息可以看出,数值型变量(int64和float64)没有缺失。非数值型变量可能存在unknown值。 本小节要求:

  1. 检查每个变量的缺失值占比情况
  2. 给出存在缺失值的变量中:高、中、低三类缺失情况

2.1 缺失值检查

  • 数据集的输入变量是20个特征量,分为数值变量(numeric)和分类(categorical)变量。
  • 使用df.isnull().any()观察缺失值情况,没有发现特征含有缺失值(NaN)。
  • 但是在本数据集中,缺失值是以其他的形式存在的。分类变量大部分的特征都是使用unknown来表示缺失值,而poutcome是使用nonexistent来表示;数值变量中只有pdays存在缺失值(以数字999形式存在)。 本步骤要求对所有存在缺失值的分类变量打印其缺失值占比


cols = categoryVar + ['pdays']
for col in cols:
    v = df[col].value_counts().to_dict()
    if 'unknown' in v.keys():
        unCount = v['unknown']
    elif 'nonexistent' in v.keys():
        unCount = v['nonexistent']
    elif '999' in v.keys():
        unCount = v['999']
    print ("%-10s: %5.1f%%"%(col,unCount/total*100))
job       :   0.8%
marital   :   0.2%
education :   4.2%
default   :  20.9%
housing   :   2.4%
loan      :   2.4%
poutcome  :  86.3%

2.2 高缺失比例的变量处理

  1. 通过直方图对pdays变量进行可视化,请给出分析,未缺失的pdays大概都在一个怎样的数值范围内?
  2. 通过pdays与poutcome的交叉表,观察这两个变量取值的关系,通过数据分析得到进一步结论


dfPdays=df.loc[df.pdays != 999, 'pdays']


# 对pdays绘制直方图
dfPdays = df.loc[df.pdays!=999,'pdays']
(array([ 15.,  26.,  61., 439., 118.,  46., 412.,  60.,  18.,   0.,  64.,
         52.,  28.,  58.,  36.,  20.,  24.,  11.,   8.,   0.,   7.,   3.,
          1.,   2.,   3.,   0.,   0.,   1.,   1.,   1.]),
 array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ,
         9.9, 10.8, 11.7, 12.6, 13.5, 14.4, 15.3, 16.2, 17.1, 18. , 18.9,
        19.8, 20.7, 21.6, 22.5, 23.4, 24.3, 25.2, 26.1, 27. ]),



pdaysDf = df['pdays'].apply(lambda x: int(x /5 )*5)
pd.crosstab(pdaysDf,df['poutcome']) #显示交叉表
poutcome failure nonexistent success
0 6 0 653
5 74 0 526
10 36 0 158
15 22 0 31
20 3 0 3
25 1 0 2
995 4110 35563 0

2.3 default(信用违约)缺失值分析和处理

default: 缺失值占比20.9%,考虑对缺失值进行分析和修补

  1. default的取值分布中有何启示?
  2. 对存在信用违约记录缺失的用户群体特征进行描述。(请在变量的用户信息中取出变量一一与default进行可视化)
  3. 说明最后对default的处理,为何采用unknown与yes记录合并的做法


no         32588
unknown     8597
yes            3
Name: default, dtype: int64


In [7]:
def defaultAsso(dataset, col):
    tab = pd.crosstab(dataset['default'],dataset[col]).apply(lambda x: x/x.sum() * 100)
    tab_pct = tab.transpose()
    x = tab_pct.index.values
    plt.plot(x, tab_pct['unknown'],color='green', label='unknown')
    plt.plot(x, tab_pct['yes'],color='blue', label='yes')
    plt.plot(x, tab_pct['no'],color='red', label='no')



In [11]:
def get_age_group(age):
    if age <30:
        return 2
    elif age>60:
        return 6
        return age//10
df['ageGroup'] =df['age'].apply(lambda x:get_age_group(x))#打印年龄组的取值是否正确
defaultAsso(df,'ageGroup') #对照defualt与年龄组
  age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
5 45 services married basic.9y unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
6 59 admin. married professional.course no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
7 41 blue-collar married unknown unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
8 24 technician single professional.course no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
9 25 services single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
10 41 blue-collar married unknown unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
11 25 services single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
12 29 blue-collar single high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
13 57 housemaid divorced basic.4y no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
14 35 blue-collar married basic.6y no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
15 54 retired married basic.9y unknown yes yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
16 35 blue-collar married basic.6y no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
17 46 blue-collar married basic.6y unknown yes yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
18 50 blue-collar married basic.9y no yes yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
19 39 management single basic.9y unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
20 30 unemployed married high.school no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
21 55 blue-collar married basic.4y unknown yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
22 55 retired single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
23 41 technician single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
24 37 admin. married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
25 35 technician married university.degree no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
26 59 technician married unknown no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
27 39 self-employed married basic.9y unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
28 54 technician single university.degree unknown no no telephone may mon ... 2 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
29 55 unknown married university.degree unknown unknown unknown telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
41158 35 technician divorced basic.4y no no no cellular nov tue ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.035 4963.6 yes
41159 35 technician divorced basic.4y no yes no cellular nov tue ... 1 9 4 success -1.1 94.767 -50.8 1.035 4963.6 yes
41160 33 admin. married university.degree no no no cellular nov tue ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.035 4963.6 yes
41161 33 admin. married university.degree no yes no cellular nov tue ... 1 999 1 failure -1.1 94.767 -50.8 1.035 4963.6 no
41162 60 blue-collar married basic.4y no yes no cellular nov tue ... 2 4 1 success -1.1 94.767 -50.8 1.035 4963.6 no
41163 35 technician divorced basic.4y no yes no cellular nov tue ... 3 4 2 success -1.1 94.767 -50.8 1.035 4963.6 yes
41164 54 admin. married professional.course no no no cellular nov tue ... 2 10 1 success -1.1 94.767 -50.8 1.035 4963.6 yes
41165 38 housemaid divorced university.degree no no no cellular nov wed ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 yes
41166 32 admin. married university.degree no no no telephone nov wed ... 1 999 1 failure -1.1 94.767 -50.8 1.030 4963.6 yes
41167 32 admin. married university.degree no yes no cellular nov wed ... 3 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 no
41168 38 entrepreneur married university.degree no no no cellular nov wed ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 no
41169 62 services married high.school no yes no cellular nov wed ... 5 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 no
41170 40 management divorced university.degree no yes no cellular nov wed ... 2 999 4 failure -1.1 94.767 -50.8 1.030 4963.6 no
41171 33 student married professional.course no yes no telephone nov thu ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 yes
41172 31 admin. single university.degree no yes no cellular nov thu ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 yes
41173 62 retired married university.degree no yes no cellular nov thu ... 1 999 2 failure -1.1 94.767 -50.8 1.031 4963.6 yes
41174 62 retired married university.degree no yes no cellular nov thu ... 1 1 6 success -1.1 94.767 -50.8 1.031 4963.6 yes
41175 34 student single unknown no yes no cellular nov thu ... 1 999 2 failure -1.1 94.767 -50.8 1.031 4963.6 no
41176 38 housemaid divorced high.school no yes yes cellular nov thu ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 no
41177 57 retired married professional.course no yes no cellular nov thu ... 6 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 no
41178 62 retired married university.degree no no no cellular nov thu ... 2 6 3 success -1.1 94.767 -50.8 1.031 4963.6 yes
41179 64 retired divorced professional.course no yes no cellular nov fri ... 3 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41180 36 admin. married university.degree no no no cellular nov fri ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41181 37 admin. married university.degree no yes no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 yes
41182 29 unemployed single basic.4y no yes no cellular nov fri ... 1 9 1 success -1.1 94.767 -50.8 1.028 4963.6 no
41183 73 retired married professional.course no yes no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 yes
41184 46 blue-collar married professional.course no no no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41185 56 retired married university.degree no yes no cellular nov fri ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41186 44 technician married professional.course no no no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 yes
41187 74 retired married professional.course no yes no cellular nov fri ... 3 999 1 failure -1.1 94.767 -50.8 1.028 4963.6 no

41188 rows × 21 columns



df['default']=df['default'].map({'unknown':1 ,'yes':1,'no':0})
0    32588
1     8600
Name: default, dtype: int64

2.4 处理极少量缺失比例的变量


2.4.1 删除缺失记录

  • job和marital只有少量缺失,缺失值记录占比不到百分之一,这里要求将job和marital中取值为unknown的记录删除
  • 删除记录后,调用value_counts()检查缺失值是否真的已经去除 这里以job删除为例:
df.drop(df[df.job == 'unknown'].index,inplace = True,axis=0)
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
Name: job, dtype: int64
df.drop(df[df.marital == 'unknown'].index,inplace = True,axis=0)
married     24694
single      11494
divorced     4599
Name: marital, dtype: int64
marital divorced married single
admin. 1280 5253 3875
blue-collar 728 6687 1825
entrepreneur 179 1071 203
housemaid 161 777 119
management 331 2089 501
retired 348 1274 93
self-employed 133 904 379
services 532 2294 1137
student 9 41 824
technician 774 3670 2287
unemployed 124 634 251
yes        21376
no         18427
unknown      984
Name: housing, dtype: int64
no         33620
yes         6183
unknown      984
Name: loan, dtype: int64

2.4.2 处理关联的缺失值

  • 从热力图上看,除了housing,loan与education的关系最为密切。因此使用交叉表观察housing和loan的关系。
  • 删除housing的缺失记录
  • 针对housing和loan分别调用value_counts()观察缺失值是否已经去除
df.drop(df[df.housing == 'unknown'].index,inplace = True,axis=0)
yes    21376
no     18427
Name: housing, dtype: int64
no     33620
yes     6183
Name: loan, dtype: int64
loan no yes
no 15897 2530
yes 17723 3653
loan no yes
admin. 8472 1709
blue-collar 7636 1365
entrepreneur 1212 205
housemaid 874 154
management 2411 439
retired 1431 240
self-employed 1182 194
services 3263 599
student 709 142
technician 5596 988
unemployed 834 148
marital divorced married single
no 2086 11273 5068
yes 2392 12837 6147


3. 将分类变量转为数值

分类变量数值化 为了能使分类变量参与模型计算,我们需要将分类变量数值化,也就是编码。因此尚未被编码的分类变量(教育、工作、违约、联系方式、住房和贷款)都需要进一步被转换为数值变量。

3.1 只有两种取值的变量

二分类变量编码: 在本数据集中,变量y, default 、contact、housing 和loan 都是只有两种取值,即二分类变量,可对其进行0,1编码。Default在前面的步骤中取值已经被转为数字0和1。

  1. 使用map方法,将y 、contact、housing 和loan 的取值映射成数字0和1
  1. 使用df[['y','default','contact','housing','loan']].head(),观察以上变量已经被正确转换:
no     35316
yes     4487
Name: y, dtype: int64
df['y'] = df['y'].map({'no':0, 'yes':1})
df['housing'] = df['housing'].map({"no":0, "yes":1})
df['loan'] = df['loan'].map({"no":0, "yes":1})
0    35316
1     4487
Name: y, dtype: int64
  y default contact housing loan
0 0 0 1 0 0
1 0 1 1 0 0
2 0 0 1 1 0
3 0 0 1 0 0
4 0 0 1 0 1

3.2 有序分类变量编码

观察education的取值,可以根据学历高低,认为变量education是有序分类变量,影响大小排序为"illiterate", "basic.4y", "basic.6y", "basic.9y", "high.school", "professional.course", "university.degree", 变量影响由小到大的顺序编码为1、2、3、..., 但是由于缺失值的存在,unknown将无法进行排序。为了处理方便,我们在这里先将unknown设置为0,后续再重新对该值进行修正。


values = ["unknown","illiterate", "basic.4y", "basic.6y", "basic.9y", "high.school",  "professional.course", "university.degree"]
levels = range(0,len(values))
dict_levels = dict(zip(values, levels))
for v in values:
    df.loc[df['education'] == v, 'education'] = dict_levels[v]
7    11821
5     9244
4     5856
6     5100
2     4002
3     2204
0     1558
1       18
Name: education, dtype: int64

3.3 将无序分类变量转为虚拟变量



  1. 将本数据集中的无序分类变量(job,marital,poutcome,month,day_of_week)转为虚拟变量(one-hot编码)
  2. 调用df.info()观察转换后的变量变化
df = pd.get_dummies(df, columns = ['job','marital','poutcome','month','day_of_week'])

Int64Index: 39803 entries, 0 to 41187
Data columns (total 49 columns):
age                     39803 non-null int64
education               39803 non-null int64
default                 39803 non-null int64
housing                 39803 non-null int64
loan                    39803 non-null int64
contact                 39803 non-null int64
duration                39803 non-null int64
campaign                39803 non-null int64
pdays                   39803 non-null int64
previous                39803 non-null int64
emp.var.rate            39803 non-null float64
cons.price.idx          39803 non-null float64
cons.conf.idx           39803 non-null float64
euribor3m               39803 non-null float64
nr.employed             39803 non-null float64
y                       39803 non-null int64
ageGroup                39803 non-null int64
job_admin.              39803 non-null uint8
job_blue-collar         39803 non-null uint8
job_entrepreneur        39803 non-null uint8
job_housemaid           39803 non-null uint8
job_management          39803 non-null uint8
job_retired             39803 non-null uint8
job_self-employed       39803 non-null uint8
job_services            39803 non-null uint8
job_student             39803 non-null uint8
job_technician          39803 non-null uint8
job_unemployed          39803 non-null uint8
marital_divorced        39803 non-null uint8
marital_married         39803 non-null uint8
marital_single          39803 non-null uint8
poutcome_failure        39803 non-null uint8
poutcome_nonexistent    39803 non-null uint8
poutcome_success        39803 non-null uint8
month_apr               39803 non-null uint8
month_aug               39803 non-null uint8
month_dec               39803 non-null uint8
month_jul               39803 non-null uint8
month_jun               39803 non-null uint8
month_mar               39803 non-null uint8
month_may               39803 non-null uint8
month_nov               39803 non-null uint8
month_oct               39803 non-null uint8
month_sep               39803 non-null uint8
day_of_week_fri         39803 non-null uint8
day_of_week_mon         39803 non-null uint8
day_of_week_thu         39803 non-null uint8
day_of_week_tue         39803 non-null uint8
day_of_week_wed         39803 non-null uint8
dtypes: float64(5), int64(12), uint8(32)
memory usage: 6.7 MB

4. 通过随机森林补充缺失值


  1. 将数据集切分为训练集和测试集。其中无education缺失的记录归入训练集;education缺失的记录归入测试集。education作为预测目标(注意,这里与本数据集以营销成功与否作为目标是不同的)
  2. 使用机器学习在训练集上学习,并且将学习结果应用在测试集中


  • trainX 训练集输入变量
  • trainY 训练集目标值
  • testX 测试集输入变量
from sklearn.ensemble import RandomForestClassifier
def train_predict_unknown(trainX, trainY, testX):
    forest = RandomForestClassifier(n_estimators=100)
    forest = forest.fit(trainX, trainY)
    test_predictY = forest.predict(testX).astype(int)
    return pd.DataFrame(test_predictY,index=testX.index)
# 将education值已知的记录作为训练集,education的值未知(等于0)记录放入测试集
test_data = df[df['education'] == 0]#education等于0的记录作为测试集
train_data = df[df['education'] != 0] #education不等于0的记录作为训练集
# 将education变量作为目标变量,将训练集分为目标变量和输入变量两个dataframe
trainY =train_data['education'] # 将education列放入trainY
trainX = train_data.drop('education', axis=1)  # 将education列从train_data中删除
testX =test_data.drop('education', axis=1)#将education列从testX中删除 


test_data['education'] = train_predict_unknown(trainX, trainY, testX)


7    446
5    383
2    261
4    256
6    165
3     47
Name: education, dtype: int64


df = pd.concat([train_data, test_data])
(39803, 49)


  age education default housing loan contact duration campaign pdays previous ... month_mar month_may month_nov month_oct month_sep day_of_week_fri day_of_week_mon day_of_week_thu day_of_week_tue day_of_week_wed
0 56 2 0 0 0 1 261 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
1 57 5 1 0 0 1 149 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
2 37 5 0 1 0 1 226 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
3 40 3 0 0 0 1 151 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
4 56 5 0 0 1 1 307 1 999 0 ... 0 1 0 0 0 0 1 0 0 0

5 rows × 49 columns




from sklearn.preprocessing import StandardScaler
def scaleColumns(data, cols_to_scale):
    scaler = StandardScaler()
    idx = data.index.values
    for col in cols_to_scale:
        x = scaler.fit_transform(pd.DataFrame(data[col]))
        data[col] = pd.DataFrame(x,columns=['col'],index=idx)
    return data
df = scaleColumns(df,numberVar+['education'])
  age education default housing loan contact duration campaign pdays previous ... month_mar month_may month_nov month_oct month_sep day_of_week_fri day_of_week_mon day_of_week_thu day_of_week_tue day_of_week_wed
0 1.539987 -1.925742 0 0 0 1 0.009489 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
1 1.636117 -0.096859 1 0 0 1 -0.422339 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
2 -0.286490 -0.096859 0 1 0 1 -0.125457 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
3 0.001901 -1.316115 0 0 0 1 -0.414628 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
4 1.539987 -0.096859 0 0 1 1 0.186846 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0

5 rows × 49 columns


6. 特征选择



  1. 删除duration这一列
  2. 使用shape、info方法观察数据集最终的变量数、记录

Int64Index: 39803 entries, 0 to 41175
Data columns (total 49 columns):
age                     39803 non-null float64
education               39803 non-null float64
default                 39803 non-null int64
housing                 39803 non-null int64
loan                    39803 non-null int64
contact                 39803 non-null int64
duration                39803 non-null float64
campaign                39803 non-null float64
pdays                   39803 non-null float64
previous                39803 non-null float64
emp.var.rate            39803 non-null float64
cons.price.idx          39803 non-null float64
cons.conf.idx           39803 non-null float64
euribor3m               39803 non-null float64
nr.employed             39803 non-null float64
y                       39803 non-null int64
ageGroup                39803 non-null int64
job_admin.              39803 non-null uint8
job_blue-collar         39803 non-null uint8
job_entrepreneur        39803 non-null uint8
job_housemaid           39803 non-null uint8
job_management          39803 non-null uint8
job_retired             39803 non-null uint8
job_self-employed       39803 non-null uint8
job_services            39803 non-null uint8
job_student             39803 non-null uint8
job_technician          39803 non-null uint8
job_unemployed          39803 non-null uint8
marital_divorced        39803 non-null uint8
marital_married         39803 non-null uint8
marital_single          39803 non-null uint8
poutcome_failure        39803 non-null uint8
poutcome_nonexistent    39803 non-null uint8
poutcome_success        39803 non-null uint8
month_apr               39803 non-null uint8
month_aug               39803 non-null uint8
month_dec               39803 non-null uint8
month_jul               39803 non-null uint8
month_jun               39803 non-null uint8
month_mar               39803 non-null uint8
month_may               39803 non-null uint8
month_nov               39803 non-null uint8
month_oct               39803 non-null uint8
month_sep               39803 non-null uint8
day_of_week_fri         39803 non-null uint8
day_of_week_mon         39803 non-null uint8
day_of_week_thu         39803 non-null uint8
day_of_week_tue         39803 non-null uint8
day_of_week_wed         39803 non-null uint8
dtypes: float64(11), int64(6), uint8(32)
memory usage: 6.7 MB

6. 保存预处理数据


  1. 由于原始数据集中,样本是按照时间顺序排列的,因此这里需要将其打乱,变成无序数据集,以免在训练过程中出现过拟合。
  2. 对数据集进行持久化(保存为.csv文件),index=False表示不保存索引
from sklearn.utils import shuffle
df = shuffle(df)
