大数据实践(三):葡萄牙银行数据集的数据预处理

实验目标

对数据集做数据预处理以便可以进行后续的机器学习。具体包括通过多种方式处理缺失值、将变量转为数值类型,使用机器学习模型填充缺失值,数据shuffle和持久化。

实验要求

  1. 完成对数据集缺失值的处理
  2. 完成对数据集非数值变量的转换
  3. 完成对数据集的标准化
  4. 保存预处理后的数据集

实验过程

image.png

 

变量介绍

银行客户信息:

  • 1 - age: 年龄 (数字)
  • 2 - job: 工作类型 。管理员(admin),蓝领(blue-collar),企业家(entrepreneur),家庭主妇(housemaid),管理者('management'),退休('retired'),个体经营('self-employed'),服务业('services'),学生('student'),技术人员('technician'),无业('unemployed'),未知('unknown')
  • 3 - marital : 婚姻状态,离婚('divorced'),结婚('married'),单身('single'),未知('unknown')。说明:离婚也包括寡居
  • 4 - education: 教育情况 : 基本4年('basic.4y'), 基本6年('basic.6y'),基本九年('basic.9y'),高中('high.school'),文盲('illiterate'),专业课程('professional.course'),大学学位('university.degree'),未知('unknown')
  • 5 - default: 是否有信用违约? ('no','yes','unknown')
  • 6 - housing: 是否有房贷 ( 'no','yes','unknown')
  • 7 - loan: 是否有个人贷款 (categorical: 'no','yes','unknown')

    与联络相关信息:

  • 8 - contact: 联系类型,手机( 'cellular'),电话:'telephone'
  • 9 - month: 年度最后一次联系的月份 (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  • 10 - day_of_week: 最后一次联系的星期 (categorical: 'mon','tue','wed','thu','fri')
  • 11 - duration: 上一次联系的通话时长(秒). 重要提示:此属性高度影响输出目标(例如,如果持续时间=0,则y='no')。然而,在执行呼叫之前,持续时间还不知道。而且,在通话结束后,Y显然是已知的。因此,这个输入应该只包括在基准测试中,如果想要有一个实际的预测模型,就应该丢弃它。(预测时不知道会通话的时长)

    其他属性:

  • 12 - campaign: 针对该客户,为了此次营销所发起联系的数量。(数字,包括最后一次联络)
  • 13 - pdays: 上次营销到现在已经过了多少天。(数字,如果是999表示这个客户还没有联系过)
  • 14 - previous: 在本次营销之前和客户联系过几次(数字)
  • 15 - poutcome: 上一次营销活动的结果 ( 'failure','nonexistent','success')

    社会和经济相关属性

  • 16 - emp.var.rate: 就业变动率 -系度指标(numeric)
  • 17 - cons.price.idx: 消费物价指数-月度指标 (numeric)
  • 18 - cons.conf.idx: 消费者信心指数--月度指标(numeric)
  • 19 - euribor3m: 欧元同业拆借利率3个月 - 每日指标 (numeric)
  • 20 - nr.employed: 员工数量-季度指标 (numeric)

    输出变量(目标):

  • 21 - y -客户存钱了吗(被成功营销了吗)? (binary: 'yes','no')
 

数据预处理

 

1. 数据装载

  • 数据装载,使用head()观察数据
  • 为了方便后续处理,将分类变量和数值变量的列名分别存放在不同列表中
    numberVar=['age',...]
    categoryVar = [ ...]

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df=pd.read_csv("bank-additional-full.csv",sep=';')
df.shape
(41188, 21)
numberVar=['age','duration','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']
categoryVar=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome','y']

 

2.缺失值处理

数据集的输入变量是20个特征量,分为数值变量(numeric)和分类(categorical)变量。从前期数据信息可以看出,数值型变量(int64和float64)没有缺失。非数值型变量可能存在unknown值。 本小节要求:

  1. 检查每个变量的缺失值占比情况
  2. 给出存在缺失值的变量中:高、中、低三类缺失情况

2.1 缺失值检查

  • 数据集的输入变量是20个特征量,分为数值变量(numeric)和分类(categorical)变量。
  • 使用df.isnull().any()观察缺失值情况,没有发现特征含有缺失值(NaN)。
  • 但是在本数据集中,缺失值是以其他的形式存在的。分类变量大部分的特征都是使用unknown来表示缺失值,而poutcome是使用nonexistent来表示;数值变量中只有pdays存在缺失值(以数字999形式存在)。 本步骤要求对所有存在缺失值的分类变量打印其缺失值占比

对所有分类变量(外加一个pdays变量)进行缺失值的比例检查。对比Demo所不同的是,有三种值(unknown,nonexistent,999)都算作缺失:

 
cols = categoryVar + ['pdays']
total=df.shape[0]
for col in cols:
    v = df[col].value_counts().to_dict()
    if 'unknown' in v.keys():
        unCount = v['unknown']
    elif 'nonexistent' in v.keys():
        unCount = v['nonexistent']
    elif '999' in v.keys():
        unCount = v['999']
    else:
        continue    
    print ("%-10s: %5.1f%%"%(col,unCount/total*100))
job       :   0.8%
marital   :   0.2%
education :   4.2%
default   :  20.9%
housing   :   2.4%
loan      :   2.4%
poutcome  :  86.3%
 

2.2 高缺失比例的变量处理

 
  1. 通过直方图对pdays变量进行可视化,请给出分析,未缺失的pdays大概都在一个怎样的数值范围内?
  2. 通过pdays与poutcome的交叉表,观察这两个变量取值的关系,通过数据分析得到进一步结论

将pdays中非缺失值的部分进行直方图可视化:

dfPdays=df.loc[df.pdays != 999, 'pdays']

使用dfPdays进行直方图可视化,配合.value_counts()方法,分析大部分的营销间隔在什么时间范围内?

# 对pdays绘制直方图
dfPdays = df.loc[df.pdays!=999,'pdays']
plt.hist(dfPdays,bins=30,rwidth=0.8)
(array([ 15.,  26.,  61., 439., 118.,  46., 412.,  60.,  18.,   0.,  64.,
         52.,  28.,  58.,  36.,  20.,  24.,  11.,   8.,   0.,   7.,   3.,
          1.,   2.,   3.,   0.,   0.,   1.,   1.,   1.]),
 array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ,
         9.9, 10.8, 11.7, 12.6, 13.5, 14.4, 15.3, 16.2, 17.1, 18. , 18.9,
        19.8, 20.7, 21.6, 22.5, 23.4, 24.3, 25.2, 26.1, 27. ]),
 )
 
 

虽然这两个变量的缺失较多,但是未缺失的记录还是有一定的参考意义。根据前文热力图分析,发现pdays(-0.31)和poutcom(-0.13)对营销结果相关性较很多其他变量都要高,虽然此列的缺失值较多,但是不做删除考虑,保持现有状态。

要求使用交叉表观察pdays和poutcome之间的关系。为了方便观察,需要将pdays对5取整转为时间段(类似年龄段的做法)

pdaysDf = df['pdays'].apply(lambda x: int(x /5 )*5)
pd.crosstab(pdaysDf,df['poutcome']) #显示交叉表
poutcome failure nonexistent success
pdays      
0 6 0 653
5 74 0 526
10 36 0 158
15 22 0 31
20 3 0 3
25 1 0 2
995 4110 35563 0
 

2.3 default(信用违约)缺失值分析和处理

default: 缺失值占比20.9%,考虑对缺失值进行分析和修补
要求:

  1. default的取值分布中有何启示?
  2. 对存在信用违约记录缺失的用户群体特征进行描述。(请在变量的用户信息中取出变量一一与default进行可视化)
  3. 说明最后对default的处理,为何采用unknown与yes记录合并的做法
 

在对default进行修补之前,先观察该变量取值情况。(使用value_counts())

df['default'].value_counts()
no         32588
unknown     8597
yes            3
Name: default, dtype: int64
 

定义如下函数,参数1为dataframe,参数2为需要与default进行对比的列

In [7]:
def defaultAsso(dataset, col):
    tab = pd.crosstab(dataset['default'],dataset[col]).apply(lambda x: x/x.sum() * 100)
    tab_pct = tab.transpose()
    x = tab_pct.index.values
    plt.figure(figsize=(14,3))
    plt.plot(x, tab_pct['unknown'],color='green', label='unknown')
    plt.plot(x, tab_pct['yes'],color='blue', label='yes')
    plt.plot(x, tab_pct['no'],color='red', label='no')
    plt.legend() 
    plt.xlabel(col)
    plt.ylabel('rate')
    plt.show()

defaultAsso(df,'job')
defaultAsso(df,'education')
defaultAsso(df,'marital')
 

年龄需要转为年龄组来处理:

In [11]:
def get_age_group(age):
    if age <30:
        return 2
    elif age>60:
        return 6
    else:
        return age//10
df['ageGroup'] =df['age'].apply(lambda x:get_age_group(x))#打印年龄组的取值是否正确
defaultAsso(df,'ageGroup') #对照defualt与年龄组
df.drop('ageGroup',axis=1)#将新增的年龄组这一列删除
Out[11]:
  age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
5 45 services married basic.9y unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
6 59 admin. married professional.course no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
7 41 blue-collar married unknown unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
8 24 technician single professional.course no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
9 25 services single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
10 41 blue-collar married unknown unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
11 25 services single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
12 29 blue-collar single high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
13 57 housemaid divorced basic.4y no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
14 35 blue-collar married basic.6y no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
15 54 retired married basic.9y unknown yes yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
16 35 blue-collar married basic.6y no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
17 46 blue-collar married basic.6y unknown yes yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
18 50 blue-collar married basic.9y no yes yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
19 39 management single basic.9y unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
20 30 unemployed married high.school no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
21 55 blue-collar married basic.4y unknown yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
22 55 retired single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
23 41 technician single high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
24 37 admin. married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
25 35 technician married university.degree no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
26 59 technician married unknown no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
27 39 self-employed married basic.9y unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
28 54 technician single university.degree unknown no no telephone may mon ... 2 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
29 55 unknown married university.degree unknown unknown unknown telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
41158 35 technician divorced basic.4y no no no cellular nov tue ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.035 4963.6 yes
41159 35 technician divorced basic.4y no yes no cellular nov tue ... 1 9 4 success -1.1 94.767 -50.8 1.035 4963.6 yes
41160 33 admin. married university.degree no no no cellular nov tue ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.035 4963.6 yes
41161 33 admin. married university.degree no yes no cellular nov tue ... 1 999 1 failure -1.1 94.767 -50.8 1.035 4963.6 no
41162 60 blue-collar married basic.4y no yes no cellular nov tue ... 2 4 1 success -1.1 94.767 -50.8 1.035 4963.6 no
41163 35 technician divorced basic.4y no yes no cellular nov tue ... 3 4 2 success -1.1 94.767 -50.8 1.035 4963.6 yes
41164 54 admin. married professional.course no no no cellular nov tue ... 2 10 1 success -1.1 94.767 -50.8 1.035 4963.6 yes
41165 38 housemaid divorced university.degree no no no cellular nov wed ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 yes
41166 32 admin. married university.degree no no no telephone nov wed ... 1 999 1 failure -1.1 94.767 -50.8 1.030 4963.6 yes
41167 32 admin. married university.degree no yes no cellular nov wed ... 3 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 no
41168 38 entrepreneur married university.degree no no no cellular nov wed ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 no
41169 62 services married high.school no yes no cellular nov wed ... 5 999 0 nonexistent -1.1 94.767 -50.8 1.030 4963.6 no
41170 40 management divorced university.degree no yes no cellular nov wed ... 2 999 4 failure -1.1 94.767 -50.8 1.030 4963.6 no
41171 33 student married professional.course no yes no telephone nov thu ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 yes
41172 31 admin. single university.degree no yes no cellular nov thu ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 yes
41173 62 retired married university.degree no yes no cellular nov thu ... 1 999 2 failure -1.1 94.767 -50.8 1.031 4963.6 yes
41174 62 retired married university.degree no yes no cellular nov thu ... 1 1 6 success -1.1 94.767 -50.8 1.031 4963.6 yes
41175 34 student single unknown no yes no cellular nov thu ... 1 999 2 failure -1.1 94.767 -50.8 1.031 4963.6 no
41176 38 housemaid divorced high.school no yes yes cellular nov thu ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 no
41177 57 retired married professional.course no yes no cellular nov thu ... 6 999 0 nonexistent -1.1 94.767 -50.8 1.031 4963.6 no
41178 62 retired married university.degree no no no cellular nov thu ... 2 6 3 success -1.1 94.767 -50.8 1.031 4963.6 yes
41179 64 retired divorced professional.course no yes no cellular nov fri ... 3 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41180 36 admin. married university.degree no no no cellular nov fri ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41181 37 admin. married university.degree no yes no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 yes
41182 29 unemployed single basic.4y no yes no cellular nov fri ... 1 9 1 success -1.1 94.767 -50.8 1.028 4963.6 no
41183 73 retired married professional.course no yes no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 yes
41184 46 blue-collar married professional.course no no no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41185 56 retired married university.degree no yes no cellular nov fri ... 2 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 no
41186 44 technician married professional.course no no no cellular nov fri ... 1 999 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 yes
41187 74 retired married professional.course no yes no cellular nov fri ... 3 999 1 failure -1.1 94.767 -50.8 1.028 4963.6 no

41188 rows × 21 columns

 

根据以上分析,在数据处理中,将default变量的unknown与yes记录合并(使用map方法,将unknown与yes映射成同一个值),然后使用value_counts()观察转换结果。

df['default']=df['default'].map({'unknown':1 ,'yes':1,'no':0})
df['default'].value_counts()
0    32588
1     8600
Name: default, dtype: int64

2.4 处理极少量缺失比例的变量

 

2.4.1 删除缺失记录

 
  • job和marital只有少量缺失,缺失值记录占比不到百分之一,这里要求将job和marital中取值为unknown的记录删除
  • 删除记录后,调用value_counts()检查缺失值是否真的已经去除 这里以job删除为例:
df.drop(df[df.job == 'unknown'].index,inplace = True,axis=0)
df.job.value_counts()
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
Name: job, dtype: int64
df.drop(df[df.marital == 'unknown'].index,inplace = True,axis=0)
df.marital.value_counts()
married     24694
single      11494
divorced     4599
Name: marital, dtype: int64
pd.crosstab(df['job'],df['marital'])
 
marital divorced married single
job      
admin. 1280 5253 3875
blue-collar 728 6687 1825
entrepreneur 179 1071 203
housemaid 161 777 119
management 331 2089 501
retired 348 1274 93
self-employed 133 904 379
services 532 2294 1137
student 9 41 824
technician 774 3670 2287
unemployed 124 634 251
 
df['housing'].value_counts()
yes        21376
no         18427
unknown      984
Name: housing, dtype: int64
df['loan'].value_counts()
no         33620
yes         6183
unknown      984
Name: loan, dtype: int64
 

2.4.2 处理关联的缺失值

 
  • 从热力图上看,除了housing,loan与education的关系最为密切。因此使用交叉表观察housing和loan的关系。
  • 删除housing的缺失记录
  • 针对housing和loan分别调用value_counts()观察缺失值是否已经去除
pd.crosstab(df['housing'],df['loan'])
df.drop(df[df.housing == 'unknown'].index,inplace = True,axis=0)
df['housing'].value_counts()
 
yes    21376
no     18427
Name: housing, dtype: int64
df['loan'].value_counts()
no     33620
yes     6183
Name: loan, dtype: int64
pd.crosstab(df['housing'],df['loan'])
 
loan no yes
housing    
no 15897 2530
yes 17723 3653
pd.crosstab(df['job'],df['loan'])
loan no yes
job    
admin. 8472 1709
blue-collar 7636 1365
entrepreneur 1212 205
housemaid 874 154
management 2411 439
retired 1431 240
self-employed 1182 194
services 3263 599
student 709 142
technician 5596 988
unemployed 834 148
pd.crosstab(df['housing'],df['marital'])
marital divorced married single
housing      
no 2086 11273 5068
yes 2392 12837 6147
 

最后剩下education的缺失值尚未处理,由于缺失值数量有1.5k条记录,不宜直接删除,考虑使用随机森林进行缺失值补充。在将所有参数数值化之后进行统一处理

3. 将分类变量转为数值

分类变量数值化 为了能使分类变量参与模型计算,我们需要将分类变量数值化,也就是编码。因此尚未被编码的分类变量(教育、工作、违约、联系方式、住房和贷款)都需要进一步被转换为数值变量。
分类变量又可以分为二项分类变量、有序分类变量和无序分类变量。不同种类的分类变量编码方式也有区别。

3.1 只有两种取值的变量

二分类变量编码: 在本数据集中,变量y, default 、contact、housing 和loan 都是只有两种取值,即二分类变量,可对其进行0,1编码。Default在前面的步骤中取值已经被转为数字0和1。
要求:

  1. 使用map方法,将y 、contact、housing 和loan 的取值映射成数字0和1
  1. 使用df[['y','default','contact','housing','loan']].head(),观察以上变量已经被正确转换:
 
df['y'].value_counts()
no     35316
yes     4487
Name: y, dtype: int64
df['y'] = df['y'].map({'no':0, 'yes':1})
df['contact']=df['contact'].map({'cellular':0,'telephone':1})
df['housing'] = df['housing'].map({"no":0, "yes":1})
df['loan'] = df['loan'].map({"no":0, "yes":1})
df.y.value_counts()#检查目标变量,未发现缺失值
0    35316
1     4487
Name: y, dtype: int64
df[['y','default','contact','housing','loan']].head()
  y default contact housing loan
0 0 0 1 0 0
1 0 1 1 0 0
2 0 0 1 1 0
3 0 0 1 0 0
4 0 0 1 0 1
 

3.2 有序分类变量编码

观察education的取值,可以根据学历高低,认为变量education是有序分类变量,影响大小排序为"illiterate", "basic.4y", "basic.6y", "basic.9y", "high.school", "professional.course", "university.degree", 变量影响由小到大的顺序编码为1、2、3、..., 但是由于缺失值的存在,unknown将无法进行排序。为了处理方便,我们在这里先将unknown设置为0,后续再重新对该值进行修正。

完成转换之后,调用value_counts()观察education的转换结果是否正确。

values = ["unknown","illiterate", "basic.4y", "basic.6y", "basic.9y", "high.school",  "professional.course", "university.degree"]
levels = range(0,len(values))
dict_levels = dict(zip(values, levels))
for v in values:
    df.loc[df['education'] == v, 'education'] = dict_levels[v]
df['education'].value_counts()
7    11821
5     9244
4     5856
6     5100
2     4002
3     2204
0     1558
1       18
Name: education, dtype: int64
 

3.3 将无序分类变量转为虚拟变量

根据上文的输入变量描述,可以认为变量job,marital,poutcome,month,day_of_week为无序分类变量。需要说明的是,虽然变量month和day_of_week从时间角度是有序的,但是对于目标变量而言是无序的。对于无序分类变量,可以利用独热编码(one-hot)。
独热编码(one-hot):又称为一位有效编码,主要是采用N位状态寄存器来对N个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候只有一位有效。
独热编码的转换方法:

要求

  1. 将本数据集中的无序分类变量(job,marital,poutcome,month,day_of_week)转为虚拟变量(one-hot编码)
  2. 调用df.info()观察转换后的变量变化
df = pd.get_dummies(df, columns = ['job','marital','poutcome','month','day_of_week'])
df.info()

Int64Index: 39803 entries, 0 to 41187
Data columns (total 49 columns):
age                     39803 non-null int64
education               39803 non-null int64
default                 39803 non-null int64
housing                 39803 non-null int64
loan                    39803 non-null int64
contact                 39803 non-null int64
duration                39803 non-null int64
campaign                39803 non-null int64
pdays                   39803 non-null int64
previous                39803 non-null int64
emp.var.rate            39803 non-null float64
cons.price.idx          39803 non-null float64
cons.conf.idx           39803 non-null float64
euribor3m               39803 non-null float64
nr.employed             39803 non-null float64
y                       39803 non-null int64
ageGroup                39803 non-null int64
job_admin.              39803 non-null uint8
job_blue-collar         39803 non-null uint8
job_entrepreneur        39803 non-null uint8
job_housemaid           39803 non-null uint8
job_management          39803 non-null uint8
job_retired             39803 non-null uint8
job_self-employed       39803 non-null uint8
job_services            39803 non-null uint8
job_student             39803 non-null uint8
job_technician          39803 non-null uint8
job_unemployed          39803 non-null uint8
marital_divorced        39803 non-null uint8
marital_married         39803 non-null uint8
marital_single          39803 non-null uint8
poutcome_failure        39803 non-null uint8
poutcome_nonexistent    39803 non-null uint8
poutcome_success        39803 non-null uint8
month_apr               39803 non-null uint8
month_aug               39803 non-null uint8
month_dec               39803 non-null uint8
month_jul               39803 non-null uint8
month_jun               39803 non-null uint8
month_mar               39803 non-null uint8
month_may               39803 non-null uint8
month_nov               39803 non-null uint8
month_oct               39803 non-null uint8
month_sep               39803 non-null uint8
day_of_week_fri         39803 non-null uint8
day_of_week_mon         39803 non-null uint8
day_of_week_thu         39803 non-null uint8
day_of_week_tue         39803 non-null uint8
day_of_week_wed         39803 non-null uint8
dtypes: float64(5), int64(12), uint8(32)
memory usage: 6.7 MB
 

4. 通过随机森林补充缺失值

对于education这个变量的缺失值,这里采用机器学习的方式来实现缺失值的预测。思路是通过其他变量的值,预测缺失值最可能的取值。
步骤:

  1. 将数据集切分为训练集和测试集。其中无education缺失的记录归入训练集;education缺失的记录归入测试集。education作为预测目标(注意,这里与本数据集以营销成功与否作为目标是不同的)
  2. 使用机器学习在训练集上学习,并且将学习结果应用在测试集中

参数:

  • trainX 训练集输入变量
  • trainY 训练集目标值
  • testX 测试集输入变量
from sklearn.ensemble import RandomForestClassifier
def train_predict_unknown(trainX, trainY, testX):
    forest = RandomForestClassifier(n_estimators=100)
    forest = forest.fit(trainX, trainY)
    test_predictY = forest.predict(testX).astype(int)
    return pd.DataFrame(test_predictY,index=testX.index)
# 将education值已知的记录作为训练集,education的值未知(等于0)记录放入测试集
test_data = df[df['education'] == 0]#education等于0的记录作为测试集
train_data = df[df['education'] != 0] #education不等于0的记录作为训练集
# 将education变量作为目标变量,将训练集分为目标变量和输入变量两个dataframe
trainY =train_data['education'] # 将education列放入trainY
trainX = train_data.drop('education', axis=1)  # 将education列从train_data中删除
testX =test_data.drop('education', axis=1)#将education列从testX中删除 

使用机器学习算法预测education的缺失值

test_data['education'] = train_predict_unknown(trainX, trainY, testX)

使用value_counts观察test_data的education变量的取值,看看缺失值是否都得到了补充:

test_data['education'].value_counts()
7    446
5    383
2    261
4    256
6    165
3     47
Name: education, dtype: int64
 

将测试集与训练集合并成一张表格:

df = pd.concat([train_data, test_data])
df.shape
(39803, 49)
 

观察合并后education变量的取值是否在1~7之间(缺失值0不存在),同时通过df.head()观察整个数据表的状况

train_data['education'].value_counts()
df.head()
  age education default housing loan contact duration campaign pdays previous ... month_mar month_may month_nov month_oct month_sep day_of_week_fri day_of_week_mon day_of_week_thu day_of_week_tue day_of_week_wed
0 56 2 0 0 0 1 261 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
1 57 5 1 0 0 1 149 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
2 37 5 0 1 0 1 226 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
3 40 3 0 0 0 1 151 1 999 0 ... 0 1 0 0 0 0 1 0 0 0
4 56 5 0 0 1 1 307 1 999 0 ... 0 1 0 0 0 0 1 0 0 0

5 rows × 49 columns

 

5.对数值变量进行标准化

并不是所有算法都需要对数值变量进行标准化的。一些算法对于变量是否标准化比较敏感,例如逻辑回归,支持向量机,神经网络等;而随机森林和决策树不需要变量的标准化。为了方便后续的机器学习算法选择,这里统一进行标准化。
在本例中,需要对所有的数值变量进行标准化,由于education作为有序数列,也需要进行标准化。

from sklearn.preprocessing import StandardScaler
def scaleColumns(data, cols_to_scale):
    scaler = StandardScaler()
    idx = data.index.values
    for col in cols_to_scale:
        x = scaler.fit_transform(pd.DataFrame(data[col]))
        data[col] = pd.DataFrame(x,columns=['col'],index=idx)
    return data
df = scaleColumns(df,numberVar+['education'])
df.head()
  age education default housing loan contact duration campaign pdays previous ... month_mar month_may month_nov month_oct month_sep day_of_week_fri day_of_week_mon day_of_week_thu day_of_week_tue day_of_week_wed
0 1.539987 -1.925742 0 0 0 1 0.009489 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
1 1.636117 -0.096859 1 0 0 1 -0.422339 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
2 -0.286490 -0.096859 0 1 0 1 -0.125457 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
3 0.001901 -1.316115 0 0 0 1 -0.414628 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0
4 1.539987 -0.096859 0 0 1 1 0.186846 -0.566762 0.194855 -0.349299 ... 0 1 0 0 0 0 1 0 0 0

5 rows × 49 columns

 

6. 特征选择

 

一些情况下原始数据维度非常高,维度越高,数据在每个特征维度上的分布就越稀疏,这对机器学习算法基本都是灾难性(维度灾难)。当我们又没有办法挑选出有效的特征时,需要使用PCA等算法来降低数据维度,使得数据可以用于统计学习的算法。但是,如果能够挑选出少而精的特征了,那么PCA等降维算法没有很大必要。在本次实验中,数据集中的特征已经比较有代表性而且并不过多,所以应该不需要降维。
根据前文分析可知,duration(最后一次和用户的通话时间)只有在通话结束时才会知道该变量的值。营销的目的就是减少工作人员的工作量,如果已经完成了通话才对是否需要联系此用户进行预测是没有价值的。因此该变量不应该作为预测模型的一个输入变量。

  1. 删除duration这一列
  2. 使用shape、info方法观察数据集最终的变量数、记录
df.drop(['duration'],axis=1)
df.info()

Int64Index: 39803 entries, 0 to 41175
Data columns (total 49 columns):
age                     39803 non-null float64
education               39803 non-null float64
default                 39803 non-null int64
housing                 39803 non-null int64
loan                    39803 non-null int64
contact                 39803 non-null int64
duration                39803 non-null float64
campaign                39803 non-null float64
pdays                   39803 non-null float64
previous                39803 non-null float64
emp.var.rate            39803 non-null float64
cons.price.idx          39803 non-null float64
cons.conf.idx           39803 non-null float64
euribor3m               39803 non-null float64
nr.employed             39803 non-null float64
y                       39803 non-null int64
ageGroup                39803 non-null int64
job_admin.              39803 non-null uint8
job_blue-collar         39803 non-null uint8
job_entrepreneur        39803 non-null uint8
job_housemaid           39803 non-null uint8
job_management          39803 non-null uint8
job_retired             39803 non-null uint8
job_self-employed       39803 non-null uint8
job_services            39803 non-null uint8
job_student             39803 non-null uint8
job_technician          39803 non-null uint8
job_unemployed          39803 non-null uint8
marital_divorced        39803 non-null uint8
marital_married         39803 non-null uint8
marital_single          39803 non-null uint8
poutcome_failure        39803 non-null uint8
poutcome_nonexistent    39803 non-null uint8
poutcome_success        39803 non-null uint8
month_apr               39803 non-null uint8
month_aug               39803 non-null uint8
month_dec               39803 non-null uint8
month_jul               39803 non-null uint8
month_jun               39803 non-null uint8
month_mar               39803 non-null uint8
month_may               39803 non-null uint8
month_nov               39803 non-null uint8
month_oct               39803 non-null uint8
month_sep               39803 non-null uint8
day_of_week_fri         39803 non-null uint8
day_of_week_mon         39803 non-null uint8
day_of_week_thu         39803 non-null uint8
day_of_week_tue         39803 non-null uint8
day_of_week_wed         39803 non-null uint8
dtypes: float64(11), int64(6), uint8(32)
memory usage: 6.7 MB

6. 保存预处理数据

将预处理后的数据保存,后续进行机器学习时,就可以直接使用预处理后的数据,而不需要重新做预处理了。
要求:

  1. 由于原始数据集中,样本是按照时间顺序排列的,因此这里需要将其打乱,变成无序数据集,以免在训练过程中出现过拟合。
  2. 对数据集进行持久化(保存为.csv文件),index=False表示不保存索引
from sklearn.utils import shuffle
df = shuffle(df)
df.to_csv('bank-preprocess.csv',index=False)

你可能感兴趣的:(大数据实践(三):葡萄牙银行数据集的数据预处理)