该数据与一家葡萄牙银行机构的直接营销活动(电话)有关。分类目标是预测客户是否会订阅定期存款(变量 y)。
数据集介绍:营销活动基于电话。通常,需要与同一客户联系不止一位,才能了解产品(银行定期存款)是否会被(“是”)订阅(“否”),训练集特征如下:
提示:以下是本篇文章正文内容,下面案例可供参考
import numpy as np
import pandas as pd
import random
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import GridSearchCV
train_data_file = "bank-full.csv"
df = pd.read_csv(train_data_file,sep=';')
查看数据
3.用条形图展示一下各特征变量
categorcial_variables = ['job', 'marital', 'education', 'default', 'loan', 'contact', 'month', 'poutcome','y']
for col in categorcial_variables:
plt.figure(figsize=(10,4))
sns.barplot(df[col].value_counts().values, df[col].value_counts().index)
plt.title(col)
plt.tight_layout()
categorcial_variables = ['job', 'marital', 'education', 'default', 'loan', 'contact', 'month', 'poutcome','y']
for col in categorcial_variables:
plt.figure(figsize=(10,4))
#Returns counts of unique values for each outcome for each feature.
pos_counts = df.loc[df.y.values == 'yes', col].value_counts()
neg_counts = df.loc[df.y.values == 'no', col].value_counts()
all_counts = list(set(list(pos_counts.index) + list(neg_counts.index)))
#Counts of how often each outcome was recorded.
freq_pos = (df.y.values == 'yes').sum()
freq_neg = (df.y.values == 'no').sum()
pos_counts = pos_counts.to_dict()
neg_counts = neg_counts.to_dict()
all_index = list(all_counts)
all_counts = [pos_counts.get(k, 0) / freq_pos - neg_counts.get(k, 0) / freq_neg for k in all_counts]
sns.barplot(all_counts, all_index)
plt.title(col)
plt.tight_layout()
另一种方法是从其他变量中巧妙地推断未知变量的值。这是一种插补方法,我们使用其他自变量来推断缺失变量的值。这并不能保证所有缺失的值都会得到解决,但其中大多数都会有一个合理的值,这在预测中是有用的。 具有未知/缺失值的变量包括:“教育”、“工作”、“住房”、“贷款”、“耳聋”和“婚姻”。但重要的是“教育”、“工作”、“住房”和“贷款”。“婚姻”的未知数非常低。“default”变量的未知值被视为未知值。客户可能不愿意向银行代表披露此信息。因此,“default”中的未知值实际上是一个单独的值。 因此,我们首先为“教育”、“工作”、“住房”和“贷款”中的未知价值创造新的变量。我们这样做是为了查看这些值是否随机丢失,或者丢失的值中是否存在模式。
从工作中推断教育:从交叉表中可以看出,从事管理工作的人通常拥有大学学位。因此,无论“工作”=管理,“教育”=未知,我们都可以用“大学学位”代替“教育”。同样,“工作”=“服务”-->“教育”=“高”。“学校”和“工作”=“女佣”-->“教育”。
根据教育推断工作:如果“教育”=“基本”。4y或基本。6y”或“基本”。那么“工作”通常是“蓝领”。如果“教育”=“专业”。“课程”,然后“工作”=“技术人员”。 根据年龄推断工作:如我们所见,如果“年龄”大于60岁,那么“工作”就是“退休”,这是有道理的。 在估算工作和教育的价值时,我们意识到了这样一个事实,即相关性应该具有现实意义。如果这在现实世界中没有意义,我们就不会替换缺失的值。
代码如下(示例):
df.loc[(df['age']>60) & (df['job']=='unknown'), 'job'] = 'retired'
df.loc[(df['education']=='unknown') & (df['job']=='management'), 'education'] = 'university.degree'
df.loc[(df['education']=='unknown') & (df['job']=='services'), 'education'] = 'high.school'
df.loc[(df['education']=='unknown') & (df['job']=='housemaid'), 'education'] = 'basic.4y'
df.loc[(df['job'] == 'unknown') & (df['education']=='basic.4y'), 'job'] = 'blue-collar'
df.loc[(df['job'] == 'unknown') & (df['education']=='basic.6y'), 'job'] = 'blue-collar'
df.loc[(df['job'] == 'unknown') & (df['education']=='basic.9y'), 'job'] = 'blue-collar'
df.loc[(df['job']=='unknown') & (df['education']=='professional.course'), 'job'] = 'technician'
jobhousing=cross_tab(df,'job','housing')
jobloan=cross_tab(df,'job','loan')
def fillhousing(df,jobhousing):
"""Function for imputation via cross-tabulation to fill missing values for the 'housing' categorical feature"""
jobs=['housemaid','services','admin.','blue-collar','technician','retired','management','unemployed','self-employed','entrepreneur','student']
house=["no","yes"]
for j in jobs:
ind=df[np.logical_and(np.array(df['housing']=='unknown'),np.array(df['job']==j))].index
mask=np.random.rand(len(ind))<((jobhousing.loc[j]['no'])/(jobhousing.loc[j]['no']+jobhousing.loc[j]['yes']))
ind1=ind[mask]
ind2=ind[~mask]
df.loc[ind1,"housing"]='no'
df.loc[ind2,"housing"]='yes'
return df
def fillloan(df,jobloan):
"""Function for imputation via cross-tabulation to fill missing values for the 'loan' categorical feature"""
jobs=['housemaid','services','admin.','blue-collar','technician','retired','management','unemployed','self-employed','entrepreneur','student']
loan=["no","yes"]
for j in jobs:
ind=df[np.logical_and(np.array(df['loan']=='unknown'),np.array(df['job']==j))].index
mask=np.random.rand(len(ind))<((jobloan.loc[j]['no'])/(jobloan.loc[j]['no']+jobloan.loc[j]['yes']))
ind1=ind[mask]
ind2=ind[~mask]
df.loc[ind1,"loan"]='no'
df.loc[ind2,"loan"]='yes'
return df
df=fillhousing(df,jobhousing)
df=fillloan(df,jobloan)
#展示新加入特征
df.head()
#本次就先选以下特征为变量
features_columns=['age', 'job', 'marital', 'education', 'balance', 'housing',
'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
'previous','education_un', 'job_un', 'housing_un',
'loan_un','y']
将标签编码为数值:
#Encode the categorical data
for col in df.columns:
if df[col].dtype==object:
df[col]=df[col].astype('category')
df[col]=df[col].cat.codes
# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
features_columns_df=df[features_columns]
array = features_columns_df.values
# separate array into input and output components
X = array[:,0:15]
Y = array[:,15]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
总结