业务背景:
随着我国市场经济的发展,企业之间的竞争加剧,市场环境变化多端,粗放型营销不再满足企业的发展需求。
企业需要精准营销,能够更高效的获取目标客户、降低企业运营成本、精准营销活动不仅可以提升客户的满意度,增强客户对公司的忠诚度,而且可以降低客户获取费用,增加营销活动投资回报率,直接带来企业效益的增加。
# 银行客户信息:
# 1 - age: 年龄 (数字)
# 2 - job: 工作类型 。管理员(admin),蓝领(blue-collar),企业家(entrepreneur),家庭主妇(housemaid),管理者(‘management‘),退休(‘retired‘),个体经营(‘self-employed‘),服务业(‘services‘),学生(‘student‘),技术人员(‘technician‘),无业(‘unemployed‘),未知(‘unknown‘)
# 3 - marital : 婚姻状态,离婚(‘divorced‘),结婚(‘married‘),单身(‘single‘),未知(‘unknown‘)。说明:离婚也包括寡居
# 4 - education: 教育情况 : 基本4年(‘basic.4y‘), 基本6年(‘basic.6y‘),基本九年(‘basic.9y‘),高中(‘high.school‘),文盲(‘illiterate‘),专业课程(‘professional.course‘),大学学位(‘university.degree‘),未知(‘unknown‘)
# 5 - default: 是否有信用违约? (‘no‘,‘yes‘,‘unknown‘)
# 6 - housing: 是否有房贷 ( ‘no‘,‘yes‘,‘unknown‘)
# 7 - loan: 是否有个人贷款 (categorical: ‘no‘,‘yes‘,‘unknown‘)
# 与联络相关信息:
# 8 - contact: 联系类型,手机( ‘cellular‘),电话:‘telephone‘
# 9 - month: 年度最后一次联系的月份 (categorical: ‘jan‘, ‘feb‘, ‘mar‘, ..., ‘nov‘, ‘dec‘)
# 10 - day_of_week: 最后一次联系的星期 (categorical: ‘mon‘,‘tue‘,‘wed‘,‘thu‘,‘fri‘)
# 11 - duration: 上一次联系的通话时长(秒). 重要提示:此属性高度影响输出目标(例如,如果持续时间=0,则y=‘no‘)。然而,在执行呼叫之前,持续时间还不知道。而且,在通话结束后,Y显然是已知的。因此,这个输入应该只包括在基准测试中,如果想要有一个实际的预测模型,就应该丢弃它。(预测时不知道会通话的时长)
# 其他属性:
# 12 - campaign: 针对该客户,为了此次营销所发起联系的数量。(数字,包括最后一次联络)
# 13 - pdays: 上次营销到现在已经过了多少天。(数字,如果是999表示这个客户还没有联系过)
# 14 - previous: 在本次营销之前和客户联系过几次(数字)
# 15 - poutcome: 上一次营销活动的结果 ( ‘failure‘,‘nonexistent‘,‘success‘)
# 社会和经济相关属性
# 16 - emp.var.rate: 就业变动率 -系度指标(numeric)
# 17 - cons.price.idx: 消费物价指数-月度指标 (numeric)
# 18 - cons.conf.idx: 消费者信心指数--月度指标(numeric)
# 19 - euribor3m: 欧元同业拆借利率3个月 - 每日指标 (numeric)
# 20 - nr.employed: 员工数量-季度指标 (numeric)
# 输出变量(目标):
# 21 - y -客户存钱了吗(被成功营销了吗)? (binary: ‘yes‘,‘no‘)
本案例用银行产品营销数据集,利用决策树模型预测客户是否认购定期存款,筛选出可能购买产品的客户群,从而减低成本,提高营销的效率。
1.获取数据
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from io import StringIO
from sklearn.tree import export_graphviz
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
get_ipython().run_line_magic('matplotlib', 'inline')
bank=pd.read_csv('./bank.csv')
bank.head()
|
|
age |
job |
marital |
education |
default |
balance |
housing |
loan |
contact |
day |
month |
duration |
campaign |
pdays |
previous |
poutcome |
deposit |
0 |
59 |
admin. |
married |
secondary |
no |
2343 |
yes |
no |
unknown |
5 |
may |
1042 |
1 |
-1 |
0 |
unknown |
yes |
1 |
56 |
admin. |
married |
secondary |
no |
45 |
no |
no |
unknown |
5 |
may |
1467 |
1 |
-1 |
0 |
unknown |
yes |
2 |
41 |
technician |
married |
secondary |
no |
1270 |
yes |
no |
unknown |
5 |
may |
1389 |
1 |
-1 |
0 |
unknown |
yes |
3 |
55 |
services |
married |
secondary |
no |
2476 |
yes |
no |
unknown |
5 |
may |
579 |
1 |
-1 |
0 |
unknown |
yes |
4 |
54 |
admin. |
married |
tertiary |
no |
184 |
no |
no |
unknown |
5 |
may |
673 |
2 |
-1 |
0 |
unknown |
yes |
2.数据探索
bank.info()
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 11162 non-null int64
1 job 11162 non-null object
2 marital 11162 non-null object
3 education 11162 non-null object
4 default 11162 non-null object
5 balance 11162 non-null int64
6 housing 11162 non-null object
7 loan 11162 non-null object
8 contact 11162 non-null object
9 day 11162 non-null int64
10 month 11162 non-null object
11 duration 11162 non-null int64
12 campaign 11162 non-null int64
13 pdays 11162 non-null int64
14 previous 11162 non-null int64
15 poutcome 11162 non-null object
16 deposit 11162 non-null object
dtypes: int64(7), object(10)
memory usage: 1.4+ MB
#计算每个变量的缺失值个数
bank[bank.isnull().any(axis=1)].count()
age 0
job 0
marital 0
education 0
default 0
balance 0
housing 0
loan 0
contact 0
day 0
month 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
deposit 0
dtype: int64
bank.describe().T
|
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
age |
11162.0 |
41.231948 |
11.913369 |
18.0 |
32.0 |
39.0 |
49.00 |
95.0 |
balance |
11162.0 |
1528.538524 |
3225.413326 |
-6847.0 |
122.0 |
550.0 |
1708.00 |
81204.0 |
day |
11162.0 |
15.658036 |
8.420740 |
1.0 |
8.0 |
15.0 |
22.00 |
31.0 |
duration |
11162.0 |
371.993818 |
347.128386 |
2.0 |
138.0 |
255.0 |
496.00 |
3881.0 |
campaign |
11162.0 |
2.508421 |
2.722077 |
1.0 |
1.0 |
2.0 |
3.00 |
63.0 |
pdays |
11162.0 |
51.330407 |
108.758282 |
-1.0 |
-1.0 |
-1.0 |
20.75 |
854.0 |
previous |
11162.0 |
0.832557 |
2.292007 |
0.0 |
0.0 |
0.0 |
1.00 |
58.0 |
#直方图
plt.hist(bank.age,bins=10)
plt.show()
#箱图
sns.boxplot(x=bank["age"])
sns.boxplot(x=bank["duration"])
sns.distplot(bank.duration, bins=20)
plt.hist(bank.balance,bins=100)
plt.show()
sns.boxplot(x=bank["balance"])
bank_data = bank.copy()
bank_data.poutcome.value_counts()
unknown 8326
failure 1228
success 1071
other 537
Name: poutcome, dtype: int64
bank_data.job.value_counts()
management 2566
blue-collar 1944
technician 1823
admin. 1334
services 923
retired 778
self-employed 405
student 360
unemployed 357
entrepreneur 328
housemaid 274
unknown 70
Name: job, dtype: int64
bank_data.contact.value_counts()
bank_data.education.value_counts()
bank_data.marital.value_counts()
married 6351
single 3518
divorced 1293
Name: marital, dtype: int64
3.特征工程
#default变量转换为0-1型,并删除原始变量default
bank_data['default_cat'] = bank_data['default'].map( {'yes':1, 'no':0} )
bank_data.drop('default', axis=1,inplace = True)
#housing变量转换为0-1型,并删除原始变量housing
bank_data["housing_cat"]=bank_data['housing'].map({'yes':1, 'no':0})
bank_data.drop('housing', axis=1,inplace = True)
#loan变量转换为0-1型,并删除原始变量loan
bank_data["loan_cat"] = bank_data['loan'].map({'yes':1, 'no':0})
bank_data.drop('loan', axis=1, inplace=True)
# 变量 "deposit" 转换为0-1型,并删除原始变量
bank_data["deposit_cat"] = bank_data['deposit'].map({'yes':1, 'no':0})
bank_data.drop('deposit', axis=1, inplace=True)
# 合并相似的job为同一类别
bank['job'] = bank['job'].replace(['management','admin.'],'white-collar')
bank['job'] = bank['job'].replace(['housemaid','services'],'pink-collar')
bank['job'] = bank['job'].replace(['retired','student','unemployed','unknown'],'other')
#营销结果变量poutcome总的other转换为unknown
bank_data['poutcome'] = bank_data['poutcome'].replace(['other'] , 'unknown')
bank_data.head().T
|
|
0 |
1 |
2 |
3 |
4 |
age |
59 |
56 |
41 |
55 |
54 |
job |
admin. |
admin. |
technician |
services |
admin. |
marital |
married |
married |
married |
married |
married |
education |
secondary |
secondary |
secondary |
secondary |
tertiary |
balance |
2343 |
45 |
1270 |
2476 |
184 |
contact |
unknown |
unknown |
unknown |
unknown |
unknown |
day |
5 |
5 |
5 |
5 |
5 |
month |
may |
may |
may |
may |
may |
duration |
1042 |
1467 |
1389 |
579 |
673 |
campaign |
1 |
1 |
1 |
1 |
2 |
pdays |
-1 |
-1 |
-1 |
-1 |
-1 |
previous |
0 |
0 |
0 |
0 |
0 |
poutcome |
unknown |
unknown |
unknown |
unknown |
unknown |
default_cat |
0 |
0 |
0 |
0 |
0 |
housing_cat |
1 |
0 |
1 |
1 |
0 |
loan_cat |
0 |
0 |
0 |
0 |
0 |
deposit_cat |
1 |
1 |
1 |
1 |
1 |
# 删除 'month' and 'day' 变量
bank_data.drop('month', axis=1, inplace=True)
bank_data.drop('day', axis=1, inplace=True)
bank_data.head()
|
|
age |
job |
marital |
education |
balance |
contact |
duration |
campaign |
pdays |
previous |
poutcome |
default_cat |
housing_cat |
loan_cat |
deposit_cat |
0 |
59 |
admin. |
married |
secondary |
2343 |
unknown |
1042 |
1 |
-1 |
0 |
unknown |
0 |
1 |
0 |
1 |
1 |
56 |
admin. |
married |
secondary |
45 |
unknown |
1467 |
1 |
-1 |
0 |
unknown |
0 |
0 |
0 |
1 |
2 |
41 |
technician |
married |
secondary |
1270 |
unknown |
1389 |
1 |
-1 |
0 |
unknown |
0 |
1 |
0 |
1 |
3 |
55 |
services |
married |
secondary |
2476 |
unknown |
579 |
1 |
-1 |
0 |
unknown |
0 |
1 |
0 |
1 |
4 |
54 |
admin. |
married |
tertiary |
184 |
unknown |
673 |
2 |
-1 |
0 |
unknown |
0 |
0 |
0 |
1 |
print("Pdays取值为-1的样本数", len(bank_data[bank_data.pdays==-1]))
print("Padys的最大值:", bank_data['pdays'].max())
Pdays取值为-1的样本数 8324
Padys的最大值: 854
bank_data.loc[bank_data['pdays'] == -1, 'pdays'] = 10000
bank_data.head()
|
|
age |
job |
marital |
education |
balance |
contact |
duration |
campaign |
pdays |
previous |
poutcome |
default_cat |
housing_cat |
loan_cat |
deposit_cat |
0 |
59 |
admin. |
married |
secondary |
2343 |
unknown |
1042 |
1 |
10000 |
0 |
unknown |
0 |
1 |
0 |
1 |
1 |
56 |
admin. |
married |
secondary |
45 |
unknown |
1467 |
1 |
10000 |
0 |
unknown |
0 |
0 |
0 |
1 |
2 |
41 |
technician |
married |
secondary |
1270 |
unknown |
1389 |
1 |
10000 |
0 |
unknown |
0 |
1 |
0 |
1 |
3 |
55 |
services |
married |
secondary |
2476 |
unknown |
579 |
1 |
10000 |
0 |
unknown |
0 |
1 |
0 |
1 |
4 |
54 |
admin. |
married |
tertiary |
184 |
unknown |
673 |
2 |
10000 |
0 |
unknown |
0 |
0 |
0 |
1 |
#哑变量处理
bank_with_dummies = pd.get_dummies(data=bank_data,
columns = ['job', 'marital', 'education', 'contact','poutcome'],
prefix = ['job', 'marital', 'education','contact', 'poutcome'])
bank_with_dummies.info()
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 11162 non-null int64
1 balance 11162 non-null int64
2 duration 11162 non-null int64
3 campaign 11162 non-null int64
4 pdays 11162 non-null int64
5 previous 11162 non-null int64
6 default_cat 11162 non-null int64
7 housing_cat 11162 non-null int64
8 loan_cat 11162 non-null int64
9 deposit_cat 11162 non-null int64
10 job_admin. 11162 non-null uint8
11 job_blue-collar 11162 non-null uint8
12 job_entrepreneur 11162 non-null uint8
13 job_housemaid 11162 non-null uint8
14 job_management 11162 non-null uint8
15 job_retired 11162 non-null uint8
16 job_self-employed 11162 non-null uint8
17 job_services 11162 non-null uint8
18 job_student 11162 non-null uint8
19 job_technician 11162 non-null uint8
20 job_unemployed 11162 non-null uint8
21 job_unknown 11162 non-null uint8
22 marital_divorced 11162 non-null uint8
23 marital_married 11162 non-null uint8
24 marital_single 11162 non-null uint8
25 education_primary 11162 non-null uint8
26 education_secondary 11162 non-null uint8
27 education_tertiary 11162 non-null uint8
28 education_unknown 11162 non-null uint8
29 contact_cellular 11162 non-null uint8
30 contact_telephone 11162 non-null uint8
31 contact_unknown 11162 non-null uint8
32 poutcome_failure 11162 non-null uint8
33 poutcome_success 11162 non-null uint8
34 poutcome_unknown 11162 non-null uint8
dtypes: int64(10), uint8(25)
memory usage: 1.1 MB
4.模型开发与评估
corr = bank_with_dummies.corr()
plt.figure(figsize = (10,10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .82})
plt.title('Correlation Matrix')
Text(0.5, 1, 'Correlation Matrix')
bank_with_dummies_drop_deposite=bank_with_dummies.drop('deposit_cat', 1)
label = bank_with_dummies.deposit_cat
data_train, data_test, label_train, label_test = train_test_split(bank_with_dummies_drop_deposite, label, test_size = 0.3, random_state = 50)
print(data_train.shape)
print(data_test.shape)
print(label_train.shape)
print(label_test.shape)
(7813, 34)
(3349, 34)
(7813,)
(3349,)
from sklearn.tree import DecisionTreeClassifier
k_plot=[]
t_plot=[]
for k in range(1,10,1):
dt=DecisionTreeClassifier(max_depth=k,random_state=101)
dt.fit(data_train,label_train)
predict=dt.predict(data_test)
accuracy_test=round(dt.score(data_test,label_test)*100,2)
accuracy_train=round(dt.score(data_train,label_train)*100,2)
#print(k)
#print('train accuracy of decision tree classifier',accuracy_train)
#print('test accuracy of decision tree classifier',accuracy_test)
k_plot.append(accuracy_test)
t_plot.append(accuracy_train)
fig,axes=plt.subplots(1,1,figsize=(12,8))
axes.set_xticks(range(1,10,1))
plt.title("accuracy of decision tree classifier")
plt.xlabel("max_depth", color = "purple")
plt.ylabel("accuracy", color = "green")
k=range(1,10,1)
plt.plot(k,k_plot,linewidth = 3.0, linestyle = '--',marker = "o")
plt.plot(k,t_plot,'r',marker = "o",markerfacecolor = 'white')
plt.legend(['accuracy_test','accuracy_train'])
# Make predictions on the test set
preds = dt.predict(data_test)
# Calculate accuracy
print("\nAccuracy score:\n{}".format(metrics.accuracy_score(label_test, preds)))
# Make predictions on the test set using predict_proba
probs = dt.predict_proba(data_test)[:,1]
# Calculate the AUC metric
print("\nArea Under Curve: \n{}".format(metrics.roc_auc_score(label_test, probs)))
Accuracy score:
0.7832188713048671
Area Under Curve:
0.8442580902727844
features = bank_with_dummies_drop_deposite.columns.tolist()
tree.export_graphviz(dt, out_file='./tree_depth_5.dot',feature_names=features)
5.总结
最大深度为5的决策树模型准确率为78.23%,模型的AUR为0.8433。
基于精准营销模型建立一个自动化推荐系统,提高给客户推荐最合适产品的准确度是非常有必要的。