决策树分析_金融产品推荐

业务背景:

随着我国市场经济的发展,企业之间的竞争加剧,市场环境变化多端,粗放型营销不再满足企业的发展需求。
企业需要精准营销,能够更高效的获取目标客户、降低企业运营成本、精准营销活动不仅可以提升客户的满意度,增强客户对公司的忠诚度,而且可以降低客户获取费用,增加营销活动投资回报率,直接带来企业效益的增加。

# 银行客户信息:
# 1 - age: 年龄 (数字)
# 2 - job: 工作类型 。管理员(admin),蓝领(blue-collar),企业家(entrepreneur),家庭主妇(housemaid),管理者(‘management‘),退休(‘retired‘),个体经营(‘self-employed‘),服务业(‘services‘),学生(‘student‘),技术人员(‘technician‘),无业(‘unemployed‘),未知(‘unknown‘)
# 3 - marital : 婚姻状态,离婚(‘divorced‘),结婚(‘married‘),单身(‘single‘),未知(‘unknown‘)。说明:离婚也包括寡居
# 4 - education: 教育情况 : 基本4年(‘basic.4y‘), 基本6年(‘basic.6y‘),基本九年(‘basic.9y‘),高中(‘high.school‘),文盲(‘illiterate‘),专业课程(‘professional.course‘),大学学位(‘university.degree‘),未知(‘unknown‘)
# 5 - default: 是否有信用违约? (‘no‘,‘yes‘,‘unknown‘)
# 6 - housing: 是否有房贷 ( ‘no‘,‘yes‘,‘unknown‘)
# 7 - loan: 是否有个人贷款 (categorical: ‘no‘,‘yes‘,‘unknown‘)
# 与联络相关信息:
# 8 - contact: 联系类型,手机( ‘cellular‘),电话:‘telephone‘
# 9 - month: 年度最后一次联系的月份 (categorical: ‘jan‘, ‘feb‘, ‘mar‘, ..., ‘nov‘, ‘dec‘)
# 10 - day_of_week: 最后一次联系的星期 (categorical: ‘mon‘,‘tue‘,‘wed‘,‘thu‘,‘fri‘)
# 11 - duration: 上一次联系的通话时长(秒). 重要提示:此属性高度影响输出目标(例如,如果持续时间=0,则y=‘no‘)。然而,在执行呼叫之前,持续时间还不知道。而且,在通话结束后,Y显然是已知的。因此,这个输入应该只包括在基准测试中,如果想要有一个实际的预测模型,就应该丢弃它。(预测时不知道会通话的时长)
# 其他属性:
# 12 - campaign: 针对该客户,为了此次营销所发起联系的数量。(数字,包括最后一次联络)
# 13 - pdays: 上次营销到现在已经过了多少天。(数字,如果是999表示这个客户还没有联系过)
# 14 - previous: 在本次营销之前和客户联系过几次(数字)
# 15 - poutcome: 上一次营销活动的结果 ( ‘failure‘,‘nonexistent‘,‘success‘)
# 社会和经济相关属性
# 16 - emp.var.rate: 就业变动率 -系度指标(numeric)
# 17 - cons.price.idx: 消费物价指数-月度指标 (numeric)
# 18 - cons.conf.idx: 消费者信心指数--月度指标(numeric)
# 19 - euribor3m: 欧元同业拆借利率3个月 - 每日指标 (numeric)
# 20 - nr.employed: 员工数量-季度指标 (numeric)
# 输出变量(目标):
# 21 - y -客户存钱了吗(被成功营销了吗)? (binary: ‘yes‘,‘no‘)

本案例用银行产品营销数据集,利用决策树模型预测客户是否认购定期存款,筛选出可能购买产品的客户群,从而减低成本,提高营销的效率。

1.获取数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from io import StringIO
from sklearn.tree import export_graphviz
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
get_ipython().run_line_magic('matplotlib', 'inline')
bank=pd.read_csv('./bank.csv')
bank.head()


age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
0 59 admin. married secondary no 2343 yes no unknown 5 may 1042 1 -1 0 unknown yes
1 56 admin. married secondary no 45 no no unknown 5 may 1467 1 -1 0 unknown yes
2 41 technician married secondary no 1270 yes no unknown 5 may 1389 1 -1 0 unknown yes
3 55 services married secondary no 2476 yes no unknown 5 may 579 1 -1 0 unknown yes
4 54 admin. married tertiary no 184 no no unknown 5 may 673 2 -1 0 unknown yes

2.数据探索

bank.info()

RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        11162 non-null  int64 
 1   job        11162 non-null  object
 2   marital    11162 non-null  object
 3   education  11162 non-null  object
 4   default    11162 non-null  object
 5   balance    11162 non-null  int64 
 6   housing    11162 non-null  object
 7   loan       11162 non-null  object
 8   contact    11162 non-null  object
 9   day        11162 non-null  int64 
 10  month      11162 non-null  object
 11  duration   11162 non-null  int64 
 12  campaign   11162 non-null  int64 
 13  pdays      11162 non-null  int64 
 14  previous   11162 non-null  int64 
 15  poutcome   11162 non-null  object
 16  deposit    11162 non-null  object
dtypes: int64(7), object(10)
memory usage: 1.4+ MB


#计算每个变量的缺失值个数
bank[bank.isnull().any(axis=1)].count()


age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
deposit      0
dtype: int64


bank.describe().T


count mean std min 25% 50% 75% max
age 11162.0 41.231948 11.913369 18.0 32.0 39.0 49.00 95.0
balance 11162.0 1528.538524 3225.413326 -6847.0 122.0 550.0 1708.00 81204.0
day 11162.0 15.658036 8.420740 1.0 8.0 15.0 22.00 31.0
duration 11162.0 371.993818 347.128386 2.0 138.0 255.0 496.00 3881.0
campaign 11162.0 2.508421 2.722077 1.0 1.0 2.0 3.00 63.0
pdays 11162.0 51.330407 108.758282 -1.0 -1.0 -1.0 20.75 854.0
previous 11162.0 0.832557 2.292007 0.0 0.0 0.0 1.00 58.0


#直方图
plt.hist(bank.age,bins=10)
plt.show()
#箱图
sns.boxplot(x=bank["age"])


output_10_0.png

output_10_2.png


sns.boxplot(x=bank["duration"])




output_11_1.png


sns.distplot(bank.duration, bins=20)



output_12_1.png


plt.hist(bank.balance,bins=100)
plt.show()


output_13_0.png


sns.boxplot(x=bank["balance"])



output_14_1.png


bank_data = bank.copy()
bank_data.poutcome.value_counts()


unknown    8326
failure    1228
success    1071
other       537
Name: poutcome, dtype: int64


bank_data.job.value_counts()


management       2566
blue-collar      1944
technician       1823
admin.           1334
services          923
retired           778
self-employed     405
student           360
unemployed        357
entrepreneur      328
housemaid         274
unknown            70
Name: job, dtype: int64


bank_data.contact.value_counts()
bank_data.education.value_counts()
bank_data.marital.value_counts()


married     6351
single      3518
divorced    1293
Name: marital, dtype: int64

3.特征工程



#default变量转换为0-1型,并删除原始变量default
bank_data['default_cat'] = bank_data['default'].map( {'yes':1, 'no':0} )
bank_data.drop('default', axis=1,inplace = True)
#housing变量转换为0-1型,并删除原始变量housing
bank_data["housing_cat"]=bank_data['housing'].map({'yes':1, 'no':0})
bank_data.drop('housing', axis=1,inplace = True)
#loan变量转换为0-1型,并删除原始变量loan
bank_data["loan_cat"] = bank_data['loan'].map({'yes':1, 'no':0})
bank_data.drop('loan', axis=1, inplace=True)
# 变量 "deposit" 转换为0-1型,并删除原始变量
bank_data["deposit_cat"] = bank_data['deposit'].map({'yes':1, 'no':0})
bank_data.drop('deposit', axis=1, inplace=True)




# 合并相似的job为同一类别
bank['job'] = bank['job'].replace(['management','admin.'],'white-collar')
bank['job'] = bank['job'].replace(['housemaid','services'],'pink-collar')
bank['job'] = bank['job'].replace(['retired','student','unemployed','unknown'],'other')
#营销结果变量poutcome总的other转换为unknown
bank_data['poutcome'] = bank_data['poutcome'].replace(['other'] , 'unknown')




bank_data.head().T


0 1 2 3 4
age 59 56 41 55 54
job admin. admin. technician services admin.
marital married married married married married
education secondary secondary secondary secondary tertiary
balance 2343 45 1270 2476 184
contact unknown unknown unknown unknown unknown
day 5 5 5 5 5
month may may may may may
duration 1042 1467 1389 579 673
campaign 1 1 1 1 2
pdays -1 -1 -1 -1 -1
previous 0 0 0 0 0
poutcome unknown unknown unknown unknown unknown
default_cat 0 0 0 0 0
housing_cat 1 0 1 1 0
loan_cat 0 0 0 0 0
deposit_cat 1 1 1 1 1


# 删除 'month' and 'day' 变量
bank_data.drop('month', axis=1, inplace=True)
bank_data.drop('day', axis=1, inplace=True)




bank_data.head()


age job marital education balance contact duration campaign pdays previous poutcome default_cat housing_cat loan_cat deposit_cat
0 59 admin. married secondary 2343 unknown 1042 1 -1 0 unknown 0 1 0 1
1 56 admin. married secondary 45 unknown 1467 1 -1 0 unknown 0 0 0 1
2 41 technician married secondary 1270 unknown 1389 1 -1 0 unknown 0 1 0 1
3 55 services married secondary 2476 unknown 579 1 -1 0 unknown 0 1 0 1
4 54 admin. married tertiary 184 unknown 673 2 -1 0 unknown 0 0 0 1


print("Pdays取值为-1的样本数", len(bank_data[bank_data.pdays==-1]))
print("Padys的最大值:", bank_data['pdays'].max())


Pdays取值为-1的样本数 8324
Padys的最大值: 854


bank_data.loc[bank_data['pdays'] == -1, 'pdays'] = 10000
bank_data.head()


age job marital education balance contact duration campaign pdays previous poutcome default_cat housing_cat loan_cat deposit_cat
0 59 admin. married secondary 2343 unknown 1042 1 10000 0 unknown 0 1 0 1
1 56 admin. married secondary 45 unknown 1467 1 10000 0 unknown 0 0 0 1
2 41 technician married secondary 1270 unknown 1389 1 10000 0 unknown 0 1 0 1
3 55 services married secondary 2476 unknown 579 1 10000 0 unknown 0 1 0 1
4 54 admin. married tertiary 184 unknown 673 2 10000 0 unknown 0 0 0 1

#哑变量处理
bank_with_dummies = pd.get_dummies(data=bank_data, 
                                   columns = ['job', 'marital', 'education', 'contact','poutcome'], 
                                   prefix = ['job', 'marital', 'education','contact', 'poutcome'])

bank_with_dummies.info()



RangeIndex: 11162 entries, 0 to 11161
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   age                  11162 non-null  int64
 1   balance              11162 non-null  int64
 2   duration             11162 non-null  int64
 3   campaign             11162 non-null  int64
 4   pdays                11162 non-null  int64
 5   previous             11162 non-null  int64
 6   default_cat          11162 non-null  int64
 7   housing_cat          11162 non-null  int64
 8   loan_cat             11162 non-null  int64
 9   deposit_cat          11162 non-null  int64
 10  job_admin.           11162 non-null  uint8
 11  job_blue-collar      11162 non-null  uint8
 12  job_entrepreneur     11162 non-null  uint8
 13  job_housemaid        11162 non-null  uint8
 14  job_management       11162 non-null  uint8
 15  job_retired          11162 non-null  uint8
 16  job_self-employed    11162 non-null  uint8
 17  job_services         11162 non-null  uint8
 18  job_student          11162 non-null  uint8
 19  job_technician       11162 non-null  uint8
 20  job_unemployed       11162 non-null  uint8
 21  job_unknown          11162 non-null  uint8
 22  marital_divorced     11162 non-null  uint8
 23  marital_married      11162 non-null  uint8
 24  marital_single       11162 non-null  uint8
 25  education_primary    11162 non-null  uint8
 26  education_secondary  11162 non-null  uint8
 27  education_tertiary   11162 non-null  uint8
 28  education_unknown    11162 non-null  uint8
 29  contact_cellular     11162 non-null  uint8
 30  contact_telephone    11162 non-null  uint8
 31  contact_unknown      11162 non-null  uint8
 32  poutcome_failure     11162 non-null  uint8
 33  poutcome_success     11162 non-null  uint8
 34  poutcome_unknown     11162 non-null  uint8
dtypes: int64(10), uint8(25)
memory usage: 1.1 MB

4.模型开发与评估



corr = bank_with_dummies.corr()
plt.figure(figsize = (10,10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .82})
plt.title('Correlation Matrix')


Text(0.5, 1, 'Correlation Matrix')
output_28_1.png


bank_with_dummies_drop_deposite=bank_with_dummies.drop('deposit_cat', 1)
label = bank_with_dummies.deposit_cat
data_train, data_test, label_train, label_test = train_test_split(bank_with_dummies_drop_deposite, label, test_size = 0.3, random_state = 50)
print(data_train.shape)
print(data_test.shape)
print(label_train.shape)
print(label_test.shape)


(7813, 34)
(3349, 34)
(7813,)
(3349,)


from sklearn.tree import  DecisionTreeClassifier
k_plot=[]
t_plot=[]
for k in range(1,10,1):
     dt=DecisionTreeClassifier(max_depth=k,random_state=101)
     dt.fit(data_train,label_train)
     predict=dt.predict(data_test)
     accuracy_test=round(dt.score(data_test,label_test)*100,2)
     accuracy_train=round(dt.score(data_train,label_train)*100,2)
     #print(k)
     #print('train accuracy of decision tree classifier',accuracy_train)
     #print('test accuracy of decision tree classifier',accuracy_test)
     k_plot.append(accuracy_test)
     t_plot.append(accuracy_train)
fig,axes=plt.subplots(1,1,figsize=(12,8))
axes.set_xticks(range(1,10,1))
plt.title("accuracy of decision tree classifier")
plt.xlabel("max_depth", color = "purple")
plt.ylabel("accuracy", color = "green")
k=range(1,10,1)
plt.plot(k,k_plot,linewidth = 3.0, linestyle = '--',marker = "o")
plt.plot(k,t_plot,'r',marker = "o",markerfacecolor = 'white')
plt.legend(['accuracy_test','accuracy_train'])



output_30_1.png


# Make predictions on the test set
preds = dt.predict(data_test)
# Calculate accuracy
print("\nAccuracy score:\n{}".format(metrics.accuracy_score(label_test, preds)))
# Make predictions on the test set using predict_proba
probs = dt.predict_proba(data_test)[:,1]
# Calculate the AUC metric
print("\nArea Under Curve: \n{}".format(metrics.roc_auc_score(label_test, probs)))


Accuracy score:
0.7832188713048671

Area Under Curve: 
0.8442580902727844


features = bank_with_dummies_drop_deposite.columns.tolist()
tree.export_graphviz(dt, out_file='./tree_depth_5.dot',feature_names=features)


5.总结

最大深度为5的决策树模型准确率为78.23%,模型的AUR为0.8433。
基于精准营销模型建立一个自动化推荐系统,提高给客户推荐最合适产品的准确度是非常有必要的。

你可能感兴趣的:(决策树分析_金融产品推荐)