07_行销(Marketing)里行销市场参与(Marketing Engagement)的可能性


      • Load the packages
      • Load the dataset
      • Variable Encoding
      • Training & Testing
      • Evaluating Models


  • 参与的可能性(Engagement Likelihood)


  • 客户生命周期价值(Customer Lifetime Value)


  • 推荐合适的产品和内容(Product Recommendation)


  • 客户获取和保留 (Customer Acquisition and Retention)



  • 准确性(Accuracy):



  • 精度(Precision )

  • 召回率(Recall)

  • ROC曲线


  • AUC

AUC只是ROC曲线下的总面积。 AUC的范围是0到1,并且较高的AUC数表明更好的模型性能。随机分类器的AUC为0.5,因此任何AUC高于0.5的分类器都表明该模型的性能优于随机预测。

下面我们还是用06里的Kaggle数据集 WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Load the packages

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_curve, auc

%matplotlib inline

Load the dataset

df = pd.read_csv('../input/ibm-watson-marketing-customer-value-data/WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv')
Customer State Customer Lifetime Value Response Coverage Education Effective To Date EmploymentStatus Gender Income ... Months Since Policy Inception Number of Open Complaints Number of Policies Policy Type Policy Renew Offer Type Sales Channel Total Claim Amount Vehicle Class Vehicle Size
0 BU79786 Washington 2763.519279 No Basic Bachelor 2/24/11 Employed F 56274 ... 5 0 1 Corporate Auto Corporate L3 Offer1 Agent 384.811147 Two-Door Car Medsize
1 QZ44356 Arizona 6979.535903 No Extended Bachelor 1/31/11 Unemployed F 0 ... 42 0 8 Personal Auto Personal L3 Offer3 Agent 1131.464935 Four-Door Car Medsize
2 AI49188 Nevada 12887.431650 No Premium Bachelor 2/19/11 Employed F 48767 ... 38 0 2 Personal Auto Personal L3 Offer1 Agent 566.472247 Two-Door Car Medsize

3 rows ?? 24 columns

Variable Encoding


df['Engaged'] = df['Response'].apply(lambda x: 1 if x == 'Yes' else 0)

Engagement rate is 14.32%


Customer Lifetime Value Income Monthly Premium Auto Months Since Last Claim Months Since Policy Inception Number of Open Complaints Number of Policies Total Claim Amount Engaged
count 9134.000000 9134.000000 9134.000000 9134.000000 9134.000000 9134.000000 9134.000000 9134.000000 9134.000000
mean 8004.940475 37657.380009 93.219291 15.097000 48.064594 0.384388 2.966170 434.088794 0.143201
std 6870.967608 30379.904734 34.407967 10.073257 27.905991 0.910384 2.390182 290.500092 0.350297
min 1898.007675 0.000000 61.000000 0.000000 0.000000 0.000000 1.000000 0.099007 0.000000
25% 3994.251794 0.000000 68.000000 6.000000 24.000000 0.000000 1.000000 272.258244 0.000000
50% 5780.182197 33889.500000 83.000000 14.000000 48.000000 0.000000 2.000000 383.945434 0.000000
75% 8962.167041 62320.000000 109.000000 23.000000 71.000000 0.000000 4.000000 547.514839 0.000000
max 83325.381190 99981.000000 298.000000 35.000000 99.000000 5.000000 9.000000 2893.239678 1.000000

continuous_features = [
    'Customer Lifetime Value', 'Income', 'Monthly Premium Auto',
    'Months Since Last Claim', 'Months Since Policy Inception',
    'Number of Open Complaints', 'Number of Policies', 'Total Claim Amount'

Creating Dummy Variables

columns_to_encode = [
    'Sales Channel', 'Vehicle Size', 'Vehicle Class', 'Policy', 'Policy Type', 
    'EmploymentStatus', 'Marital Status', 'Education', 'Coverage'

categorical_features = []
for col in columns_to_encode:
    encoded_df = pd.get_dummies(df[col])
    encoded_df.columns = [col.replace(' ', '.') + '.' + x for x in encoded_df.columns]
    categorical_features += list(encoded_df.columns)
    df = pd.concat([df, encoded_df], axis=1)

Encoding Gender

df['Is.Female'] = df['Gender'].apply(lambda x: 1 if x == 'F' else 0)


All features together

all_features = continuous_features + categorical_features
response = 'Engaged'
sample_df = df[all_features + [response]]
sample_df.columns = [x.replace(' ', '.') for x in sample_df.columns]
all_features = [x.replace(' ', '.') for x in all_features]
Customer.Lifetime.Value Income Monthly.Premium.Auto Months.Since.Last.Claim Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies Total.Claim.Amount Sales.Channel.Agent Sales.Channel.Branch ... Education.Bachelor Education.College Education.Doctor Education.High.School.or.Below Education.Master Coverage.Basic Coverage.Extended Coverage.Premium Is.Female Engaged
0 2763.519279 56274 69 32 5 0 1 384.811147 1 0 ... 1 0 0 0 0 1 0 0 1 0
1 6979.535903 0 94 13 42 0 8 1131.464935 1 0 ... 1 0 0 0 0 0 1 0 1 0
2 12887.431650 48767 108 18 38 0 2 566.472247 1 0 ... 1 0 0 0 0 0 0 1 1 0

3 rows ?? 51 columns

Training & Testing

x_train, x_test, y_train, y_test = train_test_split(sample_df[all_features], sample_df[response], test_size=0.3)

Building RandomForest Model

rf_model = RandomForestClassifier(
rf_model.fit(X=x_train, y=y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Show 5 single trees

[DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                        max_depth=5, max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=838010407, splitter='best'),
 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                        max_depth=5, max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=1246246316, splitter='best'),
 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                        max_depth=5, max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=1694179371, splitter='best'),
 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                        max_depth=5, max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=1705361911, splitter='best'),
 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                        max_depth=5, max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=1699793500, splitter='best')]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Feature Importances

array([0.07259724, 0.08119994, 0.06495311, 0.03328749, 0.04340357,
       0.01579008, 0.02218859, 0.07436861, 0.03735399, 0.00565773,
       0.00674024, 0.00462383, 0.01171149, 0.00923856, 0.00981901,
       0.00457693, 0.00067447, 0.00525568, 0.0033229 , 0.00698852,
       0.00307466, 0.00102278, 0.00151386, 0.00164754, 0.00097357,
       0.00144482, 0.00133233, 0.00123385, 0.00062795, 0.00358802,
       0.00138201, 0.00207571, 0.00152493, 0.00453954, 0.02572955,
       0.00640002, 0.26177188, 0.02542252, 0.05939226, 0.01906953,
       0.01750097, 0.00354849, 0.00461108, 0.0036818 , 0.00537708,
       0.00786115, 0.00547227, 0.00413968, 0.00209991, 0.00818824])
feature_importance_df = pd.DataFrame(list(zip(rf_model.feature_importances_, all_features)))
feature_importance_df.columns = ['feature.importance', 'feature']

feature_importance_df.sort_values(by='feature.importance', ascending=False)
feature.importance feature
36 0.261772 EmploymentStatus.Retired
1 0.081200 Income
7 0.074369 Total.Claim.Amount
0 0.072597 Customer.Lifetime.Value
2 0.064953 Monthly.Premium.Auto
38 0.059392 Marital.Status.Divorced
4 0.043404 Months.Since.Policy.Inception
8 0.037354 Sales.Channel.Agent
3 0.033287 Months.Since.Last.Claim
34 0.025730 EmploymentStatus.Employed
37 0.025423 EmploymentStatus.Unemployed
6 0.022189 Number.of.Policies
39 0.019070 Marital.Status.Married
40 0.017501 Marital.Status.Single
5 0.015790 Number.of.Open.Complaints
12 0.011711 Vehicle.Size.Large
14 0.009819 Vehicle.Size.Small
13 0.009239 Vehicle.Size.Medsize
49 0.008188 Is.Female
45 0.007861 Education.Master
19 0.006989 Vehicle.Class.Sports.Car
10 0.006740 Sales.Channel.Call.Center
35 0.006400 EmploymentStatus.Medical.Leave
9 0.005658 Sales.Channel.Branch
46 0.005472 Coverage.Basic
44 0.005377 Education.High.School.or.Below
17 0.005256 Vehicle.Class.Luxury.SUV
11 0.004624 Sales.Channel.Web
42 0.004611 Education.College
15 0.004577 Vehicle.Class.Four-Door.Car
33 0.004540 EmploymentStatus.Disabled
47 0.004140 Coverage.Extended
43 0.003682 Education.Doctor
29 0.003588 Policy.Special.L3
41 0.003548 Education.Bachelor
18 0.003323 Vehicle.Class.SUV
20 0.003075 Vehicle.Class.Two-Door.Car
48 0.002100 Coverage.Premium
31 0.002076 Policy.Type.Personal.Auto
23 0.001648 Policy.Corporate.L3
32 0.001525 Policy.Type.Special.Auto
22 0.001514 Policy.Corporate.L2
25 0.001445 Policy.Personal.L2
30 0.001382 Policy.Type.Corporate.Auto
26 0.001332 Policy.Personal.L3
27 0.001234 Policy.Special.L1
21 0.001023 Policy.Corporate.L1
24 0.000974 Policy.Personal.L1
16 0.000674 Vehicle.Class.Luxury.Car
28 0.000628 Policy.Special.L2

Evaluating Models

Accuracy, Precision, and Recal

in_sample_preds = rf_model.predict(x_train)
out_sample_preds = rf_model.predict(x_test)
print('In-Sample Accuracy: %0.4f' % accuracy_score(y_train, in_sample_preds))
print('Out-of-Sample Accuracy: %0.4f' % accuracy_score(y_test, out_sample_preds))
In-Sample Accuracy: 0.8724
Out-of-Sample Accuracy: 0.8767
print('In-Sample Precision: %0.4f' % precision_score(y_train, in_sample_preds))
print('Out-of-Sample Precision: %0.4f' % precision_score(y_test, out_sample_preds))
In-Sample Precision: 0.9907
Out-of-Sample Precision: 1.0000
print('In-Sample Recall: %0.4f' % recall_score(y_train, in_sample_preds))
print('Out-of-Sample Recall: %0.4f' % recall_score(y_test, out_sample_preds))
In-Sample Recall: 0.1151
Out-of-Sample Recall: 0.1266


in_sample_preds = rf_model.predict_proba(x_train)[:,1]
out_sample_preds = rf_model.predict_proba(x_test)[:,1]
in_sample_fpr, in_sample_tpr, in_sample_thresholds = roc_curve(y_train, in_sample_preds)
out_sample_fpr, out_sample_tpr, out_sample_thresholds = roc_curve(y_test, out_sample_preds)
in_sample_roc_auc = auc(in_sample_fpr, in_sample_tpr)
out_sample_roc_auc = auc(out_sample_fpr, out_sample_tpr)

print('In-Sample AUC: %0.4f' % in_sample_roc_auc)
print('Out-Sample AUC: %0.4f' % out_sample_roc_auc)
In-Sample AUC: 0.8768
Out-Sample AUC: 0.8352

    out_sample_fpr, out_sample_tpr, color='darkorange', label='Out-Sample ROC curve (area = %0.4f)' % in_sample_roc_auc
    in_sample_fpr, in_sample_tpr, color='navy', label='In-Sample ROC curve (area = %0.4f)' % out_sample_roc_auc
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('RandomForest Model ROC Curve')
plt.legend(loc="lower right")


