预测分析是一个从历史数据中分析和提取信息以识别模式并预测未来结果的过程。通常使用大量的统计和机器学习模型来查找数据集中的属性或特征与您要预测的目标变量或行为之间的关系。预测分析可以在许多不同行业中使用和应用。例如,它经常在金融行业中用于欺诈检测,其中训练了机器学习模型以检测和防止潜在的欺诈交易。医疗保健行业还可以从预测分析中受益,以帮助医生进行决策。此外,营销的各个部分也可以从预测分析中受益,例如客户获取,客户保留以及向上销售和交叉销售等。如前所述,在市场营销中有多种应用和利用预测分析的方法。在这篇里,我们将讨论市场营销中预测分析的四个流行例子:
预测分析可以帮助营销人员通过其营销策略预测客户参与的可能性。例如,如果您的营销活动在电子邮件空间中频繁发生,则可以利用预测分析来预测哪些客户很有可能打开您的营销电子邮件,并针对这些高可能性客户定制营销策略,以最大程度地提高营销效果。再举一个例子,如果您要在社交媒体上显示广告,则预测分析可以帮助您确定可能点击广告的某些类型的客户。
预测分析可以预测客户的预期生命周期价值。使用历史交易数据,预测分析可以帮助识别高价值客户。通过这些预测,您和您的公司可以更加专注于与那些高价值客户建立关系。
我们可以使用数据科学和机器学习来预测哪些客户可能会购买产品或查看内容。使用这些预测,可以通过为单个客户推荐合适的产品和内容来提高客户转化率。
预测分析也已广泛用于客户获取和保留。根据收集的有关潜在客户或潜在客户的资料数据以及现有客户的历史数据,我们可以应用预测分析来识别高质量的潜在客户,或根据潜在客户被转化为活跃客户的可能性对潜在客户进行排名。另一方面,也可以使用客户流失数据和现有客户的历史数据来开发预测模型,以预测哪些客户可能会离开或退订产品。
介绍完预测分析的应用,我们再来讲讲在开发预测模型时,重要的是要知道如何评估那些模型。这里我介绍五种比较常用的评估分类模型性能的指标。
准确度就是正确预测在所有预测中的百分比
精度定义为真阳性的数量除以真阳性和假阳性的总数。真阳性是模型正确预测为阳性的情况,而假阳性是模型预测为阳性但真实标签为阴性的情况。
ROC曲线显示了在不同阈值下真阳性率和假阳性率如何变化。典型的ROC曲线如下所示:
AUC只是ROC曲线下的总面积。 AUC的范围是0到1,并且较高的AUC数表明更好的模型性能。随机分类器的AUC为0.5,因此任何AUC高于0.5的分类器都表明该模型的性能优于随机预测。
下面我们还是用06里的Kaggle数据集 WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/ibm-watson-marketing-customer-value-data/WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_curve, auc
%matplotlib inline
df = pd.read_csv('../input/ibm-watson-marketing-customer-value-data/WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv')
df.head(3)
Customer | State | Customer Lifetime Value | Response | Coverage | Education | Effective To Date | EmploymentStatus | Gender | Income | ... | Months Since Policy Inception | Number of Open Complaints | Number of Policies | Policy Type | Policy | Renew Offer Type | Sales Channel | Total Claim Amount | Vehicle Class | Vehicle Size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BU79786 | Washington | 2763.519279 | No | Basic | Bachelor | 2/24/11 | Employed | F | 56274 | ... | 5 | 0 | 1 | Corporate Auto | Corporate L3 | Offer1 | Agent | 384.811147 | Two-Door Car | Medsize |
1 | QZ44356 | Arizona | 6979.535903 | No | Extended | Bachelor | 1/31/11 | Unemployed | F | 0 | ... | 42 | 0 | 8 | Personal Auto | Personal L3 | Offer3 | Agent | 1131.464935 | Four-Door Car | Medsize |
2 | AI49188 | Nevada | 12887.431650 | No | Premium | Bachelor | 2/19/11 | Employed | F | 48767 | ... | 38 | 0 | 2 | Personal Auto | Personal L3 | Offer1 | Agent | 566.472247 | Two-Door Car | Medsize |
3 rows ?? 24 columns
Outcome
df['Engaged'] = df['Response'].apply(lambda x: 1 if x == 'Yes' else 0)
df['Engaged'].mean()
0.14320122618786948
Engagement rate is 14.32%
Features
df.describe()
Customer Lifetime Value | Income | Monthly Premium Auto | Months Since Last Claim | Months Since Policy Inception | Number of Open Complaints | Number of Policies | Total Claim Amount | Engaged | |
---|---|---|---|---|---|---|---|---|---|
count | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 |
mean | 8004.940475 | 37657.380009 | 93.219291 | 15.097000 | 48.064594 | 0.384388 | 2.966170 | 434.088794 | 0.143201 |
std | 6870.967608 | 30379.904734 | 34.407967 | 10.073257 | 27.905991 | 0.910384 | 2.390182 | 290.500092 | 0.350297 |
min | 1898.007675 | 0.000000 | 61.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.099007 | 0.000000 |
25% | 3994.251794 | 0.000000 | 68.000000 | 6.000000 | 24.000000 | 0.000000 | 1.000000 | 272.258244 | 0.000000 |
50% | 5780.182197 | 33889.500000 | 83.000000 | 14.000000 | 48.000000 | 0.000000 | 2.000000 | 383.945434 | 0.000000 |
75% | 8962.167041 | 62320.000000 | 109.000000 | 23.000000 | 71.000000 | 0.000000 | 4.000000 | 547.514839 | 0.000000 |
max | 83325.381190 | 99981.000000 | 298.000000 | 35.000000 | 99.000000 | 5.000000 | 9.000000 | 2893.239678 | 1.000000 |
continuous_features = [
'Customer Lifetime Value', 'Income', 'Monthly Premium Auto',
'Months Since Last Claim', 'Months Since Policy Inception',
'Number of Open Complaints', 'Number of Policies', 'Total Claim Amount'
]
Creating Dummy Variables
columns_to_encode = [
'Sales Channel', 'Vehicle Size', 'Vehicle Class', 'Policy', 'Policy Type',
'EmploymentStatus', 'Marital Status', 'Education', 'Coverage'
]
categorical_features = []
for col in columns_to_encode:
encoded_df = pd.get_dummies(df[col])
encoded_df.columns = [col.replace(' ', '.') + '.' + x for x in encoded_df.columns]
categorical_features += list(encoded_df.columns)
df = pd.concat([df, encoded_df], axis=1)
Encoding Gender
df['Is.Female'] = df['Gender'].apply(lambda x: 1 if x == 'F' else 0)
categorical_features.append('Is.Female')
All features together
all_features = continuous_features + categorical_features
response = 'Engaged'
sample_df = df[all_features + [response]]
sample_df.columns = [x.replace(' ', '.') for x in sample_df.columns]
all_features = [x.replace(' ', '.') for x in all_features]
sample_df.head(3)
Customer.Lifetime.Value | Income | Monthly.Premium.Auto | Months.Since.Last.Claim | Months.Since.Policy.Inception | Number.of.Open.Complaints | Number.of.Policies | Total.Claim.Amount | Sales.Channel.Agent | Sales.Channel.Branch | ... | Education.Bachelor | Education.College | Education.Doctor | Education.High.School.or.Below | Education.Master | Coverage.Basic | Coverage.Extended | Coverage.Premium | Is.Female | Engaged | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2763.519279 | 56274 | 69 | 32 | 5 | 0 | 1 | 384.811147 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 6979.535903 | 0 | 94 | 13 | 42 | 0 | 8 | 1131.464935 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
2 | 12887.431650 | 48767 | 108 | 18 | 38 | 0 | 2 | 566.472247 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
3 rows ?? 51 columns
x_train, x_test, y_train, y_test = train_test_split(sample_df[all_features], sample_df[response], test_size=0.3)
Building RandomForest Model
rf_model = RandomForestClassifier(
n_estimators=300,
max_depth=5
)
rf_model.fit(X=x_train, y=y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=5, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=300,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
Show 5 single trees
rf_model.estimators_[:5]
[DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=5, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=838010407, splitter='best'),
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=5, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=1246246316, splitter='best'),
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=5, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=1694179371, splitter='best'),
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=5, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=1705361911, splitter='best'),
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=5, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=1699793500, splitter='best')]
rf_model.estimators_[0].predict(x_test)[:10]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Feature Importances
rf_model.feature_importances_
array([0.07259724, 0.08119994, 0.06495311, 0.03328749, 0.04340357,
0.01579008, 0.02218859, 0.07436861, 0.03735399, 0.00565773,
0.00674024, 0.00462383, 0.01171149, 0.00923856, 0.00981901,
0.00457693, 0.00067447, 0.00525568, 0.0033229 , 0.00698852,
0.00307466, 0.00102278, 0.00151386, 0.00164754, 0.00097357,
0.00144482, 0.00133233, 0.00123385, 0.00062795, 0.00358802,
0.00138201, 0.00207571, 0.00152493, 0.00453954, 0.02572955,
0.00640002, 0.26177188, 0.02542252, 0.05939226, 0.01906953,
0.01750097, 0.00354849, 0.00461108, 0.0036818 , 0.00537708,
0.00786115, 0.00547227, 0.00413968, 0.00209991, 0.00818824])
feature_importance_df = pd.DataFrame(list(zip(rf_model.feature_importances_, all_features)))
feature_importance_df.columns = ['feature.importance', 'feature']
feature_importance_df.sort_values(by='feature.importance', ascending=False)
feature.importance | feature | |
---|---|---|
36 | 0.261772 | EmploymentStatus.Retired |
1 | 0.081200 | Income |
7 | 0.074369 | Total.Claim.Amount |
0 | 0.072597 | Customer.Lifetime.Value |
2 | 0.064953 | Monthly.Premium.Auto |
38 | 0.059392 | Marital.Status.Divorced |
4 | 0.043404 | Months.Since.Policy.Inception |
8 | 0.037354 | Sales.Channel.Agent |
3 | 0.033287 | Months.Since.Last.Claim |
34 | 0.025730 | EmploymentStatus.Employed |
37 | 0.025423 | EmploymentStatus.Unemployed |
6 | 0.022189 | Number.of.Policies |
39 | 0.019070 | Marital.Status.Married |
40 | 0.017501 | Marital.Status.Single |
5 | 0.015790 | Number.of.Open.Complaints |
12 | 0.011711 | Vehicle.Size.Large |
14 | 0.009819 | Vehicle.Size.Small |
13 | 0.009239 | Vehicle.Size.Medsize |
49 | 0.008188 | Is.Female |
45 | 0.007861 | Education.Master |
19 | 0.006989 | Vehicle.Class.Sports.Car |
10 | 0.006740 | Sales.Channel.Call.Center |
35 | 0.006400 | EmploymentStatus.Medical.Leave |
9 | 0.005658 | Sales.Channel.Branch |
46 | 0.005472 | Coverage.Basic |
44 | 0.005377 | Education.High.School.or.Below |
17 | 0.005256 | Vehicle.Class.Luxury.SUV |
11 | 0.004624 | Sales.Channel.Web |
42 | 0.004611 | Education.College |
15 | 0.004577 | Vehicle.Class.Four-Door.Car |
33 | 0.004540 | EmploymentStatus.Disabled |
47 | 0.004140 | Coverage.Extended |
43 | 0.003682 | Education.Doctor |
29 | 0.003588 | Policy.Special.L3 |
41 | 0.003548 | Education.Bachelor |
18 | 0.003323 | Vehicle.Class.SUV |
20 | 0.003075 | Vehicle.Class.Two-Door.Car |
48 | 0.002100 | Coverage.Premium |
31 | 0.002076 | Policy.Type.Personal.Auto |
23 | 0.001648 | Policy.Corporate.L3 |
32 | 0.001525 | Policy.Type.Special.Auto |
22 | 0.001514 | Policy.Corporate.L2 |
25 | 0.001445 | Policy.Personal.L2 |
30 | 0.001382 | Policy.Type.Corporate.Auto |
26 | 0.001332 | Policy.Personal.L3 |
27 | 0.001234 | Policy.Special.L1 |
21 | 0.001023 | Policy.Corporate.L1 |
24 | 0.000974 | Policy.Personal.L1 |
16 | 0.000674 | Vehicle.Class.Luxury.Car |
28 | 0.000628 | Policy.Special.L2 |
Accuracy, Precision, and Recal
in_sample_preds = rf_model.predict(x_train)
out_sample_preds = rf_model.predict(x_test)
print('In-Sample Accuracy: %0.4f' % accuracy_score(y_train, in_sample_preds))
print('Out-of-Sample Accuracy: %0.4f' % accuracy_score(y_test, out_sample_preds))
In-Sample Accuracy: 0.8724
Out-of-Sample Accuracy: 0.8767
print('In-Sample Precision: %0.4f' % precision_score(y_train, in_sample_preds))
print('Out-of-Sample Precision: %0.4f' % precision_score(y_test, out_sample_preds))
In-Sample Precision: 0.9907
Out-of-Sample Precision: 1.0000
print('In-Sample Recall: %0.4f' % recall_score(y_train, in_sample_preds))
print('Out-of-Sample Recall: %0.4f' % recall_score(y_test, out_sample_preds))
In-Sample Recall: 0.1151
Out-of-Sample Recall: 0.1266
ROC & AUC
in_sample_preds = rf_model.predict_proba(x_train)[:,1]
out_sample_preds = rf_model.predict_proba(x_test)[:,1]
in_sample_fpr, in_sample_tpr, in_sample_thresholds = roc_curve(y_train, in_sample_preds)
out_sample_fpr, out_sample_tpr, out_sample_thresholds = roc_curve(y_test, out_sample_preds)
in_sample_roc_auc = auc(in_sample_fpr, in_sample_tpr)
out_sample_roc_auc = auc(out_sample_fpr, out_sample_tpr)
print('In-Sample AUC: %0.4f' % in_sample_roc_auc)
print('Out-Sample AUC: %0.4f' % out_sample_roc_auc)
In-Sample AUC: 0.8768
Out-Sample AUC: 0.8352
plt.figure(figsize=(10,7))
plt.plot(
out_sample_fpr, out_sample_tpr, color='darkorange', label='Out-Sample ROC curve (area = %0.4f)' % in_sample_roc_auc
)
plt.plot(
in_sample_fpr, in_sample_tpr, color='navy', label='In-Sample ROC curve (area = %0.4f)' % out_sample_roc_auc
)
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.grid()
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('RandomForest Model ROC Curve')
plt.legend(loc="lower right")
plt.show()
EOD