AUC只是ROC曲线下的总面积。 AUC的范围是0到1,并且较高的AUC数表明更好的模型性能。随机分类器的AUC为0.5,因此任何AUC高于0.5的分类器都表明该模型的性能优于随机预测。
下面我们还是用06里的Kaggle数据集 WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_curve, auc
%matplotlib inline
df = pd.read_csv('../input/ibm-watson-marketing-customer-value-data/WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv')
Customer | State | Customer Lifetime Value | Response | Coverage | Education | Effective To Date | EmploymentStatus | Gender | Income | ... | Months Since Policy Inception | Number of Open Complaints | Number of Policies | Policy Type | Policy | Renew Offer Type | Sales Channel | Total Claim Amount | Vehicle Class | Vehicle Size | |
0 | BU79786 | Washington | 2763.519279 | No | Basic | Bachelor | 2/24/11 | Employed | F | 56274 | ... | 5 | 0 | 1 | Corporate Auto | Corporate L3 | Offer1 | Agent | 384.811147 | Two-Door Car | Medsize |
1 | QZ44356 | Arizona | 6979.535903 | No | Extended | Bachelor | 1/31/11 | Unemployed | F | 0 | ... | 42 | 0 | 8 | Personal Auto | Personal L3 | Offer3 | Agent | 1131.464935 | Four-Door Car | Medsize |
2 | AI49188 | Nevada | 12887.431650 | No | Premium | Bachelor | 2/19/11 | Employed | F | 48767 | ... | 38 | 0 | 2 | Personal Auto | Personal L3 | Offer1 | Agent | 566.472247 | Two-Door Car | Medsize |
3 rows ?? 24 columns
df['Engaged'] = df['Response'].apply(lambda x: 1 if x == 'Yes' else 0)
Engagement rate is 14.32%
Customer Lifetime Value | Income | Monthly Premium Auto | Months Since Last Claim | Months Since Policy Inception | Number of Open Complaints | Number of Policies | Total Claim Amount | Engaged | |
count | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 | 9134.000000 |
mean | 8004.940475 | 37657.380009 | 93.219291 | 15.097000 | 48.064594 | 0.384388 | 2.966170 | 434.088794 | 0.143201 |
std | 6870.967608 | 30379.904734 | 34.407967 | 10.073257 | 27.905991 | 0.910384 | 2.390182 | 290.500092 | 0.350297 |
min | 1898.007675 | 0.000000 | 61.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.099007 | 0.000000 |
25% | 3994.251794 | 0.000000 | 68.000000 | 6.000000 | 24.000000 | 0.000000 | 1.000000 | 272.258244 | 0.000000 |
50% | 5780.182197 | 33889.500000 | 83.000000 | 14.000000 | 48.000000 | 0.000000 | 2.000000 | 383.945434 | 0.000000 |
75% | 8962.167041 | 62320.000000 | 109.000000 | 23.000000 | 71.000000 | 0.000000 | 4.000000 | 547.514839 | 0.000000 |
max | 83325.381190 | 99981.000000 | 298.000000 | 35.000000 | 99.000000 | 5.000000 | 9.000000 | 2893.239678 | 1.000000 |
continuous_features = [
'Customer Lifetime Value', 'Income', 'Monthly Premium Auto',
'Months Since Last Claim', 'Months Since Policy Inception',
'Number of Open Complaints', 'Number of Policies', 'Total Claim Amount'
Creating Dummy Variables
columns_to_encode = [
'Sales Channel', 'Vehicle Size', 'Vehicle Class', 'Policy', 'Policy Type',
'EmploymentStatus', 'Marital Status', 'Education', 'Coverage'
categorical_features = []
for col in columns_to_encode:
encoded_df = pd.get_dummies(df[col])
encoded_df.columns = [col.replace(' ', '.') + '.' + x for x in encoded_df.columns]
categorical_features += list(encoded_df.columns)
df = pd.concat([df, encoded_df], axis=1)
Encoding Gender
df['Is.Female'] = df['Gender'].apply(lambda x: 1 if x == 'F' else 0)
All features together
all_features = continuous_features + categorical_features
response = 'Engaged'
sample_df = df[all_features + [response]]
sample_df.columns = [x.replace(' ', '.') for x in sample_df.columns]
all_features = [x.replace(' ', '.') for x in all_features]
Customer.Lifetime.Value | Income | Monthly.Premium.Auto | Months.Since.Last.Claim | Months.Since.Policy.Inception | Number.of.Open.Complaints | Number.of.Policies | Total.Claim.Amount | Sales.Channel.Agent | Sales.Channel.Branch | ... | Education.Bachelor | Education.College | Education.Doctor | Education.High.School.or.Below | Education.Master | Coverage.Basic | Coverage.Extended | Coverage.Premium | Is.Female | Engaged | |
0 | 2763.519279 | 56274 | 69 | 32 | 5 | 0 | 1 | 384.811147 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 6979.535903 | 0 | 94 | 13 | 42 | 0 | 8 | 1131.464935 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
2 | 12887.431650 | 48767 | 108 | 18 | 38 | 0 | 2 | 566.472247 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
3 rows ?? 51 columns
x_train, x_test, y_train, y_test = train_test_split(sample_df[all_features], sample_df[response], test_size=0.3)
Building RandomForest Model
rf_model = RandomForestClassifier(
rf_model.fit(X=x_train, y=y_train)
Show 5 single trees
Feature Importances
feature_importance_df = pd.DataFrame(list(zip(rf_model.feature_importances_, all_features)))
feature_importance_df.columns = ['feature.importance', 'feature']
feature_importance_df.sort_values(by='feature.importance', ascending=False)
feature.importance | feature | |
36 | 0.261772 | EmploymentStatus.Retired |
1 | 0.081200 | Income |
7 | 0.074369 | Total.Claim.Amount |
0 | 0.072597 | Customer.Lifetime.Value |
2 | 0.064953 | Monthly.Premium.Auto |
38 | 0.059392 | Marital.Status.Divorced |
4 | 0.043404 | Months.Since.Policy.Inception |
8 | 0.037354 | Sales.Channel.Agent |
3 | 0.033287 | Months.Since.Last.Claim |
34 | 0.025730 | EmploymentStatus.Employed |
37 | 0.025423 | EmploymentStatus.Unemployed |
6 | 0.022189 | Number.of.Policies |
39 | 0.019070 | Marital.Status.Married |
40 | 0.017501 | Marital.Status.Single |
5 | 0.015790 | Number.of.Open.Complaints |
12 | 0.011711 | Vehicle.Size.Large |
14 | 0.009819 | Vehicle.Size.Small |
13 | 0.009239 | Vehicle.Size.Medsize |
49 | 0.008188 | Is.Female |
45 | 0.007861 | Education.Master |
19 | 0.006989 | Vehicle.Class.Sports.Car |
10 | 0.006740 | Sales.Channel.Call.Center |
35 | 0.006400 | EmploymentStatus.Medical.Leave |
9 | 0.005658 | Sales.Channel.Branch |
46 | 0.005472 | Coverage.Basic |
44 | 0.005377 | Education.High.School.or.Below |
17 | 0.005256 | Vehicle.Class.Luxury.SUV |
11 | 0.004624 | Sales.Channel.Web |
42 | 0.004611 | Education.College |
15 | 0.004577 | Vehicle.Class.Four-Door.Car |
33 | 0.004540 | EmploymentStatus.Disabled |
47 | 0.004140 | Coverage.Extended |
43 | 0.003682 | Education.Doctor |
29 | 0.003588 | Policy.Special.L3 |
41 | 0.003548 | Education.Bachelor |
18 | 0.003323 | Vehicle.Class.SUV |
20 | 0.003075 | Vehicle.Class.Two-Door.Car |
48 | 0.002100 | Coverage.Premium |
31 | 0.002076 | Policy.Type.Personal.Auto |
23 | 0.001648 | Policy.Corporate.L3 |
32 | 0.001525 | Policy.Type.Special.Auto |
22 | 0.001514 | Policy.Corporate.L2 |
25 | 0.001445 | Policy.Personal.L2 |
30 | 0.001382 | Policy.Type.Corporate.Auto |
26 | 0.001332 | Policy.Personal.L3 |
27 | 0.001234 | Policy.Special.L1 |
21 | 0.001023 | Policy.Corporate.L1 |
24 | 0.000974 | Policy.Personal.L1 |
16 | 0.000674 | Vehicle.Class.Luxury.Car |
28 | 0.000628 | Policy.Special.L2 |
Accuracy, Precision, and Recal
in_sample_preds = rf_model.predict(x_train)
out_sample_preds = rf_model.predict(x_test)
print('In-Sample Accuracy: %0.4f' % accuracy_score(y_train, in_sample_preds))
print('Out-of-Sample Accuracy: %0.4f' % accuracy_score(y_test, out_sample_preds))
In-Sample Accuracy: 0.8724
Out-of-Sample Accuracy: 0.8767
print('In-Sample Precision: %0.4f' % precision_score(y_train, in_sample_preds))
print('Out-of-Sample Precision: %0.4f' % precision_score(y_test, out_sample_preds))
In-Sample Precision: 0.9907
Out-of-Sample Precision: 1.0000
print('In-Sample Recall: %0.4f' % recall_score(y_train, in_sample_preds))
print('Out-of-Sample Recall: %0.4f' % recall_score(y_test, out_sample_preds))
In-Sample Recall: 0.1151
Out-of-Sample Recall: 0.1266
in_sample_preds = rf_model.predict_proba(x_train)[:,1]
out_sample_preds = rf_model.predict_proba(x_test)[:,1]
in_sample_fpr, in_sample_tpr, in_sample_thresholds = roc_curve(y_train, in_sample_preds)
out_sample_fpr, out_sample_tpr, out_sample_thresholds = roc_curve(y_test, out_sample_preds)
in_sample_roc_auc = auc(in_sample_fpr, in_sample_tpr)
out_sample_roc_auc = auc(out_sample_fpr, out_sample_tpr)
print('In-Sample AUC: %0.4f' % in_sample_roc_auc)
print('Out-Sample AUC: %0.4f' % out_sample_roc_auc)
In-Sample AUC: 0.8768
Out-Sample AUC: 0.8352
out_sample_fpr, out_sample_tpr, color='darkorange', label='Out-Sample ROC curve (area = %0.4f)' % in_sample_roc_auc
in_sample_fpr, in_sample_tpr, color='navy', label='In-Sample ROC curve (area = %0.4f)' % out_sample_roc_auc
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('RandomForest Model ROC Curve')
plt.legend(loc="lower right")