客户流失是指客户决定停止使用公司的服务,内容或产品。当我们讨论客户分析时,保留现有客户的成本要比获取新客户便宜得多,而且回头客的收入通常要比新客户高。在竞争激烈的行业中,企业面对许多竞争对手,因此获得新客户的成本甚至更高,因此保留现有客户对于此类企业而言变得越来越重要。客户离开公司有很多原因。客户流失的一些常见原因是不良的客户服务,无法在产品或服务中找到足够的价值,缺乏沟通和缺乏客户忠诚度。保留这些客户的第一步是监视一段时间内的客户流失率。如果客户流失率通常很高或随着时间的流逝而增加,那么最好花一些资源来改善客户保留率。
为了提高客户保留率,当务之急是更好地了解客户。我们可以调查已经流失的客户,以了解他们为什么离开。我们还可以调查现有客户,以了解他们的需求和痛苦点。例如,我们可以查看客户的网络活动数据,并了解他们在哪里花费最多的时间,他们正在查看的页面上是否有错误,或者他们的搜索结果是否未返回良好的内容。我们还可以查看客户服务呼叫日志,以了解他们的等待时间长短,他们的投诉是什么以及如何处理他们的问题。对这些数据点进行深入分析可以揭示企业在保留现有客户方面面临的问题。
在本文中, 我们来建立一个机器学习模型,该模型可以预测哪些客户可能流失,并锁定并留住这些较高流失风险的特定客户。我们会使用神经网络模型。人工神经网络(ANN)模型是一种机器学习模型,受人脑功能的启发。 ANN模型最近在图像识别,语音识别和机器人技术方面的成功应用证明了其在各种行业中的预测能力和实用性。您可能已经听说过“深度学习”一词。这是一种ANN模型,其中输入和输出层之间的层数很大。
此图显示了具有一个隐藏层的ANN模型的简单情况。此图中的圆圈表示人工神经元或节点,它们模拟人脑中的这些神经元。箭头表示信号如何从一个神经元传输到另一个神经元。如该图所示,ANN模型通过查找从每个输入神经元到下一层神经元的信号的模式或权重进行学习,从而最好地预测了输出。
下面我们还是用Kaggle数据集 WA_Fn-UseC_-Telco-Customer-Churn.csv 。然后我们用keras来构建一个神经网络。
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_curve, auc
%matplotlib inline
df = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head(3)
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
3 rows ?? 21 columns
df.shape
(7043, 21)
Encoding target var: Churn
df['Churn'] = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)
df.Churn.mean()
0.2653698707936959
Create TotalCharges
df['TotalCharges'] = df['TotalCharges'].replace(' ', np.nan).astype(float)
df = df.dropna()
Create Continuous Vars
df[['tenure', 'MonthlyCharges', 'TotalCharges']].describe()
tenure | MonthlyCharges | TotalCharges | |
---|---|---|---|
count | 7032.000000 | 7032.000000 | 7032.000000 |
mean | 32.421786 | 64.798208 | 2283.300441 |
std | 24.545260 | 30.085974 | 2266.771362 |
min | 1.000000 | 18.250000 | 18.800000 |
25% | 9.000000 | 35.587500 | 401.450000 |
50% | 29.000000 | 70.350000 | 1397.475000 |
75% | 55.000000 | 89.862500 | 3794.737500 |
max | 72.000000 | 118.750000 | 8684.800000 |
Normalize the variable
df['MonthlyCharges'] = np.log(df['MonthlyCharges'])
df['MonthlyCharges'] = (df['MonthlyCharges'] - df['MonthlyCharges'].mean())/df['MonthlyCharges'].std()
df['TotalCharges'] = np.log(df['TotalCharges'])
df['TotalCharges'] = (df['TotalCharges'] - df['TotalCharges'].mean())/df['TotalCharges'].std()
df['tenure'] = (df['tenure'] - df['tenure'].mean())/df['tenure'].std()
df[['tenure', 'MonthlyCharges', 'TotalCharges']].describe()
tenure | MonthlyCharges | TotalCharges | |
---|---|---|---|
count | 7.032000e+03 | 7.032000e+03 | 7.032000e+03 |
mean | -1.028756e-16 | 4.688495e-14 | 7.150708e-15 |
std | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 |
min | -1.280157e+00 | -1.882268e+00 | -2.579056e+00 |
25% | -9.542285e-01 | -7.583727e-01 | -6.080585e-01 |
50% | -1.394072e-01 | 3.885103e-01 | 1.950521e-01 |
75% | 9.198605e-01 | 8.004829e-01 | 8.382338e-01 |
max | 1.612459e+00 | 1.269576e+00 | 1.371323e+00 |
continuous_vars = list(df.describe().columns)
continuous_vars
['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']
One-Hot Encoding
for col in list(df.columns):
print(col, df[col].nunique())
customerID 7032
gender 2
SeniorCitizen 2
Partner 2
Dependents 2
tenure 72
PhoneService 2
MultipleLines 3
InternetService 3
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Contract 3
PaperlessBilling 2
PaymentMethod 4
MonthlyCharges 1584
TotalCharges 6530
Churn 2
df.groupby('gender').count()['customerID'].plot(
kind='bar', color='skyblue', grid=True, figsize=(8,6), title='Gender'
)
plt.show()
df.groupby('InternetService').count()['customerID'].plot(
kind='bar', color='skyblue', grid=True, figsize=(8,6), title='Internet Service'
)
plt.show()
df.groupby('PaymentMethod').count()['customerID'].plot(
kind='bar', color='skyblue', grid=True, figsize=(8,6), title='Payment Method'
)
plt.show()
dummy_cols = []
sample_set = df[['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']].copy(deep=True)
for col in list(df.columns):
if col not in ['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn'] and df[col].nunique() < 5:
dummy_vars = pd.get_dummies(df[col])
dummy_vars.columns = [col+str(x) for x in dummy_vars.columns]
sample_set = pd.concat([sample_set, dummy_vars], axis=1)
sample_set.head()
tenure | MonthlyCharges | TotalCharges | Churn | genderFemale | genderMale | SeniorCitizen0 | SeniorCitizen1 | PartnerNo | PartnerYes | ... | StreamingMoviesYes | ContractMonth-to-month | ContractOne year | ContractTwo year | PaperlessBillingNo | PaperlessBillingYes | PaymentMethodBank transfer (automatic) | PaymentMethodCredit card (automatic) | PaymentMethodElectronic check | PaymentMethodMailed check | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1.280157 | -1.054244 | -2.281382 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0.064298 | 0.032896 | 0.389269 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
2 | -1.239416 | -0.061298 | -1.452520 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
3 | 0.512450 | -0.467578 | 0.372439 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
4 | -1.239416 | 0.396862 | -1.234860 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
5 rows ?? 47 columns
list(sample_set.columns)
['tenure',
'MonthlyCharges',
'TotalCharges',
'Churn',
'genderFemale',
'genderMale',
'SeniorCitizen0',
'SeniorCitizen1',
'PartnerNo',
'PartnerYes',
'DependentsNo',
'DependentsYes',
'PhoneServiceNo',
'PhoneServiceYes',
'MultipleLinesNo',
'MultipleLinesNo phone service',
'MultipleLinesYes',
'InternetServiceDSL',
'InternetServiceFiber optic',
'InternetServiceNo',
'OnlineSecurityNo',
'OnlineSecurityNo internet service',
'OnlineSecurityYes',
'OnlineBackupNo',
'OnlineBackupNo internet service',
'OnlineBackupYes',
'DeviceProtectionNo',
'DeviceProtectionNo internet service',
'DeviceProtectionYes',
'TechSupportNo',
'TechSupportNo internet service',
'TechSupportYes',
'StreamingTVNo',
'StreamingTVNo internet service',
'StreamingTVYes',
'StreamingMoviesNo',
'StreamingMoviesNo internet service',
'StreamingMoviesYes',
'ContractMonth-to-month',
'ContractOne year',
'ContractTwo year',
'PaperlessBillingNo',
'PaperlessBillingYes',
'PaymentMethodBank transfer (automatic)',
'PaymentMethodCredit card (automatic)',
'PaymentMethodElectronic check',
'PaymentMethodMailed check']
target_var = 'Churn'
features = [x for x in list(sample_set.columns) if x != target_var]
model = Sequential()
model.add(Dense(16, input_dim=len(features), activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
???input_dim???16???relu???8???relu???Sigmoid???
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
X_train, X_test, y_train, y_test = train_test_split(
sample_set[features],
sample_set[target_var],
test_size=0.3
)
model.fit(X_train, y_train, epochs=50, batch_size=100)
Epoch 1/50
4922/4922 [==============================] - 0s 73us/step - loss: 0.6871 - accuracy: 0.5638
Epoch 2/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.5409 - accuracy: 0.7314
Epoch 3/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.5034 - accuracy: 0.7322
Epoch 4/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4717 - accuracy: 0.7452
Epoch 5/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4404 - accuracy: 0.7926
Epoch 6/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4225 - accuracy: 0.8037
Epoch 7/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4150 - accuracy: 0.8066
Epoch 8/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4113 - accuracy: 0.8070
Epoch 9/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.4083 - accuracy: 0.8098
Epoch 10/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4063 - accuracy: 0.8090
Epoch 11/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4052 - accuracy: 0.8111
Epoch 12/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4037 - accuracy: 0.8090
Epoch 13/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4030 - accuracy: 0.8119
Epoch 14/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4021 - accuracy: 0.8127
Epoch 15/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4014 - accuracy: 0.8108
Epoch 16/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4009 - accuracy: 0.8104
Epoch 17/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4003 - accuracy: 0.8125
Epoch 18/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.4002 - accuracy: 0.8147
Epoch 19/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3987 - accuracy: 0.8133
Epoch 20/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3982 - accuracy: 0.8139
Epoch 21/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3979 - accuracy: 0.8155
Epoch 22/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3976 - accuracy: 0.8137
Epoch 23/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3974 - accuracy: 0.8139
Epoch 24/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3971 - accuracy: 0.8129
Epoch 25/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3969 - accuracy: 0.8143
Epoch 26/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3970 - accuracy: 0.8135
Epoch 27/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3963 - accuracy: 0.8123
Epoch 28/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3959 - accuracy: 0.8141
Epoch 29/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3952 - accuracy: 0.8149
Epoch 30/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3948 - accuracy: 0.8153
Epoch 31/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3954 - accuracy: 0.8153
Epoch 32/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3948 - accuracy: 0.8163
Epoch 33/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3944 - accuracy: 0.8159
Epoch 34/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3940 - accuracy: 0.8169
Epoch 35/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3941 - accuracy: 0.8178
Epoch 36/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3938 - accuracy: 0.8161
Epoch 37/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3936 - accuracy: 0.8151
Epoch 38/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3929 - accuracy: 0.8147
Epoch 39/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3927 - accuracy: 0.8169
Epoch 40/50
4922/4922 [==============================] - 0s 14us/step - loss: 0.3930 - accuracy: 0.8155
Epoch 41/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3922 - accuracy: 0.8169
Epoch 42/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3925 - accuracy: 0.8178
Epoch 43/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3921 - accuracy: 0.8155
Epoch 44/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3915 - accuracy: 0.8182
Epoch 45/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3911 - accuracy: 0.8163
Epoch 46/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3912 - accuracy: 0.8159
Epoch 47/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3909 - accuracy: 0.8178
Epoch 48/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3909 - accuracy: 0.8169
Epoch 49/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3910 - accuracy: 0.8174
Epoch 50/50
4922/4922 [==============================] - 0s 13us/step - loss: 0.3901 - accuracy: 0.8190
Accuracy, Precision, Recall
in_sample_preds = [round(x[0]) for x in model.predict(X_train)]
out_sample_preds = [round(x[0]) for x in model.predict(X_test)]
print('In-Sample Accuracy: %0.4f' % accuracy_score(y_train, in_sample_preds))
print('Out-of-Sample Accuracy: %0.4f' % accuracy_score(y_test, out_sample_preds))
print('\n')
print('In-Sample Precision: %0.4f' % precision_score(y_train, in_sample_preds))
print('Out-of-Sample Precision: %0.4f' % precision_score(y_test, out_sample_preds))
print('\n')
print('In-Sample Recall: %0.4f' % recall_score(y_train, in_sample_preds))
print('Out-of-Sample Recall: %0.4f' % recall_score(y_test, out_sample_preds))
In-Sample Accuracy: 0.8171
Out-of-Sample Accuracy: 0.7991
In-Sample Precision: 0.6946
Out-of-Sample Precision: 0.6440
In-Sample Recall: 0.5660
Out-of-Sample Recall: 0.5154
ROC & AUC
in_sample_preds = [x[0] for x in model.predict(X_train)]
out_sample_preds = [x[0] for x in model.predict(X_test)]
in_sample_fpr, in_sample_tpr, in_sample_thresholds = roc_curve(y_train, in_sample_preds)
out_sample_fpr, out_sample_tpr, out_sample_thresholds = roc_curve(y_test, out_sample_preds)
in_sample_roc_auc = auc(in_sample_fpr, in_sample_tpr)
out_sample_roc_auc = auc(out_sample_fpr, out_sample_tpr)
print('In-Sample AUC: %0.4f' % in_sample_roc_auc)
print('Out-Sample AUC: %0.4f' % out_sample_roc_auc)
In-Sample AUC: 0.8691
Out-Sample AUC: 0.8314
plt.figure(figsize=(10,7))
plt.plot(
out_sample_fpr, out_sample_tpr, color='darkorange', label='Out-Sample ROC curve (area = %0.4f)' % in_sample_roc_auc
)
plt.plot(
in_sample_fpr, in_sample_tpr, color='navy', label='In-Sample ROC curve (area = %0.4f)' % out_sample_roc_auc
)
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.grid()
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
EOD