CDA分析建模-产品营销模型之建置及预测

背景:

A公司有一款在线服务的P产品,公司的营销通路是100%的网络媒介。A公司希望提供30天免费的P产品后,期望顾客能正式签约购买P产品之服务。但A公司发现‚每隔1~2天便对数以万计的顾客发送电子营销文宣,不但购买率低下,甚至造成诸多客诉。同时,客户之预期获利是以人工经验评估之,没有量化或模型工具之协助,不晓得到底应该使用广告全投放还是机器学习模型来做投放?

这个基本和模拟题2类似,并且要基于混淆矩阵分析获益矩阵,考虑营销成本和收益。

训练和测试数据一共8000 ,训练数据6000,字段意义如下,具体可见官网

数据读取

打开发现很多空值是用'?' 显示,而且很多是object类型,需要做转换


# -*- coding: UTF-8 -*-

# 保证脚本与Python3兼容

from __future__ import print_function

import os   #读取数据文件

import sys

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from sklearn.model_selection import train_test_split  #划分训练集测试集使用

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import LabelEncoder

from sklearn.impute import SimpleImputer 

from sklearn import metrics

from sklearn.feature_extraction import DictVectorizer #特征转换器

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report

from sklearn import tree

from sklearn.model_selection import GridSearchCV

%matplotlib inline

import warnings

#禁止警告

warnings.filterwarnings("ignore")

defreadData(path):

    """

使用pandas读取数据

"""

    data = pd.read_csv(path)

    cols = list(data.columns.values)

    return data[cols]

defvisualData(data):

    """

画直方图,直观了解数据

"""

    data.hist(

        rwidth=0.9, grid=True, figsize=(8, 8), alpha=0.6,bins=10, color="blue")

    plt.show()

if __name__ == "__main__":

    # 设置显示格式

    pd.set_option('display.width', 1000)

    homePath = os.path.dirname(os.path.abspath('__file__'))

    # Windows下的存储路径与Linux并不相同

    if os.name == "nt":

        dataPath = "%s\\df_training.csv" % homePath

    else:

        dataPath = "%s/df_training.csv" % homePath

    train = readData(dataPath)

    if os.name == "nt":

        dataPath = "%s\\df_test.csv" % homePath

    else:

        dataPath = "%s/df_test.csv" % homePath

    test = readData(dataPath)

    #统计分析信息

    train.info()

    train.head()

    #没有空值因为变成?,但大多数为object,需要转换

    test.info()


数据清洗和描述分析

#print(set(train.columns)-set(['ID',"'Purchase or not'"])) 显示特征

features = ["'Product using score'", "'User area'", 'age', "'Point balance'", 'gender', "'Cumulative using time'", 

"'Active user'", "' Estimated salary'", "'Pay a monthly fee by credit card'", "'Product service usage'"]

for feature in features:

    na_count = train[train[feature]=='?'].shape[0]

    na_per = na_count/train[feature].count() 

    na_per_buy = train[train[feature]=='?']["'Purchase or not'"].value_counts()[1]/na_count

    print('%s 空值的百分比: %.2f ,空值中购买的百分比: %.2f ' %(feature,na_per,na_per_buy))

    #用na填充?,并去除空值

    train2=train.copy()

    train2[feature] = train2[feature].replace('?',np.nan).dropna()  

    #非分类变量转为数值

    if feature not in ["'User area'",'gender']:

        train2[feature]=train2[feature].apply(pd.to_numeric)

    #可视化交叉报表

    #取值个数大于20则分箱

    if len(train1[feature].unique())>=20:

        #等宽分箱

        train2['cut'] = pd.cut(train2[feature],bins=10,include_lowest=True,right=False,precision=0)

        cross1 = pd.crosstab(train2['cut'], train["'Purchase or not'"]) 

    else:

        cross1 = pd.crosstab(train2[feature], train["'Purchase or not'"]) 

    print(cross1)

    cross1.plot(kind="bar", color=["blue", "0.45"], rot=0)

    plt.show()

从总体看,每个特征却数值都在30%左右,其中购买占20%左右,因为训练数据较少缺失占比较高,避免丢失有效信息,缺失值可以指派为一类值替换。

其中产品使用分数 Product using score ,410以下的购买可能基本为0,其它分布不均接近正态

用户区域看看taichung中购买的比例高于tainan 和 taipei


age的分布基本符合正态分布,但80到85怀疑有异常购买数据,74之后基本趋于0


点数余额看,有[0.0, 22227.0) 和[111134.0, 133361.0)两个区间购买者数特别高,其它差不多

性别看女性中的购买比例较大


产品使用量也是0到4的整数为主,3和4 特别高,1在30%,2最低


而是否活跃用户和是否用信用卡付费为逻辑值,取0和1,其中非活跃用户更多人购买,而信用卡付费对买和不买的影响差不多。

用箱线图看连续值得分布情况

import matplotlib.pyplot as plt

feabox = ["'Product using score'"  , "' Estimated salary'"]

for feature in feabox:

    train3=train1.copy()

    train3[feature] = train3[feature].replace('?',np.nan).dropna()

    train3[feature]=train3[feature].apply(pd.to_numeric)

    train3[feature].plot.box(title= feature)

    plt.grid(linestyle="--", alpha=0.3)

    plt.boxplot(x = train3[feature], # 指定绘图数据

            patch_artist=True, # 要求用自定义颜色填充盒形图,默认白色填充

            showmeans=True, # 以点的形式显示均值

            boxprops = {'color':'black','facecolor':'steelblue'}, # 设置箱体属性,如边框色和填充色

            # 设置异常点属性,如点的形状、填充色和点的大小

            flierprops = {'marker':'o','markerfacecolor':'red', 'markersize':3}, 

            # 设置均值点的属性,如点的形状、填充色和点的大小

            meanprops = {'marker':'D','markerfacecolor':'indianred', 'markersize':4}, 

            # 设置中位数线的属性,如线的类型和颜色

            medianprops = {'linestyle':'--','color':'orange'}, 

            labels = [' '] # 删除x轴的刻度标签,否则图形显示刻度标签为1

           )

# 添加图形标题

    plt.title(feature)

    #train3.info()

    plt.show()


数据预处理和特征处理

# 编码映射

alldata = pd.concat([train,test],axis=0)

#做字典映射表

area_map = {'Taichung': 0 ,'Tainan': 1 ,'Taipei': 2, '?': 3}

gender_map = {'Female': 0, 'Male': 1, '?': 3}

#应用映射

alldata["'User area'"] = alldata["'User area'"].map(area_map)

alldata["gender"] = alldata["gender"].map(gender_map)

#其它'?'转为nan

alldata= alldata.replace('?',np.nan)

alldata["'Cumulative using time'"].fillna(alldata["'Cumulative using time'"].mode(),inplace=True)

alldata["'Product service usage'"].fillna(alldata["'Product service usage'"].mode(),inplace=True)

#其它指派为一类

alldata.fillna(value=-1,inplace=True)

#转为数值

for feature in features:

    alldata[feature] = alldata[feature].apply(pd.to_numeric)

#再分测试训练集    

newtrain = alldata[alldata["'Purchase or not'"]!='Withheld']

newtrain["'Purchase or not'"] = newtrain["'Purchase or not'"].apply(pd.to_numeric)

newtest = alldata[alldata["'Purchase or not'"]=='Withheld']

newtrain.info()

newtest.info()

先用随机森林做个baseline

#使用RF进行简单预测

from sklearn.metrics import accuracy_score,roc_auc_score

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

predictors =  list(set(newtrain.columns)-set(['ID',"'Purchase or not'"]))

X_train,X_val,y_train,y_val = train_test_split(newtrain[predictors],newtrain["'Purchase or not'"],

                                               test_size=0.2,random_state=1234)

rf = RandomForestClassifier(n_estimators=1000, min_samples_split=5, min_samples_leaf=3)

rf.fit(X_train, y_train)

print(accuracy_score(rf.predict(X_val), y_val))

print(roc_auc_score(rf.predict(X_val),y_val))

0.83

0.7924476371342856

梯度提升训练

bagging和boosting是两个常见模型融合的方法,随机森林属于前者,接下来使用boosting的一种模型gbdt来优化,gbdt有xgboost和LightGBM,传说LightGBM更加高效精准,现在变身调参侠开始修炼

step 1 :设定初始学习率并调测迭代次数

import pandas as pd

import lightgbm as lgb

params = {    

          'boosting_type': 'gbdt',

          'objective': 'binary',

          'metric': 'auc',

          'nthread':4,

          'learning_rate':0.1,

          'num_leaves':30, 

          'max_depth': 6,   

          'subsample': 0.8, 

          'colsample_bytree': 0.8, 

    }

data_train = lgb.Dataset(X_train, y_train)

cv_results = lgb.cv(params, data_train, num_boost_round=1000, nfold=5, stratified=False, shuffle=True, metrics='auc',early_stopping_rounds=50,seed=0)

print('best n_estimators:', len(cv_results['auc-mean']))

print('best cv score:', pd.Series(cv_results['auc-mean']).max())


best n_estimators: 44

best cv score: 0.7852211260765876

step 2:根据迭代次数确定max_depth和num_leaves

这是提高精确度的最重要的参数。这里我们引入sklearn里的GridSearchCV()函数进行搜索。

params_test1={'max_depth': range(3,8,1), 'num_leaves':range(5, 100, 5)}

gsearch1 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, 

                                                       n_estimators=44, max_depth=6, bagging_fraction = 0.8,feature_fraction = 0.8), 

                       param_grid = params_test1, scoring='roc_auc',cv=5,n_jobs=-1)

gsearch1.fit(X_train,y_train)

means = gsearch1.cv_results_['mean_test_score']

params = gsearch1.cv_results_['params']

for mean,param in zip(means,params):

    print("%f  with:   %r" % (mean,param))

step 3:调试min_data_in_leaf和max_bin

params_test2={'max_bin': range(5,256,10), 'min_data_in_leaf':range(1,102,10)}

gsearch2 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, 

                                        n_estimators=44, max_depth=3, num_leaves=5,bagging_fraction = 0.8,feature_fraction = 0.8), 

                       param_grid = params_test2, scoring='roc_auc',cv=5,n_jobs=-1)

gsearch2.fit(X_train,y_train) 

means = gsearch2.cv_results_['mean_test_score']

params = gsearch2.cv_results_['params']

for mean,param in zip(means,params):

    print("%f  with:   %r" % (mean,param))

step 4:确定feature_fraction、bagging_fraction、bagging_freq

params_test3={'feature_fraction': [0.6,0.7,0.8,0.9,1.0],

              'bagging_fraction': [0.6,0.7,0.8,0.9,1.0],

              'bagging_freq': range(0,81,10)

}

gsearch3 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1,

                                                n_estimators=44, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1), 

                       param_grid = params_test3, scoring='roc_auc',cv=5,n_jobs=-1)

gsearch3.fit(X_train,y_train) 

means = gsearch3.cv_results_['mean_test_score']

params = gsearch3.cv_results_['params']

for mean,param in zip(means,params):

    print("%f  with:   %r" % (mean,param))

step 5:调测lambda_l1和lambda_l2

params_test4={'lambda_l1': [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0],

              'lambda_l2': [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]

}

gsearch4 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, 

                    n_estimators=44, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1,bagging_fraction=0.9,bagging_freq= 40, feature_fraction= 0.7), 

                       param_grid = params_test4, scoring='roc_auc',cv=5,n_jobs=-1)

gsearch4.fit(X_train,y_train) 

means = gsearch4.cv_results_['mean_test_score']

params = gsearch4.cv_results_['params']

for mean,param in zip(means,params):

    print("%f  with:   %r" % (mean,param))

step 6:确定 min_split_gain参数

params_test5={'min_split_gain':[0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]}

gsearch5 = GridSearchCV(estimator = lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.1, 

                    n_estimators=44, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1,bagging_fraction=0.9,bagging_freq= 40, feature_fraction= 0.7,

lambda_l1=1e-05,lambda_l2=0.001), 

                       param_grid = params_test5, scoring='roc_auc',cv=5,n_jobs=-1)

gsearch5.fit(X_train,y_train)

means = gsearch5.cv_results_['mean_test_score']

params = gsearch5.cv_results_['params']

for mean,param in zip(means,params):

    print("%f  with:   %r" % (mean,param))

step 7 :降低学习率,增加迭代次数,验证模型

 

from sklearn.metrics import accuracy_score,roc_auc_score

X_train,X_val,y_train,y_val = train_test_split(newtrain[predictors],newtrain["'Purchase or not'"],

test_size=0.2,random_state=2019)

model=lgb.LGBMClassifier(boosting_type='gbdt',objective='binary',metrics='auc',learning_rate=0.005,

n_estimators=2900, max_depth=3, num_leaves=5,max_bin=185,min_data_in_leaf=1,

bagging_fraction=0.9,bagging_freq=40, feature_fraction=0.7,

lambda_l1=1e-05,lambda_l2=0.001,min_split_gain=0)

model.fit(X_train,y_train)

y_train['Predicted_Results'] = model.predict(X_val)

print(accuracy_score(pred, y_val))

newtest['Predicted_Results'] = model.predict(newtest[predictors])

newtest[['ID','Predicted_Results']].to_csv('results3.csv',index=False)

结果为

0.8341666666666666

计算最终获益

根据不同的营销文案成本和其他成本还有收益 ,计算最终获益矩阵和利润

from sklearn.metrics import confusion_matrix

#计算最终利润

print(confusion_matrix(y_val, pred))

tp = confusion_matrix(y_val, pred)[0][0] 

fp = confusion_matrix(y_val,pred)[1][0]

#print(con_matrix)

profitA = tp * 1500 - fp * 500

print(profitA)

profitB = tp * 700 - fp * 500

print(profitB)

参考


菜菜的机器学习sklearn课堂

https://www.biaodianfu.com/lightgbm.html

https://blog.csdn.net/qq_24519677/article/details/82811215

https://blog.csdn.net/u012735708/article/details/83749703

http://www.sohu.com/a/311595528_99953482

https://www.imooc.com/article/43784

https://blog.csdn.net/jingyi130705008/article/details/82670011

你可能感兴趣的:(CDA分析建模-产品营销模型之建置及预测)