天池竞赛O2O优惠券使用预测+XGboost做法,AUC为0.5379

作为初入算法竞赛的计算机大菜鸡,想把自己这次的历程分享下来,用来记录自己在竞赛中的进步.

从2月份开始看天池竞赛代码,到现在将近两个月算是基本入门,知道是怎么个比赛流程,一路上自己摸爬滚打,中间过程挺艰辛的,很多次因为一个bug卡好几天,也没人帮着解决,效率挺低的.

好了,言归正传,分享下自己最近才做的一个O2O竞赛.

从一个在天池新人赛报名之后,就先到技术圈去学习了下,看到一个100行代码入门天池O2O优惠券使用新人赛的baseline,就拿来调试代码,结果运行很顺利,代码也相对比较简单.几乎未进行特征工程处理,采用了SGDClassifier算法,最后AUC是0.5287,排名412/13500,离第一名的0.81差距甚远,

于是乎在此基础上重新进行了代码修改,提取部分特征,采用XGboost算法模型,提交结果,最后AUC是0.5379,排名进了60多名,350/13500.下边就分享这次代码,并详细进行解读.

import os, sys, pickle

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import seaborn as sns

from datetime import date

from sklearn.model_selection import KFold, train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss, roc_auc_score, auc, roc_curve
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
#import xgboost as xgb
#import lightgbm as lgb

display for this notebook

%matplotlib inline
%config InlineBackend.figure_format = ‘retina’

一:导入数据集
大概查看一下数据集

dfoff = pd.read_csv("..\input\ccf_offline_stage1_train.csv",keep_default_na = False)
dftest = pd.read_csv("..\input\ccf_offline_stage1_test_revised.csv",keep_default_na = False)

dfon = pd.read_csv("..\input\ccf_online_stage1_train.csv",keep_default_na = False)
dfoff.head(5)   #显示训练集的前5行
dftest.head(5)  #显示测试集的前5行

由此可见,训练集比测试集多了一列Date,我们要做的就是通过训练训练集的数据,最后通过测试集来预测用户是否会进行消费

本赛题提供用户在2016年1月1日至2016年6月30日之间真实线上线下消费行为,预测用户在2016年7月领取优惠券后15天以内的使用情况。

数据集清洗方法一: 查看是否有缺失值

dfoff.isnull().sum().sort_values(ascending=False).head(10)   #结果说明没有缺失值

Date 0
Date_received 0
Distance 0
Discount_rate 0
Coupon_id 0
Merchant_id 0
User_id 0
dtype: int64

dfoff.info()   #查看一下类型


RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Coupon_id object
Discount_rate object
Distance object
Date_received object
Date object
dtypes: int64(2), object(5)
memory usage: 93.7+ MB

dfoff["Date_received"].unique()      #查看一下线下训练集中 Data_receive 的类型个数
dfoff["Date"].unique()
print("有优惠卷,购买商品:" ,dfoff[(dfoff["Date_received"] != "null") & (dfoff["Date"] != "null")].shape[0])
print('有优惠卷,未购商品:%d' % dfoff[(dfoff['Date_received'] != 'null') & (dfoff['Date'] == 'null')].shape[0])
print('无优惠卷,购买商品:%d' % dfoff[(dfoff['Date_received'] == 'null') & (dfoff['Date'] != 'null')].shape[0])
print('无优惠卷,未购商品:%d' % dfoff[(dfoff['Date_received'] == 'null') & (dfoff['Date'] == 'null')].shape[0])

有优惠卷,购买商品: 75382
有优惠卷,未购商品:977900
无优惠卷,购买商品:701602
无优惠卷,未购商品:0

# 在测试集中出现的用户但训练集没有出现
print('1. User_id in training set but not in test set', set(dftest['User_id']) - set(dfoff['User_id']))
# 在测试集中出现的商户但训练集没有出现
print('2. Merchant_id in training set but not in test set', set(dftest['Merchant_id']) - set(dfoff['Merchant_id']))
  1. User_id in training set but not in test set {2495873, 1286474}
  2. Merchant_id in training set but not in test set {5920}

二:清洗数据集与提取特征
知识点1: unique函数是查看多少种类型

知识点2: 想要提取数据集中某一行的特征就用 dfoff[“中间存放标题”]+.函数 这种形式.例如:下面

特征一:打折率

我们想到第一个特征就是打折率,打折力度越大,用户使用优惠券的概率越大,先用unique函数查看一下有多少种打折类型

print('Discount_rate 类型:',dfoff['Discount_rate'].unique())

Discount_rate 类型: [‘null’ ‘150:20’ ‘20:1’ ‘200:20’ ‘30:5’ ‘50:10’ ‘10:5’ ‘100:10’ ‘200:30’
‘20:5’ ‘30:10’ ‘50:5’ ‘150:10’ ‘100:30’ ‘200:50’ ‘100:50’ ‘300:30’
‘50:20’ ‘0.9’ ‘10:1’ ‘30:1’ ‘0.95’ ‘100:5’ ‘5:1’ ‘100:20’ ‘0.8’ ‘50:1’
‘200:10’ ‘300:20’ ‘100:1’ ‘150:30’ ‘300:50’ ‘20:10’ ‘0.85’ ‘0.6’ ‘150:50’
‘0.75’ ‘0.5’ ‘200:5’ ‘0.7’ ‘30:20’ ‘300:10’ ‘0.2’ ‘50:30’ ‘200:100’
‘150:5’]
根据打印结果发现,打折类型一共有3中,

第一种是nan , 表示没有打折

第二种是150:20 表示满150元减少20元

第三种是0.95 表示打折0.95的折扣

我们因此构建4个函数,提取4个特征,分别是:

打折类型:getDiscountType()

打折率:convertRate

满多少:getDiscountMan

减多少:getDiscountJian

# convert Discount_rate and Distance

def getDiscountType(row):   # row 是传进来的形参
    if row == 'null':
        return 'null'       #无折扣
    elif ':' in row:
        return 1            #满多少
    else:
        return 0            #折扣率

def convertRate(row):
    """Convert discount to rate"""
    if row == 'null':                                   #无折扣
        return 1.0
    elif ':' in row:
        rows = row.split(':')
        return 1.0 - float(rows[1])/float(rows[0])      #满多少,转化为折扣率
    else:
        return float(row)                               #折扣率

def getDiscountMan(row):
    if ':' in row:
        rows = row.split(':')
        return int(rows[0])                             #如果是满多少,就返回满的数字 row[0]
    else:
        return 0                                        #否则就返回0

def getDiscountJian(row): 
    if ':' in row:
        rows = row.split(':')                 
        return int(rows[1])                             #如果是满多少,就返回要减的数字row[1]
    else:
        return 0                                        #否则返回0
    
def processData(df):                                  #自定义了函数processData,
    
    # convert discunt_rate
    df['discount_rate'] = df['Discount_rate'].apply(convertRate)#申请调用刚才自定义的函数,把Discount_rate传进去
    df['discount_man'] = df['Discount_rate'].apply(getDiscountMan)
    df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian)
    df['discount_type'] = df['Discount_rate'].apply(getDiscountType)
    print(df['discount_rate'].unique())                         #查看更改后的discount_rate类型
    
    # convert distance
   
    df['distance'] = df['Distance'].replace('null', -1).astype(int) 
    
    return df 
dfoff = processData(dfoff)     #同理,调用processData这个函数,将线下训练集传进去
dftest = processData(dftest)   #将测试集进行数据清洗,dftest是传进去的参数

[1. 0.86666667 0.95 0.9 0.83333333 0.8
0.5 0.85 0.75 0.66666667 0.93333333 0.7
0.6 0.96666667 0.98 0.99 0.975 0.33333333
0.2 0.4 ]
[0.83333333 0.9 0.96666667 0.8 0.95 0.75
0.98 0.5 0.86666667 0.6 0.66666667 0.7
0.85 0.33333333 0.94 0.93333333 0.975 0.99 ]

#展示前两行,发现训练集中多了4列"Date	discount_rate","discount_man","discount_jian","discount_type"
dfoff.head(2)  
#print('Distance 类型:', dfoff['Distance'].unique())

数据集清洗二:对数据集进行类型转换
处理"Distance"距离这一列

#df['distance'] = df['Distance'].replace('null', -1).astype(int)  #在distance中把null替换为-1,并进行类型转换
#print(df['distance'].unique())                                   #查看distance的类型       
dftest.head(2)     #同理,对测试集进行处理
date_received = dfoff['Date_received'].unique()     #此时的data_received是一个数组
date_received = sorted(date_received[date_received != 'null'])

date_buy = dfoff['Date'].unique()
date_buy = sorted(date_buy[date_buy != 'null'])

date_buy = sorted(dfoff[dfoff['Date'] != 'null']['Date'])
print('优惠券收到日期从',date_received[0],'到', date_received[-1])   
print('消费日期从', date_buy[0], '到', date_buy[-1])

优惠券收到日期从 20160101 到 20160615
消费日期从 20160101 到 20160630

特征二:提取星期特征,消费时间更有可能和星期有关

def getWeekday(row):         #自定义getWeekday的函数,row是形式参数
    if row == 'null':        #如果为空,则返回原来的
        return row
    else:
        return date(int(row[0:4]), int(row[4:6]), int(row[6:8])).weekday() + 1

#对训练集中Date_received进行类型转换,转变为字符型,然后调用getWeekday函数,将结果赋值给weekday
dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday)
dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)

# weekday_type :  周六和周日为1,其他为0
dfoff['weekday_type'] = dfoff['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )
dftest['weekday_type'] = dftest['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )
dfoff.head()

数据集处理方式三:进行one-hot独热编码

# change weekday to one-hot encoding   独热编码
weekdaycols = ['weekday_' + str(i) for i in range(1,8)]
print(weekdaycols)                                                     #此时的weekdaycols是一个数组

tmpdf = pd.get_dummies(dfoff['weekday'].replace('null', np.nan))  #一键one_hot处理
tmpdf.columns = weekdaycols   #更新栏目表头
dfoff[weekdaycols] = tmpdf

tmpdf = pd.get_dummies(dftest['weekday'].replace('null', np.nan))  #一键one_hot处理
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf
dfoff.head()

[‘weekday_1’, ‘weekday_2’, ‘weekday_3’, ‘weekday_4’, ‘weekday_5’, ‘weekday_6’, ‘weekday_7’]

好了,经过以上简单的特征提取,我们总共得到了 14 个有用的特征: discount_rate

discount_type

discount_man

discount_jian

distanceweek

distance

dayweekday_type

weekday_1

weekday_2

weekday_3

weekday_4

weekday_5

weekday_6

weekday_7

标注标签 Label

有了特征之后,我们还需要对训练样本进行 label 标注,即确定哪些是正样本(y = 1),哪些是负样本(y = 0)。我们要预测的是用户在领取优惠券之后 15 之内的消费情况。所以,总共有三种情况:

1.Date_received == ‘null’:

表示没有领到优惠券,无需考虑,y = -1

2.(Date_received != ‘null’) & (Date != ‘null’) & (Date - Date_received <= 15):

表示领取优惠券且在15天内使用,即正样本,y = 1

3.(Date_received != ‘null’) & ((Date == ‘null’) | (Date - Date_received > 15)):

表示领取优惠券未在在15天内使用,即负样本,y = 0

好了,知道规则之后,我们就可以定义标签备注函数了。

正负样本

def label(row):            #自定义一个函数label,row是形参
    if row['Date_received'] == 'null':
        return -1                #Date_received为空表示,没有收到优惠券,不用考虑
    if row['Date'] != 'null':
        td = pd.to_datetime(row['Date'], format='%Y%m%d') -  pd.to_datetime(row['Date_received'], format='%Y%m%d')
        if td <= pd.Timedelta(15, 'D'):               #表示领取优惠券后在15天内进行消费,即为正样本
            return 1
    return 0                 #表示领取优惠券后没有在15天内进行消费,即为负样本
dfoff['label'] = dfoff.apply(label, axis = 1)
dfoff["label"].unique()  #unique函数是查看类型

array([-1, 0, 1], dtype=int64)

value_counts函数是统计各个类型的个数

我们可以使用这个函数对训练集进行标注,看一下正负样本究竟有多少:

print(dfoff['label'].value_counts())   # 统计正负样本的个数

0 988887
-1 701602
1 64395
Name: label, dtype: int64

很清晰地,正样本共有 64395 例,负样本共有 988887 例。显然,正负样本数量差别很大。 这也是为什么会使用 AUC 作为模型性能评估标准的原因。

dfoff.columns.tolist() 函数是用来查看表头

print('已有columns:',dfoff.columns.tolist())     #查看一下已经有的标题

已有columns: [‘User_id’, ‘Merchant_id’, ‘Coupon_id’, ‘Discount_rate’, ‘Distance’, ‘Date_received’, ‘Date’, ‘discount_rate’, ‘discount_man’, ‘discount_jian’, ‘discount_type’, ‘distance’, ‘weekday’, ‘weekday_type’, ‘weekday_1’, ‘weekday_2’, ‘weekday_3’, ‘weekday_4’, ‘weekday_5’, ‘weekday_6’, ‘weekday_7’, ‘label’]

dfoff.head(2)

三:建立模型
接下来就是最主要的建立机器学习模型了。首先确定的是我们选择的特征是上面提取的 14 个特征,为了验证模型的性能,需要划分验证集进行模型验证,划分方式是按照领券日期,即训练集:20160101-20160515,验证集:20160516-20160615。我们采用XGboost算法

xgboost
1.划分训练集和验证集
注意这里得到的结果 pred_prob 是概率值(预测样本属于正类的概率)。

最后,就可以对验证集计算 AUC。直接调用 sklearn 库自带的计算 AUC 函数即可。

# data split
df = dfoff[dfoff['label'] != -1].copy()               #把训练集中非负样本数赋值给df
train = df[(df['Date_received'] < '20160516')].copy()   #df中日期小于20160516的作为训练集
valid = df[(df['Date_received'] >= '20160516') & (df['Date_received'] <= '20160615')].copy()  #df中日期大于20160516的作为训练集
print(train['label'].value_counts())               #查看训练集中各样本数量
print(valid['label'].value_counts())               #查看验证集中各样本数量

0 759172
1 41524
Name: label, dtype: int64
0 229715
1 22871
Name: label, dtype: int64

#train.head(5)
#valid.head(5)
y = train.label
#drop函数是去掉删除的
#对训练集特征进行降维,去掉中括号里面的这些没用特征
X = train.drop(["User_id","Merchant_id","Coupon_id","Discount_rate","Distance","Date","Date_received","label"],axis=1)  
val_y = valid.label

#对验证集特征进行降维,去掉中括号里面的这些没用特征
val_X = valid.drop(["User_id","Merchant_id","Coupon_id","Discount_rate","Distance","Date","Date_received","label"],axis=1)

#对测试集特征进行降维,去掉中括号里面的这些没用特征
tests = dftest.drop(["User_id","Merchant_id","Coupon_id","Discount_rate","Distance","Date_received"],axis=1)
val_X["weekday"].unique(),val_X["discount_type"].unique()
 #查看一下类型,发现weekday和discount_type是object类型,所以对其进行类型转换

(array([6, 1, 4, 3, 2, 7, 5], dtype=object), array([1, 0], dtype=object))

#astype函数是进行类型转换的函数
#对weekday进行类型转换,转换成int型
X["weekday"] = X["weekday"].astype(int)
X["discount_type"] =X["discount_type"].astype(int)
val_X["weekday"]=val_X["weekday"].astype(int)
val_X["discount_type"]=val_X["discount_type"].astype(int)
tests["weekday"].unique()
array([2, 3, 5, 6, 7, 1, 4], dtype=int64)

val_X[“weekday”].unique() #查看一下

 array([6, 1, 4, 3, 2, 7, 5], dtype=int64)
#xgb矩阵赋值
xgb_val = xgb.DMatrix(val_X,label=val_y)  #将处理好的训练集模型转化成xgb矩阵
xgb_train = xgb.DMatrix(X, label=y)       #将处理好的label转化成xgb矩阵
xgb_test = xgb.DMatrix(tests)             #将处理好的测试集转化成xgb矩阵
xgb_val_X = xgb.DMatrix(val_X)            #将处理好的验证集转化成xgb矩阵

C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, ‘base’, None) is not None and
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py:588: FutureWarning: Series.base is deprecated and will be removed in a future version
data.base is not None and isinstance(data, np.ndarray) \

# 自定义性能评价函数(auc)
def myauc(test):
    testgroup = test.groupby(["Coupon_id"])
    aucs = []
    for i in testgroup:
        tmpdf = i[1]
        if len(tmpdf['label'].unique()) != 2:
            continue
        fpr, tpr, thresholds = roc_curve(tmpdf['label'], tmpdf['pred'], pos_label=1)
        aucs.append(auc(fpr, tpr))
    return np.average(aucs)

XGboost算法框架

params = {'booster': 'gbtree',
          #'objective': 'rank:pairwise',
          'eval_metric': 'auc',
          'gamma': 0.1,             
          'min_child_weight': 1.1,
          'max_depth': 5,            
          'lambda': 10,
          'subsample': 0.7,
          'colsample_bytree': 0.7,
          'colsample_bylevel': 0.7,
          'eta': 0.01,
          'tree_method': 'exact',
          'seed': 0,
          'nthread': 12
          }
watchlist = [(xgb_train, 'train')]
model = xgb.train(params, xgb_train, num_boost_round=1000, evals=watchlist,early_stopping_rounds=100)

model.save_model('C:/Users/Administrator/o2o.code/notebook/xgbmodel')
model = xgb.Booster(params)
model.load_model('C:/Users/Administrator/o2o.code/notebook/xgbmodel')
val_X.head()
valid.head()
model = xgb.Booster()
model.load_model('C:/Users/Administrator/o2o.code/notebook/xgbmodel')   #下载模型

temp = valid[["Coupon_id", "label"]].copy()   #复制"Coupon_id", "label"两列
temp['pred'] = model.predict(xgb_val)         #做出预测
temp.pred = MinMaxScaler(copy=True, feature_range=(0, 1)).fit_transform(temp['pred'].values.reshape(-1, 1))
print(myauc(temp))                            #调用自定义函数计算AUC值
temp.head()

0.5518357641394374

tests.head()
val_X.head()
#predeict,做出预测
y_test = dftest[['User_id','Coupon_id',"Date_received"]].copy()

y_test['label'] =  model.predict(xgb_test)   #对测试集进行预测

#对结果进行保存
y_test.to_csv("C:/Users/Administrator/o2o.code/notebook/second.csv", index=None, header=None)
y_test.head()

你可能感兴趣的:(天池竞赛O2O优惠券使用预测+XGboost做法,AUC为0.5379)