O2O优惠券核销-模型预测

目录

一、项目背景与目标

二、数据描述

三、问题分析

四、数据探索与预处理

五、特征工程(构造特征)

5.1 特征构造-整体数据

5.1.1 时间特征

5.1.2 优惠券特征

5.1.3 预测目标值构造

5.2 数据划分-时间滑窗

5.3 特征构造-滑窗数据

5.3.1 用户特征

5.3.2 商户特征

5.3.3 优惠券特征

5.3.4 用户-优惠券联合特征

5.3.5 用户-商户联合特征

5.3.6 商户-优惠券联合特征

5.3.7 用户-商户-优惠券联合特征

六、模型构建

七、数据保存

八、心得体会


一、项目背景与目标

O2O行业关联数亿消费者,各类APP每天记录了超过百亿条用户行为和位置记录,因而成为大数据科研和商业化运营的最佳结合点之一。以优惠券盘活老用户或吸引新客户进店消费是O2O的一种重要营销方式。然而随机投放的优惠券对多数用户造成无意义的干扰。对商家而言,滥发的优惠券可能降低品牌声誉,同时难以估算营销成本。个性化投放是提高优惠券核销率的重要技术,它能让具有一定偏好的消费者得到真正的实惠,同时赋予商家更强的营销能力。

利用用户在2016年1月1日至2016年6月30日之间真实线下消费行为数据,预测用户在2016年7月领取优惠券后15天以内的使用情况。

预测前的进行了初步数据分析:O2O优惠券数据分析报告

二、数据描述

编程语言:Python
数据来源:https://tianchi.aliyun.com/competition/entrance/231593/information
数据字段:

O2O优惠券核销-模型预测_第1张图片

三、问题分析

问题一:预测数据集特征只有6个,如何全面构造特征工程来表达样本?

O2O优惠券核销-模型预测_第2张图片

 

O2O优惠券核销-模型预测_第3张图片

 

问题二:如何划分数据集,利用历史数据预测未来数据?

历史数据——>提取特征——>代表一种习惯或者固有惯性——>不易改变

因此可以利用7月份(待预测)的前几个月数据集——>提取固有特征——>基于固有特征进行预测

数据划分:

  • 2016.01.01-2016.04.30预测→2016.05.01-2016.05.31(数据集1)
  • 2016.02.01-2016.05.31预测→2016.06.01-2016.06.30(数据集2)
  • 2016.03.01-2016.06.30预测→2016.07.01-2016.07.31(待预测数据集)

实现过程:

- 对2016.01.01-2016.04.30的数据提取特征

- 将提取的特征应用于2016.05.01-2016.05.31数据中,另外两组同理

- 将处理后的2016.05.01-2016.05.31和2016.06.01-2016.06.30的数据合并为一个数据集

- 将合并后的数据集划分train、test进行模型训练

- 将训练好的模型用于预测2016.07.01-2016.07.31。

四、数据探索与预处理

import pandas as pd
import numpy as np
data_off = pd.read_csv("/项目准备/O2O优惠券使用预测/offline_train.csv")
off_test = pd.read_csv("/项目准备/O2O优惠券使用预测/offline_test.csv")
off_test1 = off_test
off_test.head()
data_off.shape
data_off.info()
data_off.describe()
# 消费日期的最大最小值
# 领券日期的最大最小值
print(data_off['Date'].max(),data_off['Date'].min())
print(data_off['Date_received'].max(),data_off['Date_received'].min())

输出:
20160630.0 20160101.0
20160615.0 20160101.0
# 缺失值
data_off.isnull().sum()
# 没有优惠券时coupon_id,字段discount_rate和date_received也同时没有
nan1 = data_off["Discount_rate"].isnull()
nan2 = data_off['Date_received'].isnull()
nan3 = data_off['Coupon_id'].isnull()
np.all(nan1==nan2),np.all(nan1==nan3)

输出:
(True, True)
# 删除重复值
data_off.drop_duplicates(inplace=True) 
data_off.info()
# 将日期float64类型转换为日期类型
data_off['Date'] = pd.to_datetime(data_off['Date'],format='%Y%m%d')
data_off['Date_received'] = pd.to_datetime(data_off['Date_received'],format='%Y%m%d')
off_test['Date_received'] = pd.to_datetime(off_test['Date_received'],format='%Y%m%d')
data_off.info()

五、特征工程(构造特征)

5.1 特征构造-整体数据

5.1.1 时间特征

# 从领券到消费的天数
date_interval = data_off['Date']-data_off['Date_received']
data_off['date_interval'] = [d.days for d in date_interval]
#领券日期是周几
data_off['receive_week']=[d.weekday()+1 for d in data_off['Date_received']]
off_test['receive_week']=[d.weekday()+1 for d in off_test['Date_received']]

#优惠券领取时间是否是周末
data_off['receive_isWeekend']=data_off['receive_week'].apply(lambda x:1 if x>5 else 0)
off_test['receive_isWeekend']=off_test['receive_week'].apply(lambda x:1 if x>5 else 0)

5.1.2 优惠券特征

# 折扣率
def deal_rate(x):
    if pd.isna(x):
        y =float(x)
    elif ":" in x:
        a = float(x.split(":")[0])# 分母
        b = a-float(x.split(":")[1])# 分子
        y = np.round(b/a,2)
    else:
        y = float(x)
    return y
data_off['Discount_rate_%'] = data_off['Discount_rate'].map(deal_rate)
off_test['Discount_rate_%'] = off_test['Discount_rate'].map(deal_rate)

# 门槛
def deal_mk(x):
    if pd.isna(x):# nan
        y =float(x)
    elif ":" in x:# 满减券
        y = int(x.split(":")[0])# 分母
    else:# 打折券
        y = np.nan
    return y
data_off['Discount_rate_mk'] = data_off['Discount_rate'].apply(deal_mk,1)
off_test['Discount_rate_mk'] = off_test['Discount_rate'].apply(deal_mk,1)
data_off.head()

5.1.3 预测目标值构造

data_off['Y'] = data_off['date_interval'].apply(lambda x:1 if x<=15 else 0)
data_off.head()

5.2 数据划分-时间滑窗

feature1=data_off[((data_off['Date_received']>='2016-01-01')&(data_off['Date_received']<='2016-04-30')) | ((data_off['Date']>='2016-01-01')&(data_off['Date']<='2016-04-30'))]
feature1.reset_index(drop=True,inplace=True)
database1=data_off[((data_off['Date_received']>='2016-05-01')&(data_off['Date_received']<='2016-05-31')) | ((data_off['Date']>='2016-05-01')&(data_off['Date']<='2016-05-31'))]
database1.reset_index(drop=True,inplace=True)
print(' 1-4月数据总计%i行'%len(feature1))
print(' 5月数据总计%i行'%len(database1))
feature2=data_off[((data_off['Date_received']>='2016-02-01')&(data_off['Date_received']<='2016-05-31')) | ((data_off['Date']>='2016-02-01')&(data_off['Date']<='2016-05-31'))]
feature2.reset_index(drop=True,inplace=True)
database2=data_off[((data_off['Date_received']>='2016-06-01')&(data_off['Date_received']<='2016-06-30')) | ((data_off['Date']>='2016-06-01')&(data_off['Date']<='2016-06-30'))]
database2.reset_index(drop=True,inplace=True)
print(' 2-5月数据总计%i行'%len(feature2))
print(' 6月数据总计%i行'%len(database2))
feature3=data_off[((data_off['Date_received']>='2016-03-01')&(data_off['Date_received']<='2016-06-30')) | ((data_off['Date']>='2016-03-01')&(data_off['Date']<='2016-06-30'))]
feature3.reset_index(drop=True,inplace=True)
database3=off_test
print(' 3-5月数据总计%i行'%len(feature3))
print(' 7月数据总计%i行'%len(database3))

5.3 特征构造-滑窗数据

对每个划分后的数据集分别进行指标提取

5.3.1 用户特征

def user_feature(feature):
    all_users = feature['User_id']
    users = all_users.drop_duplicates()
    # 1.用户消费次数(不对商家去重)
    users_goods = feature[pd.notna(feature.Date)][['User_id','Merchant_id']]
    users_goods['Merchant_id']=1
    users_goods_nums = users_goods.groupby(by = 'User_id').sum('Merchant_id')
    users_goods_nums.columns=['buy_num']
    users = pd.merge(users,users_goods_nums,on='User_id',how = 'left')
    # 2.每个用户的领券次数
    Coupon = feature[pd.notna(feature['Coupon_id'])][['User_id','Coupon_id']]
    Coupon['Coupon_id'] = 1
    Coupon_num = Coupon.groupby(by='User_id').sum('Coupon_id')
    Coupon_num.columns = ['Coupon_get_num']
    users = pd.merge(users,Coupon_num,on='User_id',how='left')
    users['Coupon_get_num']=users['Coupon_get_num'].replace(np.nan,0)
    # 3.用户领券消费次数
    Used_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Coupon_id']]
    Used_Coupon['Coupon_id'] = 1
    Used_Coupon_num = Used_Coupon.groupby(by='User_id').sum('Coupon_id')
    Used_Coupon_num.columns = ['Coupon_use_num']
    users = pd.merge(users,Used_Coupon_num,on='User_id',how='left')
    users['Coupon_use_num']=users['Coupon_use_num'].replace(np.nan,0)
    # 4.用户用券购买概率
    users['yqgmgl'] = users['Coupon_use_num']/users['buy_num']
    # 5.用户核销率
    users['Coupon_use_rate'] = users['Coupon_use_num']/users['Coupon_get_num']
    # 6.每个用户15天内核销优惠券的张数
    Used_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['User_id','Coupon_id']]
    Used_Coupon['Coupon_id'] = 1
    Used_Coupon_num15 = Used_Coupon.groupby(by='User_id').sum('Coupon_id')
    Used_Coupon_num15.columns = ['Coupon_use_num15']
    users = pd.merge(users,Used_Coupon_num15,on='User_id',how='left')
    users['Coupon_use_num15']=users['Coupon_use_num15'].replace(np.nan,0)
    # 7.每个用户15天内优惠券核销率
    users['Coupon_use_rate15'] = users['Coupon_use_num15']/users['Coupon_get_num']
    # 8.用户消费过的不同商家数量(对商家去重)
    users_goods = feature[pd.notna(feature.Date)][['User_id','Merchant_id']]
    users_goods = users_goods.drop_duplicates()
    users_goods['Merchant_id']=1
    users_goods_nums = users_goods.groupby(by = 'User_id').sum('Merchant_id')
    users_goods_nums.columns=['buy_merchant_num']
    users = pd.merge(users,users_goods_nums,on='User_id',how = 'left')
    # 9.优惠券使用间隔天数(最小天数,平均天数)
    get_user_date = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','date_interval']]
    min_interval = get_user_date.groupby('User_id').min('date_interval')
    min_interval.columns = ['user_min_interval']
    mean_interval = get_user_date.groupby('User_id').mean('date_interval')
    mean_interval.columns = ['user_mean_interval']
    users = pd.merge(users,min_interval,on='User_id',how='left')
    users = pd.merge(users,mean_interval,on='User_id',how='left')
    # 10.用户-商家领券消费距离(最大/最小/平均距离)
    distance = feature[(pd.notna(feature.Distance))&(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Distance']]
    user_distance_max = distance.groupby(by='User_id').max('Distance')
    user_distance_max.columns = ['user_distance_max']
    user_distance_min = distance.groupby(by='User_id').min('Distance')
    user_distance_min.columns = ['user_distance_min']
    user_distance_mean = distance.groupby(by='User_id').mean('Distance')
    user_distance_mean.columns = ['user_distance_mean']
    users = pd.merge(users,user_distance_max,on='User_id',how='left')
    users = pd.merge(users,user_distance_mean,on='User_id',how='left')
    users = pd.merge(users,user_distance_min,on='User_id',how='left')
    
    # 11.用户核销优惠券的平均门槛
    mk = feature[pd.notna(feature['Discount_rate_mk'])][['User_id','Discount_rate_mk']]
    user_Discount_mk_mean =mk.groupby(by='User_id').mean('Discount_rate_mk')
    user_Discount_mk_mean.columns = ['user_Discount_mk_mean']
    users = pd.merge(users,user_Discount_mk_mean,on='User_id',how='left')
    
    user_Discount_mk_min =mk.groupby(by='User_id').mean('Discount_rate_mk')
    user_Discount_mk_min.columns = ['user_Discount_mk_min']
    users = pd.merge(users,user_Discount_mk_min,on='User_id',how='left')
    
    user_Discount_mk_max =mk.groupby(by='User_id').mean('Discount_rate_mk')
    user_Discount_mk_max.columns = ['user_Discount_mk_max']
    users = pd.merge(users,user_Discount_mk_max,on='User_id',how='left')
    
    users.buy_num =users.buy_num.replace(np.nan,0)
    users.buy_merchant_num =users.buy_merchant_num.replace(np.nan,0)
    
    return users

5.3.2 商户特征

def Merchant_feature(feature):
    all_Merchants = feature['Merchant_id']
    Merchants = all_Merchants.drop_duplicates()
    # 1.商户合计被消费次数
    Merchant_sale = feature[pd.notna(feature['Date'])][['Merchant_id']] 
    Merchant_sale['Merchant_sale_num'] = 1
    Merchant_sale_num = Merchant_sale.groupby(by='Merchant_id').sum('Merchant_sale_num')
    Merchants = pd.merge(Merchants,Merchant_sale_num,on='Merchant_id',how='left')
    Merchants['Merchant_sale_num']=Merchants['Merchant_sale_num'].replace(np.nan,0)
    # 2.商户被领券次数
    Merchant_coupons = feature[pd.notna(feature['Date_received'])][['Merchant_id']]
    Merchant_coupons['Merchant_coupons_num'] = 1
    Merchant_coupons_num = Merchant_coupons.groupby(by='Merchant_id').sum('Merchant_coupons_num')
    Merchants = pd.merge(Merchants,Merchant_coupons_num,on='Merchant_id',how='left')
    Merchants['Merchant_coupons_num']=Merchants['Merchant_coupons_num'].replace(np.nan,0)
    # 3.商户被领券消费次数
    Merchant_coupons_buy = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id']]
    Merchant_coupons_buy['Merchant_coupons_buy_num'] = 1
    Merchant_coupons_buy_num = Merchant_coupons_buy.groupby(by='Merchant_id').sum('Merchant_coupons_buy_num')
    Merchants = pd.merge(Merchants,Merchant_coupons_buy_num,on='Merchant_id',how='left')
    Merchants['Merchant_coupons_buy_num']=Merchants['Merchant_coupons_buy_num'].replace(np.nan,0)
    # 4.商户用券率
    Merchants['Merchant_user_rate'] = Merchants['Merchant_coupons_buy_num']/Merchants['Merchant_sale_num']
    # 5.商户核销率
    Merchants['Merchant_rate'] = Merchants['Merchant_coupons_buy_num']/Merchants['Merchant_coupons_num']
    # 6. 消费者15天内核销总数、核销率
    Merchant_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Merchant_id','Coupon_id']]
    Merchant_Coupon['Coupon_id'] = 1
    Merchant_Coupon_num15 = Merchant_Coupon.groupby(by='Merchant_id').sum('Coupon_id')
    Merchant_Coupon_num15.columns = ['Merchant_Coupon_use_num15']
    Merchants = pd.merge(Merchants,Merchant_Coupon_num15,on='Merchant_id',how='left')
    Merchants['Merchant_Coupon_use_num15']=Merchants['Merchant_Coupon_use_num15'].replace(np.nan,0)
    Merchants['Merchant_Coupon_use_rate15'] = Merchants['Merchant_Coupon_use_num15']/Merchants['Merchant_coupons_num']
    # 7. 商户-消费者距离(max/mean已核销)
    Merchant_distance = feature[(pd.notna(feature['Distance']))&(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Merchant_id','Distance']]
    Merchant_distance_max = Merchant_distance.groupby(by='Merchant_id').max('Distance')
    Merchant_distance_max.columns = ['Merchant_distance_max']
    Merchant_distance_mean = Merchant_distance.groupby(by='Merchant_id').mean('Distance')
    Merchant_distance_mean.columns = ['Merchant_distance_mean']
    Merchants = pd.merge(Merchants,Merchant_distance_mean,on='Merchant_id',how='left')
    Merchants = pd.merge(Merchants,Merchant_distance_max,on='Merchant_id',how='left')
    # 8. 商家已使用的优惠券门槛(平均、最小)
    Merchant_mk = feature[(pd.notna(feature['Discount_rate_mk']))&(pd.notna(feature['Date']))][['Discount_rate_mk','Merchant_id']]
    Merchant_mk_min = Merchant_mk.groupby(by='Merchant_id').min('Discount_rate_mk')
    Merchant_mk_min.columns = ['Merchant_mk_min']
    Merchant_mk_mean = Merchant_mk.groupby(by='Merchant_id').mean('Discount_rate_mk')
    Merchant_mk_mean.columns = ['Merchant_mk_mean']
    Merchant_mk_max = Merchant_mk.groupby(by='Merchant_id').mean('Discount_rate_mk')
    Merchant_mk_max.columns = ['Merchant_mk_max']
    Merchants = pd.merge(Merchants,Merchant_mk_min,on='Merchant_id',how='left')
    Merchants = pd.merge(Merchants,Merchant_mk_max,on='Merchant_id',how='left')
    Merchants = pd.merge(Merchants,Merchant_mk_mean,on='Merchant_id',how='left')
    # 9. 商家优惠券被使用的平均时间
    Merchant_interval = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id','date_interval']]
    min_interval = Merchant_interval.groupby('Merchant_id').min('date_interval')
    min_interval.columns = ['Merchant_min_interval']
    mean_interval = Merchant_interval.groupby('Merchant_id').mean('date_interval')
    mean_interval.columns = ['Merchant_mean_interval']
    Merchants = pd.merge(Merchants,min_interval,on='Merchant_id',how='left')
    Merchants = pd.merge(Merchants,mean_interval,on='Merchant_id',how='left')
    return Merchants

5.3.3 优惠券特征

def couponsType_feature(feature):
    all_coupons = feature[pd.notna(feature['Discount_rate'])]['Discount_rate']
    Coupons = all_coupons.drop_duplicates()
    # 1.各类优惠券type被领取次数
    Coupons_Type_get = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Discount_rate']))][['Discount_rate']]
    Coupons_Type_get['Coupons_Type_get_num'] = 1
    Coupons_Type_get_num = Coupons_Type_get.groupby(by='Discount_rate').sum('Coupons_Type_get_num')
    Coupons = pd.merge(Coupons,Coupons_Type_get_num,on='Discount_rate',how='left')
    Coupons['Coupons_Type_get_num']=Coupons['Coupons_Type_get_num'].replace(np.nan,0)
    # 2.各类优惠券type被使用次数
    Coupons_Type_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Discount_rate']]
    Coupons_Type_use['Coupons_Type_use_num'] = 1
    Coupons_Type_use_num = Coupons_Type_use.groupby(by='Discount_rate').sum('Coupons_Type_use_num')
    Coupons = pd.merge(Coupons,Coupons_Type_use_num,on='Discount_rate',how='left')
    Coupons['Coupons_Type_use_num']=Coupons['Coupons_Type_use_num'].replace(np.nan,0)
    # 3.各类优惠券type核销率
    Coupons['Coupons_Type_rate']=Coupons['Coupons_Type_use_num']/Coupons['Coupons_Type_get_num']
    # 4.各类优惠券type15天内核销数量
    Coupon15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Discount_rate']]
    Coupon15['Coupon15_use_num'] = 1
    Coupon15_num15 = Coupon15.groupby(by='Discount_rate').sum('Coupon15_use_num')
    Coupons = pd.merge(Coupons,Coupon15_num15,on='Discount_rate',how='left')
    # 5.各类优惠券type15天内核销率
    Coupons['Coupons15_Type_rate']=Coupons['Coupon15_use_num']/Coupons['Coupons_Type_get_num']
    # 6.各类优惠券type被使用的距离(max/mean)
    Coupon_distance = feature[(pd.notna(feature['Distance']))&(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Discount_rate','Distance']]
    Coupon_distance_max = Coupon_distance.groupby(by='Discount_rate').max('Distance')
    Coupon_distance_max.columns = ['Coupons_Type_distance_max']
    Coupon_distance_mean = Coupon_distance.groupby(by='Discount_rate').mean('Distance')
    Coupon_distance_mean.columns = ['Coupons_Type_distance_mean']
    Coupons = pd.merge(Coupons,Coupon_distance_mean,on='Discount_rate',how='left')
    Coupons = pd.merge(Coupons,Coupon_distance_max,on='Discount_rate',how='left')
    # 7.各类优惠券type被使用的时间间隔(mean/min)
    Coupon_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Discount_rate','date_interval']]
    Coupon_interval_min = Coupon_interval.groupby(by='Discount_rate').min('date_interval')
    Coupon_interval_min.columns = ['Coupons_Type_interval_min']
    Coupon_interval_mean = Coupon_interval.groupby(by='Discount_rate').mean('date_interval')
    Coupon_interval_mean.columns = ['Coupons_Type_interval_mean']
    Coupons = pd.merge(Coupons,Coupon_interval_mean,on='Discount_rate',how='left')
    Coupons = pd.merge(Coupons,Coupon_interval_min,on='Discount_rate',how='left')
    return Coupons

5.3.4 用户-优惠券联合特征

def User_CouponsType_feature(feature):
    User_Coupons = feature[['User_id','Discount_rate']]
    User_Coupons = User_Coupons.drop_duplicates()
    # 1. 用户领取特定优惠券次数
    User_CouponType_get = feature[pd.notna(feature['Date_received'])][['User_id','Discount_rate']]
    User_CouponType_get['User_CouponType_get_num'] = 1
    User_CouponType_get = User_CouponType_get.groupby(['User_id','Discount_rate']).sum('User_CouponType_get_num')
    User_Coupons = pd.merge(User_Coupons,User_CouponType_get,on=['User_id','Discount_rate'],how='left')
    User_Coupons['User_CouponType_get_num']=User_Coupons['User_CouponType_get_num'].replace(np.nan,0)
    # 2. 用户使用特定优惠券次数
    User_CouponType_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['User_id','Discount_rate']]
    User_CouponType_use['User_CouponType_use_num'] = 1
    User_CouponType_use = User_CouponType_use.groupby(['User_id','Discount_rate']).sum('User_CouponType_use_num')
    User_Coupons = pd.merge(User_Coupons,User_CouponType_use,on=['User_id','Discount_rate'],how='left')
    User_Coupons['User_CouponType_use_num']=User_Coupons['User_CouponType_use_num'].replace(np.nan,0)
    # 3. 用户特定优惠券核销率
    User_Coupons['User_Coupons_rate'] = User_Coupons['User_CouponType_use_num']/User_Coupons['User_CouponType_get_num']
    # 4. 15天核销次数
    User_Coupon15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['User_id','Discount_rate']]
    User_Coupon15['User_Coupon15_use_num'] = 1
    User_Coupon15_num15 = User_Coupon15.groupby(['User_id','Discount_rate']).sum('User_Coupon15_use_num')
    User_Coupons = pd.merge(User_Coupons,User_Coupon15_num15,on=['User_id','Discount_rate'],how='left')
    User_Coupons['User_Coupon15_use_num']=User_Coupons['User_Coupon15_use_num'].replace(np.nan,0)
    # 5. 15天核销率
    User_Coupons['User_Coupons_rate15'] = User_Coupons['User_Coupon15_use_num']/User_Coupons['User_CouponType_get_num']
    # 6. 时间间隔
    User_Coupon_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Discount_rate','date_interval']]
    User_Coupon_interval_min = User_Coupon_interval.groupby(['Discount_rate','User_id']).min('date_interval')
    User_Coupon_interval_min.columns = ['User_Coupons_Type_interval_min']
    User_Coupon_interval_mean = User_Coupon_interval.groupby(['Discount_rate','User_id']).mean('date_interval')
    User_Coupon_interval_mean.columns = ['User_Coupons_Type_interval_mean']
    User_Coupons = pd.merge(User_Coupons,User_Coupon_interval_mean,on=['Discount_rate','User_id'],how='left')
    User_Coupons = pd.merge(User_Coupons,User_Coupon_interval_min,on=['Discount_rate','User_id'],how='left')
    return User_Coupons

5.3.5 用户-商户联合特征

def User_Merchants_feature(feature):
    User_Merchants = feature[['User_id','Merchant_id']]
    User_Merchants = User_Merchants.drop_duplicates()
    # 1.用户在特定商家消费次数
    User_Merchant_buy = feature[pd.notna(feature['Date'])][['User_id','Merchant_id']]
    User_Merchant_buy['User_Merchant_buy_num'] = 1
    User_Merchant_buy = User_Merchant_buy.groupby(['User_id','Merchant_id']).sum('User_Merchant_buy_num')
    User_Merchants = pd.merge(User_Merchants,User_Merchant_buy,on=['User_id','Merchant_id'],how='left')
    User_Merchants['User_Merchant_buy_num']=User_Merchants['User_Merchant_buy_num'].replace(np.nan,0)
    # 2. 用户在特定商家领取优惠券次数
    User_Merchant_get = feature[pd.notna(feature['Date_received'])][['User_id','Merchant_id']]
    User_Merchant_get['User_Merchant_get_num'] = 1
    User_Merchant_get = User_Merchant_get.groupby(['User_id','Merchant_id']).sum('User_Merchant_get_num')
    User_Merchants = pd.merge(User_Merchants,User_Merchant_get,on=['User_id','Merchant_id'],how='left')
    User_Merchants['User_Merchant_get_num']=User_Merchants['User_Merchant_get_num'].replace(np.nan,0)
    # 3. 用户在特定商家使用优惠券次数
    User_Merchant_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['User_id','Merchant_id']]
    User_Merchant_use['User_Merchant_use_num'] = 1
    User_Merchant_use = User_Merchant_use.groupby(['User_id','Merchant_id']).sum('User_Merchant_use_num')
    User_Merchants = pd.merge(User_Merchants,User_Merchant_use,on=['User_id','Merchant_id'],how='left')
    User_Merchants['User_Merchant_use_num']=User_Merchants['User_Merchant_use_num'].replace(np.nan,0)
    # 4. 用户在特定商家优惠券核销率
    User_Merchants['User_Merchants_rate'] = User_Merchants['User_Merchant_use_num']/User_Merchants['User_Merchant_get_num']
    # 5. 用券率
    User_Merchants['User_Merchants_user_rate'] = User_Merchants['User_Merchant_use_num']/User_Merchants['User_Merchant_buy_num']
    # 6. 15天核销次数
    User_Merchant15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['User_id','Merchant_id']]
    User_Merchant15['User_Merchant15_use_num'] = 1
    User_Merchant15_num15 = User_Merchant15.groupby(['User_id','Merchant_id']).sum('User_Merchant15_use_num')
    User_Merchants = pd.merge(User_Merchants,User_Merchant15_num15,on=['User_id','Merchant_id'],how='left')
    User_Merchants['User_Merchant15_use_num']=User_Merchants['User_Merchant15_use_num'].replace(np.nan,0)
    # 7. 15天核销率
    User_Merchants['User_Merchant_rate15'] = User_Merchants['User_Merchant15_use_num']/User_Merchants['User_Merchant_get_num']
    # 8. 时间间隔
    User_Merchant_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Merchant_id','date_interval']]
    User_Merchant_interval_min = User_Merchant_interval.groupby(['User_id','Merchant_id']).min('date_interval')
    User_Merchant_interval_min.columns = ['User_Merchants_Type_interval_min']
    User_Merchant_interval_mean = User_Merchant_interval.groupby(['User_id','Merchant_id']).mean('date_interval')
    User_Merchant_interval_mean.columns = ['User_Merchants_Type_interval_mean']
    User_Merchants = pd.merge(User_Merchants,User_Merchant_interval_mean,on=['User_id','Merchant_id'],how='left')
    User_Merchants = pd.merge(User_Merchants,User_Merchant_interval_min,on=['User_id','Merchant_id'],how='left')
    return User_Merchants

5.3.6 商户-优惠券联合特征

def Merchants_CouponsType_feature(feature):
    Merchants_Coupons = feature[['Merchant_id','Discount_rate']]
    Merchants_Coupons = Merchants_Coupons.drop_duplicates()
    # 1. 商户领取特定优惠券次数
    Merchants_CouponType_get = feature[pd.notna(feature['Date_received'])][['Merchant_id','Discount_rate']]
    Merchants_CouponType_get['Merchants_CouponType_get_num'] = 1
    Merchants_CouponType_get = Merchants_CouponType_get.groupby(['Merchant_id','Discount_rate']).sum('Merchants_CouponType_get_num')
    Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_CouponType_get,on=['Merchant_id','Discount_rate'],how='left')
    Merchants_Coupons['Merchants_CouponType_get_num']=Merchants_Coupons['Merchants_CouponType_get_num'].replace(np.nan,0)
    # 2. 商户使用特定优惠券次数
    Merchants_CouponType_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id','Discount_rate']]
    Merchants_CouponType_use['Merchants_CouponType_use_num'] = 1
    Merchants_CouponType_use = Merchants_CouponType_use.groupby(['Merchant_id','Discount_rate']).sum('Merchants_CouponType_use_num')
    Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_CouponType_use,on=['Merchant_id','Discount_rate'],how='left')
    Merchants_Coupons['Merchants_CouponType_use_num']=Merchants_Coupons['Merchants_CouponType_use_num'].replace(np.nan,0)
    # 3. 商户特定优惠券核销率
    Merchants_Coupons['Merchants_Coupons_rate'] = Merchants_Coupons['Merchants_CouponType_use_num']/Merchants_Coupons['Merchants_CouponType_get_num']
    
    # 4. 15天核销次数
    Merchants_CouponsType15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Discount_rate','Merchant_id']]
    Merchants_CouponsType15['Merchants_CouponsType15_use_num'] = 1
    Merchants_CouponsType15_num15 = Merchants_CouponsType15.groupby(['Discount_rate','Merchant_id']).sum('Merchants_CouponsType15_use_num')
    Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_CouponsType15_num15,on=['Discount_rate','Merchant_id'],how='left')
    Merchants_Coupons['Merchants_CouponsType15_use_num']=Merchants_Coupons['Merchants_CouponsType15_use_num'].replace(np.nan,0)
    # 5. 15天核销率
    Merchants_Coupons['Merchants_CouponsType_rate15'] = Merchants_Coupons['Merchants_CouponsType15_use_num']/Merchants_Coupons['Merchants_CouponType_get_num']
    # 6. 时间间隔
    Merchants_Coupon_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Discount_rate','Merchant_id','date_interval']]
    Merchants_Coupon_interval_min = Merchants_Coupon_interval.groupby(['Discount_rate','Merchant_id']).min('date_interval')
    Merchants_Coupon_interval_min.columns = ['Merchants_Coupons_Type_interval_min']
    Merchants_Coupon_interval_mean = Merchants_Coupon_interval.groupby(['Discount_rate','Merchant_id']).mean('date_interval')
    Merchants_Coupon_interval_mean.columns = ['Merchants_Coupons_Type_interval_mean']
    Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_Coupon_interval_mean,on=['Discount_rate','Merchant_id'],how='left')
    Merchants_Coupons = pd.merge(Merchants_Coupons,Merchants_Coupon_interval_min,on=['Discount_rate','Merchant_id'],how='left')
    return Merchants_Coupons

5.3.7 用户-商户-优惠券联合特征

def M_C_UType_feature(feature):
    M_C_U = feature[['Merchant_id','Discount_rate','User_id']]
    M_C_U = M_C_U.drop_duplicates()
    # 1. 用户-商户-优惠券-领取次数
    M_C_U_get = feature[pd.notna(feature['Date_received'])][['Merchant_id','Discount_rate','User_id']]
    M_C_U_get['M_C_U_get_num'] = 1
    M_C_U_get = M_C_U_get.groupby(['Merchant_id','Discount_rate','User_id']).sum('M_C_U_get_num')
    M_C_U = pd.merge(M_C_U,M_C_U_get,on=['Merchant_id','Discount_rate','User_id'],how='left')
    M_C_U['M_C_U_get_num']=M_C_U['M_C_U_get_num'].replace(np.nan,0)
    # 2. 用户-商户-优惠券-使用次数
    M_C_U_use = feature[(pd.notna(feature['Date_received']))&(pd.notna(feature['Date']))][['Merchant_id','Discount_rate','User_id']]
    M_C_U_use['M_C_U_use_num'] = 1
    M_C_U_use = M_C_U_use.groupby(['Merchant_id','Discount_rate','User_id']).sum('M_C_U_use_num')
    M_C_U = pd.merge(M_C_U,M_C_U_use,on=['Merchant_id','Discount_rate','User_id'],how='left')
    M_C_U['M_C_U_use_num']=M_C_U['M_C_U_use_num'].replace(np.nan,0)
    # 3. 商户特定优惠券核销率
    M_C_U['M_C_U_rate'] = M_C_U['M_C_U_use_num']/M_C_U['M_C_U_get_num']
    
    # 4. 15天核销次数
    M_C_U15 = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['Merchant_id','Discount_rate','User_id']]
    M_C_U15['M_C_U15_use_num'] = 1
    M_C_U15_num15 = M_C_U15.groupby(['Merchant_id','Discount_rate','User_id']).sum('M_C_U15_use_num')
    M_C_U = pd.merge(M_C_U,M_C_U15_num15,on=['Merchant_id','Discount_rate','User_id'],how='left')
    M_C_U['M_C_U15_use_num']=M_C_U['M_C_U15_use_num'].replace(np.nan,0)
    # 5. 15天核销率
    M_C_U['M_C_UType_rate15'] = M_C_U['M_C_U15_use_num']/M_C_U['M_C_U_get_num']
    
    # 6. 时间间隔
    M_C_U_interval = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['Merchant_id','Discount_rate','User_id','date_interval']]
    M_C_U_interval_min = M_C_U_interval.groupby(['Merchant_id','Discount_rate','User_id']).min('date_interval')
    M_C_U_interval_min.columns = ['M_C_Us_Type_interval_min']
    M_C_U_interval_mean = M_C_U_interval.groupby(['Merchant_id','Discount_rate','User_id']).mean('date_interval')
    M_C_U_interval_mean.columns = ['M_C_Us_Type_interval_mean']
    M_C_Us = pd.merge(M_C_U,M_C_U_interval_mean,on=['Merchant_id','Discount_rate','User_id'],how='left')
    M_C_Us = pd.merge(M_C_U,M_C_U_interval_min,on=['Merchant_id','Discount_rate','User_id'],how='left')
    return M_C_U
def leakage(database3):
    # 1.每个用户的领券次数
    Coupon = database3[pd.notna(database3['Coupon_id'])][['User_id','Coupon_id']]
    Coupon['Coupon_id'] = 1
    Coupon_num = Coupon.groupby(by='User_id').sum('Coupon_id')
    Coupon_num.columns = ['Coupon_get_num']
    database3 = pd.merge(database3,Coupon_num,on=['User_id'],how='left')
    # 2.用户本月领取的某种优惠券的数量
    User_CouponType_get = database3[pd.notna(database3['Date_received'])][['User_id','Discount_rate']]
    User_CouponType_get['leakage_User_CouponType_get_num'] = 1
    User_CouponType_get = User_CouponType_get.groupby(['User_id','Discount_rate']).sum('leakage_User_CouponType_get_num')
    database3 = pd.merge(database3,User_CouponType_get,on=['User_id','Discount_rate'],how='left')

    # 3.用户在特定商家领取优惠券次数
    User_Merchant_get = database3[pd.notna(database3['Date_received'])][['User_id','Merchant_id']]
    User_Merchant_get['leakage_User_Merchant_get_num'] = 1
    User_Merchant_get = User_Merchant_get.groupby(['User_id','Merchant_id']).sum('leakage_User_Merchant_get_num')
    database3 = pd.merge(database3,User_Merchant_get,on=['User_id','Merchant_id'],how='left')
    # 4.每个用户当天的领券次数
    Coupon_day = database3[pd.notna(database3['Coupon_id'])][['User_id','Date_received']]
    Coupon_day['leakage_Coupon_dayget_num'] = 1
    Coupon_num_day = Coupon_day.groupby(['User_id','Date_received']).sum('leakage_Coupon_dayget_num')
    database3 = pd.merge(database3,Coupon_num_day,on=['User_id','Date_received'],how='left')
    # 5.每个用户当天某种优惠券的领券次数
    Coupon_s_day = database3[pd.notna(database3['Coupon_id'])][['User_id','Date_received','Discount_rate']]
    Coupon_s_day['speleakage_Coupon_dayget_num'] = 1
    Coupon_num_s_day = Coupon_s_day.groupby(['User_id','Date_received','Discount_rate']).sum('speleakage_Coupon_dayget_num')
    database3 = pd.merge(database3,Coupon_num_s_day,on=['User_id','Date_received','Discount_rate'],how='left')
    lekge_user_SpeCouSum_maxday=database3[database3['leakage_User_CouponType_get_num']>1].groupby(['User_id','Discount_rate'])['Date_received'].max().reset_index().rename(columns={'Date_received':'lekge_user_SpeCouSum_maxday'})
    lekge_user_SpeCouSum_minday=database3[database3['leakage_User_CouponType_get_num']>1].groupby(['User_id','Discount_rate'])['Date_received'].min().reset_index().rename(columns={'Date_received':'lekge_user_SpeCouSum_minday'})
    database3=pd.merge(database3,lekge_user_SpeCouSum_maxday,how='left',on=['User_id','Discount_rate'])
    database3=pd.merge(database3,lekge_user_SpeCouSum_minday,how='left',on=['User_id','Discount_rate'])
    database3['lekge_user_SpeCou_ifirst']=(database3['Date_received']-database3['lekge_user_SpeCouSum_minday']).apply(lambda x:1 if x.days==0 else 0 if x.days>0 else -1)
    database3['lekge_user_SpeCou_iflast']=(database3['lekge_user_SpeCouSum_maxday']-database3['Date_received']).apply(lambda x:1 if x.days==0 else 0 if x.days>0 else -1)
    return database3

 将前面所有构造的特征merge连接起来

def feature_all(feature3,y):
    # 用户
    users = user_feature(feature3)
    # 商户
    Merchants = Merchant_feature(feature3)
    # 优惠券
    Coupons_type = couponsType_feature(feature3)
    # 用户-商户
    User_Merchants = User_Merchants_feature(feature3)
    # 用户-优惠券
    User_CouponsType = User_CouponsType_feature(feature3)
    # 商户-优惠券
    Merchants_CouponsType = Merchants_CouponsType_feature(feature3)
    y = leakage(y)
    feature_final = pd.merge(y,users,on='User_id',how='left')
    feature_final = pd.merge(feature_final,Merchants,on='Merchant_id',how='left')
    feature_final = pd.merge(feature_final,Coupons_type,on='Discount_rate',how='left')
    feature_final = pd.merge(feature_final,User_Merchants,on=['User_id','Merchant_id'],how='left')
    feature_final = pd.merge(feature_final,User_CouponsType,on=['User_id','Discount_rate'],how='left')
    feature_final = pd.merge(feature_final,Merchants_CouponsType,on=['Merchant_id','Discount_rate'],how='left')
    feature_final = feature_final[feature_final['Discount_rate']==feature_final['Discount_rate']]
    feature_final['user_distance_max_interval'] = feature_final['Distance']-feature_final['user_distance_max']
    feature_final['user_distance_mean_interval'] = feature_final['Distance']-feature_final['user_distance_mean']
    feature_final['Merchant_distance_max_interval'] = feature_final['Distance']-feature_final['Merchant_distance_max']
    feature_final['Merchant_distance_mean_interval'] = feature_final['Distance']-feature_final['Merchant_distance_mean']
    feature_final['Coupons_Type_distance_mean_interval'] = feature_final['Distance']-feature_final['Coupons_Type_distance_mean']
    feature_final['Coupons_Type_distance_max_interval'] = feature_final['Distance']-feature_final['Coupons_Type_distance_max']
    feature_final['user_Discount_mk_mean_interval'] = feature_final['Discount_rate_mk']-feature_final['user_Discount_mk_mean']
    feature_final['user_Discount_mk_min_interval'] = feature_final['Discount_rate_mk']-feature_final['user_Discount_mk_min']
    feature_final['user_Discount_mk_max_interval'] = feature_final['Discount_rate_mk']-feature_final['user_Discount_mk_max']
    
#     feature_final = feature_final.replace(np.nan,-99999)
    return feature_final

 数据集划分

data3 = feature_all(feature3,database3)
data2 = feature_all(feature2,database2)
data1 = feature_all(feature1,database1)
data_train = pd.concat([data2,data1],axis =0)

y_train = data_train['Y'].values
x_train1 = data_train.drop(columns=['date_interval','Discount_rate','Date_received','Date','User_id','Merchant_id','Coupon_id','Y','lekge_user_SpeCouSum_maxday','lekge_user_SpeCouSum_minday'])
x_train = x_train1.values
print('总计%i个特征'%len(x_train1.columns))
from sklearn.model_selection import train_test_split
(train_x,test_x,train_y,test_y)=train_test_split(x_train, y_train,test_size=0.8,random_state=0)
x_pred1 = data3.drop(columns=['Discount_rate','Date_received','User_id','Merchant_id','Coupon_id','lekge_user_SpeCouSum_maxday','lekge_user_SpeCouSum_minday'])
x_pred = x_pred1.values
print('总计%i个特征'%len(x_pred1.columns))

输出:
总计83个特征
总计83个特征

六、模型构建

import xgboost as xgb
from sklearn.model_selection import train_test_split
from xgboost import plot_importance
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score,roc_curve,auc,recall_score,precision_score
from sklearn.preprocessing import MinMaxScaler
xgb_model = xgb.XGBClassifier(
 booster='gbtree',
 objective= 'binary:logistic', 
 eval_metric='auc',
 learning_rate =0.03, 
 n_estimators=1000, 
 max_depth=5, 
 min_child_weight=1.1, 
 gamma=0.1, 
 subsample=0.8, 
 colsample_bytree=0.8, 
 seed=10, 
 reg_alpha=0, 
 reg_lambda=0 
)
xgb_model.fit(train_x,train_y)
print('xgboost模型的召回率为:',recall_score(test_y, xgb_model.predict(test_x)))
print('xgboost模型的精确率为:',precision_score(test_y, xgb_model.predict(test_x)))
print('xgboost模型的auc为:',roc_auc_score(test_y, xgb_model.predict_proba(test_x)[:,1]))

七、数据保存

data33= pd.read_csv('/项目准备/O2O优惠券使用预测/offline_test.csv')
y_pred= xgb_model.predict_proba(x_pred)
print(len(y_pred))
a = pd.DataFrame(y_pred)[1].values
pred=pd.DataFrame({'User_id':data33['User_id'].values,"Coupon_id":data33['Coupon_id'].values,'Date_received':data33['Date_received'].values,'pred':a})
pred.to_csv('/项目准备/O2O优惠券使用预测/result16.csv',index=None,header=None)

O2O优惠券核销-模型预测_第4张图片

后续简单调参后,提交系统得分0.7882 

O2O优惠券核销-模型预测_第5张图片

八、心得体会

接触到这个数据集的时候,原本只想运用Tableau完成一个数据分析报告,但是又觉得这个比赛蛮有意思的,就尝试敲了一下代码,中间有借鉴其他博主的思路,比如数据划分的时候,利用数据滑窗的思路,但是里面很多特征构造还是基于之前做的数据分析报告的一些洞察。本人也通过这个比赛学习到了很多知识,比如:

  • 特征提取才是机器学习的精髓,它考察了对具体业务的洞察力。比如之前数据分析的时候发现,距离、优惠券门槛、星期等都是影响优惠券核销的关键因素,因此在构建特征工程的时候将这些指标特征构造出来会很大程度提升模型的效果
  • 合适的特征则需要洞察力。比如单纯构造”优惠券历史核销平均距离“特征可能对模型预测的影响并不显著,但是将”用户本次领取的优惠券距离-优惠券历史核销平均距离“特征却对模型预测的影响很显著。所以找到合适的特征需要对业务的一些灵感 
  • 模型调参:其实参数调节对模型的影响不会很大,但花费的时间却是较长的,最开始花费了大量的时间在模型调参上,但是最后优化了特征后,发现参数调节对模型的影响不会很大,很多时候特征工程和预测方法的选择对模型的影响更大。在计算机算力充分的情况下,可以利用grid search的方式进行参数探索,但是耗时很久很久很久,于是本文只是进行了一个简单的人工调参,具体步骤参考博客xgboost参数调节的一般思路

你可能感兴趣的:(数据分析笔记,python,决策树)