天猫重复买家预测比赛整理

天池-天猫买了还买预测比赛

赛题介绍

商家有时会在特定日期(例如“Boxing-day”,“黑色星期五”或“双11”)进行大促销(例如折扣或现金券),以吸引大量新买家。许多吸引的买家都是一次性交易猎人,这些促销可能对销售产生很小的长期影响。为了缓解这个问题,商家必须确定谁可以转换为重复买家。通过瞄准这些潜力忠诚的客户,商家可以大大降低促销成本,提高投资回报率(ROI)。

众所周知,在线广告领域,客户定位极具挑战性,特别是对于新买家而言。但是通过Tmall.com长期积累的用户行为日志,我们或许可以解决这个问题。在这个挑战中,我们提供了一套商家及其在“双11”日促销期间获得的相应新买家。你的任务是预测对于指定商家的新买家将来是否会成为忠实客户。换句话说,您需要预测这些新买家在6个月内再次从同一商家购买商品的概率。一个包含大约20万用户的数据集用于训练,还有一个类似大小的数据集用于测试。与其他比赛类似,您可以提取任何特征,然后使用其他工具进行训练。您只需提交预测结果进行评估。

1.数据处理

数据描述

该数据集包含在“双十一”之前和之后的过去6个月中的匿名用户的购物日志,以及指示他们是否是重复购买者的标签信息。由于隐私问题,数据以偏移量进行采样,因此该数据集的统计结果对于Tmall.com的实际数据有一些偏差。但它不会影响解决方案的适用性。可以在“data_format2.zip”中找到训练和测试数据集的文件。数据格式的详细信息可以在下表中找到。
官方给了两种格式的数据,格式1包含四个文件,分别是用户行为日志、用户信息、训练集、测试集。用户行为日志包含用户ID、商品ID、商品类别、商户ID、商品品牌、时间和用户行为类别7个特征,用户信息包含用户ID、用户年龄段和用户性别信息,训练集和测试集分别包含用户ID、商户ID和是否为重复买家标签,其中训练集标签为0-1,测试集标签为空,需要选手预测。具体信息可以查看官方的赛题与数据面板。

赛题分析

与广告点击率问题类似,此赛题的目的是根据用户与商户在双11之前6个月的交互记录数据和双11期间的交互记录预测商户的新买家在未来的6个月内再次从同一商家购买商品的概率。

数据格式化

参考这里的处理方式对日志文件进行压缩存储

# 对数据按照格式进行压缩重新存储
def compressData(inputData):
    '''
    :parameters: inputData: pd.Dataframe
    :return: inputData: pd.Dataframe
    :Purpose: 
    压缩csv中的数据,通过改变扫描每列的dtype,转换成适合的大小
    例如: int64, 检查最小值是否存在负数,是则声明signed,否则声明unsigned,并转换更小的int size
    对于object类型,则会转换成category类型,占用内存率小
    参考来自:https://www.jiqizhixin.com/articles/2018-03-07-3
    '''
    for eachType in set(inputData.dtypes.values):
        ##检查属于什么类型
        if 'int' in str(eachType):
            ## 对每列进行转换
            for i in inputData.select_dtypes(eachType).columns.values:
                if inputData[i].min() < 0:
                    inputData[i] = pd.to_numeric(inputData[i],downcast='signed')
                else:
                    inputData[i] = pd.to_numeric(inputData[i],downcast='unsigned')      
        elif 'float' in str(eachType):
            for i in inputData.select_dtypes(eachType).columns.values:   
                inputData[i] = pd.to_numeric(inputData[i],downcast='float')
        elif 'object' in str(eachType):
            for i in inputData.select_dtypes(eachType).columns.values: 
                inputData[i] = trainData7[i].astype('category')
    return inputData
 
userInfo = pd.read_csv('d:/JulyCompetition/input/user_log_format1.csv')
print('Before compressed:\n',userInfo.info())
userInfo = compressData(userInfo)
print('After compressed:\n',userInfo.info())

数据清洗

查看缺失值发现brand_id有91015条缺失数据

userInfo.isnull().sum()

使用所在商户对应的品牌众数填充

# brand_id使用所在seller_id对应的brand_id的众数填充
def get_Logs():
    '''
    :parameters: None: None
    :return: userLog: pd.Dataframe
    :Purpose: 
    方便与其他函数调取原始的行为数据,同时已对缺失省进行调整
    使用pickle模块进行序列话,加快速度读写
    '''
    filePath = 'd:/JulyCompetition/features/Logs.pkl'
    if os.path.exists(filePath):
        userLog = pickle.load(open(filePath,'rb'))
    else:
        userLog = pd.read_csv('d:/JulyCompetition/input/user_log_format1.csv',dtype=column_types)
        print('Is null? \n',userLog.isnull().sum())
 
        ## 对brand_id缺失值进行处理
        missingIndex = userLog[userLog.brand_id.isnull()].index
        ## 思路:找到所有商店所拥有brand_id的众数,并对所缺失的brand_id与其相对应的商店进行填充
        sellerMode = userLog.groupby(['seller_id']).apply(lambda x:x.brand_id.mode()[0]).reset_index()
        pickUP = userLog.loc[missingIndex]
        pickUP = pd.merge(pickUP,sellerMode,how='left',on=['seller_id'])[0].astype('float32')
        pickUP.index = missingIndex
        userLog.loc[missingIndex,'brand_id'] = pickUP
        del pickUP,sellerMode,missingIndex
        print('--------------------')
        print('Is null? \n',userLog.isnull().sum())
        pickle.dump(userLog,open(filePath,'wb'))
    return userLog
userLog = get_Logs()

2.特征工程

2.1 用户特征

用户基本信息

# 用户基本信息:年龄,性别(类别型特征)
userInfo = pd.read_csv('d:/JulyCompetition/input/user_info_format1.csv')
userInfo.age_range.fillna(userInfo.age_range.median(),inplace=True)#年龄用中位数填充
userInfo.gender.fillna(userInfo.gender.mode()[0],inplace=True)# 性别用众数填充
print('Check any missing value?\n',userInfo.isnull().any())# 检查缺省值
df_age = pd.get_dummies(userInfo.age_range,prefix='age')# 对age进行哑编码
df_sex = pd.get_dummies(userInfo.gender)# 对gender进行哑编码并改变列名
df_sex.rename(columns={0:'female',1:'male',2:'unknown'},inplace=True)
userInfo = pd.concat([userInfo.user_id, df_age, df_sex], axis=1)# 整合user信息
del df_age,df_sex
print(userInfo.info())

用户行为信息

# 提取全部的原始行为数据...
totalActions = userLog[["user_id","action_type"]]
totalActions.head()

统计用户交互次数

  • 用户交互的总次数(是否为活跃用户)
  • 用户点击商品的总次数
  • 用户加入购物车的总次数
  • 用户购买商品的总次数
  • 用户收藏商品的总次数
# 对行为类别进行哑编码,0 表示点击, 1 表示加入购物车, 2 表示购买,3 表示收藏.
df = pd.get_dummies(totalActions['action_type'],prefix='userTotalAction')

# 统计日志行为中用户点击、加购、购买、收藏的总次数
totalActions = pd.concat([totalActions.user_id, df], axis=1).groupby(['user_id'], as_index=False).sum()
totalActions['userTotalAction'] = totalActions['userTotalAction_0']+totalActions['userTotalAction_1']+totalActions['userTotalAction_2']+totalActions['userTotalAction_3']
del df
totalActions.info()

用户交互次数在所有用户中的地位

  • 用户交互次数占所有用户交互次数的比例
  • 用户交互次数与用户平均交互次数的差值(绝对差值、相对差值)
  • 用户交互次数在所有用户交互次数中的分位数
  • 用户交互次数在所有用户交互次数中的排名
print('所有用户交互次数:'+str(userLog.shape[0]))
print('所有用户数:'+str(userLog['user_id'].nunique()))
print('所有用户平均交互次数:'+str(userLog.shape[0]/userLog['user_id'].nunique()))
totalActions['userTotalActionRatio'] = totalActions['userTotalAction']/userLog.shape[0]
totalActions['userTotalActionDiff'] = totalActions['userTotalAction']-userLog.shape[0]/userLog['user_id'].nunique()

用户点击次数在所有用户点击次数中的地位

  • 用户点击次数占所有用户点击次数的比例
  • 用户点击次数与用户平均点击次数的差值(绝对差值、相对差值)
  • 用户点击次数在所有用户点击次数中的分位数
  • 用户点击次数在所有用户点击次数中的排名
print('所有用户点击次数:'+str(userLog[userLog.action_type==0].shape[0]))
totalActions['userClickRatio'] = totalActions['userTotalAction_0']/userLog[userLog.action_type==0].shape[0]
print('用户平均点击次数:'+str(userLog[userLog.action_type==0].shape[0]/userLog['user_id'].nunique()))
totalActions['userClickDiff'] = totalActions['userTotalAction_0']-userLog[userLog.action_type==0].shape[0]/userLog['user_id'].nunique()

用户加入购物车次数在所有用户加入购物车次数中的地位

  • 用户加入购物车次数占所有用户加入购物车次数的比例
  • 用户加入购物车次数与用户平均加入购物车次数的差值(绝对差值、相对差值)
  • 用户加入购物车次数在所有用户加入购物车次数中的分位数
  • 用户加入购物车次数在所有用户加入购物车次数中的排名
print('所有用户加入购物车次数:'+str(userLog[userLog.action_type==1].shape[0]))
totalActions['userAddRatio'] = totalActions['userTotalAction_1']/userLog[userLog.action_type==1].shape[0]
print('用户平均加入购物车次数:'+str(userLog[userLog.action_type==1].shape[0]/userLog['user_id'].nunique()))
totalActions['userAddDiff'] = totalActions['userTotalAction_1']-userLog[userLog.action_type==1].shape[0]/userLog['user_id'].nunique()

用户购买次数在所有用户购买次数中的地位

  • 用户购买次数占所有用户购买次数的比例
  • 用户购买次数与用户平均购买次数的差值(绝对差值、相对差值)
  • 用户购买次数在所有用户购买次数中的分位数
  • 用户购买次数在所有用户购买次数中的排名
print('所有用户购买次数:'+str(userLog[userLog.action_type==2].shape[0]))
totalActions['userBuyRatio'] = totalActions['userTotalAction_2']/userLog[userLog.action_type==2].shape[0]
print('用户平均购买次数:'+str(userLog[userLog.action_type==2].shape[0]/userLog['user_id'].nunique()))
totalActions['userBuyDiff'] = totalActions['userTotalAction_2']-userLog[userLog.action_type==2].shape[0]/userLog['user_id'].nunique()

用户收藏次数在所有用户收藏次数中的地位

  • 用户收藏次数占所有用户收藏次数的比例
  • 用户收藏次数与用户平均收藏次数的差值(绝对差值、相对差值)
  • 用户收藏次数在所有用户收藏次数中的分位数
  • 用户收藏次数在所有用户收藏次数中的排名
print('所有用户收藏次数:'+str(userLog[userLog.action_type==3].shape[0]))
totalActions['userSaveRatio'] = totalActions['userTotalAction_3']/userLog[userLog.action_type==3].shape[0]
print('用户平均收藏次数:'+str(userLog[userLog.action_type==3].shape[0]/userLog['user_id'].nunique()))
totalActions['userSaveDiff'] = totalActions['userTotalAction_3']-userLog[userLog.action_type==3].shape[0]/userLog['user_id'].nunique()

统计用户不同行为的习惯(用户内部)

  • 用户点击次数占用户总交互次数的比例
  • 用户加入购物车次数占用户总交互次数的比例
  • 用户购买次数占用户总交互次数的比例
  • 用户收藏次数占用户总交互次数的比例
# 统计用户点击,加购,收藏,购买次数占用户总交互次数的比例
totalActions['userClick_ratio'] = totalActions['userTotalAction_0']/totalActions['userTotalAction']
totalActions['userAdd_ratio'] = totalActions['userTotalAction_1']/totalActions['userTotalAction']
totalActions['userBuy_ratio'] = totalActions['userTotalAction_2']/totalActions['userTotalAction']
totalActions['userSave_ratio'] = totalActions['userTotalAction_3']/totalActions['userTotalAction']

统计用户的点击、加入购物车、收藏的购买转化率

  • 用户点击转化率
  • 用户点击转化率与所有用户平均的点击转化率的差值
  • 用户加入购物车购买转化率
  • 用户加入购物车购买转化率与所有用户加入购物车购买转化率的差值
  • 用户收藏转化率
  • 用户收藏转化率与所有用户收藏转化率的差值
# 统计日志行为中用户的点击、加购、收藏的购买转化率
totalActions['userTotalAction_0_ratio'] = np.log1p(totalActions['userTotalAction_2']) - np.log1p(totalActions['userTotalAction_0'])
totalActions['userTotalAction_0_ratio_diff'] = totalActions['userTotalAction_0_ratio'] - totalActions['userTotalAction_0_ratio'].mean()
totalActions['userTotalAction_1_ratio'] = np.log1p(totalActions['userTotalAction_2']) - np.log1p(totalActions['userTotalAction_1'])
totalActions['userTotalAction_1_ratio_diff'] = totalActions['userTotalAction_1_ratio'] - totalActions['userTotalAction_1_ratio'].mean()
totalActions['userTotalAction_3_ratio'] = np.log1p(totalActions['userTotalAction_2']) - np.log1p(totalActions['userTotalAction_3'])
totalActions['userTotalAction_3_ratio_diff'] = totalActions['userTotalAction_3_ratio'] - totalActions['userTotalAction_3_ratio'].mean()
totalActions.info()

用户交互的时间信息(按天)

  • 用户交互的总天数,与所有用户平均交互总天数的比较
  • 用户每个月的交互天数,与所有用户平均每个月的交互天数的比较
  • 用户月交互天数的变化量,与所有用户平均月交互天数变化量的比较
days_cnt = userLog.groupby(['user_id'])['time_stamp'].nunique()
days_cnt_diff = days_cnt - userLog.groupby(['user_id'])['time_stamp'].nunique().mean()

用户交互的时间信息(按次数)

  • 用户交互的总天数,与所有用户平均交互总天数的比较
  • 用户每个月的交互天数,与所有用户平均每个月的交互天数的比较
  • 用户月交互天数的变化量,与所有用户平均月交互天数变化量的比较
  • 用户相邻两次交互行为的相隔天数最小值、最大值
  • 所有用户相邻两次交互行为的相隔天数平均最小值、平均最大值
  • 用户在双11之前是否是重复购买者(购买过一家商户的至少两件商品)
  • 用户在双11之前重复购买过的商品数量
# 对数值型特征手动标准化
numeric_cols = totalActions.columns[totalActions.dtypes == 'float64']
numeric_cols
numeric_col_means = totalActions.loc[:, numeric_cols].mean()
numeric_col_std = totalActions.loc[:, numeric_cols].std()
totalActions.loc[:, numeric_cols] = (totalActions.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
totalActions.head(5)
# 将统计好的数量和转化率进行拼接
userInfo = pd.merge(userInfo,totalActions,how='left',on=['user_id'])
del totalActions
userInfo.info()
  • 用户六个月中做出行为的商品数量(用户交互行为的广泛程度)
  • 用户六个月中做出行为的种类数量
  • 用户六个月中做出行为的店铺数量
  • 用户六个月中做出行为的品牌数量
  • 用户六个月中做出行为的天数(用户交互行为的长期性)
# 用户六个月中做出行为的商品数量
item_cnt = userLog.groupby(['user_id'])['item_id'].nunique()
# 用户六个月中做出行为的种类数量
cate_cnt = userLog.groupby(['user_id'])['cat_id'].nunique()
# 用户六个月中做出行为的店铺数量
seller_cnt = userLog.groupby(['user_id'])['seller_id'].nunique()
# 用户六个月中做出行为的品牌数量
brand_cnt = userLog.groupby(['user_id'])['brand_id'].nunique()
# 用户六个月中做出行为的天数
days_cnt = userLog.groupby(['user_id'])['time_stamp'].nunique()

typeCount_result = pd.concat([item_cnt,cate_cnt],axis=1)
typeCount_result = pd.concat([typeCount_result,seller_cnt],axis=1)
typeCount_result = pd.concat([typeCount_result,brand_cnt],axis=1)
typeCount_result = pd.concat([typeCount_result,days_cnt],axis=1)
typeCount_result.rename(columns={'item_id':'item_cnt','cat_id':'cat_cnt','seller_id':'seller_cnt','brand_id':'brand_counts','time_stamp':'active_days'},inplace=True)
typeCount_result.reset_index(inplace=True)
typeCount_result.info()
# 对数值型特征手动标准化
numeric_cols = typeCount_result.columns[typeCount_result.dtypes == 'int64']
print(numeric_cols)
numeric_col_means = typeCount_result.loc[:, numeric_cols].mean()
numeric_col_std = typeCount_result.loc[:, numeric_cols].std()
typeCount_result.loc[:, numeric_cols] = (typeCount_result.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
typeCount_result.head(5)
## 将统计好的数量进行拼接
userInfo = pd.merge(userInfo,typeCount_result,how='left',on=['user_id'])
del typeCount_result
userInfo.info()
  • 用户在双11之前是否有重复购买记录(用户是否趋向于购买过的店铺)
  • 用户在双11之前重复购买过的商家数量(用户对于购买过店铺的认同度)
## 统计双十一之前,用户重复购买过的商家数量
### --------------------------------------------------------------------------
repeatSellerCount = userLog[["user_id","seller_id","time_stamp","action_type"]]
repeatSellerCount = repeatSellerCount[(repeatSellerCount.action_type == 2) & (repeatSellerCount.time_stamp < 1111)]
repeatSellerCount.drop_duplicates(inplace=True)
repeatSellerCount = repeatSellerCount.groupby(['user_id','seller_id'])['time_stamp'].count().reset_index()
repeatSellerCount = repeatSellerCount[repeatSellerCount.time_stamp > 1]
repeatSellerCount = repeatSellerCount.groupby(['user_id'])['seller_id'].count().reset_index()
repeatSellerCount.rename(columns={'seller_id':'repeat_seller_count'},inplace=True)
# 对数值型特征手动标准化
numeric_cols = repeatSellerCount.columns[repeatSellerCount.dtypes == 'int64']
print(numeric_cols)
numeric_col_means = repeatSellerCount.loc[:, numeric_cols].mean()
numeric_col_std = repeatSellerCount.loc[:, numeric_cols].std()
repeatSellerCount.loc[:, numeric_cols] = (repeatSellerCount.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
repeatSellerCount.head(5)
userInfo = pd.merge(userInfo,repeatSellerCount,how='left',on=['user_id'])
# 没有重复购买的user用0填充?
userInfo.repeat_seller_count.fillna(0,inplace=True)
userInfo['repeat_seller'] = userInfo['repeat_seller_count'].map(lambda x: 1 if x != 0 else 0)
del repeatSellerCount

用户的活跃程度的变化

  • 用户每个月点击的次数
  • 用户每个月加购的次数
  • 用户每个月购买的次数
  • 用户每个月收藏的次数
  • 用户每个月交互的总次数
# 用户总交互的次数、天数
# 用户交互的间隔
# 统计每月的点击次数,每月的加入购物次数,每月的购买次数,每月的收藏次数
### --------------------------------------------------------------------------
monthActionsCount = userLog[["user_id","time_stamp","action_type"]]
result = list()
for i in range(5,12):
    start = int(str(i)+'00')
    end = int(str(i)+'30')
    # 获取i月的数据
    example = monthActionsCount[(monthActionsCount.time_stamp >= start) & (monthActionsCount.time_stamp < end)]
    # 对i月的交互行为进行哑编码
    df = pd.get_dummies(example['action_type'],prefix='%d_Action'%i)
    df[str(i)+'_Action'] = df[str(i)+'_Action_0']+df[str(i)+'_Action_1']+df[str(i)+'_Action_2']+df[str(i)+'_Action_3']
    # 将example的time_stamp设为月份值(5,6,。。。,11)
    example.loc[:,'time_stamp'] = example.time_stamp.apply(lambda x: int(str(x)[0]) if len(str(x)) == 3 else int(str(x)[:2]))
    result.append(pd.concat([example, df], axis=1).groupby(['user_id','time_stamp'],as_index=False).sum())

for i in range(0,7):
    userInfo = pd.merge(userInfo,result[i],how='left',on=['user_id'])
    userInfo.fillna(0,inplace=True)
for col in ['time_stamp_x','action_type_x','time_stamp_y','action_type_y','time_stamp','action_type']:
    del userInfo[col]
for i in range(5,12):
    userInfo[str(i)+'_Action'] = userInfo[str(i)+'_Action_0']+userInfo[str(i)+'_Action_1']+userInfo[str(i)+'_Action_2']+userInfo[str(i)+'_Action_3']

保存用户特征

filePath='d:/JulyCompetition/features/userInfo_Features.pkl'
pickle.dump(userInfo, open(filePath, 'wb'))

读取用户特征

# 读取用户特征
filePath='d:/JulyCompetition/features/userInfo_Features.pkl'
if os.path.exists(filePath):
    userInfo = pickle.load(open(filePath,'rb'))
userInfo.info()

2.2商户特征

统计基于商户的特征主要目的是分析商户在当前市场的受欢迎程度及商户自身对忠实用户的吸引力

  • 商户商品,种类,品牌总数,占总数的比例 (商户自身实力)

  • 商户被点击,被加入购物车,被购买,被收藏次数(商户受欢迎程度)

  • 商户被点击购买转化率,被加入购物车购买转化率,被收藏次数购买转化率(商户购买转化率)

  • 商户被点击的人数,被加入购物车的人数,被购买的人数(商户用户范围)

  • 商户被交互总次数,每月次数,平均每月次数,每月最多最少次数,每月变化量(商户受欢迎程度)

  • 被收藏的人数, 商户重复买家总数量

  • 统计商户的商品数

  • 统计商户的商品类别数

  • 统计商户的品牌数量

# 统计每个商户的商品,种类,品牌总数,并放入dataFrame[seller_id,xx_number]为列名,便于往后的拼接
# (表示商户的规模大小)
itemNumber = userLog[['seller_id','item_id']].groupby(['seller_id'])['item_id'].nunique().reset_index()
catNumber = userLog[['seller_id','cat_id']].groupby(['seller_id'])['cat_id'].nunique().reset_index()
brandNumber = userLog[['seller_id','brand_id']].groupby(['seller_id'])['brand_id'].nunique().reset_index()
itemNumber.rename(columns={'item_id':'item_number'},inplace=True)
catNumber.rename(columns={'cat_id':'cat_number'},inplace=True)
brandNumber.rename(columns={'brand_id':'brand_number'},inplace=True)
  • 商户双11之前的重复买家数量
 # 统计商户重复买家总数量(表示商户对于新用户的留存能力)
repeatPeoCount = userLog[(userLog.time_stamp < 1111) & (userLog.action_type == 2)]
repeatPeoCount = repeatPeoCount.groupby(['seller_id'])['user_id'].value_counts().to_frame()
repeatPeoCount.rename(columns={'user_id':'Buy_Number'},inplace=True)
repeatPeoCount.reset_index(inplace=True)
repeatPeoCount = repeatPeoCount[repeatPeoCount.Buy_Number > 1]
repeatPeoCount = repeatPeoCount.groupby(['seller_id']).apply(lambda x:len(x.user_id)).reset_index()
repeatPeoCount = pd.merge(pd.DataFrame({'seller_id':range(1, 4996 ,1)}),repeatPeoCount,how='left',on=['seller_id']).fillna(0)
repeatPeoCount.rename(columns={0:'repeatBuy_peopleNumber'},inplace=True)
  • 商户被点击次数
  • 商户商品被加入购物车次数
  • 商户商品被购买次数
  • 商户商品被收藏次数
  • 商户点击转化率
  • 商户加入购物车转化率
  • 商户收藏转化率
##统计被点击,被加入购物车,被购买,被收藏次数
###统计被点击购买转化率,被加入购物车购买转化率,被收藏次数购买转化率
sellers = userLog[["seller_id","action_type"]]
df = pd.get_dummies(sellers['action_type'],prefix='seller')
sellers = pd.concat([sellers, df], axis=1).groupby(['seller_id'], as_index=False).sum()
sellers.drop("action_type", axis=1,inplace=True)
del df
# 构造转化率字段
sellers['seller_0_ratio'] = np.log1p(sellers['seller_2']) - np.log1p(sellers['seller_0'])
sellers['seller_1_ratio'] = np.log1p(sellers['seller_2']) - np.log1p(sellers['seller_1'])
sellers['seller_3_ratio'] = np.log1p(sellers['seller_2']) - np.log1p(sellers['seller_3'])
sellers.info()
  • 商户被点击的用户数
  • 商户被加入购物车的用户数
  • 商户被购买的用户数
  • 商户被收藏的用户数
###统计每个商户被点击的人数,被加入购物车的人数,被购买的人数,被收藏的人数
peoCount = userLog[["user_id","seller_id","action_type"]]
df = pd.get_dummies(peoCount['action_type'],prefix='seller_peopleNumber')
peoCount = pd.concat([peoCount, df], axis=1)
peoCount.drop("action_type", axis=1,inplace=True)
peoCount.drop_duplicates(inplace=True)
df1 = peoCount.groupby(['seller_id']).apply(lambda x:x.seller_peopleNumber_0.sum())
df2 = peoCount.groupby(['seller_id']).apply(lambda x:x.seller_peopleNumber_1.sum())
df3 = peoCount.groupby(['seller_id']).apply(lambda x:x.seller_peopleNumber_2.sum())
df4 = peoCount.groupby(['seller_id']).apply(lambda x:x.seller_peopleNumber_3.sum())
peoCount = pd.concat([df1, df2,df3, df4], axis=1).reset_index()
del df1,df2,df3,df4
peoCount.rename(columns={0:'seller_peopleNum_0',1:'seller_peopleNum_1',2:'seller_peopleNum_2',3:'seller_peopleNum_3'},inplace=True)
peoCount.info()
###对各种统计表根据seller_id进行拼接
sellers = pd.merge(sellers,peoCount,on=['seller_id'])
sellers = pd.merge(sellers,itemNumber,on=['seller_id'])
sellers = pd.merge(sellers,catNumber,on=['seller_id'])
sellers = pd.merge(sellers,brandNumber,on=['seller_id'])
sellers = pd.merge(sellers,repeatPeoCount,on=['seller_id'])
del itemNumber,catNumber,brandNumber,peoCount,repeatPeoCount
sellers.info()
  • 商户的商品数占总商品数的比例
  • 商户的商品类别占总商品类别的比例
  • 商户的商品品牌占总商品品牌的比例
# 统计每个商户的商品数,商品种类、品牌占总量的比例(表示商户的规模大小)
sellers['item_ratio'] = sellers['item_number']/userLog['item_id'].nunique()
sellers['cat_ratio'] = sellers['item_number']/userLog['cat_id'].nunique()
sellers['brand_ratio'] = sellers['item_number']/userLog['brand_id'].nunique()
  • 在此商户有点击行为的用户数占所有有点击行为用户数的比例
  • 在此商户有加入购物车用户数占所有有加购物车行为用户数的比例
  • 在此商户有购买行为的用户数占所有有购买行为用户数的比例
  • 在此商户有收藏行为的用户数占所有有收藏行为用户数的比例
# 统计每个商户被点击、加购、购买、收藏的人数占有点击、加购、购买、收藏行为人数的比例
sellers['click_people_ratio'] = sellers['seller_peopleNum_0']/userLog[userLog['action_type'] == 0]['user_id'].nunique()
sellers['add_people_ratio'] = sellers['seller_peopleNum_1']/userLog[userLog['action_type'] == 1]['user_id'].nunique()
sellers['buy_people_ratio'] = sellers['seller_peopleNum_2']/userLog[userLog['action_type'] == 2]['user_id'].nunique()
sellers['save_people_ratio'] = sellers['seller_peopleNum_3']/userLog[userLog['action_type'] == 3]['user_id'].nunique()
# 对数值型特征手动标准化
numeric_cols = sellers.columns[sellers.dtypes != 'uint64']
print(numeric_cols)
numeric_col_means = sellers.loc[:, numeric_cols].mean()
numeric_col_std = sellers.loc[:, numeric_cols].std()
sellers.loc[:, numeric_cols] = (sellers.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
sellers.head(5)

保存商户特征

filePath='d:/JulyCompetition/features/sellerInfo_Features.pkl'
pickle.dump(sellers,open(filePath,'wb'))

读取商户特征

# 读取商户特征
filePath='d:/JulyCompetition/features/sellerInfo_Features.pkl'
if os.path.exists(filePath):
    sellers = pickle.load(open(filePath,'rb'))

2.3用户-商户特征

统计用户 - 商户之间的特征主要目的是分析特定用户与特定商户之间所形成的关系

  • 点击,加入购物车,购买,收藏的总次数
  • 点击,加入购物车,收藏的转化率
  • 点击,加入购物车,购买,收藏的总天数
  • 点击商品数量,占商户总数量的比例
  • 加入购物车商品数量,占商户商品总数量的比例
  • 购买商品的数量,占商户商品总数量的比例
  • 收藏商品的数量,占商户商品总数量的比例
  • 点击商品类别数量,占商户商品类别总数量的比例
  • 加入购物车商品类别数量,占商户商品类别总数量的比例
  • 购买商品类别数量,占商户商品类别总数量的比例
  • 收藏商品类别数量,占商户商品类别总数量的比例
  • 点击商品品牌数量,占商户商品品牌总数量的比例
  • 加入购物车商品品牌数量,占商户商品品牌总数量的比例
  • 购买商品品牌数量,占商户商品品牌总数量的比例
  • 收藏商品品牌数量,占商户商品品牌总数量的比例
## 提取预测目标的行为数据
trainData = pd.read_csv('d:/JulyCompetition/input/train_format1.csv')
trainData.rename(columns={'merchant_id':'seller_id'},inplace=True)
testData = pd.read_csv('d:/JulyCompetition/input/test_format1.csv')
testData.rename(columns={'merchant_id':'seller_id'},inplace=True)
targetIndex = pd.concat([trainData[['user_id', 'seller_id']],testData[['user_id', 'seller_id']]],ignore_index=True)
logs = pd.merge(targetIndex,userLog,on=['user_id', 'seller_id'])
del trainData,testData,targetIndex
logs.info()
  • 用户-商户的点击、加入购物车、购买、收藏次数
  • 用户-商户的点击、加入购物车、收藏转化率
### 统计用户对预测的商店的行为特征,例如点击,加入购物车,购买,收藏的总次数,以及各种转化率
df_result = logs[["user_id", "seller_id","action_type"]]
df = pd.get_dummies(df_result['action_type'],prefix='userSellerAction')
df_result = pd.concat([df_result, df], axis=1).groupby(['user_id', 'seller_id'], as_index=False).sum()
del df
df_result.drop("action_type", axis=1,inplace=True)
df_result['userSellerAction_0_ratio'] = np.log1p(df_result['userSellerAction_2']) - np.log1p(df_result['userSellerAction_0'])
df_result['userSellerAction_1_ratio'] = np.log1p(df_result['userSellerAction_2']) - np.log1p(df_result['userSellerAction_1'])
df_result['userSellerAction_3_ratio'] = np.log1p(df_result['userSellerAction_2']) - np.log1p(df_result['userSellerAction_3'])
df_result.info()
  • 用户-商户的点击,加入购物车,购买,收藏的总天数
###统计用户对预测商店点击的总天数
clickDays = logs[logs.action_type == 0]
clickDays = clickDays[["user_id", "seller_id","time_stamp","action_type"]]
clickDays = clickDays.groupby(['user_id', 'seller_id']).apply(lambda x:x.time_stamp.nunique()).reset_index()
clickDays.rename(columns={0:'click_days'},inplace=True)
df_result = pd.merge(df_result,clickDays,how='left',on=['user_id', 'seller_id'])
df_result.click_days.fillna(0,inplace=True)
del clickDays
###统计用户对预测商店加入购物车的总天数
addDays = logs[logs.action_type == 1]
addDays = addDays[["user_id", "seller_id","time_stamp","action_type"]]
addDays = addDays.groupby(['user_id', 'seller_id']).apply(lambda x:x.time_stamp.nunique()).reset_index()
addDays.rename(columns={0:'add_days'},inplace=True)
df_result = pd.merge(df_result,addDays,how='left',on=['user_id', 'seller_id'])
df_result.add_days.fillna(0,inplace=True)
del addDays
###统计用户对预测商店购物的总天数
buyDays = logs[logs.action_type == 2]
buyDays = buyDays[["user_id", "seller_id","time_stamp","action_type"]]
buyDays = buyDays.groupby(['user_id', 'seller_id']).apply(lambda x:x.time_stamp.nunique()).reset_index()
buyDays.rename(columns={0:'buy_days'},inplace=True)
df_result = pd.merge(df_result,buyDays,how='left',on=['user_id', 'seller_id'])
df_result.buy_days.fillna(0,inplace=True)
del buyDays
###统计用户对预测商店购物的总天数
saveDays = logs[logs.action_type == 3]
saveDays = saveDays[["user_id", "seller_id","time_stamp","action_type"]]
saveDays = saveDays.groupby(['user_id', 'seller_id']).apply(lambda x:x.time_stamp.nunique()).reset_index()
saveDays.rename(columns={0:'save_days'},inplace=True)
df_result = pd.merge(df_result,saveDays,how='left',on=['user_id', 'seller_id'])
df_result.save_days.fillna(0,inplace=True)
del saveDays
  • 点击商品数量,占商户总数量的比例
  • 加入购物车商品数量,占商户商品总数量的比例
  • 购买商品的数量,占商户商品总数量的比例
  • 收藏商品的数量,占商户商品总数量的比例
itemCount = logs[["user_id", "seller_id","item_id","action_type"]]
# 点击商品数量
itemCountClick = itemCount[itemCount.action_type == 0]
item_result = itemCountClick.groupby(['user_id', 'seller_id']).apply(lambda x:x.item_id.nunique()).reset_index()
item_result.rename(columns={0:'item_click_count'},inplace=True)
item_result.item_click_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,item_result,how='left',on=['user_id', 'seller_id'])
del itemCountClick,item_result
# 加入购物车商品数量
itemCountAdd = itemCount[itemCount.action_type == 1]
item_result = itemCountAdd.groupby(['user_id', 'seller_id']).apply(lambda x:x.item_id.nunique()).reset_index()
item_result.rename(columns={0:'item_add_count'},inplace=True)
item_result.item_add_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,item_result,how='left',on=['user_id', 'seller_id'])
del itemCountAdd,item_result
# 购买商品数量
itemCountBuy = itemCount[itemCount.action_type == 2]
item_result = itemCountBuy.groupby(['user_id', 'seller_id']).apply(lambda x:x.item_id.nunique()).reset_index()
item_result.rename(columns={0:'item_buy_count'},inplace=True)
item_result.item_buy_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,item_result,how='left',on=['user_id', 'seller_id'])
del itemCountBuy,item_result
# 收藏商品数量
itemCountSave = itemCount[itemCount.action_type == 3]
item_result = itemCountSave.groupby(['user_id', 'seller_id']).apply(lambda x:x.item_id.nunique()).reset_index()
item_result.rename(columns={0:'item_save_count'},inplace=True)
item_result.item_save_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,item_result,how='left',on=['user_id', 'seller_id'])
del itemCountSave,item_result
  • 点击商品类别数量,占商户商品类别总数量的比例
  • 加入购物车商品类别数量,占商户商品类别总数量的比例
  • 购买商品类别数量,占商户商品类别总数量的比例
  • 收藏商品类别数量,占商户商品类别总数量的比例
catCount = logs[["user_id", "seller_id","cat_id","action_type"]]
# 点击种类数量
catCountClick = catCount[catCount.action_type == 0]
cat_result = catCountClick.groupby(['user_id', 'seller_id']).apply(lambda x:x.cat_id.nunique()).reset_index()
cat_result.rename(columns={0:'cat_click_count'},inplace=True)
cat_result.cat_click_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,cat_result,how='left',on=['user_id', 'seller_id'])
del catCountClick,cat_result
# 加入购物车种类数量
catCountAdd = catCount[catCount.action_type == 1]
cat_result = catCountAdd.groupby(['user_id', 'seller_id']).apply(lambda x:x.cat_id.nunique()).reset_index()
cat_result.rename(columns={0:'cat_add_count'},inplace=True)
cat_result.cat_add_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,cat_result,how='left',on=['user_id', 'seller_id'])
del catCountAdd,cat_result
# 购买种类数量
catCountBuy = catCount[catCount.action_type == 2]
cat_result = catCountBuy.groupby(['user_id', 'seller_id']).apply(lambda x:x.cat_id.nunique()).reset_index()
cat_result.rename(columns={0:'cat_buy_count'},inplace=True)
cat_result.cat_buy_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,cat_result,how='left',on=['user_id', 'seller_id'])
del catCountBuy,cat_result
# 收藏种类数量
catCountSave = catCount[catCount.action_type == 3]
cat_result = catCountSave.groupby(['user_id', 'seller_id']).apply(lambda x:x.cat_id.nunique()).reset_index()
cat_result.rename(columns={0:'cat_save_count'},inplace=True)
cat_result.cat_save_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,cat_result,how='left',on=['user_id', 'seller_id'])
del catCountSave,cat_result
  • 点击商品品牌数量,占商户商品品牌总数量的比例
  • 加入购物车商品品牌数量,占商户商品品牌总数量的比例
  • 购买商品品牌数量,占商户商品品牌总数量的比例
  • 收藏商品品牌数量,占商户商品品牌总数量的比例
brandCount = logs[["user_id", "seller_id","brand_id","action_type"]]
# 点击品牌数量
brandCountClick = brandCount[brandCount.action_type == 0]
brand_result = brandCountClick.groupby(['user_id', 'seller_id']).apply(lambda x:x.brand_id.nunique()).reset_index()
brand_result.rename(columns={0:'brand_click_count'},inplace=True)
brand_result.brand_click_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,brand_result,how='left',on=['user_id', 'seller_id'])
del brandCountClick,brand_result
# 加入购物车品牌数量
brandCountAdd = brandCount[brandCount.action_type == 1]
brand_result = brandCountAdd.groupby(['user_id', 'seller_id']).apply(lambda x:x.brand_id.nunique()).reset_index()
brand_result.rename(columns={0:'brand_add_count'},inplace=True)
brand_result.brand_add_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,brand_result,how='left',on=['user_id', 'seller_id'])
del brandCountAdd,brand_result
# 购买品牌数量
brandCountBuy = brandCount[brandCount.action_type == 2]
brand_result = brandCountBuy.groupby(['user_id', 'seller_id']).apply(lambda x:x.brand_id.nunique()).reset_index()
brand_result.rename(columns={0:'brand_buy_count'},inplace=True)
brand_result.brand_buy_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,brand_result,how='left',on=['user_id', 'seller_id'])
del brandCountBuy,brand_result
# 收藏品牌数量
brandCountSave = brandCount[brandCount.action_type == 3]
brand_result = brandCountSave.groupby(['user_id', 'seller_id']).apply(lambda x:x.brand_id.nunique()).reset_index()
brand_result.rename(columns={0:'brand_save_count'},inplace=True)
brand_result.brand_save_count.fillna(0,inplace=True)
df_result = pd.merge(df_result,brand_result,how='left',on=['user_id', 'seller_id'])
del brandCountSave,brand_result
df_result.fillna(0,inplace=True)
# 对数值型特征手动标准化
for col in ['buy_days','item_buy_count','cat_buy_count','brand_buy_count']:
    df_result[col] = df_result[col].astype('float64')
# 对数值型特征手动标准化
numeric_cols = df_result.columns[df_result.dtypes == 'float64']
print(numeric_cols)
numeric_col_means = df_result.loc[:, numeric_cols].mean()
numeric_col_std = df_result.loc[:, numeric_cols].std()
df_result.loc[:, numeric_cols] = (df_result.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
df_result.head(5)

保存用户-商户特征

filePath='d:/JulyCompetition/features/userSellerActions.pkl'
pickle.dump(df_result,open(filePath,'wb'))
# 读取商户特征
filePath='d:/JulyCompetition/features/userSellerActions.pkl'
if os.path.exists(filePath):
    df_results = pickle.load(open(filePath,'rb'))

3.建立模型

3.1 构造训练集、测试集

# 构造训练集
def make_train_set():
    filePath = 'd:/JulyCompetition/features/trainSetWithFeatures.pkl'
    if os.path.exists(filePath):
        trainSet = pickle.load(open(filePath,'rb'))
    else:     
        trainSet = pd.read_csv('d:/JulyCompetition/input/train_format1.csv')
        trainSet.rename(columns={'merchant_id':'seller_id'},inplace=True)
        userInfo = pickle.load(open('d:/JulyCompetition/features/userInfo_Features.pkl','rb'))
        trainSet = pd.merge(trainSet,userInfo,how='left',on=['user_id'])
        sellerInfo = pickle.load(open('d:/JulyCompetition/features/sellerInfo_Features.pkl','rb'))
        trainSet = pd.merge(trainSet,sellerInfo,how='left',on=['seller_id'])
        userSellers = pickle.load(open('d:/JulyCompetition/features/userSellerActions.pkl','rb'))
        trainSet = pd.merge(trainSet,userSellers,how='left',on=['user_id','seller_id'])
        del userInfo,sellerInfo,userSellers
        pickle.dump(trainSet,open(filePath,'wb'))
    return trainSet
trainSet = make_train_set()
trainSet.info()
# 构造测试集
def make_test_set():
    filePath = 'd:/JulyCompetition/features/testSetWithFeatures.pkl'
    if os.path.exists(filePath):
        testSet = pickle.load(open(filePath,'rb'))
    else:     
        testSet = pd.read_csv('d:/JulyCompetition/input/test_format1.csv')
        testSet.rename(columns={'merchant_id':'seller_id'},inplace=True)
        userInfo = pickle.load(open('d:/JulyCompetition/features/userInfo_Features.pkl','rb'))
        testSet = pd.merge(testSet,userInfo,how='left',on=['user_id'])
        sellerInfo = pickle.load(open('d:/JulyCompetition/features/sellerInfo_Features.pkl','rb'))
        testSet = pd.merge(testSet,sellerInfo,how='left',on=['seller_id'])
        userSellers = pickle.load(open('d:/JulyCompetition/features/userSellerActions.pkl','rb'))
        testSet = pd.merge(testSet,userSellers,how='left',on=['user_id','seller_id'])
        del userInfo,sellerInfo,userSellers
        pickle.dump(testSet,open(filePath,'wb'))
    return testSet
testSet = make_test_set()
testSet.info()
## 提取训练特征集
from sklearn.model_selection import train_test_split
## 并按照0.85 : 0.15比例分割训练集和测试集
## 并测试集中分一半给xgboost作验证集,防止过拟合,影响模型泛化能力

# dataSet = pickle.load(open('features/trainSetWithFeatures.pkl','rb'))
###  把训练集进行分隔成训练集,验证集,测试集
x = trainSet.loc[:,trainSet.columns != 'label']
y = trainSet.loc[:,trainSet.columns == 'label']
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 2018)

del X_train['user_id']
del X_train['seller_id']
del X_test['user_id']
del X_test['seller_id']
print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

3.2 模型训练

LR

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.linear_model import LogisticRegression
params=[
    {'penalty':['l1'],
    'C':[100,1000],
    'solver':['liblinear']},
    {'penalty':['l2'],
    'C':[100,1000],
    'solver':['lbfgs']}]
clf = LogisticRegression(random_state=2018, max_iter=1000,  verbose=2)
grid = GridSearchCV(clf, params, scoring='roc_auc',cv=10, verbose=2)
grid.fit(X_train, y_train)

print(grid.best_score_)    #查看最佳分数(此处为neg_mean_absolute_error)
print(grid.best_params_)   #查看最佳参数
print(grid.cv_results_)
print(grid.best_estimator_) 
lr=grid.best_estimator_

xgboost

## 提取训练特征集
## 并按照0.85 : 0.15比例分割训练集和测试集
## 并测试集中分一半给xgboost作验证集,防止过拟合,影响模型泛化能力
import pandas as pd
import numpy as np
import xgboost as xgb
import pickle
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# 构造训练集
dataSet = pickle.load(open('d:/JulyCompetition/features/trainSetWithFeatures.pkl','rb'))
###  把训练集进行分隔成训练集,验证集,测试集
x = dataSet.loc[:,dataSet.columns != 'label']
y = dataSet.loc[:,dataSet.columns == 'label']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.15, random_state = 0)
 
x_val = x_test.iloc[:int(x_test.shape[0]/2),:]
y_val = y_test.iloc[:int(y_test.shape[0]/2),:]
 
x_test = x_test.iloc[int(x_test.shape[0]/2):,:] 
y_test = y_test.iloc[int(y_test.shape[0]/2):,:]
 
del x_train['user_id'],x_train['seller_id'],x_val['user_id'],x_val['seller_id']
 
dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_val, label=y_val)

## 快速训练和测试:xgboost训练
param = {'n_estimators': 500,
     'max_depth': 4, 
     'min_child_weight': 3,
     'gamma':0.3,
     'subsample': 0.8,
     'colsample_bytree': 0.8,  
     'eta': 0.125,
     'silent': 1, 
     'objective': 'binary:logistic',
     'eval_metric':'auc',
     'nthread':16
    }
plst = param.items()
evallist = [(dtrain, 'train'),(dtest,'eval')]
bst = xgb.train(plst, dtrain, 500, evallist, early_stopping_rounds=10)
 
## 将特征重要性排序出来和打印并保存
def create_feature_map(features):
    outfile = open(r'd:/JulyCompetition/output/featureMap/firstXGB.fmap', 'w')
    i = 0
    for feat in features:
        outfile.write('{0}\t{1}\tq\n'.format(i, feat))
        i = i + 1
    outfile.close()
def feature_importance(bst_xgb):
    importance = bst_xgb.get_fscore(fmap=r'd:/JulyCompetition/output/featureMap/firstXGB.fmap')
    importance = sorted(importance.items(), reverse=True)
 
    df = pd.DataFrame(importance, columns=['feature', 'fscore'])
    df['fscore'] = df['fscore'] / df['fscore'].sum()
    return df
 
## 创建特征图
create_feature_map(list(x_train.columns[:]))
## 根据特征图,计算特征重要性,并排序和展示
feature_importance = feature_importance(bst)
feature_importance.sort_values("fscore", inplace=True, ascending=False)
feature_importance.head(20)
 
##使用测试集,评估模型
users = x_test[['user_id', 'seller_id']].copy()
del x_test['user_id']
del x_test['seller_id']
x_test_DMatrix = xgb.DMatrix(x_test)
y_pred = bst.predict(x_test_DMatrix)
 
## 调用ROC-AUC函数,计算其AUC值
roc_auc_score(y_test,y_pred)

多模型

import lightgbm as lgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import  LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

def get_models(SEED=2018):
    """
    :parameters: None: None
    :return: models: Dict
    :Purpose: 
    声明各种基模型,并将他们放入字典中,便于调用
    """
    lgm = lgb.LGBMClassifier(num_leaves=50,learning_rate=0.05,n_estimators=250,class_weight='balanced',random_state=SEED)
    xgbMo = xgb.XGBClassifier(max_depth=4,min_child_weight=2,learning_rate=0.15,n_estimators=150,nthread=4,gamma=0.2,subsample=0.9,colsample_bytree=0.7, random_state=SEED)
    knn = KNeighborsClassifier(n_neighbors=1250,weights='distance',n_jobs=-1)## 使用了两成的数据量,就花了大量时间训练模型
    lr = LogisticRegression(C=150,class_weight='balanced',solver='liblinear', random_state=SEED)
    nn = MLPClassifier(solver='lbfgs', activation = 'logistic',early_stopping=False,alpha=1e-3,hidden_layer_sizes=(100,5), random_state=SEED)
    gb = GradientBoostingClassifier(learning_rate=0.01,n_estimators=600,min_samples_split=1000,min_samples_leaf=60,max_depth=10,subsample=0.85,max_features='sqrt',random_state=SEED)
    rf = RandomForestClassifier(min_samples_leaf=30,min_samples_split=120,max_depth=16,n_estimators=400,n_jobs=2,max_features='sqrt',class_weight='balanced',random_state=SEED)

    models = {
              'knn': knn, #分数太过低了,并且消耗时间长
              'xgb':xgbMo,
              'lgm':lgm,
              'mlp-nn': nn,
              'random forest': rf,
              'gbm': gb,
              'logistic': lr
              }

    return models
def train_predict(model_list):
    """
    :parameters: model_list: Dict
    :return: P: pd.DataFrame
    :Purpose: 
    根据提供的基模型字典,遍历每个模型并进行训练
    如果是lightgbm或xgboost,切入一些验证集
    返回每个模型预测结果
    """
    Preds_stacker = np.zeros((y_test.shape[0], len(model_list)))
    Preds_stacker = pd.DataFrame(Preds_stacker)

    print("Fitting models.")
    cols = list()
    for i, (name, m) in enumerate(models.items()):
        print("%s..." % name, end=" ", flush=False)
        if name == 'xgb' or name == 'lgm':
            m.fit(x_train,y_train.values.ravel(),eval_metric='auc')
        else:
            m.fit(x_train, y_train.values.ravel())
        Preds_stacker.iloc[:, i] = m.predict_proba(x_test)[:, 1]
        cols.append(name)
        print("done")

    Preds_stacker.columns = cols
    print("Done.\n")
    return Preds_stacker
def score_models(Preds_stacker, true_preds):
    """
    :parameters: Preds_stacker: pd.DataFrame   true_preds: pd.Series
    :return: None
    :Purpose: 
    遍历每个模型的预测结果,计算其与真实结果的AUC值
    """
    print("Scoring models.")
    for m in Preds_stacker.columns:
        score = roc_auc_score(true_preds, Preds_stacker.loc[:, m])
        print("%-26s: %.3f" % (m, score))
    print("Done.\n")
    
models = get_models()
Preds = train_predict(models)
score_models(Preds, y_test)

3.3 模型融合

def train_base_learners(base_learners, xTrain, yTrain, verbose=True):
    """
    :parameters: model_list: Dict, xTrain:pd.DataFrame, yTrain:pd.DataFrame
    :return: None
    :Purpose: 
    根据提供的基模型字典,和训练数据,遍历每个模型并进行训练
    """
    if verbose: print("Fitting models.")
    for i, (name, m) in enumerate(base_learners.items()):
        if verbose: print("%s..." % name, end=" ", flush=False)
        if name == 'xgb' or name == 'lgm':
            m.fit(xTrain,yTrain.values.ravel(),eval_metric='auc')
        else:
            m.fit(xTrain, yTrain.values.ravel())
        if verbose: print("done")
 
def predict_base_learners(pred_base_learners, inp, verbose=True):
    """
    :parameters: model_list: Dict, inp
    :return: P:pd.DataFrame
    :Purpose: 
    根据提供的基模型字典,输出预测结果
    """
    P = np.zeros((inp.shape[0], len(pred_base_learners)))
    if verbose: print("Generating base learner predictions.")
    for i, (name, m) in enumerate(pred_base_learners.items()):
        if verbose: print("%s..." % name, end=" ", flush=False)
        p = m.predict_proba(inp)
        # With two classes, need only predictions for one class
        P[:, i] = p[:, 1]
        if verbose: print("done")
    return P
  
def ensemble_predict(base_learners, meta_learner, inp, verbose=True):
    """
    :parameters: model_list: Dict, meta_learner, inp
    :return: P_pred, P
    :Purpose: 
    根据提供训练好的基模型字典,还有训练好的元模型,
    输出预测值
    """
    P_pred = predict_base_learners(base_learners, inp, verbose=verbose)
    return P_pred, meta_learner.predict_proba(P_pred)[:, 1]
## 1.定义基模型
base_learners = get_models()
## 2.定义元模型(第二层架构)
meta_learner = GradientBoostingClassifier(
    n_estimators=5000,
    loss="exponential",
    max_features=3,
    max_depth=4,
    subsample=0.8,
    learning_rate=0.0025, 
    random_state=SEED
)
 
## 将每个模型的预测结果切分成两半,一半作为元模型的训练,另一半作为测试
xtrain_base, xpred_base, ytrain_base, ypred_base = train_test_split(
    x_train, y_train, test_size=0.5, random_state=SEED)
## 3.训练基模型
train_base_learners(base_learners, xtrain_base, ytrain_base)
## 4.根据训练好的基模型,输出每个模型的测试值
P_base = predict_base_learners(base_learners, xpred_base)
## 5.根据刚刚的每个基模型的测试值,训练元模型!
meta_learner.fit(P_base, ypred_base.values.ravel())
## 6.将元模型进行预测!
P_pred, p = ensemble_predict(base_learners, meta_learner, x_test)
print("\nEnsemble ROC-AUC score: %.3f" % roc_auc_score(y_test, p))

你可能感兴趣的:(比赛,机器学习,天池比赛,天猫,购买预测)