初次参加天池比赛,题目很简单:https://tianchi.aliyun.com/competition/entrance/231721/tab/158。
训练集一共包含三个文件,分别为用户行为文件、用户信息表、商品信息表,详情如下。
user_behavior.csv
为用户行为文件,文件共有4列并以逗号分隔。每列的含义与内容如下:
列名 | 描述 |
---|---|
用户ID | 正整数,对应一个特定的用户 |
商品ID | 正整数,对应一个特定的商品 |
行为类型 | 枚举类型字符串,取值为('pv', 'buy', 'cart', 'fav')之一 |
时间戳 | 取值范围为[0, 1382400)的整数,表示该行为发生的时间到某一个星期五的0:00:00的时间偏移(单位为秒) |
user.csv
为用户信息文件,文件共有4列并以逗号分隔。每列的含义与内容如下:
列名 | 描述 |
---|---|
用户ID | 正整数,对应一个特定的用户 |
性别 | 正整数,表示用户性别。0表示男性,1表示女性,2表示未知 |
年龄 | 正整数,用户年龄 |
购买力 | 取值自[1, 9]的正整数,表示用户的购买力层级 |
item.csv
为商品信息表,文件共有4列并以逗号分隔。每列的含义与内容如下:
列名 | 描述 |
---|---|
商品ID | 正整数,对应一个特定的商品 |
类目ID | 正整数,表示该商品所属的类目 |
店铺ID | 正整数,表示该商品所属的店铺 |
品牌ID | 整数,表示该商品的品牌,-1表示未知 |
和训练集类似,测试集同样包含如上三个文件。参赛选手需要为测试集中的每一个用户预测其未来可能感兴趣的top 50的商品。具体的,基于如上的定义,用户的真实“未来兴趣”是指用户在1382400时间点以后一天内发生的('pv', 'buy', 'cart', 'fav')四种行为中的任意一种。另外,用户兴趣预估的候选商品库集合,为训练、测试集中商品库(item.csv
文件)的并集。
数据集下载:链接:https://pan.baidu.com/s/16rZgHtoG8aoJK3T9OMT3Zg 密码:j77w
面对题目,我的思路如下:用户行为数据中对某商品有购买操作,如果该用户群对该商品的复购率高,那么就可以将其设置为感兴趣商品,如果复购率为0,那么就不推荐该商品;如果对某商品没有过购买操作,则根据其其他操作的加权之和作为推荐指标;接下来再在该用户所在的群体(年龄、性别、购买力)中按照操作的加权之和降序推荐商品。
思路很简单,但是数据文件很大,我使用python做数据分析,直接用pandas.DataFrame操作,速度非常慢(基本上动不了);之前参加天池的离线赛时也碰到这种情况,那个时候搭建的spark这时候就可以派上用场了。方法如下:首先将csv文件放到hadoop中,接着在pyspark中读取csv文件,保存成parquet格式,这个时候就可以作为pyspark.sql.dataframe.DataFrame用了,上面的思路就可以转换成对DataFrame的操作了。代码如下:
import pyspark.sql.functions as F
import numpy as np
import pandas as pd
## csv -> parquet
user_behaviors=sqlContext.read.format('com.databricks.spark.csv').options(header='false',inferschema='true').load('hdfs://master:8020/item_recommend1_testB/user_behavior.csv')
user_behaviors=user_behaviors.withColumnRenamed('_c0','user_id').withColumnRenamed('_c1','item_id').withColumnRenamed('_c2','behavior_type').withColumnRenamed('_c3','time')
user_behaviors.write.format('parquet').mode('overwrite').save('/item_recommend1_testB/user_behaviors.parquet')
users=sqlContext.read.format('com.databricks.spark.csv').options(header='false',inferschema='true').load('hdfs://master:8020/item_recommend1_testB/user.csv')
users=users.withColumnRenamed('_c0','user_id').withColumnRenamed('_c1','gender').withColumnRenamed('_c2','age').withColumnRenamed('_c3','buy_cap')
users.write.format('parquet').mode('overwrite').save('/item_recommend1_testB/users.parquet')
items=sqlContext.read.format('com.databricks.spark.csv').options(header='false',inferschema='true').load('hdfs://master:8020/item_recommend1_testB/item.csv')
items=items.withColumnRenamed('_c0','item_id').withColumnRenamed('_c1','category_id').withColumnRenamed('_c2','shop_id').withColumnRenamed('_c3','brand_id')
items.write.format('parquet').mode('overwrite').save('/item_recommend1_testB/items.parquet')
## parquet 原始数据
users_test=spark.read.parquet('/item_recommend1_testB/users.parquet')
items_test=spark.read.parquet('/item_recommend1_testB/items.parquet')
user_behaviors_test=spark.read.parquet('/item_recommend1_testB/user_behaviors.parquet')
users=spark.read.parquet('/item_recommend1/users.parquet')
items=spark.read.parquet('/item_recommend1/items.parquet')
user_behaviors=spark.read.parquet('/item_recommend1/user_behaviors.parquet')
## total 数据
items_total=items.union(items_test).distinct()
users_total=users.union(users_test).distinct()
user_behaviors_total=user_behaviors.union(user_behaviors_test).distinct()
items_total.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/items.parquet')
users_total.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/users.parquet')
user_behaviors_total.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/user_behaviors.parquet')
## user_behaviors_allaction
user_behaviors=spark.read.parquet('/item_recommend1_totalB/user_behaviors.parquet')
user_behaviors_allaction=user_behaviors.withColumn('behavior_value',F.when(user_behaviors['behavior_type']=='pv',1).when(user_behaviors['behavior_type']=='fav',2).when(user_behaviors['behavior_type']=='cart',3).when(user_behaviors['behavior_type']=='buy',4))
user_behaviors_allaction.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/user_behaviors_allaction.parquet')
user_behaviors_allaction=spark.read.parquet('/item_recommend1_totalB/user_behaviors_allaction.parquet')
## 总user item 数据
users=spark.read.parquet('/item_recommend1_totalB/users.parquet')
items=spark.read.parquet('/item_recommend1_totalB/items.parquet')
## 所有天
full_user_behaviors=user_behaviors_allaction.join(users,on='user_id').join(items,on='item_id')
full_user_behaviors=full_user_behaviors.select(['*',(full_user_behaviors.behavior_value/F.ceil(16-full_user_behaviors.time/86400)).alias('behavior_value_new')])
full_user_behaviors.write.format("parquet").mode("overwrite").save('/item_recommend1_totalB/full_user_behaviors.parquet')
full_user_behaviors=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors.parquet')
##根据'user_id','item_id'分组
full_user_behaviors_user_item=full_user_behaviors.groupBy(['user_id','item_id']).agg({'behavior_value_new':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'}) #,'behavior_type':'count_distinct'
full_user_behaviors_user_item=full_user_behaviors_user_item.withColumnRenamed('sum(behavior_value_new)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item.parquet')
full_user_behaviors_user_item=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item.parquet')
full_user_behaviors_user_item_user=users.join(full_user_behaviors_user_item,on='user_id')
full_user_behaviors_user_item_user_age_item=full_user_behaviors_user_item_user.groupBy(['age','gender','buy_cap','item_id']).agg({'behavior_value_sum':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'})
full_user_behaviors_user_item_user_age_item=full_user_behaviors_user_item_user_age_item.withColumnRenamed('sum(behavior_value_sum)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item_user_age_item.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item.parquet')
full_user_behaviors_user_item_user_age_item=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item.parquet')
## 3天内
START_TIME=86400*8
full_user_behaviors_3=full_user_behaviors.filter('time>'+str(START_TIME))
full_user_behaviors_3.write.format("parquet").mode("overwrite").save('/item_recommend1_totalB/full_user_behaviors_3.parquet')
full_user_behaviors_3=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_3.parquet')
##根据'user_id','item_id'分组
#count 表示买了该商品的人数
full_user_behaviors_user_item_3=full_user_behaviors_3.groupBy(['user_id','item_id']).agg({'behavior_value_new':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'}) #,'behavior_type':'count_distinct'
full_user_behaviors_user_item_3=full_user_behaviors_user_item_3.withColumnRenamed('sum(behavior_value_new)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item_3.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item_3.parquet')
full_user_behaviors_user_item_3=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item_3.parquet')
full_user_behaviors_user_item_user_3=users.join(full_user_behaviors_user_item_3,on='user_id')
full_user_behaviors_user_item_user_age_item_3=full_user_behaviors_user_item_user_3.groupBy(['age','gender','buy_cap','item_id']).agg({'behavior_value_sum':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'})
full_user_behaviors_user_item_user_age_item_3=full_user_behaviors_user_item_user_age_item_3.withColumnRenamed('sum(behavior_value_sum)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_user_item_user_age_item_3.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item_3.parquet')
full_user_behaviors_user_item_user_age_item_3=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item_user_age_item_3.parquet')
## _3
#.filter('count>1')
dup_buyed_items=full_user_behaviors.filter('behavior_value==4').groupBy(['user_id','item_id']).count().groupBy('item_id').agg({'count':'avg'}).withColumnRenamed('avg(count)','count')
dup_buyed_items.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
dup_buyed_items=dup_buyed_items.filter('count>1.25')
buyed_items=full_user_behaviors.join(users_test,how='left_semi',on='user_id')
full_user_behaviors_buy=buyed_items.filter('behavior_value==4')
full_user_behaviors_buy_dup=full_user_behaviors_buy.select(['user_id','item_id']).distinct().join(dup_buyed_items,how='inner',on='item_id')
#购买次数大于1次的商品
# full_user_behaviors_buy_dup_count=full_user_behaviors_buy_dup.groupBy('user_id').count()
# full_user_behaviors_buy_dup_count.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
#三天内的动作
full_user_behaviors_3_test=full_user_behaviors_3.join(users_test,how='left_semi',on='user_id')
#去掉已经购买了的商品
full_user_behaviors_3_notbuy=full_user_behaviors_3_test.join(full_user_behaviors_buy,how='left_anti',on=['user_id','item_id'])
full_user_behaviors_3_notbuy_group=full_user_behaviors_3_notbuy.groupBy(['user_id','item_id']).agg({'behavior_value_new':'sum','category_id':'first','shop_id':'first','brand_id':'first','*':'count'})
full_user_behaviors_3_notbuy_group=full_user_behaviors_3_notbuy_group.withColumnRenamed('sum(behavior_value_new)','behavior_value_sum').withColumnRenamed('first(category_id)','category_id').withColumnRenamed('first(shop_id)','shop_id').withColumnRenamed('first(brand_id)','brand_id').withColumnRenamed('count(1)','count')
full_user_behaviors_3_notbuy_group.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
full_user_behaviors_3_notbuy_group.approxQuantile('behavior_value_sum',np.linspace(0,1,50).tolist(),0.01)
#未购买的商品中,其他动作>4的部分
recommended_notbuy=full_user_behaviors_3_notbuy_group.filter('behavior_value_sum>16')
full_user_behaviors_buy_dup.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/full_user_behaviors_buy.parquet')
full_user_behaviors_buy_dup=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_buy.parquet')
recommended_notbuy.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/recommended_notbuy.parquet')
recommended_notbuy=spark.read.parquet('/item_recommend1_totalB/recommended_notbuy.parquet')
recommended=recommended_notbuy.select(['user_id','item_id','count','behavior_value_sum']).union(full_user_behaviors_buy_dup.selectExpr(['user_id','item_id','count','10000 as behavior_value_sum']))
# recommended.groupBy('user_id').count().approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
recommended1=recommended.selectExpr(['user_id','item_id','count','behavior_value_sum'])
# full_user_behaviors_user_item_user_age_item_3v=full_user_behaviors_user_item_user_age_item_3.selectExpr(['age','gender','buy_cap','item_id','behavior_value_sum/count'])
# full_user_behaviors_user_item_user_age_item_3v.approxQuantile('(behavior_value_sum / count)',np.linspace(0,1,50).tolist(),0.01)
# full_user_behaviors_user_item_user_age_itemP=full_user_behaviors_user_item_user_age_item_3.toPandas()
# full_user_behaviors_user_item_user_age_itemP['ac']=full_user_behaviors_user_item_user_age_itemP['behavior_value_sum']/full_user_behaviors_user_item_user_age_itemP['count']
#
# full_user_behaviors_user_item.filter('behavior_value_sum>1').stat.approxQuantile('behavior_value_sum',np.linspace(0,1,50).tolist(),0.01)
#
# full_user_behaviors_user_item.filter('behavior_value_sum>1').stat.approxQuantile('count',np.linspace(0,1,50).tolist(),0.01)
@F.pandas_udf("age int,gender int,buy_cap int,item_id int,count int,behavior_value_sum double", F.PandasUDFType.GROUPED_MAP)
def trim(df):
return df.nlargest(50,'behavior_value_sum')
recommend_items_age=full_user_behaviors_user_item_user_age_item_3.select(['age','gender','buy_cap','item_id','count', 'behavior_value_sum']).groupby(['age','gender','buy_cap']).apply(trim)
recommend_items_user=users_test.join(recommend_items_age,on=['age','gender','buy_cap']).select(['user_id','item_id','count','behavior_value_sum'])
recommend_items_user.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/recommend_items_user.parquet')
recommend_items_user=spark.read.parquet('/item_recommend1_totalB/recommend_items_user.parquet')
recommend_items_user_all=recommend_items_user.join(recommended1,how='left_anti',on=['user_id','item_id']).union(recommended1)
recommend_items_user_df=recommend_items_user_all.toPandas()
def gen_itemids(r):
if 'user_id' not in r.columns:
return
user_id=r['user_id'].iloc[0]
l = [user_id]
r=r.sort_values(by='behavior_value_sum',ascending=False)
l.extend(list(r['item_id'])[:50])
return l
recommend_items_user_series=recommend_items_user_df.groupby('user_id').apply(gen_itemids)
notmatched_users=users_test.select('user_id').subtract(recommend_items_user.select('user_id').distinct()).collect()
for a in notmatched_users:
recommend_items_user_series[a.user_id]=[a.user_id]
need_more_recommends=recommend_items_user_series[recommend_items_user_series.apply(len)<51]
#如果商品推荐不足,从全部数据中得到推荐数据
for a in need_more_recommends:
print(a[0])
user=users_test.filter(' user_id='+str(a[0])).collect()[0]
j = 0
while len(a) < 51 and j < 4:
if j == 0:
pre_condition = ' age=%d and gender=%d and buy_cap=%d ' % (user.age, user.gender, user.buy_cap)
elif j == 1:
pre_condition = ' age between %d and %d and gender=%d and buy_cap=%d ' % (
user.age - 3, user.age + 3, user.gender, user.buy_cap)
elif j == 2:
pre_condition = ' age =%d and gender=%d and buy_cap between %d and %d ' % (
user.age, user.gender, user.buy_cap - 2, user.buy_cap + 2)
else:
pre_condition = ' age between %d and %d and gender=%d and buy_cap between %d and %d ' % (
user.age - 3, user.age + 3, user.gender, user.buy_cap - 2, user.buy_cap + 2)
condition = pre_condition
if len(a) > 1:
condition += (' and item_id not in (%s)' % ','.join([str(i) for i in a[1:]]) )
print(condition)
recommend_items = full_user_behaviors_user_item_user_age_item.filter(condition).orderBy(F.desc('behavior_value_sum')).limit(51-len(a)).collect()
for i in recommend_items:
if i.item_id not in a[1:]:
a.append(i.item_id)
if len(a) >= 51:
break
j=j+1
recommend_items_user_series[a[0]]=a
# need_more_recommends=recommend_items_user_series[recommend_items_user_series.apply(len)<51]
# for a in need_more_recommends:
# user=users_test.filter(' user_id='+str(a[0])).collect()[0]
# j = 1
# while len(a) < 51 and j < 4:
# if j == 0:
# pre_condition = ' age=%d and gender=%d and buy_cap=%d ' % (user.age, user.gender, user.buy_cap)
# elif j == 1:
# pre_condition = ' age between %d and %d and gender=%d and buy_cap=%d ' % (
# user.age - 3, user.age + 3, user.gender, user.buy_cap)
# elif j == 2:
# pre_condition = ' age =%d and gender=%d and buy_cap between %d and %d ' % (
# user.age, user.gender, user.buy_cap - 2, user.buy_cap + 2)
# else:
# pre_condition = ' age between %d and %d and gender=%d and buy_cap between %d and %d ' % (
# user.age - 3, user.age + 3, user.gender, user.buy_cap - 2, user.buy_cap + 2)
# condition = pre_condition
# if len(a) > 1:
# condition += (' and item_id not in (%s)' % ','.join([str(i) for i in a[1:]]) )
# print(condition)
# recommend_items = full_user_behaviors_user_item_user_age_item_3.filter(condition).orderBy(
# F.desc('count'),F.desc('behavior_value_sum')).limit(51-len(a)).collect()
# for i in recommend_items:
# if i.item_id not in a[1:]:
# a.append(i.item_id)
# if len(a) >= 51:
# break
# j=j+1
# recommend_items_user_series[a[0]]=a
df=pd.DataFrame(list(recommend_items_user_series.values),dtype=int)
df.to_csv('/Users/zhangyugu/Downloads/testb_result_081416.csv',float_format='%d',header=False,index=False)
这是初赛,靠着这段逻辑进入了复赛;复赛中要求提交代码了。我开始思考:这个筛选的规则都是我自己定的,不一定就是最好的推荐策略;那么如何通过机器学习得到最优的推荐策略呢?
首先,我可以用的数据有用户信息(年龄、性别、购买力),商品信息(分类、品牌、店铺),还有用户的历史数据,目标是用户对商品的兴趣度。兴趣度的计算方式是可以自己定的,最简单的就是用户对商品的历史动作加权之和。前面的数据就是特征,其中比较复杂的是用户的历史行为中蕴含的特征提取。这里选用spark的als做矩阵分解,拿到在历史行为中蕴含的用户、商品特征:
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
hadoop_preffix="hdfs://master:8020/item_recommend1.db"
sc.setCheckpointDir(hadoop_preffix+'/item_recommend1_als_sc')
user_behaviors=spark.read.parquet('/item_recommend1_totalB/full_user_behaviors_user_item.parquet')
users_test=spark.read.parquet('/item_recommend1_testB/users.parquet')
user_behaviors_test=user_behaviors.join(users_test,how='left_semi',on='user_id')
users=spark.read.parquet('/item_recommend1/users.parquet')
user_behaviors_test_full=user_behaviors.join(users_test,how='inner',on='user_id')
user_behaviors_test=user_behaviors_test.select(['user_id','item_id',"behavior_value_sum"])
import pyspark.mllib.recommendation as rd
user_behaviors_test_rdd=user_behaviors_test.rdd
user_behaviors_test_rddRating=user_behaviors_test.rdd.map(lambda r:rd.Rating(r.user_id,r.item_id,r.behavior_value_sum))
user_behaviors_test_rddRating.checkpoint()
user_behaviors_test_rddRating.cache()
model=rd.ALS.trainImplicit(user_behaviors_test_rddRating,8,50,0.01)
userFeatures=model.userFeatures()
def feature_to_row(a):
l=list(a[1])
l.insert(0,a[0])
return l
userFeaturesRowed=userFeatures.map(feature_to_row)
productFeatures=model.productFeatures()
productFeaturesRowed=productFeatures.map(feature_to_row)
userFeaturesDf=sqlContext.createDataFrame(userFeaturesRowed,['user_id','feature_0','feature_1','feature_2','feature_3','feature_4','feature_5','feature_6','feature_7'])
itemFeaturesDf=sqlContext.createDataFrame(productFeaturesRowed,['item_id','item_feature_0','item_feature_1','item_feature_2','item_feature_3','item_feature_4','item_feature_5','item_feature_6','item_feature_7'])
userFeaturesDf.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/userFeaturesDf.parquet')
itemFeaturesDf.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/itemFeaturesDf.parquet')
user_behaviors_test_fullFeatured=user_behaviors_test_full.join(userFeaturesDf,how='inner',on='user_id').join(itemFeaturesDf,how='inner',on='item_id')
user_behaviors_test_fullFeatured.write.format('parquet').mode('overwrite').save('/item_recommend1_totalB/user_behaviors_test_fullFeatured.parquet')
有了特征数据,接下来就是用特征来得到用户对某商品的兴趣了,可以选择一般的回归算法,如逻辑回归;也可以选择比较强大的拟合工具,如神经网络。
如何将特征与神经网络结合呢,这个时候我发现了DeepFM模型,具体介绍见DeepFM模型理论和实践
下面的代码是运用该模型来进行训练
import numpy as np
import tensorflow as tf
import pandas as pd
from StratifiedKFoldByXColumnsYBin import StratifiedKFoldByXColumnsYBin
import sys
sys.path.append('../')
from DeepFM import DeepFM
# 数据读取
data_dir = '/Users/zhangyugu/Downloads/'
dfTrain=pd.read_csv(data_dir+'user_behaviors_test_full.csv')
# dfTrain['behavior_value_sum'].describe()
cols=['user_id','item_id','brand_id','shop_id','category_id','gender','age','buy_cap']
X_train=dfTrain[cols]
Y_train=dfTrain['behavior_value_sum'].values
X_test=X_train
# 向离散特征转换成one-hot形式
feat_dim=0
feat_dict = {}
numeric_cols=['buy_cap','age']
for col in cols:
us=dfTrain[col].unique()
if col in numeric_cols:
feat_dim+=1
feat_dict[col]=feat_dim
else:
feat_dict[col] = dict(zip(us, range(feat_dim, len(us) + feat_dim)))
feat_dim+=len(us)
dfTrain_i=X_train.copy()
dfTrain_v=X_train.copy()
for col in cols:
if col not in numeric_cols:
dfTrain_i[col] = dfTrain_i[col].map(feat_dict[col])
dfTrain_v[col] = 1.
else:
dfTrain_i[col] = feat_dict[col]
dfTrain_i=dfTrain_i.values.tolist()
dfTrain_v=dfTrain_v.values.tolist()
# 预测结果度量方式
def gini_norm(actual,predict):
return ((np.array(actual)-np.array(predict))**2).sum()/len(actual)
dfm_params={
'use_fm':True,
'use_deep':True,
'embedding_size':24,
'dropout_fm':[1.0,1.0],
'deep_layers':[48,16],
'dropout_deep':[0.5,0.5,0.5],
'deep_layers_activation':tf.nn.relu,
'epoch':30,
'batch_size':1024,
'learning_rate':0.001,
'optimizer_type':'adam',
'batch_norm':1,
'batch_norm_decay':0.995,
'l2_reg':0.01,
'verbose':True,
'eval_metric':gini_norm,
'random_seed':2017,
'greater_is_better':False,
'loss_type':'mse'
}
dfm_params["feature_size"] = feat_dim #特征数
dfm_params["field_size"] = len(cols) #字段数
folds= list(StratifiedKFoldByXColumnsYBin(columns=['gender','age','buy_cap'],n_splits=3,shuffle=True,random_state=2017).split(X_train,Y_train))
y_train_meta = np.zeros((dfTrain.shape[0],1),dtype=float) # 训练数据的预测结果
y_test_meta = np.zeros((dfTrain.shape[0],1),dtype=float) # 测试数据的预测结果,下面的测试数据,偷懒的缘故,直接采用了训练数据,待修复
gini_results_cv=np.zeros(len(folds),dtype=float) # 预测结果度量
gini_results_epoch_train=np.zeros((len(folds),dfm_params['epoch']),dtype=float) # 训练数据的预测结果度量 folds * epoch
gini_results_epoch_valid=np.zeros((len(folds),dfm_params['epoch']),dtype=float)
# for i in range(len(folds)):
# train_idx,valid_idx=train_test_split(range(len(dfTrain)),random_state=2017,train_size=2.0/3.0)
for i, (train_idx, valid_idx) in enumerate(folds):
# 拿到feature_index feature_value label
_get = lambda x, l: [x[i] for i in l]
Xi_train_,Xv_train_,y_train_ = _get(dfTrain_i,train_idx), _get(dfTrain_v,train_idx),_get(Y_train,train_idx)
Xi_valid_, Xv_valid_, y_valid_ = _get(dfTrain_i, valid_idx), _get(dfTrain_v, valid_idx), _get(Y_train, valid_idx)
dfm=DeepFM(**dfm_params)
dfm.fit(Xi_train_,Xv_train_,y_train_,Xi_valid_,Xv_valid_,y_valid_)
y_train_meta[valid_idx,0]=dfm.predict(Xi_valid_,Xv_valid_)
y_test_meta[:,0]=dfm.predict(dfTrain_i,dfTrain_v)
gini_results_cv[i]=gini_norm(y_valid_,y_train_meta[valid_idx])
gini_results_epoch_train[i]=dfm.train_result
gini_results_epoch_valid[i]=dfm.valid_result
filename = data_dir+"DeepFm_Mean%.5f_Std%.5f.csv" % (gini_results_cv.mean(), gini_results_cv.std())
# 测试结果
y_test_meta/=float(len(folds))
pd.DataFrame({"user_id": X_train['user_id'],"item_id":X_train['item_id'], "target": y_test_meta.flatten()}).to_csv(
filename, index=False, float_format="%.5f")
# 误差
print("DeepFm: %.5f (%.5f)" % (gini_results_cv.mean(),gini_results_cv.std()))
import matplotlib.pyplot as plt
def _plot_fig(train_results, valid_results, filename):
colors = ["red", "blue", "green"]
xs = np.arange(1, train_results.shape[1]+1)
plt.figure()
legends = []
for i in range(train_results.shape[0]):
plt.plot(xs, train_results[i], color=colors[i], linestyle="solid", marker="o")
plt.plot(xs, valid_results[i], color=colors[i], linestyle="dashed", marker="o")
legends.append("train-%d"%(i+1))
legends.append("valid-%d"%(i+1))
plt.xlabel("Epoch")
plt.ylabel("Normalized Gini")
plt.legend(legends)
plt.savefig(filename)
plt.close()
_plot_fig(gini_results_epoch_train, gini_results_epoch_valid,data_dir+'DeepFm_ItemRecommend1.png')
在本机跑,速度很慢,为了用上gpu,尝试了google colab;在此推荐这款神器,一个方便远程使用gpu跑机器学习任务的环境。
在多次训练时,需要将数据拆分为训练集和验证集,采样时需要考虑数据的均匀性,即在不同特征以及结果取值上的样本都要覆盖到,tensorflow提供的StratifiedKFold只考虑了结果值y的均匀性,所以自己实现了一个数据拆分类StratifiedKFoldByXColumnsYBin:
import numpy as np
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
from StratifiedKFoldByXColumnsYBin import StratifiedKFoldByXColumnsYBin
import sys
sys.path.append('../')
from DeepFM import DeepFM
def data2index_values(dfTrain, feat_dict, cols, numeric_cols):
dfTrain_i = dfTrain.copy()
dfTrain_v = dfTrain.copy()
for col in cols:
if col not in numeric_cols:
dfTrain_i[col] = dfTrain_i[col].map(feat_dict[col])
dfTrain_v[col] = 1.
else:
dfTrain_i[col] = feat_dict[col]
dfTrain_i = dfTrain_i.values.tolist()
dfTrain_v = dfTrain_v.values.tolist()
return dfTrain_i, dfTrain_v
# 向离散特征转换成one-hot形式
def data2sparse_matrix(X_train, X_test, numeric_cols):
feat_dim = 0
feat_dict = {}
cols = X_train.columns
for col in cols:
us = dfTrain[col].unique()
if col in numeric_cols:
feat_dim += 1
feat_dict[col] = feat_dim
else:
feat_dict[col] = dict(zip(us, range(feat_dim, len(us) + feat_dim)))
feat_dim += len(us)
return data2index_values(X_train), data2index_values(X_test), feat_dim
def train_and_predict(xTrain_i, xTrain_v, xTest_i, xTest_v, feat_dim):
EPOCHES = 30
folds = list(StratifiedKFoldByXColumnsYBin(columns=['gender', 'age', 'buy_cap'], n_splits=3, shuffle=True,
random_state=2017).split(X_train, Y_train))
y_train_meta = np.zeros((len(xTrain_v), 1), dtype=float) # 训练数据的预测结果
y_test_meta = np.zeros((len(xTest_v), 1), dtype=float) # 测试数据的预测结果
loss_metrics_test = np.zeros(len(folds), dtype=float) # 预测误差度量
epoch_loss_metrics_train = np.zeros((len(folds), EPOCHES), dtype=float) # 训练数据的预测误差度量 folds * epoch
epoch_loss_metrics_valid = np.zeros((len(folds), EPOCHES), dtype=float)
for i, (train_idx, valid_idx) in enumerate(folds):
dfm_params = {
'use_fm': True,
'use_deep': True,
'embedding_size': 24,
'dropout_fm': [1.0, 1.0],
'deep_layers': [48, 16],
'dropout_deep': [0.5, 0.5, 0.5],
'deep_layers_activation': tf.nn.relu,
'epoch': 30,
'batch_size': 1024,
'learning_rate': 0.001,
'optimizer_type': 'adam',
'batch_norm': 1,
'batch_norm_decay': 0.995,
'l2_reg': 0.01,
'verbose': True,
'eval_metric': lambda actual, predict: ((np.array(actual) - np.array(predict)) ** 2).sum() / len(actual),
# 预测结果度量方式
'random_seed': 2017,
'greater_is_better': False,
'loss_type': 'mse',
"feature_size": feat_dim, # 特征数
"field_size": len(xTrain_v[0]) # 字段数
}
dfm = DeepFM(**dfm_params)
# 拿到feature_index feature_value label
_get = lambda x, l: [x[i] for i in l]
Xi_train_, Xv_train_, y_train_ = _get(xTrain_i, train_idx), _get(xTrain_v, train_idx), _get(Y_train,
train_idx)
Xi_valid_, Xv_valid_, y_valid_ = _get(xTrain_i, valid_idx), _get(xTrain_v, valid_idx), _get(Y_train,
valid_idx)
dfm.fit(Xi_train_, Xv_train_, y_train_, Xi_valid_, Xv_valid_, y_valid_)
y_train_meta[valid_idx, 0] = dfm.predict(Xi_valid_, Xv_valid_)
y_test_meta[:, 0] += dfm.predict(xTest_i, xTest_v)
loss_metrics_test[i] = dfm_params['eval_metric'](y_valid_, y_train_meta[valid_idx])
epoch_loss_metrics_train[i] = dfm.train_result # 维度为 epoches 的向量
epoch_loss_metrics_valid[i] = dfm.valid_result
# 测试结果
y_test_meta /= float(len(folds))
return loss_metrics_test, y_test_meta, epoch_loss_metrics_train, epoch_loss_metrics_valid
def _plot_fig(train_results, valid_results, filename):
colors = ["red", "blue", "green"]
xs = np.arange(1, train_results.shape[1]+1)
plt.figure()
legends = []
for i in range(train_results.shape[0]):
plt.plot(xs, train_results[i], color=colors[i], linestyle="solid", marker="o")
plt.plot(xs, valid_results[i], color=colors[i], linestyle="dashed", marker="o")
legends.append("train-%d" % (i+1))
legends.append("valid-%d" % (i+1))
plt.xlabel("Epoch")
plt.ylabel("Normalized Gini")
plt.legend(legends)
plt.savefig(filename)
plt.close()
# 数据读取
data_dir = '/Users/zhangyugu/Downloads/'
dfTrain = pd.read_csv(data_dir + 'user_behaviors_test_full.csv')
X_train = dfTrain[['user_id', 'item_id', 'brand_id', 'shop_id', 'category_id', 'gender', 'age', 'buy_cap']]
X_test = X_train[:100]
Y_train = dfTrain['behavior_value_sum'].values
xTrain_i,xTrain_v,xTest_i,xTest_v,feat_dim = data2sparse_matrix(X_train, X_test,['buy_cap', 'age'])
loss_metrics_test, y_test_meta, epoch_loss_metrics_train, epoch_loss_metrics_valid = train_and_predict(xTrain_i,xTrain_v,xTest_i,xTest_v,feat_dim)
pd.DataFrame({"user_id": X_train['user_id'],"item_id":X_train['item_id'], "target": y_test_meta.flatten()})\
.to_csv(data_dir + "DeepFm_Mean%.5f_Std%.5f.csv" % (loss_metrics_test.mean(), loss_metrics_test.std())
, index=False, float_format="%.5f")
# 误差
print("DeepFm: %.5f (%.5f)" % (loss_metrics_test.mean(),loss_metrics_test.std()))
_plot_fig(epoch_loss_metrics_train, epoch_loss_metrics_valid,data_dir+'DeepFm_ItemRecommend1.png')
最后还剩下一个问题,对一个用户预测他明天最有可能买或者浏览的商品,商品推荐源巨大,如何根据其特征值进行筛选?回归本源,用户购买商品,是因为一系列兴趣指标,而商品在这些兴趣指标上也有相应的得分,反映到模型上,就是针对一列特征,用户和商品分别有自己的取值向量。那么如果直接通过在商品特征取值向量上快速筛选得到需要的商品?阿里的tdm是一个很好的方案。但是具体实现中每个用户对应的商品排序结构中,每个商品的打分如何计算?有没有可能在模型训练的过程中就把该打分、树结构构造好?
tdm的做法是基于商品本身的分类初始构造树层次,最底层的叶子节点为商品,接着对于每个商品根据其在特征上的取值向量通过层次聚类方法进行分组,每个分组都有一个打分(用于排序),该打分再作为一个特征,放到机器学习过程中优化。该打分类似attention的机制来实现,具体实现方式待研究。