上一篇介绍了如何运行训练样本,本篇将详细介绍下训练数据集的加载,首先下载数据集,执行 utils文件夹下的0_download_raw.sh文件:
bash 0_download_raw.sh
下载成功后,../raw_data会有文件:
包含两个文件,分别是用户评价、点击相关的文件:reviews_Electronics_5.json以及具体商品信息相关的文件:meta_Electronics.json
执行脚本
python 1_convert_pd.py
将原始数据转换为dataframe格式文件,源代码如下:
import pickle
import pandas as pd
def to_df(file_path):
with open(file_path, 'r') as fin:
df = {}
i = 0
for line in fin:
df[i] = eval(line) #eval函数自动将每一行json格式的文件解析成词典
i += 1
df = pd.DataFrame.from_dict(df, orient='index') # from_dict自动将数据转化为DataFrame格式文件
return df
reviews_df = to_df('../raw_data/reviews_Electronics_5.json')
# (读取用户对产品的评论信息)读取数据如下
# reviewerID - ID of the reviewer, e.g. A1RSDE90N6RSZF
# asin - ID of the product, e.g. 0000013714
# reviewerName - name of the reviewer
# helpful - helpfulness rating of the review, e.g. 2/3
# reviewText - text of the review
# overall - rating of the product (产品等级)
# summary - summary of the review
# unixReviewTime - time of the review (unix time)
# reviewTime - time of the review (raw)
with open('../raw_data/reviews.pkl', 'wb') as f:
pickle.dump(reviews_df, f, pickle.HIGHEST_PROTOCOL)
meta_df = to_df('../raw_data/meta_Electronics.json')
# 产品信息
# asin - ID of the product, e.g. 0000031852
# imUrl - url of the product image
# description - description of the product
# categories - list of categories the product belongs to
# title - name of the product
# price - price in US dollars (at time of crawl)
# salesRank - sales rank information
# related - related products (also bought, also viewed, bought together, buy after viewing)
# brand - brand name
meta_df = meta_df[meta_df['asin'].isin(reviews_df['asin'].unique())]
meta_df = meta_df.reset_index(drop=True)
with open('../raw_data/meta.pkl', 'wb') as f:
pickle.dump(meta_df, f, pickle.HIGHEST_PROTOCOL)
具体数据格式已在源代码中进行了详细注解,接下来执行脚本
python 2_remap_id.py
将商品信息和用户信息进行编号并存储,源代码如下:
import random
import pickle
import numpy as np
random.seed(1234)
with open('../raw_data/reviews.pkl', 'rb') as f:
reviews_df = pickle.load(f)
reviews_df = reviews_df[['reviewerID', 'asin', 'unixReviewTime']]
#3个字段:'reviewerID', 'asin', 'unixReviewTime'
with open('../raw_data/meta.pkl', 'rb') as f:
meta_df = pickle.load(f)
meta_df = meta_df[['asin', 'categories']]
meta_df['categories'] = meta_df['categories'].map(lambda x: x[-1][-1])
#取最后一行一列的值,其他都不需要,例如:[['Electronics', 'GPS & Navigation', 'Vehicle GPS', 'Trucking GPS']]
#返回的是'Trucking GPS'
#2个字段:'asin', 'categories'
def build_map(df, col_name): # 将df的col_name列转换为数字编号并返回原值和数字编号对应的词典映射以及去重复后的列数据list
key = sorted(df[col_name].unique().tolist())
m = dict(zip(key, range(len(key)))) # key从0开始一次转换为编号:0 1 2 3 4 5 6 ......,m为key和编号对应关系的词典,
df[col_name] = df[col_name].map(lambda x: m[x]) # 将dataframe里面的col_name的列转换为编号存储
return m, key
asin_map, asin_key = build_map(meta_df, 'asin')
# 产品id和数字编号映射, 去重复后的产品id
cate_map, cate_key = build_map(meta_df, 'categories')
# 产品分类和数字编号映射, 去重复后的产品分类
revi_map, revi_key = build_map(reviews_df, 'reviewerID')
# 用户id和数字编号映射,去重复后的用户id
user_count, item_count, cate_count, example_count =\
len(revi_map), len(asin_map), len(cate_map), reviews_df.shape[0]
print('user_count: %d\titem_count: %d\tcate_count: %d\texample_count: %d' %
(user_count, item_count, cate_count, example_count))
meta_df = meta_df.sort_values('asin')
meta_df = meta_df.reset_index(drop=True) # 最终字段:'asin', 'categories'
reviews_df['asin'] = reviews_df['asin'].map(lambda x: asin_map[x])
reviews_df = reviews_df.sort_values(['reviewerID', 'unixReviewTime'])
reviews_df = reviews_df.reset_index(drop=True)
reviews_df = reviews_df[['reviewerID', 'asin', 'unixReviewTime']] # 最终字端 'reviewerID', 'asin', 'unixReviewTime'
cate_list = [meta_df['categories'][i] for i in range(len(asin_map))]
cate_list = np.array(cate_list, dtype=np.int32) # 所有产品分类组成的list
with open('../raw_data/remap.pkl', 'wb') as f:
pickle.dump(reviews_df, f, pickle.HIGHEST_PROTOCOL) # uid, iid; 用户id, 商品id, 时间戳
pickle.dump(cate_list, f, pickle.HIGHEST_PROTOCOL) # cid of iid line; 所有产品分类信息列表
pickle.dump((user_count, item_count, cate_count, example_count),
f, pickle.HIGHEST_PROTOCOL) # 用户数、商品数、商品分类数和样本数
pickle.dump((asin_key, cate_key, revi_key), f, pickle.HIGHEST_PROTOCOL)# 产品id和数字编号映射、分类信息和数字编号映射、去重复后的用户ID
接下来执行:
python build_dataset.py
生成训练样本数据和测试样本数据,源代码如下:
import random
import pickle
random.seed(1234)
with open('../raw_data/remap.pkl', 'rb') as f:
reviews_df = pickle.load(f) #用户id, 商品id, 时间戳
cate_list = pickle.load(f) #商品分类List
user_count, item_count, cate_count, example_count = pickle.load(f)
train_set = []
test_set = []
for reviewerID, hist in reviews_df.groupby('reviewerID'):
pos_list = hist['asin'].tolist() # 用户点击过的商品作为正样本
def gen_neg():
neg = pos_list[0]
while neg in pos_list:
neg = random.randint(0, item_count-1)
return neg
neg_list = [gen_neg() for i in range(len(pos_list))] # 随机取其他样本数据作为负样本
for i in range(1, len(pos_list)): #按时间顺依次存入之前浏览的作品+当前是否点击该作品。
hist = pos_list[:i]
if i != len(pos_list) - 1:
train_set.append((reviewerID, hist, pos_list[i], 1))
train_set.append((reviewerID, hist, neg_list[i], 0))
else:
label = (pos_list[i], neg_list[i])
test_set.append((reviewerID, hist, label))
#train_set和test_set数据格式 [用户id,[用户之前点击商品id列表],当前推荐的商品ID,是否点击(1或0)]
random.shuffle(train_set)
random.shuffle(test_set)
assert len(test_set) == user_count
# assert(len(test_set) + len(train_set) // 2 == reviews_df.shape[0])
with open('dataset.pkl', 'wb') as f:
pickle.dump(train_set, f, pickle.HIGHEST_PROTOCOL)#训练样本集合
pickle.dump(test_set, f, pickle.HIGHEST_PROTOCOL)#测试样本集合
pickle.dump(cate_list, f, pickle.HIGHEST_PROTOCOL)#所有商品分类list对应每个商品属于哪个分类
pickle.dump((user_count, item_count, cate_count), f, pickle.HIGHEST_PROTOCOL)#用户数、商品数、商品分类数
最终生成dataset.pkl文件,包含如下信息:
训练样本和测试样本数据,格式: [用户id, [用户之前点击商品id列表], 当前推荐的商品ID,是否点击(1或0)];
所有商品分类list对应的商品分类:索引对应商品的id编号,内容对应商品分类;
用户数、商品数和商品分类数。
这就是后续所有模型训练需要的数据,下一节将继续介绍模型结构。