音乐频道推荐业务,支持各个产品业务和策略。这里我们先实现推荐部分,下面具体展开:
一、推荐系统流程图
CB,CF算法在召回阶段使用,推荐出来的item是粗排的,利用LR算法,可以将CB,CF召回回来的item进行精排,然后选择分数最高,给用户推荐出来。当然我们也可以使用矩阵分解技术,如ALS、LFM、SVD、SVD++,也可利用深度学习技术,如FM、FFM等。
二、推荐系统思路详解
代码思路:
1、数据预处理(用户画像数据、物品元数据、用户行为数据)
2、召回(CB、CF算法)
3、LR训练模型的数据准备,即用户特征数据,物品特征数据
4、模型准备,即通过LR算法训练模型数据得到w,b
5、推荐系统流程:
(1)解析请求:userid,itemid
(2)加载模型:加载排序模型(model.w,model.b)
(3)检索候选集合:利用cb,cf去redis里面检索数据库,得到候选集合
(4)获取用户特征:userid
(5)获取物品特征:itemid
(6)打分(逻辑回归,深度学习),排序
(7)top-n过滤
(8)数据包装(itemid->name),返回
三、推荐系统实现
3.1、数据预处理
(1)用户画像数据:user_profile.data
userid,性别,年龄,收入,地域
(2)物品(音乐)元数据:music_meta
itemid,name,desc,时长,地域,标签
(3)用户行为数据:user_watch_pref.sml
userid,itemid,该用户对该物品的收听时长,点击时间(小时)
首先,将3份数据融合到一份数据中
在pre_base_data目录中,执行python gen_base.py
# 将3份数据merge后的结果输出,供下游数据处理 #coding=utf-8 import sys user_action_data = '../data/user_watch_pref.sml' music_meta_data = '../data/music_meta' user_profile_data = '../data/user_profile.data' output_file = '../data/merge_base.data' # 将3份数据merge后的结果输出,供下游数据处理 ofile = open(output_file, 'w') # step 1. decode music meta data item_info_dict = {} with open(music_meta_data, 'r') as fd: for line in fd: ss = line.strip().split('\001') if len(ss) != 6: continue itemid, name, desc, total_timelen, location, tags = ss item_info_dict[itemid] = '\001'.join([name, desc, total_timelen, location, tags]) # step 2. decode user profile data user_profile_dict = {} with open(user_profile_data, 'r') as fd: for line in fd: ss = line.strip().split(',') if len(ss) != 5: continue userid, gender, age, salary, location = ss user_profile_dict[userid] = '\001'.join([gender, age, salary, location]) # step 3. decode user action data # output merge data with open(user_action_data, 'r') as fd: for line in fd: ss = line.strip().split('\001') if len(ss) != 4: continue userid, itemid, watch_len, hour = ss if userid not in user_profile_dict: continue if itemid not in item_info_dict: continue ofile.write('\001'.join([userid, itemid, watch_len, hour, \ user_profile_dict[userid], item_info_dict[itemid]])) ofile.write("\n") ofile.close()
- 01e3fdf415107cd6046a07481fbed499^A6470209102^A1635^A21^A男^A36-45^A20000-100000^A内蒙古^A黄家驹1993演唱会高清视频^A^A1969^A^A演唱会
3.2、【召回】CB算法
(1)以token itemid score形式整理训练数据
利用jieba分词,对item name进行中文分词
python gen_cb_train.py
#coding=utf-8 import sys sys.path.append('../') reload(sys) sys.setdefaultencoding('utf-8') import jieba import jieba.posseg import jieba.analyse input_file = "../data/merge_base.data" # 输出cb训练数据 output_file = '../data/cb_train.data' ofile = open(output_file, 'w') RATIO_FOR_NAME = 0.9 RATIO_FOR_DESC = 0.1 RATIO_FOR_TAGS = 0.05 # 这里是jieba的idf文件 idf_file = '../data/idf.txt' idf_dict = {} with open(idf_file, 'r') as fd: for line in fd: token, idf_score = line.strip().split(' ') idf_dict[token] = idf_score itemid_set = set() with open(input_file, 'r') as fd: for line in fd: ss = line.strip().split('\001') # 用户行为 userid = ss[0].strip() itemid = ss[1].strip() watch_len = ss[2].strip() hour = ss[3].strip() # 用户画像 gender = ss[4].strip() age = ss[5].strip() salary = ss[6].strip() user_location = ss[7].strip() # 物品元数据 name = ss[8].strip() desc = ss[9].strip() total_timelen = ss[10].strip() item_location = ss[11].strip() tags = ss[12].strip() # item因为做CB算法,所以不需要行为参数,只要拿到物品的名字就可以使用,所以进行去重复 if itemid not in itemid_set: itemid_set.add(itemid) else: continue # token for name, desc token_dict = {} for a in jieba.analyse.extract_tags(name, withWeight=True): token = a[0] score = float(a[1]) token_dict[token] = score * RATIO_FOR_NAME # token for desc for a in jieba.analyse.extract_tags(desc, withWeight=True): token = a[0] score = float(a[1]) if token in token_dict: token_dict[token] += score * RATIO_FOR_DESC else: token_dict[token] = score * RATIO_FOR_DESC # token for tags for tag in tags.strip().split(','): if tag not in idf_dict: continue else: if tag in token_dict: token_dict[tag] += float(idf_dict[tag]) * RATIO_FOR_TAGS else: token_dict[tag] = float(idf_dict[tag]) * RATIO_FOR_TAGS for k, v in token_dict.items(): token = k.strip() score = str(v) #print token, itemid, score ofile.write(','.join([token, itemid, score])) ofile.write("\n") ofile.close()
得到如下数据:
- 翻译,4090309101,0.561911164569(最后一个是一个不是传统的TF-IDF,因为分出的词在name,desc,tag里面他的重要性是不一样的)
(2)用协同过滤算法跑出item-item数据
最后得到基于cb的ii矩阵
(3)对数据格式化,item-> item list形式,整理出KV形式
python gen_reclist.py
#coding=utf-8 import sys infile = '../data/cb.result' outfile = '../data/cb_reclist.redis' ofile = open(outfile, 'w') MAX_RECLIST_SIZE = 100 PREFIX = 'CB_' rec_dict = {} with open(infile, 'r') as fd: for line in fd: itemid_A, itemid_B, sim_score = line.strip().split('\t') if itemid_A not in rec_dict: rec_dict[itemid_A] = [] rec_dict[itemid_A].append((itemid_B, sim_score)) for k, v in rec_dict.items(): key_item = PREFIX + k reclist_result = '_'.join([':'.join([tu[0], str(round(float(tu[1]), 6))]) \ for tu in sorted(v, key=lambda x: x[1], reverse=True)[:MAX_RECLIST_SIZE]]) ofile.write(' '.join(['SET', key_item, reclist_result])) ofile.write("\n") ofile.close()
类似如下数据:
- SET CB_5305109176 726100303:0.393048_953500302:0.393048_6193109237:0.348855
(4)灌库(redis)
下载redis-2.8.3.tar.gz安装包
进行源码编译(需要C编译yum install gcc-c++ ),执行make,然后会在src目录中,得到bin文件(redis-server 服务器,redis-cli 客户端)
启动redis server服务两种方法:
]# ./src/redis-server
]#后台方式启动 nohup ./redis-server &
然后换一个终端执行:]# ./src/redis-cli,连接服务
接下来灌数据(批量灌):
需要安装unix2dos(yum install unix2dos)(格式转换)
]# cat cb_reclist.redis | /usr/local/src/redis-2.8.3/src/redis-cli --pipe 这样是会报大量异常,所以需要用下面的方式去做,完了再使用管道插入(注意redis安装目录)
unix2dos cb_reclist.redis
cat cb_reclist.redis | /usr/local/src/redis/redis-2.8.3/src/redis-cli --pipe
验证:]# ./src/redis-cli
执行:
127.0.0.1:6379> get CB_5305109176
"726100303:0.393048_953500302:0.393048_6193109237:0.348855"
3.3、【召回】CF算法
(1)以userid itemid score形式整理训练数据
python gen_cf_train.py
#coding=utf-8 import sys input_file = "../data/merge_base.data" # 输出cf训练数据 output_file = '../data/cf_train.data' ofile = open(output_file, 'w') key_dict = {} with open(input_file, 'r') as fd: for line in fd: ss = line.strip().split('\001') # 用户行为 userid = ss[0].strip() itemid = ss[1].strip() watch_len = ss[2].strip() hour = ss[3].strip() # 用户画像 gender = ss[4].strip() age = ss[5].strip() salary = ss[6].strip() user_location = ss[7].strip() # 物品元数据 name = ss[8].strip() desc = ss[9].strip() total_timelen = ss[10].strip() item_location = ss[11].strip() tags = ss[12].strip() key = '_'.join([userid, itemid]) if key not in key_dict: key_dict[key] = [] key_dict[key].append((int(watch_len), int(total_timelen))) for k, v in key_dict.items(): t_finished = 0 t_all = 0 # 对为key进行分数聚合 for vv in v: t_finished += vv[0] t_all += vv[1] # 得到userid对item的最终分数 score = float(t_finished) / float(t_all) userid, itemid = k.strip().split('_') ofile.write(','.join([userid, itemid, str(score)])) ofile.write("\n") ofile.close()
(2)用协同过滤算法跑出item-item数据
最后得到基于cf的ii矩阵
(3)对数据格式化,item-> item list形式,整理出KV形式
python gen_reclist.py
#coding=utf-8 import sys infile = '../data/cf.result' outfile = '../data/cf_reclist.redis' ofile = open(outfile, 'w') MAX_RECLIST_SIZE = 100 PREFIX = 'CF_' rec_dict = {} with open(infile, 'r') as fd: for line in fd: itemid_A, itemid_B, sim_score = line.strip().split('\t') if itemid_A not in rec_dict: rec_dict[itemid_A] = [] rec_dict[itemid_A].append((itemid_B, sim_score)) for k, v in rec_dict.items(): key_item = PREFIX + k reclist_result = '_'.join([':'.join([tu[0], str(round(float(tu[1]), 6))]) \ for tu in sorted(v, key=lambda x: x[1], reverse=True)[:MAX_RECLIST_SIZE]]) ofile.write(' '.join(['SET', key_item, reclist_result])) ofile.write("\n") ofile.close()
类似如下数据:
(4)灌库
unix2dos cf_reclist.redis
cat cf_reclist.redis | /usr/local/src/redis-2.8.3/src/redis-cli --pipe
验证:
3.4、LR训练模型的数据准备
准备我们自己的训练数据
进入pre_data_for_rankmodel目录:
gen_samples.py
#coding=utf-8 import sys sys.path.append('../') reload(sys) sys.setdefaultencoding('utf-8') import jieba import jieba.analyse import jieba.posseg merge_base_infile = '../data/merge_base.data' output_file = '../data/samples.data' output_user_feature_file = '../data/user_feature.data' output_item_feature_file = '../data/item_feature.data' output_itemid_to_name_file = '../data/name_id.dict' def get_base_samples(infile): ret_samples_list = [] user_info_set = set() item_info_set = set() item_name2id = {} item_id2name = {} with open(infile, 'r') as fd: for line in fd: ss = line.strip().split('\001') if len(ss) != 13: continue userid = ss[0].strip() itemid = ss[1].strip() watch_time = ss[2].strip() total_time = ss[10].strip() # user info gender = ss[4].strip() age = ss[5].strip() user_feature = '\001'.join([userid, gender, age]) # item info name = ss[8].strip() item_feature = '\001'.join([itemid, name]) # label info label = float(watch_time) / float(total_time) final_label = '0' if label >= 0.82: final_label = '1' elif label <= 0.3: final_label = '0' else: continue # gen name2id dict for item feature item_name2id[name] = itemid # gen id2name dict item_id2name[itemid] = name # gen all samples list ret_samples_list.append([final_label, user_feature, item_feature]) # gen uniq userinfo user_info_set.add(user_feature) item_info_set.add(name) return ret_samples_list, user_info_set, item_info_set, item_name2id, item_id2name # step1. generate base samples(label, user feature, item feature) base_sample_list, user_info_set, item_info_set, item_name2id, item_id2name = \ get_base_samples(merge_base_infile) # step2. extract user feature user_fea_dict = {} for info in user_info_set: userid, gender, age = info.strip().split('\001') #gender idx = 0 # default 女 if gender == '男': idx = 1 gender_fea = ':'.join([str(idx), '1']) # age idx = 0 if age == '0-18': idx = 0 elif age == '19-25': idx = 1 elif age == '26-35': idx = 2 elif age == '36-45': idx = 3 else: idx = 4 idx += 2 age_fea = ':'.join([str(idx), '1']) user_fea_dict[userid] = ' '.join([gender_fea, age_fea]) # step3. extract item feature token_set = set() item_fs_dict = {} for name in item_info_set: token_score_list = [] for x, w in jieba.analyse.extract_tags(name, withWeight=True): token_score_list.append((x, w)) token_set.add(x) item_fs_dict[name] = token_score_list user_feature_offset = 10 # gen item id feature token_id_dict = {} for tu in enumerate(list(token_set)): # token -> token id token_id_dict[tu[1]] = tu[0] item_fea_dict = {} for name, fea in item_fs_dict.items(): tokenid_score_list = [] for (token, score) in fea: if token not in token_id_dict: continue token_id = token_id_dict[token] + user_feature_offset tokenid_score_list.append(':'.join([str(token_id), str(score)])) item_fea_dict[name] = ' '.join(tokenid_score_list) # step 4.generate final samples ofile = open(output_file, 'w') for (label, user_feature, item_feature) in base_sample_list: userid = user_feature.strip().split('\001')[0] item_name = item_feature.strip().split('\001')[1] if userid not in user_fea_dict: continue if item_name not in item_fea_dict: continue ofile.write(' '.join([label, user_fea_dict[userid], item_fea_dict[item_name]])) ofile.write('\n') ofile.close() # step 5. generate user feature mapping file o_u_file = open(output_user_feature_file, 'w') for userid, feature in user_fea_dict.items(): o_u_file.write('\t'.join([userid, feature])) o_u_file.write('\n') o_u_file.close() # step 6. generate item feature mapping file o_i_file = open(output_item_feature_file, 'w') for item_name, feature in item_fea_dict.items(): if item_name not in item_name2id: continue itemid = item_name2id[item_name] o_i_file.write('\t'.join([itemid, feature])) o_i_file.write('\n') o_i_file.close() # step 7. generate item id to name mapping file o_file = open(output_itemid_to_name_file, 'w') for itemid, item_name in item_id2name.items(): o_file.write('\t'.join([itemid, item_name])) o_file.write('\n') o_file.close()
3.5、模型准备
# -*- coding: UTF-8 -*- ''' 思路:这里我们要用到我们的数据,就需要我们自己写load_data的部分, 首先定义main,方法入口,然后进行load_data的编写 其次调用该方法的到x训练x测试,y训练,y测试,使用L1正则化或是L2正则化使得到结果更加可靠 输出wegiht,和b偏置 ''' import sys import numpy as np from scipy.sparse import csr_matrix from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression input_file = sys.argv[1] def load_data(): #由于在计算过程用到矩阵计算,这里我们需要根据我们的数据设置行,列,和训练的数据准备 #标签列表 target_list = [] #行数列表 fea_row_list = [] #特征列表 fea_col_list = [] #分数列表 data_list = [] #设置行号计数器 row_idx = 0 max_col = 0 with open(input_file,'r') as fd: for line in fd: ss = line.strip().split(' ') #标签 label = ss[0] #特征 fea = ss[1:] #将标签放入标签列表中 target_list.append(int(label)) #开始循环处理特征: for fea_score in fea: sss = fea_score.strip().split(':') if len(sss) != 2: continue feature, score = sss #增加行 fea_row_list.append(row_idx) #增加列 fea_col_list.append(int(feature)) #填充分数 data_list.append(float(score)) if int(feature) > max_col: max_col = int(feature) row_idx += 1 row = np.array(fea_row_list) col = np.array(fea_col_list) data = np.array(data_list) fea_datasets = csr_matrix((data, (row, col)), shape=(row_idx, max_col + 1)) x_train, x_test, y_train, y_test = train_test_split(fea_datasets, s, test_size=0.2, random_state=0) return x_train, x_test, y_train, y_test def main(): x_train,x_test,y_train,y_test = load_data() #用L2正则话防止过拟合 model = LogisticRegression(penalty='l2') #模型训练 model.fit(x_train,y_train) ff_w = open('model.w', 'w') ff_b = open('model.b', 'w') #写入训练出来的W for w_list in model.coef_: for w in w_list: print >> ff_w, "w: ", w # 写入训练出来的B for b in model.intercept_: print >> ff_b, "b: ", b print "precision: ", model.score(x_test, y_test) print "MSE: ", np.mean((model.predict(x_test) - y_test) ** 2) if __name__ == '__main__': main()
3.6、推荐系统实现
推荐系统demo流程
(1)解析请求:userid,itemid
(2)加载模型:加载排序模型(model.w,model.b)
(3)检索候选集合:利用cb,cf去redis里面检索数据库,得到候选集合
(4)获取用户特征:userid
(5)获取物品特征:itemid
(6)打分(逻辑回归,深度学习),排序
(7)top-n过滤
(8)数据包装(itemid->name),返回
验证:
192.168.150.10:9999/?userid=00370d83b51febe3e8ae395afa95c684&itemid=3880409156
下面附上一张完整的音乐推荐系统流程结构图: