hi各位大佬好,我是菜鸟小明哥。因为纯粹基于ID的模型CTR已经降至最低,比如之前的ytb,SRGNN,LightGCN,TAGNN,均干不过FFM,DSSM,这是实际应用中不可避免的,总之,这种情况无论是代码问题也好,还是说数据问题稀疏也罢,短期内调整是看不到效果提升的。因为我已经尝试了增加ytb的user侧特征(net和最近爱看的类别),然而实际效果仍旧没有任何提升,尽管离线指标已经逼近SRGNN,实际测试CTR也是没有提升,而且SRGNN都已经不行了,还折腾个啥子?[当然我也试过调整其他参数,仍旧没啥子用]
paper在此,代码here.
1-实际代码中会有很多问题要改,比如item_cate文件cate编码是从1开始的,而train/test/valid中则是从0开始的,我将cate中编码减去1后测试发现并无差别,不知道这样做是否正确。user及item均是从1开始编码的。
2-我自己的数据划分是采用的留一法,结果如下,相当差,验证集没有任何变化,但训练的loss在降低,关于这个问题,详见我的issue
time interval: 157.9083 min
docpic_ComiRec-SA_b128_lr0.001_d64_len20_docpic
iter: 49000, train loss: 1.8495, valid recall: 0.000007, valid ndcg: 0.000007, valid hitrate: 0.000007
time interval: 161.3182 min
docpic_ComiRec-SA_b128_lr0.001_d64_len20_docpic
iter: 50000, train loss: 1.8340, valid recall: 0.000007, valid ndcg: 0.000007, valid hitrate: 0.000007
time interval: 164.7177 min
docpic_ComiRec-SA_b128_lr0.001_d64_len20_docpic
iter: 51000, train loss: 1.8256, valid recall: 0.000007, valid ndcg: 0.000007, valid hitrate: 0.000007
time interval: 168.0316 min
docpic_ComiRec-SA_b128_lr0.001_d64_len20_docpic
iter: 52000, train loss: 1.8214, valid recall: 0.000007, valid ndcg: 0.000007, valid hitrate: 0.000007
model restored from best_model/docpic_ComiRec-SA_b128_lr0.001_d64_len20_docpic/
valid recall: 0.000007, valid ndcg: 0.000007, valid hitrate: 0.000007, valid diversity: 0.000000
test recall: 0.000007, test ndcg: 0.000007, test hitrate: 0.000007, test diversity: 0.000000
2-1而原数据则是采用一部分user的点击作为验证集和测试集,训练集,测试集,验证集中的user均不相同(如下),这种只考虑item序列的分割可以理解,但我采用留一法的分割也完全有理由,然而验证效果就是差。
>>> tr=book_train[book_train['user_id'].isin(book_test['user_id'])]
>>> tr
Empty DataFrame
Columns: [user_id, item_id, cate]
Index: []
>>> tr=book_train[book_train['user_id'].isin(book_valid['user_id'])]
>>> tr
Empty DataFrame
Columns: [user_id, item_id, cate]
Index: []
>>> tr=book_test[book_test['user_id'].isin(book_valid['user_id'])]
>>> tr
Empty DataFrame
Columns: [user_id, item_id, cate]
Index: []
2-2先与数据对齐,或者将原来数据也采用留一法试试看,如下:
合并数据后重新分割,并验证所有item均有cate
>>> new_data.shape
(8898041, 3)
>>> new_data2=new_data[new_data['item_id'].isin(item_cate['item_id'])]
>>> new_data2.shape
(8898041, 3)
经常报下面错说明是user点击的item数量少于4个,这样的user可以去掉了。因此为了保证最后一个点击为test后仍旧可执行,需要将至少点击设置为6
training begin
Traceback (most recent call last):
File "src/train.py", line 379, in
model_type=args.model_type, lr=args.learning_rate, max_iter=args.max_iter, patience=args.patience)
File "src/train.py", line 244, in train
for src, tgt in train_data:
File "/data/logs/xulm1/ComiRec/src/data_iterator.py", line 67, in __next__
k = random.choice(range(4, len(item_list)))
File "/data/logs/xulm1/myconda/lib/python3.7/random.py", line 261, in choice
raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence
目前的运行来看的确应该按照我的issue说的改,或者将全部数据用来训练,还离线测试个啥子?这么小的HR与NDCG,纯属有点扯【注:本人博文说的东西从不针对任何人,只是这么说说而已,请不要自扣屎盆子、对号入座。哈哈】。如下,这是book数据改成留一法的训练过程与结果:指标也是没啥变化,这能不能说明我的数据除了稀疏外没有其他问题
time interval: 1.9249 min
book_ComiRec-SA_b128_lr0.001_d64_len20_leave1out
iter: 2000, train loss: 7.6902, valid recall: 0.006361, valid ndcg: 0.006361, valid hitrate: 0.006361
time interval: 3.5376 min
。。。
model restored from best_model/book_ComiRec-SA_b128_lr0.001_d64_len20_leave1out/
valid recall: 0.006361, valid ndcg: 0.006361, valid hitrate: 0.006361, valid diversity: 0.040000
test recall: 0.006361, test ndcg: 0.006361, test hitrate: 0.006361, test diversity: 0.040000
3-将类别再次处理【20210428】
之前的item cate杂乱,啥数据都有,将其中垃圾数据全部归为-1,特别是有?的垃圾数据,只能说有些人数据没处理好,bug就向下流了。也有很多类别真的很傻逼,单个就是个类别,这样能学出来个鬼吧。单个类别就是一个item的竟然有200个。还是暂时用粗的类别吧,这个算是比较准确的。
[(272, 1), (13, 1), (269, 1), (276, 1), (781, 1), (266, 1), (261, 1), (51, 1), (18, 1), (53, 1), (273, 1), (779, 1), (270, 1), (254, 1), (307, 1), (252, 1), (518, 1), (257, 1), (268, 1), (12, 1), (250, 1), (308, 1), (248, 1), (277, 1), (289, 1), (302, 1), (46, 1), (243, 1), (287, 1), (301, 1), (799, 1), (288, 1), (300, 1), (44, 1), (290, 1), (249, 1), (35, 1), (260, 1), (292, 1), (293, 1), (298, 1), (42, 1), (294, 1), (295, 1), (41, 1), (558, 1), (769, 1), (259, 1), (47, 1), (278, 1), (790, 1), (279, 1), (281, 1), (793, 1), (306, 1), (50, 1), (282, 1), (1, 1), (509, 1), (40, 1), (795, 1), (258, 1), (304, 1), (48, 1), (247, 1), (3, 1), (286, 1), (798, 1), (251, 1), (96, 1), (496, 1), (408, 1), (146, 1), (402, 1), (658, 1), (147, 1), (660, 1), (149, 1), (661, 1), (150, 1), (406, 1), (407, 1), (152, 1), (153, 1), (420, 1), (154, 1), (155, 1), (411, 1), (232, 1), (156, 1), (412, 1), (158, 1), (231, 1), (160, 1), (417, 1), (163, 1), (401, 1), (145, 1), (144, 1), (655, 1), (132, 1), (389, 1), (645, 1), (134, 1), (390, 1), (391, 1), (647, 1), (136, 1), (392, 1), (233, 1), (137, 1), (393, 1), (138, 1), (394, 1), (650, 1), (139, 1), (395, 1), (651, 1), (396, 1), (397, 1), (142, 1), (398, 1), (143, 1), (164, 1), (421, 1), (387, 1), (224, 1), (185, 1), (697, 1), (186, 1), (698, 1), (699, 1), (188, 1), (700, 1), (193, 1), (227, 1), (196, 1), (225, 1), (199, 1), (166, 1), (202, 1), (204, 1), (205, 1), (206, 1), (207, 1), (222, 1), (210, 1), (212, 1), (214, 1), (218, 1), (216, 1), (440, 1), (184, 1), (695, 1), (439, 1), (423, 1), (424, 1), (425, 1), (170, 1), (426, 1), (428, 1), (429, 1), (430, 1), (431, 1), (176, 1), (432, 1), (177, 1), (178, 1), (434, 1), (179, 1), (435, 1), (180, 1), (436, 1), (181, 1), (229, 1), (182, 1), (438, 1), (228, 1), (643, 1), (641, 1), (311, 1), (337, 1), (75, 1), (76, 1), (332, 1), (77, 1), (589, 1), (78, 1), (334, 1), (236, 1), (79, 1), (335, 1), (591, 1), (593, 1), (346, 1), (82, 1), (338, 1), (83, 1), (339, 1), (84, 1), (341, 1), (87, 1), (343, 1), (599, 1), (89, 1), (345, 1), (586, 1), (330, 1), (329, 1), (73, 1), (312, 1), (240, 1), (824, 1), (57, 1), (314, 1), (59, 1), (315, 1), (60, 1), (61, 1), (317, 1), (63, 1), (319, 1), (238, 1), (322, 1), (324, 1), (69, 1), (325, 1), (70, 1), (326, 1), (71, 1), (72, 1), (328, 1), (584, 1), (235, 1), (91, 1), (385, 1), (375, 1), (366, 1), (111, 1), (368, 1), (114, 1), (370, 1), (115, 1), (511, 1), (628, 1), (373, 1), (629, 1), (118, 1), (376, 1), (603, 1), (632, 1), (378, 1), (634, 1), (123, 1), (124, 1), (637, 1), (126, 1), (382, 1), (127, 1), (128, 1), (384, 1), (621, 1), (365, 1), (620, 1), (362, 1), (92, 1), (348, 1), (604, 1), (606, 1), (351, 1), (607, 1), (217, 1), (97, 1), (98, 1), (354, 1), (610, 1), (355, 1), (612, 1), (101, 1), (357, 1), (613, 1), (614, 1), (103, 1), (359, 1), (104, 1), (490, 1), (361, 1), (106, 1), (371, 1), (602, 2), (64, 2), (65, 2), (818, 2), (344, 2), (493, 2), (68, 2), (491, 2), (242, 2), (340, 2), (58, 2), (313, 2)]
【从冬天开始的俯卧撑,早晚各一次,目前已经做到一口气80个,我的目标是200个,暂且不说标准与否,锻炼手臂,助于睡眠,以后可能要加上仰卧起坐了,这样的话最好是9点前能下班,太晚对身体不好】
将类别用最粗的发现仍旧有很多无类别(约1/5),这就是内容画像或者数据处理做的不行,当然这不是我应该背的锅。几个类别的item个数很少,这样的类别没啥用,与无类别的一样,啥都学不到。调整后发现指标又下降了,卧槽,从百万分之7降到百万分之5,。。。。
4-如何infer
最终训练得到的是啥子?根据paper的图可知,每个user是几个兴趣embedding来表征,各自进行topk个召回后,再进行筛选。看到代码有output我放心了,这个输出的正是item的embedding,根据paper4.2中所述,每个user学习K个兴趣embedding,各自进行topN个召回,然后按照内积大小进行排序,这点与MIND并无区别。寡人采用K=4,self-attention的模型进行尝试,不过在test阶段太慢了,卧槽,比蜗牛还慢(啥JB速度用了20min)。算了,直接输出训练好的变量吧,我看是啥子样子。emmm,咋只有item的embedding啊??
>>> emb.shape
(367978, 64)
我要兴趣的embedding,,,,多兴趣,学到最后没兴趣。。。。。细查代码发现有user_emb,但输入是啥子?
def output_user(self, sess, inps):
user_embs = sess.run(self.user_eb, feed_dict={
self.mid_his_batch_ph: inps[0],
self.mask: inps[1]
})
return user_embs
查看评价部分即可得到,但并不是那么容易得到,其中需要考虑的是数据集。如果采用留一法分割数据,那么除了最后一个的其他点击item都是作为输入数据来得到user的embedding,这样data应该为train的data,这一点与SRGNN又是相同的,可以的,小明哥,天赋异禀。哈哈,总体来说,ComiRec与SRGNN也是相似,只不过后者学到一个user的embedding,即用user点击的item序列表征user的兴趣偏好,而前者是学到多个embedding,最后根据内积大小排序,这也是序列推荐的整体模式,即与user无关,而与user的点击有关(其实user的点击还不是与user有关吗?)最终还是类似U2I的召回方式,因而faiss的部署也是如此。
def evaluate_full(sess, test_data, model, model_path, batch_size, item_cate_map, save=True, coef=None):
topN = args.topN
item_embs = model.output_item(sess)
res = faiss.StandardGpuResources()
flat_config = faiss.GpuIndexFlatConfig()
flat_config.device = 0
按照如上所述,那么我之前的评价指标是可以的,这样才是真正的对齐,才能与之前的方法进行PK。其中模型的训练及保存等需要修改,比如训练10次,保存最后一次的模型,其中不再验证验证集的指标(因为没有变化),这部分也可节省时间,我看作者写的代码还可,没有乱七八糟的不常见的代码(有的偏僻的可能很难改,我的目的是改成多GPU模式)。
最终指标如下:按照原代码结果
iter: 10000, train loss: 1.9007
time interval: 111.7582 s
WARNING:tensorflow:From src/mytrain.py:319: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
model restored from best_model/docpic_ComiRec-SA_b128_lr0.005_d64_len20_test_docpic/
valid recall: 0.156470, valid ndcg: 0.283301, valid hitrate: 0.421946, valid diversity: 0.798395
iter: 1000000, train loss: 1.1756
time interval: 180.4199 min
WARNING:tensorflow:From src/mytrain.py:319: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
model restored from best_model/docpic_ComiRec-SA_b128_lr0.005_d64_len20_TEST/
valid recall: 0.257123, valid ndcg: 0.449645, valid hitrate: 0.594076, valid diversity: 0.681665
一个用了2min迭代,一个用了3h,其中差别只是迭代次数,然而迭代次数并不是epoch,而是其中样本的训练个数,当然可能有重复训练的,至于重复了几次不知道。这个R与HR不同的原因是,这里的验证并不是针对最后一个item,也就不是留一法。寡人已经保存了用户及item的embedding,可进行faiss召回看看leave-one-out test method 的指标,如下:
我疯了。。。。。。。。分别为MAP,HR,NDCG,就这水平。。。如果没有其他问题的话,这就是很差劲的结果
>>> score=evaluate_score(preds,answer_dict)
answer length, 1239551
>>> score
array([7.9618017e-05, 9.6567225e-04, 2.4277368e-04], dtype=float32)
我怀疑有问题,首先是user_id与user_emb可能没有对应,也就是说得到的user的embedding可能不是有序的,经查看data_next迭代生成数据的方式发现,数据的格式是user,item,time,如下:
def read(self, source):
self.graph = {}
self.users = set()
self.items = set()
with open(source, 'r') as f:
for line in f:
conts = line.strip().split(',')
user_id = int(conts[0])
item_id = int(conts[1])
time_stamp = int(conts[2])
self.users.add(user_id)
self.items.add(item_id)
if user_id not in self.graph:
self.graph[user_id] = []
self.graph[user_id].append((item_id, time_stamp))
for user_id, value in self.graph.items():
value.sort(key=lambda x: x[1])
self.graph[user_id] = [x[0] for x in value]
因此我的数据格式是有误的,user,item,cate,下面更改数据格式再次核对(或者将原来作者的代码修改,完全可以去掉第三列time,因为我已经拍好序)
train的第三列为time的另一个证据:item_cate的cate最大编号小于train的第三列最大数据,用的book原始数据
>>> train=pd.read_csv('book_train.txt',header=None)
>>> train.columns=['user_id','item_id','time']
>>> train['time'].max()
23221
>>> item_cate=pd.read_csv('../book_item_cate.txt',header=None)
>>> item_cate.columns
Int64Index([0, 1], dtype='int64')
>>> item_cate.columns=['item_id','cate']
>>> item_cate['cate'].min()
0
>>> item_cate['cate'].max()
1599
【插入一个疑问,既然训练的数据没有用cate,那么cate是怎么用于学习的???至少在data next迭代中未见到,这也难怪将cate从0排序也没有啥变化的原因】
但user是绝对有序的,set后add的user如下:可见从小到大的顺序
>>> user=set()
>>> user.add(1)
>>> user.add(2)
>>> user
{1, 2}
>>> user.add(4)
>>> user.add(2)
>>> user.add(3)
>>> user
{1, 2, 3, 4}
>>> user.add(30)
>>> user.add(20)
>>> user
{1, 2, 3, 4, 20, 30}
那么目前唯一的问题就是点击序列乱了,如果序列没有乱,结果如下:这个。。。算啥效果啊
answer length, 1239551
score : [0.01636344 0.1160275 0.03573435]
下面要做的工作是:1,尽可能将epoch用上,2,调整参数,3,如果可能改成多GPU
等我好消息吧。
又跑了下,多迭代了次数,也设定embedding_size=32但效果下降了,难道要调整LR?
iter: 399000, train loss: 0.8908
time interval: 54.3862 min
iter: 400000, train loss: 2.4162
time interval: 54.5183 min
answer length, 1239551
score : [0.0104962 0.10120842 0.0274579 ]
又来了一次将隐层size也调整为32,LR由千5下调为千1.结果如下:也不咋滴,没有明显提高
iter: 399000, train loss: 0.6971
time interval: 56.5206 min
iter: 400000, train loss: 2.1962
time interval: 56.6556 min
answer length, 1239551
score : [0.01636877 0.12692983 0.03753907]
下班。