hulu关于related show的演化可以由下面的一张图表现出来:
xiangliang在这次报到中提到了单纯使用CF会造成如下问题(直接从ppt中摘录)
CF works in most cases. However, given a fact that users who watch show A also watch B,
other than the relevance between A and B, there could be lots of reasons:
1. A and B release new videos in the same day;
2. B is very popular;
3. A and B are new released shows;
4. A and B do not have same life cycle;
5. A and B are often shown in homepage;
We need to remove such effects when calculating show relevance through CF.
有个关于相关性和CTR(点击率)的trade-off:
如果选择直接优化CTR,会造成推荐产品相关度不高,一般流行的items,或者在用户访问的那个页面上新的items会获得很高的CTR.
所以现在的CTR只是模型的一个因子而已。
Hulu他们还发现一些有关孩子的shows会和一些新闻的shows联系在一起,后来他们认为是因为父母和孩子共享了一个账号,
所以他们发明了一种新的topic model类型:基于用户行为的topic model.
他们发现show A 和 show B 的相关性受一下因素的影响:
1. Metadata of show A and show B
2. How much percentage of users watch both A and B
3. CTR of show B on show A's page
他们使用机器学习的方法来融合这所有的factor.
他们使用了如下一个框架:
使用CF和Content方法算出每两部电影的相似度,然后每个fator都有他们的权重,然后再由领域专家打个label(具体是不是没两部电影之间都要用专家打label就不知道了)
然后他们使用ensemble的方法来训练,ensemble的方法:
We have tried
Logistic Regression
Decision Tree (C4.5, Random Forest, Boosting Tree)
And
Decision Tree performs much better than Logistic Regression
Because
This is a non-linear learning problem