programming collective intelligence的读书笔记

第二章 Making Recommendations

现在我们浏览的每一个网站,几乎都会发现推荐系统的痕迹,以前我总是觉得很神秘,读完这一章发现其实挺简单的(呵呵,但是大数据量的计算还是很大的问题)。

本章主要讲了一般的推荐系统的实现方式,基本上都是用协同过滤来作的,所谓协同过滤 http://en.wikipedia.org/wiki/Collaborative_filtering ,就是找臭味和你相同的那些人或物,根据他们已有的喜好来推测出你的喜好。作者从实际例子出发,讲的很好,特别适合我这种半路出家的人。

作者从影评推荐的例子入手,根据每个用户的影评资料找出每一对用户之间的相似度,进而可以根据这个相似度为每一个用户推荐电影(当然,这里求每一对用户之间的相似度的度量方法N多,具体可以见 http://en.wikipedia.org/wiki/Metric_%28mathematics%29#Examples ),以上是根据用户的影评信息找相似的用户(User-Based Filtering),接下来作者展示了利用以上信息,找相似的电影(根据用户找相似电影,这其实就是Item-Based Filtering),对于这两种方式的不同,作者有以下说明:

This will probably work well for a few thousand people or items, but a very large site like Amazon has millions of customers and products—comparing a user with every other user and then comparing every product each user has rated can be very slow,The technique we have used thus far is called user-based collaborative filtering. An alternative is known as item-based collaborative filtering. In cases with very large datasets, item-based collaborative filtering can give better results, and it allows many of the calculations to be performed in advance so that a user needing recommenda-tions can get them more quickly.

所以后者需要经常线下算,保证拥有一份最新的item相似的字典。

 

User-Based or Item-Based Filtering?
Item-based filtering is significantly faster than user-based when getting a list of rec-ommendations for a large dataset, but it does have the additional overhead of main-
taining the item similarity table. Also, there is a difference in accuracy that depends on how “sparse” the dataset is. In the movie example, since every critic has rated
nearly every movie, the dataset is dense (not sparse). On the other hand, it would be unlikely to find two people with the same set of del.icio.us bookmarks—most book-
marks are saved by a small group of people, leading to a sparse dataset. Item-based filtering usually outperforms user-based filtering in sparse datasets, and the two per-
form about equally in dense datasets.

 

 

作业:

1、Tanimoto score:http://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_coefficient_.28extended_Jaccard_coefficient.29

你可能感兴趣的:(programming)