#集体智慧编程学习
概要:文章主要讨论一些我遇到的问题,学习到的方法,总结一些算法的实现过程。
注:所参考的版本为2009年出版。[书中packages与现在有变化,但是可以通过查阅相关packages文档来找到相关功能 ]
功能:找出与我们品味相近的一群人(推荐算法)
相关:正因为用户的对某事物的评分,我们的算法可以利用这些信息来建立一个过滤系统,即利用匹配相同爱好的人来推荐相同的有兴趣的事物。_
字典结构、数据库
欧几里得几何距离、皮尔逊相关度
s = sqrt(pow(x1-x2, 2) + pow(y1-y2, 2) )
距离越近,相似度越大,则转化为下式:sim_distance = 1 / (1+s)
给出源码如下:
#Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs,p1,p2):
# Get the list of mutually rated items
si={}
for item in prefs[p1]:
if item in prefs[p2]: si[item]=1
# if they are no ratings in common, return 0
if len(si)==0: return 0
# Sum calculations
n=len(si)#计算相同的偏好的个数
# Sums of all the preferences
sum1=sum([prefs[p1][it] for it in si])
sum2=sum([prefs[p2][it] for it in si])
# Sums of the squares
sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
sum2Sq=sum([pow(prefs[p2][it],2) for it in si])
# Sum of the products
pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
# Calculate r (Pearson score)
num=pSum-(sum1*sum2/n)
den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
r=num/den
return r
算法和概率论协方差系数有关系。
Jaccard系数
比较文本相似度,用于文本查重与去重;计算对象间距离,用于数据聚类等。
曼哈顿距离
用以标明两个点在标准坐标系上的绝对轴距总和
利用相似度来排序
实现代码如下:参考源码
def topMatches(prefs,person,n=5,similarity=sim_pearson):
scores=[(similarity(prefs,person,other),other) for other in prefs if other!=person]
scores.sort()
scores.reverse()
return scores[0:n]
print(topMatches(critics, 'Toby)
输出: [ ( 0.9912407071619299 , ′ L i s a R o s e ′ ) , ( 0.9244734516419049 , ′ M i c k L a S a l l e ′ ) , ( 0.8934051474415647 , ′ C l a u d i a P u i g ′ ) ] [(0.9912407071619299, 'Lisa Rose'), (0.9244734516419049, 'Mick LaSalle'), (0.8934051474415647, 'Claudia Puig')] [(0.9912407071619299,′LisaRose′),(0.9244734516419049,′MickLaSalle′),(0.8934051474415647,′ClaudiaPuig′)]
相关的参考书中代码
最后可以实现对某人未看的电影进行评分(实现预测),从而实现电影推荐。
###5.算法思路扩展(灵活运用算法思想)
找出列表时遇到困难,怎么构造要使用的数据集是我们需要考虑的。关于此问题,我会重新学习爬虫的相关知识,重开一文。
def loadMovieLens(path='C:\\'):
#路径记得加\\,若一个短斜杠会被转义成Tab
# Get movie titles
movies={}
for line in open(path+'u.item'):
(id,title)=line.split('|')[0:2]
movies[id]=title
# Load data
prefs={}
for line in open(path+'u.data'):
(user,movieid,rating,ts)=line.split('\t')
prefs.setdefault(user,{})
prefs[user][movies[movieid]]=float(rating)
return prefs
prefs = loadMovieLens()
print(prefs['45'])#数码随机选的
输出示例:
{‘Birdcage, The (1996)’: 4.0, ‘Mystery Science Theater 3000: The Movie (1996)’: 5.0, ‘Twister (1996)’: 4.0, ‘Happy Gilmore (1996)’: 2.0, ‘James and the Giant Peach (1996)’: 3.0, ‘Dragonheart (1996)’: 3.0, ‘Godfather, The (1972)’: 5.0, ‘Independence Day (ID4) (1996)’: 4.0, ‘Evening Star, The (1996)’: 2.0, ‘First Wives Club, The (1996)’: 3.0, ‘Leaving Las Vegas (1995)’: 5.0, “Mr. Holland’s Opus (1995)”: 4.0, ‘Fargo (1996)’: 5.0, ‘Jerry Maguire (1996)’: 4.0, ‘Blue in the Face (1995)’: 4.0, ‘Toy Story (1995)’: 5.0, ‘Muppet Treasure Island (1996)’: 3.0, ‘Adventures of Pinocchio, The (1996)’: 3.0, ‘Kids in the Hall: Brain Candy (1996)’: 4.0, ‘Mulholland Falls (1996)’: 4.0, ‘101 Dalmatians (1996)’: 4.0, ‘Beautiful Girls (1996)’: 4.0, “Preacher’s Wife, The (1996)”: 2.0, ‘Space Jam (1996)’: 4.0, ‘Father of the Bride Part II (1995)’: 2.0, ‘Star Wars (1977)’: 5.0, ‘Hunchback of Notre Dame, The (1996)’: 3.0, ‘Nutty Professor, The (1996)’: 3.0, ‘Mighty Aphrodite (1995)’: 5.0, ‘Scream (1996)’: 3.0, ‘Eraser (1996)’: 3.0, ‘Men in Black (1997)’: 5.0, “Don’t Be a Menace to South Central While Drinking Your Juice in the Hood (1996)”: 2.0, ‘Time to Kill, A (1996)’: 4.0, ‘Stupids, The (1996)’: 3.0, ‘Twelve Monkeys (1995)’: 3.0, ‘Willy Wonka and the Chocolate Factory (1971)’: 2.0, ‘Down Periscope (1996)’: 3.0, ‘Truth About Cats & Dogs, The (1996)’: 4.0, ‘Bed of Roses (1996)’: 3.0, ‘Ransom (1996)’: 4.0, ‘Tin Cup (1996)’: 4.0, ‘That Thing You Do! (1996)’: 4.0, ‘Return of the Jedi (1983)’: 4.0, ‘Rumble in the Bronx (1995)’: 3.0, ‘Phantom, The (1996)’: 3.0, ‘If Lucy Fell (1996)’: 4.0, ‘Hercules (1997)’: 4.0}
1.unhashable type: ‘slice’
2.float division by zero
3.列表输出评价信息全为零(基于物体构造相似度字典)