作者:金良([email protected]) csdn博客:http://blog.csdn.net/u012176591
下图的矩阵是由餐馆的菜品和不同用户对这些菜品的意见构成,用户可以采用1到5之间的任意整数对菜品进行评级,如果用户没有尝过这道菜,则评级为0。我们的推荐系统的功能是:对于某个用户,从他未尝过的菜品里推荐几道他可能最喜欢(评分最高)的菜品。
推荐的机制:当系统对某个用户喜欢的菜品进行预测时,它首先会找到用户未曾品尝过的菜品,然后,他就会计算该菜品与用户品尝过的菜品之间的相似度。如果相似度高,推荐算法就会认为用户极有可能喜欢这道菜品,因此就会向该用户推荐它。
协同过滤的特点:它不关心菜品的描述属性,而是严格地参照许多用户的观点(评分)来计算相似度。
比如我们计算上图中寿司饭和烤牛肉的相似度。假如我们使用欧氏距离来计算,二者的距离是:
再看寿司饭和日式炸鸡的欧氏距离(我们的代码中用的是余弦距离):
可以看到寿司饭和日式炸鸡的欧氏距离小于与烤牛肉的欧氏距离,因此寿司饭与日式炸鸡比与烤牛肉距离更近。
我们可以下面的公式对计算两个菜品的相似度:
基本的协同过滤推荐系统代码:
#encoding=UTF-8 ''' Created on Mar 8, 2011 @author: jin ''' from numpy import * from numpy import linalg as la def loadData(): return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5], [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3], [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0], [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0], [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0], [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0], [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1], [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4], [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2], [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0], [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]] def cosSim(inA,inB):#使用了余弦距离 num = float(inA.T*inB) denom = la.norm(inA)*la.norm(inB) return 0.5+0.5*(num/denom) def standEst(dataMat, user, simMeas, item): n = shape(dataMat)[1] simTotal = 0.0; ratSimTotal = 0.0 for j in range(n): userRating = dataMat[user,j] if userRating == 0: continue overLap = nonzero(logical_and(dataMat[:,item].A>0, \ dataMat[:,j].A>0))[0] if len(overLap) == 0: similarity = 0 else: similarity = simMeas(dataMat[overLap,item], \ dataMat[overLap,j]) print 'the %d and %d similarity is: %f' % (item, j, similarity) simTotal += similarity ratSimTotal += similarity * userRating if simTotal == 0: return 0 else: print "item: ",item," 的平均期望得分: ",ratSimTotal/simTotal return ratSimTotal/simTotal def recommend(dataMat, user, N=2, simMeas = cosSim, estMethod = standEst):#N��Ҫ�Ƽ�����Ŀ unratedItems = nonzero(dataMat[user,:].A==0)[1]#find unrated items #mat.A运算符将矩阵转换成数组,由matrix([[0,0,0,0,4,0,0,1,0,4,0]])转换成array([[0,0,0,0,4,0,0,1,0,4,0]]) #mat == 0得到 [[ True True True True False True True False True False True]] #nonzero(array(dtype == 0))方法得到元素为True的位置,nonzero(array(dtype != 0))得到为False的位置如(array([0, 0, 0]), array([4, 7, 9])) if len(unratedItems) == 0: #该用户已评价所有产品 return 'you rated everything' else: print "未评价的产品号: ",unratedItems itemScores = [] for item in unratedItems: estimatedScore = estMethod(dataMat, user, simMeas, item) itemScores.append((item, estimatedScore)) print "未评价的产品及其预期的平均得分为: \n",itemScores return sorted(itemScores,key=lambda my:my[1], reverse=True)[:N]# [:N]表示获取前N个元素 #reverse=True表示降序排列False表示升序排列 def Main(): myMat = mat(loadData()) recresult = recommend(myMat,3)#为第3个(客户序号从0开始)客户推荐产品 print "选出的得分前两名的产品号及其平均得分: \n", recresult Main()打印输出:
未评价的产品号: [ 3 4 5 6 9 10] the 3 and 0 similarity is: 0.000000 the 3 and 1 similarity is: 0.000000 the 3 and 2 similarity is: 0.000000 the 3 and 7 similarity is: 0.000000 the 3 and 8 similarity is: 0.000000 the 4 and 0 similarity is: 0.000000 the 4 and 1 similarity is: 0.000000 the 4 and 2 similarity is: 0.000000 the 4 and 7 similarity is: 1.000000 the 4 and 8 similarity is: 0.000000 item: 4 的平均期望得分: 2.0 the 5 and 0 similarity is: 0.000000 the 5 and 1 similarity is: 0.000000 the 5 and 2 similarity is: 0.000000 the 5 and 7 similarity is: 0.000000 the 5 and 8 similarity is: 0.000000 the 6 and 0 similarity is: 0.000000 the 6 and 1 similarity is: 0.000000 the 6 and 2 similarity is: 0.000000 the 6 and 7 similarity is: 0.000000 the 6 and 8 similarity is: 0.000000 the 9 and 0 similarity is: 0.000000 the 9 and 1 similarity is: 0.000000 the 9 and 2 similarity is: 0.000000 the 9 and 7 similarity is: 1.000000 the 9 and 8 similarity is: 0.000000 item: 9 的平均期望得分: 2.0 the 10 and 0 similarity is: 1.000000 the 10 and 1 similarity is: 1.000000 the 10 and 2 similarity is: 1.000000 the 10 and 7 similarity is: 1.000000 the 10 and 8 similarity is: 1.000000 item: 10 的平均期望得分: 2.8 未评价的产品及其预期的平均得分为: [(3, 0), (4, 2.0), (5, 0), (6, 0), (9, 2.0), (10, 2.7999999999999998)] 选出的得分前两名的产品号及其平均得分: [(10, 2.7999999999999998), (4, 2.0)]
利用SVD的协同过滤推荐系统的代码:
这里只要把基本的协同过滤推荐系统的standEst(dataMat, user, simMeas, item)方法替换成下面的两个方法即可:
def getDim(Sigma,threshold): SigmaSqr = [] for i in range(len(Sigma)): if i==0: SigmaSqr.append(Sigma[i]**2) else: SigmaSqr.append(SigmaSqr[i-1]+Sigma[i]**2) for i in range(len(SigmaSqr)): SigmaSqr[i] /= SigmaSqr[len(SigmaSqr)-1] if SigmaSqr[i] >=threshold: return i+1 return i+1 def svdEst(dataMat, user, simMeas, item): n = shape(dataMat)[1] simTotal = 0.0; ratSimTotal = 0.0 U,Sigma,VT = la.svd(dataMat) dimN = getDim(Sigma,0.95) print "dimN: ",dimN SigN = mat(eye(dimN)*Sigma[:dimN]) #arrange Sig4 into a diagonal matrix xformedItems = dataMat.T * U[:,:dimN] * SigN.I #create transformed items xformedItems = xformedItems.T#降维后。mat.I表示求逆,mat.T表示转置 print "xformedItems: ",xformedItems for j in range(n): userRating = dataMat[user,j] if userRating == 0 or j==item: continue similarity = simMeas(xformedItems[:,item],xformedItems[:,j]) print 'the %d and %d similarity is: %f' % (item, j, similarity) simTotal += similarity ratSimTotal += similarity * userRating if simTotal == 0: return 0 else: print "item: ",item," 的平均期望得分: ",ratSimTotal/simTotal return ratSimTotal/simTotal
getDim()方法返回要保留的维数,其输入参数有两个,前者是分解后的对角矩阵的非零元素组成的列表,后者是要保留的能量的下限。
svdEst(dataMat, user, simMeas, item)方法与前边的standEst(dataMat, user, simMeas, item)方法的思路基本相同,唯一的区别是它使用SVD对矩阵进行了降维处理,降维的关键代码是
xformedItems = dataMat.T * U[:,:dimN] * SigN.I #create transformed items xformedItems = xformedItems.T#降维后通过降维把规模为(11*11)的矩阵dataMat转换成了规模为(4*11)的矩阵xformedItems。我们知道矩阵的行代表用户,之所以能够降维,是因为某些用户的喜好比较类似,可以把它们归为一类。
未评价的产品号: [ 3 4 5 6 9 10] dimN: 4 the 3 and 0 similarity is: 0.490950 the 3 and 1 similarity is: 0.491294 the 3 and 2 similarity is: 0.491573 the 3 and 7 similarity is: 0.482175 the 3 and 8 similarity is: 0.491307 item: 3 的平均期望得分: 2.80308552617 dimN: 4 the 4 and 0 similarity is: 0.487100 the 4 and 1 similarity is: 0.485583 the 4 and 2 similarity is: 0.485739 the 4 and 7 similarity is: 0.542799 the 4 and 8 similarity is: 0.490037 item: 4 的平均期望得分: 2.78039293924 dimN: 4 the 5 and 0 similarity is: 0.484274 the 5 and 1 similarity is: 0.481516 the 5 and 2 similarity is: 0.482346 the 5 and 7 similarity is: 0.494716 the 5 and 8 similarity is: 0.491228 item: 5 的平均期望得分: 2.79310558538 dimN: 4 the 6 and 0 similarity is: 0.508677 the 6 and 1 similarity is: 0.513688 the 6 and 2 similarity is: 0.512869 the 6 and 7 similarity is: 0.479543 the 6 and 8 similarity is: 0.498249 item: 6 的平均期望得分: 2.81499443018 dimN: 4 the 9 and 0 similarity is: 0.490280 the 9 and 1 similarity is: 0.490272 the 9 and 2 similarity is: 0.490180 the 9 and 7 similarity is: 0.536962 the 9 and 8 similarity is: 0.490078 item: 9 的平均期望得分: 2.78506449633 dimN: 4 the 10 and 0 similarity is: 0.512755 the 10 and 1 similarity is: 0.509709 the 10 and 2 similarity is: 0.510584 the 10 and 7 similarity is: 0.524970 the 10 and 8 similarity is: 0.520290 item: 10 的平均期望得分: 2.79262526188 未评价的产品及其预期的平均得分为: [(3, 2.803085526172937), (4, 2.7803929392359228), (5, 2.7931055853817748), (6, 2.8149944301764482), (9, 2.7850644963293152), (10, 2.7926252618769802)] 选出的得分前两名的产品号及其平均得分: [(6, 2.8149944301764482), (3, 2.803085526172937)]