协同过滤推荐系统(Collaborative Filtering recommendation)及SVD的应用

作者:金良([email protected]) csdn博客:http://blog.csdn.net/u012176591

协同过滤推荐系统(Collaborative Filtering recommendation)及SVD的应用_第1张图片 下图的矩阵是由餐馆的菜品和不同用户对这些菜品的意见构成,用户可以采用1到5之间的任意整数对菜品进行评级,如果用户没有尝过这道菜,则评级为0。

我们的推荐系统的功能是:对于某个用户,从他未尝过的菜品里推荐几道他可能最喜欢(评分最高)的菜品。

协同过滤推荐系统(Collaborative Filtering recommendation)及SVD的应用_第2张图片

推荐的机制:当系统对某个用户喜欢的菜品进行预测时,它首先会找到用户未曾品尝过的菜品,然后,他就会计算该菜品与用户品尝过的菜品之间的相似度。如果相似度高,推荐算法就会认为用户极有可能喜欢这道菜品,因此就会向该用户推荐它。

协同过滤的特点:它不关心菜品的描述属性,而是严格地参照许多用户的观点(评分)来计算相似度。

比如我们计算上图中寿司饭和烤牛肉的相似度。假如我们使用欧氏距离来计算,二者的距离是:


再看寿司饭和日式炸鸡的欧氏距离(我们的代码中用的是余弦距离):


可以看到寿司饭和日式炸鸡的欧氏距离小于与烤牛肉的欧氏距离,因此寿司饭与日式炸鸡比与烤牛肉距离更近。

我们可以下面的公式对计算两个菜品的相似度:



基本的协同过滤推荐系统代码:

#encoding=UTF-8
'''
Created on Mar 8, 2011

@author: jin
'''
from numpy import *
from numpy import linalg as la
  
def loadData():
    return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
           [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
           [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
           [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
           [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
           [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
           [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
           [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
           [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
           [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
           [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]   
 
def cosSim(inA,inB):#使用了余弦距离
    num = float(inA.T*inB)
    denom = la.norm(inA)*la.norm(inB)
    return 0.5+0.5*(num/denom)
def standEst(dataMat, user, simMeas, item):
    n = shape(dataMat)[1]
    simTotal = 0.0; ratSimTotal = 0.0
    for j in range(n):
        userRating = dataMat[user,j]
        if userRating == 0: continue
        overLap = nonzero(logical_and(dataMat[:,item].A>0, \
                                      dataMat[:,j].A>0))[0]
        if len(overLap) == 0: similarity = 0
        else: similarity = simMeas(dataMat[overLap,item], \
                                   dataMat[overLap,j])
        print 'the %d and %d similarity is: %f' % (item, j, similarity)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: 
        return 0
    else:
        print "item: ",item," 的平均期望得分:    ",ratSimTotal/simTotal 
        return ratSimTotal/simTotal

def recommend(dataMat, user, N=2, simMeas = cosSim, estMethod = standEst):#N��Ҫ�Ƽ�����Ŀ
    unratedItems = nonzero(dataMat[user,:].A==0)[1]#find unrated items 
    #mat.A运算符将矩阵转换成数组,由matrix([[0,0,0,0,4,0,0,1,0,4,0]])转换成array([[0,0,0,0,4,0,0,1,0,4,0]])
    #mat == 0得到 [[ True  True  True  True False  True  True False  True False  True]]
    #nonzero(array(dtype == 0))方法得到元素为True的位置,nonzero(array(dtype != 0))得到为False的位置如(array([0, 0, 0]), array([4, 7, 9]))
    if len(unratedItems) == 0: #该用户已评价所有产品
        return 'you rated everything'
    else:
        print "未评价的产品号:    ",unratedItems
    itemScores = []
    for item in unratedItems:
        estimatedScore = estMethod(dataMat, user, simMeas, item)
        itemScores.append((item, estimatedScore))
    print "未评价的产品及其预期的平均得分为:    \n",itemScores
    return sorted(itemScores,key=lambda my:my[1], reverse=True)[:N]# [:N]表示获取前N个元素
    #reverse=True表示降序排列False表示升序排列
def Main():
    myMat = mat(loadData())
    recresult = recommend(myMat,3)#为第3个(客户序号从0开始)客户推荐产品
    print "选出的得分前两名的产品号及其平均得分:    \n", recresult
Main()
打印输出:
未评价的产品号:     [ 3  4  5  6  9 10]
the 3 and 0 similarity is: 0.000000
the 3 and 1 similarity is: 0.000000
the 3 and 2 similarity is: 0.000000
the 3 and 7 similarity is: 0.000000
the 3 and 8 similarity is: 0.000000
the 4 and 0 similarity is: 0.000000
the 4 and 1 similarity is: 0.000000
the 4 and 2 similarity is: 0.000000
the 4 and 7 similarity is: 1.000000
the 4 and 8 similarity is: 0.000000
item:  4  的平均期望得分:     2.0
the 5 and 0 similarity is: 0.000000
the 5 and 1 similarity is: 0.000000
the 5 and 2 similarity is: 0.000000
the 5 and 7 similarity is: 0.000000
the 5 and 8 similarity is: 0.000000
the 6 and 0 similarity is: 0.000000
the 6 and 1 similarity is: 0.000000
the 6 and 2 similarity is: 0.000000
the 6 and 7 similarity is: 0.000000
the 6 and 8 similarity is: 0.000000
the 9 and 0 similarity is: 0.000000
the 9 and 1 similarity is: 0.000000
the 9 and 2 similarity is: 0.000000
the 9 and 7 similarity is: 1.000000
the 9 and 8 similarity is: 0.000000
item:  9  的平均期望得分:     2.0
the 10 and 0 similarity is: 1.000000
the 10 and 1 similarity is: 1.000000
the 10 and 2 similarity is: 1.000000
the 10 and 7 similarity is: 1.000000
the 10 and 8 similarity is: 1.000000
item:  10  的平均期望得分:     2.8
未评价的产品及其预期的平均得分为:    
[(3, 0), (4, 2.0), (5, 0), (6, 0), (9, 2.0), (10, 2.7999999999999998)]
选出的得分前两名的产品号及其平均得分:    
[(10, 2.7999999999999998), (4, 2.0)]

利用SVD的协同过滤推荐系统的代码:

这里只要把基本的协同过滤推荐系统的standEst(dataMat, user, simMeas, item)方法替换成下面的两个方法即可:

def getDim(Sigma,threshold):
    SigmaSqr = []
    for i in range(len(Sigma)):
        if i==0:
            SigmaSqr.append(Sigma[i]**2)
        else:
            SigmaSqr.append(SigmaSqr[i-1]+Sigma[i]**2)   
    for i in range(len(SigmaSqr)):
        SigmaSqr[i] /= SigmaSqr[len(SigmaSqr)-1]
        if SigmaSqr[i] >=threshold:
            return i+1
    return i+1
def svdEst(dataMat, user, simMeas, item):
    n = shape(dataMat)[1]
    simTotal = 0.0; ratSimTotal = 0.0
    U,Sigma,VT = la.svd(dataMat)
    dimN = getDim(Sigma,0.95)
    print "dimN: ",dimN
    SigN = mat(eye(dimN)*Sigma[:dimN]) #arrange Sig4 into a diagonal matrix
    xformedItems = dataMat.T * U[:,:dimN] * SigN.I  #create transformed items
    xformedItems = xformedItems.T#降维后。mat.I表示求逆,mat.T表示转置
    print "xformedItems: ",xformedItems
    for j in range(n):
        userRating = dataMat[user,j]
        if userRating == 0 or j==item: continue
        similarity = simMeas(xformedItems[:,item],xformedItems[:,j])
        print 'the %d and %d similarity is: %f' % (item, j, similarity)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: 
        return 0
    else: 
        print "item: ",item," 的平均期望得分:    ",ratSimTotal/simTotal 
        return ratSimTotal/simTotal

getDim()方法返回要保留的维数,其输入参数有两个,前者是分解后的对角矩阵的非零元素组成的列表,后者是要保留的能量的下限。

svdEst(dataMat, user, simMeas, item)方法与前边的standEst(dataMat, user, simMeas, item)方法的思路基本相同,唯一的区别是它使用SVD对矩阵进行了降维处理,降维的关键代码是

    xformedItems = dataMat.T * U[:,:dimN] * SigN.I  #create transformed items
    xformedItems = xformedItems.T#降维后
通过降维把规模为(11*11)的矩阵dataMat转换成了规模为(4*11)的矩阵xformedItems。我们知道矩阵的行代表用户,之所以能够降维,是因为某些用户的喜好比较类似,可以把它们归为一类。

打印输出如下:

未评价的产品号:     [ 3  4  5  6  9 10]
dimN:  4
the 3 and 0 similarity is: 0.490950
the 3 and 1 similarity is: 0.491294
the 3 and 2 similarity is: 0.491573
the 3 and 7 similarity is: 0.482175
the 3 and 8 similarity is: 0.491307
item:  3  的平均期望得分:     2.80308552617
dimN:  4
the 4 and 0 similarity is: 0.487100
the 4 and 1 similarity is: 0.485583
the 4 and 2 similarity is: 0.485739
the 4 and 7 similarity is: 0.542799
the 4 and 8 similarity is: 0.490037
item:  4  的平均期望得分:     2.78039293924
dimN:  4
the 5 and 0 similarity is: 0.484274
the 5 and 1 similarity is: 0.481516
the 5 and 2 similarity is: 0.482346
the 5 and 7 similarity is: 0.494716
the 5 and 8 similarity is: 0.491228
item:  5  的平均期望得分:     2.79310558538
dimN:  4
the 6 and 0 similarity is: 0.508677
the 6 and 1 similarity is: 0.513688
the 6 and 2 similarity is: 0.512869
the 6 and 7 similarity is: 0.479543
the 6 and 8 similarity is: 0.498249
item:  6  的平均期望得分:     2.81499443018
dimN:  4
the 9 and 0 similarity is: 0.490280
the 9 and 1 similarity is: 0.490272
the 9 and 2 similarity is: 0.490180
the 9 and 7 similarity is: 0.536962
the 9 and 8 similarity is: 0.490078
item:  9  的平均期望得分:     2.78506449633
dimN:  4
the 10 and 0 similarity is: 0.512755
the 10 and 1 similarity is: 0.509709
the 10 and 2 similarity is: 0.510584
the 10 and 7 similarity is: 0.524970
the 10 and 8 similarity is: 0.520290
item:  10  的平均期望得分:     2.79262526188
未评价的产品及其预期的平均得分为:    
[(3, 2.803085526172937), (4, 2.7803929392359228), (5, 2.7931055853817748), (6, 2.8149944301764482), (9, 2.7850644963293152), (10, 2.7926252618769802)]
选出的得分前两名的产品号及其平均得分:    
[(6, 2.8149944301764482), (3, 2.803085526172937)]



你可能感兴趣的:(推荐系统,协同过滤,SVD,奇异值分解)