假设A是一个m×n阶实矩阵,则存在一个分解使得A=UΣVT,其中U是m×m阶正交矩阵;Σ是半正定m×n阶对角矩阵;而VT是n×n阶正交矩阵。这样的分解称作M的奇异值分解。Σ对角线上的元素Σi,i即为M的奇异值。
常见的做法是为了奇异值由大而小排列。如此Σ便能由A唯一确定了。
U和V的列分别是A的奇异值的左、右奇异向量。
movielens数据集
movielens是一个记录用户对电影评分的数据集
具体介绍在这里movielens
数据格式这样子的:
u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC
943个用户对1682部电影的100000个评分(1-5分),可以生成一个评分矩阵R(943*1682)。
问题定义这样:给定上面的数据集,然后要你预测用户1对电影M的评分是多少,当然用户1对电影M的评分没有在训练数据集中出现,你所能做的只能是依赖用户1对其他电影的评分结果和其他用户对电影M的评分结果做出预测。
SVD(Singular Value Decomposition)的想法是根据已有的评分情况,分析出评分者对各个因子的喜好程度以及电影包含各个因子的程度,最后再反过来根据分析结果预测评分。电影中的因子可以理解成这些东西:电影的搞笑程度,电影的爱情爱得死去活来的程度,电影的恐怖程度。。。。。。SVD的想法抽象点来看就是将一个N行M列的评分矩阵R(R[u][i]代表第u个用户对第i个物品的评分),分解成一个N行F列的用户因子矩阵P(P[u][k]表示用户u对因子k的喜好程度)和一个M行F列的物品因子矩阵Q(Q[i][k]表示第i个物品的因子k的程度)。用公式来表示就是
R = P * T(Q) //T(Q)表示Q矩阵的转置
这样要预测用户1对电影M的评分就只需将用户1对各个因子的喜好程度和电影M包含各个因子的程度dotproduct便可得到。就是这么简单。
那么关键就是要求得R的奇异值分解了。推导见参考。
下面是一个demo
#coding:utf-8
# ======================================
# SvdMatrix:
# generates matrices U and V such that
# U * V^T closely approximates
# the original matrix (in this case, the utility
# matrix M)
# =======================================
import math
import random
import time
"""
Rating class.
Store every rating associated with a particular
userid and movieid.
================Optimization======================
"""
class Rating:
def __init__(self, userid, movieid, rating):
# to accomodate zero-indexing for matrices
self.uid = userid-1
self.mid = movieid-1
self.rat = rating
class SvdMatrix:
"""
trainfile -> name of file to train data against
nusers -> number of users in dataset
nmovies -> number of movies in dataset
r -> rank of approximation (for U and V)
lrate -> learning rate
regularizer -> regularizer
typefile -> 0 if for smaller MovieLens dataset
1 if for medium or larger MovieLens dataset
"""
def __init__(self, trainfile, nusers, nmovies, r=30, lrate=0.035, regularizer=0.01, typefile=0):
self.trainrats = []
self.testrats = []
self.nusers = nusers
self.nmovies = nmovies
if typefile == 0:
self.readtrainsmaller(trainfile)
elif typefile == 1:
self.readtrainlarger(trainfile)
# get average rating
avg = self.averagerating()
# set initial values in U, V using square root
# of average/rank
initval = math.sqrt(avg/r)
# U matrix
self.U = [[initval]*r for i in range(nusers)]
# V matrix -- easier to store and compute than V^T
self.V = [[initval]*r for i in range(nmovies)]
self.r = r
self.lrate = lrate
self.regularizer = regularizer
self.minimprov = 0.001
self.maxepochs = 30
"""
Returns the dot product of v1 and v2
"""
def dotproduct(self, v1, v2):
return sum([v1[i]*v2[i] for i in range(len(v1))])
"""
Returns the estimated rating corresponding to userid for movieid
Ensures returns rating is in range [1,5]
"""
def calcrating(self, uid, mid):
p = self.dotproduct(self.U[uid], self.V[mid])
if p > 5:
p = 5
elif p < 1:
p = 1
return p
"""
Returns the average rating of the entire dataset
"""
def averagerating(self):
avg = 0
n = 0
for i in range(len(self.trainrats)):
avg += self.trainrats[i].rat
n += 1
return float(avg/n)
"""
Predicts the estimated rating for user with id i
for movie with id j
"""
def predict(self, i, j):
return self.calcrating(i, j)
"""
Trains the kth column in U and the kth row in
V^T
See docs for more details.
"""
def train(self, k):
sse = 0.0
n = 0
for i in range(len(self.trainrats)):
# get current rating
crating = self.trainrats[i]
err = crating.rat - self.predict(crating.uid, crating.mid)
sse += err**2
n += 1
uTemp = self.U[crating.uid][k]
vTemp = self.V[crating.mid][k]
self.U[crating.uid][k] += self.lrate * (err*vTemp - self.regularizer*uTemp)
self.V[crating.mid][k] += self.lrate * (err*uTemp - self.regularizer*vTemp)
return math.sqrt(sse/n)
"""
Trains the entire U matrix and the entire V (and V^T) matrix
"""
def trainratings(self):
# stub -- initial train error
oldtrainerr = 1000000.0
for k in range(self.r):
print "k=", k
for epoch in range(self.maxepochs):
trainerr = self.train(k)
# check if train error is still changing
if abs(oldtrainerr-trainerr) < self.minimprov:
break
oldtrainerr = trainerr
print "epoch=", epoch, "; trainerr=", trainerr
"""
Calculates the RMSE using between arr
and the estimated values in (U * V^T)
"""
def calcrmse(self, arr):
nusers = self.nusers
nmovies = self.nmovies
sse = 0.0
total = 0
for i in range(len(arr)):
crating = arr[i]
sse += (crating.rat - self.calcrating(crating.uid, crating.mid))**2
total += 1
return math.sqrt(sse/total)
"""
Read in the ratings from fname and put in arr
Use splitter as delimiter in fname
"""
def readinratings(self, fname, arr, splitter="\t"):
f = open(fname)
for line in f:
newline = [int(each) for each in line.split(splitter)]
userid, movieid, rating = newline[0], newline[1], newline[2]
arr.append(Rating(userid, movieid, rating))
arr = sorted(arr, key=lambda rating: (rating.uid, rating.mid))
return len(arr)
"""
Read in the smaller train dataset
"""
def readtrainsmaller(self, fname):
return self.readinratings(fname, self.trainrats, splitter="\t")
"""
Read in the large train dataset
"""
def readtrainlarger(self, fname):
return self.readinratings(fname, self.trainrats, splitter="::")
"""
Read in the smaller test dataset
"""
def readtestsmaller(self, fname):
return self.readinratings(fname, self.testrats, splitter="\t")
"""
Read in the larger test dataset
"""
def readtestlarger(self, fname):
return self.readinratings(fname, self.testrats, splitter="::")
if __name__ == "__main__":
#========= test SvdMatrix class on smallest MovieLENS dataset =========
init = time.time()
svd = SvdMatrix("ua.base", 943, 1682)
svd.trainratings()
print "rmsetrain: ", svd.calcrmse(svd.trainrats)
svd.readtestsmaller("ua.test")
print "rmsetest: ", svd.calcrmse(svd.testrats)
print "time: ", time.time()-init
SVD原理及其应用导论
SVD分解的并行实现
奇异值分解(SVD)原理详解及推导
SVD在推荐系统中的应用详解以及算法推导
推荐系统相关算法(1):SVD
矩阵分解(MATRIX FACTORIZATION)在推荐系统中的应用
SVD奇异值分解
使用SVD方法实现电影推荐系统
SVD学习笔记
非负矩阵分解(NMF)
非负矩阵分解(NMF)