机器学习实战-利用SVD简化数据

利用SVD(Singular Value Decomposition),即奇异值分解,我们可以用更小的数据集来表示原始数据集。这样做,其实是去除了噪声和冗余信息。

奇异值分解
优点:简化数据,去除噪声,提高算法的结果
缺点:数据的转化可能难以理解
使用数据类型:数值型数据

最早的SVD应用之一就是信息检索,我们称利用SVD的方法为隐性语义索引(Latent Semantic Indexing,LSI),或隐性语义分析(Latent Semantic Analysis,LSA)。
SVD的另一个应用就是推荐系统。利用SVD可以从数据中构建一个主题空间,如果再在该空间下计算其相似度。
SVD是矩阵分解的一种类型,而矩阵分解是将数据矩阵分解为多个独立部分的过程。
NumPy中有一个称为linalg的线性代数工具箱。

In [94]: from numpy import *

In [95]: U, Sigma,VT = linalg.svd([[1,1],[7,7]])

In [96]: U
Out[96]: 
array([[-0.14142136, -0.98994949],
       [-0.98994949,  0.14142136]])

In [97]: Sigma
Out[97]: array([  1.00000000e+01,   2.82797782e-16])

In [98]: VT
Out[98]: 
array([[-0.70710678, -0.70710678],
       [ 0.70710678, -0.70710678]])

Sigma以行向量array([10.,0.])返回,而非[[10,0],[0,0]]。这种返回方式节省空间。

建立一个新文件svdRec.py:

def loadExData():
    return[[1, 1, 1, 0, 0],
           [2, 2, 2, 0, 0],
           [1, 1, 1, 0, 0],
           [5, 5, 5, 0, 0],
           [1, 1, 0, 0, 0],
           [0, 0, 0, 3, 3],
           [0, 0, 0, 1, 1]]

接下来对该矩阵进行SVD分解:

In [17]: import svdRec
    ...: Data = svdRec.loadExData()
    ...: U, Sigma,VT = linalg.svd(Data)
    ...: Sigma
    ...: 
Out[17]: 
array([  9.71302333e+00,   4.47213595e+00,   8.10664981e-01,
         1.62982155e-15,   8.33719667e-17])

因为最后两个数太小了,我们可以去掉。
我们试图重新构造原始矩阵,首先构建一个3x3的矩阵Sig3,因而我们只需要前三行和前三列:

In [18]: Sig3 = mat([[Sigma[0],0,0],[0,Sigma[1],0],[0,0,Sigma[2]]])
    ...: U[:,:3]*Sig3*VT[:3,:]
    ...: 
Out[18]: 
matrix([[  1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          -7.70210327e-33,  -7.70210327e-33],
        [  2.00000000e+00,   2.00000000e+00,   2.00000000e+00,
          -4.60081159e-17,  -4.60081159e-17],
        [  1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          -1.23532915e-17,  -1.23532915e-17],
        ..., 
        [  1.00000000e+00,   1.00000000e+00,   4.53492652e-16,
          -5.59432048e-34,  -5.59432048e-34],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
           3.00000000e+00,   3.00000000e+00],
        [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
           1.00000000e+00,   1.00000000e+00]])

确定要保留的奇异值的数目有很多启发式的策略,其中一个典型的做法就是保留矩阵中90%的能量。为了计算总能量信息,我们将所有的奇异值求其平方和。于是可以将奇异值的平方和累加到总之的90%为止。另一个启发式策略就是,当矩阵上有上万的奇异值时,那么就保留钱2000~3000个。
下面开始研究相似度的计算:(上文矩阵出错,后文已更正)

from numpy import *
from numpy import linalg as la
    
def ecludSim(inA,inB):#欧氏距离
    return 1.0/(1.0 + la.norm(inA - inB))

def pearsSim(inA,inB):#皮尔逊指数
    if len(inA) < 3 : return 1.0#不存在,则两个向量完全相关
    return 0.5+0.5*corrcoef(inA, inB, rowvar = 0)[0][1]

def cosSim(inA,inB):#余弦相似度
    num = float(inA.T*inB)
    denom = la.norm(inA)*la.norm(inB)
    return 0.5+0.5*(num/denom)

下面我们将对上述函数进行尝试:

In [9]: import svdRec
   ...: myMat = mat(svdRec.loadExData())
   ...: svdRec.ecludSim(myMat[:,0],myMat[:,4])
   ...: 
Out[9]: 0.13367660240019172

In [10]: svdRec.ecludSim(myMat[:,0],myMat[:,0])
Out[10]: 1.0

In [11]: svdRec.cosSim(myMat[:,0],myMat[:,4])
Out[11]: 0.54724555912615336

In [12]: svdRec.cosSim(myMat[:,0],myMat[:,0])
Out[12]: 0.99999999999999989

In [13]: svdRec.pearsSim(myMat[:,0],myMat[:,4])
Out[13]: 0.23768619407595815

In [14]: svdRec.pearsSim(myMat[:,0],myMat[:,0])
Out[14]: 1.0

这里采用列向量的表示方法,暗示着我们将利用基于物品的相似度计算方法。使用哪一种相似度,取决于用户或者物品的数目。
如何对推荐引擎进行评价呢?具体的做法是我们将某些已知的评分去掉,如何对他们进行预测,最后计算预测值和真实值的差异。
通常用于推荐引擎评价的指标是称为最小均方误差(Root Mean Squared,RMSE)的指标,他首先计算均方误差的平均值,然后取其平方根。
接下来我们尝试一个物品相似度推荐引擎:

def standEst(dataMat, user, simMeas, item):#给定相似度计算方法的条件下,计算用户对物品的估计评分制
    n = shape(dataMat)[1]#物品数目
    simTotal = 0.0; ratSimTotal = 0.0
    for j in range(n):
        userRating = dataMat[user,j]
        if userRating == 0: continue#没有评分,跳过
        overLap = nonzero(logical_and(dataMat[:,item].A>0,dataMat[:,j].A>0))[0]#寻找两个用户已经评分的物品
        if len(overLap) == 0: similarity = 0
        else: similarity = simMeas(dataMat[overLap,item],dataMat[overLap,j])
        print 'the %d and %d similarity is: %f' % (item, j, similarity)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: return 0
    else: return ratSimTotal/simTotal
    
def recommend(dataMat, user, N=3, simMeas=cosSim, estMethod=standEst):
    unratedItems = nonzero(dataMat[user,:].A==0)[1]#没有评分的物品
    if len(unratedItems) == 0: return 'you rated everything'
    itemScores = []
    for item in unratedItems:
        estimatedScore = estMethod(dataMat, user, simMeas, item)
        itemScores.append((item, estimatedScore))
    #sorted排序函数,key 是按照关键字排序,lambda是隐函数,固定写法,
    #jj表示待排序元祖,jj[1]按照jj的第二列排序,reverse=True,降序;[:N]前N个
    return sorted(itemScores, key=lambda jj: jj[1], reverse=True)[:N]

接下来看看他的实际效果,首先对前面给出的矩阵稍加修改:

def loadExData():
    return[[0, 0, 0, 2, 2],
           [0, 0, 0, 3, 3],
           [0, 0, 0, 1, 1],
           [1, 1, 1, 0, 0],
           [2, 2, 2, 0, 0],
           [5, 5, 5, 0, 0],
           [1, 1, 1, 0, 0]]
In [20]: import svdRec
    ...: myMat = mat(svdRec.loadExData())
    ...: myMat[0,1]=myMat[0,0]=myMat[1,0]=myMat[2,0]=4
    ...: myMat[3,3]=2
    ...: 

In [21]: myMat
Out[21]: 
matrix([[4, 4, 0, 2, 2],
        [4, 0, 0, 3, 3],
        [4, 0, 0, 1, 1],
        ..., 
        [2, 2, 2, 0, 0],
        [5, 5, 5, 0, 0],
        [1, 1, 1, 0, 0]])

我们先尝试做一些推荐:

In [22]: svdRec.recommend(myMat,2)
the 1 and 0 similarity is: 1.000000
the 1 and 3 similarity is: 0.928746
the 1 and 4 similarity is: 1.000000
the 2 and 0 similarity is: 1.000000
the 2 and 3 similarity is: 1.000000
the 2 and 4 similarity is: 0.000000
Out[22]: [(2, 2.5), (1, 2.0243290220056256)]

下面利用其他相似度计算方法:

In [24]: svdRec.recommend(myMat,2,simMeas=svdRec.ecludSim)
the 1 and 0 similarity is: 1.000000
the 1 and 3 similarity is: 0.309017
the 1 and 4 similarity is: 0.333333
the 2 and 0 similarity is: 1.000000
the 2 and 3 similarity is: 0.500000
the 2 and 4 similarity is: 0.000000
Out[24]: [(2, 3.0), (1, 2.8266504712098603)]

In [25]: svdRec.recommend(myMat,2,simMeas=svdRec.pearsSim)
the 1 and 0 similarity is: 1.000000
the 1 and 3 similarity is: 1.000000
the 1 and 4 similarity is: 1.000000
the 2 and 0 similarity is: 1.000000
the 2 and 3 similarity is: 1.000000
the 2 and 4 similarity is: 0.000000
Out[25]: [(2, 2.5), (1, 2.0)]

实际的数据集会比我们用于展示recommend()函数功能的myMat矩阵稀疏得多。我们载入新的矩阵:

def loadExData2():
    return[[2, 0, 0, 4, 4, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5],
           [0, 0, 0, 0, 0, 0, 0, 1, 0, 4, 0],
           [3, 3, 4, 0, 3, 0, 0, 2, 2, 0, 0],
           [5, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 5, 0, 0, 5, 0],
           [4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 5],
           [0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 4],
           [0, 0, 0, 0, 0, 0, 5, 0, 0, 5, 0],
           [0, 0, 0, 3, 0, 0, 0, 0, 4, 5, 0],
           [1, 1, 2, 1, 1, 2, 1, 0, 4, 5, 0]]

下面我们计算该矩阵的SVD来了解其到底需要但是维特征:

In [39]: import svdRec
    ...: from numpy import linalg as la
    ...: U, Sigma, VT = la.svd(mat(svdRec.loadExData2()))
    ...: Sigma
    ...: 
Out[39]: 
array([  1.34342819e+01,   1.18190832e+01,   8.20176076e+00, ...,
         2.08702082e+00,   7.08715931e-01,   1.90990329e-16])

接下来我们看看到底有多少个奇异值能达到总能量的90%。首先对Sigma中的值求平方:

In [40]: Sig2=Sigma**2

In [41]: Sig2
Out[41]: 
array([  1.80479931e+02,   1.39690727e+02,   6.72688795e+01, ...,
         4.35565591e+00,   5.02278271e-01,   3.64773057e-32])

In [42]: sum(Sig2)
Out[42]: 497.0

In [43]: sum(Sig2)*0.9
Out[43]: 447.30000000000001

In [44]: sum(Sig2[:2])
Out[44]: 320.17065834028847

In [45]: sum(Sig2[:3])
Out[45]: 387.43953785565782

In [46]: sum(Sig2[:4])
Out[46]: 434.62441339532074

In [47]: sum(Sig2[:5])
Out[47]: 462.61518152879415

所以可以使用11维矩阵转化成一个5维矩阵。
下面对转化后的空间构造出一个相似度计算函数。我们利用SVD将所有的菜肴映射到一个低维空间去:

def svdEst(dataMat, user, simMeas, item):
    n = shape(dataMat)[1]
    simTotal = 0.0; ratSimTotal = 0.0
    U,Sigma,VT = la.svd(dataMat)
    Sig4 = mat(eye(4)*Sigma[:4]) 
    xformedItems = dataMat.T * U[:,:4] * Sig4.I
    for j in range(n):
        userRating = dataMat[user,j]
        if userRating == 0 or j==item: continue
        similarity = simMeas(xformedItems[item,:].T, xformedItems[j,:].T)
        print 'the %d and %d similarity is: %f' % (item, j, similarity)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: return 0
    else: return ratSimTotal/simTotal

然后我们看看效果:

In [61]: myMat = mat(svdRec.loadExData2())

In [62]: svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst)
the 0 and 10 similarity is: 0.584526
the 1 and 10 similarity is: 0.342595
the 2 and 10 similarity is: 0.553617
the 3 and 10 similarity is: 0.509334
the 4 and 10 similarity is: 0.478823
the 5 and 10 similarity is: 0.842470
the 6 and 10 similarity is: 0.512666
the 7 and 10 similarity is: 0.320211
the 8 and 10 similarity is: 0.456105
the 9 and 10 similarity is: 0.489873
Out[62]: [(8, 5.0000000000000009), (0, 5.0), (1, 5.0)]

下面尝试另外一种相似度计算方法:

In [63]: svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst, simMeas=svdRec.pearsSim)
the 0 and 10 similarity is: 0.602364
the 1 and 10 similarity is: 0.303884
the 2 and 10 similarity is: 0.513270
the 3 and 10 similarity is: 0.787267
the 4 and 10 similarity is: 0.667888
the 5 and 10 similarity is: 0.833890
the 6 and 10 similarity is: 0.560256
the 7 and 10 similarity is: 0.371606
the 8 and 10 similarity is: 0.520289
the 9 and 10 similarity is: 0.604393
Out[63]: [(0, 5.0), (1, 5.0), (2, 5.0)]

在大型程序中,SVD每天运行一次或者频率更低,并且还要离线运行。冷启动问题(如何在缺乏数据时给出更好的推荐)处理起来也非常困难。
下面是使用SVD实现对图像的压缩,在svdRec.py中加入如下代码:

def printMat(inMat, thresh=0.8):#thresh阈值
    for i in range(32):
        for k in range(32):
            if float(inMat[i,k]) > thresh:
                print 1,
            else: print 0,
        print ''

def imgCompress(numSV=3, thresh=0.8):
    myl = []
    for line in open('0_5.txt').readlines():
        newRow = []
        for i in range(32):
            newRow.append(int(line[i]))
        myl.append(newRow)
    myMat = mat(myl)
    print "****original matrix******"
    printMat(myMat, thresh)
    U,Sigma,VT = la.svd(myMat)
    #新建全0矩阵重构
    SigRecon = mat(zeros((numSV, numSV)))
    for k in range(numSV):
        SigRecon[k,k] = Sigma[k]
    reconMat = U[:,:numSV]*SigRecon*VT[:numSV,:]
    print "****reconstructed matrix using %d singular values******" % numSV
    printMat(reconMat, thresh)

下面我们看看实际效果:

In [81]: svdRec.imgCompress(2)
****original matrix******
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 
0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 
0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 
0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 
****reconstructed matrix using 2 singular values******
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

可以看见,只需要两个奇异值就能相当精确地对图像实现重构。数字的总数目是64+64+2=130,与原来的1024相比实现了很高的压缩比。
在大规模数据集上,SVD的计算和推荐可能是一个很困难的工程问题。通过离线方式来进行SVD分解和相似度计算,是一种减少冗余和推荐时所需时间的方法。下一章将介绍在大数据集上进行机器学习的一些工具。

你可能感兴趣的:(机器学习实战-利用SVD简化数据)