Mahout协同过滤算法源码分析(3-2)--QR分解数据流

Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。

接上篇,当数据准备完成后,就可以来分析其数据流了。

首先要分析的是new QRDecomposition(Ai),这个初始化QRDecomposition就做了好多的事情,具体分析如下:

先贴上源码,然后再分析:

public QRDecomposition(Matrix a) {

    // Initialize.
    qr = a.clone();
    originalRows = a.numRows();
    originalColumns = a.numCols();
    rDiag = new DenseVector(originalColumns);

    // precompute and cache some views to avoid regenerating them time and again
    Vector[] QRcolumnsPart = new Vector[originalColumns];
    for (int k = 0; k < originalColumns; k++) {
      QRcolumnsPart[k] = qr.viewColumn(k).viewPart(k, originalRows - k);
    }

    // Main loop.
    for (int k = 0; k < originalColumns; k++) {
      //DoubleMatrix1D QRcolk = QR.viewColumn(k).viewPart(k,m-k);
      // Compute 2-norm of k-th column without under/overflow.
      double nrm = 0;
      //if (k<m) nrm = QRcolumnsPart[k].aggregate(hypot,F.identity);

      for (int i = k; i < originalRows; i++) { // fixes bug reported by [email protected]
        nrm = Algebra.hypot(nrm, qr.getQuick(i, k));
      }


      if (nrm != 0.0) {
        // Form k-th Householder vector.
        if (qr.getQuick(k, k) < 0) {
          nrm = -nrm;
        }
        QRcolumnsPart[k].assign(Functions.div(nrm));
        /*
        for (int i = k; i < m; i++) {
           QR[i][k] /= nrm;
        }
        */

        qr.setQuick(k, k, qr.getQuick(k, k) + 1);

        // Apply transformation to remaining columns.
        for (int j = k + 1; j < originalColumns; j++) {
          Vector QRcolj = qr.viewColumn(j).viewPart(k, originalRows - k);
          double s = QRcolumnsPart[k].dot(QRcolj);
          /*
          // fixes bug reported by John Chambers
          DoubleMatrix1D QRcolj = QR.viewColumn(j).viewPart(k,m-k);
          double s = QRcolumnsPart[k].zDotProduct(QRcolumns[j]);
          double s = 0.0;
          for (int i = k; i < m; i++) {
            s += QR[i][k]*QR[i][j];
          }
          */
          s = -s / qr.getQuick(k, k);
          //QRcolumnsPart[j].assign(QRcolumns[k], F.plusMult(s));

          for (int i = k; i < originalRows; i++) {
            qr.setQuick(i, j, qr.getQuick(i, j) + s * qr.getQuick(i, k));
          }

        }
      }
      rDiag.setQuick(k, -nrm);
    }
  }
初始化qr矩阵:就是Ai的clone值

[[31.678402777777777, 4.08661209859189, 4.573918596524476],
[4.08661209859189, 1.0203966547288652, 0.3987296589988406],
[4.573918596524476, 0.3987296589988406, 1.059026647737198]]
QRcolumnsPart: 截取qr的半个矩阵(对角线)

[{0:31.678402777777777,1:4.08661209859189,2:4.573918596524476},
{0:1.0203966547288652,1:0.3987296589988406}, 
{0:1.059026647737198}]


然后到了主循环: Main loop   主循环次数就是Ai的列数,从0开始,用k来表示次数
+++++++++++++++++++++++++---------------------------------first time  k=0

nrm:

32.26673724322168
assign函数:不仅仅是更新了QRcolumnsPart值,同时更新了qr的值,QRcolumnsPart[k].assign(Functions.div(nrm)),这里因为QRcolumnsPart和qr中的部分引用是一样的,如下图:

Mahout协同过滤算法源码分析(3-2)--QR分解数据流_第1张图片
  对于QRcolumnsPart来说,做了如下更新(div_num在程序中的使用的变量是nrm):
    设 div_num=(sqrt(QRcolumnsPart[k][0]^2+QRcolumnsPart[k][1]^2+...QRcolumnsPart[k][size-1]^2))
    QRcolumnsPart[k]=QRcolumnsPart[k]/div_num
    其中QRcolumnsPart[k][i]是QRcolumnsPart[k]中的第i个元素,size是QRcolumnsPart[k]中一共含有的元素个数;

    QRcolumnsPart更新为:

{0:0.9817665337214256,1:0.12665092438034994,2:0.14175336545641365}
  {0:1.0203966547288652,1:0.3987296589988406}
  {0:1.059026647737198}
   对于qr,做了如下修改,因为qr存储的地址和QRcolumnsPart中的matrix中的values地址一样,所以修改了QRcolumnsPart 就会导致qr也被修改
    qr[j]=qr[j]/div_num 
    qr更新为:
[[0.9817665337214256, 4.08661209859189, 4.573918596524476],
  [0.12665092438034994, 1.0203966547288652, 0.3987296589988406],
  [0.14175336545641365, 0.3987296589988406, 1.059026647737198]]
  然后qr的对角线(row=col)自增1:
    qr更新为:

[[1.9817665337214256, 4.08661209859189, 4.573918596524476],
  [0.12665092438034994, 1.0203966547288652, 0.3987296589988406],
  [0.14175336545641365, 0.3987296589988406, 1.059026647737198]]
QRcolumnsPart更新为:
[{0:1.9817665337214256,1:0.12665092438034994,2:0.14175336545641365},  -- 主循环改变的值QRcolumnsPart[k]
  {0:1.0203966547288652,1:0.3987296589988406},
  {0:1.059026647737198}]
         内层for循环 ,循环从k+1开始,终止为Ai的列数,用j来表示次数
      ------------------------------sub-first time j=k+1=1  ,k=0
       QRcolj: 取qr的第j列,且从k行开始

{0:4.08661209859189,1:1.0203966547288652,2:0.3987296589988406}
              s:就是QRcolumnsPart[k]和QRcolj的点乘,即每项对应相乘

8.28446654391689
                s重新赋值:s = -s / qr.getQuick(k, k);

-4.180344355881341
           qr:把qr的第j列第k行开始更新为原始值+s*第k列的相应值

[[1.9817665337214256, -4.197854445325, 4.573918596524476],
	  [0.12665092438034994, 0.4909521778283148, 0.3987296589988406], 
	  [0.14175336545641365, -0.19384822221406328, 1.059026647737198]]
           QRcolumnsPart更新为:QRcolumnsPart截取的是qr的左下部分,和qr一直保持一致

[{0:1.9817665337214256,1:0.12665092438034994,2:0.14175336545641365},
	  {0:0.4909521778283148,1:-0.19384822221406328},    -- 内循环改变的值QRcolumnsPart[k+1]
	  {0:1.059026647737198}]
                   --------------------------------sub-second time j=j+1=2,k=0
         QRcolj: 取qr的第j列,且从k行开始
{0:4.573918596524476,1:0.3987296589988406,2:1.059026647737198}
                 s:就是QRcolumnsPart[k]和QRcolj的点乘,即每项对应相乘

9.265058873873116
             s重新赋值:s = -s / qr.getQuick(k, k);
 -4.675151545966863
             qr:把qr的第j列第k行开始更新为原始值+s*第k列的相应值

[[1.9817665337214256, -4.197854445325, -4.69114027734864],
	  [0.12665092438034994, 0.4909521778283148, -0.1933826059160847],
	  [0.14175336545641365, -0.19384822221406328, 0.39630818207763985]]
             QRcolumnsPart更新为:QRcolumnsPart截取的是qr的左下部分,和qr一直保持一致

 [{0:1.9817665337214256,1:0.12665092438034994,2:0.14175336545641365}, 
	  {0:0.4909521778283148,1:-0.19384822221406328},
	  {0:0.39630818207763985}]     -- 内循环改变的值QRcolumnsPart[k+2]
             ---------------------------------------- 内循环结束,可以看到外循环当k=0时设置的是QRcolumnsPart[0]的值
     ---------------------------------------- 内循环依次设置QRcolumnsPart[1]、...、QRcolumnsPart[rowSize-1]的值;
设置变量rDiag:rDiag.setQuick(k, -nrm);,其中的nrm就是前面的div_num
{0:-32.26673724322168}

+++++++++++++++++++++++++++++++++---------------------------------------------------second time k=1
nrm:
0.527836313803738
QRcolumnsPart更新为: 经过assign函数后QRcolumnsPart[k]和qr(qr中和QRcolumnsPart[k]对应的部分)都会改变
[{0:1.9817665337214256,1:0.12665092438034994,2:0.14175336545641365},
{0:0.9301220188705365,1:-0.36725063650346085},
{0:0.39630818207763985}]
qr:

[[1.9817665337214256, -4.197854445325, -4.69114027734864],
[0.12665092438034994, 0.9301220188705365, -0.1933826059160847],
[0.14175336545641365, -0.36725063650346085, 0.39630818207763985]]
更新qr,对角线(row=col)加1
qr:

[[1.9817665337214256, -4.197854445325, -4.69114027734864],
[0.12665092438034994, 1.9301220188705366, -0.1933826059160847],
[0.14175336545641365, -0.36725063650346085, 0.39630818207763985]]
QRcolumnsPart:

[{0:1.9817665337214256,1:0.12665092438034994,2:0.14175336545641365},
{0:1.9301220188705366,1:-0.36725063650346085},
{0:0.39630818207763985}

         内层for循环 ,循环从k+1开始,终止为Ai的列数,用j来表示次数
         ------------------------------sub-first time j=k+1=2  ,k=1
         QRcolj: 取qr的第j列,且从k行开始

{0:-0.1933826059160847,1:0.39630818207763985}
               s:就是QRcolumnsPart[k]和QRcolj的点乘,即每项对应相乘

-0.5187964578647415
            s重新赋值:s = -s / qr.getQuick(k, k);

0.2687894613876947
         qr:把qr的第j列第k行开始更新为原始值+s*第k列的相应值

[[1.9817665337214256, -4.197854445325, -4.69114027734864], 
	  [0.12665092438034994, 1.9301220188705366, 0.3254138519486568], 
	  [0.14175336545641365, -0.36725063650346085, 0.29759508129758655]]
           QRcolumnsPart更新为:QRcolumnsPart截取的是qr的左下部分,和qr一直保持一致

[{0:1.9817665337214256,1:0.12665092438034994,2:0.14175336545641365}, 
	  {0:1.9301220188705366,1:-0.36725063650346085}, 
	  {0:0.29759508129758655}]
           --------------------------------------------内层循环结束
 --------------------------------------------
设置变量rDiag:rDiag.setQuick(k, -nrm);,其中的nrm就是前面的div_num
{0:-32.26673724322168,1:-0.527836313803738}
++++++++++++++++++++++++++++++++---------------------------------------------------third time k=2
nrm:0.29759508129758655

QRcolumnsPart更新为: 经过assign函数后QRcolumnsPart[k]和qr(qr中和QRcolumnsPart[k]对应的部分)都会改变
qr:

[[1.9817665337214256, -4.197854445325, -4.69114027734864],
[0.12665092438034994, 1.9301220188705366, 0.3254138519486568],
[0.14175336545641365, -0.36725063650346085, 0.9999999999999999]]
QRcolumnsPart:

[{0:1.9817665337214256,1:0.12665092438034994,2:0.14175336545641365},
{0:1.9301220188705366,1:-0.36725063650346085}, 
{0:0.9999999999999999}]
更新qr,对角线(row=col)加1
qr:
[[1.9817665337214256, -4.197854445325, -4.69114027734864],
[0.12665092438034994, 1.9301220188705366, 0.3254138519486568],
[0.14175336545641365, -0.36725063650346085, 2.0]]
QRcolumnsPart:

[{0:1.9817665337214256,1:0.12665092438034994,2:0.14175336545641365},
{0:1.9301220188705366,1:-0.36725063650346085}, 
{0:2.0}]
        内层for循环 ,循环从k+1开始,终止为Ai的列数,用j来表示次数
------------------------------sub-first time j=k+1=3  ,k=2
-------------------------------- 直接退出,内层循环结束
设置变量rDiag:rDiag.setQuick(k, -nrm);,其中的nrm就是前面的div_num
{0:-32.26673724322168,1:-0.527836313803738,2:-0.29759508129758655}


至此,new QRDecomposition(Ai)算是分析完毕。下篇分析new QRDecomposition(Ai).solve(Vi).viewColumn(0)。好吧,我也快吐了。。。

分享,成长,快乐

转载请注明blog地址:http://blog.csdn.net/fansy1990


你可能感兴趣的:(Mahout,源码分析,协同过滤,QR分解)