Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。
接上篇,eigen分解,额,太复杂了,人太浮躁了,静不下来分析(说java对矩阵操作支持度不足,额,好吧是外部原因)。
1. 前奏:
eigen分解的是triDiag矩阵,这个矩阵,上篇求得的结果是:
[[0.315642761491587, 0.9488780991876485, 0.0], [0.9488780991876485, 2.855117440373572, 0.0], [0.0, 0.0, 0.0]]根据源代码:
EigenDecomposition decomp = new EigenDecomposition(triDiag); Matrix eigenVects = decomp.getV(); Vector eigenVals = decomp.getRealEigenvalues();这里得到的eigenVectors和eigenVals就是eigen分解得到的结果,调试模式可以看到这两个变量的值是:
在这个网址可以使用eigen分解:http://www.yunsuanzi.com/cgi-bin/symmetric_eig_decomp.py,得到的结果如下:
其实这两个结果是一样的,只是列的顺序不一样。额,好吧,还有符号,好像有一点也不一样。额,确实是不一样,怎么办?用matlab试试吧,结果在matlab中的结果和java算出来的一模一样:
额,看来上面的那个网页的太不给力了,没算对。
接着往下看:
for (int row = 0; row < i; row++) { Vector realEigen = null; // the eigenvectors live as columns of V, in reverse order. Weird but true. Vector ejCol = eigenVects.viewColumn(i - row - 1); int size = Math.min(ejCol.size(), state.getBasisSize()); for (int j = 0; j < size; j++) { double d = ejCol.get(j); Vector rowJ = state.getBasisVector(j); if (realEigen == null) { realEigen = rowJ.like(); } realEigen.assign(rowJ, new PlusMult(d)); } realEigen = realEigen.normalize(); state.setRightSingularVector(row, realEigen); double e = eigenVals.get(row) * state.getScaleFactor(); if (!isSymmetric) { e = Math.sqrt(e); } log.info("Eigenvector {} found with eigenvalue {}", row, e); state.setSingularValue(row, e); } log.info("LanczosSolver finished."); endTime(TimingSection.FINAL_EIGEN_CREATE); }可以看到realEigen的值(当row=0时)就是eigenVects的(rank-1-row)列的转置和basisVector的转置的乘积,比如:
{0:0.01180448947054423,1:0.001703710024210367,2:0.002100735590662567,3:0.014221147454610283,4:0.09654151173375553,5:0.0025666815984826535,6:0.0026147055494762234,7:1.753144283209579E-4,8:0.0017595900141802873,9:0.0049406361794682024,10:7.881250692924197E-4,11:0.002873479530226361,12:0.9951286321096425}excel中计算的值是:
0.011804489 | 0.00170371 | 0.002100736 | 0.014221147 | 0.096541512 | 0.002566682 | 0.002614706 | 0.000175314 | 0.00175959 | 0.004940636 | 0.000788125 | 0.00287348 | 0.995128632 |
然后就是normalize了,这个函数是更新realEigen的值的,使用原始值除以(realEigen(0)的点积开根号);最后就是赋值了,把这个realEigen赋值给state的singularVectors;e的值就更好理解了,直接从eigenVals中取出相应的值然后乘以scaleFactor,然后开根号就ok了;最后把e值赋值给state的singularValue。这里给出state的singularVectors和singularValue的定义:
protected final Map<Integer, Double> singularValues; protected Map<Integer, Vector> singularVectors;
上面运行完成后就会返回DistributedLanczosSolver中的第203行执行:
Path outputEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS); serializeOutput(state, outputEigenVectorPath);首先初始化一个输出目录,然后序列化state进行输出,其中state在solve函数中进行了更新;
看serializeOutput的函数定义:
public void serializeOutput(LanczosState state, Path outputPath) throws IOException { int numEigenVectors = state.getIterationNumber(); log.info("Persisting {} eigenVectors and eigenValues to: {}", numEigenVectors, outputPath); Configuration conf = getConf() != null ? getConf() : new Configuration(); FileSystem fs = FileSystem.get(outputPath.toUri(), conf); SequenceFile.Writer seqWriter = new SequenceFile.Writer(fs, conf, outputPath, IntWritable.class, VectorWritable.class); try { IntWritable iw = new IntWritable(); for (int i = 0; i < numEigenVectors; i++) { // Persist eigenvectors sorted by eigenvalues in descending order\ NamedVector v = new NamedVector(state.getRightSingularVector(numEigenVectors - 1 - i), "eigenVector" + i + ", eigenvalue = " + state.getSingularValue(numEigenVectors - 1 - i)); Writable vw = new VectorWritable(v); iw.set(i); seqWriter.append(iw, vw); } } finally { Closeables.closeQuietly(seqWriter); } }上面最主要的就是:
NamedVector v = new NamedVector(state.getRightSingularVector(numEigenVectors - 1 - i), "eigenVector" + i + ", eigenvalue = " + state.getSingularValue(numEigenVectors - 1 - i));这个就是把上面state中的singularValue和singularVector写入到文件中:
singularVectors: {0={0:0.01180448947054423,1:0.001703710024210367,2:0.002100735590662567,3:0.014221147454610283,4:0.09654151173375553,5:0.0025666815984826535,6:0.0026147055494762234,7:1.753144283209579E-4,8:0.0017595900141802873,9:0.0049406361794682024,10:7.881250692924197E-4,11:0.002873479530226361,12:0.9951286321096425}, 1={0:-0.2883450858059115,1:-0.29170231535763447,2:-0.29157035465385267,3:-0.28754185317979386,4:-0.26018076078737895,5:-0.2914154866344813,6:-0.2913995247546756,7:-0.2922103132689348,8:-0.2916837423401091,9:-0.29062644748002026,10:-0.2920066313645422,11:-0.2913135151887795,12:0.03848561950058266}, 2={0:0.01671441233225078,1:0.0935655369363106,2:0.09132650234523473,3:-0.0680324702834075,4:-0.9461123439509093,5:0.10210271255992123,6:0.10042714365337412,7:0.11137954332150339,8:0.10331974823993555,9:0.10621406378767596,10:0.10586960137353602,11:0.09262650242313884,12:0.09059904726143547}} singularValue: {0=0.0, 1=23.01314740985974, 2=2536.4018057098874}
然后就返回到了DistributedLanczosSolver的153行,接着往下执行;
3. 任务篇:
Path rawEigenVectorPath = new Path(outputPath, RAW_EIGENVECTORS); return new EigenVerificationJob().run(inputPath, rawEigenVectorPath, outputPath, outputTmpPath, maxError, minEigenvalue, inMemory, getConf() != null ? new Configuration(getConf()) : new Configuration());先初始化一个文件,然后直接调用EigenVerificationJob的run方法,那么,整个分析就转移到了EigenVerificationJob。
附注:rawEigen是什么?根据上面的分析可以看出rawEigen其实就是state的singularVectors和singularValue的值而已;
分享,成长,快乐
转载请注明blog地址:http://blog.csdn.net/fansy1990