mahout distributed lanzcos svd method summary according a MAHOUT-180 comments

zz: http://issues.apache.org/jira/browse/MAHOUT-180
1. hadoop version of the lanczos algorithm for performing SVD on sparse matrices.对sparse有高性能

2.the primary work to do parallized Lanczos is parallelized multiplication of (the square of) your input matrix by vectors. the input matrix lives in HDFS, and then lanczos SVD method just leaves your matrix in HDFS( which means the input matrix in distributed stored, and no additional data transfer) and sends one vector at a time to do parallelized matrix*vector
主要的工作就是matrix*vector的相乘,有时候是(the square of the matrix)*vector:M^TM*Vector
the work also avoid squaring the input matrix when your input matrix is symmetric
如果矩阵是对称的,它不会帮你squared,如果不是对称的,它首先帮你squared。

3. the author work on unit testing shows that lanczos is doing great.好
4.get SparseVectorsFromSequenceFiles:
$HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i text_path -o corpus_as_vectors_path -seq true -w tfidf -chunk 1000 --minSupport 1 --minDF 5 --maxDFPercent 50 --norm 2

do distributed lanczos solve to calculate singular value
$HADOOP_HOME/bin/hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver -i corpus_as_vectors_path -o corpus_svd_path -nr 1 -nc <numFeatures> --rank 100

仔细看包含这个内容的帖子,特别是下面一部分disiredRank是什么意思

5.EigenVerificationJob可以去掉不好的eigenvalue

6。Multiplication of a matrix (or the square of a matrix) by a vector is the primary operation of Lanczos, and that is done in a M/R iteration. If you want the top-k singular vectors, you make k passes over the data.

7.the code seems to be working fine and indeed produces the right amount of dense (eigen?) vectors.

你可能感兴趣的:(apache,工作,hadoop)