1、Spark MLlib 核心基础:向量 And矩阵

1.1 Vector

1.1.1 dense vector


   * Creates a dense vector from its values.



  def dense(firstValue: Double, otherValues: Double*): Vector =

    new DenseVector((firstValue +: otherValues).toArray)


  // A dummy implicit is used to avoid signature collision with the one generated by @varargs.


   * Creates a dense vector from a double array.


  def dense(values: Array[Double]): Vector =new DenseVector(values)


scala>   val A1 = (1 to 5) {f => f.toDouble}

A1: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)

scala>   val V1 = Vectors.dense(A1)

V1: org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0,4.0,5.0]

scala>   val V2 = Vectors.dense(2.0, 2.0, 2.0, 2.0, 2.0, 2.0)

V2: org.apache.spark.mllib.linalg.Vector = [2.0,2.0,2.0,2.0,2.0,2.0]

1.1.2 dense vector



   * Creates a sparse vector providing its index array and value array.


   * @param size vector size.

   * @param indices index array, must be strictly increasing.

   * @param values value array, must have the same length as indices.


  def sparse(size: Int, indices: Array[Int], values: Array[Double]): Vector =

    new SparseVector(size, indices, values)

  def sparse(size: Int, elements: Seq[(Int, Double)]): Vector = {

  def sparse(size: Int, elements: JavaIterable[(JavaInteger, JavaDouble)]): Vector = {


scala>   val S1 = Vectors.sparse(5, Array(0, 1, 2, 3, 4), Array(1.0, 2.0, 3.0, 4.0, 5.0))

S1: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,2.0,3.0,4.0,5.0])

scala>   val S2 = Vectors.sparse(5, Seq((0, 1.0), (1, 2.0), (2,3.0), (3,4.0), (4,5.0)))

S2: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,2.0,3.0,4.0,5.0])

1.2 Matrix

1.2.1 dense matrix



   * Creates a column-major dense matrix.


   * @param numRows number of rows

   * @param numCols number of columns

   * @param values matrix entries in column major


  def dense(numRows: Int, numCols: Int, values: Array[Double]): Matrix = {

    new DenseMatrix(numRows, numCols, values)



scala>   val A2 = (1 to 25) { f => f.toDouble }

A2: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0)


scala>   val M1 = Matrices.dense(5, 5, A2)

M1: org.apache.spark.mllib.linalg.Matrix =

1.0  6.0   11.0  16.0  21.0 

2.0  7.0   12.0  17.0  22.0 

3.0  8.0   13.0  18.0  23.0 

4.0  9.0   14.0  19.0  24.0 

5.0  10.0  15.0  20.0  25.0 


1.2.2 sparse matrix



   * Creates a column-major sparse matrix in Compressed Sparse Column (CSC) format.


   * @param numRows number of rows

   * @param numCols number of columns

   * @param colPtrs the index corresponding to the start of a new column

   * @param rowIndices the row index of the entry

   * @param values non-zero matrix entries in column major


  def sparse(

     numRows: Int,

     numCols: Int,

     colPtrs: Array[Int],

     rowIndices: Array[Int],

     values: Array[Double]): Matrix = {

    new SparseMatrix(numRows, numCols, colPtrs, rowIndices, values)



   * Column-major sparse matrix.

   * The entry values are stored in Compressed Sparse Column (CSC) format.

   * For example, the following matrix

   * {{{

   *   1.0 0.0 4.0

   *   0.0 3.0 5.0

   *   2.0 0.0 6.0

   * }}}

   * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,

   * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`.


scala>   val M2 = Matrices.sparse(3, 3, Array(0, 2, 3, 6), Array(0, 2, 1, 0, 1, 2), Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))

M2: org.apache.spark.mllib.linalg.Matrix =

3 x 3 CSCMatrix

(0,0) 1.0

(2,0) 2.0

(1,1) 3.0

(0,2) 4.0

(1,2) 5.0

(2,2) 6.0


1.3 distributed Matrix

1.3.1 RowMatrix



* @param rows rows stored as an RDD[Vector]

 * @param nRows number of rows. A non-positive value means unknown, and then the number of rows will be determined by the number of records in the RDD `rows`.

 * @param nCols number of columns. A non-positive value means unknown, and then the number of columns will be determined by the size of the first row.

newRowMatrix(rows: RDD[Vector])

newRowMatrix(rows: RDD[Vector], nRows: Long, nCols: Int)

scala>   val rdd1= sc.parallelize(Array(Array(1.0,2.0,3.0,4.0),Array(2.0,3.0,4.0,5.0),Array(3.0,4.0,5.0,6.0))).map(f => Vectors.dense(f))

rdd1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[38] at map at <console>:31

scala>   val RM = new RowMatrix(rdd1)

RM: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@6196e877


1)columnSimilarities(threshold: Double): CoordinateMatrix


val simic1 = RM.columnSimilarities(0.5)


2)columnSimilarities(): CoordinateMatrix


val simic2 = RM.columnSimilarities()


3)computeColumnSummaryStatistics(): MultivariateStatisticalSummary


val simic3 = RM.computeColumnSummaryStatistics()





4)computeCovariance(): Matrix


val cc1 = RM.computeCovariance

cc1: org.apache.spark.mllib.linalg.Matrix =

1.0  1.0  1.0  1.0 

1.0  1.0  1.0  1.0 

1.0  1.0  1.0  1.0 

1.0  1.0  1.0  1.0 


5)computeGramianMatrix(): Matrix

计算格拉姆矩阵:`A^T A`。

给定一个实矩阵 A,矩阵 ATA 是 A 的列向量的格拉姆矩阵,而矩阵 AAT 是 A 的行向量的格拉姆矩阵。

val cc1 = RM.computeCovariance

cg1: org.apache.spark.mllib.linalg.Matrix =

14.0  20.0  26.0  32.0 

20.0  29.0  38.0  47.0 

26.0  38.0  50.0  62.0 

32.0  47.0  62.0  77.0 


6)computePrincipalComponents(k: Int): Matrix


val pc1 = RM.computePrincipalComponents(3)

pc1: org.apache.spark.mllib.linalg.Matrix =

-0.5000000000000002  0.8660254037844388    1.6653345369377348E-16 

-0.5000000000000002  -0.28867513459481275  0.8164965809277258     

-0.5000000000000002  -0.28867513459481287  -0.40824829046386296   

-0.5000000000000002  -0.28867513459481287  -0.40824829046386296  


7)computeSVD(k: Int, computeU: Boolean = false, rCond: Double = 1e-9):



  val svd = RM.computeSVD(4, true)

  val U = svd.U


  val s = svd.s

  val V = svd.V


8)multiply(B: Matrix): RowMatrix


A. multiply(B) => A*B

9)numCols(): Long


10)numRows(): Long


11)rows: RDD[Vector]



1.3.2 IndexedRowMatrix 

这是分布式矩阵的第二种matrix,这种矩阵和RowMatrix非常相似,区别是它带有有一定意义的 row indices。It is backed by an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector. 一个 IndexedRowMatrix可以从RDD[IndexedRow] 实例创建,IndexedRow 是 (Int, Vector) 的 wrapper, 而且这种矩阵可以传换成 RowMatrix, 通过丢掉它的 row indices。



1.3.3 DistributedMatrix

Represents a distributively stored matrix backed by one or more RDDs.


1.3.4 CoordinateMatrix

这是分布式矩阵的第三种matrix,坐标矩阵也是一种RDD存储的分布式矩阵。 顾名思义,这里的每一项都是一个(i: Long, j: Long, value: Double) 指示行列值的元组tuple。 其中i是行坐标,j是列坐标,value是值。如果矩阵是非常大的而且稀疏,坐标矩阵一定是最好的选择。坐标矩阵则是通过RDD[MatrixEntry]实例创建,MatrixEntry是(long,long.Double)形式。坐标矩阵可以转化为IndexedRowMatrix。


1.3.5 BlockMatrix


 * :: Experimental ::


 * Represents a distributed matrix in blocks of local matrices.


 * @param blocks The RDD of sub-matrix blocks ((blockRowIndex, blockColIndex), sub-matrix) that

 *               form this distributed matrix. If multiple blocks with the same index exist, the

 *               results for operations like add and multiply will be unpredictable.

 * @param rowsPerBlock Number of rows that make up each block. The blocks forming the final

 *                     rows are not required to have the given number of rows

 * @param colsPerBlock Number of columns that make up each block. The blocks forming the final

 *                     columns are not required to have the given number of columns

 * @param nRows Number of rows of this matrix. If the supplied value is less than or equal to zero,

 *              the number of rows will be calculated when `numRows` is invoked.

 * @param nCols Number of columns of this matrix. If the supplied value is less than or equal to

 *              zero, the number of columns will be calculated when `numCols` is invoked.




