源码定义:
* Creates a dense vector from its values.
*/
@varargs
def dense(firstValue: Double, otherValues: Double*): Vector =
new DenseVector((firstValue +: otherValues).toArray)
// A dummy implicit is used to avoid signature collision with the one generated by @varargs.
/**
* Creates a dense vector from a double array.
*/
def dense(values: Array[Double]): Vector =new DenseVector(values)
实现方法:
scala> val A1 = (1 to 5).toArray.map {f => f.toDouble}
A1: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
scala> val V1 = Vectors.dense(A1)
V1: org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0,4.0,5.0]
scala> val V2 = Vectors.dense(2.0, 2.0, 2.0, 2.0, 2.0, 2.0)
V2: org.apache.spark.mllib.linalg.Vector = [2.0,2.0,2.0,2.0,2.0,2.0]
源码定义:
/**
* Creates a sparse vector providing its index array and value array.
*
* @param size vector size.
* @param indices index array, must be strictly increasing.
* @param values value array, must have the same length as indices.
*/
def sparse(size: Int, indices: Array[Int], values: Array[Double]): Vector =
new SparseVector(size, indices, values)
def sparse(size: Int, elements: Seq[(Int, Double)]): Vector = {
def sparse(size: Int, elements: JavaIterable[(JavaInteger, JavaDouble)]): Vector = {
实现方法:
scala> val S1 = Vectors.sparse(5, Array(0, 1, 2, 3, 4), Array(1.0, 2.0, 3.0, 4.0, 5.0))
S1: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,2.0,3.0,4.0,5.0])
scala> val S2 = Vectors.sparse(5, Seq((0, 1.0), (1, 2.0), (2,3.0), (3,4.0), (4,5.0)))
S2: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,2.0,3.0,4.0,5.0])
源码定义:
/**
* Creates a column-major dense matrix.
*
* @param numRows number of rows
* @param numCols number of columns
* @param values matrix entries in column major
*/
def dense(numRows: Int, numCols: Int, values: Array[Double]): Matrix = {
new DenseMatrix(numRows, numCols, values)
}
实现方法:
scala> val A2 = (1 to 25).toArray.map { f => f.toDouble }
A2: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0)
scala> val M1 = Matrices.dense(5, 5, A2)
M1: org.apache.spark.mllib.linalg.Matrix =
1.0 6.0 11.0 16.0 21.0
2.0 7.0 12.0 17.0 22.0
3.0 8.0 13.0 18.0 23.0
4.0 9.0 14.0 19.0 24.0
5.0 10.0 15.0 20.0 25.0
源码定义:
/**
* Creates a column-major sparse matrix in Compressed Sparse Column (CSC) format.
*
* @param numRows number of rows
* @param numCols number of columns
* @param colPtrs the index corresponding to the start of a new column
* @param rowIndices the row index of the entry
* @param values non-zero matrix entries in column major
*/
def sparse(
numRows: Int,
numCols: Int,
colPtrs: Array[Int],
rowIndices: Array[Int],
values: Array[Double]): Matrix = {
new SparseMatrix(numRows, numCols, colPtrs, rowIndices, values)
}
/**
* Column-major sparse matrix.
* The entry values are stored in Compressed Sparse Column (CSC) format.
* For example, the following matrix
* {{{
* 1.0 0.0 4.0
* 0.0 3.0 5.0
* 2.0 0.0 6.0
* }}}
* is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,
* `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`.
实现方法:
scala> val M2 = Matrices.sparse(3, 3, Array(0, 2, 3, 6), Array(0, 2, 1, 0, 1, 2), Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
M2: org.apache.spark.mllib.linalg.Matrix =
3 x 3 CSCMatrix
(0,0) 1.0
(2,0) 2.0
(1,1) 3.0
(0,2) 4.0
(1,2) 5.0
(2,2) 6.0
一个行矩阵就是把每行对应一个RDD,将矩阵的每行分布式存储,矩阵的每行是一个本地向量。这和多变量统计的数据矩阵比较相似。因为每行以一个本地向量表示,那么矩阵列的数量被限制在整数范围内,但是实际应用中列数很小。
1、创建RowMatrix
* @param rows rows stored as an RDD[Vector]
* @param nRows number of rows. A non-positive value means unknown, and then the number of rows will be determined by the number of records in the RDD `rows`.
* @param nCols number of columns. A non-positive value means unknown, and then the number of columns will be determined by the size of the first row.
newRowMatrix(rows: RDD[Vector])
newRowMatrix(rows: RDD[Vector], nRows: Long, nCols: Int)
scala> val rdd1= sc.parallelize(Array(Array(1.0,2.0,3.0,4.0),Array(2.0,3.0,4.0,5.0),Array(3.0,4.0,5.0,6.0))).map(f => Vectors.dense(f))
rdd1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[38] at map at <console>:31
scala> val RM = new RowMatrix(rdd1)
RM: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@6196e877
2、RowMatrix方法
1)columnSimilarities(threshold: Double): CoordinateMatrix
计算每列之间相似度,采用抽样方法进行计算,参数为threshold;
val simic1 = RM.columnSimilarities(0.5)
2)columnSimilarities(): CoordinateMatrix
计算每列之间相似度。
val simic2 = RM.columnSimilarities()
3)computeColumnSummaryStatistics(): MultivariateStatisticalSummary
计算每列的汇总统计。
val simic3 = RM.computeColumnSummaryStatistics()
simic3.max
simic3.min
simic3.mean
4)computeCovariance(): Matrix
计算每列之间的协方差,生成协方差矩阵。
val cc1 = RM.computeCovariance
cc1: org.apache.spark.mllib.linalg.Matrix =
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
5)computeGramianMatrix(): Matrix
计算格拉姆矩阵:`A^T A`。
给定一个实矩阵 A,矩阵 ATA 是 A 的列向量的格拉姆矩阵,而矩阵 AAT 是 A 的行向量的格拉姆矩阵。
val cc1 = RM.computeCovariance
cg1: org.apache.spark.mllib.linalg.Matrix =
14.0 20.0 26.0 32.0
20.0 29.0 38.0 47.0
26.0 38.0 50.0 62.0
32.0 47.0 62.0 77.0
6)computePrincipalComponents(k: Int): Matrix
计算主成分,前K个。行为样本,列为变量。
val pc1 = RM.computePrincipalComponents(3)
pc1: org.apache.spark.mllib.linalg.Matrix =
-0.5000000000000002 0.8660254037844388 1.6653345369377348E-16
-0.5000000000000002 -0.28867513459481275 0.8164965809277258
-0.5000000000000002 -0.28867513459481287 -0.40824829046386296
-0.5000000000000002 -0.28867513459481287 -0.40824829046386296
7)computeSVD(k: Int, computeU: Boolean = false, rCond: Double = 1e-9):
SingularValueDecomposition[RowMatrix, Matrix]
计算矩阵的奇异值分解。
val svd = RM.computeSVD(4, true)
val U = svd.U
U.rows.foreach(println)
val s = svd.s
val V = svd.V
8)multiply(B: Matrix): RowMatrix
矩阵乘法运算,右乘运算。
A. multiply(B) => A*B
9)numCols(): Long
矩阵的列数;
10)numRows(): Long
矩阵的行数;
11)rows: RDD[Vector]
矩阵转化成RDD,以行存储的RDD。
这是分布式矩阵的第二种matrix,这种矩阵和RowMatrix非常相似,区别是它带有有一定意义的 row indices。It is backed by an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector. 一个 IndexedRowMatrix可以从RDD[IndexedRow] 实例创建,IndexedRow 是 (Int, Vector) 的 wrapper, 而且这种矩阵可以传换成 RowMatrix, 通过丢掉它的 row indices。
这和多变量统计的数据矩阵比较相似。因为每行以一个本地向量表示,那么矩阵列的数量被限制在整数范围内,但是实际应用中列数很小。
创建及使用方法类似于RowMatrix。
Represents a distributively stored matrix backed by one or more RDDs.
这是分布式矩阵的第三种matrix,坐标矩阵也是一种RDD存储的分布式矩阵。 顾名思义,这里的每一项都是一个(i: Long, j: Long, value: Double) 指示行列值的元组tuple。 其中i是行坐标,j是列坐标,value是值。如果矩阵是非常大的而且稀疏,坐标矩阵一定是最好的选择。坐标矩阵则是通过RDD[MatrixEntry]实例创建,MatrixEntry是(long,long.Double)形式。坐标矩阵可以转化为IndexedRowMatrix。
/**
* :: Experimental ::
*
* Represents a distributed matrix in blocks of local matrices.
*
* @param blocks The RDD of sub-matrix blocks ((blockRowIndex, blockColIndex), sub-matrix) that
* form this distributed matrix. If multiple blocks with the same index exist, the
* results for operations like add and multiply will be unpredictable.
* @param rowsPerBlock Number of rows that make up each block. The blocks forming the final
* rows are not required to have the given number of rows
* @param colsPerBlock Number of columns that make up each block. The blocks forming the final
* columns are not required to have the given number of columns
* @param nRows Number of rows of this matrix. If the supplied value is less than or equal to zero,
* the number of rows will be calculated when `numRows` is invoked.
* @param nCols Number of columns of this matrix. If the supplied value is less than or equal to
* zero, the number of columns will be calculated when `numCols` is invoked.
*/
分布式分块矩阵。参照:http://de.wikipedia.org/wiki/Blockmatrix