Spark MLlib之矩阵

  • Spark MLlib的底层组件
  • MLlib的数据存储
  • 本地
    • 本地向量Local vector
    • 标记向量Labeled point
    • 稀疏数据Sparse Data
    • 本地矩阵Local matrix
  • 分布式矩阵Distributed matrix
    • 行矩阵RowMatrix
    • 行索引矩阵IndexedRowMatrix
    • 坐标矩阵CoordinateMatrix
    • 块矩阵BlockMatrix

Spark MLlib的底层组件

BLAS/LAPACK层
LAPACK是用Fortran编写的算法库,顾名思义,Linear Algebra Package是为了解决通用的线性代数问题。算法包BLAS(Basic Linear Algebra Subprograms),其实LAPACK底层使用了BLAS库
Netlib-java
对BLAS/LAPACK封装的Java接口层
Breeze
scala编写的数值处理库,提供向量、矩阵运算等API
库依赖
MLlib底层使用了依赖Fortran routines的netlib-java。因此,需要在节点预安装gfortran runtime library

MLlib的数据存储

支持本地向量和矩阵存储、分布式的矩阵存储(底层实现一个或多个RDD)

MLlib监督学习中,一个训练样例叫做labeled point

官方文档(1.6,2.0也一样)很详细

本地

本地向量(Local vector)

数据类型为double,数组序号从0开始的整数类型。本地向量存储在单机中。
MLlib支持两种类型的本地向量
- 稠密向量
- 底层实现:一个double型的数组存储向量每个元素的值
- 稀疏向量
- 底层实现:连个并行的数组,一个数组存储向量的序号,一个存储向量元素值

本地向量的基本类是Vector类,两种实现
* DenseVector
* SparseVector

import org.apache.spark.mllib.linalg.{Vector, Vectors}

// Create a dense vector (1.0, 0.0, 3.0).
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

标记向量(Labeled point)

是一个本地向量,可以是Dense,也可以是Sparse,与一个label/response关联。

通常在Supervized learning中使用

标记向量用Double类型数据存储标记

二元分类(binary classification)来说,label可以为0(negative)或者1(positive)

多元分类(multiclass classification)问题,标记需要从0开始

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))

// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

稀疏数据(Sparse Data)

实际应用中,稀疏的训练数据非常常见。 Mllib支持读取LIBSVM格式的训练数据,这种数据默认被LIBSVM,LIBLINEAR使用。

文件的每一行,代表一个被标记的系数特征向量

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD

val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

本地矩阵(Local matrix)

A local matrix has integer-typed row and column indices and double-typed values, stored on a single machine. MLlib supports dense matrices, whose entry values are stored in a single double array in column-major order, and sparse matrices, whose non-zero entry values are stored in the Compressed Sparse Column (CSC) format in column-major order. For example, the following dense matrix

本地矩阵基本类Matrix,两种实现
- DenseMatrix
- SparseMatrix

按列为主的顺从存储


(1.0,2.0)
(3.0,4.0)
(5.0,6.0)

上述矩阵存储在一维数组 [1.0, 3.0, 5.0, 2.0, 4.0, 6.0],矩阵大小为(3, 2),3行2列.

import org.apache.spark.mllib.linalg.{Matrix, Matrices}

// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))

// Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8))

分布式矩阵(Distributed matrix)

A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Four types of distributed matrices have been implemented so far.

存储Double类型数据,底层分布式地存储一个或多个RDD,行列号使用Long型

分布式矩阵存储的关键,是采用正确的数据格式。将一个分布式矩阵转换为另一种不同格式的分布式矩阵,可能引发全局shuffle操作,开销很大

四种分布式矩阵实现

  • RowMatrix
  • IndexedRowMatrix
  • CoordinateMatrix
  • BlockMatrix

行矩阵(RowMatrix)

A RowMatrix is a row-oriented distributed matrix without meaningful row indices, backed by an RDD of its rows, where each row is a local vector. Since each row is represented by a local vector, the number of columns is limited by the integer range but it should be much smaller in practice.

面向行,底层是一个以本地向量为数据项的RDD.

每行都是本地向量,long型为行号,所以会收到类型范围限制

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()

// QR decomposition 
val qrResult = mat.tallSkinnyQR(true)

行索引矩阵(IndexedRowMatrix)

An IndexedRowMatrix is similar to a RowMatrix but with meaningful row indices. It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector.

An IndexedRowMatrix can be created from an RDD[IndexedRow] instance, where IndexedRow is a wrapper over (Long, Vector). An IndexedRowMatrix can be converted to a RowMatrix by dropping its row indices.

底层实现,是一个带行索引的RDD,这个RDD,每行是Long型索引和本地向量

import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}

val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
// Create an IndexedRowMatrix from an RDD[IndexedRow].
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()

// Drop its row indices.
val rowMat: RowMatrix = mat.toRowMatrix()

坐标矩阵(CoordinateMatrix)

A CoordinateMatrix is a distributed matrix backed by an RDD of its entries. Each entry is a tuple of (i: Long, j: Long, value: Double), where i is the row index, j is the column index, and value is the entry value. A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.

底层实现是使用一个RDD存储的数据线,每个数据项是一个元组(i: Long, j: Long, value: Double)

只适用于,维度很大,矩阵很稀疏的情况

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}

val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val mat: CoordinateMatrix = new CoordinateMatrix(entries)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()

// Convert it to an IndexRowMatrix whose rows are sparse vectors.
val indexedRowMatrix = mat.toIndexedRowMatrix()

块矩阵(BlockMatrix)

A BlockMatrix is a distributed matrix backed by an RDD of MatrixBlocks, where a MatrixBlock is a tuple of ((Int, Int), Matrix), where the (Int, Int) is the index of the block, and Matrix is the sub-matrix at the given index with size rowsPerBlock x colsPerBlock. BlockMatrix supports methods such as add and multiply with another BlockMatrix. BlockMatrix also has a helper function validate which can be used to check whether the BlockMatrix is set up properly.

底层实现是MatrixBlocks的RDD,这个MatrixBlock是一个元祖((Int, Int), Matrix)

import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}

val entries: RDD[MatrixEntry] = ... // an RDD of (i, j, v) matrix entries
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)
// Transform the CoordinateMatrix to a BlockMatrix
val matA: BlockMatrix = coordMat.toBlockMatrix().cache()

// Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
// Nothing happens if it is valid.
matA.validate()

// Calculate A^T A.
val ata = matA.transpose.multiply(matA)

你可能感兴趣的:(Spark)