Sparse Matrix Storage Formats稀疏矩阵的存储格式
COO使用3个数组进行存储:values,rows, andcolumn。
数组values: 实数或复数数据,包括矩阵中的非零元素,顺序任意。
数组rows: 数据所处的行。
数组columns: 数据所处的列。
参数:矩阵中非零元素的数量 nnz,3个数组的长度均为nnz.如果稀疏矩阵有仅包含非0元素的对角线,则对角存储格式(DIA)可以减少非0元素定位的信息量。这种存储格式对有限元素或者有限差分离散化的矩阵尤其有效。
DIA通过两个数组确定: values、distance。
If the sparse matrix has diagonals containing only zero elements, then the diagonal storage format can be used to reduce the amount of information needed to locate the non-zero elements. This storage format is particularly useful in many applications where the matrix arises from a finite element or finite difference discretization.
The Intel MKL diagonal storage format is specified by two arrays:values anddistance, and two parameters:ndiag, which is the number of non-empty diagonals, andlval, which is the declared leading dimension in the calling (sub)programs.
A real or complex two-dimensional array is dimensioned aslval byndiag. Each column of it contains the non-zero elements of certain diagonal ofA. The key point of the storage is that each element invalues retains the row number of the original matrix. To achieve this diagonals in the lower triangular part of the matrix are padded from the top, and those in the upper triangular part are padded from the bottom. Note that the value ofdistance(i) is the number of elements to be padded for diagonali.
An integer array with dimension ndiag. Elementi of the arraydistance is the distance betweeni-diagonal and the main diagonal. The distance is positive if the diagonal is above the main diagonal, and negative if the diagonal is below the main diagonal. The main diagonal has a distance equal to zero.
压缩稀疏行格式(CSR)通过四个数组确定: values,columns, pointerB, pointerE.
数组pointerB :第j个整型元素给出矩阵A行j中第一个非0元的位置,等价于pointerB(j) -pointerB(1)+1 ;
The Intel MKL compressed sparse row (CSR) format is specified by four arrays: thevalues,columns,pointerB, andpointerE. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrixA.
A real or complex array that contains the non-zero elements ofA. Values of the non-zero elements ofA are mapped into thevalues array using the row-major storage mapping described above.
Element i of the integer array columns is the number of the column inA that contains thei-th value in thevalues array.
Element j of this integer array gives the index of the element in thevalues array that is first non-zero element in a rowj ofA. Note that this index is equal topointerB(j) -pointerB(1)+1 .
An integer array that contains row indices, such thatpointerE(j)-pointerB(1) is the index of the element in thevalues array that is last non-zero element in a row j of A.
压缩稀疏列格式(CSC)类似CSR格式,只是用的是列而不是行压缩。换句话说,矩阵A的CSC 格式和矩阵A的转置的CSR是一样的。
同样CSC也是由四个数组确定:values, columns, pointerB, and pointerE. 含义类同CSR。
The compressed sparse column format (CSC) is similar to the CSR format, but the columns are used instead the rows. In other words, the CSC format is identical to the CSR format for the transposed matrix. The CSR format is specified by four arrays: values, columns, pointerB, and pointerE. The following table describes the arrays in terms of the values, row, and column positions of the non-zero elements in a sparse matrixA.
A real or complex array that contains the non-zero elements ofA. Values of the non-zero elements ofA are mapped into thevalues array using the column-major storage mapping.
Element i of the integer array rows is the number of the row inA that contains thei-th value in thevalues array.
Element j of this integer array gives the index of the element in thevalues array that is first non-zero element in a columnj ofA. Note that this index is equal topointerB(j) -pointerB(1)+1 .
An integer array that contains column indices, such thatpointerE(j)-pointerB(1) is the index of the element in thevalues array that is last non-zero element in a column j ofA.
The skyline storage format is important for the direct sparse solvers, and it is well suited for Cholesky or LU decomposition when no pivoting is required.
The skyline storage format accepted in Intel MKL can store only triangular matrix or triangular part of a matrix. This format is specified by two arrays:values andpointers. The following table describes these arrays:
A scalar array. For a lower triangular matrix it contains the set of elements from each row of the matrix starting from the first non-zero element to and including the diagonal element. For an upper triangular matrix it contains the set of elements from each column of the matrix starting with the first non-zero element down to and including the diagonal element. Encountered zero elements are included in the sets.
An integer array with dimension (m+1), where m is the number of rows for lower triangle (columns for the upper triangle).pointers(i) -pointers(1)+1 gives the index of element invalues that is first non-zero element in row (column)i. The value ofpointers(m+1) is set tonnz+pointers(1), wherennz is the number of elements in the arrayvalues.
values = (1 02 1 6 7 8 2 1 4 5 1 4 3 0 0 7 2 0 0)
columns = (0 1 1 1 2)
pointerB= (0 2 3)
pointerE= (2 3 5)
分块压缩稀疏行格式(BSR) 通过四个数组确定:values,columns,pointerB, pointerE.
数组pointerB :第j个整型元素给出columns第j个非0块的起始位置;
The Intel MKL block compressed sparse row (BSR) format for sparse matrices is specified by four arrays:values,columns,pointerB, andpointerE. The following table describes these arrays.
A real array that contains the elements of the non-zero blocks of a sparse matrix. The elements are stored block-by-block in row-major order. A non-zero block is the block that contains at least one non-zero element. All elements of non-zero blocks are stored, even if some of them is equal to zero. Within each non-zero block elements are stored in column-major order in the case of one-based indexing, and in row-major order in the case of the zero-based indexing.
Element i of the integer array columns is the number of the column in the block matrix that contains thei-th non-zero block.
Element j of this integer array gives the index of the element in thecolumns array that is first non-zero block in a rowj of the block matrix.
Element j of this integer array gives the index of the element in thecolumns array that contains the last non-zero block in a rowj of the block matrix plus 1.
分块压缩稀疏行格式(BSR)bsr_matrix(arg1, shape=None, dtype=None, copy=False, blocksize=None)Block Sparse Row matrix:
和压缩稀疏行格式(CSR)很相似,但是BSR更适合于有密集子矩阵的稀疏矩阵,分块矩阵通常出现在向量值有限的离散元中,在这种情景下,比CSR和CSC算术操作更有效。The Block Compressed Row (BSR) format is very similar to the Compressed Sparse Row (CSR) format. BSR is appropriate for sparse matrices with dense sub matrices. Block matrices often arise in vector-valued finite element discretizations. In such cases, BSR is considerably more efficient than CSR and CSC for many sparse arithmetic operations.
csc_matrix(arg1,shape=None, dtype=None, copy=False)压缩的列稀疏矩阵CSC :
高效的CSC +CSC, CSC * CSC算术运算;高效的列切片操作。但是矩阵内积操作没有CSR, BSR快;行切片操作慢(相比CSR);稀疏结构的变化代价高(相比LIL 或者 DOK)。
Advantages of the CSC format
•efficient arithmetic operations CSC + CSC, CSC * CSC, etc.
•efficient column slicing
•fast matrix vector products (CSR, BSR may be faster!)
Disadvantages of the CSC format
•slow row slicing operations (consider CSR)
•changes to the sparsity structure are expensive (consider LIL or DOK)
csr_matrix(arg1, shape=None, dtype=None, copy=False)Compressed Sparse Row matrix压缩稀疏行格式(CSR):
高效的CSR + CSR, CSR *CSR算术运算;高效的行切片操作;高效的矩阵内积内积操作。但是列切片操作慢(相比CSC);稀疏结构的变化代价高(相比LIL 或者 DOK)。CSR格式在存储稀疏矩阵时非零元素平均使用的字节数(Bytes per Nonzero Entry)最为稳定(float类型约为8.5,double类型约为12.5)。CSR格式常用于读入数据后进行稀疏矩阵计算。
Advantages of the CSR format
•efficient arithmetic operations CSR + CSR, CSR * CSR, etc.
•efficient row slicing
•fast matrix vector products
Disadvantages of the CSR format
•slow column slicing operations (consider CSC)
•changes to the sparsity structure are expensive (consider LIL or DOK)
Advantages of the COO format
•facilitates fast conversion among sparse formats
•permits duplicate entries (see example)
•very fast conversion to and from CSR/CSC formats
COO格式常用于从文件中进行稀疏矩阵的读写,如matrix market即采用COO格式。
tocsc() |
Return a copy of this matrix in Compressed Sparse Column format |
tocsr() |
Return a copy of this matrix in Compressed Sparse Row format |
todense([order, out]) |
Return a dense matrix representation of this matrix |
dia_matrix(arg1, shape=None, dtype=None, copy=False)Sparse matrix with DIAgonal storage
dok_matrix(arg1, shape=None, dtype=None, copy=False)Dictionary Of Keys based sparse matrix.
基于字典存储的稀疏矩阵。This is an efficient structure for constructing sparse matrices incrementally.Allows for efficient O(1) access of individual elements. Duplicates are not allowed. Can be efficiently converted to a coo_matrix once constructed.
lil_matrix(arg1, shape=None, dtype=None, copy=False)Row-based linked list sparse matrix
This is an efficient structure for constructing sparse matrices incrementally.
2. COO和CSR格式比起DIA和ELL来,更加灵活,易于操作;
3. ELL的优点是快速,而COO优点是灵活,二者结合后的HYB格式是一种不错的稀疏矩阵表示格式;
4. 根据Nathan Bell的工作:
CSR格式在存储稀疏矩阵时非零元素平均使用的字节数(Bytes per Nonzero Entry)最为稳定(float类型约为8.5,double类型约为12.5)
对于Unstructured Mesh以及Random Matrix,DIA格式使用的字节数是CSR格式的十几倍;
5. 一些线性代数计算库:COO格式常用于从文件中进行稀疏矩阵的读写,如matrix market即采用COO格式,而CSR格式常用于读入数据后进行稀疏矩阵计算。
sparse matrix稀疏矩阵不同的存储形式在sparse模块中对应如下:
shape=(M, N):创建的稀疏矩阵的shape为(M, N)未指定时从索引数组中推断;
其中BSR特有的参数blocksize:分块矩阵分块大小,而且必须被矩阵shape (M,N)整除。未指定时会自动使用启发式方法找到合适的分块大小。
坐标格式(COO) :
coo_matrix(arg1[, shape, dtype,copy])
对角存储格式(DIA) :
dia_matrix(arg1[, shape, dtype,copy])
压缩稀疏行格式(CSR) :
csr_matrix(arg1[, shape, dtype,copy])
压缩稀疏列格式(CSC) :
csc_matrix(arg1[, shape, dtype,copy])
分块压缩稀疏行格式(BSR) :
bsr_matrix(arg1[, shape, dtype,copy,blocksize])
dtype 矩阵数据类型
shape (2-tuple)矩阵形状
ndim (int)矩阵维数
nnz 非0元个数
row COO特有的,矩阵行索引
col COO特有的,矩阵列索引
has_sorted_indices BSR有的,是否有排序索引
indices BSR特有的,BSR格式的索引数组
indptr BSR特有的,BSR格式的索引指针数组
blocksize BSR特有的,矩阵块大小
asformat(format) 返回给定格式的稀疏矩阵
astype(t) 返回给定元素格式的稀疏矩阵
diagonal() 返回矩阵主对角元素
dot(other) 坐标点积
getcol(j) 返回矩阵列j的一个拷贝,作为一个(mx 1) 稀疏矩阵 (列向量)
getrow(i) 返回矩阵行i的一个拷贝,作为一个(1 x n) 稀疏矩阵 (行向量)
max([axis]) 给定轴的矩阵最大元素
nonzero() 非0元索引
todense([order, out]) 返回当前稀疏矩阵的密集矩阵表示
eye(m[, n, k, dtype, format])
Sparse matrix with ones on diagonal
identity(n[, dtype, format])
Identity matrix in sparse format
kron(A, B[, format])
kronecker product of sparse matrices A and B
kronsum(A, B[, format])
kronecker sum of sparse matrices A and B
diags(diagonals[, offsets, shape, format, dtype])
Construct a sparse matrix from diagonals.
spdiags(data, diags, m, n[, format])
Return a sparse matrix from diagonals.
block_diag(mats[, format, dtype])
Build a block diagonal sparse matrix from provided matrices.
tril(A[, k, format])
Return the lower triangular portion of a matrix in sparse format
triu(A[, k, format])
Return the upper triangular portion of a matrix in sparse format
bmat(blocks[, format, dtype])
Build a sparse matrix from sparse sub-blocks
hstack(blocks[, format, dtype])
Stack sparse matrices horizontally (column wise)
vstack(blocks[, format, dtype])
Stack sparse matrices vertically (row wise)
rand(m, n[, density, format, dtype, ...])
Generate a sparse matrix of the given shape and density with uniformly distributed values.
random(m, n[, density, format, dtype, ...])
Generate a sparse matrix of the given shape and density with randomly distributed values.
A =coo_matrix([[1,2],[3,4]]) print(A) (0, 0) 1 (0, 1) 2 (1, 0) 3 (1, 1) 4
row = array([0,0,0,0,1,3,1]) col = array([0,0,0,2,1,3,1]) data = array([1,1,1,8,1,1,1])
matrix = coo_matrix((data, (row,col)), shape=(4,4))print( matrix)
csr = csr_matrix([[1, 5], [4, 0], [1, 3]]) print(csr.todense()) #todense()之后是print(csr.shape) print(csr.shape[1]) [[1 5] [4 0] [1 3]] (3, 2) 2
(0, 0) 1 (0, 1) 5 (1, 0) 4 (2, 0) 1 (2, 1) 3 print(csr[0]) #(0, 0) 1 (0, 1) 5 print(csr[1,1]) 1
print(csr[0,0]) 0
for c in csr: #每次读取csr中的一行 type(c)(0, 0) 1print(c) break
csr_mat = csr_matrix([1, 5, 0]) print(csr_mat.todense()) # print(type(csr_mat.nonzero())) #[[1 5 0]]for row, col in csr_mat.nonzero(): print(row, col, csr_mat[row, col])
csr = csr_matrix([[1, 5, 5], [4, 0, 6], [1, 3, 7]]) print(csr.todense()) [[1 5 5] [4 0 6] [1 3 7]] csr2 = csr_matrix([[3, 0, 9]]) print(csr2.todense()) [[3 0 9]] print(vstack([csr, csr2]).todense()) [[1 5 5] [4 0 6] [1 3 7] [3 0 9]]
sub = matrix.getcol(1) #'coo_matrix' object does not support indexing,不能使用matrix[1] print(sub)
(1, 0) 2sub = matrix.todense()[ :,[ 1, 2]] # 常规矩阵 取指定列 print(sub)
A = csr_matrix([[1, 2, 0], [0, 0, 3]]) print(A.todense())
[[1 2 0] [0 0 3]] v = A.T print(v.todense())
[[1 0] [2 0] [0 3]] d = print(d)(0, 0) 5
A = lil_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]]) v = array([1, 0, -1]) s = for i in range(100000): d = #这里v是一个ndarray print( - s) 计算时间: bsr:0:00:01.666072 coo:1.04 csc:0.93 csr:0.90 dia:1.06 dok:1.57 lil:11.37 故推荐用csr计算点积
csr_mat1 = csr_matrix([1, 2, 0]) csr_mat2 = csr_matrix([1, 0, -1]) similar = ( #这里csr_mat2也是一个csr_matrix print(type(similar)) print(similar) print(similar[0, 0])(0, 0) 1 1
mmwrite(target, a[, comment, field, precision]) Writes the sparse or dense array a to a Matrix Market formatted file.
mmread(source) Reads the contents of a Matrix Market file ‘filename’ into a matrix.
mminfo(source) Queries the contents of the Matrix Market file ‘filename’ to extract size and storage.
def save_csr_mat( item_item_sparse_mat_filename=r'.\datasets\lastfm-dataset-1K\item_item_csr_mat.mtx'): random.seed(10) raw_user_item_mat = random.randint(0, 6, (3, 2)) d = csr_matrix(raw_user_item_mat) print(d.todense()) print(d) mmwrite(item_item_sparse_mat_filename, d) print("item_item_sparse_mat_file information: ") print(mminfo(item_item_sparse_mat_filename)) k = mmread(item_item_sparse_mat_filename) print(k.todense()) [[1 5] [4 0] [1 3]] (0, 0) 1 (0, 1) 5 (1, 0) 4 (2, 0) 1 (2, 1) 3 item_item_sparse_mat_file information: (3, 2, 5, 'coordinate', 'integer', 'general') [[1 5] [4 0] [1 3]] 保存的文件中的内容: %%MatrixMarket matrix coordinate integer general % 3 2 5 1 1 1 1 2 5 2 1 4 3 1 1 3 2 3Note:保存的文件拓展名应为.mtx
