csr_matrix矩阵

csr采用按行压缩的方法,将原始的矩阵用三个数组表示:
三个数组的形式有两种
第一种

from scipy.sparse import *

row =  [0,0,0,1,1,1,2,2,2]#行索引
col =  [0,1,2,0,1,2,0,1,2]#列索引
data = [1,0,1,0,1,1,1,1,0]#对应值
t = csr_matrix((data,(row,col)),shape=(3,3))
print(t)
print(t.todense())
>>
 (0, 0)	1
  (0, 1)	0
  (0, 2)	1
  (1, 0)	0
  (1, 1)	1
  (1, 2)	1
  (2, 0)	1
  (2, 1)	1
  (2, 2)	0
[[1 0 1]
 [0 1 1]
 [1 1 0]]

这种是比较好理解的,每个数组分别代表行索引、列索引和对应的值
csr_matrix矩阵用法小节
第二种

from scipy import sparse
data = np.array([1, 2, 3, 4, 5, 6])         #所有的非零数值
indices = np.array([0, 2, 2, 0, 1, 2])      #所有值得列索引
indptr = np.array([0, 2, 3, 6])             #每行的的非零数据 data[i:i+1]
mtx = sparse.csr_matrix((data,indices,indptr),shape=(3,3))
mtx.todense()

比较难理解的是indptr,indptr每个值是每行中一个值得索引,我们用indptr[0]:indptr[1]取第一行对应的data的索引,即data[indptr[0]:indptr[1]]为第一行对应的值,再根据列索引即可确定值的位置。
如何理解sparse.csr_matrix

利用csr矩阵做计算貌似是更有效的,item协同过滤矩阵的乘法也是采用csr_matrix

# 将word映射为id
	documents_as_ids = [np.sort([word_to_id[w] for w in doc if w in word_to_id]).astype('uint32') for doc in documents]
    # row_ind为所有单词的所在的doc索引,col_ind为所有单词在该doc的索引
    row_ind, col_ind = zip(*itertools.chain(*[[(i, w) for w in doc] for i, doc in enumerate(documents_as_ids)]))
    data = np.ones(len(row_ind), dtype='uint32')  # use unsigned int for better memory utilization
    max_word_id = max(itertools.chain(*documents_as_ids)) + 1
    docs_words_matrix = csr_matrix((data, (row_ind, col_ind)), shape=(len(documents_as_ids), max_word_id))  # efficient arithmetic operations with CSR * CSR
    words_cooc_matrix = docs_words_matrix.T * docs_words_matrix  # multiplying docs_words_matrix with its transpose matrix would generate the co-occurences matrix
    words_cooc_matrix.setdiag(0)

你可能感兴趣的:(python,csr_matrix)