csr采用按行压缩的方法,将原始的矩阵用三个数组表示:
三个数组的形式有两种
第一种
from scipy.sparse import *
row = [0,0,0,1,1,1,2,2,2]#行索引
col = [0,1,2,0,1,2,0,1,2]#列索引
data = [1,0,1,0,1,1,1,1,0]#对应值
t = csr_matrix((data,(row,col)),shape=(3,3))
print(t)
print(t.todense())
>>
(0, 0) 1
(0, 1) 0
(0, 2) 1
(1, 0) 0
(1, 1) 1
(1, 2) 1
(2, 0) 1
(2, 1) 1
(2, 2) 0
[[1 0 1]
[0 1 1]
[1 1 0]]
这种是比较好理解的,每个数组分别代表行索引、列索引和对应的值
csr_matrix矩阵用法小节
第二种
from scipy import sparse
data = np.array([1, 2, 3, 4, 5, 6]) #所有的非零数值
indices = np.array([0, 2, 2, 0, 1, 2]) #所有值得列索引
indptr = np.array([0, 2, 3, 6]) #每行的的非零数据 data[i:i+1]
mtx = sparse.csr_matrix((data,indices,indptr),shape=(3,3))
mtx.todense()
比较难理解的是indptr,indptr每个值是每行中一个值得索引,我们用indptr[0]:indptr[1]
取第一行对应的data的索引,即data[indptr[0]:indptr[1]]
为第一行对应的值,再根据列索引即可确定值的位置。
如何理解sparse.csr_matrix
利用csr矩阵做计算貌似是更有效的,item协同过滤矩阵的乘法也是采用csr_matrix
# 将word映射为id
documents_as_ids = [np.sort([word_to_id[w] for w in doc if w in word_to_id]).astype('uint32') for doc in documents]
# row_ind为所有单词的所在的doc索引,col_ind为所有单词在该doc的索引
row_ind, col_ind = zip(*itertools.chain(*[[(i, w) for w in doc] for i, doc in enumerate(documents_as_ids)]))
data = np.ones(len(row_ind), dtype='uint32') # use unsigned int for better memory utilization
max_word_id = max(itertools.chain(*documents_as_ids)) + 1
docs_words_matrix = csr_matrix((data, (row_ind, col_ind)), shape=(len(documents_as_ids), max_word_id)) # efficient arithmetic operations with CSR * CSR
words_cooc_matrix = docs_words_matrix.T * docs_words_matrix # multiplying docs_words_matrix with its transpose matrix would generate the co-occurences matrix
words_cooc_matrix.setdiag(0)