首先我们看看CountVectorizer相关源码中的部分内容。
class CountVectorizer(_VectorizerMixin, BaseEstimator):
"""Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using
scipy.sparse.csr_matrix.
If you do not provide an a-priori dictionary and you do not use an analyzer
that does some kind of feature selection then the number of features will
be equal to the vocabulary size found by analyzing the data.
Read more in the :ref:`User Guide `.
注释的前面两行就指出了CountVectorizer最核心的两点
Convert a collection of text documents to a matrix of token counts
CountVectorizer把一个文档转成一个包含词频的矩阵。
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
最后的词频矩阵是用csr_matrix这种稀疏矩阵的表示方式来表示的。
用一个简单的demo测试一下
from sklearn.feature_extraction.text import CountVectorizer
def t1():
cv = CountVectorizer()
train = ["Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"Chinese Macao",
"Tokyo Japan Chinese"]
cv_fit = cv.fit_transform(train)
print(cv.get_feature_names())
print(cv_fit)
print(cv_fit.toarray())
t1()
最后的输出结果
['beijing', 'chinese', 'japan', 'macao', 'shanghai', 'tokyo']
(0, 1) 2
(0, 0) 1
(1, 1) 2
(1, 4) 1
(2, 1) 1
(2, 3) 1
(3, 1) 1
(3, 5) 1
(3, 2) 1
[[1 2 0 0 0 0]
[0 2 0 0 1 0]
[0 1 0 1 0 0]
[0 1 1 0 0 1]]
首先所有的文档中有6个词,所以最后get_feature_names得到的结果为6维列表。
cv_fit很明显可以看出来就是使用csr_matrix这种方式来存储的,(0,1)对应的是第一行第二个词即chinese,后面的2表示第一行chinese这个词出现了2次。
如果调用toarray方法,会将矩阵由稀疏表示转化为正常矩阵,因为所有文档中包含6个词,所以每一行文档会有6维。
class TfidfVectorizer(CountVectorizer):
"""Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to :class:`CountVectorizer` followed by
:class:`TfidfTransformer`.
Read more in the :ref:`User Guide `.
TfidfVectorizer跟CountVectorizer的区别在于:
CountVectorizer返回的是词频,TfidfVectorizer返回的是tfidf值。
from sklearn.feature_extraction.text import TfidfVectorizer
def t2():
tf = TfidfVectorizer(use_idf=True, smooth_idf=True, norm=None)
train = ["Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"Chinese Macao",
"Tokyo Japan Chinese"]
tf_fit = tf.fit_transform(train)
print(tf.get_feature_names())
print(tf_fit)
print(tf_fit.toarray())
t2()
['beijing', 'chinese', 'japan', 'macao', 'shanghai', 'tokyo']
(0, 0) 1.916290731874155
(0, 1) 2.0
(1, 4) 1.916290731874155
(1, 1) 2.0
(2, 3) 1.916290731874155
(2, 1) 1.0
(3, 2) 1.916290731874155
(3, 5) 1.916290731874155
(3, 1) 1.0
[[1.91629073 2. 0. 0. 0. 0. ]
[0. 2. 0. 0. 1.91629073 0. ]
[0. 1. 0. 1.91629073 0. 0. ]
[0. 1. 1.91629073 0. 0. 1.91629073]]
TfidfVectorizer中计算tfidf值的核心代码调用如下
self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,
smooth_idf=smooth_idf,
sublinear_tf=sublinear_tf)
进入到TfidfTransformer中,查看源码观察具体计算逻辑
def __init__(self, norm='l2', use_idf=True, smooth_idf=True,
sublinear_tf=False):
self.norm = norm
self.use_idf = use_idf
self.smooth_idf = smooth_idf
self.sublinear_tf = sublinear_tf
def fit(self, X, y=None):
"""Learn the idf vector (global term weights)
Parameters
----------
X : sparse matrix, [n_samples, n_features]
a matrix of term/token counts
"""
X = check_array(X, accept_sparse=('csr', 'csc'))
if not sp.issparse(X):
X = sp.csr_matrix(X)
dtype = X.dtype if X.dtype in FLOAT_DTYPES else np.float64
if self.use_idf:
n_samples, n_features = X.shape
df = _document_frequency(X)
df = df.astype(dtype, **_astype_copy_false(df))
# perform idf smoothing if required
df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)
# log+1 instead of log makes sure terms with zero idf don't get
# suppressed entirely.
idf = np.log(n_samples / df) + 1
self._idf_diag = sp.diags(idf, offsets=0,
shape=(n_features, n_features),
format='csr',
dtype=dtype)
return self
根据上面的代码不难看出,idf的具体计算方法为
当smooth_idf参数为true时
i d f = l o g 1 + n d 1 + d f + 1 idf = log \frac{1+n_d}{1+ df} + 1 idf=log1+df1+nd+1
其中, n d n_d nd为总文档数量,df为某个词出现的文档数量。
而当smooth_idf参数为false时
i d f = l o g n d d f + 1 idf = log \frac{n_d}{df} + 1 idf=logdfnd+1
前面说到了csr_matrix表示方法,顺便温习一下csr_matrix相关知识点。
csr_matrix(Compressed Sparse Row matrix)为稀疏矩阵的一种表示方式,对应的是csc_matric(Compressed Sparse Column marix)。
CSR方法采取按行压缩的办法, 将原始的矩阵用三个数组进行表示
def csr_data():
from scipy import sparse
import numpy as np
data = np.array([1, 2, 3, 4, 5, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
indptr = np.array([0, 2, 3, 6])
matrix = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
print(matrix)
print()
print(matrix.todense())
csr_data()
结果为
(0, 0) 1
(0, 2) 2
(1, 2) 3
(2, 0) 4
(2, 1) 5
(2, 2) 6
[[1 0 2]
[0 0 3]
[4 5 6]]
其中,data为所有的非零数值
indices为所有非零值的列索引
indptr为每行的非零数据起止索引
def csc_data():
from scipy import sparse
import numpy as np
data = np.array([1, 2, 3, 4, 5, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
indptr = np.array([0, 2, 3, 6])
matrix = sparse.csc_matrix((data, indices, indptr), shape=(3, 3))
print(matrix)
print()
print(matrix.todense())
csc_data()
结果为
(0, 0) 1
(2, 0) 2
(2, 1) 3
(0, 2) 4
(1, 2) 5
(2, 2) 6
[[1 0 4]
[0 0 5]
[2 3 6]]
csc_matrix与csr_matrix唯一的区别在于,csr的indptr是针对行,而csc的indptr是针对列。