使用Python中的 CountVectorizer函数和TfidfTransformer函数进行文本TF-IDF向量化方法详解

最近想使用TF-IDF算法对中文文本进行向量化,因此接触了CountVectorizer函数和TfidfTransformer函数,在此记录一下其中的学习过程。想初步了解可以先看下两篇博客:
①sklearn CountVectorizer\TfidfVectorizer\TfidfTransformer函数详解
②【机器学习】文本数据的向量化(TF-IDF)—样本集实例讲解+python实现
Python源码中对TfidfTransformer函数有详细的备注解释:

   """**Transform a count matrix to a normalized tf or tf-idf representation.**
Tf means term-frequency while tf-idf means term-frequency times inverse
document-frequency. This is a common term weighting scheme in information
retrieval, that has also found good use in document classification.

The goal of using tf-idf instead of the raw frequencies of occurrence of a
token in a given document is to scale down the impact of tokens that occur
very frequently in a given corpus and that are hence empirically less
informative than features that occur in a small fraction of the training
corpus.

****The formula that is used to compute the tf-idf for a term t of a document d
in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is
computed as idf(t) = log [ n / df(t) ] + 1 (if ``smooth_idf=False``), where
n is the total number of documents in the document set and df(t) is the
document frequency of t; the document frequency is the number of documents
in the document set that contain the term t. The effect of adding "1" to
the idf in the equation above is that terms with zero idf, i.e., terms
that occur in all documents in a training set, will not be entirely
ignored.
(Note that the idf formula above differs from the standard textbook
notation that defines the idf as
idf(t) = log [ n / (df(t) + 1) ]).
If ``smooth_idf=True`` (the default), the constant "1" is added to the
numerator and denominator of the idf as if an extra document was seen
containing every term in the collection exactly once, which prevents
zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.**
Furthermore, the formulas used to compute tf and idf depend
on parameter settings that correspond to the SMART notation used in IR
as follows:
Tf is "n" (natural) by default, "l" (logarithmic) when
``sublinear_tf=True``.
Idf is "t" when use_idf is given, "n" (none) otherwise.
Normalization is "c" (cosine) when ``norm='l2'``, "n" (none)
when ``norm=None``.**

Read more in the :ref:`User Guide `.

Parameters
----------
norm : {'l1', 'l2'}, default='l2'
    Each output row will have unit norm, either:

    ***- 'l2': Sum of squares of vector elements is 1. The cosine
      similarity between two vectors is their dot product when l2 norm has
      been applied.***
    - 'l1': Sum of absolute values of vector elements is 1.
      See :func:`preprocessing.normalize`.

use_idf : bool, default=True
    Enable inverse-document-frequency reweighting. If False, idf(t) = 1.

smooth_idf : bool, default=True
    Smooth idf weights by adding one to document frequencies, as if an
    extra document was seen containing every term in the collection
    exactly once. Prevents zero divisions.

sublinear_tf : bool, default=False
    Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

算法过程算是比较简单的了。举个例子(样本来自上面提到的第二篇博客):

import numpy as np 
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer  
from sklearn.feature_extraction.text import TfidfTransformer 
tag_list = ['iphone guuci huawei watch huawei',
                'huawei watch iphone watch iphone guuci',
                'skirt skirt skirt flower',
                'watch watch huawei']    
vectorizer = CountVectorizer() #将文本中的词语转换为词频矩阵  
X = vectorizer.fit_transform(tag_list) #计算个词语出现的次数
print(X)      
transformer = TfidfTransformer()  
tfidf = transformer.fit_transform(X)  #将词频矩阵X统计成TF-IDF值  
print(tfidf.toarray())
(0, 3)	1
  (0, 1)	1
  (0, 2)	2
  (0, 5)	1
  (1, 3)	2
  (1, 1)	1
  (1, 2)	1
  (1, 5)	2
  (2, 4)	3
  (2, 0)	1
  (3, 2)	1
  (3, 5)	2
[[0.         0.43531168 0.70484465 0.43531168 0.         0.35242232]
 [0.         0.34758387 0.2813991  0.69516774 0.         0.5627982 ]
 [0.31622777 0.         0.         0.         0.9486833  0.        ]
 [0.         0.         0.4472136  0.         0.         0.89442719]]

tag_list被看做是四篇文档(即n_sample=4),拥有6个特征(即n_feature=6)。我们可以直接构造一个4行6列的TF矩阵,储存词典中的每个词在每篇文档中出现的频率,其中的词典为:

words = np.array(['flower', 'guuci', 'huawei', 'iphone', 'skirt','watch'])

默认按照首字母顺序出现,当然如果想要改变词典顺序的话也是可以的,只需要在CountVectorize中加入参数vocabulary(你自己定义的词典)即可。这样我们便可以得到TF矩阵:

x1=np.array([[0/5, 1/5, 2/5, 1/5, 0/5, 1/5],
       [0, 1/6, 1/6, 2/6, 0, 2/6],
       [1/4, 0, 0, 0, 3/4, 0],
       [0, 0, 1/3, 0, 0, 2/3]])

接下来就是重点计算的IDF矩阵了,TfidfTransformer函数中主要用到三种IDF矩阵:

  1. 最简单的,也是上面第二篇博客中作者自己写的函数实现的结果。其中IDF计算公式为idf(t) = log [ n / (df(t) + 1) ]),t为每个特征(词),df(t)表示该特征词在几篇文档中出现过,那么根据计算,此段顺序为words时对应的IDF矩阵为:
idfs1=np.array([ 0.6931472,0.2876821,0,0.2876821,0.6931472,0]

后经过代码np.multiply(x1,idfs1)就可以得到最终的TF-IDF矩阵
0.000000 0.057536 0.0 0.057536 0.00000 0.0
0.000000 0.047947 0.0 0.095894 0.00000 0.0
0.173287 0.000000 0.0 0.000000 0.51986 0.0
0.000000 0.000000 0.0 0.000000 0.00000 0.0
2.上文提到TfidfTransformer函数存在一个参数 smooth_idf ,这主要是为了对得到的IDF权重进行平滑,如果 smooth_idf =true,则IDF计算公式为idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1,对上面的words我们可以得到IDF矩阵:

idfs1=np.array([1.91629073,1.51082562,1.22314355,1.51082562, 1.91629073, 1.22314355])

同样经过np.multiply(x1,idfs2)可以得到最终的TF-IDF矩阵:
0.000000 0.302165 0.489257 0.302165 0.000000 0.244629
0.000000 0.251804 0.203857 0.503609 0.000000 0.407715
0.479073 0.000000 0.000000 0.000000 1.437218 0.000000
0.000000 0.000000 0.407715 0.000000 0.000000 0.815429
但这并非最终结果,注意到TfidfTransformer函数还有一个norm的参数,需要对结果进行标准化,默认为2-范数标准化,即让每一行的数字的平方之和为1。因此最终0.302165要转化为:

np.sqrt(0.302165**2/(0.302165**2+0.489257**2+0.302165**2+0.000000+0.244629**2))

结果恰好为0.4353116941338476,与函数运行的结果是一样的!
3. 如果 smooth_idf =false,则IDF计算公式为idf(t) = log [ n / df(t) ] + 1,对上面的words我们同样可以得到IDF矩阵:

idfs3=[2.386294, 1.693147, 1.287682, 1.693147,2.386294,1.287682]

同样经过np.multiply(x1,idfs3)可以得到:
0.000000 0.338629 0.515073 0.338629 0.000000 0.257536
0.000000 0.282191 0.214614 0.564382 0.000000 0.429227
0.596573 0.000000 0.000000 0.000000 1.789721 0.000000
0.000000 0.000000 0.429227 0.000000 0.000000 0.858455
再进行标准化后即可以得到最终结果。
对上述三种IDF算法,Python只使用了后两种,这是和教材上的区别,但是并不影响最终对文本进行聚类的结果。
此外,在看源代码的过程中还碰到了几个有意思的函数,在此也做一些记录。
①python函数之csr_matrix:
python函数之csr_matrix
使用Python中的 CountVectorizer函数和TfidfTransformer函数进行文本TF-IDF向量化方法详解_第1张图片
②python——numpy.bincount()的用法:
python——numpy.bincount()的用法
使用Python中的 CountVectorizer函数和TfidfTransformer函数进行文本TF-IDF向量化方法详解_第2张图片
③ python中sklearn的pipeline模块:
python中sklearn的pipeline模块

你可能感兴趣的:(python,机器学习,nlp,自然语言处理)