CountVectorizer是通过fit_transform函数将文本中的词语转换为词频矩阵
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer#导入包
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]#文本数据
vectorizer = CountVectorizer()#初始化
count = vectorizer.fit_transform(corpus)#将文本中的词语转换为词频矩阵
print(vectorizer.get_feature_names())#看到所有文本的关键字
[‘and’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’]
print(vectorizer.vocabulary_)#将文本中出现的所有单词编号,如下:
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
print(count.toarray())#可看到词频矩阵的结果
[[0 1 1 1 0 0 1 0 1]
[0 1 0 1 0 2 1 0 1]
[1 0 0 0 1 0 1 1 0]
[0 1 1 1 0 0 1 0 1]]
统计CountVectorizer中每个词语的tf-idf权值
TF-IDF的主要思想是:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。TF-IDF实际上是:TF * IDF。
transformer = TfidfTransformer()
tfidf_matrix = transformer.fit_transform(count)
print(tfidf_matrix.toarray())
[[ 0. 0.43877674 0.54197657 0.43877674 0. 0. 0.35872874 0. 0.43877674]
[ 0. 0.27230147 0. 0.27230147 0. 0.85322574 0.22262429 0. 0.27230147]
[ 0.55280532 0. 0. 0. 0.55280532 0. 0.28847675 0.55280532 0. ]
[ 0. 0.43877674 0.54197657 0.43877674 0. 0. 0.35872874 0. 0.43877674]]
可以把CountVectorizer, TfidfTransformer合并起来,直接生成tfidf值
TfidfVectorizer的关键参数:
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)
print(tfidf_vec.get_feature_names())
print(tfidf_vec.vocabulary_)
print(tfidf_matrix.toarray())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
[[ 0. 0.43877674 0.54197657 0.43877674 0. 0. 0.35872874 0. 0.43877674]
[ 0. 0.27230147 0. 0.27230147 0. 0.85322574 0.22262429 0. 0.27230147]
[ 0.55280532 0. 0. 0. 0.55280532 0. 0.28847675 0.55280532 0. ]
[ 0. 0.43877674 0.54197657 0.43877674 0. 0. 0.35872874 0. 0.43877674]]
CountVectorizer文本特征提取