Sklearn中CountVectorizer的简单理解

前言

网上对CountVectorizer讲解很多,这篇主要是记录一下个人学习的过程。不会介绍特别详细的内容。

简单理解

是一个文本特征提取方法,将文本转成词频矩阵,只考虑每个词出现的频率,不考虑词的前后关系(考虑前后关系的是word2vec)。
比如,有2个简单的句子:

“王姐,去哪啊”
“大铁棍子医院”

这两句话中,分词后,有这么几个词语:

“王姐”,“去哪”,“啊”,“大”,“铁棍子”,“医院”,

CountVectorizer就是来计算这几个词语在那两句话中出现的频率。它首先通过字典生成一个词向量,如:

{“王姐”:0,“铁棍子”:1,“去哪”:2,“医院”:3,“啊”:4,“大”:5}

这个向量包含了语料中的所有词语,然后通过这个向量来表示每句话中每个词的频率,如第一句话可以表示成:

[1,0,1,0,1,0] ,表示“王姐”出现了1次,“铁棍子”出现了0次,“去哪”出现了1...

第二句话可以表示成:

[0,1,0,1,0,0] ,表示“王姐”出现了0次,“铁棍子”出现了1次,“去哪”出现了0...

CountVectorizer最终返回的词频矩阵就是:

[[1,0,1,0,1,0],[0,1,0,1,0,0]]

Parameters

在此只介绍几个重点参数:
ngram_range:

tuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

理解:CountVecotrizer的目的是计算词频,对于词而言,一个单字可以算词,两个字也可以算一个词,ngram_range就是定义什么样的组合算一个词,这个参数是一个数组,一个代表下限,一个代表上限,比如(1,2),表示计算词频的词中,最少有1个单词组成,最多由两个单词组成。一般设置为(1,1),如果设置的过大,当语料库也很大时,将导致词过多。

For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.

max_features:

default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

理解:如果词过多,可以通过这个参数来控制,把频率出现最高的词留,其余的扔了。

stop_words:

{‘english’}, list, default=None
If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative

理解:停用语,简单说就是没啥意义的词,比如:啊,嗯,哦,(说明:此处没有开车)等等,这些词出现的过多可能会影响分析,过滤掉比较好。

Attributes

vocabulary_,dict,返回的是在向量中每个词和对应的索引

Methods

get_feature_names_out():返回的是包含所有词的那个词向量

抄一个例子来说明用法

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
] #4个句子

vectorizer = CountVectorizer()
vector = vectorizer.fit(corpus)

先fit一下,fit了语料后,词向量就生成了,我们看看:

vector.get_feature_names_out() #返回的词向量
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

词向量中每个词对应的索引:

vector.vocabulary_   #返回的在向量中每个词和对应的索引
{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

fit后,再transform一下(也可以直接fit_transform),就能生成词频矩阵了,生成后的矩阵是一个稀疏矩阵,需要转一下:

vector.transform(corpus).toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

再抄一个例子

例子的原链接在这
弄一个function,可以提取一个文本中出现频率最高的n个词和词出现的频率。
这里用到了ngran_range这个parameter。

def get_ngrams(text, ngram_from=2, ngram_to=2, n=None, max_features=20000 ,stop_words = None):
    
    vec = CountVectorizer(ngram_range = (ngram_from, ngram_to), 
                          max_features = max_features, 
                          stop_words=stop_words).fit(text)
    bag_of_words = vec.transform(text)
    sum_words = bag_of_words.sum(axis = 0) 
    words_freq = [(word, sum_words[0, i]) for word, i in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
   
    return words_freq[:n]

跑一下看看:

get_ngrams(corpus ,ngram_from=1, ngram_to=1,)
[('this', 4),
 ('is', 4),
 ('the', 4),
 ('document', 4),
 ('first', 2),
 ('second', 1),
 ('and', 1),
 ('third', 1),
 ('one', 1)]

改一下参数,通过参数的调整,就能看出来ngram_range这个参数的意思了:

get_ngrams(corpus ,ngram_from=1, ngram_to=2,)
[('this', 4),
 ('is', 4),
 ('the', 4),
 ('document', 4),
 ('is the', 3),
 ('first', 2),
 ('this is', 2),
 ('the first', 2),
 ('first document', 2),
 ('second', 1),
 ('this document', 1),
 ('document is', 1),
 ('the second', 1),
 ('second document', 1),
 ('and', 1),
 ('third', 1),
 ('one', 1),
 ('and this', 1),
 ('the third', 1),
 ('third one', 1),
 ('is this', 1),
 ('this the', 1)]

设置一下stop_words,一下就过滤掉好多没有的东西:

get_ngrams(corpus ,ngram_from=1, ngram_to=2,stop_words = 'english')
[('document', 4),
 ('second', 1),
 ('document second', 1),
 ('second document', 1)]

你可能感兴趣的:(sklearn,python)