CountVector基础功能的复现

sklearn.feature_extraction.text 中有4种文本特征提取方法:

  • CountVectorizer
  • TfidfVectorizer
  • TfidfTransformer
  • HashingVectorizer

CountVectorizer会将文本中的词语转换为词频矩阵,它通过fit_transform函数计算各个词语在文档中出现的次数。

参数

属性

属性表 作用
vocabulary_ 词汇表;字典型
get_feature_names() 所有文本的词汇;列表型
stop_words_ 返回停用词表

方法

方法表 作用
fit_transform(X) 拟合模型,并返回term-document矩阵
fit(raw_documents[, y]) 学习文档集中的vocabulary dictionary

入门示例

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird'] # “dog cat fish” 为输入列表元素,即代表一个文章的字符串
cv = CountVectorizer() #创建词袋数据结构
cv_fit = cv.fit_transform(texts)
# 上述代码等价于下面两行
# cv.fit(texts)
# cv_fit=cv.transform(texts)

print(cv.get_feature_names())    #['bird', 'cat', 'dog', 'fish'] 列表形式呈现文章生成的词典

print(cv.vocabulary_)            # {‘dog’:2,'cat':1,'fish':3,'bird':0} 字典形式,key:词,value:该词(特征)的索引,同时是tf矩阵的列号
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)

print(cv_fit)
#(0,3)1   第0个列表元素,**词典中索引为3的元素**, 词频
#(0,1)1
#(0,2)1
#(1,1)2
#(1,2)1
#(2,0)1
#(2,3)1
#(3,0)1

print(cv_fit.toarray()) #.toarray() 是将结果转化为稀疏矩阵矩阵的表示方式;
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

print(cv_fit.toarray().sum(axis=0))  #每个词在所有文档中的词频
#[2 3 2 2]

复现

功能包括:

  • 去停词等文本预处理操作
  • fit
  • transform
  • 支持 n-gram
import numpy as np

with open('data.txt', 'r', encoding='utf-8') as f:
    data = [i.strip() for i in f.readlines()]

class MyCountVectorizer(object):
    vocabulary = {}
    corpus = []
    
    def __init__(self, n=1, remove_stop_words=False):
        self.n = n
        self.remove_stop_words = remove_stop_words
        
    def clean(self, corpus):
        if self.remove_stop_words:
            # Load stopword list
            with open('stopwords.txt') as f:
                stop_words = [w.strip() for w in f.readlines()]
        for text in corpus:
            # Lower case
            text = text.lower()
            # Remove special punctuation
            for c in """!"'#$%&\()*+,-./:;<=>?@[\\]^_`{|}~“”‘’""":
                text = text.replace(c, ' ')
            if self.remove_stop_words:
                word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1 and (word not in stop_words)]
            else:
                word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1]
            # corpus: document size * vocabulary size
            n_gram_word_ls = []
            for idx in range(len(word_ls)):
                if idx + self.n > len(word_ls):
                    break
                n_gram_word = ' '.join(word_ls[idx: idx + self.n])
                n_gram_word_ls.append(n_gram_word)
            self.corpus.append(n_gram_word_ls)    
    
    def fit(self, corpus):
        # Create a dictionary of terms which map to columns of the term-frequency matrix.
        self.clean(corpus)
        for row in self.corpus:
            for word in row:
                if word not in self.vocabulary:
                    self.vocabulary[word] = len(self.vocabulary)
        return
    
    def transform(self):
        # Create a term-frequency matrix of appropriate size (document size * vocabulary size)
        tf_matrix = []
        size = len(self.vocabulary)
        for doc in self.corpus:
            # Count how often the word appears in the document
            word_count = {}
            for word in doc:
                word_count[word] = word_count.get(word, 0) + 1
            # Construct the term-frequency vector of the row
            row = [0 for i in range(size)]
            for word, value in word_count.items():
                row[self.vocabulary[word]] = value
            tf_matrix.append(row)
        tf_matrix = np.array(tf_matrix)
        return tf_matrix
    
    def get_vocab(self):
        # Returns the dictionary of terms
        return self.vocabulary
    
cv = MyCountVectorizer(1, True)
cv.fit(data)
print(cv.get_vocab())
term_frequency_matrix = cv.transform()
print(term_frequency_matrix.shape)

参考文献:
sklearn——CountVectorizer详解

你可能感兴趣的:(CountVector基础功能的复现)