python文本特征提取实例_理解python scikitlearn中的文本特征提取TfidfVectorizer

我在这篇文章中看到几个问题。How do the different arguments in TfidfVectorizer interact with one another?









sklearn使用的停止词列表可以在以下位置找到:from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

删除停止词的逻辑与以下事实有关:这些词没有太多含义,而且它们在大多数文本中都出现过:[('the', 79808),

('of', 40024),

('and', 38311),

('to', 28765),

('in', 22020),

('a', 21124),

('that', 12512),

('he', 12401),

('was', 11410),

('it', 10681),

('his', 10034),

('is', 9773),

('with', 9739),

('as', 8064),

('i', 7679),

('had', 7383),

('for', 6938),

('at', 6789),

('by', 6735),

('on', 6639)]




token_pattern使用正则表达式模式\b\w\w+\b,这意味着标记必须至少有2个字符长,这样“I”、“a”之类的单词和0-9之类的数字就会被删除。你还会注意到它去掉了撇号What happens first, ngram generation or stop word removal?

让我们做个小测试。import numpy as np

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

docs = np.array(['what is tfidf',

'what does tfidf stand for',

'what is tfidf and what does it stand for',

'tfidf is what',

"why don't I use tfidf",

'1 in 10 people use tfidf'])

tfidf = TfidfVectorizer(use_idf=False, norm=None, ngram_range=(1, 1))

matrix = tfidf.fit_transform(docs).toarray()

df = pd.DataFrame(matrix, index=docs, columns=tfidf.get_feature_names())

for doc in docs:

print(' '.join(word for word in doc.split() if word not in ENGLISH_STOP_WORDS))


does tfidf stand

tfidf does stand


don't I use tfidf

1 10 people use tfidf

现在让我们打印df:10 and does don for in is \

what is tfidf 0.0 0.0 0.0 0.0 0.0 0.0 1.0

what does tfidf stand for 0.0 0.0 1.0 0.0 1.0 0.0 0.0

what is tfidf and what does it stand for 0.0 1.0 1.0 0.0 1.0 0.0 1.0

tfidf is what 0.0 0.0 0.0 0.0 0.0 0.0 1.0

why don't I use tfidf 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1 in 10 people use tfidf 1.0 0.0 0.0 0.0 0.0 1.0 0.0

it people stand tfidf use \

what is tfidf 0.0 0.0 0.0 1.0 0.0

what does tfidf stand for 0.0 0.0 1.0 1.0 0.0

what is tfidf and what does it stand for 1.0 0.0 1.0 1.0 0.0

tfidf is what 0.0 0.0 0.0 1.0 0.0

why don't I use tfidf 0.0 0.0 0.0 1.0 1.0

1 in 10 people use tfidf 0.0 1.0 0.0 1.0 1.0

what why

what is tfidf 1.0 0.0

what does tfidf stand for 1.0 0.0

what is tfidf and what does it stand for 2.0 0.0

tfidf is what 1.0 0.0

why don't I use tfidf 0.0 1.0

1 in 10 people use tfidf 0.0 0.0

注:use_idf=False, norm=None设置这些值时,就相当于使用sklearn的countvector。它只会返回计数。



让我们删除stopwords并再次查看df:tfidf = TfidfVectorizer(use_idf=False, norm=None, stop_words='english', ngram_range=(1, 2))

输出:10 10 people does does stand \

what is tfidf 0.0 0.0 0.0 0.0

what does tfidf stand for 0.0 0.0 1.0 0.0

what is tfidf and what does it stand for 0.0 0.0 1.0 1.0

tfidf is what 0.0 0.0 0.0 0.0

why don't I use tfidf 0.0 0.0 0.0 0.0

1 in 10 people use tfidf 1.0 1.0 0.0 0.0

does tfidf don don use people \

what is tfidf 0.0 0.0 0.0 0.0

what does tfidf stand for 1.0 0.0 0.0 0.0

what is tfidf and what does it stand for 0.0 0.0 0.0 0.0

tfidf is what 0.0 0.0 0.0 0.0

why don't I use tfidf 0.0 1.0 1.0 0.0

1 in 10 people use tfidf 0.0 0.0 0.0 1.0

people use stand tfidf \

what is tfidf 0.0 0.0 1.0

what does tfidf stand for 0.0 1.0 1.0

what is tfidf and what does it stand for 0.0 1.0 1.0

tfidf is what 0.0 0.0 1.0

why don't I use tfidf 0.0 0.0 1.0

1 in 10 people use tfidf 1.0 0.0 1.0

tfidf does tfidf stand use \

what is tfidf 0.0 0.0 0.0

what does tfidf stand for 0.0 1.0 0.0

what is tfidf and what does it stand for 1.0 0.0 0.0

tfidf is what 0.0 0.0 0.0

why don't I use tfidf 0.0 0.0 1.0

1 in 10 people use tfidf 0.0 0.0 1.0

use tfidf

what is tfidf 0.0

what does tfidf stand for 0.0

what is tfidf and what does it stand for 0.0

tfidf is what 0.0

why don't I use tfidf 1.0

1 in 10 people use tfidf 1.0

外卖:标记“don use”的出现是因为don't I use去掉了't,并且I少于两个字符,所以它被删除,因此单词被连接到don use。。。实际上这不是结构,可能会改变结构一点!

回答:删除停止字,删除短字符,然后生成可返回意外结果的ngram。does it make sense to use max_df/min_df arguments together with use_idf argument?



