sklearn.feature_extraction.text
.TfidfVectorizer
class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=
, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer.
Read more in the User Guide.
将原始文档的集合转换为TF-IDF功能的矩阵。
相当于CountVectorizer,后跟TfidfTransformer。
在“ 用户指南”中阅读更多内容。
Parameters:
- input : string {‘filename’, ‘file’, ‘content’}
If ‘filename’, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.
If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.
Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly.
如果是'filename',序列作为参数传递给拟合器,预计为文件名列表,这需要读取原始内容进行分析
如果是'file',序列中元素必须有一个”read“的方法(类似文件的对象),被调用作为获取内存中的字节数
否则,输入预计为序列串,或字节数据项都预计可直接进行分析。
- encoding : string, ‘utf-8’ by default.
If bytes or files are given to analyze, this encoding is used to decode.
如果给出要解析的字节或文件,此编码将用于解码
- decode_error : {‘strict’, ‘ignore’, ‘replace’}
Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
如果一个给出的字节序列包含的字符不是给定的编码,指示应该如何去做。默认情况下,它是'strict',这意味着的UnicodeDecodeError将会报错,其他值是'ignore'和'replace'
- strip_accents : {‘ascii’, ‘unicode’, None}
Remove accents during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.
在预处理步骤中去除编码规则(accents),”ASCII码“是一种快速的方法,仅适用于有一个直接的ASCII字符映射,"unicode"是一个稍慢一些的方法,None(默认)什么都不做
- analyzer : string, {‘word’, ‘char’} or callable
Whether the feature should be made of word or character n-grams.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
定义特征为词(word)或n-gram字符,如果传递给它的调用被用于抽取未处理输入源文件的特征序列
- preprocessor : callable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
当保留令牌和”n-gram“生成步骤时,覆盖预处理(字符串变换)的阶段
- tokenizer : callable or None (default)
Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.
当保留预处理和n-gram生成步骤时,覆盖字符串令牌步骤
- ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
要提取的n-gram的n-values的下限和上限范围,在min_n <= n <= max_n区间的n的全部值
- stop_words : string {‘english’}, list, or None (default)
If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
如果未english,用于英语内建的停用词列表
如果未list,该列表被假定为包含停用词,列表中的所有词都将从令牌中删除
如果None,不使用停用词。max_df可以被设置为范围[0.7, 1.0)的值,基于内部预料词频来自动检测和过滤停用词
- lowercase : boolean, default True
Convert all characters to lowercase before tokenizing.
在令牌标记前转换所有的字符为小写
- token_pattern : string
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
正则表达式显示了”token“的构成,仅当analyzer == ‘word’时才被使用。两个或多个字母数字字符的正则表达式(标点符号完全被忽略,始终被视为一个标记分隔符)。
- max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
当构建词汇表时,严格忽略高于给出阈值的文档频率的词条,语料指定的停用词。如果是浮点值,该参数代表文档的比例,如果是整型,代表绝对计数值。如果词汇表不为None,此参数被忽略。
- min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
当构建词汇表时,严格忽略低于给出阈值的文档频率的词条,语料指定的停用词。如果是浮点值,该参数代表文档的比例,整型绝对计数值,如果词汇表不为None,此参数被忽略。
- max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
如果不为None,构建一个词汇表,仅考虑max_features--按语料词频排序,如果词汇表不为None,这个参数被忽略
- vocabulary : Mapping or iterable, optional
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
也是一个映射(Map)(例如,字典),其中键是词条而值是在特征矩阵中索引,或词条中的迭代器。如果没有给出,词汇表被确定来自输入文件。在映射中索引不能有重复,并且不能在0到最大索引值之间有间断。
- binary : boolean, default=False
If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs.)
如果未True,所有非零计数被设置为1,这对于离散概率模型是有用的,建立二元事件模型,而不是整型计数
- dtype : type, optional
Type of the matrix returned by fit_transform() or transform().
通过fit_transform()或transform()返回矩阵的类型
- norm : ‘l1’, ‘l2’ or None, optional
Norm used to normalize term vectors. None for no normalization.
范数用于标准化词条向量。None为不归一化
- use_idf : boolean, default=True
Enable inverse-document-frequency reweighting.
启动inverse-document-frequency重新计算权重
- smooth_idf : boolean, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
通过加1到文档频率平滑idf权重,为防止除零,加入一个额外的文档
- sublinear_tf : boolean, default=False
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
应用线性缩放TF,例如,使用1+log(tf)覆盖tf
Attributes:
- vocabulary_ : dict
A mapping of terms to feature indices.
- idf_ : array, shape = [n_features], or None
The learned idf vector (global term weights) when use_idf is set to True, None otherwise.
- stop_words_ : set
Terms that were ignored because they either:
occurred in too many documents (max_df)
occurred in too few documents (min_df)
were cut off by feature selection (max_features).
This is only available if no vocabulary was given.
参考资料
- http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- https://blog.csdn.net/laobai1015/article/details/80451371