模块 sklearn.feature_extraction
可以用来提取多种格式的数据集中,符合机器学习算法中支持的特征,如文本和图像
Note
特征提取与 特征选择(Feature selection) 特征选择有很大的不同: 模型意义在于把复杂的数据,如文本和图像,转化为数字特征,从而在机器学习中使用。后者是一个机器学习中应用这些特征的方法
类 DictVectorizer
可以把特征向量转化成标准的Python字典对象的一个列表, 同时也是被scikit-learn的估计器使用的一个NumPy/SciPy体现(ndarray)
即使处理时并不是特别快,python的字典有易于使用的优势,适用于稀疏情景(缺失特征不会被存储),存储特征的名字和值。
类 DictVectorizer
实现了所谓 one-of-K 或 “one-hot” 的方法来使用范畴(即离散的)特征 。 范畴特征是一个键值对,其值被约束为离散的无序列表
(如话题标志,对象类型,标签,名字等)。
在下面例子中 “city” 是一个绝对变量而 disc是一个 “temperature” 传统的数值特征
>>> measurements = [
... {'city': 'Dubai', 'temperature': 33.},
... {'city': 'London', 'temperature': 12.},
... {'city': 'San Fransisco', 'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
类 DictVectorizer
也是一个有用的转化形式,主要应用于自然语言处理(NLP)中分类器的训练模型,典型应用于在兴趣文本中提取特征序列.
比如说,我们有一个算法来提取词性标签作为补充标签,来训练序列分类器(如chunker概括大意)。 下面的字典展示了一个小例子,提取在例句 ‘The cat sat on the mat.’ 中sat周围的特征:
>>> pos_window = [
... {
... 'word-2': 'the',
... 'pos-2': 'DT',
... 'word-1': 'cat',
... 'pos-1': 'NN',
... 'word+1': 'on',
... 'pos+1': 'PP',
... },
... # in a real application one would extract many such dictionaries
... ]
以上形式可以被向量化成一个稀疏二维矩阵,从而作为参数传递给分类器(或经过:class:text.TfidfTransformer 的加工标准化):
>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized
<1x6 sparse matrix of type '<... 'numpy.float64'>'
with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']
正如你所想的,如果在文档全集中进行提取,结果矩阵将会非常巨大(大量one-hot-features), 他们中的大部分通常将会是0。所以为了使这个矩阵的稀疏数据结构存储在内存中,类 DictVectorizer
默认使用了一个 scipy.sparse
矩阵 而不是 numpy.ndarray
。
类 FeatureHasher
是一个快速且低内存消耗的向量化方法,使用了 feature hashing 技术,或可称为”hashing trick”。 没有像矢量化那样,为计算训练得到的特征建立哈西表,类 FeatureHasher
的实例使用了一个哈希函数来直接确定特征在样本矩阵中的列号。 这样在可检查性上增加了速度减少了内存开销。这个类不会记住输入特征的形状,也没有 inverse_transform
方法。
因为哈希函数会造成不相关特征间的冲突,所以这里使用了带有签名的哈希函数。哈希值的签名决定了输出矩阵中特征的签名。 在这种情况下,哈希冲突可能会消失,不会出现错误。且所有输出矩阵的期望都是0。
如果传递 non_negative=True
参数给构造器,那么将使用绝对值。这将减少一些对冲突的控制,但是允许输出作为参数传递给估计器如: sklearn.naive_bayes.MultinomialNB
或 sklearn.feature_selection.chi2
特征选择器要求非负的输入。
类 FeatureHasher
接受mapping(如python的字典和其在 collections
模块中的变体), 使用键值对 (feature, value)
,或是使用字符串string,取决于构造器参数 input_type
。 Mapping 被看成键值对的列表,其中单个字符串有一个隐式的值: 1 , 所以 ['feat1', 'feat2', 'feat3']
被转化为 [('feat1', 1), ('feat2', 1), ('feat3', 1)]
。
如果一个单独特征在一个样本中出现了多次,与之相关的次数将被加和(所以 ('feat', 2)
and ('feat', 3.5)
转化成 ('feat', 5.5)
)。 类 FeatureHasher
的输出通常是一个CSR格式的 scipy.sparse
稀疏矩阵。
特征哈希可以在文本分类中使用, 但是,与 text.CountVectorizer
不同, 为使分类器/哈希联合使用,请参考下方的 Vectorizing a large text corpus with the hashing trick
举个例子,假设有一个词级别的自然语言处理任务,需要在 (token, part_of_speech)
键值对中 提取特征,你可以使用Python的生成器函数来提取:
def token_features(token, part_of_speech):
if token.isdigit():
yield "numeric"
else:
yield "token={}".format(token.lower())
yield "token,pos={},{}".format(token, part_of_speech)
if token[0].isupper():
yield "uppercase_initial"
if token.isupper():
yield "all_uppercase"
yield "pos={}".format(part_of_speech)
之后 raw_X
为了可以传入 FeatureHasher.transform
可以通过如下方式构建:
raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus)
然后传入哈希:
hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)
得到一个 scipy.sparse
类型的的矩阵 X
。
注意对使用生成器的理解Note the use of a generator comprehension, 它将为特征哈希引入懒加载机制: 词令牌(token)只在哈希要求时处理。
类 FeatureHasher
使用了有符号的MurmurHash3的变体, 因此导致 (同时由于 scipy.sparse
的限制), 现在支持的最大特征数量为 。
特征哈希的原始形式源于 Weinberger et al。 使用了两个独立的哈希函数 和 来分别决定列下标和特征签名。 现有的实现是基于假设:MurmurHash3的符号位与其他位独立。
因为从哈希函数到列标只使用了简单的取模操作,因此建议使用二次方作为 n_features
的参数, 否则特征不会平均的分布到列中。
References:
文本分析是机器学习算法的主要应用领域。然而,符号文字序列不能直接传递给这些算法, 因为他们要求数值的固定长度的矩阵特征而不是可变长度的文本文档。
为了解决这个问题,scikit-learn为数值特征提取最常见的方式提供了一系列工具,它们是:
在这个体系中,特征和样本有如下定义:
因此文本的集合可被表示为矩阵形式,每行一条文本,每列对应每个文本中出现的词令牌(如单个词)。
我们称 vectorization(向量化) 是转化文本集合为数值向量的普遍方法。这种特殊思想,包括令牌化,统计频数和归一化,被称为 Bag of Words(词袋子) 或 “Bag of n-grams” 模型。文本被词出现频率描述,完全忽略词的相对位置信息。
因为大多数文本通常只使用文本词向量全集中的一个小子集,结果矩阵将有许多特征的值为0(经常超过99%)。
例如,一个10000个短文本集的例子(如Emails)将使用总共大约100000个不同的词,而每个文本(Email)将使用100到1000个单词。
为了可以在内存中储存这种矩阵,同时加速线性代数的矩阵 / 向量运算,所以通常以稀疏形式实现,例如可参考在包 scipy.sparse
中的实现。
CountVectorizer
在单个类中实现了令牌化和出现频数统计:
>>> from sklearn.feature_extraction.text import CountVectorizer
这个模型有很多参数,然而初始值非常合理(请参考 reference documentation 获取更多细节):
>>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
让我们使用它来使简单文本全集令牌化,并统计词频:
>>> corpus = [
... 'This is the first document.',
... 'This is the second second document.',
... 'And the third one.',
... 'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type '<... 'numpy.int64'>'
with 19 stored elements in Compressed Sparse ... format>
初始设定是,令牌化字符串,提取至少两个字母的词。做这一步的函数可以显式的被调用:
>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.") == (
... ['this', 'is', 'text', 'document', 'to', 'analyze'])
True
每个在拟合中被分析器发现的词被指派了一个独一无二的索引,在结果矩阵中表示一列。对于列的翻译可以被如下方式检索:
>>> vectorizer.get_feature_names() == (
... ['and', 'document', 'first', 'is', 'one',
... 'second', 'the', 'third', 'this'])
True
>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
从列标到特征名的反转映射储存在向量化类 vectorizer 的属性 vocabulary_
中:
>>> vectorizer.vocabulary_.get('document')
1
因此在训练集里未出现的的词将在将来调用transform方法时被完全忽略:
>>> vectorizer.transform(['Something completely new.']).toarray()
...
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)
注意在之前的集合中第一个和最后一个文本事实上是同一个词,因此被编码成相同的向量。特别是最后一个字符是询问形式时我们丢失了他的信息。 为了防止词组顺序颠倒,我们除了提取一元模型(1-Gram,即单字单词),也可以提取二元模型(2-Gram):
>>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
... token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
... ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True
矢量化提取的词因此变得很大,同时可以在定位模式时消歧义:
>>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2
...
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]...)
特别的,疑问形式如 “Is this” 只在最后一个文档中显示:
>>> feature_index = bigram_vectorizer.vocabulary_.get('is this')
>>> X_2[:, feature_index]
array([0, 0, 0, 1]...)
在一个巨大的文本集中,一些词会出现很多次(如 “the”, “a”, “is” in English),且带有较少的有意义的信息。 如果我们直接把数量输入到分类器中则这些频繁词组会掩盖住那些我们关注但很少出现的词。
为了重新计算特征权重,将其转化成适合被分类器使用的浮点值,使用tf-idf转化非常普遍。
Tf意思是词语频率 term-frequency 而tf–idf意思是词语频率与转置文档频率( inverse document-frequency )的乘积。 它源于一个词权重的信息检索方式(作为搜索引擎结果的评级函数),同时在文本分类和聚类中表现良好。
归一化过程已经实现于类 :class:`TfidfTransformer`中:
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer()
>>> transformer
TfidfTransformer(norm=...'l2', smooth_idf=True, sublinear_tf=False,
use_idf=True)
请参考 reference documentation 获取其他参数的更多细节。
让我们以下方的词频为例。第一个词在任何时候都100%显示,其他两个特征只占文档中少于50%的比例:
>>> counts = [[3, 0, 1],
... [2, 0, 0],
... [3, 0, 0],
... [4, 0, 0],
... [3, 2, 0],
... [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf
<6x3 sparse matrix of type '<... 'numpy.float64'>'
with 9 stored elements in Compressed Sparse ... format>
>>> tfidf.toarray()
array([[ 0.85..., 0. ..., 0.52...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 0.55..., 0.83..., 0. ...],
[ 0.63..., 0. ..., 0.77...]])
每一行都被正则化,来适用欧几里得标准,每个特征的权重被方法 fit
计算,调用结果被存储在模型参数中:
>>> transformer.idf_
array([ 1. ..., 2.25..., 1.84...])
因为 tf–idf 在特征提取中经常被使用,所以有另一个类: TfidfVectorizer 在单个类中结合了所有类和类中的选择:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer(min_df=1)
>>> vectorizer.fit_transform(corpus)
...
<4x9 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
虽然 tf–idf 正则化经常被使用,但是经常有一种情况是二元变量显示会提供更好的特征。可以使用类 CountVectorizer
中的参数 binary
来达到这一目的。 特别的,一些估计器,如 朴素贝叶斯 伯努利模型 显式地使用离散的布尔值随机变量,同时,非常短的文本可能会影响tf-idf的值,而相比之下 二元表示(binary occurrence)会更加稳定。
通常情况下最好的提取特征的调整参数方式是使用基于网格搜索的交叉验证,例如使用管道(pipelining)传输特征提取器和分类器:
- Sample pipeline for text feature extraction and evaluation
文本由字组成,而文件由字节组成。字节转化成字符依照一定的编码(encoding)方式。 为了在Python中使用文本文档,这些字节需要被解码(decoded)成Unicode字符集。 常见的编码方式有 ASCII, Latin-1 (Western Europe), KOI8-R (Russian) 和通用编码方式 UTF-8 与 UTF-16。或许也其他的方式。 .. note:
一个编码也被称为“字符集”,但是这个名词是不准确的: 一些编码可以用单个字符表示。
scikit-learn中的文本特征提取器知道如何解码文本文件,但是只能通过告诉它在何种编码方式之下才行。
类 CountVectorizer
有一个参数 encoding
来实现这一目的。 对于现代文本文档,正确的编码方式大多是UTF-8,它也是默认编码方式 (encoding="utf-8"
)。
如果你的加载的文本不是UTF-8编码,你将会得到一个 UnicodeDecodeError
。矢量化方法可以通过设定 decode_error
参数值为 "ignore"
或 "replace"
来不抛出这一错误。参考Python的函数 bytes.decode
得到更多细节(在Python命令行里输入 help(bytes.decode)
)。
如果在解码时遇到了困难,可以尝试以下方法:
file
找到它的编码方式。Python的 chardet
模块含有一个脚本 chardetect.py
,可以得到大概的编码方式,但是不应依赖它,因为并不总是正确。decode_error='replace'
。这可能会破坏特征的使用。latin-1
之后使用 ftfy
来修正错误。latin-1
。一些文本会显示错误,但是至少相同的字节序列意味着相同的特征。例如,下面的代码片段使用 chardet
(没有加入scikit-learn中,需要另外安装)来计算出编码方式。 之后它把文本矢量化并打印学习的单词(特征)。输出在下方给出。
>>> import chardet
>>> text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
>>> text2 = b"holdselig sind deine Ger\xfcche"
>>> text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
>>> decoded = [x.decode(chardet.detect(x)['encoding'])
... for x in (text1, text2, text3)]
>>> v = CountVectorizer().fit(decoded).vocabulary_
>>> for term in v: print(v)
(取决于 chardet
的版本,或许会返回第一个值错误的结果。)
更详细的介绍Unicode和字符编码,参考 Joel Spolsky 的 Absolute Minimum Every Software Developer Must Know About Unicode.
词袋子模型表示法非常简单但在实际中很有用。
特别的,在监督学习设置( supervised setting )中它能够把快速和可伸缩的线性模型相结合,来训练分类器( document classifiers ),例如:
- Classification of text documents using sparse features
在 unsupervised setting 中它可以为相似文档分类,同时应用聚类方法,比如 K-means :
- Clustering text documents using k-means
最后,通过松弛聚类的约束条件(relaxing the hard assignment constraint of clustering),发现文集中的主题是可能的,如使用 Non-negative matrix factorization (NMF or NNMF):
- Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings or word derivations.
N-grams to the rescue! Instead of building a simple collection of unigrams (n=1), one might prefer a collection of bigrams (n=2), where occurrences of pairs of consecutive words are counted.
One might alternatively consider a collection of character n-grams, a representation resilient against misspellings and derivations.
For example, let’s say we’re dealing with a corpus of two documents: ['words', 'wprds']
. The second document contains a misspelling of the word ‘words’. A simple bag of words representation would consider these two as very distinct documents, differing in both of the two possible features. A character 2-gram representation, however, would find the documents matching in 4 out of 8 features, which may help the preferred classifier decide better:
>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)
>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
>>> ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
In the above example, 'char_wb
analyzer is used, which creates n-grams only from characters inside word boundaries (padded with space on each side). The 'char'
analyzer, alternatively, creates n-grams that span across words:
>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5), min_df=1)
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True
>>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5), min_df=1)
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x5 sparse matrix of type '<... 'numpy.int64'>'
with 5 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
True
The word boundaries-aware variant char_wb
is especially interesting for languages that use white-spaces for word separation as it generates significantly less noisy features than the raw char
variant in that case. For such languages it can increase both the predictive accuracy and convergence speed of classifiers trained using such features while retaining the robustness with regards to misspellings and word derivations.
While some local positioning information can be preserved by extracting n-grams instead of individual words, bag of words and bag of n-grams destroy most of the inner structure of the document and hence most of the meaning carried by that internal structure.
In order to address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs should thus be taken into account. Many such models will thus be casted as “Structured output” problems which are currently outside of the scope of scikit-learn.
The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_
attribute) causes several problems when dealing with large datasets:
vocabulary_
can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size),vocabulary_
attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.It is possible to overcome those limitations by combining the “hashing trick” (特征哈希) implemented by thesklearn.feature_extraction.FeatureHasher
class and the text preprocessing and tokenization features of the CountVectorizer
.
This combination is implementing in HashingVectorizer
, a transformer class that is mostly API compatible with CountVectorizer
. HashingVectorizer
is stateless, meaning that you don’t have to call fit
on it:
>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> hv = HashingVectorizer(n_features=10)
>>> hv.transform(corpus)
...
<4x10 sparse matrix of type '<... 'numpy.float64'>'
with 16 stored elements in Compressed Sparse ... format>
You can see that 16 non-zero feature tokens were extracted in the vector output: this is less than the 19 non-zeros extracted previously by the CountVectorizer
on the same toy corpus. The discrepancy comes from hash function collisions because of the low value of the n_features
parameter.
In a real world setting, the n_features
parameter can be left to its default value of 2 ** 20
(roughly one million possible features). If memory or downstream models size is an issue selecting a lower value such as 2 ** 18
might help without introducing too many additional collisions on typical text classification tasks.
Note that the dimensionality does not affect the CPU training time of algorithms which operate on CSR matrices (LinearSVC(dual=True)
, Perceptron
, SGDClassifier
, PassiveAggressive
) but it does for algorithms that work with CSC matrices (LinearSVC(dual=False)
, Lasso()
, etc).
Let’s try again with the default setting:
>>> hv = HashingVectorizer()
>>> hv.transform(corpus)
...
<4x1048576 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
We no longer get the collisions, but this comes at the expense of a much larger dimensionality of the output space. Of course, other terms than the 19 used here might still collide with each other.
The HashingVectorizer
also comes with the following limitations:
inverse_transform
method), nor to access the original string representation of the features, because of the one-way nature of the hash function that performs the mapping.TfidfTransformer
can be appended to it in a pipeline if required.An interesting development of using a HashingVectorizer
is the ability to perform out-of-core scaling. This means that we can learn from data that does not fit into the computer’s main memory.
A strategy to implement out-of-core scaling is to stream data to the estimator in mini-batches. Each mini-batch is vectorized using HashingVectorizer
so as to guarantee that the input space of the estimator has always the same dimensionality. The amount of memory used at any time is thus bounded by the size of a mini-batch. Although there is no limit to the amount of data that can be ingested using such an approach, from a practical point of view the learning time is often limited by the CPU time one wants to spend on the task.
For a full-fledged example of out-of-core scaling in a text classification task see Out-of-core classification of text documents.
It is possible to customize the behavior by passing a callable to the vectorizer constructor:
>>> def my_tokenizer(s):
... return s.split()
...
>>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)
>>> vectorizer.build_analyzer()(u"Some... punctuation!") == (
... ['some...', 'punctuation!'])
True
In particular we name:
preprocessor
: a callable that takes an entire document as input (as a single string), and returns a possibly transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase the entire document, etc.tokenizer
: a callable that takes the output from the preprocessor and splits it into tokens, then returns a list of these.analyzer
: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place at the analyzer level, so a custom analyzer may have to reproduce these steps.
(Lucene users might recognize these names, but be aware that scikit-learn concepts may not map one-to-one onto Lucene concepts.)
To make the preprocessor, tokenizer and analyzers aware of the model parameters it is possible to derive from the class and override the build_preprocessor
, build_tokenizer
and build_analyzer
factory methods instead of passing custom functions.
Some tips and tricks:
If documents are pre-tokenized by an external package, then store them in files (or strings) with the tokens separated by whitespace and pass
analyzer=str.split
Fancy token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer or the analyzer. Here’s a
CountVectorizer
with a tokenizer and lemmatizer using NLTK:>>> from nltk import word_tokenize >>> from nltk.stem import WordNetLemmatizer >>> class LemmaTokenizer(object): ... def __init__(self): ... self.wnl = WordNetLemmatizer() ... def __call__(self, doc): ... return [self.wnl.lemmatize(t) for t in word_tokenize(doc)] ... >>> vect = CountVectorizer(tokenizer=LemmaTokenizer())(Note that this will not filter out punctuation.)
Customizing the vectorizer can also be useful when handling Asian languages that do not use an explicit word separator such as whitespace.
The extract_patches_2d
function extracts patches from an image stored as a two-dimensional array, or three-dimensional with color information along the third axis. For rebuilding an image from all its patches, use reconstruct_from_patches_2d
. For example let use generate a 4x4 pixel picture with 3 color channels (e.g. in RGB format):
>>> import numpy as np
>>> from sklearn.feature_extraction import image
>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0] # R channel of a fake RGB picture
array([[ 0, 3, 6, 9],
[12, 15, 18, 21],
[24, 27, 30, 33],
[36, 39, 42, 45]])
>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
... random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0, 3],
[12, 15]],
[[15, 18],
[27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
[27, 30]])
Let us now try to reconstruct the original image from the patches by averaging on overlapping areas:
>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)
The PatchExtractor
class works in the same way as extract_patches_2d
, only it supports multiple images as input. It is implemented as an estimator, so it can be used in pipelines. See:
>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)
Several estimators in the scikit-learn can use connectivity information between features or samples. For instance Ward clustering (Hierarchical clustering) can cluster together only neighboring pixels of an image, thus forming contiguous patches:
For this purpose, the estimators use a ‘connectivity’ matrix, giving which samples are connected.
The function img_to_graph
returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph
build a connectivity matrix for images given the shape of these image.
These matrices can be used to impose connectivity in estimators that use connectivity information, such as Ward clustering (Hierarchical clustering), but also to build precomputed kernels, or similarity matrices.
Note
Examples