scikit-learn:4.2.3. Text feature extraction

http://scikit-learn.org/stable/modules/feature_extraction.html

4.2节内容太多,因此将文本特征提取单独作为一块。


1、the bag of words representation

将raw data表示成长度固定的数字特征向量,scikit-learn提供了三个方式:

tokenizing:给每一个token(字、词,粒度自己把握)一个整数索引id

counting:每个token在每个文档中出现的次数

normalizing:根据每个token在样本/文档中出现的次数 规范化/权重化 token的重要性。


重新理解什么是feature、什么事sample:

  • each individual token occurrence frequency (normalized or not) is treated as a feature.
  • the vector of all the token frequencies for a given document is considered a multivariate sample.


Bag of Words  or “Bag of n-grams” representation:

general process (tokenization, counting and normalization) of turning a collection of text documents into numerical feature vectors,while completelyignoring the relative position information of the words in the document.


2、sparsity

每个文档中的词,只是整个语料库中所有词,的很小的一部分,这样造成feature vector的稀疏性(很多值为0)。为了解决存储和运算速度的问题,使用python的scipy.sparse包。



3、common vectorizer usage

CountVectorizer同时实现tokenizing和counting。

参数很多,但默认的就很合理了,适合大多数情况,具体参考:http://blog.csdn.net/mmc2015/article/details/46866537

>>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer                     
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
这边的例子说明了它的使用:

http://blog.csdn.net/mmc2015/article/details/46857887

包括fit_transform、transform、get_feature_names()、ngram_range=(min,max)、vocabulary_.get()等。。。。


4、tf-idf term weighting

解决(e.g. “the”, “a”, “is” in English) 某些词出现次数太多,却又不是我们关注的词的问题。

the text.TfidfTransformer class实现了mormalization:

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer()
>> counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf                         
<6x3 sparse matrix of type '<... 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse ... format>

>>> tfidf.toarray()                        
array([[ 0.85...,  0.  ...,  0.52...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.55...,  0.83...,  0.  ...],
       [ 0.63...,  0.  ...,  0.77...]])
>>> transformer.idf_  #idf_保存fit之后的结果
array([ 1. ...,  2.25...,  1.84...])

another class called  TfidfVectorizer  that combines all the options of  CountVectorizer  and TfidfTransformer  in a single model:

如果对于binary occurrence的feature,使用CountVectorizer的参数设置为binary更好。。。bernoulli Naive Bayes也更适合做estimator。


5、Decoding text files

text是由character组成,但file则由bytes组成,所以要让scikit-learn工作,首先要告诉他file的编码,那么 CountVectorizer就会自动解码了。默认的编码方式是UTF-8,解码后的character set称为Unicode。如果你加载的file编码方式不是UTF-8,有没有设置encoding参数,则会出现UnicodeDecodeError。

如果编码错误,try:

  • Find out what the actual encoding of the text is. The file might come with a header or README that tells you the encoding, or there might be some standard encoding you can assume based on where the text comes from.
  • You may be able to find out what kind of encoding it is in general using the UNIX command file. The Python chardet module comes with a script called chardetect.py that will guess the specific encoding, though you cannot rely on its guess being correct.
  • You could try UTF-8 and disregard the errors. You can decode byte strings with bytes.decode(errors='replace') to replace all decoding errors with a meaningless character, or set decode_error='replace' in the vectorizer. This may damage the usefulness of your features.
  • Real text may come from a variety of sources that may have used different encodings, or even be sloppily decoded in a different encoding than the one it was encoded with. This is common in text retrieved from the Web. The Python package ftfy can automatically sort out some classes of decoding errors, so you could try decoding the unknown text as latin-1 and then using ftfy to fix errors.
  • If the text is in a mish-mash of encodings that is simply too hard to sort out (which is the case for the 20 Newsgroups dataset), you can fall back on a simple single-byte encoding such as latin-1. Some text may display incorrectly, but at least the same sequence of bytes will always represent the same feature.

For example, the following snippet uses chardet (not shipped with scikit-learn, must be installed separately) to figure out the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not shown here.

>>>
>>> import chardet    
>>> text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
>>> text2 = b"holdselig sind deine Ger\xfcche"
>>> text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
>>> decoded = [x.decode(chardet.detect(x)['encoding'])
...            for x in (text1, text2, text3)]        
>>> v = CountVectorizer().fit(decoded).vocabulary_    
>>> for term in v: print(v)                           

(Depending on the version of chardet, it might get the first one wrong.)



6、应用和实例

推荐看一下第三个例子。

In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train document classifiers, for instance:

  • Classification of text documents using sparse features

In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such as K-means:

  • Clustering text documents using k-means

Finally it is possible to discover the main topics of a corpus by relaxing the hard assignment constraint of clustering, for instance by using Non-negative matrix factorization (NMF or NNMF):

  • Topics extraction with Non-Negative Matrix Factorization

7、bag of words的缺陷

misspelling、word derivations、word order dependece。拼写错误(word wprd wrod)、词汇的变形(word words、arrive arriving)、词汇之间的顺序及依赖关系。


使用N-gram而不要单单使用unigram。另外,还可以使用这里http://blog.csdn.net/mmc2015/article/details/46730289提到的词干分析方法。

给个例子,以char_wb为例了:

>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)
>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
>>> ngram_vectorizer.get_feature_names() == (
...     [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
       [1, 1, 0, 1, 1, 1, 0, 1]])


下三部分有时间写。。。

8、Vectorizing a large text corpus with the hashing trick,使用hashing技巧vectorizing大语料库

使用上面提到的vectorization方法虽然简单,但该方法是基于in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute),这导致处理大数据集时会出现很多问题:memory use、access slow。。。

通过结合sklearn.feature_extraction.FeatureHasher class的hashing trick和CountVectorizer可以解决这些问题。


hash和countVectorizer结合的产物是 HashingVectorizer,。

HashingVectorizer is stateless, meaning that you don’t have to call fit on it(直接使用transform即可):

>>>
>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> hv = HashingVectorizer(n_features=10)
>>> hv.transform(corpus)
...                                
<4x10 sparse matrix of type '<... 'numpy.float64'>'
    with 16 stored elements in Compressed Sparse ... format>
默认的n_features是2**20(one million features),如果内存有问题,可以稍微小一点,比如2**18,而不会造成太多的冲突。。


HashingVectorizer,有两个缺点一定需要注意:

1)不提供IDF加权,因为是stateless。如果需要的话,可以在pipeline中append一个 TfidfTransformer 。

2)不提供inverse_transform方法,因为hash的单向属性。即,不能访问原来的string特征,只能访问特征的整数索引了。。。。


9、Performing out-of-core scaling with HashingVectorizer

HashingVectorizer,也有优点——可以进行out-of-core学习,这对于内存放不下的数据集来说非常有益。

策略是,mini-batches fit:Each mini-batch is vectorized usingHashingVectorizer so as to guarantee that the input space of the estimator has always the same dimensionality. The amount of memory used at any time is thus bounded by the size of a mini-batch.

这边有个例子可以参考一下:http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py



10、Customizing the vectorizer classes

自定义vectorizer,主要体现在如何提取token吧:

>>> def my_tokenizer(s):
...     return s.split()
...
>>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)
>>> vectorizer.build_analyzer()(u"Some... punctuation!") == (
...     ['some...', 'punctuation!'])
True
下面的内容不翻译:

In particular we name:

  • preprocessor: a callable that takes an entire document as input (as a single string), and returns a possibly transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase the entire document, etc.
  • tokenizer: a callable that takes the output from the preprocessor and splits it into tokens, then returns a list of these.
  • analyzer: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place at the analyzer level, so a custom analyzer may have to reproduce these steps.
想要使上面的三者起作用,最好override  build_preprocessor build_tokenizer`  and  build_analyzer  factory methods,而不是简单地传递过去custom functions。一些小技巧如下:

  • If documents are pre-tokenized by an external package, then store them in files (or strings) with the tokens separated by whitespace and pass analyzer=str.split

  • Fancy token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer or the analyzer. Here’s a CountVectorizer with a tokenizer and lemmatizer using NLTK:

    >>>
    >>> from nltk import word_tokenize          
    >>> from nltk.stem import WordNetLemmatizer 
    >>> class LemmaTokenizer(object):
    ...     def __init__(self):
    ...         self.wnl = WordNetLemmatizer()
    ...     def __call__(self, doc):
    ...         return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
    ...
    >>> vect = CountVectorizer(tokenizer=LemmaTokenizer())  
由于中文不是靠空格分割,所以使用custom vectorizer是非常必要的。。。!!!


文本特征提取完毕。。。


你可能感兴趣的:(scikit-learn,文本特征提取)