文本型特征提取

1. hash结构

from sklearn.feature_extraction import DictVectorizer

    measurements = [

    {'city':'dubai', 'temperature':33}, {'city':'London', 'temperature':12}, ,{'city':'San Fransisco', 'temperature':18}

]
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()

# output
array([[ 0.,  0.,  1., 33.],
       [ 1.,  0.,  0., 12.],
       [ 0.,  1.,  0., 18.]])

vec.get_feature_names()
>>> ['city=London', 'city=San Fransisco', 'city=dubai', 'temperature']

2. 词袋模型

from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer(min_df=1) # 至少出现一次
vectorizer

>>> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

corpus = [
    'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
]
X=vectorizer.fit_transform(corpus)
X

>>> <4x9 sparse matrix of type ''
    with 19 stored elements in Compressed Sparse Row format>


vectorizer.get_feature_names()
>>> ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

X.toarray()
>>> array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
                 [0, 1, 0, 1, 0, 2, 1, 0, 1],
                 [1, 0, 0, 0, 1, 0, 1, 1, 0],
                 [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

描述的文档会完全忽略文档中单词的相对位置

analyze("This is a text document to analyze.")
>>> ['this', 'is', 'text', 'document', 'to', 'analyze']

为了保留一些有序的信息,我们可以抽取2-grams的词汇,而非使用1-grams:

bigram_vectorizer=CountVectorizer(ngram_range=(1,2), token_pattern=r'\b\w+\b', min_df=1)
analyze=bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!')

>>> ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool']

通过该vectorizer抽取的词汇表,比之前的方式更大,可以以local positioning patterns进行模糊编码:
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
X_2
>>> array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
                 [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
                 [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
                 [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]],
      dtype=int64)

bigram_vectorizer.get_feature_names()

>>> ['and',
 'and the',
 'document',
 'first',
 'first document',
 'is',
 'is the',
 'is this',
 'one',
 'second',
 'second document',
 'second second',
 'the',
 'the first',
 'the second',
 'the third',
 'third',
 'third one',
 'this',
 'this is',
 'this the']

3. TF-IDF item weight

tf表示词频(term-frequency),idf表示inverse document-frequency ,tf–idf 表示tf * idf。它原先适用于信息检索(搜索引擎的ranking),同时也发现在文档分类和聚类上也很好用。

from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()
 transformer   
TfidfTransformer(norm=...'l2', smooth_idf=True, sublinear_tf=False,
                 use_idf=True)

counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]]
...
tfidf = transformer.fit_transform(counts)
 tfidf               
          
>>> <6x3 sparse matrix of type '<... 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse ... format>

 tfidf.toarray()                        
>>> array([[ 0.85...,  0.  ...,  0.52...],
                 [ 1.  ...,  0.  ...,  0.  ...],
                 [ 1.  ...,  0.  ...,  0.  ...],
                 [ 1.  ...,  0.  ...,  0.  ...],
                 [ 0.55...,  0.83...,  0.  ...],
                 [ 0.63...,  0.  ...,  0.77...]])

4. BOW模型的限制

unigrams集(BOW)不能捕获句字和多个词的表示,会丢失掉词的顺序依存。另外,BOW模型不能解释可能的误拼(misspellings)或者词派生(word derivations)。

ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2,2), min_df=1)
counts=ngram_vectorizer.fit_transform(['word', 'wprds'])
counts

>>> <2x9 sparse matrix of type ''
    with 11 stored elements in Compressed Sparse Row format>

ngram_vectorizer.get_feature_names()
[' w', 'd ', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp']

使用’char_wb’分析器,它可以在字符边界内创建n-grams的字符(两边使用空格补齐)。而‘char’分析器则可以基于词来创建n-grams。

你可能感兴趣的:(文本型特征提取)