sklearn学习笔记2 Feature_extraction库

1. 将字典格式的数据转换为特征

前提:数据是用字典格式存储的,通过调用DictVectorizer类可将其转换成特征,对于特征值为字符型的变量,自动转换为多个特征变量,类似前面提到的onehot编码。

In [226]: measurements = [
     ...:      {'city': 'Dubai', 'temperature': 33.},
     ...:      {'city': 'London', 'temperature': 12.},
     ...:      {'city': 'San Fransisco', 'temperature': 18.},
     ...: ]

In [227]: from sklearn.feature_extraction import DictVectorizer

In [228]: vec=DictVectorizer()

In [229]: vec.fit_transform(measurements).toarray()
Out[229]:
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])

In [230]: vec.get_feature_names()
Out[230]: ['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']


同时可以通过vec的restrict函数,直接对特征进行选择。

In [247]: from sklearn.feature_selection import SelectKBest,chi2
In [249]: z=vec.fit_transform(measurements)support = SelectKBest(chi2, k=2).fit(z, [0, 1,2])
In [250]: z.toarray()
Out[250]:array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.]])
In [251]: vec.get_feature_names()
Out[251]: ['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
In [252]: vec.restrict(support.get_support())
Out[252]:DictVectorizer(dtype=, separator='=', sort=True, sparse=True)
In [253]: vec.get_feature_names()
Out[253]: ['city=San Fransisco', 'temperature']


也可调用inverse_transform函数得到原来的值。

 

2.特征哈希

当特征取值列表很大,且有多个需onehot编码时,会导致特征矩阵很大,且有很多0,这时可用哈希函数将特征根据特征名和值映射到指定维数的矩阵。由于hash是单向函数,因此FeatureHash没有inverse_transform函数。

from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10,input_type='dict')
D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
f = h.transform(D)
f.toarray()
array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])


3.文本处理

(1)count

语料库中的单词作为特征,文档中该单词出现的数目作为特征值。

In [1]: from sklearn.feature_extraction.text import CountVectorizer

In [2]: vec=CountVectorizer()

In [3]: vec
Out[3]:
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=, encoding=u'utf-8', input=u'content'
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [4]: corpus = [
   ...:     'This is the first document.',
   ...:     'This is the second second document.',
   ...:     'And the third one.',
   ...:     'Is this the first document?',
   ...: ]

In [5]: X=vec.fit_transform(corpus)

In [6]: X.toarray()
Out[6]:
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)


 

也可用ngram作为词袋,通过参数ngram_range指定。

In [21]: bigram_vec=CountVectorizer(ngram_range=(1,3),token_pattern=r'\b\w+\b'
    ...: min_df=1)

In [22]: bigram_vec.fit_transform(corpus).toarray()
Out[22]:
array([[0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 0, 0, 1, 1, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 2, 1, 1, 1, 1, 0, 0, 1, 1,
        0, 0, 0, 0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 1, 1]], dtype=int64)

In [23]: analyze=bigram_vec.build_analyzer()

In [24]: analyze('hello a b c')
Out[24]:
[u'hello',
 u'a',
 u'b',
 u'c',
 u'hello a',
 u'a b',
 u'b c',
 u'hello a b',
 u'a b c']


对字符编码的识别问题,可看下chardet模块

(2)稀疏矩阵转换

HashTransformerHashVectorizer=CountVectorizer+HashTransfomer

 

参考:

http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction

你可能感兴趣的:(sklearn学习笔记2 Feature_extraction库)