scikit-learn:4.2. Feature extraction(特征提取,不是特征选择)

http://scikit-learn.org/stable/modules/feature_extraction.html

带病在网吧里。。。。。。写,求支持。。。


1、首先澄清两个概念:特征提取和特征选择( 

Feature extraction is very different from Feature selection

)。

the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features(从已经提取的特征中选择更好的特征).


下面分为四大部分来讲,主要还是4、text feature extraction

2、loading features form dicts

class DictVectorizer。举个例子就好:

>>> measurements = [
...     {'city': 'Dubai', 'temperature': 33.},
...     {'city': 'London', 'temperature': 12.},
...     {'city': 'San Fransisco', 'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']


class  DictVectorizer对于提取某个特定词汇附近的feature windows非常有用例如加入我们通过一个已有的algorithm提取了word ‘sat’ 在句子‘The cat sat on the mat.’中的PoS(Part of Speech)特征,如下:

>>> pos_window = [
...     {
...         'word-2': 'the',
...         'pos-2': 'DT',
...         'word-1': 'cat',
...         'pos-1': 'NN',
...         'word+1': 'on',
...         'pos+1': 'PP',
...     },
...     # in a real application one would extract many such dictionaries
... ]

上面的PoS特征就可以vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):

>>>
>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized                
<1x6 sparse matrix of type '<... 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1.,  1.,  1.,  1.,  1.,  1.]])
>>> vec.get_feature_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']


3、feature hashing

The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”. 

由于hash,所以只保存feature的interger index,而不保存原来feature的string名字,所以没有inverse_transform方法。


FeatureHasher 接收dict对,即 (feature, value) 对,或者strings,由构造函数的参数input_type决定.结果是scipy.sparse matrix。如果是strings,则value默认取1,例如 ['feat1', 'feat2', 'feat2'] 被解释为[('feat1', 1), ('feat2', 2)].



4、text feature extraction

因为内容太多,分开写了,参考着篇博客:http://blog.csdn.net/mmc2015/article/details/46997379



5、image feature extraction

提取部分图片(Patch extraction):

The extract_patches_2d function从图片中提取小块,存储成two-dimensional array, or three-dimensional with color information along the third axis. 使用reconstruct_from_patches_2d. 能够将所有的小块重构成原图:

>>> import numpy as np
>>> from sklearn.feature_extraction import image

>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0]  # R channel of a fake RGB picture
array([[ 0,  3,  6,  9],
       [12, 15, 18, 21],
       [24, 27, 30, 33],
       [36, 39, 42, 45]])

>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
...     random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0,  3],
        [12, 15]],

       [[15, 18],
        [27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
       [27, 30]])
重构方式如下:

>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)

The PatchExtractor class和 extract_patches_2d,一样,只不过可以同时接受多个图片作为输入:

>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)


图片像素的连接(Connectivity graph of an image):


主要是根据像素的差别来判断图片的每两个像素点是否连接。。。。。

The function img_to_graph returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph build a connectivity matrix for images given the shape of these image.

这有个直观的例子:http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html#example-cluster-plot-lena-ward-segmentation-py



头疼。。。。碎觉。。。






你可能感兴趣的:(scikit-learn,scikit-learn)