http://scikit-learn.org/stable/modules/feature_extraction.html
带病在网吧里。。。。。。写,求支持。。。
1、首先澄清两个概念:特征提取和特征选择(
Feature extraction is very different from Feature selection
)。the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features(从已经提取的特征中选择更好的特征).
下面分为四大部分来讲,主要还是4、text feature extraction
2、loading features form dicts
class DictVectorizer。举个例子就好:
>>> measurements = [ ... {'city': 'Dubai', 'temperature': 33.}, ... {'city': 'London', 'temperature': 12.}, ... {'city': 'San Fransisco', 'temperature': 18.}, ... ] >>> from sklearn.feature_extraction import DictVectorizer >>> vec = DictVectorizer() >>> vec.fit_transform(measurements).toarray() array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.]]) >>> vec.get_feature_names() ['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
>>> pos_window = [ ... { ... 'word-2': 'the', ... 'pos-2': 'DT', ... 'word-1': 'cat', ... 'pos-1': 'NN', ... 'word+1': 'on', ... 'pos+1': 'PP', ... }, ... # in a real application one would extract many such dictionaries ... ]
上面的PoS特征就可以vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):
>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized
<1x6 sparse matrix of type '<... 'numpy.float64'>'
with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']
3、feature hashing
The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”.
由于hash,所以只保存feature的interger index,而不保存原来feature的string名字,所以没有inverse_transform方法。
FeatureHasher 接收dict对,即 (feature, value) 对,或者strings,由构造函数的参数input_type决定.结果是scipy.sparse matrix。如果是strings,则value默认取1,例如 ['feat1', 'feat2', 'feat2'] 被解释为[('feat1', 1), ('feat2', 2)].
4、text feature extraction
因为内容太多,分开写了,参考着篇博客:http://blog.csdn.net/mmc2015/article/details/46997379
5、image feature extraction
提取部分图片(Patch extraction):
The extract_patches_2d function从图片中提取小块,存储成two-dimensional array, or three-dimensional with color information along the third axis. 使用reconstruct_from_patches_2d. 能够将所有的小块重构成原图:
>>> import numpy as np >>> from sklearn.feature_extraction import image >>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3)) >>> one_image[:, :, 0] # R channel of a fake RGB picture array([[ 0, 3, 6, 9], [12, 15, 18, 21], [24, 27, 30, 33], [36, 39, 42, 45]]) >>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2, ... random_state=0) >>> patches.shape (2, 2, 2, 3) >>> patches[:, :, :, 0] array([[[ 0, 3], [12, 15]], [[15, 18], [27, 30]]]) >>> patches = image.extract_patches_2d(one_image, (2, 2)) >>> patches.shape (9, 2, 2, 3) >>> patches[4, :, :, 0] array([[15, 18], [27, 30]])重构方式如下:
>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3)) >>> np.testing.assert_array_equal(one_image, reconstructed)
The PatchExtractor class和 extract_patches_2d,一样,只不过可以同时接受多个图片作为输入:
>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3) >>> patches = image.PatchExtractor((2, 2)).transform(five_images) >>> patches.shape (45, 2, 2, 3)
图片像素的连接(Connectivity graph of an image):
主要是根据像素的差别来判断图片的每两个像素点是否连接。。。。。
The function img_to_graph returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph build a connectivity matrix for images given the shape of these image.
这有个直观的例子:http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html#example-cluster-plot-lena-ward-segmentation-py
头疼。。。。碎觉。。。