sklearn 特征提取

Feature extraction 特征提取

参考自官方文档http://scikit-learn.org/stable/modules/feature_extraction.html

注意:特征提取与特征选择不同:前者包括将任意数据(如文本或图像)转换为可用于机器学习的数值特征。后者是应用这些特征的机器学习技术。

从字典加载特征

DictVectorizer类可将表示为标准Python字典(dict)对象列表的特征数组转换为scikit学习估计器能使用使用的NumPy / SciPy表示。

虽然Python的dict不是特别快,但它具有使用方便,稀疏(不需要存储缺失特征)和以键值对存储特征的优点。

measurements=[
    {'city':'Dubai','temperature':33},
    {'city':'London','temperature':12.},
    {'city':'San Fransisico','temperature':18.},
]
from sklearn.feature_extraction import DictVectorizer
vec=DictVectorizer()
vec.fit_transform(measurements).toarray()#这里对文本的处理方式类似于one_hot编码方式。原数据如果是数字类型则不变
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])
vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisico', 'temperature']

DictVectorizer在自然语言处理(NLP)中也十分有用, 下面是‘The cat sat on the mat.‘中围绕’sat’一词提取的特征

pos_window = [
    {
        'word-2': 'the',
        'pos-2': 'DT',
        'word-1': 'cat',
        'pos-1': 'NN',
        'word+1': 'on',
        'pos+1': 'PP',
    },
    # in a real application one would extract many such dictionaries
]

该描述可以被矢量化成适合于馈送到分类器的稀疏二维矩阵

vec=DictVectorizer()
pos_vectorized=vec.fit_transform(pos_window)
pos_vectorized
<1x6 sparse matrix of type ''
    with 6 stored elements in Compressed Sparse Row format>
pos_vectorized.toarray()
array([[ 1.,  1.,  1.,  1.,  1.,  1.]])
vec.get_feature_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']

特征散列 Feature Hashing

'''例如,考虑这样一个自然语言处理任务,需要从(token,part_of_speech)对中提取特征。
可以使用Python生成器函数来提取'''
def token_features(token,part_of_speech):
    if token.isdigit():
        yield "numeric"
    else:
        yield "token={}".format(token.lower())
        yield "token,pos={},{}".format(token,part_of_speech)
        if token[0].isupper():
            yield "unppercase_initial"
        if token.isupper():
            yield "all_uppercase"
        yield "pos={}".format(part_of_speech)
#然后,可以使用以下方式构建要输入给FeatureHasher.transform的raw_X:
raw_X=(token_features(tok,pos_tagger(tok)) for tok in corpus)
from sklearn.feature_extraction import FeatureHasher
hasher=FeatureHasher(input_type='string')
X=hasher.transform(raw_X)#输出值是一个scipy.sparse 类型的矩阵X.

文本特征提取

词袋表示

文本文档原始长度不一致,需要从文本内容中提取出固定长度的数字特征。
符号化:计数:正则化

由于每个文档的内容只用到了文集中的一小部分词语,所以生成的矩阵是高度稀疏的。使用scipy.sparse包处理这样的稀疏矩阵最为合适。

from sklearn.feature_extraction.text import CountVectorizer#这个类包含了符号化和计数功能
vectorizer=CountVectorizer(min_df=1)
vectorizer
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X=vectorizer.fit_transform(corpus)
X
<4x9 sparse matrix of type ''
    with 19 stored elements in Compressed Sparse Row format>
X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
analyze=vectorizer.build_analyzer()#默认配置通过提取至少2个字母的字符来标记字符串。 
analyze("This is a text document to analyze.") == (
    ['this', 'is', 'text', 'document', 'to', 'analyze'])
True
vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
#从字符特征到列索引的映射都保存在vectorizer的vocabulary_属性里
vectorizer.vocabulary_.get('document')
1
#因此,在将来调用转换方法时,在训练语料库中看不到的单词将被完全忽略:
vectorizer.transform(['Something completely new.']).toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)
bigram_vectorizer=CountVectorizer(ngram_range=(1,2),token_pattern=r'\b\w+\b',min_df=1)
analyze=bigrame_vectorizer.build_analyzer()#更改参数会有不同的匹配模式
analyze('Bi-grams are cool!') == (
    ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
X_2
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)
feature_index = bigram_vectorizer.vocabulary_.get('is this')
X_2[:,feature_index]
array([0, 0, 0, 1], dtype=int64)
bigram_vectorizer.get_feature_names()
['and',
 'and the',
 'document',
 'first',
 'first document',
 'is',
 'is the',
 'is this',
 'one',
 'second',
 'second document',
 'second second',
 'the',
 'the first',
 'the second',
 'the third',
 'third',
 'third one',
 'this',
 'this is',
 'this the']

TF-IDF

from sklearn.feature_extraction.text import TfidfTransformer
transformer=TfidfTransformer(smooth_idf=False)
transformer
TfidfTransformer(norm='l2', smooth_idf=False, sublinear_tf=False,
         use_idf=True)
counts = [[3, 0, 1],
          [2, 0, 0],
          [3, 0, 0],
          [4, 0, 0],
          [3, 2, 0],
          [3, 0, 2]]
tfidf = transformer.fit_transform(counts)
tfidf
<6x3 sparse matrix of type ''
    with 9 stored elements in Compressed Sparse Row format>
tfidf.toarray()
array([[ 0.81940995,  0.        ,  0.57320793],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.47330339,  0.88089948,  0.        ],
       [ 0.58149261,  0.        ,  0.81355169]])

当设置smooth_idf=False时,idf(t)=log(n_d/df(d,t))+1,
默认为smooth_idf=True时,dif(t)=log(1+n_d/1+df(d,t))+1,为了防止除数为0

transformer=TfidfTransformer()#注意,这里默认smooth_idf=True
transformer.fit_transform(counts).toarray()
array([[ 0.85151335,  0.        ,  0.52433293],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.55422893,  0.83236428,  0.        ],
       [ 0.63035731,  0.        ,  0.77630514]])
transformer.idf_#每个特征的权重值
array([ 1.        ,  2.25276297,  1.84729786])

由于tf-idf在文字特征中应用十分频繁,还有一个类TfidfVectorizer ,它结合了CountVectorizer和TfidfTransformer的所有可选项

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer(min_df=1)
vectorizer.fit_transform(corpus)
<4x9 sparse matrix of type ''
    with 19 stored elements in Compressed Sparse Row format>

解码文本文件

import chardet
text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
text2 = b"holdselig sind deine Ger\xfcche"
text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s" \
        b"\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i" \
        b"\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d" \
        b"\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
decoded=[x.decode(chardet.detect(x)['encoding']) for x in (text1,text2,text3)]
v=CountVectorizer().fit(decoded).vocabulary_
v
{'auf': 0,
 'deine': 1,
 'des': 2,
 'dich': 3,
 'flügeln': 4,
 'fort': 5,
 'gegrüßt': 6,
 'gerüche': 7,
 'gesanges': 8,
 'herzliebchen': 9,
 'holdselig': 10,
 'ich': 11,
 'mein': 12,
 'mir': 13,
 'sauerkraut': 14,
 'sei': 15,
 'sind': 16,
 'trag': 17}

图像特征提取

import numpy as np
from sklearn.feature_extraction import image
one_image=np.arange(4*4*3).reshape((4,4,3))
one_image[:,:,0]#R channel of a fake RGB picture 模拟3通道RGB图片
array([[ 0,  3,  6,  9],
       [12, 15, 18, 21],
       [24, 27, 30, 33],
       [36, 39, 42, 45]])
patches=image.extract_patches_2d(one_image,(2,2),max_patches=2,random_state=0)
patches.shape
(2, 2, 2, 3)
patches[:,:,:,0]
array([[[ 0,  3],
        [12, 15]],

       [[15, 18],
        [27, 30]]])
patches=image.extract_patches_2d(one_image,(2,2))
patches.shape#用2x2的方块会将图片分割成9个小方块
(9, 2, 2, 3)
patches[4,:,:,0]
array([[15, 18],
       [27, 30]])
#现在让我们尝试通过对重叠区域进行平均来从补丁重构原始图像:
reconstructed=image.reconstruct_from_patches_2d(patches,(4,4,3))
np.testing.assert_array_equal(one_image,reconstructed)

PatchExtractor类的工作方式与extract_patches_2d相同,只是它支持多个图像作为输入

five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
patches=image.PatchExtractor((2,2)).transform((five_images))
patches.shape
(45, 2, 2, 3)

你可能感兴趣的:(python,sklearn,特征提取)