张博208

scikit-leann 特征提取学习

模块 sklearn.feature_extraction 可以用来提取多种格式的数据集中，符合机器学习算法中支持的特征，如文本和图像

Note

特征提取与特征选择(Feature selection) 特征选择有很大的不同: 模型意义在于把复杂的数据，如文本和图像，转化为数字特征，从而在机器学习中使用。后者是一个机器学习中应用这些特征的方法

4.2.1. 加载字典的中的特征

类 DictVectorizer 可以把特征向量转化成标准的Python字典对象的一个列表，同时也是被scikit-learn的估计器使用的一个NumPy/SciPy体现(ndarray)

即使处理时并不是特别快，python的字典有易于使用的优势，适用于稀疏情景(缺失特征不会被存储)，存储特征的名字和值。

类 DictVectorizer 实现了所谓 one-of-K 或 “one-hot” 的方法来使用范畴(即离散的)特征。范畴特征是一个键值对，其值被约束为离散的无序列表

(如话题标志，对象类型，标签，名字等)。

在下面例子中 “city” 是一个绝对变量而 disc是一个 “temperature” 传统的数值特征

 
     >>> measurements = [
...     {'city': 'Dubai', 'temperature': 33.},
...     {'city': 'London', 'temperature': 12.},
...     {'city': 'San Fransisco', 'temperature': 18.},
... ]

>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()

>>> vec.fit_transform(measurements).toarray()
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])

>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
 
    

类 DictVectorizer 也是一个有用的转化形式，主要应用于自然语言处理(NLP)中分类器的训练模型，典型应用于在兴趣文本中提取特征序列.

比如说，我们有一个算法来提取词性标签作为补充标签，来训练序列分类器(如chunker概括大意)。下面的字典展示了一个小例子，提取在例句 ‘The cat sat on the mat.’ 中sat周围的特征:

 
     >>> pos_window = [
...     {
...         'word-2': 'the',
...         'pos-2': 'DT',
...         'word-1': 'cat',
...         'pos-1': 'NN',
...         'word+1': 'on',
...         'pos+1': 'PP',
...     },
...     # in a real application one would extract many such dictionaries
... ]
 
    

以上形式可以被向量化成一个稀疏二维矩阵，从而作为参数传递给分类器(或经过:class:text.TfidfTransformer 的加工标准化):

 
     >>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized                
<1x6 sparse matrix of type '<... 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1.,  1.,  1.,  1.,  1.,  1.]])
>>> vec.get_feature_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']
 
    

正如你所想的，如果在文档全集中进行提取，结果矩阵将会非常巨大(大量one-hot-features)，他们中的大部分通常将会是0。所以为了使这个矩阵的稀疏数据结构存储在内存中，类 DictVectorizer 默认使用了一个 scipy.sparse 矩阵而不是 numpy.ndarray。

4.2.2. 特征哈希

类 FeatureHasher 是一个快速且低内存消耗的向量化方法，使用了 feature hashing 技术，或可称为”hashing trick”。没有像矢量化那样，为计算训练得到的特征建立哈西表，类 FeatureHasher 的实例使用了一个哈希函数来直接确定特征在样本矩阵中的列号。这样在可检查性上增加了速度减少了内存开销。这个类不会记住输入特征的形状，也没有 inverse_transform 方法。

因为哈希函数会造成不相关特征间的冲突，所以这里使用了带有签名的哈希函数。哈希值的签名决定了输出矩阵中特征的签名。在这种情况下，哈希冲突可能会消失，不会出现错误。且所有输出矩阵的期望都是0。

如果传递 non_negative=True 参数给构造器，那么将使用绝对值。这将减少一些对冲突的控制，但是允许输出作为参数传递给估计器如: sklearn.naive_bayes.MultinomialNB 或 sklearn.feature_selection.chi2 特征选择器要求非负的输入。

类 FeatureHasher 接受mapping(如python的字典和其在 collections 模块中的变体)，使用键值对 (feature, value) ，或是使用字符串string，取决于构造器参数 input_type 。 Mapping 被看成键值对的列表，其中单个字符串有一个隐式的值: 1 ，所以 ['feat1', 'feat2', 'feat3'] 被转化为 [('feat1', 1), ('feat2', 1), ('feat3', 1)] 。

如果一个单独特征在一个样本中出现了多次，与之相关的次数将被加和(所以 ('feat', 2) and ('feat', 3.5) 转化成 ('feat', 5.5) )。类 FeatureHasher 的输出通常是一个CSR格式的 scipy.sparse 稀疏矩阵。

特征哈希可以在文本分类中使用，但是，与 text.CountVectorizer 不同, 为使分类器/哈希联合使用，请参考下方的 Vectorizing a large text corpus with the hashing trick

举个例子，假设有一个词级别的自然语言处理任务，需要在 (token, part_of_speech) 键值对中提取特征，你可以使用Python的生成器函数来提取:

 
     def token_features(token, part_of_speech):
    if token.isdigit():
        yield "numeric"
    else:
        yield "token={}".format(token.lower())
        yield "token,pos={},{}".format(token, part_of_speech)
    if token[0].isupper():
        yield "uppercase_initial"
    if token.isupper():
        yield "all_uppercase"
    yield "pos={}".format(part_of_speech)
 
    

之后 raw_X 为了可以传入 FeatureHasher.transform 可以通过如下方式构建:

 
     raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus)

然后传入哈希:

 
     hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)

得到一个 scipy.sparse 类型的的矩阵 X 。

注意对使用生成器的理解Note the use of a generator comprehension, 它将为特征哈希引入懒加载机制: 词令牌(token)只在哈希要求时处理。

4.2.2.1. 实现细节

类 FeatureHasher 使用了有符号的MurmurHash3的变体，因此导致 (同时由于 scipy.sparse 的限制), 现在支持的最大特征数量为。

特征哈希的原始形式源于 Weinberger et al。使用了两个独立的哈希函数和来分别决定列下标和特征签名。现有的实现是基于假设：MurmurHash3的符号位与其他位独立。

因为从哈希函数到列标只使用了简单的取模操作，因此建议使用二次方作为 n_features 的参数，否则特征不会平均的分布到列中。

References:

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and Josh Attenberg (2009). Feature hashing for large scale multitask learning. Proc. ICML.
MurmurHash3.

4.2.3. 文本特征提取

4.2.3.1. 体现：词袋模型

文本分析是机器学习算法的主要应用领域。然而，符号文字序列不能直接传递给这些算法，因为他们要求数值的固定长度的矩阵特征而不是可变长度的文本文档。

为了解决这个问题，scikit-learn为数值特征提取最常见的方式提供了一系列工具，它们是:

tokenizing 对每个可能的词令牌分成字符串并赋予整形的id，比如使用空格和作为令牌分割依据。
counting 每个词令牌在文档中的出现次数。
normalizing 在大多数的文档 / 样本中，可以减少重要的次令牌的权重。

在这个体系中，特征和样本有如下定义:

每个 独立令牌出现频率 (归一化或未归一化) 被当做一个 (特征)feature 。
document(文本) 中所有的令牌频率向量被看做一个多元 sample(样本) 。

因此文本的集合可被表示为矩阵形式，每行一条文本，每列对应每个文本中出现的词令牌(如单个词)。

我们称 vectorization(向量化) 是转化文本集合为数值向量的普遍方法。这种特殊思想，包括令牌化，统计频数和归一化，被称为 Bag of Words(词袋子) 或 “Bag of n-grams” 模型。文本被词出现频率描述，完全忽略词的相对位置信息。

4.2.3.2. 稀疏

因为大多数文本通常只使用文本词向量全集中的一个小子集，结果矩阵将有许多特征的值为0(经常超过99%)。

例如，一个10000个短文本集的例子(如Emails)将使用总共大约100000个不同的词，而每个文本(Email)将使用100到1000个单词。

为了可以在内存中储存这种矩阵，同时加速线性代数的矩阵 / 向量运算，所以通常以稀疏形式实现，例如可参考在包 scipy.sparse 中的实现。

4.2.3.3. 通常向量化使用Common Vectorizer usage

CountVectorizer 在单个类中实现了令牌化和出现频数统计:

 
      >>> from sklearn.feature_extraction.text import CountVectorizer

这个模型有很多参数，然而初始值非常合理(请参考 reference documentation 获取更多细节):

 
      >>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer                     
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
 
     

让我们使用它来使简单文本全集令牌化，并统计词频:

 
      >>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X                              
<4x9 sparse matrix of type '<... 'numpy.int64'>'
    with 19 stored elements in Compressed Sparse ... format>
 
     

初始设定是，令牌化字符串，提取至少两个字母的词。做这一步的函数可以显式的被调用:

 
      >>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.") == (
...     ['this', 'is', 'text', 'document', 'to', 'analyze'])
True

每个在拟合中被分析器发现的词被指派了一个独一无二的索引，在结果矩阵中表示一列。对于列的翻译可以被如下方式检索:

 
      >>> vectorizer.get_feature_names() == (
...     ['and', 'document', 'first', 'is', 'one',
...      'second', 'the', 'third', 'this'])
True

>>> X.toarray()           
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
 
     

从列标到特征名的反转映射储存在向量化类 vectorizer 的属性 vocabulary_ 中:

 
      >>> vectorizer.vocabulary_.get('document')
1

因此在训练集里未出现的的词将在将来调用transform方法时被完全忽略:

 
      >>> vectorizer.transform(['Something completely new.']).toarray()
...                           
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)

注意在之前的集合中第一个和最后一个文本事实上是同一个词，因此被编码成相同的向量。特别是最后一个字符是询问形式时我们丢失了他的信息。为了防止词组顺序颠倒,我们除了提取一元模型(1-Gram，即单字单词)，也可以提取二元模型(2-Gram):

 
      >>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
...                                     token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
...     ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True
 
     

矢量化提取的词因此变得很大，同时可以在定位模式时消歧义:

 
      >>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2
...                           
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]...)
 
     

特别的，疑问形式如 “Is this” 只在最后一个文档中显示:

 
      >>> feature_index = bigram_vectorizer.vocabulary_.get('is this')
>>> X_2[:, feature_index]     
array([0, 0, 0, 1]...)

4.2.3.4. Tf–idf算法字词权值

在一个巨大的文本集中，一些词会出现很多次(如 “the”, “a”, “is” in English)，且带有较少的有意义的信息。如果我们直接把数量输入到分类器中则这些频繁词组会掩盖住那些我们关注但很少出现的词。

为了重新计算特征权重，将其转化成适合被分类器使用的浮点值，使用tf-idf转化非常普遍。

Tf意思是词语频率 term-frequency 而tf–idf意思是词语频率与转置文档频率( inverse document-frequency )的乘积。它源于一个词权重的信息检索方式(作为搜索引擎结果的评级函数)，同时在文本分类和聚类中表现良好。

归一化过程已经实现于类 :class:`TfidfTransformer`中:

 
      >>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer()
>>> transformer   
TfidfTransformer(norm=...'l2', smooth_idf=True, sublinear_tf=False,
                 use_idf=True)
 
     

请参考 reference documentation 获取其他参数的更多细节。

让我们以下方的词频为例。第一个词在任何时候都100%显示，其他两个特征只占文档中少于50%的比例:

 
      >>> counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf                         
<6x3 sparse matrix of type '<... 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse ... format>

>>> tfidf.toarray()                        
array([[ 0.85...,  0.  ...,  0.52...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.55...,  0.83...,  0.  ...],
       [ 0.63...,  0.  ...,  0.77...]])
 
     

每一行都被正则化，来适用欧几里得标准，每个特征的权重被方法 fit 计算，调用结果被存储在模型参数中:

 
      >>> transformer.idf_                       
array([ 1. ...,  2.25...,  1.84...])

因为 tf–idf 在特征提取中经常被使用，所以有另一个类: TfidfVectorizer 在单个类中结合了所有类和类中的选择:

 
      >>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer(min_df=1)
>>> vectorizer.fit_transform(corpus)
...                                
<4x9 sparse matrix of type '<... 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse ... format>
 
     

虽然 tf–idf 正则化经常被使用，但是经常有一种情况是二元变量显示会提供更好的特征。可以使用类 CountVectorizer 中的参数 binary 来达到这一目的。特别的，一些估计器，如朴素贝叶斯伯努利模型显式地使用离散的布尔值随机变量，同时，非常短的文本可能会影响tf-idf的值，而相比之下二元表示(binary occurrence)会更加稳定。

通常情况下最好的提取特征的调整参数方式是使用基于网格搜索的交叉验证，例如使用管道(pipelining)传输特征提取器和分类器:

Sample pipeline for text feature extraction and evaluation

4.2.3.5. 文档编码转码

文本由字组成，而文件由字节组成。字节转化成字符依照一定的编码(encoding)方式。为了在Python中使用文本文档，这些字节需要被解码(decoded)成Unicode字符集。常见的编码方式有 ASCII, Latin-1 (Western Europe), KOI8-R (Russian) 和通用编码方式 UTF-8 与 UTF-16。或许也其他的方式。 .. note:

 
      一个编码也被称为“字符集”，但是这个名词是不准确的: 一些编码可以用单个字符表示。

scikit-learn中的文本特征提取器知道如何解码文本文件，但是只能通过告诉它在何种编码方式之下才行。

类 CountVectorizer 有一个参数 encoding 来实现这一目的。对于现代文本文档，正确的编码方式大多是UTF-8，它也是默认编码方式 (encoding="utf-8")。

如果你的加载的文本不是UTF-8编码，你将会得到一个 UnicodeDecodeError 。矢量化方法可以通过设定 decode_error 参数值为 "ignore" 或 "replace" 来不抛出这一错误。参考Python的函数 bytes.decode 得到更多细节(在Python命令行里输入 help(bytes.decode) )。

如果在解码时遇到了困难，可以尝试以下方法:

找到文本的实际编码方式。文件的头部或是README文件可以告诉你编码，或是一些标准编码，你可以从文本的来源处推断编码方式。
你可以用通常方法，使用UNIX指令 file 找到它的编码方式。Python的 chardet 模块含有一个脚本 chardetect.py ，可以得到大概的编码方式，但是不应依赖它，因为并不总是正确。
你可以尝试UFT-8并忽略错误。解码字节数组，使用``bytes.decode(errors=’replace’)`` 来用一个无意义的字符替换所有解码错误，或在矢量化方法中设置 decode_error='replace' 。这可能会破坏特征的使用。
真实文本可能有不同来源，因此用了不同编码方式，或使用错误的解码，即与编码方式不对应。这在网络中获取的文本中很常见。python的包 ftfy 可以自动检查出一些解码错误的类，所以可以尝试解码未知文本为 latin-1 之后使用 ftfy 来修正错误。
如果文本的编码混乱，那么它将很难整理分类(如20 Newsgroups dataset的例子)。你可以把他们退回到简单的字节编码方式，如 latin-1 。一些文本会显示错误，但是至少相同的字节序列意味着相同的特征。

例如，下面的代码片段使用 chardet (没有加入scikit-learn中，需要另外安装)来计算出编码方式。之后它把文本矢量化并打印学习的单词(特征)。输出在下方给出。

 
      >>> import chardet    
>>> text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
>>> text2 = b"holdselig sind deine Ger\xfcche"
>>> text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
>>> decoded = [x.decode(chardet.detect(x)['encoding'])
...            for x in (text1, text2, text3)]        
>>> v = CountVectorizer().fit(decoded).vocabulary_    
>>> for term in v: print(v)                           
 
     

(取决于 chardet 的版本，或许会返回第一个值错误的结果。)

更详细的介绍Unicode和字符编码，参考 Joel Spolsky 的 Absolute Minimum Every Software Developer Must Know About Unicode.

4.2.3.6. 应用与例子

词袋子模型表示法非常简单但在实际中很有用。

特别的，在监督学习设置( supervised setting )中它能够把快速和可伸缩的线性模型相结合，来训练分类器( document classifiers )，例如:

Classification of text documents using sparse features

在 unsupervised setting 中它可以为相似文档分类，同时应用聚类方法，比如 K-means :

Clustering text documents using k-means

最后，通过松弛聚类的约束条件(relaxing the hard assignment constraint of clustering)，发现文集中的主题是可能的，如使用 Non-negative matrix factorization (NMF or NNMF):

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

4.2.3.7. 词袋子模型表示法的限制

A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings or word derivations.

N-grams to the rescue! Instead of building a simple collection of unigrams (n=1), one might prefer a collection of bigrams (n=2), where occurrences of pairs of consecutive words are counted.

One might alternatively consider a collection of character n-grams, a representation resilient against misspellings and derivations.

For example, let’s say we’re dealing with a corpus of two documents: ['words', 'wprds']. The second document contains a misspelling of the word ‘words’. A simple bag of words representation would consider these two as very distinct documents, differing in both of the two possible features. A character 2-gram representation, however, would find the documents matching in 4 out of 8 features, which may help the preferred classifier decide better:

 
      >>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)
>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
>>> ngram_vectorizer.get_feature_names() == (
...     [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
       [1, 1, 0, 1, 1, 1, 0, 1]])
 
     

In the above example, 'char_wb analyzer is used, which creates n-grams only from characters inside word boundaries (padded with space on each side). The 'char' analyzer, alternatively, creates n-grams that span across words:

 
      >>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5), min_df=1)
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...                                
<1x4 sparse matrix of type '<... 'numpy.int64'>'
   with 4 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
...     [' fox ', ' jump', 'jumpy', 'umpy '])
True

>>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5), min_df=1)
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...                                
<1x5 sparse matrix of type '<... 'numpy.int64'>'
    with 5 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
...     ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
True
 
     

The word boundaries-aware variant char_wb is especially interesting for languages that use white-spaces for word separation as it generates significantly less noisy features than the raw char variant in that case. For such languages it can increase both the predictive accuracy and convergence speed of classifiers trained using such features while retaining the robustness with regards to misspellings and word derivations.

While some local positioning information can be preserved by extracting n-grams instead of individual words, bag of words and bag of n-grams destroy most of the inner structure of the document and hence most of the meaning carried by that internal structure.

In order to address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs should thus be taken into account. Many such models will thus be casted as “Structured output” problems which are currently outside of the scope of scikit-learn.

4.2.3.8. Vectorizing a large text corpus with the hashing trick

The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large datasets:

the larger the corpus, the larger the vocabulary will grow and hence the memory use too,
fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset.
building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner.
pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size),
it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.

It is possible to overcome those limitations by combining the “hashing trick” (特征哈希) implemented by thesklearn.feature_extraction.FeatureHasher class and the text preprocessing and tokenization features of the CountVectorizer.

This combination is implementing in HashingVectorizer, a transformer class that is mostly API compatible with CountVectorizer. HashingVectorizer is stateless, meaning that you don’t have to call fit on it:

 
      >>> from sklearn.feature_extraction.text import HashingVectorizer
>>> hv = HashingVectorizer(n_features=10)
>>> hv.transform(corpus)
...                                
<4x10 sparse matrix of type '<... 'numpy.float64'>'
    with 16 stored elements in Compressed Sparse ... format>
 
     

You can see that 16 non-zero feature tokens were extracted in the vector output: this is less than the 19 non-zeros extracted previously by the CountVectorizer on the same toy corpus. The discrepancy comes from hash function collisions because of the low value of the n_features parameter.

In a real world setting, the n_features parameter can be left to its default value of 2 ** 20 (roughly one million possible features). If memory or downstream models size is an issue selecting a lower value such as 2 ** 18 might help without introducing too many additional collisions on typical text classification tasks.

Note that the dimensionality does not affect the CPU training time of algorithms which operate on CSR matrices (LinearSVC(dual=True), Perceptron, SGDClassifier, PassiveAggressive) but it does for algorithms that work with CSC matrices (LinearSVC(dual=False), Lasso(), etc).

Let’s try again with the default setting:

 
      >>> hv = HashingVectorizer()
>>> hv.transform(corpus)
...                               
<4x1048576 sparse matrix of type '<... 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse ... format>
 
     

We no longer get the collisions, but this comes at the expense of a much larger dimensionality of the output space. Of course, other terms than the 19 used here might still collide with each other.

The HashingVectorizer also comes with the following limitations:

it is not possible to invert the model (no inverse_transform method), nor to access the original string representation of the features, because of the one-way nature of the hash function that performs the mapping.
it does not provide IDF weighting as that would introduce statefulness in the model. A TfidfTransformer can be appended to it in a pipeline if required.

4.2.3.9. Performing out-of-core scaling with HashingVectorizer

An interesting development of using a HashingVectorizer is the ability to perform out-of-core scaling. This means that we can learn from data that does not fit into the computer’s main memory.

A strategy to implement out-of-core scaling is to stream data to the estimator in mini-batches. Each mini-batch is vectorized using HashingVectorizer so as to guarantee that the input space of the estimator has always the same dimensionality. The amount of memory used at any time is thus bounded by the size of a mini-batch. Although there is no limit to the amount of data that can be ingested using such an approach, from a practical point of view the learning time is often limited by the CPU time one wants to spend on the task.

For a full-fledged example of out-of-core scaling in a text classification task see Out-of-core classification of text documents.

4.2.3.10. Customizing the vectorizer classes

It is possible to customize the behavior by passing a callable to the vectorizer constructor:

 
      >>> def my_tokenizer(s):
...     return s.split()
...
>>> vectorizer = CountVectorizer(tokenizer=my_tokenizer)
>>> vectorizer.build_analyzer()(u"Some... punctuation!") == (
...     ['some...', 'punctuation!'])
True
 
     

In particular we name:

preprocessor: a callable that takes an entire document as input (as a single string), and returns a possibly transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase the entire document, etc.

tokenizer: a callable that takes the output from the preprocessor and splits it into tokens, then returns a list of these.

analyzer: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place at the analyzer level, so a custom analyzer may have to reproduce these steps.

(Lucene users might recognize these names, but be aware that scikit-learn concepts may not map one-to-one onto Lucene concepts.)

To make the preprocessor, tokenizer and analyzers aware of the model parameters it is possible to derive from the class and override the build_preprocessor, build_tokenizer and build_analyzer factory methods instead of passing custom functions.

Some tips and tricks:

If documents are pre-tokenized by an external package, then store them in files (or strings) with the tokens separated by whitespace and pass analyzer=str.split
Fancy token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer or the analyzer. Here’s a CountVectorizer with a tokenizer and lemmatizer using NLTK:
>>> from nltk import word_tokenize          
>>> from nltk.stem import WordNetLemmatizer 
>>> class LemmaTokenizer(object):
...     def __init__(self):
...         self.wnl = WordNetLemmatizer()
...     def __call__(self, doc):
...         return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
...
>>> vect = CountVectorizer(tokenizer=LemmaTokenizer())  
(Note that this will not filter out punctuation.)

Customizing the vectorizer can also be useful when handling Asian languages that do not use an explicit word separator such as whitespace.

4.2.4. Image feature extraction

4.2.4.1. Patch extraction

The extract_patches_2d function extracts patches from an image stored as a two-dimensional array, or three-dimensional with color information along the third axis. For rebuilding an image from all its patches, use reconstruct_from_patches_2d. For example let use generate a 4x4 pixel picture with 3 color channels (e.g. in RGB format):

 
      >>> import numpy as np
>>> from sklearn.feature_extraction import image

>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0]  # R channel of a fake RGB picture
array([[ 0,  3,  6,  9],
       [12, 15, 18, 21],
       [24, 27, 30, 33],
       [36, 39, 42, 45]])

>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
...     random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0,  3],
        [12, 15]],

       [[15, 18],
        [27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
       [27, 30]])
 
     

Let us now try to reconstruct the original image from the patches by averaging on overlapping areas:

 
      >>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)

The PatchExtractor class works in the same way as extract_patches_2d, only it supports multiple images as input. It is implemented as an estimator, so it can be used in pipelines. See:

 
      >>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)

4.2.4.2. Connectivity graph of an image

Several estimators in the scikit-learn can use connectivity information between features or samples. For instance Ward clustering (Hierarchical clustering) can cluster together only neighboring pixels of an image, thus forming contiguous patches:

For this purpose, the estimators use a ‘connectivity’ matrix, giving which samples are connected.

The function img_to_graph returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph build a connectivity matrix for images given the shape of these image.

These matrices can be used to impose connectivity in estimators that use connectivity information, such as Ward clustering (Hierarchical clustering), but also to build precomputed kernels, or similarity matrices.

Note

Examples

A demo of structured Ward hierarchical clustering on Lena image
Spectral clustering for image segmentation
Feature agglomeration vs. univariate selection

你可能感兴趣的:(sklearn,Python)

python2 中使用pip2 install package_name的时候报错：AttributeError: ‘int‘ object has no attribute ‘endswith‘ 点亮~黑夜 16—各种错误和bug（你的痛我的痛痛痛痛）python
文章目录1错误说明2错误解决方式1错误说明1、在python2的环境下使用pip2install安装库包的时候报错：AttributeError:'int'objecthasnoattribute'endswith'2、具体报错信息如下(base)shl@zhihui-mint:~/tools$pip2installpyquaternionException:Traceback(mostrecen
OpenCV实现Python视频播放控制详解夏勇兴
本文还有配套的精品资源，点击获取简介：本文详细介绍了如何使用OpenCV库在Python环境中播放视频，并展示了实现视频快进、后退控制的方法。首先通过cv2.VideoCapture()函数实现基础播放，然后利用set(cv2.CAP_PROP_FPS)函数控制播放速度实现快进和慢速播放，最后结合cv2.CAP_PROP_POS_MSEC属性实现精确的快进和后退。开发者可以根据实际需求选择合适的方
CentOS7 编译安装Python3.12 topxiasz linux python
Tom更新于2024.8.201.说明CentOS7已成为历史，不过很多人还在这段是历史奋战。Python2的Python2.7.5是CentOS7默认安装的版本;Python3的Python3.6.8是CentOS7可以通过默认repo，直接用yum安装的版本。yuminstall-ypython3本文主要针对CentOS7中较高版本如3.12的编译安装。2.安装OpenSSL-1.1.1根据P
note: This error originates from a subprocess，and is likely not a problem with pip异常嚯呀怪怪怪 pip 后端 python 运维 pycharm 服务器
note:Thiserrororiginatesfromasubprocess，andislikelynotaproblemwithpip异常这个错误提示表明问题可能源自pip所调用的子进程，而不是pip本身的问题。可能的原因包括：环境问题：Python环境（如虚拟环境）没有正确配置。库或Python版本之间的冲突。权限问题，导致pip无法执行子进程。系统问题：系统依赖或工具（如gcc、make）
【问题解决】| 关于This error originates from a subprocess, and is likely not a problem with pip问题 Qodicat 问题解决 pip
写代码配环境的时候，无意间碰到这样一个问题Thiserrororiginatesfromasubprocess,andislikelynotaproblemwithpip查了网上的博客之后，大概的意思是——这个库和python版本不兼容，python版本过高导致一般只需要降低python版本，或者升高库的版本即可解决问题的过程中收获两个小的知识点1、pip可以搜索到很多版本，比如我们输入pipin
Python 3.12安装库报错 m0_47156047 python 开发语言
报错如下：AttributeError:module'pkgutil'hasnoattribute'ImpImporter'.Didyoumean:'zipimporter'?这是因为Python3.12移除了对pkgutil.ImpImporter的支持，而某些库（例如setuptools或numpy的旧版本）依赖于旧的导入机制。解决方案1.降级到兼容的Python版本numpy和一些旧的依赖库
机器视觉python+opencv函数库：一二师弟_k opencv python
对此图片进行操作：代码部分：第一步：importcv2#导入opencv函数库img_test=cv2.imread(r"C:\Users\12044\Desktop\test.png")#读取图像，img_test为原图名称cv2.imshow("image",img_test)#显示图像，引号中的内容为图像显示窗口的名称，即“image”cv2.waitKey(0)#等待事件触发，参数0表示永
超实用的 30 段 Python 案例（上） Python之栈 python 开发语言
Python是目前最流行的语言之一，它在数据科学、机器学习、web开发、脚本编写、自动化方面被许多人广泛使用。它的简单和易用性造就了它如此流行的原因。如果你正在阅读本文，那么你或多或少已经使用过Python或者对Python感兴趣。在本文中，我们将会介绍30个简短的代码片段，你可以在30秒或更短的时间里理解和学习这些代码片段。1.检查重复元素下面的方法可以检查给定列表中是否有重复的元素。它使用了s
cv python_python里面cv是什么意思 weixin_40004659 cv python
OpenCV(OpenSourceComputerVisionLibrary)开放源代码计算机视觉库，主要算法涉及图像处理、计算机视觉和机器学习相关方法。OpenCV其实就是一堆C和C++语言的源代码文件，这些源代码文件中实现了许多常用的计算机视觉算法。OpenCV由一系列C函数和C++类构成，它有C，C++，Python和java接口，当前SDK(SoftwareDevelopmentKit软件
python实现坐标系转换_python – 执行坐标系转换的库？ weixin_39622150 python实现坐标系转换
您可以使用shapely库：http://toblerity.org/shapely/manual.htmlfromshapely.geometryimportPointfromfunctoolsimportpartialimportpyprojfromshapely.opsimporttransformpoint1=Point(9.0,50.0)print(point1)project=part
【如何获取股票数据05】Python、Java等多种主流语言实例演示获取股票行情api接口之沪深A股最新分时MA数据获取实例演示及接口API说明文档码农蝶澈 python java 开发语言股票数据API 股票数据接口
最近一两年内，股票量化分析逐渐成为热门话题。而从事这一领域工作的第一步，就是获取全面且准确的股票数据。因为无论是实时交易数据、历史交易记录、财务数据还是基本面信息，这些数据都是我们进行量化分析时不可或缺的宝贵资源。我们的主要任务是从这些数据中提炼出有价值的信息，为我们的投资策略提供有力的指导。在数据探索的旅途中，我尝试了多种方法，包括自编网易股票页面爬虫、申万行业数据爬虫，以及同花顺问财的爬虫，甚
Shapely：Python中的几何操作库 xyt556_CUMT Big Data python 开发语言
Shapely：Python中的几何操作库介绍Shapely是一个用于操作和分析几何对象的Python库。它基于GEOS（GeometryEngine-OpenSource）库，提供了一系列函数来处理几何形状，如点（Point）、线（LineString）、多边形（Polygon）等。Shapely被广泛应用于GIS（地理信息系统）、数据分析和计算机图形学中，用于处理地理空间数据和几何分析。安装S
使用 rasterstats 库进行栅格与矢量数据的空间分析 xyt556_CUMT 人工智能
在地理信息系统（GIS）领域，栅格数据和矢量数据是两类常见的数据类型。栅格数据通常代表像素网格，如遥感影像或土地利用图，而矢量数据则通常表示具体的地理实体，如行政区划或土地边界。如何有效地结合这两类数据进行空间分析是许多地理研究中的关键问题。rasterstats是一个用于处理栅格和矢量数据的Python库，提供了便捷的工具来实现栅格统计、空间叠加分析等。本文将介绍如何使用rasterstats库
Python 项目__init__.py 文件作用 KillFuckBugs python python 开发语言
在Python项目中，__init__.py文件有以下几个主要作用：1.将目录标识为包当一个目录中包含__init__.py文件时，Python会将该目录识别为一个包。这允许开发者通过模块导入的方式访问该目录中的内容。例如：目录结构：project/mypackage/__init__.pymodule1.pymodule2.py导入示例：pythonfrommypackageimportmodu
自定义数据集使用scikit-learn中的包实现线性回归方法对其进行拟合灵封～ scikit-learn 线性回归 python
一、导入必要的库importpandasaspdfromsklearn.model_selectionimporttrain_test_splitfromsklearn.linear_modelimportLinearRegressionfromsklearn.metricsimportmean_squared_error,r2_score二、加载自定义数据集#创建自定义数据集#假设我们有一个简单
使用scikit-learn中的KNN包实现对鸢尾花数据集的预测。灵封～ scikit-learn 机器学习人工智能
导入必要的库和数据集#导入鸢尾花数据集fromsklearn.datasetsimportload_iris#数据化可视包importpandasaspdfromsklearn.model_selectionimporttrain_test_splitfromsklearn.preprocessingimportMinMaxScaler,StandardScalerfromsklearn.neig
【如何获取股票数据01】Python、Java等多种主流语言实例演示获取股票行情api接口之沪深A股实时交易数据获取实例演示及接口API说明文档 Eumenides_max python java 开发语言
最近一两年内，股票量化分析逐渐成为热门话题。而从事这一领域工作的第一步，就是获取全面且准确的股票数据。因为无论是实时交易数据、历史交易记录、财务数据还是基本面信息，这些数据都是我们进行量化分析时不可或缺的宝贵资源。我们的主要任务是从这些数据中提炼出有价值的信息，为我们的投资策略提供有力的指导。在数据探索的旅途中，我尝试了多种方法，包括自编网易股票页面爬虫、申万行业数据爬虫，以及同花顺问财的爬虫，甚
练习题 - Django 4.x File 文件上传使用示例和配置方法 Mr数据杨 Python Web开发 django sqlite 数据库
在现代的web应用开发中，文件上传是一个常见的功能，无论是用户上传头像、上传文档，还是其他类型的文件，处理文件上传都是开发者必须掌握的技能之一。Django作为一个流行的Pythonweb框架，提供了便捷的文件上传功能和配置方法。学习如何在Django中实现文件上传，不仅有助于提升编程技能，还能帮助我们更好地理解web应用的开发流程。本次练习题的设计目的是通过真实的生活实例帮助自学编程的用户掌握D
第30章测试驱动开发中的设计模式解析（Python 版） Tester_孙大壮测试驱动开发驱动开发设计模式 python
写在前面这本书是我们老板推荐过的，我在《价值心法》的推荐书单里也看到了它。用了一段时间Cursor软件后，我突然思考，对于测试开发工程师来说，什么才更有价值呢？如何让AI工具更好地辅助自己写代码，或许优质的单元测试是一个切入点。就我个人而言，这本书确实很有帮助。第一次读的时候，很多细节我都不太懂，但将书中内容应用到工作中后，我受益匪浅。比如面对一些让人抓狂的代码设计时，书里的方法能让我逐步深入理解
Python中opencv的一些函数及应用灵封～ python opencv 开发语言
Sobel算子函数功能：Sobel算子用于计算图像的梯度（变化率），常用于边缘检测。它通过对图像应用一个基于一阶导数的滤波器来强调图像中的边缘部分，特别是水平和垂直方向上的边缘。通过计算图像的梯度，可以获得图像中亮度变化较大的地方，这些地方通常是物体的边界。Sobel算子有两个方向的变体：SobelX：计算水平方向的梯度。SobelY：计算垂直方向的梯度。Sobel算子函数：cv2.Sobel()
股票数据接口API实例代码python、JAVA等多种语言演示免费获取实时数据、历史数据、CDMA、KDJ等指标数据配有API说明文档 Eumenides_max python java 开发语言
本文中所有接口均可直接在浏览器打开获取数据，为了便于大家验证有效性，已经做好了超链接，直接点击即可！沪深两市股票列表API接口链接（可点击验证）：https://api.mairui.club/hslt/list/b997d4403688d5e66a【实时数据接口】沪深两市实时交易数据接口API接口链接（可点击验证）：https://api.mairui.club/hsrl/ssjy/000001
【代码随想录：数组】python3 zzzmy159 代码随想录 leetcode
数组Day1704.二分查找，27.移除元素704二分查找35搜索插入位置34在排序数组中查找元素的第一个和最后一个位置27移除元素：双指针977.有序数组的平方209.长度最小的子数组：最小滑窗904.水果成篮：最大滑窗59.螺旋矩阵IIDay1704.二分查找，27.移除元素704二分查找时间复杂度为O(logn)O(logn)O(logn)，空间复杂度为O(1)O(1)O(1)leetcod
python 应用开发日志工具包—— loguru 添财小哥 python 应用开发 python pip
一、简介Loguru是一个Python库，旨在让日志记录变得愉快。你是否曾因为懒得配置日志记录器而直接使用print()？…我有过，然而日志记录对于每个应用程序都是基本的，它简化了调试过程。使用Loguru，你没有理由不从一开始就使用日志记录，这就像导入fromloguruimportlogger一样简单。此外，这个库旨在通过添加一系列有用的功能来解决标准日志记录器的缺陷，从而减轻Python日志
Python 一个脚本批量安装第三方库漫漫进阶路 Python Pycharm python
importos#引入os库，os是python自带的库definstall_packages():#将要批量安装的第三方库写进一个列表libs=["numpy","matplotlib","pillow","sklearn","scipy","requests","uvicorn","pyspider","beautifulsoup4","wheel","networkx","sympy","p
python 中的 logging 详解 SATAN 先生 python python 开发语言
文章目录1.Abstract2.logging模块结构3.Logger的层次结构和命名规则3.1RootLogger3.2层次结构和命名规则3.2.1层次结构和命名规则3.2.2Logger的工厂机制4.Logger和Handler的过滤机制：Level和Filter5.emit：格式化与输出流6.配置basicConfig，logging.config.fileConfig…；6.1`basic
Python编程的最好搭档—VSCode 详细指南程序员朱鹏 vscode python 编辑器
刚学Python的同学可能会觉得每次写Python的时候都得打开Cmd有点烦躁，直接上手Pycharm的同学可能会觉得这软件太笨重了，晦涩难用。那么有没有省去打开CMD的步骤，又能弥补Pycharm笨重的特点的软件呢？——答案是VSCode.诞生于2015年的VSCode编辑器，现在可以说是目前最强的编辑器之一，在微软的背书下，比各位历史悠久的老大哥成长快得多，不到5年的时间里便坐到了市场占有率第
python学习系列之logging(一、基础教程) Idea King python3
文章目录1.什么是日志？为什么需要日志？2.什么时候使用什么级别的日志？2.1日志的级别3.logging基础教程3.1输出到控制台3.2记录日志到文件3.3从多个模块记录日志3.4记录变量数据3.5修改日志输出的格式参考文献按照官方使用说明进行编写1.什么是日志？为什么需要日志？日志是对软件执行时所发生事件的一种追踪方式。软件开发人员对他们的代码添加日志调用，借此来指示某事件的发生。一个事件通过
python 基本知识达达玲玲 python 开发语言
Python：背景知识及环境安装什么是Python？Python是一种解释型、面向对象的高级编程语言。它的设计哲学强调代码的可读性和简洁性，因此被广泛应用于各种领域，包括：数据科学与机器学习：NumPy,Pandas,Matplotlib,Scikit-learn等库让Python成为了数据分析和机器学习的首选语言。Web开发：Django,Flask等框架提供了高效的Web开发解决方案。自动化：
学习使用pymodbus模块实现Modbus通讯草莓仙生学习单片机嵌入式硬件
Modbus是一种工业领域广泛使用的通信协议，而PyModbus是一个在Python中实现Modbus通信的库。它支持多种Modbus模式，包括RTU（通过串行线路），ASCII和TCP/IP。1.建立通讯frompymodbus.clientimportModbusTcpClientclient=ModbusTcpClient('localhost',port=502)client.connec
蓝桥杯 ALGO-1006 拿金币动态规划双解法 python 2401_84558326 程序员蓝桥杯动态规划 python
但是我们看一下上图可以发现，有很多位置重复走过了（比如说（1,1），（2,1），（1,2）），走过的路就没必要再走一遍了，我们可以使用标记数组将记录走过位置以实现剪枝，提高执行效率。现在我们看一下代码实现：defdfs(x,y):n行n列范围外的位置没有意义，结束递归ifx>n-1ory>n-1:return0走到终点位置后将终点位置的金币返回ifx==n-1andy==n-1:returnnum
mondb入手木zi_鸣 mongodb
windows 启动mongodb 编写bat文件， mongod --dbpath D:\software\MongoDBDATA mongod --help 查询各种配置配置在mongob 打开批处理，即可启动，27017原生端口，shell操作监控端口扩展28017，web端操作端口启动配置文件配置，数据更灵活
大型高并发高负载网站的系统架构 bijian1013 高并发负载均衡
扩展Web应用程序一.概念简单的来说，如果一个系统可扩展，那么你可以通过扩展来提供系统的性能。这代表着系统能够容纳更高的负载、更大的数据集，并且系统是可维护的。扩展和语言、某项具体的技术都是无关的。扩展可以分为两种： 1.
DISPLAY变量和xhost(原创) czmmiao display
DISPLAY 在Linux/Unix类操作系统上, DISPLAY用来设置将图形显示到何处. 直接登陆图形界面或者登陆命令行界面后使用startx启动图形, DISPLAY环境变量将自动设置为:0:0, 此时可以打开终端, 输出图形程序的名称(比如xclock)来启动程序, 图形将显示在本地窗口上, 在终端上输入printenv查看当前环境变量, 输出结果中有如下内容:DISPLAY=:0.0
获取B/S客户端IP 周凡杨 java 编程 jsp Web 浏览器
最近想写个B/S架构的聊天系统，因为以前做过C/S架构的QQ聊天系统，所以对于Socket通信编程只是一个巩固。对于C/S架构的聊天系统，由于存在客户端Java应用，所以直接在代码中获取客户端的IP，应用的方法为： String ip = InetAddress.getLocalHost().getHostAddress(); 然而对于WEB
浅谈类和对象朱辉辉33 编程
类是对一类事物的总称，对象是描述一个物体的特征，类是对象的抽象。简单来说，类是抽象的，不占用内存，对象是具体的，占用存储空间。类是由属性和方法构成的，基本格式是public class 类名{ //定义属性 private/public 数据类型属性名； //定义方法 publ
android activity与viewpager+fragment的生命周期问题肆无忌惮_ viewpager
有一个Activity里面是ViewPager，ViewPager里面放了两个Fragment。第一次进入这个Activity。开启了服务，并在onResume方法中绑定服务后，对Service进行了一定的初始化，其中调用了Fragment中的一个属性。 super.onResume(); bindService(intent, conn, BIND_AUTO_CREATE);
base64Encode对图片进行编码 843977358 base64 图片 encoder
/** * 对图片进行base64encoder编码 * * @author mrZhang * @param path * @return */ public static String encodeImage(String path) { BASE64Encoder encoder = null; byte[] b = null; I
Request Header简介 aigo servlet
当一个客户端(通常是浏览器)向Web服务器发送一个请求是，它要发送一个请求的命令行，一般是GET或POST命令，当发送POST命令时，它还必须向服务器发送一个叫“Content-Length”的请求头(Request Header) 用以指明请求数据的长度，除了Content-Length之外，它还可以向服务器发送其它一些Headers，如：
HttpClient4.3 创建SSL协议的HttpClient对象 alleni123 httpclient 爬虫 ssl
public class HttpClientUtils { public static CloseableHttpClient createSSLClientDefault(CookieStore cookies){ SSLContext sslContext=null; try { sslContext=new SSLContextBuilder().l
java取反 -右移-左移-无符号右移的探讨百合不是茶位运算符位移
取反：在二进制中第一位，1表示符数，0表示正数 byte a = -1; 原码：10000001 反码：11111110 补码：11111111 //异或: 00000000 byte b = -2; 原码：10000010 反码：11111101 补码：11111110 //异或: 00000001
java多线程join的作用与用法 bijian1013 java 多线程
对于JAVA的join，JDK 是这样说的：join public final void join （long millis ）throws InterruptedException Waits at most millis milliseconds for this thread to die. A timeout of 0 means t
Java发送http请求(get 与post方法请求) bijian1013 java spring
PostRequest.java package com.bijian.study; import java.io.BufferedReader; import java.io.DataOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.net.HttpURL
【Struts2二】struts.xml中package下的action配置项默认值 bit1129 struts.xml
在第一部份，定义了struts.xml文件，如下所示： <!DOCTYPE struts PUBLIC "-//Apache Software Foundation//DTD Struts Configuration 2.3//EN" "http://struts.apache.org/dtds/struts
【Kafka十三】Kafka Simple Consumer bit1129 simple
代码中关于Host和Port是割裂开的，这会导致单机环境下的伪分布式Kafka集群环境下，这个例子没法运行。实际情况是需要将host和port绑定到一起， package kafka.examples.lowlevel; import kafka.api.FetchRequest; import kafka.api.FetchRequestBuilder; impo
nodejs学习api ronin47 nodejs api
NodeJS基础什么是NodeJS JS是脚本语言，脚本语言都需要一个解析器才能运行。对于写在HTML页面里的JS，浏览器充当了解析器的角色。而对于需要独立运行的JS，NodeJS就是一个解析器。每一种解析器都是一个运行环境，不但允许JS定义各种数据结构，进行各种计算，还允许JS使用运行环境提供的内置对象和方法做一些事情。例如运行在浏览器中的JS的用途是操作DOM，浏览器就提供了docum
java-64.寻找第N个丑数 bylijinnan java
public class UglyNumber { /** * 64.查找第N个丑数具体思路可参考 [url] http://zhedahht.blog.163.com/blog/static/2541117420094245366965/[/url] * 题目：我们把只包含因子 2、3和5的数称作丑数（Ugly Number）。例如6、8都是丑数，但14
二维数组（矩阵）对角线输出 bylijinnan 二维数组
/** 二维数组对角线输出两个方向例如对于数组： { 1, 2, 3, 4 }, { 5, 6, 7, 8 }, { 9, 10, 11, 12 }, { 13, 14, 15, 16 }, slash方向输出： 1 5 2 9 6 3 13 10 7 4 14 11 8 15 12 16 backslash输出： 4 3
[JWFD开源工作流设计]工作流跳跃模式开发关键点(今日更新) comsci 工作流
既然是做开源软件的,我们的宗旨就是给大家分享设计和代码,那么现在我就用很简单扼要的语言来透露这个跳跃模式的设计原理大家如果用过JWFD的ARC-自动运行控制器,或者看过代码,应该知道在ARC算法模块中有一个函数叫做SAN(),这个函数就是ARC的核心控制器,要实现跳跃模式,在SAN函数中一定要对LN链表数据结构进行操作,首先写一段代码,把
redis常见使用 cuityang redis 常见使用
redis 通常被认为是一个数据结构服务器，主要是因为其有着丰富的数据结构 strings、map、 list、sets、 sorted sets 引入jar包 jedis-2.1.0.jar (本文下方提供下载) package redistest; import redis.clients.jedis.Jedis; public class Listtest
配置多个redis dalan_123 redis
配置多个redis客户端 <?xml version="1.0" encoding="UTF-8"?><beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi=&quo
attrib命令 dcj3sjt126com attr
attrib指令用于修改文件的属性.文件的常见属性有:只读.存档.隐藏和系统. 只读属性是指文件只可以做读的操作.不能对文件进行写的操作.就是文件的写保护. 存档属性是用来标记文件改动的.即在上一次备份后文件有所改动.一些备份软件在备份的时候会只去备份带有存档属性的文件.
Yii使用公共函数 dcj3sjt126com yii
在网站项目中，没必要把公用的函数写成一个工具类，有时候面向过程其实更方便。在入口文件index.php里添加 require_once('protected/function.php'); 即可对其引用，成为公用的函数集合。 function.php如下： <?php /** * This is the shortcut to D
linux 系统资源的查看（free、uname、uptime、netstat） eksliang netstat linux uname linux uptime linux free
linux 系统资源的查看转载请出自出处：http://eksliang.iteye.com/blog/2167081 http://eksliang.iteye.com 一、free查看内存的使用情况语法如下： free [-b][-k][-m][-g] [-t] 参数含义 -b:直接输入free时，显示的单位是kb我们可以使用b(bytes),m
JAVA的位操作符 greemranqq 位运算 JAVA位移 <<>>>
最近几种进制，加上各种位操作符，发现都比较模糊，不能完全掌握，这里就再熟悉熟悉。 1.按位操作符：按位操作符是用来操作基本数据类型中的单个bit,即二进制位，会对两个参数执行布尔代数运算，获得结果。与（&）运算： 1&1 = 1, 1&0 = 0, 0&0 &
Web前段学习网站 ihuning Web
Web前段学习网站菜鸟学习：http://www.w3cschool.cc/ JQuery中文网：http://www.jquerycn.cn/ 内存溢出：http://outofmemory.cn/#csdn.blog http://www.icoolxue.com/ http://www.jikexue
强强联合：FluxBB 作者加盟 Flarum justjavac r
原文：FluxBB Joins Forces With Flarum作者：Toby Zerner译文：强强联合：FluxBB 作者加盟 Flarum译者：justjavac FluxBB 是一个快速、轻量级论坛软件，它的开发者是一名德国的 PHP 天才 Franz Liedke。FluxBB 的下一个版本(2.0)将被完全重写，并已经开发了一段时间。FluxBB 看起来非常有前途的，
java统计在线人数（session存储信息的） macroli java Web
这篇日志是我写的第三次了前两次都发布失败！郁闷极了！由于在web开发中常常用到这一部分所以在此记录一下，呵呵，就到备忘录了！我对于登录信息时使用session存储的，所以我这里是通过实现HttpSessionAttributeListener这个接口完成的。 1、实现接口类，在web.xml文件中配置监听类，从而可以使该类完成其工作。 public class Ses
bootstrp carousel初体验快速构建图片播放 qiaolevip 每天进步一点点学习永无止境 bootstrap 纵观千象
img{ border: 1px solid white; box-shadow: 2px 2px 12px #333; _width: expression(this.width > 600 ? "600px" : this.width + "px"); _height: expression(this.width &
SparkSQL读取HBase数据，通过自定义外部数据源 superlxw1234 spark sparksql sparksql读取hbase sparksql外部数据源
关键字：SparkSQL读取HBase、SparkSQL自定义外部数据源前面文章介绍了SparSQL通过Hive操作HBase表。 SparkSQL从1.2开始支持自定义外部数据源(External DataSource)，这样就可以通过API接口来实现自己的外部数据源。这里基于Spark1.4.0，简单介绍SparkSQL自定义外部数据源，访
Spring Boot 1.3.0.M1发布 wiselyman spring boot
Spring Boot 1.3.0.M1于6.12日发布，现在可以从Spring milestone repository下载。这个版本是基于Spring Framework 4.2.0.RC1,并在Spring Boot 1.2之上提供了大量的新特性improvements and new features。主要包含以下： 1.提供一个新的sprin

scikit-leann 特征提取 学习

4.2.1. 加载字典的中的特征

4.2.2. 特征哈希

4.2.2.1. 实现细节

4.2.3. 文本特征提取

4.2.3.1. 体现：词袋模型

4.2.3.2. 稀疏

4.2.3.3. 通常向量化使用Common Vectorizer usage

4.2.3.4. Tf–idf算法 字词权值

4.2.3.5. 文档编码 转码

4.2.3.6. 应用与例子

4.2.3.7. 词袋子模型表示法的限制

4.2.3.8. Vectorizing a large text corpus with the hashing trick

4.2.3.9. Performing out-of-core scaling with HashingVectorizer

4.2.3.10. Customizing the vectorizer classes

4.2.4. Image feature extraction

4.2.4.1. Patch extraction

4.2.4.2. Connectivity graph of an image

你可能感兴趣的:(sklearn,Python)

scikit-leann 特征提取学习

4.2.3.4. Tf–idf算法字词权值

4.2.3.5. 文档编码转码