scikit-learn:0.4 使用“Pipeline”统一vectorizer => transformer => classifier、网格搜索调参


http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html


<strong>1、使用“Pipeline”统一vectorizer => transformer => classifier</strong>
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf', MultinomialNB()),
... ])

text_clf = text_clf.fit(rawData.data, rawData.target)
predicted = text_clf.predict(docs_new) 
<strong>#注意,这里是未经任何处理的原始文件,不是X_new_tfidf,否则出现下面错误。</strong>

np.mean(predicted == y_new_target)
Out[51]: 0.5

predicted = text_clf.predict(X_new_tfidf)
Traceback (most recent call last):

  File "<ipython-input-52-20002e79f960>", line 1, in <module>
    predicted = text_clf.predict(X_new_tfidf)

  File "D:\Anaconda\lib\site-packages\sklearn\pipeline.py", line 149, in predict
    Xt = transform.transform(Xt)

  File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 867, in transform
    _, X = self._count_vocab(raw_documents, fixed_vocab=True)

  File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 748, in _count_vocab
    for feature in analyze(doc):

  File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 234, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)

  File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 200, in <lambda>
    return lambda x: strip_accents(x.lower())

  File "D:\Anaconda\lib\site-packages\scipy\sparse\base.py", line 499, in __getattr__
    raise AttributeError(attr + " not found")

AttributeError: lower not found


<strong>2、使用网格搜索调参</strong>

from sklearn.grid_search import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
...               'tfidf__use_idf': (True, False),
...               'clf__alpha': (1e-2, 1e-3),
... }
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
#这里n_jobs=-1告诉grid search要自动识别机器有几个核,并使用所有的核并行跑程序。

gs_clf = gs_clf.fit(rawData.data, rawData.target)
rawData.target_names[gs_clf.predict(['i love this book'])]
'positive folder'

输出效果最好的参数:
>>> best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
>>> for param_name in sorted(parameters.keys()):
...     print("%s: %r" % (param_name, best_parameters[param_name]))
...
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)

>>> score                                              
1.000






你可能感兴趣的:(机器学习,pipeline,scikit-learn,网格搜索)