http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
<strong>1、使用“Pipeline”统一vectorizer => transformer => classifier</strong> from sklearn.pipeline import Pipeline text_clf = Pipeline([('vect', CountVectorizer()), ... ('tfidf', TfidfTransformer()), ... ('clf', MultinomialNB()), ... ]) text_clf = text_clf.fit(rawData.data, rawData.target) predicted = text_clf.predict(docs_new) <strong>#注意,这里是未经任何处理的原始文件,不是X_new_tfidf,否则出现下面错误。</strong> np.mean(predicted == y_new_target) Out[51]: 0.5 predicted = text_clf.predict(X_new_tfidf) Traceback (most recent call last): File "<ipython-input-52-20002e79f960>", line 1, in <module> predicted = text_clf.predict(X_new_tfidf) File "D:\Anaconda\lib\site-packages\sklearn\pipeline.py", line 149, in predict Xt = transform.transform(Xt) File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 867, in transform _, X = self._count_vocab(raw_documents, fixed_vocab=True) File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 748, in _count_vocab for feature in analyze(doc): File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 234, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 200, in <lambda> return lambda x: strip_accents(x.lower()) File "D:\Anaconda\lib\site-packages\scipy\sparse\base.py", line 499, in __getattr__ raise AttributeError(attr + " not found") AttributeError: lower not found <strong>2、使用网格搜索调参</strong> from sklearn.grid_search import GridSearchCV parameters = {'vect__ngram_range': [(1, 1), (1, 2)], ... 'tfidf__use_idf': (True, False), ... 'clf__alpha': (1e-2, 1e-3), ... } gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1) #这里n_jobs=-1告诉grid search要自动识别机器有几个核,并使用所有的核并行跑程序。 gs_clf = gs_clf.fit(rawData.data, rawData.target) rawData.target_names[gs_clf.predict(['i love this book'])] 'positive folder'
输出效果最好的参数: >>> best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1]) >>> for param_name in sorted(parameters.keys()): ... print("%s: %r" % (param_name, best_parameters[param_name])) ... clf__alpha: 0.001 tfidf__use_idf: True vect__ngram_range: (1, 1) >>> score 1.000