我喜欢运行以下工作流程:
选择用于文本向量化的模型
定义参数列表
在参数上应用带有GridSearchCV的管道,使用LogisticRegression()作为基线以找到最佳的模型参数
保存最佳模型(参数)
加载最佳模型参数,以便我们可以在此定义的模型上应用一系列其他分类器。
这是您可以复制的代码:
GridSearch:
%%time
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(0)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
data.label, random_state=0)
# Find best Tfidf model using LR
pipeline = Pipeline([
('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),
('clf', LogisticRegression())
])
parameters = {
'tfidf__max_df': [0.25, 0.5, 0.75, 1.0],
'tfidf__smooth_idf': (True, False),
'tfidf__norm': ('l1', 'l2', None),
}
grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)
grid.fit(X_train, y_train)
print(grid.best_params_)
# Save model
#joblib.dump(grid.best_estimator_, 'best_tfidf.pkl', compress = 1) # this unfortunately includes the LogReg
joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters
对24位候选人各进行2次折叠,共48次
{'tfidf__smooth_idf':True,'tfidf__norm':'l2','tfidf__max_df':0.25}
使用最佳参数加载模型:
from sklearn.model_selection import GridSearchCV
# Load best parameters
tfidf_params = joblib.load('best_tfidf.pkl')
pipeline = Pipeline([
('vec', TfidfVectorizer(preprocessor=' '.join, tokenizer=None).set_params(**tfidf_params)), # here is the issue?
('clf', LogisticRegression())
])
cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
print("Cross-Validation Score: %s" % (np.mean(cval)))
ValueError:估算器的参数tfidf无效
TfidfVectorizer(analyzer ='word',binary = False,decode_error ='strict',
dtype =,encoding ='utf-8',input ='content',
小写=真,max_df = 1.0,max_features =无,min_df = 1,
ngram_range =(1,1),norm ='l2',
预处理器=,
smooth_idf = True,stop_words = None,strip_accents = None,
sublinear_tf = False,token_pattern ='(?u)\ b \ w \ w + \ b',
tokenizer =无,use_idf = True,词汇=无)。使用estimator.get_params().keys()检查可用参数列表。
题:
如何加载Tfidf模型的最佳参数?
参考方案
这行:
joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters
保存pipeline的参数,而不保存TfidfVectorizer的参数。这样做:
pipeline = Pipeline([
# Change the name to be same as before
('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),
('clf', LogisticRegression())
])
pipeline.set_params(**tfidf_params)
R'relaimpo'软件包的Python端口 - python
我需要计算Lindeman-Merenda-Gold(LMG)分数,以进行回归分析。我发现R语言的relaimpo包下有该文件。不幸的是,我对R没有任何经验。我检查了互联网,但找不到。这个程序包有python端口吗?如果不存在,是否可以通过python使用该包? python参考方案 最近,我遇到了pingouin库。Python:传递记录器是个好主意吗? - python
我的Web服务器的API日志如下:started started succeeded failed 那是同时收到的两个请求。很难说哪一个成功或失败。为了彼此分离请求,我为每个请求创建了一个随机数,并将其用作记录器的名称logger = logging.getLogger(random_number) 日志变成[111] started [222] start…Python-Excel导出 - python
我有以下代码:import pandas as pd import requests from bs4 import BeautifulSoup res = requests.get("https://www.bankier.pl/gielda/notowania/akcje") soup = BeautifulSoup(res.cont…Matplotlib'粗体'字体 - python
跟随this example:import numpy as np import matplotlib.pyplot as plt fig = plt.figure() for i, label in enumerate(('A', 'B', 'C', 'D')): ax = f…Python:如何根据另一列元素明智地查找一列中的空单元格计数? - python
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice','Jane', 'Alice','Bob', 'Alice'], 'income…