python保存模型与参数_GridSearch以获得最佳模型:保存和加载参数 - python

我喜欢运行以下工作流程:

选择用于文本向量化的模型

定义参数列表

在参数上应用带有GridSearchCV的管道,使用LogisticRegression()作为基线以找到最佳的模型参数

保存最佳模型(参数)

加载最佳模型参数,以便我们可以在此定义的模型上应用一系列其他分类器。

这是您可以复制的代码:

GridSearch:

%%time

import numpy as np

import pandas as pd

from sklearn.externals import joblib

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline

from gensim.utils import simple_preprocess

np.random.seed(0)

data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')

X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],

data.label, random_state=0)

# Find best Tfidf model using LR

pipeline = Pipeline([

('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),

('clf', LogisticRegression())

])

parameters = {

'tfidf__max_df': [0.25, 0.5, 0.75, 1.0],

'tfidf__smooth_idf': (True, False),

'tfidf__norm': ('l1', 'l2', None),

}

grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)

grid.fit(X_train, y_train)

print(grid.best_params_)

# Save model

#joblib.dump(grid.best_estimator_, 'best_tfidf.pkl', compress = 1) # this unfortunately includes the LogReg

joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters

对24位候选人各进行2次折叠,共48次

{'tfidf__smooth_idf':True,'tfidf__norm':'l2','tfidf__max_df':0.25}

使用最佳参数加载模型:

from sklearn.model_selection import GridSearchCV

# Load best parameters

tfidf_params = joblib.load('best_tfidf.pkl')

pipeline = Pipeline([

('vec', TfidfVectorizer(preprocessor=' '.join, tokenizer=None).set_params(**tfidf_params)), # here is the issue?

('clf', LogisticRegression())

])

cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)

print("Cross-Validation Score: %s" % (np.mean(cval)))

ValueError:估算器的参数tfidf无效

TfidfVectorizer(analyzer ='word',binary = False,decode_error ='strict',

dtype =,encoding ='utf-8',input ='content',

小写=真,max_df = 1.0,max_features =无,min_df = 1,

ngram_range =(1,1),norm ='l2',

预处理器=,

smooth_idf = True,stop_words = None,strip_accents = None,

sublinear_tf = False,token_pattern ='(?u)\ b \ w \ w + \ b',

tokenizer =无,use_idf = True,词汇=无)。使用estimator.get_params().keys()检查可用参数列表。

题:

如何加载Tfidf模型的最佳参数?

参考方案

这行:

joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters

保存pipeline的参数,而不保存TfidfVectorizer的参数。这样做:

pipeline = Pipeline([

# Change the name to be same as before

('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),

('clf', LogisticRegression())

])

pipeline.set_params(**tfidf_params)

R'relaimpo'软件包的Python端口 - python

我需要计算Lindeman-Merenda-Gold(LMG)分数,以进行回归分析。我发现R语言的relaimpo包下有该文件。不幸的是,我对R没有任何经验。我检查了互联网,但找不到。这个程序包有python端口吗?如果不存在,是否可以通过python使用该包? python参考方案 最近,我遇到了pingouin库。Python:传递记录器是个好主意吗? - python

我的Web服务器的API日志如下:started started succeeded failed 那是同时收到的两个请求。很难说哪一个成功或失败。为了彼此分离请求,我为每个请求创建了一个随机数,并将其用作记录器的名称logger = logging.getLogger(random_number) 日志变成[111] started [222] start…Python-Excel导出 - python

我有以下代码:import pandas as pd import requests from bs4 import BeautifulSoup res = requests.get("https://www.bankier.pl/gielda/notowania/akcje") soup = BeautifulSoup(res.cont…Matplotlib'粗体'字体 - python

跟随this example:import numpy as np import matplotlib.pyplot as plt fig = plt.figure() for i, label in enumerate(('A', 'B', 'C', 'D')): ax = f…Python:如何根据另一列元素明智地查找一列中的空单元格计数? - python

df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice','Jane', 'Alice','Bob', 'Alice'], 'income…

你可能感兴趣的:(python保存模型与参数)