在之前的文章以K近邻算法为例,使用交叉验证优化模型最佳参数中,我们使用验证曲线(validation_curve)去优化K近邻算法中的最优参数K。但是验证曲线有一个缺点,就是一次只能优化一个参数,如果要同时优化多个参数,就需要寻找其他方法了。针对这种多参数优化的场景,本文将介绍一种新的解决方案:网格搜索(GridSearchCV)。
网格搜索的逻辑很简单粗暴,就是遍历所有参数的组合,分别计算性能指标,然后将性能指标最好的参数组合,输出给用户。
假设想确定K近邻算法中,最优的K值(1≤K≤31)和最优的权重函数(uniform或distance,具体含义参考sklearn.neighbors.KNeighborsClassifier)。可以实现的直接方案是:分别设置权重函数为uniform和distance,基于验证曲线得到各自的最佳K值后,再比较这两个K值,最终得到最佳的参数组合。
以下代码实现了上述方案。
from sklearn.model_selection import validation_curve
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
def grid_search_by_self(X, y, cv):
# 初始uniform为最优参数,通过validation_curve获取最佳K
best_weights = 'uniform'
model_1 = KNeighborsClassifier(weights='uniform')
train_scores, test_scores = validation_curve(model_1, X, y, cv=cv,
param_name='n_neighbors', param_range=range(1, 31), scoring='f1')
test_scores_mean = np.mean(test_scores, axis=1)
best_k = np.argmax(test_scores_mean) + 1
best_score = np.max(test_scores_mean)
model_2 = KNeighborsClassifier(weights='distance')
train_scores, test_scores = validation_curve(model_2, X, y, cv=cv,
param_name='n_neighbors', param_range=range(1, 31), scoring='f1')
test_scores_mean = np.mean(test_scores, axis=1)
best_score_2 = np.max(test_scores_mean)
# 比较distance和uniform形式的best_score,取更优参数
if best_score_2 > best_score:
best_k = np.argmax(test_scores_mean) + 1
best_score = best_score_2
best_weights = 'distance'
return best_k, best_weights, best_score
如果调用sklearn工具包,可以直接使用GridSearchCV函数,即可实现网格搜索的功能。
from sklearn.model_selection import GridSearchCV
def grid_search_by_sklearn(X, y, model, param_grid, cv):
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring='f1') # 使用f1作为性能指标
grid.fit(X, y)
print('网格搜索-最佳度量值:', grid.best_score_) # 输出最优指标
print('网格搜索-最佳参数:', grid.best_params_) # 输出最优指标对应的参数值
本节通过一个实例来测试自编代码和skearn代码的计算结果。此处使用了cancer数据集。
from sklearn.neighbors import KNeighborsClassifier
if __name__ == '__main__':
_, X, y = breast_cancer()
k_range = range(1, 31) # 优化参数k的取值范围
weight_options = ['uniform', 'distance'] # 代估参数权重的取值范围。uniform为统一取权值,distance表示距离倒数取权值
param_grid = {'n_neighbors': k_range, 'weights': weight_options} # 定义优化参数字典,字典中的key值必须是分类算法的函数的参数名
knn = KNeighborsClassifier(n_neighbors=5)
cv = 5
best_params_by_sklearn, scores_by_sklearn = grid_search_by_sklearn(X, y, knn, param_grid, cv)
print('best_params_by_sklearn: {}, best_weights_by_self: {}, scores_by_sklearn: {}'.
format(best_params_by_sklearn['n_neighbors'], best_params_by_sklearn['weights'], scores_by_sklearn))
best_k_by_self, best_weights_by_self, scores_by_self = grid_search_by_self(X, y, cv)
print('best_k_by_self: {}, best_weights_by_self: {}, scores_by_self: {}'.
format(best_k_by_self, best_weights_by_self, scores_by_self))
以下为运行程序后的输出结果。显然,两者是完全相同的。
best_params_by_sklearn: 13, best_weights_by_self: uniform, scores_by_sklearn: 0.9483476865509534
best_k_by_self: 13, best_weights_by_self: uniform, scores_by_self: 0.9483476865509534