模型超参数的确定:网格搜索和交叉验证

目的:确定模型超参数
思路:
  • 1.根据经验给定几组模型的超参数组合
  • 2.在当前超参数组合下,在训练集中划分出验证集部分来训练测试模型,并得到评分
  • 3.在得到所有超参数组合的评分后,比较评分得到评分最高的超参数组合。
思路第一步优化:网格搜索

网格搜索的目的:得到不同的超参数组合。

思路第二步优化:交叉验证

交叉验证的目的:消除取验证集时随机性带来的影响。

sklearn实现:仅交叉验证
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score  # 交叉验证模块

# 1.准备数据
digits = datasets.load_digits()
X = digits.data[:100, :]
y = digits.target[:100]

# 2.划分训练集、测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=666)

# 3.交叉验证寻找超参数
knn_clf = KNeighborsClassifier()
best_k, best_p, best_score = 0, 0, 0
for k in range(2, 11):  # 2,11猜测/经验给定
    for p in range(1, 6):  # 1,6猜测/经验给定
        knn_clf = KNeighborsClassifier(weights="distance", n_neighbors=k, p=p)
        # version 2.2 cv默认为5,scoring默认为accuracy
        scores = cross_val_score(knn_clf, X_train, y_train, cv=5, scoring='accuracy')
        score = np.mean(scores)
        if score > best_score:
            best_k, best_p, best_score = k, p, score

# 4.使用最佳超参数组合训练模型
best_knn_clf = KNeighborsClassifier(weights="distance", n_neighbors=best_k, p=best_p)
best_knn_clf.fit(X_train, y_train)
print(best_knn_clf.score(X_test, y_test))

sklearn实现:交叉验证 + 网格搜索
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# 1.数据准备
digits = datasets.load_digits()
X = digits.data[:500,:]
y = digits.target[:500]
# 2.数据划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=666)

# 3.寻找超参数:网格搜索 + 交叉验证
knn_clf = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV  # GridSearchCV类:网格搜索加交叉验证
param_grid = [
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(2, 11)],
        'p': [i for i in range(1, 6)]
    }
]
grid_search = GridSearchCV(knn_clf, param_grid, verbose=1, cv=5)
grid_search.fit(X_train, y_train)  # fit (11-2)*(6-1)*cv=135次
print(grid_search.best_score_)  # 得到最优分数
print(grid_search.best_params_)  # 得到最优超参数组合

# 4.得到训练后的模型
best_knn_clf = grid_search.best_estimator_  # 得到最佳分类器
print(best_knn_clf.score(X_test, y_test))

博主新手,博客内容仅供参考,如若有误,希望不吝赐教。

你可能感兴趣的:(机器学习)