在学习网格搜索,交叉验证之后,对模型优化的效果进行图形化的评价的两种工具:
验证曲线: 由于训练集的评分已经被用于参数调优,因此该评分用于评估效果已经不再客观,需要使用验证集的评分用于评估。
sklearn.model_ selection.validation_ curve(
estimator, X, y
param_ name : string, 在模型搜索中需要改变的参数
param_ range : array-like, shape (n_ values,), 相应参数的具体值列表
groups = None : array-like, with shape (n_ samples, )
拆分为训练/测试集时用到的样本分组标签
cv = None : int, 交互验证时的分组方法,为None时按照三组拆分
scoring = None, n_ jobs = 1,pre_ dispatch = ‘all’, verbose = 0
)返回:
train_ scores : array, shape (n_ ticks, n_ cv folds), 测试集的模型评分
test_ scores : array, shape (n_ ticks, n_ cv folds), 验证集的模型评分
学习曲线: 学习曲线用于评估多大的样本量用于训练才能达到最佳效果.
sklearn.model_ selection.learning_ curve(
estimator, X, y, groups = None
train_ sizes = array([ 0.1,0.325, 0.55, 0.775, 1. ]) :
模型拟合时用于训练集的相对/绝对样本数,用整数表示绝对样本数
cv = None, scoring = None
exploit_ incremental_ learning = False :是否使用增量学习策略
n_ jobs = 1, pre_ dispatch = ‘all’, verbose = 0
shuffle = False, random state = None
)返回: .
train_ sizes_ abs : array, shape = (n_ unique ticks,), 训练集大小
train_ scores : array, shape (n_ ticks, n_ cv_ folds), 训练集评分
test_ scores : array, shape (n_ ticks, n cv folds),验证集评分
1、导入验证曲线函数:
from sklearn.model_selection import validation_curve
2、导入波士顿房价数据集并实例化:
from sklearn.datasets import load_boston
实例化:
boston = load_boston()
3、导入sklearn的岭回归模块:
from sklearn.linear_model import Ridge
4、将原始数据打乱为随机顺序:
#导入numpy库并将原始波士顿房价数据集打乱
import numpy as np
np.random.seed(666)
X,y = boston.data, boston.target
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X,y = X[indices],y[indices]
5、返回评价结果:
train_scores,test_scores = validation_curve(Ridge(),
X,y,
"alpha",
np.logspace(-10,10,200)
)
6、验证集的模型评分:
print(test_scores)
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(np.logspace(-10,10,200),np.mean(test_scores,axis = 1))
用于评估多大的样本量用于训练才能达到最佳效果
在初期,增加样本量会使得模型评分持续改善,但改善速度逐渐放缓
如果在增加训练集大小时,训练分值和验证分值都已经收敛到一个很低的
平稳值,则继续增加训练数据并不会改善模型效果
如果训练样本最大化时,训练分值仍然明显高于验证分值,说明模型存在
过拟合,此时增加训练样本很可能会改善模型的泛化能力
1、导入学习曲线模块:
from sklearn.model_selection import learning_curve
2、设置训练集大小:
size = np.linspace(0.1,1,10)
3、返回评价结果:
train_sizes,train_scores,test_scores = learning_curve(Ridge(),
X,y,
train_sizes = size,
cv = 10)
4、训练集大小:
print(train_sizes)
结果为:
[ 45 91 136 182 227 273 318 364 409 455]
5、训练集评分:
print(train_scores)
print(test_scores)
plt.scatter(train_sizes,np.mean(train_scores,axis = 1))
plt.scatter(train_sizes,np.mean(test_scores,axis = 1))
8、更改训练集大小进行验证:
size = np.linspace(0.1,1,100)
plt.scatter(train_sizes,np.mean(train_scores,axis = 1),s = 2)
plt.scatter(train_sizes,np.mean(test_scores,axis = 1),s = 2)
size = np.linspace(0.1,1,1000)
plt.scatter(train_sizes,np.mean(train_scores,axis = 1),s = 2)
plt.scatter(train_sizes,np.mean(test_scores,axis = 1),s = 2)