以线性回归多项式为例,探索代价函数与训练数据集大小的关系。
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
n_dots = 200
# 创建数据集y=sqrt(x)
X=np.linspace(0,1,n_dots)
y=np.sqrt(X)+0.2*np.random.rand(n_dots)-0.1
# sklearn接口里需要用到n_sample X n_feature矩阵,需转化为200x1矩阵
X=X.reshape(-1,1)
y=y.reshape(-1,1)
# sklearn里构建多项式模型使用Pipeline
# 在Pipeline中可以包含多个数据模型,前一个模型处理完,转到下一个模型处理
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# 定义函数生成一个多项式,其中degree表示多项式的项数
def polynominal_model(degree=1):
polynomial_features = PolynomialFeatures(degree=degree,include_bias=False)
linear_regression = LinearRegression()
# 流水线:先增加多项式阶数再用线性回归算法来拟合
pipeline = Pipeline([('polynomial_features',polynomial_features),('linear_regression',linear_regression)])
return pipeline
# 使用sklearn.model_selection里的learning_curve来学习曲线
# learning_curve会自动把训练样本的数量按照预定的规则逐渐增加,然后画出不同训练样本数量时模型准确性
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator,title,X,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(0.1,1.0,5)):
"""
Generate a simplot plot of the test and train learning curve
Parameters
---------
estimator: object type that implements the 'fit' and 'predict' methods
title: string
title for the chart
X: array_like, shape(n_samples,n_features)
Training vector
y: array_like, shape(n_samples) or (n_samples,n_features)
Target relative to X for classification or regression
None for unsupervised learning
ylim: tuple, shape(ymin,ymax), optional
Define minimum and maximum yvalues ploted
cv: int, cross-validation generator or an iterable, optional
Determines the cross validation splitting strategy
Possible inputs for cv are:
- None, to use the 3-fold cross-validation
- int, to specity the number of folds
- An object to be used as a cross-validation generator
- An iterable yielding train/test splits
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
n_jobs: integer, optional
Number of jobs to run in parallel
"""
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel('Training examples')
plt.ylabel('Score')
train_sizes, train_scores, test_scores = learning_curve(
estimator,X,y,cv=cv,n_jobs=n_jobs,train_sizes=train_sizes)
train_scores_mean=np.mean(train_scores,axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean=np.mean(test_scores,axis=1)
test_scores_std=np.std(test_scores,axis=1)
plt.grid()
# 由于训练样本随机分配,每次计算出来的准确性都不一样,plt.fill_between()函数把模型准确性均值上下标准差填充
plt.fill_between(train_sizes, train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.1,color='r')
plt.fill_between(train_sizes, test_scores_mean-test_scores_std,test_scores_mean+test_scores_std,alpha=0.1,color='g')
plt.plot(train_sizes,train_scores_mean,'o-',color='r',label='Training score')
plt.plot(train_sizes, test_scores_mean,'o-',color='g',label='Cross-validation score')
plt.legend(loc='best')
return plt
# 使用polynominal_model()函数构造出三个模型,分别是一阶多项式、三阶多项式、十阶多项式,分别画三条学习曲线
# 为了使学习曲线更平滑,计算10次交叉验证数据集分数
cv = ShuffleSplit(n_splits=10,test_size=0.2,random_state=0)
titles=['Learning Curve(Under Fitting)','Learning Curve','Learning Curve(Over Fitting)']
degrees=[1,3,10]
plt.figure(figsize=(18,4),dpi=200)
for i in range(len(degrees)):
plt.subplot(1,3,i+1)
plot_learning_curve(polynominal_model(degrees[i]),titles[i],X,y,ylim=(0.75,1.01),cv=cv)
plt.show()
以上左图(一阶多项式),中图(三阶多项式),右图(十阶多项式),虚线表示训练集分数,实线表示交叉验证集分数。
可以观察到:
1. 左图:当模型欠拟合时,随着训练集大小的增加,交叉验证集准确度也在增加,逐渐与训练集准确度靠近,但总体准确度较低,交叉验证准确度收敛在0.88,训练集准确度收敛在0.9。这是过拟合的表现,当发生高偏差时,增加训练样本量不会对算法准确度有较大改善;
2. 右图:当模型过拟合时,随着训练集样本数增加,交叉验证集准确度也在增加,逐渐与训练集准确度靠近,但两者之间间隙较大,训练集准确度最高,收敛在0.95,测试集准确度较低,收敛在0.91附近;
3. 中图:较好拟合,训练集准确度和验证集准确度靠的较近,最终交叉验证集收敛在0.93,训练集收敛在0.94,这个模型准确度最好。
需要改进学习算法时,可以画出学习曲线,用来判断是高偏差还是高方差问题。
若发现机器学习算法不能很好地预测新数据是,需要先判断这个算法模型是欠拟合还是过拟合,
如果是过拟合,可以采取以下措施:
如果是欠拟合,说明模型太简单,需增加模型复杂度:
参考:《scikit-learn机器学习》黄永昌著