代价函数学习曲线

以线性回归多项式为例,探索代价函数与训练数据集大小的关系。

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

n_dots = 200
# 创建数据集y=sqrt(x)
X=np.linspace(0,1,n_dots)
y=np.sqrt(X)+0.2*np.random.rand(n_dots)-0.1

# sklearn接口里需要用到n_sample X n_feature矩阵,需转化为200x1矩阵
X=X.reshape(-1,1)
y=y.reshape(-1,1)

# sklearn里构建多项式模型使用Pipeline
# 在Pipeline中可以包含多个数据模型,前一个模型处理完,转到下一个模型处理
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# 定义函数生成一个多项式,其中degree表示多项式的项数
def polynominal_model(degree=1):
    polynomial_features = PolynomialFeatures(degree=degree,include_bias=False)
    linear_regression = LinearRegression()
    # 流水线:先增加多项式阶数再用线性回归算法来拟合
    pipeline = Pipeline([('polynomial_features',polynomial_features),('linear_regression',linear_regression)])
    return pipeline

# 使用sklearn.model_selection里的learning_curve来学习曲线
# learning_curve会自动把训练样本的数量按照预定的规则逐渐增加,然后画出不同训练样本数量时模型准确性
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

def plot_learning_curve(estimator,title,X,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(0.1,1.0,5)):
    """
    Generate a simplot plot of the test and train learning curve
    
    Parameters
    ---------
    estimator: object type that implements the 'fit' and 'predict' methods
    
    title: string
        title for the chart
        
    X: array_like, shape(n_samples,n_features)
        Training vector
        
    y: array_like, shape(n_samples) or (n_samples,n_features)
        Target relative to X for classification or regression
        None for unsupervised learning
    
    ylim: tuple, shape(ymin,ymax), optional
        Define minimum and maximum yvalues ploted

    cv: int, cross-validation generator or an iterable, optional
        Determines the cross validation splitting strategy
        Possible inputs for cv are:
         - None, to use the 3-fold cross-validation
         - int, to specity the number of folds
         - An object to be used as a cross-validation generator
         - An iterable yielding train/test splits
         
        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
        
    n_jobs: integer, optional
        Number of jobs to run in parallel
    """
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel('Training examples')
    plt.ylabel('Score')
    train_sizes, train_scores, test_scores = learning_curve(
        estimator,X,y,cv=cv,n_jobs=n_jobs,train_sizes=train_sizes)
    train_scores_mean=np.mean(train_scores,axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean=np.mean(test_scores,axis=1)
    test_scores_std=np.std(test_scores,axis=1)
    plt.grid()

    # 由于训练样本随机分配,每次计算出来的准确性都不一样,plt.fill_between()函数把模型准确性均值上下标准差填充
    plt.fill_between(train_sizes, train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.1,color='r')
    plt.fill_between(train_sizes, test_scores_mean-test_scores_std,test_scores_mean+test_scores_std,alpha=0.1,color='g')
    plt.plot(train_sizes,train_scores_mean,'o-',color='r',label='Training score')
    plt.plot(train_sizes, test_scores_mean,'o-',color='g',label='Cross-validation score')
    plt.legend(loc='best')
    return plt

# 使用polynominal_model()函数构造出三个模型,分别是一阶多项式、三阶多项式、十阶多项式,分别画三条学习曲线
# 为了使学习曲线更平滑,计算10次交叉验证数据集分数
cv = ShuffleSplit(n_splits=10,test_size=0.2,random_state=0)
titles=['Learning Curve(Under Fitting)','Learning Curve','Learning Curve(Over Fitting)']
degrees=[1,3,10]
plt.figure(figsize=(18,4),dpi=200)
for i in range(len(degrees)):
    plt.subplot(1,3,i+1)
    plot_learning_curve(polynominal_model(degrees[i]),titles[i],X,y,ylim=(0.75,1.01),cv=cv)
    
plt.show()

代价函数学习曲线_第1张图片

以上左图(一阶多项式),中图(三阶多项式),右图(十阶多项式),虚线表示训练集分数,实线表示交叉验证集分数。

可以观察到:

1. 左图:当模型欠拟合时,随着训练集大小的增加,交叉验证集准确度也在增加,逐渐与训练集准确度靠近,但总体准确度较低,交叉验证准确度收敛在0.88,训练集准确度收敛在0.9。这是过拟合的表现,当发生高偏差时,增加训练样本量不会对算法准确度有较大改善;

2. 右图:当模型过拟合时,随着训练集样本数增加,交叉验证集准确度也在增加,逐渐与训练集准确度靠近,但两者之间间隙较大,训练集准确度最高,收敛在0.95,测试集准确度较低,收敛在0.91附近;

3. 中图:较好拟合,训练集准确度和验证集准确度靠的较近,最终交叉验证集收敛在0.93,训练集收敛在0.94,这个模型准确度最好。

总结:

需要改进学习算法时,可以画出学习曲线,用来判断是高偏差还是高方差问题。

若发现机器学习算法不能很好地预测新数据是,需要先判断这个算法模型是欠拟合还是过拟合,

如果是过拟合,可以采取以下措施:

  1. 获取更多的训练数据:从学习曲线规律来看,更多数据可改善过拟合;
  2. 减少输入的特征数量:减少模型的计算量,也减少模型的复杂度,改善过拟合。

如果是欠拟合,说明模型太简单,需增加模型复杂度:

  1. 增加有价值的特征:重新解读训练数据;
  2. 增加多项式特征:可以用数学的方法增加多项式特征,例如将特征x1,x2增加为x1,x2,x1x2,平方(x1),平方(x2),或增加阶数。

 

参考:《scikit-learn机器学习》黄永昌著

 

你可能感兴趣的:(代价函数学习曲线)