本文主要整理自tensorflow学习文档。主要讲模型训练中的过拟合和欠拟合,以及常用的一些正则化方法。
模型训练常常会出现过拟合和欠拟合,解决过拟合问题有很多方法,其中最简单的方法就是使用完整的数据集,使得模型可以充分学习数据规律。然而,现实中完整的数据集很难实现。而本文主要介绍一些正则化方法:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import regularizers
print(tf.__version__)
!pip install git+https://github.com/tensorflow/docs
import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
import tensorflow_docs.plots
from IPython import display
from matplotlib import pyplot as plt
import numpy as np
import pathlib
import shutil
import tempfile
logdir = pathlib.Path(tempfile.mkdtemp())/"tensorboard_logs"
shutil.rmtree(logdir, ignore_errors=True)
gz = tf.keras.utils.get_file('HIGGS.csv.gz', 'http://mlphysics.ics.uci.edu/data/higgs/HIGGS.csv.gz')
FEATURES = 28
# tf.data.experimental.CsvDataset可以直接read压缩文件,无需中间的解压过程。
ds = tf.data.experimental.CsvDataset(gz,[float(),]*(FEATURES+1), compression_type="GZIP")
# CsvDataset读出的数据并不是dataframe数据,而是每个样本的标量数据列表,我们需要将每个样本和标签对应起来,
def pack_row(*row):
label = row[0]
features = tf.stack(row[1:],1)
return features, label
# tensorflow对大数据集进行批量处理是非常高效的。
packed_ds = ds.batch(10000).map(pack_row).unbatch()
# 查看数据分布,结果如下图
for features,label in packed_ds.batch(1000).take(1):
print(features[0])
plt.hist(features.numpy().flatten(), bins = 101)
以上代码是准备实验使用的数据,本数据集包含11,000,000个样本,28个特征,2两个标签。属于二分类数据集。
# 取前1000个样本作为验证集,随后的10000个样本作为训练集
N_VALIDATION = int(1e3)
N_TRAIN = int(1e4)
BUFFER_SIZE = int(1e4)
BATCH_SIZE = 500
STEPS_PER_EPOCH = N_TRAIN//BATCH_SIZE
validate_ds = packed_ds.take(N_VALIDATION).cache()
train_ds = packed_ds.skip(N_VALIDATION).take(N_TRAIN).cache()
validate_ds = validate_ds.batch(BATCH_SIZE)
train_ds = train_ds.shuffle(BUFFER_SIZE).repeat().batch(BATCH_SIZE)
通过不同复杂度的模型展示模型的过拟合和欠拟合现象。小模型因可学习的参数少能有效的避免过拟合,但同时很容易产生欠拟合。而直观地,大模型因可学习的参数多,可以轻松地学习到数据集的规律,然而容易过拟合。因此,我们需要控制模型的大小在过拟合和欠拟合中寻找一个平衡。我们将从简单模型到复杂模型一一展示。
模型训练过程中逐渐减小学习率,可以使模型训练地更好。我们使用tf.keras.optimizers.schedules.InverseTimeDecay
制定模型的训练计划。
lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
0.001,
decay_steps=STEPS_PER_EPOCH*1000,
decay_rate=1,
staircase=False)
def get_optimizer():
return tf.keras.optimizers.Adam(lr_schedule)
step = np.linspace(0,100000)
lr = lr_schedule(step)
plt.figure(figsize = (8,6))
plt.plot(step/STEPS_PER_EPOCH, lr)
plt.ylim([0,max(plt.ylim())])
plt.xlabel('Epoch')
_ = plt.ylabel('Learning Rate')
通过上面代码的设计可以观察到模型训练的轮次和学习率下降的关系。如下图:
我们为每个模型都设置相同的训练配置信息,使用相同的model.compile和model.fit。
def get_callbacks(name):
return [
tfdocs.modeling.EpochDots(),
tf.keras.callbacks.EarlyStopping(monitor='val_binary_crossentropy', patience=200),
tf.keras.callbacks.TensorBoard(logdir/name),
]
def compile_and_fit(model, name, optimizer=None, max_epochs=10000):
if optimizer is None:
optimizer = get_optimizer()
model.compile(optimizer=optimizer,
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=[
tf.keras.losses.BinaryCrossentropy(
from_logits=True, name='binary_crossentropy'),
'accuracy'])
model.summary()
history = model.fit(
train_ds,
steps_per_epoch = STEPS_PER_EPOCH,
epochs=max_epochs,
validation_data=validate_ds,
callbacks=get_callbacks(name),
verbose=0)
return history
构建了四个不同复杂度的模型:tiny model、small model、medium model、large model。通过相同的方法对模型进行训练,最后将训练结果进行可视化。
size_histories = {}
tiny_model = tf.keras.Sequential([
layers.Dense(16, activation='elu', input_shape=(FEATURES,)),
layers.Dense(1)
])
size_histories['Tiny'] = compile_and_fit(tiny_model, 'sizes/Tiny')
small_model = tf.keras.Sequential([
# `input_shape` is only required here so that `.summary` works.
layers.Dense(16, activation='elu', input_shape=(FEATURES,)),
layers.Dense(16, activation='elu'),
layers.Dense(1)
])
size_histories['Small'] = compile_and_fit(small_model, 'sizes/Small')
medium_model = tf.keras.Sequential([
layers.Dense(64, activation='elu', input_shape=(FEATURES,)),
layers.Dense(64, activation='elu'),
layers.Dense(64, activation='elu'),
layers.Dense(1)
])
size_histories['Medium'] = compile_and_fit(medium_model, "sizes/Medium")
large_model = tf.keras.Sequential([
layers.Dense(512, activation='elu', input_shape=(FEATURES,)),
layers.Dense(512, activation='elu'),
layers.Dense(512, activation='elu'),
layers.Dense(512, activation='elu'),
layers.Dense(1)
])
size_histories['large'] = compile_and_fit(large_model, "sizes/large")
plotter.plot(size_histories)
a = plt.xscale('log')
plt.xlim([5, max(plt.xlim())])
plt.ylim([0.5, 0.7])
plt.xlabel("Epochs [Log Scale]")
从结果可以看出,大模型和中模型在epoch达到一定数时,其在验证集的效果不降反增。可见已经出现了严重的过拟合现象。
将上面最好的Tiny模型结果进行存储,作为baseline进行对比。
shutil.rmtree(logdir/'regularizers/Tiny', ignore_errors=True)
shutil.copytree(logdir/'sizes/Tiny', logdir/'regularizers/Tiny')
regularizer_histories = {}
regularizer_histories['Tiny'] = size_histories['Tiny']
预防过拟合的方法:
1、添加权重正则化
从上面的结果,我们知道越简单的模型越不会出现过拟合问题。给模型加上正则化相当于给复杂的模型加上一个约束,强制它的权重取一些较小的值,使得权重的分布更“规则”。
l2_model = tf.keras.Sequential([
layers.Dense(512, activation='elu',
kernel_regularizer=regularizers.l2(0.001),
input_shape=(FEATURES,)),
layers.Dense(512, activation='elu',
kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(512, activation='elu',
kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(512, activation='elu',
kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(1)
])
regularizer_histories['l2'] = compile_and_fit(l2_model, "regularizers/l2")
2、添加Dropout
dropout是神经网络最有效和最广泛的正则化技术之一。
dropout_model = tf.keras.Sequential([
layers.Dense(512, activation='elu', input_shape=(FEATURES,)),
layers.Dropout(0.5),
layers.Dense(512, activation='elu'),
layers.Dropout(0.5),
layers.Dense(512, activation='elu'),
layers.Dropout(0.5),
layers.Dense(512, activation='elu'),
layers.Dropout(0.5),
layers.Dense(1)
])
regularizer_histories['dropout'] = compile_and_fit(dropout_model, "regularizers/dropout")
3、结合L2正则化和Dropout
combined_model = tf.keras.Sequential([
layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
activation='elu', input_shape=(FEATURES,)),
layers.Dropout(0.5),
layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
activation='elu'),
layers.Dropout(0.5),
layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
activation='elu'),
layers.Dropout(0.5),
layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
activation='elu'),
layers.Dropout(0.5),
layers.Dense(1)
])
regularizer_histories['combined'] = compile_and_fit(combined_model, "regularizers/combined")
以上的训练结果如下图,可以看出L2+Dropout的结果最优。
总结
预防过拟合的常见方法主要有: