11_Training Deep Neural Networks_VarianceScaling_leaky relu_PReLU_SELU _Batch Normalization_Reusing
https://blog.csdn.net/Linli522362242/article/details/106935910
11_Training Deep Neural Networks_2_transfer learning_RBMs_Momentum_Nesterov Accelerated Gra_AdaGrad_RMSProp
https://blog.csdn.net/Linli522362242/article/details/106982127
11_Training Deep Neural Networks_3_Adam_Learning Rate Scheduling_Decay_np.argmax(」)_lambda语句_Regular
https://blog.csdn.net/Linli522362242/article/details/107086444
Dropout is one of the most popular regularization techniques for deep neural networks. It was proposed in a paper23 by Geoffrey Hinton in 2012 and further detailed in a 2014 paper24 by Nitish Srivastava et al., and it has proven to be highly successful: even the state-of-the-art neural networks get a 1–2% accuracy boost simply by adding dropout. This may not sound like a lot, but when a model already has 95% accuracy, getting a 2% accuracy boost means dropping the error rate by almost 40% (going from 5% error to roughly 3%).
It is a fairly simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step (see Figure 11-9). The hyperparameter p is called the dropout rate, and it is typically set between 10% and 50%: closer to 20–30% in recurrent递归 neural nets (see Chapter 15), and closer to 40–50% in convolutional neural networks (see Chapter 14). After training, neurons don’t get dropped anymore. And that’s all (except for a technical detail we will discuss momentarily[ˌmoʊmənˈterəli]马上,立刻).
Figure 11-9. With dropout regularization, at each training iteration a random subset of all neurons in one or more layers—except the output layer—are “dropped out”; these neurons output 0 at this iteration (represented by the dashed arrows)
@tf_export("nn.dropout", v1=[])
def dropout_v2(x, rate, noise_shape=None, seed=None, name=None):
"""Computes dropout: randomly sets elements to zero to prevent overfitting.
Note: The behavior of dropout has changed between TensorFlow 1.x and 2.x.
When converting 1.x code, please use named arguments to ensure behavior stays
consistent.
See also: `tf.keras.layers.Dropout` for a dropout layer.
[Dropout](https://arxiv.org/abs/1207.0580) is useful for regularizing DNN
models. Inputs elements are randomly set to zero (and the other elements are
rescaled). This encourages each node to be independently useful, as it cannot
rely on the output of other nodes.
More precisely: With probability `rate` elements of `x` are set to `0`.
The remaining elements are scaled up by `1.0 / (1 - rate)`, so that the
expected value is preserved.
>>> tf.random.set_seed(0)
>>> x = tf.ones([3,5])
>>> tf.nn.dropout(x, rate = 0.5, seed = 1).numpy()# x*1/(1-0.5)
array([[2., 0., 0., 2., 2.],
[2., 2., 2., 2., 2.],
[2., 0., 2., 0., 2.]], dtype=float32)
>>> tf.random.set_seed(0)
>>> x = tf.ones([3,5])
>>> tf.nn.dropout(x, rate = 0.8, seed = 1).numpy()
array([[0., 0., 0., 5., 5.],
[0., 5., 0., 5., 0.],
[5., 0., 5., 0., 5.]], dtype=float32)
>>> tf.nn.dropout(x, rate = 0.0) == x
By default, each element is kept or dropped independently. If `noise_shape`
is specified, it must be
[broadcastable](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
to the shape of `x`, and only dimensions with `noise_shape[i] == shape(x)[i]`
will make independent decisions. This is useful for dropping whole
channels from an image or sequence. For example:
>>> tf.random.set_seed(0)
>>> x = tf.ones([3,10]) #2/3 >0.5
>>> tf.nn.dropout(x, rate = 2/3, noise_shape=[1,10], seed=1).numpy()
array([[0., 0., 0., 3., 3., 0., 3., 3., 3., 0.],
[0., 0., 0., 3., 3., 0., 3., 3., 3., 0.],
[0., 0., 0., 3., 3., 0., 3., 3., 3., 0.]], dtype=float32)
Args:
x: A floating point tensor.
rate: A scalar `Tensor` with the same type as x. The probability
that each element is dropped. For example, setting rate=0.1 would drop
10% of input elements.
noise_shape: A 1-D `Tensor` of type `int32`, representing the
shape for randomly generated keep/drop flags.
seed: A Python integer. Used to create random seeds. See
`tf.random.set_seed` for behavior.
name: A name for this operation (optional).
Returns:
A Tensor of the same shape of `x`.
Raises:
ValueError: If `rate` is not in `[0, 1)` or if `x` is not a floating point
tensor. `rate=1` is disallowed, because theoutput would be all zeros,
which is likely not what was intended.
"""
with ops.name_scope(name, "dropout", [x]) as name:
is_rate_number = isinstance(rate, numbers.Real)
if is_rate_number and (rate < 0 or rate >= 1):
raise ValueError("rate must be a scalar tensor or a float in the "
"range [0, 1), got %g" % rate)
x = ops.convert_to_tensor(x, name="x") #droppout rate
x_dtype = x.dtype
if not x_dtype.is_floating:
raise ValueError("x has to be a floating point tensor since it's going "
"to be scaled. Got a %s tensor instead." % x_dtype)
is_executing_eagerly = context.executing_eagerly()
if not tensor_util.is_tensor(rate):
if is_rate_number:
keep_prob = 1 - rate #keep probability
scale = 1 / keep_prob
scale = ops.convert_to_tensor(scale, dtype=x_dtype)
ret = gen_math_ops.mul(x, scale) #x/ (1-droppout rate)######################
else:
raise ValueError("rate is neither scalar nor scalar tensor %r" % rate)
else:
rate.get_shape().assert_has_rank(0)
rate_dtype = rate.dtype
if rate_dtype != x_dtype:
if not rate_dtype.is_compatible_with(x_dtype):
raise ValueError(
"Tensor dtype %s is incomptaible with Tensor dtype %s: %r" %
(x_dtype.name, rate_dtype.name, rate))
rate = gen_math_ops.cast(rate, x_dtype, name="rate")
one_tensor = constant_op.constant(1, dtype=x_dtype)
ret = gen_math_ops.real_div(x, gen_math_ops.sub(one_tensor, rate))
noise_shape = _get_noise_shape(x, noise_shape)
# Sample a uniform distribution on [0.0, 1.0) and select values larger
# than rate.
#
# NOTE: Random uniform can only generate 2^23 floats on [1.0, 2.0)
# and subtract 1.0.
random_tensor = random_ops.random_uniform(
noise_shape, seed=seed, dtype=x_dtype)
# NOTE: if (1.0 + rate) - 1 is equal to rate, then that float is selected,
# hence a >= comparison is used.
keep_mask = random_tensor >= rate
ret = gen_math_ops.mul(ret, gen_math_ops.cast(keep_mask, x_dtype))
if not is_executing_eagerly:
ret.set_shape(x.get_shape())
return ret
It’s surprising at first that this destructive[dɪˈstrʌktɪv]破坏性的 technique works at all. Would a company perform better if its employees were told to toss a coin every morning to decide whether or not to go to work? Well, who knows; perhaps it would! The company would be forced to adapt its organization组织构架; it could not rely on any single person to work the coffee machine or perform any other critical tasks, so this expertise专门知识或技能 would have to be spread across several people. Employees would have to learn to cooperate with many of their coworkers, not just a handful of them. The company would become much more resilient有弹性的. If one person quit, it wouldn’t make much of a difference. It’s unclear whether this idea would actually work for companies, but it certainly does for neural networks. Neurons trained with dropout cannot co-adapt共同适应 with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively过度地 on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end, you get a more robust network that generalizes better.
Another way to understand the power of dropout is to realize that a unique neural network is generated at each training step. Since each neuron can be either present or absent, there are a total of possible networks (where N is the total number of droppable neurons). This is such a huge number that it is virtually impossible for the same neural network to be sampled twice. Once you have run 10,000 training steps, you have essentially trained 10,000 different neural networks (each with just one training instance). These neural networks are obviously not independent because they share many of their weights, but they are nevertheless all different. The resulting neural network can be seen as an averaging ensemble of all these smaller neural networks.
In practice, you can usually apply dropout only to the neurons in the top one to three layers (excluding the output layer).
There is one small but important technical detail. Suppose p = 50%, in which case during testing a neuron would be connected to twice as many input neurons as it would be (on average) during training. To compensate for this fact, we need to multiply each neuron’s input connection weights by 0.5 after training. If we don’t, each neuron will get a total input signal roughly twice as large as what the network was trained on and will be unlikely to perform well. More generally, we need to multiply each input connection weight by the keep probability (1 – p) after training. Alternatively, we can divide each neuron’s output by the keep probability during training (these alternatives are not perfectly equivalent, but they work equally well).
To implement dropout using Keras, you can use the keras.layers.Dropout layer. During training, it randomly drops some inputs (setting them to 0) and divides the remaining inputs by the keep probability(remaining input/1-rate). After training, it does nothing at all; it just passes the inputs to the next layer.
The following code applies dropout regularization before every Dense layer, using a dropout rate of 0.2:
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28,28]),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(10, activation="softmax")
])
model.compile( loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs=2
history = model.fit( X_train_scaled, y_train, epochs=n_epochs,
validation_data = (X_valid_scaled, y_valid))
Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So make sure to evaluate the training loss without dropout (e.g., after training).
If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training set. It can also help to increase the dropout rate for large layers, and reduce it for small ones. Moreover, many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong.
If you want to regularize a self-normalizing network based on the SELU activation function (as discussed earlier), you should use alpha dropout: this is a variant of dropout that preserves the mean and standard deviation of its inputs (it was introduced in the same paper as SELU, as regular dropout would break self-normalization
https://blog.csdn.net/Linli522362242/article/details/106935910).
Equation 11-5. Nesterov Accelerated Gradient algorithm
Alpha Dropout is a Dropout
that keeps mean and variance of inputs to their original values, in order to ensure the self-normalizing property even after this dropout. Alpha Dropout fits well to Scaled Exponential Linear Units by randomly setting activations to the negative saturation value.
rate: float, drop probability (as with Dropout
). The multiplicative noise will have standard deviation sqrt(rate / (1 - rate))
.
@keras_export('keras.layers.AlphaDropout')
class AlphaDropout(Layer):
"""Applies Alpha Dropout to the input.
Alpha Dropout is a `Dropout` that keeps mean and variance of inputs
to their original values, in order to ensure the self-normalizing property
even after this dropout.
Alpha Dropout fits well to Scaled Exponential Linear Units
by randomly setting activations to the negative saturation value.
Arguments:
rate: float, drop probability (as with `Dropout`).
The multiplicative noise will have
standard deviation `sqrt(rate / (1 - rate))`.
seed: A Python integer to use as random seed.
Call arguments:
inputs: Input tensor (of any rank).
training: Python boolean indicating whether the layer should behave in
training mode (adding dropout) or in inference mode (doing nothing).
Input shape:
Arbitrary. Use the keyword argument `input_shape`
(tuple of integers, does not include the samples axis)
when using this layer as the first layer in a model.
Output shape:
Same shape as input.
"""
def __init__(self, rate, noise_shape=None, seed=None, **kwargs):
super(AlphaDropout, self).__init__(**kwargs)
self.rate = rate
self.noise_shape = noise_shape
self.seed = seed
self.supports_masking = True
def _get_noise_shape(self, inputs):
return self.noise_shape if self.noise_shape else array_ops.shape(inputs)
def call(self, inputs, training=None):
if 0. < self.rate < 1.:
noise_shape = self._get_noise_shape(inputs)
def dropped_inputs(inputs=inputs, rate=self.rate, seed=self.seed): # pylint: disable=missing-docstring
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
alpha_p = -alpha * scale
kept_idx = math_ops.greater_equal(
K.random_uniform(noise_shape, seed=seed), rate)
kept_idx = math_ops.cast(kept_idx, inputs.dtype)
# Get affine transformation params
a = ((1 - rate) * (1 + rate * alpha_p**2))**-0.5
b = -a * alpha_p * rate
# Apply mask
x = inputs * kept_idx + alpha_p * (1 - kept_idx)
# Do affine transformation
return a * x + b
return K.in_train_phase(dropped_inputs, inputs, training=training)
return inputs
import tensorflow as tf
import numpy as np
tf.random.set_seed(42)
np.random.seed(42)
model = keras.models.Sequential([
keras.layers.Flatten( input_shape=[28,28]),
keras.layers.AlphaDropout( rate=0.2 ),
keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
keras.layers.AlphaDropout( rate=0.2 ),
keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
keras.layers.AlphaDropout( rate=0.2 ),
keras.layers.Dense(10, activation="softmax")
])
optimizer = keras.optimizers.SGD( lr=0.01, momentum=0.9, nesterov=True)
model.compile( loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs=20
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid))
... ...
Since dropout is only active during training, comparing the training loss and the validation loss can be misleading
train loss > val_loss, undefitting, Don't be misslead reallly?
model.evaluate(X_test_scaled, y_test)
model.evaluate(X_train_scaled, y_train)
train loss < test loss, overfitting
history = model.fit(X_train_scaled, y_train)
In 2016, a paper by Yarin Gal and Zoubin Ghahramani added a few more good reasons to use dropout:
If this all sounds like a “one weird trick” advertisement, then take a look at the following code. It is the full implementation of MC Dropout, boosting the dropout model we trained earlier without retraining it:
import tensorflow as tf
import numpy as np
tf.random.set_seed(42)
np.random.seed(42)
model = keras.models.Sequential([
keras.layers.Flatten( input_shape=[28,28]),
keras.layers.AlphaDropout( rate=0.2 ),
keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
keras.layers.AlphaDropout( rate=0.2 ),
keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
keras.layers.AlphaDropout( rate=0.2 ),
keras.layers.Dense(10, activation="softmax")
])
optimizer = keras.optimizers.SGD( lr=0.01, momentum=0.9, nesterov=True)
model.compile( loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs=20
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid))
#history = model.fit(X_train_scaled, y_train)#without retraining #prediction #??????
y_probas = np.stack([ model(X_test_scaled, training=True) for sample in range(100)])######
We just make 100 predictions over the test set(1000 prediction on each instance), setting training=True to ensure that the Dropout layer is active, and stack the predictions. Since dropout is active, all the predictions will be different. Recall that predict() returns a matrix with one row per instance and one column per class. Because there are 10,000 instances in the test set and 10 classes, this is a matrix of shape [10000, 10]. We stack 100 such matrices, so y_probas is an array of shape [100, 10000, 10].
Once we average over the first dimension (axis=0), we get y_proba, an array of shape [10000, 10], like we would get
with a single prediction. That’s all! Averaging over multiple predictions with dropout on gives us a Monte Carlo estimate that is generally more reliable than the result of a single prediction with dropout off.
y_proba = y_probas.mean(axis=0)
y_std = y_probas.std(axis=0)
For example, let’s look at the model’s prediction for the first instance in the Fashion MNIST test set, with dropout off:
np.round( model.predict(X_test_scaled[:1]),2)
y_test[:1]
The model seems almost certain that this image belongs to class 9 (ankle boot). Should you trust it? Is there really so little room for doubt? Compare this with the predictions made when dropout is activated:
# y_probas = np.stack([ model(X_test_scaled, training=True) for sample in range(100)])######
np.round(y_probas[:,:1],2)
... ...
This tells a very different story: apparently, when we activate dropout, the model is not sure anymore. It still seems to prefer class 9, but sometimes it hesitates with classes 5 (sandal) and 7 (sneaker), which makes sense given they’re all footwear.
y_probas.shape
# number of predictions=100 on the same instance, 10000 instances, 10 class(0~9)
y_proba.shape #y_proba = y_probas.mean(axis=0)
y_test.shape
Once we average over the first dimension, we get the following MC Dropout predictions:
np.round( y_proba[:1],2) # #y_proba = y_probas.mean(axis=0)
The model still thinks this image belongs to class 9, but only with a 83% confidence, which seems much more reasonable than 100%###np.round( model.predict(X_test_scaled[:1]),2)###. Plus it’s useful to know exactly which other classes it thinks are likely. And you can also take a look at the standard deviation of the probability estimates:
y_std = y_probas.std(axis=0)
np.round(y_std[:1],2)
Apparently there’s quite a lot of variance in the probability estimates: if you were building a risk-sensitive system (e.g., a medical or financial system), you should probably treat such an uncertain prediction with extreme caution. You definitely would not treat it like a 100% confident prediction. Moreover, the model’s accuracy: 85.8:
y_pred = np.argmax( y_proba, axis=1)
accuracy = np.sum(y_pred == y_test)/len(y_test)
accuracy
The number of Monte Carlo samples you use (100 in this example) is a hyperparameter you can tweak. The higher it is, the more accurate the predictions and their uncertainty estimates will be. However, if you double it, inference time will also be doubled. Moreover, above a certain number of samples, you will notice little improvement. So your job is to find the right trade-off between latency and accuracy, depending on your application.
If your model contains other layers that behave in a special way during training (such as BatchNormalization layers), then you should not force training mode like we just did. Instead, you should replace the Dropout layers with the following MCDropout class:
class MCDropout( keras.layers.Dropout ):
def call(self, inputs):
return super().call( inputs, training=True)#######activate dropout
Here, we just subclass the Dropout layer and override the call() method to force its training argument to True (see Chapter 12).
Similarly, you could define an MCAlphaDropout class by subclassing AlphaDropout instead.
If you are creating a model from scratch, it’s just a matter of using MCDropout rather than Dropout. But if you have a
model that was already trained using Dropout, you need to create a new model that’s identical to the existing model except that it replaces the Dropout layers with MCDrop out, then copy the existing model’s weights to your new model.
tf.random.set_seed(42)
np.random.seed(42)
class MCAlphaDropout( keras.layers.AlphaDropout ):
def call(self, inputs):
return super().call( inputs, training=True)
# model = keras.models.Sequential([
# keras.layers.Flatten( input_shape=[28,28]),
# keras.layers.AlphaDropout( rate=0.2 ),
# keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
# keras.layers.AlphaDropout( rate=0.2 ),
# keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
# keras.layers.AlphaDropout( rate=0.2 ),
# keras.layers.Dense(10, activation="softmax")
# ])
mc_model = keras.models.Sequential([
MCAlphaDropout( layer.rate ) if isinstance( layer, keras.layers.AlphaDropout ) else layer
for layer in model.layers
])
mc_model.summary()
optimizer = keras.optimizers.SGD( lr=0.01, momentum=0.9, nesterov=True )
mc_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
metrics=["accuracy"])
mc_model.set_weights(model.get_weights())
# len(model.get_weights()) : 6
len( model.get_weights()[0] ), len( model.get_weights()[1] ), len( model.get_weights()[2] )
len( model.get_weights()[3] ), len( model.get_weights()[4] ), len( model.get_weights()[5] )
Now we can use the model with MC Dropout:
np.round(np.mean([mc_model.predict(X_test_scaled[:1]) for sample in range(100)],
axis=0),
2)
In short, MC Dropout is a fantastic technique that boosts dropout models and provides better uncertainty estimates. And of course, since it is just regular dropout during training, it also acts like a regularizer.
Another regularization technique that is popular for neural networks is called maxnorm regularization: for each neuron, it constrains the weights w of the incoming connections such that ≤ r, where r is the max-norm hyperparameter and is the norm.
Max-norm regularization does not add a regularization loss term to the overall loss function. Instead, it is typically implemented by computing after each training step and rescaling w if needed (w ← ).
Reducing r increases the amount of regularization and helps reduce overfitting. Maxnorm regularization can also help alleviate the unstable gradients problems ###the vanishing/exploding gradients problems during training.
https://blog.csdn.net/Linli522362242/article/details/106935910###(if you are not using Batch Normalization ###This operation simply zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling, the other for shifting###.
To implement max-norm regularization in Keras, set the kernel_constraint argument of each hidden layer to a max_norm() constraint with the appropriate max value, like this:
layer = keras.layers.Dense( 100, activation="selu", kernel_initializer="lecun_normal",
kernel_constraint=keras.constraints.max_norm(1.)
)
After each training iteration, the model’s fit() method will call the object returned by max_norm(), passing it the layer’s weights and getting rescaled weights in return, which then replace the layer’s weights. As you’ll see in Chapter 12, you can define your own custom constraint function if necessary and use it as the kernel_constraint. You can also constrain the bias terms by setting the bias_constraint argument.
The max_norm() function has an axis argument that defaults to 0. A Dense layer usually has weights of shape [number of inputs, number of neurons], so using axis=0 means that the max-norm constraint will apply independently to each neuron’s weight vector. If you want to use max-norm with convolutional layers (see Chapter14), make sure to set the max_norm() constraint’s axis argument appropriately (usually axis=[0, 1, 2]).
from functools import partial
MaxNormDense = partial(keras.layers.Dense,
activation="selu", kernel_initializer="lecun_normal",
kernel_constraint=keras.constraints.max_norm(1.))
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28,28]),
##Dense: [number of inputs==784, number of neurons=300]
MaxNormDense(300),
MaxNormDense(100),
keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs=2
history = model.fit(X_train_scaled, y_train, epochs= n_epochs,
validation_data = (X_valid_scaled, y_valid))
In this chapter we have covered a wide range of techniques, and you may be wondering which ones you should use. This depends on the task, and there is no clear consensus[kənˈsensəs]舆论; 一致同意 yet, but I have found the configuration in Table 11-3 to work fine in most cases, without requiring much hyperparameter tuning. That said, please do not consider these defaults as hard rules!
Table 11-3. Default DNN configuration
If the network is a simple stack of dense layers, then it can self-normalize, and you should use the configuration in Table 11-4 instead.
Table 11-4. DNN configuration for a self-normalizing net
Don’t forget to normalize the input features! You should also try to reuse parts of a pretrained neural network if you can find one that solves a similar problem, or use unsupervised pretraining if you have a lot of unlabeled data, or use pretraining on an auxiliary task if you have a lot of labeled data for a similar task.
While the previous guidelines should cover most cases, here are some exceptions:
With these guidelines, you are now ready to train very deep nets! I hope you are now convinced that you can go quite a long way using just Keras. There may come a time, however, when you need to have even more control; for example, to write a custom loss function or to tweak the training algorithm. For such cases you will need to use TensorFlow’s lower-level API, as you will see in the next chapter.
The SELU activation function is a good default.
If you need the neural network to be as fast as possible, you can use one of the leaky ReLU variants instead (e.g., a simple leaky ReLU using the default hyperparameter value).
The simplicity of the ReLU activation function makes it many people’s preferred option, despite the fact that it is generally outperformed by SELU and leaky ReLU. However, the ReLU activation function’s ability to output precisely zero can be useful in some cases (e.g., see Chapter 17). Moreover, it can sometimes benefit from optimized implementation as well as from hardware acceleration.
The hyperbolic tangent (tanh) can be useful in the output layer if you need to output a number between –1 and 1, but nowadays it is not used much in hidden layers (except in recurrent nets).
The logistic activation function is also useful in the output layer when you need to estimate a probability (e.g., for binary classification), but is rarely used in hidden layers (there are exceptions—for example, for the coding layer of variational autoencoders; see Chapter 17).
Finally, the softmax activation function is useful in the output layer to output probabilities for mutually exclusive classes, but it is rarely (if ever) used in hidden layers.
import tensorflow as tf
from tensorflow import keras
import numpy as np
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
model = keras.models.Sequential()
model.add( keras.layers.Flatten( input_shape=[32,32,3] ) ) # 32 × 32–pixel color images(255.255.255)
for _ in range(20):
model.add( keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal") )
##############################################model.add( keras.layers.Dense(10, activation="softmax") )
optimizer = keras.optimizers.Nadam(lr=5e-5)
model.compile( loss="sparse_categorical_crossentropy",
optimizer = optimizer,
metrics=['accuracy']
)
Let's load the CIFAR10 dataset. We also want to use early stopping(since the cost function is convex), so we need a validation set. Let's use the first 5,000 images of the original training set as the validation set(early stopping will interrupt training when it measures no progress on the validation set for a number of epochs (defined by the patience argument ###Number of epochs with no improvement after which training will be stopped.### ) # You can load it with keras.datasets.cifar10.load_data(). The dataset is composed of 60,000 32 × 32–pixel color
# images (50,000 for training, 10,000 for testing) with 10 classes,
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train = X_train_full[5000:]
y_train = y_train_full[5000:]
X_valid = X_train_full[:5000]
y_valid = y_train_full[:5000]
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
# https://www.tensorflow.org/api_docs/python/`tf/keras/callbacks/ModelCheckpoint
# mode = "auto"
# monitor='val_loss' #default
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_model.h5", save_best_only=True)
run_index = 1 # increment every time you train the model
import os
run_logdir = os.path.join( os.curdir, "my_cifar10_logs", "run_{:03d}".format(run_index) )
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]
model.fit( X_train, y_train, epochs=100,
validation_data=(X_valid, y_valid),
callbacks=callbacks)
model = keras.models.load_model("my_cifar10_model.h5")
model.evaluate(X_valid, y_valid)
K = keras.backend
class ExponentialLearningRate( keras.callbacks.Callback):
def __init__(self, factor):
self.factor = factor
self.rates = []
self.losses = []
def on_batch_end(self, batch, logs):
self.rates.append( K.get_value(self.model.optimizer.lr) )
self.losses.append( logs['loss'] )
K.set_value( self.model.optimizer.lr, self.model.optimizer.lr*self.factor )#update learning rate
def find_learing_rate( model, X,y, epochs=1, batch_size=32, min_rate=10**-5, max_rate=10):
init_weights = model.get_weights()
iterations = len(X) // batch_size * epochs
factor = np.exp(np.log(max_rate / min_rate)/iterations) # initilize learning rate factor
init_lr = K.get_value(model.optimizer.lr) # get initial learning rate
K.set_value( model.optimizer.lr, min_rate ) # replace initial learning rate with min_rate
exp_lr = ExponentialLearningRate(factor) # pass learning rate factor to
history = model.fit( X, y, epochs=epochs, batch_size=batch_size, callbacks=[exp_lr] )
K.set_value( model.optimizer.lr, init_lr ) # replace current learning rate with initiallearning rate
model.set_weights(init_weights)
return exp_lr.rates, exp_lr.losses
def plot_lr_vs_loss( rates, losses ):
plt.plot(rates, losses)
plt.gca().set_xscale("log")
plt.hlines( min(losses), min(rates),max(rates) )
plt.axis( [min(rates), max(rates), min(losses), (losses[0]+min(losses))/2 ])
plt.xlabel("Learning rate")
plt.ylabel("Loss")
rates, losses = find_learing_rate( model, X_train, y_train, epochs=1)#default batch_size=32
import matplotlib.pyplot as plt
plot_lr_vs_loss(rates, losses)
(suggestion: do not choose the learning rate when losses arrive the minimum)<--How Do You Find A Good Learning Rate<--https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html 在实践中也可以发现,确定lr更重要的是确定量级,如1e-3和1e-2c. Now try adding Batch Normalization and compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?
The code below is very similar to the code above, with a few changes:
* I added a BN layer after every Dense layer (before the activation function, SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic), except for the output layer. I also added a BN layer before the first hidden layer(Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. the input numerical attributes have very different scales, standardization is much less affected by outliers
https://blog.csdn.net/Linli522362242/article/details/106582512).
* I changed the learning rate to 5e-4. I experimented with 1e-5, 3e-5, 5e-5, 1e-4, 3e-4, 5e-4, 1e-3 and 3e-3, and I chose the one with the best validation performance after 20 epochs.
* I renamed the run directories to run bn* and the model file name to my_cifar10_bn_model.h5
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
model = keras.models.Sequential()
model.add( keras.layers.Flatten(input_shape=[32, 32, 3]) )
model.add( keras.layers.BatchNormalization() )
for _ in range(20):
model.add( keras.layers.Dense(100, kernel_initializer="he_normal") )##########
model.add( keras.layers.BatchNormalization() ) ##########
model.add( keras.layers.Activation("elu") ) ##########
model.add( keras.layers.Dense(10, activation="softmax") )
optimizer = keras.optimizers.Nadam( lr=5e-4 ) ##########
model.compile( loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=['accuracy']
)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_bn_model.h5", save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = os.path.join( os.curdir, "my_cifar10_logs", "run_bn_{:03d}".format(run_index) )
tensorboard_cb = keras.callbacks.TensorBoard( run_logdir )
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]
model.fit(X_train, y_train, epochs=100,
validation_data=(X_valid, y_valid),
callbacks=callbacks)
model = keras.models.load_model("my_cifar10_bn_model.h5")
model.evaluate(X_valid, y_valid)
... ...
... ...
Is the model converging faster than before?
Much faster! The previous model took 36/37 epochs to reach the lowest validation loss, while the new model with BN took 18 epochs. That's more than twice as fast as the previous model. The BN layers stabilized training and allowed us to use a much larger learning rate, so convergence was faster.
Does BN produce a better model?
Yes! The final model is also much better, with 54.2% accuracy instead of 47%. It's still not a very good model, but at least it's much better than before (a Convolutional Neural Network would do much better, but that's a different topic, see chapter 14).
How does BN affect training speed?
Although the model converged twice as fast, each epoch took more time, because of the extra computations required by the BN layers. So overall, although the number of epochs was reduced by 50%, the training time (wall time) was shortened. Which is still pretty significant!
I changed the learning rate to 5e-4. I experimented with 1e-5, 3e-5, 5e-5, 1e-4, 3e-4, 5e-4, 1e-3 and 3e-3, and I chose the one with the best validation performance after 20 epochs.
rates, losses = find_learing_rate( model, X_train, y_train, epochs=1)
plot_lr_vs_loss(rates, losses)
d. Try replacing Batch Normalization with SELU, and make the necessary adjustements to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
model = keras.models.Sequential()
model.add( keras.layers.Flatten(input_shape=[32, 32, 3]) )
for _ in range(20):
model.add( keras.layers.Dense(100, kernel_initializer="lecun_normal", activation="selu") )#for selu
model.add( keras.layers.Dense(10, activation="softmax") )
optimizer = keras.optimizers.Nadam( lr=5e-4 ) ##########
model.compile( loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=['accuracy']
)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_selu_model.h5", save_best_only=True)######
run_index = 1 # increment every time you train the model
run_logdir = os.path.join( os.curdir, "my_cifar10_logs", "run_selu_{:03d}".format(run_index) ) ######
tensorboard_cb = keras.callbacks.TensorBoard( run_logdir )
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]
X_means = X_train.mean(axis=0)#for each instances #for selu
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train-X_means) /X_stds
X_valid_scaled = (X_valid-X_means) /X_stds
X_test_scaled = (X_test-X_means) /X_stds
model.fit(X_train_scaled, y_train, epochs=100, #####
validation_data=(X_valid_scaled, y_valid),
callbacks=callbacks)
model = keras.models.load_model("my_cifar10_selu_model.h5") #####
model.evaluate(X_valid_scaled, y_valid) #####
... ...
... ...
optimizer = keras.optimizers.Nadam( lr=5e-4 ) ##########
We get 50.12% accuracy, which is better than the original model, but not quite as good as the model using batch normalization. Moreover, it took 10 epochs to reach the best model, which is much faster than both the original model and the BN model. So it's by far the fastest model to train (both in terms of epochs and wall time).
# optimizer = keras.optimizers.Nadam( lr=7e-4 ) ##########
# optimizer = keras.optimizers.Nadam( lr=7e-4 ) ##########
model = keras.models.load_model("my_cifar10_selu_model.h5")
model.evaluate(X_valid_scaled, y_valid)
After compared two loss curves with different learning rate, I believed lr=5e-4 was better than lr=7e-4 since the constraint speed and more lower losses(the difference was not two much
# optimizer = keras.optimizers.Nadam( lr=7e-4 ) ##########
rates, losses = find_learing_rate( model, X_train_scaled, y_train, epochs=1)
plot_lr_vs_loss(rates, losses)
# optimizer = keras.optimizers.Nadam( lr=9e-4 ) ##########
# optimizer = keras.optimizers.Nadam( lr=9e-4 ) ##########
model = keras.models.load_model("my_cifar10_selu_model.h5")
model.evaluate(X_valid_scaled, y_valid)
rates, losses = find_learing_rate( model, X_train_scaled, y_train, epochs=1)
plot_lr_vs_loss(rates, losses)
Proved: 确定lr更重要的是确定量级,如1e-3和1e-2,因为确定了量级别就一定会constraint, 然后确定lr=5e-4的factor,constraint speed
##############################################
e. Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32,32,3]))
for _ in range(20):
model.add(keras.layers.Dense(100,
kernel_initializer="lecun_normal",
activation="selu")
)
model.add( keras.layers.AlphaDropout(rate=0.1) )#######
model.add( keras.layers.Dense(10, activation="softmax") )
optimizer = keras.optimizers.Nadam(lr=5e-4)
model.compile( loss="sparse_categorical_crossentropy",
optimizer = optimizer,
metrics = ["accuracy"]
)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_alpha_dropout_model.h5",
save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "my_cifar10_logs",
"run_alpha_dropout_{:03d}".format(run_index)
)
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]
X_means = X_train.mean(axis=0)
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train-X_means)/X_stds
X_valid_scaled = (X_valid-X_means)/X_stds
X_test_scaled = ( X_test -X_means)/X_stds
model.fit(X_train_scaled, y_train, epochs=100,
validation_data= (X_valid_scaled, y_valid),
callbacks=callbacks)
model = keras.models.load_model("my_cifar10_alpha_dropout_model.h5")
model.evaluate(X_valid_scaled, y_valid)
... ...
... ...
The model reaches 48.66% accuracy on the validation set. That's very slightly worse than without dropout (50.12%). With an extensive hyperparameter search, it might be possible to do better (I tried dropout rates of 5%, 10%, 20% and 40%, and learning rates 1e-4, 3e-4, 5e-4, and 1e-3), but probably not much better in this case.
Let's use MC Dropout now. We will need the MCAlphaDropout
class we used earlier, so let's just copy it here for convenience:
class MCAlphaDropout(keras.layers.AlphaDropout):
def call(self, inputs):
return super().call(inputs, training=True)#######activate dropout
# Now let's create a new model, identical to the one we just trained (with the same weights),
# but with MCAlphaDropout dropout layers instead of AlphaDropout layers:
mc_model = keras.models.Sequential([
##############
MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout) else layer
for layer in model.layers
])
# we don't need the following codes since we using AlphaDropout############
# optimizer = keras.optimizers.Nadam(lr=5e-4)
# mc_model.compile( loss="sparse_categorical_crossentropy",
# optimizer = optimizer,
# metrics = ["accuracy"]
# )
# Then let's add a couple utility functions. The first will run the model many times
# (10 by default) and it will return the mean predicted class probabilities.
def mc_dropout_predict_probas( mc_model, X, n_samples=10):
#each elem is predictions to X, n_samples predictions(prediction list)
Y_probas = [mc_model.predict(X) for sample in range(n_samples) ]
# Y_probas
# here 0,1,...,9 represent their probabilities
# [ [ [0,1,...,9], [0,1,...,9], ...len(X)..., [0,1,...,9] ],
# [ [0,1,...,9], [0,1,...,9], ...len(X)..., [0,1,...,9] ],
# ... ...n_samples
# [ [0,1,...,9], [0,1,...,9], ...len(X)..., [0,1,...,9] ],
# ]
return np.mean(Y_probas, axis=0)#return[ [0,1,...,9], [0,1,...,9], ...len(X)..., [0,1,...,9] ]
# The second will use these mean probabilities to predict the most likely class for
# each instance:
def mc_dropout_predict_classes( mc_model, X, n_samples=10):
Y_probas = mc_dropout_predict_probas(mc_model, X, n_samples)
return np.argmax(Y_probas, axis=1) #return[ 0, 9, ...len(X)..., 8 ]
# Now let's make predictions for all the instances in the validation set,
# and compute the accuracy:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
# we don't fit function here, since we using Alphal dropout model which has been fit
# model.fit(X_train_scaled, y_train, epochs=100,
# validation_data= (X_valid_scaled, y_valid),
# callbacks=callbacks)
y_pred = mc_dropout_predict_classes(mc_model, X_valid_scaled) #without retraining #prediction
accuracy = np.mean(y_pred==y_valid[:,0])# y_valid[:,0]: y_valid.shape==(5000,1)
accuracy
We only get virtually no accuracy improvement in this case (from 48.66% to 48.62% ).
So the best model we got in this exercise is the Batch Normalization model(54.12%).
##############################################
f. Retrain your model using 1cycle scheduling and see if it improves training speed and model accuracy.
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
model = keras.models.Sequential()
model.add( keras.layers.Flatten( input_shape=[32,32,3]) )
for _ in range(20):
model.add( keras.layers.Dense(100, kernel_initializer="lecun_normal", activation="selu") )
model.add(keras.layers.AlphaDropout(rate=0.1))
model.add(keras.layers.Dense(10, activation="softmax"))
optimizer = keras.optimizers.SGD(lr=1e-3)
model.compile( loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=['accuracy'])
K = keras.backend
class ExponentialLearningRate( keras.callbacks.Callback):
def __init__(self, factor):
self.factor = factor
self.rates = []
self.losses = []
def on_batch_end(self, batch, logs):
self.rates.append( K.get_value(self.model.optimizer.lr) )
self.losses.append( logs['loss'] )
K.set_value( self.model.optimizer.lr, self.model.optimizer.lr*self.factor )#update learning rate#callbacks
def find_learning_rate( model, X,y, epochs=1, batch_size=32, min_rate=10**-5, max_rate=10):
init_weights = model.get_weights()
iterations = len(X) // batch_size * epochs
factor = np.exp(np.log(max_rate / min_rate)/iterations)
init_lr = K.get_value(model.optimizer.lr) # get initial learning rate
K.set_value( model.optimizer.lr, min_rate ) # replace initial learning rate with min_rate
exp_lr = ExponentialLearningRate(factor) # pass learning rate factor to
history = model.fit( X, y, epochs=epochs, batch_size=batch_size, callbacks=[exp_lr] )
K.set_value( model.optimizer.lr, init_lr ) # replace current learning rate with initiallearning rate
model.set_weights(init_weights)
return exp_lr.rates, exp_lr.losses
def plot_lr_vs_loss( rates, losses ):
plt.plot(rates, losses)
plt.gca().set_xscale("log")
plt.hlines( min(losses), min(rates),max(rates) )
plt.axis( [min(rates), max(rates), min(losses), (losses[0]+min(losses))/2 ])
plt.xlabel("Learning rate")
plt.ylabel("Loss")
batch_size = 128
#sorry for spelling error, you can correct it, find_learning_rate
rates, losses =find_learning_rate(model, X_train_scaled, y_train, epochs=1, batch_size=batch_size)
plot_lr_vs_loss(rates, losses)
plt.axis([ min(rates), max(rates),
min(losses), (losses[0]+min(losses))/1.4 ])
From learning rate VS Loss curve, the best learing rate is factor*e^-4 ~e^-3
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
model = keras.models.Sequential()
model.add( keras.layers.Flatten( input_shape=[32,32,3] ) )
for _ in range(20):
model.add(keras.layers.Dense(100,
kernel_initializer="lecun_normal",
activation="selu"))
model.add( keras.layers.AlphaDropout(rate=0.1))
model.add( keras.layers.Dense(10, activation="softmax") )
optimizer = keras.optimizers.SGD( lr=1e-3 ) #from learing rate VS loss curves
model.compile( loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"] )
class OneCycleScheduler( keras.callbacks.Callback ):
def __init__(self, iterations, max_rate, start_rate=None,
last_iterations=None, last_rate=None):
self.iterations = iterations #total iterations
self.max_rate = max_rate
self.start_rate = start_rate or max_rate/10
self.last_iterations = last_iterations or iterations//10+1
self.half_iteration_pos = (iterations - self.last_iterations)//2
# finishing the last few epochs by dropping the rate down by several orders of magnitude
self.last_rate = last_rate or self.start_rate/1000
self.iteration_pos = 0
def _iterpolate( self, iter1, iter2,
rate1, rate2):
# a_slope: (rate2-rate1)/(iter2-iter1)
# x: (self.iteration-iter1)
# b: rate1
# y= a_slope * x + b
return ( (rate2-rate1)*(self.iteration_pos-iter1) / (iter2-iter1) + rate1 )
def on_batch_begin(self, batch, logs):
if self.iteration_pos < self.half_iteration_pos:
rate = self._iterpolate(0, self.half_iteration_pos,
self.start_rate, self.max_rate)
elif self.iteration_pos < 2*self.half_iteration_pos:
rate = self._iterpolate(self.half_iteration_pos, 2*self.half_iteration_pos,
self.max_rate, self.start_rate)
else:#last few epochs
rate = self._iterpolate(2*self.half_iteration_pos, self.iterations,
self.start_rate, self.last_rate)
self.iteration_pos +=1
K.set_value(self.model.optimizer.lr, rate)#update
n_epochs = 15
onecycle = OneCycleScheduler( len(X_train_scaled)//batch_size*n_epochs, max_rate=0.02)#max_rate=0.02 from learning rate VS loss curve
history = model.fit(X_train_scaled, y_train, epochs = n_epochs, batch_size=batch_size,
validation_data=(X_valid_scaled, y_valid),
callbacks=[onecycle])
model.evaluate(X_valid_scaled, y_valid)
... ...
One cycle allowed us to train the model in just 15 epochs, each taking only 75 seconds (thanks to the larger batch size). This is over 3 times faster than the fastest model we trained so far. Moreover, we improved the model's performance (from 48.66% to 51%). The batch normalized model reaches a slightly better performance, but it's much slower to train.