LIQING LIN

11_Training Deep Neural Networks_4_dropout_Max-Norm Regularization_CIFAR10_find_learning rate

11_Training Deep Neural Networks_VarianceScaling_leaky relu_PReLU_SELU _Batch Normalization_Reusing
https://blog.csdn.net/Linli522362242/article/details/106935910

11_Training Deep Neural Networks_2_transfer learning_RBMs_Momentum_Nesterov Accelerated Gra_AdaGrad_RMSProp
https://blog.csdn.net/Linli522362242/article/details/106982127

11_Training Deep Neural Networks_3_Adam_Learning Rate Scheduling_Decay_np.argmax(」)_lambda语句_Regular
https://blog.csdn.net/Linli522362242/article/details/107086444

Dropout

Dropout is one of the most popular regularization techniques for deep neural networks. It was proposed in a paper23 by Geoffrey Hinton in 2012 and further detailed in a 2014 paper24 by Nitish Srivastava et al., and it has proven to be highly successful: even the state-of-the-art neural networks get a 1–2% accuracy boost simply by adding dropout. This may not sound like a lot, but when a model already has 95% accuracy, getting a 2% accuracy boost means dropping the error rate by almost 40% (going from 5% error to roughly 3%).

It is a fairly simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step (see Figure 11-9). The hyperparameter p is called the dropout rate, and it is typically set between 10% and 50%: closer to 20–30% in recurrent递归 neural nets (see Chapter 15), and closer to 40–50% in convolutional neural networks (see Chapter 14). After training, neurons don’t get dropped anymore. And that’s all (except for a technical detail we will discuss momentarily[ˌmoʊmənˈterəli]马上，立刻).

Figure 11-9. With dropout regularization, at each training iteration a random subset of all neurons in one or more layers—except the output layer—are “dropped out”; these neurons output 0 at this iteration (represented by the dashed arrows)

@tf_export("nn.dropout", v1=[])
def dropout_v2(x, rate, noise_shape=None, seed=None, name=None):
  """Computes dropout: randomly sets elements to zero to prevent overfitting.
  Note: The behavior of dropout has changed between TensorFlow 1.x and 2.x.
  When converting 1.x code, please use named arguments to ensure behavior stays
  consistent.
  See also: `tf.keras.layers.Dropout` for a dropout layer.
  [Dropout](https://arxiv.org/abs/1207.0580) is useful for regularizing DNN
  models. Inputs elements are randomly set to zero (and the other elements are
  rescaled). This encourages each node to be independently useful, as it cannot
  rely on the output of other nodes.
  More precisely: With probability `rate` elements of `x` are set to `0`.
  The remaining elements are scaled up by `1.0 / (1 - rate)`, so that the
  expected value is preserved.

  >>> tf.random.set_seed(0)
  >>> x = tf.ones([3,5])
  >>> tf.nn.dropout(x, rate = 0.5, seed = 1).numpy()# x*1/(1-0.5)
  array([[2., 0., 0., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 0., 2., 0., 2.]], dtype=float32)
  >>> tf.random.set_seed(0)
  >>> x = tf.ones([3,5])
  >>> tf.nn.dropout(x, rate = 0.8, seed = 1).numpy()
  array([[0., 0., 0., 5., 5.],
       [0., 5., 0., 5., 0.],
       [5., 0., 5., 0., 5.]], dtype=float32)
  >>> tf.nn.dropout(x, rate = 0.0) == x
  
  By default, each element is kept or dropped independently.  If `noise_shape`
  is specified, it must be
  [broadcastable](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)
  to the shape of `x`, and only dimensions with `noise_shape[i] == shape(x)[i]`
  will make independent decisions. This is useful for dropping whole
  channels from an image or sequence. For example:
  >>> tf.random.set_seed(0)
  >>> x = tf.ones([3,10]) #2/3 >0.5
  >>> tf.nn.dropout(x, rate = 2/3, noise_shape=[1,10], seed=1).numpy()
  array([[0., 0., 0., 3., 3., 0., 3., 3., 3., 0.], 
       [0., 0., 0., 3., 3., 0., 3., 3., 3., 0.],
       [0., 0., 0., 3., 3., 0., 3., 3., 3., 0.]], dtype=float32)
  Args:
    x: A floating point tensor.
    rate: A scalar `Tensor` with the same type as x. The probability
      that each element is dropped. For example, setting rate=0.1 would drop
      10% of input elements.
    noise_shape: A 1-D `Tensor` of type `int32`, representing the
      shape for randomly generated keep/drop flags.
    seed: A Python integer. Used to create random seeds. See
      `tf.random.set_seed` for behavior.
    name: A name for this operation (optional).
  Returns:
    A Tensor of the same shape of `x`.
  Raises:
    ValueError: If `rate` is not in `[0, 1)` or if `x` is not a floating point
      tensor. `rate=1` is disallowed, because theoutput would be all zeros,
      which is likely not what was intended.
  """
  with ops.name_scope(name, "dropout", [x]) as name:
    is_rate_number = isinstance(rate, numbers.Real)
    if is_rate_number and (rate < 0 or rate >= 1):
      raise ValueError("rate must be a scalar tensor or a float in the "
                       "range [0, 1), got %g" % rate)
    x = ops.convert_to_tensor(x, name="x") #droppout rate
    x_dtype = x.dtype
    if not x_dtype.is_floating:
      raise ValueError("x has to be a floating point tensor since it's going "
                       "to be scaled. Got a %s tensor instead." % x_dtype)
    is_executing_eagerly = context.executing_eagerly()
    if not tensor_util.is_tensor(rate):
      if is_rate_number:
        keep_prob = 1 - rate       #keep probability
        scale = 1 / keep_prob
        scale = ops.convert_to_tensor(scale, dtype=x_dtype)
        ret = gen_math_ops.mul(x, scale) #x/ (1-droppout rate)######################
      else:
        raise ValueError("rate is neither scalar nor scalar tensor %r" % rate)
    else:
      rate.get_shape().assert_has_rank(0)
      rate_dtype = rate.dtype
      if rate_dtype != x_dtype:
        if not rate_dtype.is_compatible_with(x_dtype):
          raise ValueError(
              "Tensor dtype %s is incomptaible with Tensor dtype %s: %r" %
              (x_dtype.name, rate_dtype.name, rate))
        rate = gen_math_ops.cast(rate, x_dtype, name="rate")
      one_tensor = constant_op.constant(1, dtype=x_dtype)
      ret = gen_math_ops.real_div(x, gen_math_ops.sub(one_tensor, rate))

    noise_shape = _get_noise_shape(x, noise_shape)
    # Sample a uniform distribution on [0.0, 1.0) and select values larger
    # than rate.
    #
    # NOTE: Random uniform can only generate 2^23 floats on [1.0, 2.0)
    # and subtract 1.0.
    random_tensor = random_ops.random_uniform(
        noise_shape, seed=seed, dtype=x_dtype)
    # NOTE: if (1.0 + rate) - 1 is equal to rate, then that float is selected,
    # hence a >= comparison is used.
    keep_mask = random_tensor >= rate
    ret = gen_math_ops.mul(ret, gen_math_ops.cast(keep_mask, x_dtype))
    if not is_executing_eagerly:
      ret.set_shape(x.get_shape())
    return ret

It’s surprising at first that this destructive[dɪˈstrʌktɪv]破坏性的 technique works at all. Would a company perform better if its employees were told to toss a coin every morning to decide whether or not to go to work? Well, who knows; perhaps it would! The company would be forced to adapt its organization组织构架; it could not rely on any single person to work the coffee machine or perform any other critical tasks, so this expertise专门知识或技能 would have to be spread across several people. Employees would have to learn to cooperate with many of their coworkers, not just a handful of them. The company would become much more resilient有弹性的. If one person quit, it wouldn’t make much of a difference. It’s unclear whether this idea would actually work for companies, but it certainly does for neural networks. Neurons trained with dropout cannot co-adapt共同适应 with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively过度地 on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end, you get a more robust network that generalizes better.

Another way to understand the power of dropout is to realize that a unique neural network is generated at each training step. Since each neuron can be either present or absent, there are a total of possible networks (where N is the total number of droppable neurons). This is such a huge number that it is virtually impossible for the same neural network to be sampled twice. Once you have run 10,000 training steps, you have essentially trained 10,000 different neural networks (each with just one training instance). These neural networks are obviously not independent because they share many of their weights, but they are nevertheless all different. The resulting neural network can be seen as an averaging ensemble of all these smaller neural networks.

In practice, you can usually apply dropout only to the neurons in the top one to three layers (excluding the output layer).

There is one small but important technical detail. Suppose p = 50%, in which case during testing a neuron would be connected to twice as many input neurons as it would be (on average) during training. To compensate for this fact, we need to multiply each neuron’s input connection weights by 0.5 after training. If we don’t, each neuron will get a total input signal roughly twice as large as what the network was trained on and will be unlikely to perform well. More generally, we need to multiply each input connection weight by the keep probability (1 – p) after training. Alternatively, we can divide each neuron’s output by the keep probability during training (these alternatives are not perfectly equivalent, but they work equally well).

To implement dropout using Keras, you can use the keras.layers.Dropout layer. During training, it randomly drops some inputs (setting them to 0) and divides the remaining inputs by the keep probability(remaining input/1-rate). After training, it does nothing at all; it just passes the inputs to the next layer.
The following code applies dropout regularization before every Dense layer, using a dropout rate of 0.2:

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

model.compile( loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs=2
history = model.fit( X_train_scaled, y_train, epochs=n_epochs,
                     validation_data = (X_valid_scaled, y_valid))

Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So make sure to evaluate the training loss without dropout (e.g., after training).

If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training set. It can also help to increase the dropout rate for large layers, and reduce it for small ones. Moreover, many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong.

If you want to regularize a self-normalizing network based on the SELU activation function (as discussed earlier), you should use alpha dropout: this is a variant of dropout that preserves the mean and standard deviation of its inputs (it was introduced in the same paper as SELU, as regular dropout would break self-normalization
https://blog.csdn.net/Linli522362242/article/details/106935910).

Equation 11-5. Nesterov Accelerated Gradient algorithm

Alpha Dropout

Alpha Dropout is a Dropout that keeps mean and variance of inputs to their original values, in order to ensure the self-normalizing property even after this dropout. Alpha Dropout fits well to Scaled Exponential Linear Units by randomly setting activations to the negative saturation value.

rate: float, drop probability (as with Dropout). The multiplicative noise will have standard deviation sqrt(rate / (1 - rate)).

@keras_export('keras.layers.AlphaDropout')
class AlphaDropout(Layer):
  """Applies Alpha Dropout to the input.
  Alpha Dropout is a `Dropout` that keeps mean and variance of inputs
  to their original values, in order to ensure the self-normalizing property
  even after this dropout.
  Alpha Dropout fits well to Scaled Exponential Linear Units
  by randomly setting activations to the negative saturation value.
  Arguments:
    rate: float, drop probability (as with `Dropout`).
      The multiplicative noise will have
      standard deviation `sqrt(rate / (1 - rate))`.
    seed: A Python integer to use as random seed.
  Call arguments:
    inputs: Input tensor (of any rank).
    training: Python boolean indicating whether the layer should behave in
      training mode (adding dropout) or in inference mode (doing nothing).
  Input shape:
    Arbitrary. Use the keyword argument `input_shape`
    (tuple of integers, does not include the samples axis)
    when using this layer as the first layer in a model.
  Output shape:
    Same shape as input.
  """

  def __init__(self, rate, noise_shape=None, seed=None, **kwargs):
    super(AlphaDropout, self).__init__(**kwargs)
    self.rate = rate
    self.noise_shape = noise_shape
    self.seed = seed
    self.supports_masking = True

  def _get_noise_shape(self, inputs):
    return self.noise_shape if self.noise_shape else array_ops.shape(inputs)

  def call(self, inputs, training=None):
    if 0. < self.rate < 1.:
      noise_shape = self._get_noise_shape(inputs)

      def dropped_inputs(inputs=inputs, rate=self.rate, seed=self.seed):  # pylint: disable=missing-docstring
        alpha = 1.6732632423543772848170429916717
        scale = 1.0507009873554804934193349852946
        alpha_p = -alpha * scale

        kept_idx = math_ops.greater_equal(
            K.random_uniform(noise_shape, seed=seed), rate)
        kept_idx = math_ops.cast(kept_idx, inputs.dtype)

        # Get affine transformation params
        a = ((1 - rate) * (1 + rate * alpha_p**2))**-0.5
        b = -a * alpha_p * rate

        # Apply mask
        x = inputs * kept_idx + alpha_p * (1 - kept_idx)

        # Do affine transformation
        return a * x + b

      return K.in_train_phase(dropped_inputs, inputs, training=training)
    return inputs

import tensorflow as tf
import numpy as np

tf.random.set_seed(42)
np.random.seed(42)


model = keras.models.Sequential([
    keras.layers.Flatten( input_shape=[28,28]),
    
    keras.layers.AlphaDropout( rate=0.2 ),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    
    keras.layers.AlphaDropout( rate=0.2 ),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    
    keras.layers.AlphaDropout( rate=0.2 ),
    keras.layers.Dense(10, activation="softmax")
])

optimizer = keras.optimizers.SGD( lr=0.01, momentum=0.9, nesterov=True)
model.compile( loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs=20
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

... ...

Since dropout is only active during training, comparing the training loss and the validation loss can be misleading

train loss > val_loss, undefitting, Don't be misslead reallly?

model.evaluate(X_test_scaled, y_test)

model.evaluate(X_train_scaled, y_train)

train loss < test loss, overfitting

history = model.fit(X_train_scaled, y_train)

Monte Carlo (MC) Dropout

In 2016, a paper by Yarin Gal and Zoubin Ghahramani added a few more good reasons to use dropout:

First, the paper established a profound意义深远的 connection between dropout networks (i.e., neural networks containing a Dropout layer before every weight layer) and approximate Bayesian inference, giving dropout a solid mathematical justification.
Second, the authors introduced a powerful technique called MC Dropout, which can boost the performance of any trained dropout model without having to retrain it or even modify it at all, provides a much better measure of the model’s uncertainty, and is also amazingly simple to implement.

If this all sounds like a “one weird trick” advertisement, then take a look at the following code. It is the full implementation of MC Dropout, boosting the dropout model we trained earlier without retraining it:

import tensorflow as tf
import numpy as np

tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten( input_shape=[28,28]),
    
    keras.layers.AlphaDropout( rate=0.2 ),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    
    keras.layers.AlphaDropout( rate=0.2 ),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    
    keras.layers.AlphaDropout( rate=0.2 ),
    keras.layers.Dense(10, activation="softmax")
])

optimizer = keras.optimizers.SGD( lr=0.01, momentum=0.9, nesterov=True)
model.compile( loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
n_epochs=20
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

#history = model.fit(X_train_scaled, y_train)#without retraining #prediction #??????

y_probas = np.stack([ model(X_test_scaled, training=True) for sample in range(100)])######

We just make 100 predictions over the test set(1000 prediction on each instance), setting training=True to ensure that the Dropout layer is active, and stack the predictions. Since dropout is active, all the predictions will be different. Recall that predict() returns a matrix with one row per instance and one column per class. Because there are 10,000 instances in the test set and 10 classes, this is a matrix of shape [10000, 10]. We stack 100 such matrices, so y_probas is an array of shape [100, 10000, 10].
Once we average over the first dimension (axis=0), we get y_proba, an array of shape [10000, 10], like we would get
with a single prediction. That’s all! Averaging over multiple predictions with dropout on gives us a Monte Carlo estimate that is generally more reliable than the result of a single prediction with dropout off.

y_proba = y_probas.mean(axis=0)
y_std = y_probas.std(axis=0)

For example, let’s look at the model’s prediction for the first instance in the Fashion MNIST test set, with dropout off:

np.round( model.predict(X_test_scaled[:1]),2)

y_test[:1]

The model seems almost certain that this image belongs to class 9 (ankle boot). Should you trust it? Is there really so little room for doubt? Compare this with the predictions made when dropout is activated:

# y_probas = np.stack([ model(X_test_scaled, training=True) for sample in range(100)])######
np.round(y_probas[:,:1],2)

... ...

This tells a very different story: apparently, when we activate dropout, the model is not sure anymore. It still seems to prefer class 9, but sometimes it hesitates with classes 5 (sandal) and 7 (sneaker), which makes sense given they’re all footwear.

y_probas.shape

# number of predictions=100 on the same instance, 10000 instances, 10 class(0~9)

y_proba.shape  #y_proba = y_probas.mean(axis=0)

y_test.shape

Once we average over the first dimension, we get the following MC Dropout predictions:

np.round( y_proba[:1],2) # #y_proba = y_probas.mean(axis=0)

The model still thinks this image belongs to class 9, but only with a 83% confidence, which seems much more reasonable than 100%###np.round( model.predict(X_test_scaled[:1]),2)###. Plus it’s useful to know exactly which other classes it thinks are likely. And you can also take a look at the standard deviation of the probability estimates:

y_std = y_probas.std(axis=0)
np.round(y_std[:1],2)

Apparently there’s quite a lot of variance in the probability estimates: if you were building a risk-sensitive system (e.g., a medical or financial system), you should probably treat such an uncertain prediction with extreme caution. You definitely would not treat it like a 100% confident prediction. Moreover, the model’s accuracy: 85.8:

y_pred = np.argmax( y_proba, axis=1)
accuracy = np.sum(y_pred == y_test)/len(y_test)
accuracy

The number of Monte Carlo samples you use (100 in this example) is a hyperparameter you can tweak. The higher it is, the more accurate the predictions and their uncertainty estimates will be. However, if you double it, inference time will also be doubled. Moreover, above a certain number of samples, you will notice little improvement. So your job is to find the right trade-off between latency and accuracy, depending on your application.

If your model contains other layers that behave in a special way during training (such as BatchNormalization layers), then you should not force training mode like we just did. Instead, you should replace the Dropout layers with the following MCDropout class:

class MCDropout( keras.layers.Dropout ):
    def call(self, inputs):
        return super().call( inputs, training=True)#######activate dropout

Here, we just subclass the Dropout layer and override the call() method to force its training argument to True (see Chapter 12).
Similarly, you could define an MCAlphaDropout class by subclassing AlphaDropout instead.
If you are creating a model from scratch, it’s just a matter of using MCDropout rather than Dropout. But if you have a
model that was already trained using Dropout, you need to create a new model that’s identical to the existing model except that it replaces the Dropout layers with MCDrop out, then copy the existing model’s weights to your new model.

tf.random.set_seed(42)
np.random.seed(42)
    
class MCAlphaDropout( keras.layers.AlphaDropout ):
    def call(self, inputs):
        return super().call( inputs, training=True)

# model = keras.models.Sequential([
#     keras.layers.Flatten( input_shape=[28,28]),
    
#     keras.layers.AlphaDropout( rate=0.2 ),
#     keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    
#     keras.layers.AlphaDropout( rate=0.2 ),
#     keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    
#     keras.layers.AlphaDropout( rate=0.2 ),
#     keras.layers.Dense(10, activation="softmax")
# ])

mc_model = keras.models.Sequential([
    MCAlphaDropout( layer.rate ) if isinstance( layer, keras.layers.AlphaDropout ) else layer
    for layer in model.layers
])

mc_model.summary()

optimizer = keras.optimizers.SGD( lr=0.01, momentum=0.9, nesterov=True )
mc_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, 
                 metrics=["accuracy"])
mc_model.set_weights(model.get_weights())

# len(model.get_weights()) : 6
len( model.get_weights()[0] ), len( model.get_weights()[1] ), len( model.get_weights()[2] )

len( model.get_weights()[3] ), len( model.get_weights()[4] ), len( model.get_weights()[5] )

Now we can use the model with MC Dropout:

np.round(np.mean([mc_model.predict(X_test_scaled[:1]) for sample in range(100)],
                  axis=0),
        2)

In short, MC Dropout is a fantastic technique that boosts dropout models and provides better uncertainty estimates. And of course, since it is just regular dropout during training, it also acts like a regularizer.

Max-Norm Regularization

Another regularization technique that is popular for neural networks is called maxnorm regularization: for each neuron, it constrains the weights w of the incoming connections such that ≤ r, where r is the max-norm hyperparameter and is the norm.

Max-norm regularization does not add a regularization loss term to the overall loss function. Instead, it is typically implemented by computing after each training step and rescaling w if needed (w ← ).

Reducing r increases the amount of regularization and helps reduce overfitting. Maxnorm regularization can also help alleviate the unstable gradients problems ###the vanishing/exploding gradients problems during training.
https://blog.csdn.net/Linli522362242/article/details/106935910###(if you are not using Batch Normalization ###This operation simply zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling, the other for shifting###.

To implement max-norm regularization in Keras, set the kernel_constraint argument of each hidden layer to a max_norm() constraint with the appropriate max value, like this:

layer = keras.layers.Dense( 100, activation="selu", kernel_initializer="lecun_normal",
                            kernel_constraint=keras.constraints.max_norm(1.)
                          )

After each training iteration, the model’s fit() method will call the object returned by max_norm(), passing it the layer’s weights and getting rescaled weights in return, which then replace the layer’s weights. As you’ll see in Chapter 12, you can define your own custom constraint function if necessary and use it as the kernel_constraint. You can also constrain the bias terms by setting the bias_constraint argument.

The max_norm() function has an axis argument that defaults to 0. A Dense layer usually has weights of shape [number of inputs, number of neurons], so using axis=0 means that the max-norm constraint will apply independently to each neuron’s weight vector. If you want to use max-norm with convolutional layers (see Chapter14), make sure to set the max_norm() constraint’s axis argument appropriately (usually axis=[0, 1, 2]).

from functools import partial

MaxNormDense = partial(keras.layers.Dense, 
                       activation="selu", kernel_initializer="lecun_normal",
                       kernel_constraint=keras.constraints.max_norm(1.))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    ##Dense: [number of inputs==784, number of neurons=300]
    MaxNormDense(300),
    MaxNormDense(100),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs=2
history = model.fit(X_train_scaled, y_train, epochs= n_epochs,
                    validation_data = (X_valid_scaled, y_valid))

Summary and Practical Guidelines

In this chapter we have covered a wide range of techniques, and you may be wondering which ones you should use. This depends on the task, and there is no clear consensus[kənˈsensəs]舆论; 一致同意 yet, but I have found the configuration in Table 11-3 to work fine in most cases, without requiring much hyperparameter tuning. That said, please do not consider these defaults as hard rules!

Table 11-3. Default DNN configuration

If the network is a simple stack of dense layers, then it can self-normalize, and you should use the configuration in Table 11-4 instead.
Table 11-4. DNN configuration for a self-normalizing net

Don’t forget to normalize the input features! You should also try to reuse parts of a pretrained neural network if you can find one that solves a similar problem, or use unsupervised pretraining if you have a lot of unlabeled data, or use pretraining on an auxiliary task if you have a lot of labeled data for a similar task.

While the previous guidelines should cover most cases, here are some exceptions:

If you need a sparse model, you can use regularization (and optionally zero out the tiny weights after training, https://blog.csdn.net/Linli522362242/article/details/107086444 as discussed in “Lasso Regression” on page 137 in Chapter 4 https://blog.csdn.net/Linli522362242/article/details/104070847 tends to completely eliminate the weights of the least important features最不重要 (i.e., set them to zero) ... since all the weights for the high-degree polynomial features are equal to zero. In other words, Lasso Regression automatically performs feature selection and outputs a sparse model (i.e., with few nonzero feature weights)
If you need a low-latency model (one that performs lightning-fast predictions), you may need to use fewer layers, fold the Batch Normalization layers into the previous layers, and possibly use a faster activation function such as leaky ReLU or just ReLU. Having a sparse model will also help. Finally, you may want to reduce the float precision from 32 bits to 16 or even 8 bits (see “Deploying a Model to a Mobile or Embedded Device” on page 685). Again, check out TFMOT.
If you are building a risk-sensitive application, or inference latency is not very important in your application, you can use MC Dropout to boost performance and get more reliable probability estimates, along with uncertainty estimates.

With these guidelines, you are now ready to train very deep nets! I hope you are now convinced that you can go quite a long way using just Keras. There may come a time, however, when you need to have even more control; for example, to write a custom loss function or to tweak the training algorithm. For such cases you will need to use TensorFlow’s lower-level API, as you will see in the next chapter.

Exercises

Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

No, all weights should be sampled independently; they should not all have the same initial value. One important goal of sampling weights randomly is to break symmetry: if all the weights have the same initial value, even if that value is not zero, then symmetry is not broken (i.e., all neurons in a given layer are equivalent), and backpropagation will be unable to break it. Concretely, this means that all the neurons in any given layer will always have the same weights. It’s like having just one neuron per layer, and much slower. It is virtually impossible for such a configuration to converge to a good solution.
Is it OK to initialize the bias terms to 0?

It is perfectly fine to initialize the bias terms to zero. Some people like to initialize them just like weights, and that’s okay too; it does not make much difference.
Name three advantages of the SELU activation function over ReLU.

A few advantages of the SELU function over the ReLU function are:

• It can take on negative values, so the average output of the neurons in any given layer is typically closer to zero than when using the ReLU activation function (which never outputs negative values). This helps alleviate the vanishing
gradients problem.

• It always has a nonzero derivative, which avoids the dying units issue that can affect ReLU units.

• When the conditions are right (i.e., if the model is sequential, and the weights are initialized using LeCun initialization, and the inputs are standardized, and there’s no incompatible layer or regularization, such as dropout or ℓ1 regularization), then the SELU activation function ensures the model is selfnormalized, which solves the exploding/vanishing gradients problems.
In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?
The SELU activation function is a good default.

If you need the neural network to be as fast as possible, you can use one of the leaky ReLU variants instead (e.g., a simple leaky ReLU using the default hyperparameter value).

The simplicity of the ReLU activation function makes it many people’s preferred option, despite the fact that it is generally outperformed by SELU and leaky ReLU. However, the ReLU activation function’s ability to output precisely zero can be useful in some cases (e.g., see Chapter 17). Moreover, it can sometimes benefit from optimized implementation as well as from hardware acceleration.

The hyperbolic tangent (tanh) can be useful in the output layer if you need to output a number between –1 and 1, but nowadays it is not used much in hidden layers (except in recurrent nets).

The logistic activation function is also useful in the output layer when you need to estimate a probability (e.g., for binary classification), but is rarely used in hidden layers (there are exceptions—for example, for the coding layer of variational autoencoders; see Chapter 17).

Finally, the softmax activation function is useful in the output layer to output probabilities for mutually exclusive classes, but it is rarely (if ever) used in hidden layers.
What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?

If you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer, then the algorithm will likely pick up a lot of speed, hopefully moving roughly toward the global minimum, but its momentum will carry it right past the minimum. Then it will slow down and come back, accelerate again, overshoot超过 again, and so on. It may oscillate使振荡 this way many times before converging, so overall it will take much longer to converge than with a smaller momentum value.
Name three ways you can produce a sparse model.

One way to produce a sparse model (i.e., with most weights equal to zero) is to train the model normally, then zero out tiny weights.
For more sparsity, you can apply ℓ1 regularization during training, which pushes the optimizer toward sparsity.

A third option is to use the TensorFlow Model Optimization Toolkit.
Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)? What about MC Dropout?

Yes, dropout does slow down training, in general roughly by a factor of two. However, it has no impact on inference speed since it is only turned on during training.

MC Dropout is exactly like dropout during training, but it is still active during inference, so each inference is slowed down slightly. More importantly, when using MC Dropout you generally want to run inference 10 times or more to get better predictions. This means that making predictions is slowed down by a factor of 10 or more.
Practice training a deep neural network on the CIFAR10 image dataset:
##############################################
a. Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the point of this exercise). Use He initialization and the ELU activation function(SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic, and question c requires using BN then SELU).
==>HE_normal
https://blog.csdn.net/Linli522362242/article/details/107086444The only difference is that step 1 computes an exponentially decaying average rather than an exponentially decaying sum( but these are actually equivalent except for a constant factor (the decaying average is just 1 – times the decaying sum)., and You can easily verify that if the gradient remains constant, the terminal velocity (i.e., the maximum size of the weight updates, ###Here is m###) is equal to that gradient multiplied by the learning rate η multiplied by 1/(1–β)
https://blog.csdn.net/Linli522362242/article/details/106982127)
```
import tensorflow as tf
from tensorflow import keras
import numpy as np

keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add( keras.layers.Flatten( input_shape=[32,32,3] ) ) # 32 × 32–pixel color images(255.255.255)
for _ in range(20):
    model.add( keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal") )
```
##############################################
b. Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset. You can load it with keras.datasets.cifar10.load_data(). The dataset is composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons. Remember to search for the right learning rate each time you change the model’s architecture or hyperparameters.
https://www.cs.toronto.edu/~kriz/cifar.html

Let's add the output layer to the model: ==> softmax
```
model.add( keras.layers.Dense(10, activation="softmax") )
```
Nadam
Let's use a Nadam optimizer with a learning rate of 5e-5. I tried learning rates 1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3 and 1e-2, and I compared their learning curves for 10 epochs each (using the TensorBoard callback, below). The learning rates 3e-5 and 1e-4 were pretty good, so I tried 5e-5, which turned out to be slightly better.
https://blog.csdn.net/Linli522362242/article/details/107086444

Adaptive optimization methods (including RMSProp, Adam, and Nadam optimization) are often great(使用随着loss接近最小值而减小的动态学习率), converging fast to a good solution.

general Adam(结合了Adagrad善于处理稀疏梯度和RMSprop善于处理非平稳目标的优点) performs better than AdaMax(this is just one more optimizer you can try if you experience problems with Adam on some task)

Nadam optimization is Adam optimization plus the Nesterov trick, so it will often converge slightly faster than Adam.
```
optimizer = keras.optimizers.Nadam(lr=5e-5)
model.compile( loss="sparse_categorical_crossentropy",
               optimizer = optimizer,
               metrics=['accuracy']
             )
```
Let's load the CIFAR10 dataset. We also want to use early stopping(since the cost function is convex), so we need a validation set. Let's use the first 5,000 images of the original training set as the validation set(early stopping will interrupt training when it measures no progress on the validation set for a number of epochs (defined by the patience argument ###Number of epochs with no improvement after which training will be stopped.### )
```
# You can load it with keras.datasets.cifar10.load_data(). The dataset is composed of 60,000 32 × 32–pixel color 
# images (50,000 for training, 10,000 for testing) with 10 classes, 
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()

X_train = X_train_full[5000:]
y_train = y_train_full[5000:]
X_valid = X_train_full[:5000]
y_valid = y_train_full[:5000]
```
Now we can create the callbacks (in case your computer crashes, the ModelCheckpoint callback回调 saves checkpoints of your model at regular intervals during training, by default at the end of each epoch:) we need and train the model:
# https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint
# mode = "auto"
# 在save_best_only=True时决定性能最佳模型的评判准则，例如，当监测值为val_acc时，模式应为max，
# 当检测值为val_loss时，模式应为min。在auto模式下 mode='auto'，评价准则由被监测值的名字自动推断
https://blog.csdn.net/Linli522362242/article/details/106582512
```
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
# https://www.tensorflow.org/api_docs/python/`tf/keras/callbacks/ModelCheckpoint
# mode = "auto"
# monitor='val_loss' #default
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_model.h5", save_best_only=True)
run_index = 1 # increment every time you train the model

import os
run_logdir = os.path.join( os.curdir, "my_cifar10_logs", "run_{:03d}".format(run_index) )
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

model.fit( X_train, y_train, epochs=100,
           validation_data=(X_valid, y_valid),
           callbacks=callbacks)
```
... ...

... ...
```
model = keras.models.load_model("my_cifar10_model.h5")
model.evaluate(X_valid, y_valid)
```
The model with the lowest validation loss gets about 47% accuracy on the validation set. It took 36 epochs to reach the lowest validation loss. Let's see if we can improve performance using Batch Normalization.

The model with the lowest validation loss gets about 47% accuracy on the validation set. It took 36 epochs to reach the lowest validation loss.
```
K = keras.backend
 
class ExponentialLearningRate( keras.callbacks.Callback):
    def __init__(self, factor):
        self.factor = factor
        self.rates = []
        self.losses = []
        
    def on_batch_end(self, batch, logs):
        self.rates.append( K.get_value(self.model.optimizer.lr) )
        self.losses.append( logs['loss'] )
        K.set_value( self.model.optimizer.lr, self.model.optimizer.lr*self.factor )#update learning rate
        
def find_learing_rate( model, X,y, epochs=1, batch_size=32, min_rate=10**-5, max_rate=10):
    init_weights = model.get_weights()
    
    iterations = len(X) // batch_size * epochs
    factor = np.exp(np.log(max_rate / min_rate)/iterations) # initilize learning rate factor
    init_lr = K.get_value(model.optimizer.lr)   # get initial learning rate
    K.set_value( model.optimizer.lr, min_rate ) # replace initial learning rate with min_rate
    exp_lr = ExponentialLearningRate(factor)    # pass learning rate factor to
    history = model.fit( X, y, epochs=epochs, batch_size=batch_size, callbacks=[exp_lr] )
    K.set_value( model.optimizer.lr, init_lr )  # replace current learning rate with initiallearning rate
    
    model.set_weights(init_weights)
    return exp_lr.rates, exp_lr.losses
 
def plot_lr_vs_loss( rates, losses ):
    plt.plot(rates, losses)
    plt.gca().set_xscale("log")
    plt.hlines( min(losses), min(rates),max(rates) )
    plt.axis( [min(rates), max(rates),  min(losses), (losses[0]+min(losses))/2 ])
    plt.xlabel("Learning rate")
    plt.ylabel("Loss")

 
rates, losses = find_learing_rate( model, X_train, y_train, epochs=1)#default batch_size=32   
```
```
import matplotlib.pyplot as plt

plot_lr_vs_loss(rates, losses)
```
(suggestion: do not choose the learning rate when losses arrive the minimum)<--How Do You Find A Good Learning Rate<--https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html 在实践中也可以发现，确定lr更重要的是确定量级，如1e-3和1e-2

###Let's use a Nadam optimizer with a learning rate of 5e-5. I tried learning rates 1e-5, 2e-5, 1e-4, 3e-4, 1e-3, 3e-3 and 1e-2, and I compared their learning curves for 10 epochs each (using the TensorBoard callback, below). The learning rates 2e-5 and 1e-4 were pretty good, so I tried 5e-5, which turned out to be slightly better.###

Let's see if we can improve performance using Batch Normalization.

##############################################
Batch Normalization
c. Now try adding Batch Normalization and compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?

The code below is very similar to the code above, with a few changes:
* I added a BN layer after every Dense layer (before the activation function, SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic), except for the output layer. I also added a BN layer before the first hidden layer(Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. the input numerical attributes have very different scales, standardization is much less affected by outliers
https://blog.csdn.net/Linli522362242/article/details/106582512).

* I changed the learning rate to 5e-4. I experimented with 1e-5, 3e-5, 5e-5, 1e-4, 3e-4, 5e-4, 1e-3 and 3e-3, and I chose the one with the best validation performance after 20 epochs.

* I renamed the run directories to run bn* and the model file name to my_cifar10_bn_model.h5
```
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add( keras.layers.Flatten(input_shape=[32, 32, 3]) )
model.add( keras.layers.BatchNormalization() )
for _ in range(20):
    model.add( keras.layers.Dense(100, kernel_initializer="he_normal") )##########
    model.add( keras.layers.BatchNormalization() )                      ##########
    model.add( keras.layers.Activation("elu") )                         ##########
model.add( keras.layers.Dense(10, activation="softmax") )

optimizer = keras.optimizers.Nadam( lr=5e-4 )                           ##########
model.compile( loss="sparse_categorical_crossentropy",
               optimizer=optimizer,
               metrics=['accuracy']
             )

early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_bn_model.h5", save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = os.path.join( os.curdir, "my_cifar10_logs", "run_bn_{:03d}".format(run_index) )
tensorboard_cb = keras.callbacks.TensorBoard( run_logdir )
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

model.fit(X_train, y_train, epochs=100, 
          validation_data=(X_valid, y_valid),
          callbacks=callbacks)

model = keras.models.load_model("my_cifar10_bn_model.h5")
model.evaluate(X_valid, y_valid)
```
... ...

... ...

Is the model converging faster than before?

Much faster! The previous model took 36/37 epochs to reach the lowest validation loss, while the new model with BN took 18 epochs. That's more than twice as fast as the previous model. The BN layers stabilized training and allowed us to use a much larger learning rate, so convergence was faster.

Does BN produce a better model?
Yes! The final model is also much better, with 54.2% accuracy instead of 47%. It's still not a very good model, but at least it's much better than before (a Convolutional Neural Network would do much better, but that's a different topic, see chapter 14).

How does BN affect training speed?
Although the model converged twice as fast, each epoch took more time, because of the extra computations required by the BN layers. So overall, although the number of epochs was reduced by 50%, the training time (wall time) was shortened. Which is still pretty significant!

I changed the learning rate to 5e-4. I experimented with 1e-5, 3e-5, 5e-5, 1e-4, 3e-4, 5e-4, 1e-3 and 3e-3, and I chose the one with the best validation performance after 20 epochs.
```
rates, losses = find_learing_rate( model, X_train, y_train, epochs=1)
plot_lr_vs_loss(rates, losses)
```
##############################################
SELU

d. Try replacing Batch Normalization with SELU, and make the necessary adjustements to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).

keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add( keras.layers.Flatten(input_shape=[32, 32, 3]) )

for _ in range(20):
    model.add( keras.layers.Dense(100, kernel_initializer="lecun_normal", activation="selu") )#for selu

model.add( keras.layers.Dense(10, activation="softmax") )

optimizer = keras.optimizers.Nadam( lr=5e-4 )                           ##########
model.compile( loss="sparse_categorical_crossentropy",
               optimizer=optimizer,
               metrics=['accuracy']
             )

early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_selu_model.h5", save_best_only=True)######
run_index = 1 # increment every time you train the model
run_logdir = os.path.join( os.curdir, "my_cifar10_logs", "run_selu_{:03d}".format(run_index) )        ######
tensorboard_cb = keras.callbacks.TensorBoard( run_logdir )
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

X_means = X_train.mean(axis=0)#for each instances                       #for selu
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train-X_means) /X_stds
X_valid_scaled = (X_valid-X_means) /X_stds
X_test_scaled = (X_test-X_means) /X_stds

model.fit(X_train_scaled, y_train, epochs=100,                         #####
          validation_data=(X_valid_scaled, y_valid),
          callbacks=callbacks)

model = keras.models.load_model("my_cifar10_selu_model.h5")            #####
model.evaluate(X_valid_scaled, y_valid)                                #####

... ...

... ...

optimizer = keras.optimizers.Nadam( lr=5e-4 ) ##########

We get 50.12% accuracy, which is better than the original model, but not quite as good as the model using batch normalization. Moreover, it took 10 epochs to reach the best model, which is much faster than both the original model and the BN model. So it's by far the fastest model to train (both in terms of epochs and wall time).

# optimizer = keras.optimizers.Nadam( lr=7e-4 ) ##########

# optimizer = keras.optimizers.Nadam( lr=7e-4 )                           ##########
model = keras.models.load_model("my_cifar10_selu_model.h5")
model.evaluate(X_valid_scaled, y_valid)

After compared two loss curves with different learning rate, I believed lr=5e-4 was better than lr=7e-4 since the constraint speed and more lower losses(the difference was not two much

# optimizer = keras.optimizers.Nadam( lr=7e-4 )                           ##########
rates, losses = find_learing_rate( model, X_train_scaled, y_train, epochs=1)
plot_lr_vs_loss(rates, losses)

# optimizer = keras.optimizers.Nadam( lr=9e-4 ) ##########

# optimizer = keras.optimizers.Nadam( lr=9e-4 )                           ##########
model = keras.models.load_model("my_cifar10_selu_model.h5")
model.evaluate(X_valid_scaled, y_valid)

rates, losses = find_learing_rate( model, X_train_scaled, y_train, epochs=1)
plot_lr_vs_loss(rates, losses)

Proved: 确定lr更重要的是确定量级，如1e-3和1e-2,因为确定了量级别就一定会constraint, 然后确定lr=5e-4的factor，constraint speed
##############################################

alpha dropout

e. Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.

keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32,32,3]))
for _ in range(20):
    model.add(keras.layers.Dense(100, 
                                 kernel_initializer="lecun_normal",
                                 activation="selu")
             )    
model.add( keras.layers.AlphaDropout(rate=0.1) )#######
model.add( keras.layers.Dense(10, activation="softmax") )

optimizer = keras.optimizers.Nadam(lr=5e-4)
model.compile( loss="sparse_categorical_crossentropy",
               optimizer = optimizer,
               metrics = ["accuracy"]
             )

early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_alpha_dropout_model.h5",
                                                     save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = os.path.join(os.curdir, "my_cifar10_logs", 
                          "run_alpha_dropout_{:03d}".format(run_index)
                         )
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

X_means = X_train.mean(axis=0)
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train-X_means)/X_stds
X_valid_scaled = (X_valid-X_means)/X_stds
X_test_scaled = ( X_test -X_means)/X_stds

model.fit(X_train_scaled, y_train, epochs=100,
          validation_data= (X_valid_scaled, y_valid),
          callbacks=callbacks)
model = keras.models.load_model("my_cifar10_alpha_dropout_model.h5")
model.evaluate(X_valid_scaled, y_valid)

... ...

... ...

The model reaches 48.66% accuracy on the validation set. That's very slightly worse than without dropout (50.12%). With an extensive hyperparameter search, it might be possible to do better (I tried dropout rates of 5%, 10%, 20% and 40%, and learning rates 1e-4, 3e-4, 5e-4, and 1e-3), but probably not much better in this case.

Let's use MC Dropout now. We will need the MCAlphaDropout class we used earlier, so let's just copy it here for convenience:

class MCAlphaDropout(keras.layers.AlphaDropout):
    def call(self, inputs):
        return super().call(inputs, training=True)#######activate dropout
    
# Now let's create a new model, identical to the one we just trained (with the same weights), 
# but with MCAlphaDropout dropout layers instead of AlphaDropout layers:
mc_model = keras.models.Sequential([
    ##############
    MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout) else layer
    for layer in model.layers
])

# we don't need the following codes since we using AlphaDropout############
# optimizer = keras.optimizers.Nadam(lr=5e-4)
# mc_model.compile( loss="sparse_categorical_crossentropy",
#                optimizer = optimizer,
#                metrics = ["accuracy"]
#              )

# Then let's add a couple utility functions. The first will run the model many times 
# (10 by default) and it will return the mean predicted class probabilities. 
def mc_dropout_predict_probas( mc_model, X, n_samples=10):
    #each elem is predictions to X, n_samples predictions(prediction list)
    Y_probas = [mc_model.predict(X) for sample in range(n_samples) ]
    # Y_probas
    # here 0,1,...,9 represent their probabilities            
    # [ [ [0,1,...,9], [0,1,...,9], ...len(X)..., [0,1,...,9] ],
    #   [ [0,1,...,9], [0,1,...,9], ...len(X)..., [0,1,...,9] ],            
    #     ... ...n_samples
    #   [ [0,1,...,9], [0,1,...,9], ...len(X)..., [0,1,...,9] ],             
    # ]            
    return np.mean(Y_probas, axis=0)#return[ [0,1,...,9], [0,1,...,9], ...len(X)..., [0,1,...,9] ]                 

# The second will use these mean probabilities to predict the most likely class for 
# each instance:
def mc_dropout_predict_classes( mc_model, X, n_samples=10):
    Y_probas = mc_dropout_predict_probas(mc_model, X, n_samples)            
    return np.argmax(Y_probas, axis=1) #return[ 0, 9, ...len(X)..., 8 ]  

# Now let's make predictions for all the instances in the validation set, 
# and compute the accuracy:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

# we don't fit function here, since we using Alphal dropout model which has been fit
# model.fit(X_train_scaled, y_train, epochs=100,
#           validation_data= (X_valid_scaled, y_valid),
#           callbacks=callbacks)

y_pred = mc_dropout_predict_classes(mc_model, X_valid_scaled) #without retraining #prediction
accuracy = np.mean(y_pred==y_valid[:,0])# y_valid[:,0]: y_valid.shape==(5000,1)
accuracy

We only get virtually no accuracy improvement in this case (from 48.66% to 48.62% ).

So the best model we got in this exercise is the Batch Normalization model(54.12%).

##############################################

1cycle scheduling

f. Retrain your model using 1cycle scheduling and see if it improves training speed and model accuracy.

keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add( keras.layers.Flatten( input_shape=[32,32,3]) )
for _ in range(20):
    model.add( keras.layers.Dense(100, kernel_initializer="lecun_normal", activation="selu") )

model.add(keras.layers.AlphaDropout(rate=0.1))
model.add(keras.layers.Dense(10, activation="softmax"))

optimizer = keras.optimizers.SGD(lr=1e-3)
model.compile( loss="sparse_categorical_crossentropy", 
               optimizer=optimizer,
               metrics=['accuracy'])

K = keras.backend
 
class ExponentialLearningRate( keras.callbacks.Callback):
    def __init__(self, factor):
        self.factor = factor
        self.rates = []
        self.losses = []
        
    def on_batch_end(self, batch, logs):
        self.rates.append( K.get_value(self.model.optimizer.lr) )
        self.losses.append( logs['loss'] )
        K.set_value( self.model.optimizer.lr, self.model.optimizer.lr*self.factor )#update learning rate#callbacks
        
def find_learning_rate( model, X,y, epochs=1, batch_size=32, min_rate=10**-5, max_rate=10):
    init_weights = model.get_weights()
    
    iterations = len(X) // batch_size * epochs
    factor = np.exp(np.log(max_rate / min_rate)/iterations)
    init_lr = K.get_value(model.optimizer.lr)   # get initial learning rate
    K.set_value( model.optimizer.lr, min_rate ) # replace initial learning rate with min_rate
    exp_lr = ExponentialLearningRate(factor)    # pass learning rate factor to
    history = model.fit( X, y, epochs=epochs, batch_size=batch_size, callbacks=[exp_lr] )
    K.set_value( model.optimizer.lr, init_lr )  # replace current learning rate with initiallearning rate
    
    model.set_weights(init_weights)
    return exp_lr.rates, exp_lr.losses
 
def plot_lr_vs_loss( rates, losses ):
    plt.plot(rates, losses)
    plt.gca().set_xscale("log")
    plt.hlines( min(losses), min(rates),max(rates) )
    plt.axis( [min(rates), max(rates),  min(losses), (losses[0]+min(losses))/2 ])
    plt.xlabel("Learning rate")
    plt.ylabel("Loss")

batch_size = 128
#sorry for spelling error, you can correct it, find_learning_rate
rates, losses =find_learning_rate(model, X_train_scaled, y_train, epochs=1, batch_size=batch_size)
plot_lr_vs_loss(rates, losses)
plt.axis([ min(rates), max(rates), 
           min(losses), (losses[0]+min(losses))/1.4 ])

From learning rate VS Loss curve, the best learing rate is factor*e^-4 ~e^-3

keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add( keras.layers.Flatten( input_shape=[32,32,3] ) )
for _ in range(20):
    model.add(keras.layers.Dense(100, 
                                 kernel_initializer="lecun_normal", 
                                 activation="selu"))

model.add( keras.layers.AlphaDropout(rate=0.1))
model.add( keras.layers.Dense(10, activation="softmax") )

optimizer = keras.optimizers.SGD( lr=1e-3 ) #from learing rate VS loss curves
model.compile( loss="sparse_categorical_crossentropy", 
              optimizer=optimizer, 
              metrics=["accuracy"] )

class OneCycleScheduler( keras.callbacks.Callback ):
    def __init__(self, iterations, max_rate, start_rate=None, 
                 last_iterations=None, last_rate=None):
        
        self.iterations = iterations #total iterations
        
        self.max_rate = max_rate
        self.start_rate = start_rate or max_rate/10
        
        self.last_iterations = last_iterations or iterations//10+1
        self.half_iteration_pos = (iterations - self.last_iterations)//2
        
        # finishing the last few epochs by dropping the rate down by several orders of magnitude
        self.last_rate = last_rate or self.start_rate/1000
        
        self.iteration_pos = 0
        
    def _iterpolate( self, iter1, iter2, 
                           rate1, rate2):
                 # a_slope: (rate2-rate1)/(iter2-iter1)
                 # x: (self.iteration-iter1)
                 # b: rate1 
                 # y= a_slope * x + b
        return ( (rate2-rate1)*(self.iteration_pos-iter1) / (iter2-iter1) + rate1 )
    
    def on_batch_begin(self, batch, logs):
        if self.iteration_pos < self.half_iteration_pos:
            rate = self._iterpolate(0, self.half_iteration_pos, 
                                    self.start_rate, self.max_rate)
            
        elif self.iteration_pos < 2*self.half_iteration_pos:
            rate = self._iterpolate(self.half_iteration_pos, 2*self.half_iteration_pos,
                                    self.max_rate, self.start_rate)
        else:#last few epochs    
            rate = self._iterpolate(2*self.half_iteration_pos, self.iterations, 
                                    self.start_rate, self.last_rate)
        self.iteration_pos +=1
        K.set_value(self.model.optimizer.lr, rate)#update

n_epochs = 15
onecycle = OneCycleScheduler( len(X_train_scaled)//batch_size*n_epochs, max_rate=0.02)#max_rate=0.02 from learning rate VS loss curve
history = model.fit(X_train_scaled, y_train, epochs = n_epochs, batch_size=batch_size,
                    validation_data=(X_valid_scaled, y_valid),
                    callbacks=[onecycle])
model.evaluate(X_valid_scaled, y_valid)

... ...

One cycle allowed us to train the model in just 15 epochs, each taking only 75 seconds (thanks to the larger batch size). This is over 3 times faster than the fastest model we trained so far. Moreover, we improved the model's performance (from 48.66% to 51%). The batch normalized model reaches a slightly better performance, but it's much slower to train.

你可能感兴趣的:(11_Training Deep Neural Networks_4_dropout_Max-Norm Regularization_CIFAR10_find_learning rate)

什么是神经网络架构搜索（NAS, Neural Architecture Search），如何写对应的python程序代码呢小桥流水---人工智能算法深度学习 Python程序代码神经网络架构 python
一、什么是神经网络架构搜索（NAS,NeuralArchitectureSearch）神经网络架构搜索（NAS,NeuralArchitectureSearch）是一种用于自动化设计神经网络架构的技术。传统的神经网络模型架构设计通常依赖于专家经验和大量试错，而NAS通过算法自动搜索网络架构，以发现最适合特定任务的神经网络设计。NAS可以极大地减少人工调参的时间和精力，并且在某些情况下能够找到比手工
循环神经网络（RNN）：序列数据处理的强大工具 LNL13 rnn 人工智能深度学习
在人工智能和机器学习的广阔领域中，处理和理解序列数据一直是一个重要且具有挑战性的任务。循环神经网络（RecurrentNeuralNetwork，RNN）作为一类专门设计用于处理序列数据的神经网络，在诸多领域展现出了强大的能力。从自然语言处理中的文本生成、机器翻译，到时间序列分析中的股票价格预测、天气预测等，RNN都发挥着关键作用。本文将深入探讨RNN的工作原理、架构特点、训练方法、常见类型以及其
软件开发工程师使用 DeepSeek 的实用指南
在软件开发领域，效率与创新是推动项目成功的核心动力。DeepSeek作为一款具备深度代码理解能力的AI工具，能为开发工程师提供全流程技术支持，显著提升研发效能与交付质量。以下从实际应用场景出发，详细说明如何将DeepSeek深度融入开发工作流。一、代码编写与优化1.智能代码生成编码环节中，DeepSeek可基于精确需求描述生成高质量代码片段，大幅减少重复劳动。例如：基础功能：输入"Java实现两个
强化学习：Deep Deterministic Policy Gradient (DDPG) 学习笔记烨川南强化学习学习笔记算法人工智能机器学习
一、DDPG是什么？1.1核心概念DDPG=Deep+Deterministic+PolicyGradientDeep：使用深度神经网络和类似DQN的技术（经验回放、目标网络）Deterministic：输出确定的动作（而不是概率分布）PolicyGradient：基于策略梯度的方法，优化策略以最大化累积奖励1.2算法特点特性说明连续动作空间直接输出连续动作值（如方向盘角度、机器人关节扭矩）离线学
cursor+deepseek实现完整的俄罗斯方块小游戏 RW~ javascript 前端 css
俄罗斯方块body{margin:0;display:flex;justify-content:center;align-items:center;height:100vh;background:#333;color:white;font-family:Arial,sans-serif;}.game-container{display:flex;gap:20px;}canvas{border:2p
AI日报-20250703：DeepSeek-R2神秘现身？！游戏模型一句话生成GTA神作！
1、马斯克Grok4泄露！xAI融资700亿，目标“重写人类知识库”2、全球首款AI原生游戏引擎Mirage发布！一句话生成GTA级神作3、星流Agent震撼上线！专为中国设计师打造的全能AI创意助手4、DeepSeek-R2神秘现身！大模型竞技场惊现"steve"引全网热议5、OpenAI紧急切割Robinhood代币风波：虚假股权引发市场震荡6、OpenAI暂停谷歌TPU合作，英伟达AMD稳坐
提升自动驾驶导航能力：基于深度学习的场景理解技术星辰和大海都需要门票路径规划算法自动驾驶深度学习人工智能
EnhancingAutonomousVehicleNavigationUsingDeepLearning-BasedSceneUnderstanding提升自动驾驶导航能力：基于深度学习的场景理解技术摘要-为应对复杂环境下的自动驾驶导航，系统高度依赖场景理解的准确性。本研究提出一种基于深度学习的新方法，将目标识别、场景分割、运动预测与强化学习相结合以提升导航性能。该方法首先采用U-Net架构分解
使用 Ollama 部署 Deepseek 想知道哇大语言模型人工智能语言模型
使用Ollama部署Deepseek模型Ollama与传统部署方法的主要区别特性传统部署方法（之前的文章）Ollama方法部署复杂度高（需要手动设置环境、依赖和量化）低（简化的命令行界面）技术要求需要Python和机器学习库知识基本命令行知识即可灵活性高度可定制（训练参数、模型结构等）相对较低，但足够大多数使用场景资源管理手动管理（需自行优化内存使用）自动处理（内置优化）API集成需要自行实现内置
PyCharm如何调用DeepSeek实现AI编程？手把手教你打造智能开发环境！[特殊字符]_pycharm调用deepseek AI大模型-海文 pycharm AI编程 ide 人工智能开发语言深度学习 github
前言DeepSeek作为国产顶尖AI大模型，凭借其强大的代码生成、逻辑推理能力，已成为开发者提升效率的利器。本文将详细讲解如何将DeepSeek无缝接入PyCharm，实现代码自动补全、智能问答、单元测试生成等功能，助你解锁AI编程新体验！（文末附完整代码配置及常见问题解决，建议收藏！）一、准备工作：获取DeepSeekAPIKey1.注册DeepSeek账号访问DeepSeek官网，点击右上角“
DeepSpeed zero1，zero2，zero3和FSDP区别详解 ALGORITHM LOL python 分布式
1.基本概念DeepSpeedZeROZeRO是由MicrosoftDeepSpeed团队开发的一种内存优化技术，旨在通过分片模型状态来训练超大模型，减少每个GPU的内存占用，同时避免传统模型并行（如张量并行或流水线并行）所需的代码修改。ZeRO分为三个阶段（Stage1、Stage2、Stage3），每阶段逐步增加分片的范围，从而进一步降低内存需求。FSDP(FullyShardedDataPa
GNN多任务预测模型实现（二）：将EXCEL数据转换为图数据走的远一些神经网络知识分享知识备份人工智能深度学习
目录一.引言二.加载和检查数据三.提取特征和标签四.标准化特征五.构建节点索引六.构建边及其特征七.总结八.结语一.引言在图神经网络（GraphNeuralNetworks,GNNs）的多任务学习场景中，数据预处理是至关重要的一步。尤其是当我们的数据存储在表格格式（如Excel文件）中时，如何有效地将其转换为图数据格式，是搭建GNN模型的基础。二.加载和检查数据第一步是加载数据并检查其格式。我们通
重构企业智能服务：大模型部署背后的战略与落地实践慌ZHANG 人工智能人工智能
个人主页：慌ZHANG-CSDN博客期待您的关注一、引言：从“能用”到“可用”的时代跃迁过去一年中，大语言模型（LLMs）实现了从实验室“黑科技”到企业场景“生产力”的巨大跃迁。无论是通用问答、客户支持、文本生成、知识库问询，还是代码辅助、财报分析，大模型的边界已快速渗透到各行各业。然而，许多企业在试图将ChatGPT或DeepSeek等模型引入自己的业务系统时却发现：在线服务存在数据泄露风险；响
顶会新方向！14篇图神经网络（GNN）最新顶会论文汇总！（含2024） AI科研技术派神经网络人工智能深度学习
图神经网络（GNN）是深度学习领域中备受关注的前沿课题，它在处理图结构数据方面展现出了强大的潜力，随着研究的不断深入，越来越多的优秀论文在顶级学术会议上涌现。今天就给大家整理了14篇顶会中发表的图神经网络优质论文，一起看看这方面的最新研究成果吧！AAAI20241、Fine-tuningGraphNeuralNetworksbyPreservingGraphGenerativePatterns通过
60个顶级DeepSeek学术提示词，2小时完成毕业论文，建议收藏爱学习的懒洋洋论文笔记 AIGC
朋友们，写论文的苦，你懂我也懂。好消息是：有了DeepSeek等AI大模型，你只需要掌握正确的提示词（Prompt），论文就能1小时出大纲，1小时出正文，连答辩都能帮你安排上！下面这份60个顶级学术Prompt清单，涵盖选题+大纲+写作+润色+降重+答辩+引用全流程，建议点赞收藏+转发给你身边写论文的人一、论文选题与方向建议（10个）帮我根据“[专业/方向]”推荐10个有研究价值的毕业论文选题根据
大模型私有化部署的系统性挑战与解决方案：企业视角的深度解析慌ZHANG 人工智能人工智能
个人主页：慌ZHANG-CSDN博客期待您的关注一、引言：企业为何需要私有部署大模型？随着ChatGPT、Claude、DeepSeek、通义千问等大语言模型（LLMs）能力爆发，企业纷纷探索“AI+业务”的融合创新。然而，由于数据隐私、定制需求、合规政策等多重因素，私有化部署成为多数企业采用LLM的首选路径。企业选择私有部署大模型，通常基于以下几个原因：数据安全需求：业务数据敏感，禁止外发；可控
2025年IP变现王炸组合：DeepSeek+创匠AI如何助普通人月入10万创客匠人老蒋人工智能网络创客匠人创始人IP打造 deepseek AI 热点
在短视频与知识变现赛道，创客匠人推出的“DeepSeek+创匠AI”组合正成为创始人IP打造的核武器。这套工具通过“热点挖掘-文案生成-数字人出镜”的全链路提效，让普通人无需写文案、不出镜即可实现月入10万的变现目标，彻底重构IP运营的成本与效率逻辑。传统IP打造面临“内容枯竭、产能低下”的痛点：熬夜写脚本、3小时制作的视频仅200播放，而头部玩家已通过DeepSeek+创匠AI实现“5分钟扒热点
增刊第5章：模型性能优化技术与健康人工智能 python
第5章：模型性能优化完成DeepSeek大模型的部署和基本运维后，下一步就是对其进行性能优化。在大模型推理场景下，性能优化主要关注两个核心指标：推理速度（Latency）和吞吐量（Throughput）。本章将详细介绍几种关键的优化技术，帮助您在现有硬件条件下，榨干模型的每一滴性能。5.1量化策略进阶(INT4/INT8)在第2章中我们简要介绍了量化，这里我们将深入探讨量化策略。**量化（Quan
【AI Infra】基础学习汇总篇逆羽飘扬 AI基础知识人工智能学习
系列综述：目的：本系列是个人整理为了学习训练框架优化的，整理期间苛求每个知识点，平衡理解简易度与深入程度。来源：材料主要源于【DeepEP官方介绍】进行的，每个知识点的修正和深入主要参考各平台大佬的文章，其中也可能含有少量的个人实验自证。结语：如果有帮到你的地方，就点个赞和关注一下呗，谢谢！！！请先收藏！！！，后续继续完善和扩充(●’◡’●)文章目录一、分布式与并行基础分布式计算高性能并行GPU硬
Deep Global Registration 代码环境配置(rtx3090+python3.8+cuda11.1+pytorch1.7+MinkowskiEngine0.5.1) JPy646 pytorch 深度学习神经网络
前言踩过的坑：因为rtx3090最低算力是8.6，似乎不支持过低版本的cuda。试过pytorch1.7.0+cuda11.0，但会报错，由于cuda11.0支持的最高算力达不到rtx最低的要求。但配置pytorch1.8时DGR的代码运行时会报错。对于没有这个烦恼的还是推荐安装python3.6+cuda10.2+pytorch1.6+MinkowskiEngine0.4.3,这个配置无需改动代
昇腾NPU节点软件版本检查与升级方法
一、问题背景当我们需要在节点部署DeepSeek大模型时，需要检查昇腾云配套的版本驱动和固件版本，如果发现节点版本不配套建议升级到配套版本。检查方法：npu-smiinfo-tboard-i1|egrep-i"software|firmware"二、升级方法需要注意的是，一定要先升级固件，再升级驱动；如果需要降级版本，流程与升级一样。一般而言，固件包是带有firmware关键字，驱动包带有dirv
ChatGPT、DeepSeek等大语言模型技术教程
随着人工智能技术的快速发展，大语言模型如ChatGPT和DeepSeek在科研领域的应用正在为科研人员提供强大的支持。这些模型通过深度学习和大规模语料库训练，能够帮助科研人员高效地筛选文献、生成论文内容、进行数据分析和优化机器学习模型。ChatGPT和DeepSeek能够快速理解和生成复杂的语言，帮助研究人员在撰写论文时提高效率，不仅生成高质量的文章内容，还能优化论文结构和语言表达。在数据分析方面
基于Rust编写数独、deepseek调用、Mis系统 KENYCHEN奉孝 Rust rust 开发语言后端
Rust是开发人员最流行的语言之一，因为它具有开源、快速、可靠和高性能的特点。在Rust中构建新的API时，重要的是要考虑Web框架对前端和后端开发的优缺点。在本文中，我们将讨论什么是Web框架，并探索Rust生态系统中用于前端和后端开发的各种Web框架，排名不分先后。让我们开始吧。Rust编写Web版本得数独游戏开发环境配置确保安装Rust工具链和Cargo包管理器，推荐使用rustup安装最新
细粒度IP定位参文27（HGNN）：Identifying user geolocation（2022年）
[27]F.Zhou,T.Wang,T.Zhong,andG.Trajcevski,“Identifyingusergeolocationwithhierarchicalgraphneuralnetworksandexplainablefusion,”Inf.Fusion,vol.81,pp.1–13,2022.（用层次图、神经网络和可解释的融合来识别用户的地理定位）论文地址：https://do
结合LangGraph、DeepSeek-R1和Qdrant 的混合 RAG 技术实践大模型之路 RAG rag
一、引言：混合RAG技术的发展与挑战在人工智能领域，检索增强生成（RAG）技术正成为构建智能问答系统的核心方案。传统RAG通过向量数据库存储文档嵌入并检索相关内容，结合大语言模型（LLM）生成回答，有效缓解了LLM的“幻觉”问题。然而，单一的稠密向量检索（如基于Transformer的嵌入模型）在处理关键词匹配和多义词歧义时存在局限性，而稀疏向量检索（如BM25）虽擅长精确关键词匹配，却缺乏语义理
DeepSeek 大模型：工单系统优化与企业提效关键合力亿捷-小亿人工智能
随着信息化时代的到来，企业对运营效率的需求日益增强，工单系统作为重要的运营管理工具，其优化程度直接影响到企业的响应速度与服务质量。DeepSeek大模型通过强大的数据处理能力，为工单系统提供了多维度的优化方案，从分类、派发到内容填充、优先级排序，再到知识管理，全面提升了工单处理的智能化与自动化水平，帮助企业在提升客户满意度的同时，也实现了自身运营效率的跃升。一、工单分类与派发1.精准分类工单分类是
通义灵码+DeepSeek：国产代码生成王炸组合，带你飞！
引言在人工智能飞速发展的当下，AI代码生成工具如雨后春笋般涌现，为开发者们带来了前所未有的编程体验。其中，国产的通义灵码结合DeepSeek模型异军突起，成为众多开发者关注的焦点。它们凭借强大的功能和出色的表现，在代码生成领域崭露头角，不仅提升了开发效率，还为编程工作流注入了新的活力。然而，如同任何新兴技术一样，在使用过程中也会遇到各种问题和挑战。本文将通过实测，深入剖析通义灵码与DeepSeek
LLM---大语言模型技术研究报告
摘要大语言模型（LLMs）已从技术突破走向产业规模化落地。2025年，全球LLMs进入“模型即服务”（MaaS）时代，参数量级突破万亿级，多模态能力、智能体协作、专业化细分成为主流趋势。中国大模型领域在DeepSeek、通义千问、讯飞星火等头部模型推动下，实现技术突破与场景创新。本报告基于截至2025年7月的最新数据，系统梳理LLMs的技术演进、应用场景、挑战与未来方向。一、大语言模型的演进与突破
Spring AI 实战：第二章、Spring AI提示词之玩转AI占卜的艺术 liaokailin Spring AI 实战人工智能 spring java
目录（如果文章对您有一丢丢输入，请点赞、收藏、转发吧~）源码开篇、大模型时代：我们正站在浪潮之巅第一章、SpringAI入门之DeepSeek调用第二章、SpringAI提示词之玩转AI占卜的艺术第三章、SpringAI结构化输出之告别杂乱无章第四章、SpringAI多模态之看图说话
LangChain + Ollama + Spring AI：打造能自动决策的智能 Agent 大模型应用 langchain spring 人工智能 llama LLM prompt
AI的出现极大的提升了生产力，对我们程序员来说，积极的拥抱新技术是非常有必要的。今天我们基于LangChain框架，创建一个我们自己的Agent,并集成我们自己的MCP工具。体验一把LangChain。我的系统是Windows。在开始之前，我们需要:•使用Ollama运行deepseek-r1:7b•使用Node.js开发的MCPWeatherService工具•使用Flask提供WebAPI服务
Java的SpringAI+Deepseek大模型实战-会话记忆【三】梦幻通灵大数据 AI 软件工程
文章目录背景项目环境实现步骤第一步、定义会话存储方式方式一、定义记忆存储ChatMemory方式二、注入记忆存储ChatMemory第二步、配置会话记忆方式一、老版本实现方式二、新版本实现第三步、添加会话ID异常处理1、InMemoryChatMemory无法解析背景前两期搭建起大模型对话的框架，如何进行会话记忆项目环境SpringAi版本：1.0.0实现步骤第一步、定义会话存储方式在配置类Com
java封装继承多态等麦田的设计者 java eclipse jvm c encapsulatopn
最近一段时间看了很多的视频却忘记总结了，现在只能想到什么写什么了，希望能起到一个回忆巩固的作用。 1、final关键字译为：最终的 &
F5与集群的区别 bijian1013 weblogic 集群 F5
http请求配置不是通过集群，而是F5；集群是weblogic容器的，如果是ejb接口是通过集群。 F5同集群的差别，主要还是会话复制的问题，F5一把是分发http请求用的，因为http都是无状态的服务，无需关注会话问题，类似
LeetCode[Math] - #7 Reverse Integer Cwind java 题解 Math LeetCode Algorithm
原题链接：#7 Reverse Integer 要求：按位反转输入的数字例1：输入 x = 123, 返回 321 例2：输入 x = -123, 返回 -321 难度：简单分析：对于一般情况，首先保存输入数字的符号，然后每次取输入的末位（x%10）作为输出的高位（result = result*10 + x%10）即可。但
BufferedOutputStream 周凡杨
首先说一下这个大批量，是指有上千万的数据量。例子：有一张短信历史表，其数据有上千万条数据，要进行数据备份到文本文件，就是执行如下SQL然后将结果集写入到文件中！ select t.msisd
linux下模拟按键输入和鼠标被触发 linux
查看/dev/input/eventX是什么类型的事件， cat /proc/bus/input/devices 设备有着自己特殊的按键键码，我需要将一些标准的按键，比如0－9，X－Z等模拟成标准按键，比如KEY_0,KEY-Z等，所以需要用到按键模拟，具体方法就是操作/dev/input/event1文件，向它写入个input_event结构体就可以模拟按键的输入了。 linux/in
ContentProvider初体验肆无忌惮_ ContentProvider
ContentProvider在安卓开发中非常重要。与Activity，Service，BroadcastReceiver并称安卓组件四大天王。在android中的作用是用来对外共享数据。因为安卓程序的数据库文件存放在data/data/packagename里面，这里面的文件默认都是私有的，别的程序无法访问。如果QQ游戏想访问手机QQ的帐号信息一键登录，那么就需要使用内容提供者COnte
关于Spring MVC项目（maven）中通过fileupload上传文件 843977358 mybatis spring mvc 修改头像上传文件 upload
Spring MVC 中通过fileupload上传文件，其中项目使用maven管理。 1.上传文件首先需要的是导入相关支持jar包：commons-fileupload.jar,commons-io.jar 因为我是用的maven管理项目，所以要在pom文件中配置（每个人的jar包位置根据实际情况定） <!-- 文件上传 start by zhangyd-c --&g
使用svnkit api，纯java操作svn，实现svn提交，更新等操作 aigo svnkit
原文：http://blog.csdn.net/hardwin/article/details/7963318 import java.io.File; import org.apache.log4j.Logger; import org.tmatesoft.svn.core.SVNCommitInfo; import org.tmateso
对比浏览器，casperjs，httpclient的Header信息 alleni123 爬虫 crawler header
@Override protected void doGet(HttpServletRequest req, HttpServletResponse res) throws ServletException, IOException { String type=req.getParameter("type"); Enumeration es=re
java.io操作 DataInputStream和DataOutputStream基本数据流百合不是茶 java 流
1，java中如果不保存整个对象，只保存类中的属性，那么我们可以使用本篇文章中的方法，如果要保存整个对象先将类实例化后面的文章将详细写到 2，DataInputStream 是java.io包中一个数据输入流允许应用程序以与机器无关方式从底层输入流中读取基本 Java 数据类型。应用程序可以使用数据输出流写入稍后由数据输入流读取的数据。
车辆保险理赔案例 bijian1013 车险
理赔案例：一货运车，运输公司为车辆购买了机动车商业险和交强险，也买了安全生产责任险，运输一车烟花爆竹，在行驶途中发生爆炸，出现车毁、货损、司机亡、炸死一路人、炸毁一间民宅等惨剧，针对这几种情况，该如何赔付。赔付建议和方案：客户所买交强险在这里不起作用，因为交强险的赔付前提是：“机动车发生道路交通意外事故”；如果是交通意外事故引发的爆炸，则优先适用交强险条款进行赔付，不足的部分由商业
学习Spring必学的Java基础知识(5)—注解 bijian1013 java spring
文章来源：http://www.iteye.com/topic/1123823，整理在我的博客有两个目的：一个是原文确实很不错，通俗易懂，督促自已将博主的这一系列关于Spring文章都学完；另一个原因是为免原文被博主删除，在此记录，方便以后查找阅读。有必要对
【Struts2一】Struts2 Hello World bit1129 Hello world
Struts2 Hello World应用的基本步骤创建Struts2的Hello World应用，包括如下几步： 1.配置web.xml 2.创建Action 3.创建struts.xml，配置Action 4.启动web server，通过浏览器访问配置web.xml <?xml version="1.0" encoding="
【Avro二】Avro RPC框架 bit1129 rpc
1. Avro RPC简介 1.1. RPC RPC逻辑上分为二层，一是传输层，负责网络通信；二是协议层，将数据按照一定协议格式打包和解包从序列化方式来看，Apache Thrift 和Google的Protocol Buffers和Avro应该是属于同一个级别的框架，都能跨语言，性能优秀，数据精简，但是Avro的动态模式（不用生成代码，而且性能很好）这个特点让人非常喜欢，比较适合R
lua　set get cookie ronin47 lua cookie
lua: local access_token = ngx.var.cookie_SGAccessToken if access_token then ngx.header["Set-Cookie"] = "SGAccessToken="..access_token.."; path=/;Max-Age=3000" end
java-打印不大于N的质数 bylijinnan java
public class PrimeNumber { /** * 寻找不大于N的质数 */ public static void main(String[] args) { int n=100; PrimeNumber pn=new PrimeNumber(); pn.printPrimeNumber(n); System.out.print
Spring源码学习-PropertyPlaceholderHelper bylijinnan java spring
今天在看Spring 3.0.0.RELEASE的源码，发现PropertyPlaceholderHelper的一个bug 当时觉得奇怪，上网一搜，果然是个bug，不过早就有人发现了，且已经修复：详见： http://forum.spring.io/forum/spring-projects/container/88107-propertyplaceholderhelper-bug
[逻辑与拓扑]布尔逻辑与拓扑结构的结合会产生什么? comsci 拓扑
如果我们已经在一个工作流的节点中嵌入了可以进行逻辑推理的代码,那么成百上千个这样的节点如果组成一个拓扑网络,而这个网络是可以自动遍历的,非线性的拓扑计算模型和节点内部的布尔逻辑处理的结合,会产生什么样的结果呢? 是否可以形成一种新的模糊语言识别和处理模型呢? 大家有兴趣可以试试,用软件搞这些有个好处,就是花钱比较少,就算不成
ITEYE 都换百度推广了 cuisuqiang Google AdSense 百度推广广告外快
以前ITEYE的广告都是谷歌的Google AdSense，现在都换成百度推广了。为什么个人博客设置里面还是Google AdSense呢？都知道Google AdSense不好申请，这在ITEYE上也不是讨论了一两天了，强烈建议ITEYE换掉Google AdSense。至少，用一个好申请的吧。什么时候能从ITEYE上来点外快，哪怕少点
新浪微博技术架构分析 dalan_123 新浪微博架构
新浪微博在短短一年时间内从零发展到五千万用户，我们的基层架构也发展了几个版本。第一版就是是非常快的，我们可以非常快的实现我们的模块。我们看一下技术特点，微博这个产品从架构上来分析，它需要解决的是发表和订阅的问题。我们第一版采用的是推的消息模式，假如说我们一个明星用户他有10万个粉丝，那就是说用户发表一条微博的时候，我们把这个微博消息攒成10万份，这样就是很简单了，第一版的架构实际上就是这两行字。第
玩转ARP攻击 dcj3sjt126com r
我写这片文章只是想让你明白深刻理解某一协议的好处。高手免看。如果有人利用这片文章所做的一切事情，盖不负责。网上关于ARP的资料已经很多了，就不用我都说了。用某一位高手的话来说，“我们能做的事情很多，唯一受限制的是我们的创造力和想象力”。 ARP也是如此。以下讨论的机子有一个要攻击的机子：10.5.4.178 硬件地址：52:54:4C:98
PHP编码规范 dcj3sjt126com 编码规范
一、文件格式 1. 对于只含有 php 代码的文件，我们将在文件结尾处忽略掉 "?>" 。这是为了防止多余的空格或者其它字符影响到代码。例如：<?php$foo = 'foo';2. 缩进应该能够反映出代码的逻辑结果，尽量使用四个空格，禁止使用制表符TAB，因为这样能够保证有跨客户端编程器软件的灵活性。例
linux 脱机管理（nohup） eksliang linux nohup nohup
脱机管理 nohup 转载请出自出处：http://eksliang.iteye.com/blog/2166699 nohup可以让你在脱机或者注销系统后，还能够让工作继续进行。他的语法如下 nohup [命令与参数] --在终端机前台工作 nohup [命令与参数] & --在终端机后台工作但是这个命令需要注意的是，nohup并不支持bash的内置命令，所
BusinessObjects Enterprise Java SDK greemranqq java BO SAP Crystal Reports
最近项目用到oracle_ADF 从SAP/BO 上调用水晶报表，资料比较少，我做一个简单的分享，给和我一样的新手提供更多的便利。首先，我是尝试用JAVA JSP 去访问的。官方API：http://devlibrary.businessobjects.com/BusinessObjectsxi/en/en/BOE_SDK/boesdk_ja
系统负载剧变下的管控策略 iamzhongyong 高并发
假如目前的系统有100台机器，能够支撑每天1亿的点击量（这个就简单比喻一下），然后系统流量剧变了要，我如何应对，系统有那些策略可以处理，这里总结了一下之前的一些做法。 1、水平扩展这个最容易理解，加机器，这样的话对于系统刚刚开始的伸缩性设计要求比较高，能够非常灵活的添加机器，来应对流量的变化。 2、系统分组假如系统服务的业务不同，有优先级高的，有优先级低的，那就让不同的业务调用提前分组
BitTorrent DHT 协议中文翻译 justjavac bit
前言做了一个磁力链接和BT种子的搜索引擎 {Magnet & Torrent}，因此把 DHT 协议重新看了一遍。 BEP: 5Title: DHT ProtocolVersion: 3dec52cb3ae103ce22358e3894b31cad47a6f22bLast-Modified: Tue Apr 2 16:51:45 2013 -070
Ubuntu下Java环境的搭建 macroli java 工作 ubuntu
配置命令：　　$sudo apt-get install ubuntu-restricted-extras 　　再运行如下命令：　　$sudo apt-get install sun-java6-jdk 　　待安装完毕后选择默认Java. 　　$sudo update- alternatives --config java 　　安装过程提示选择，输入“2”即可，然后按回车键确定。
js字符串转日期（兼容IE所有版本） qiaolevip TO Date String IE
/** * 字符串转时间（yyyy-MM-dd HH:mm:ss） * result （分钟） */ stringToDate : function(fDate){ var fullDate = fDate.split(" ")[0].split("-"); var fullTime = fDate.split("
【数据挖掘学习】关联规则算法Apriori的学习与SQL简单实现购物篮分析 superlxw1234 sql 数据挖掘关联规则
关联规则挖掘用于寻找给定数据集中项之间的有趣的关联或相关关系。关联规则揭示了数据项间的未知的依赖关系，根据所挖掘的关联关系，可以从一个数据对象的信息来推断另一个数据对象的信息。例如购物篮分析。牛奶 ⇒ 面包 [支持度：3%，置信度：40%] 支持度3%：意味3%顾客同时购买牛奶和面包。置信度40%：意味购买牛奶的顾客40%也购买面包。规则的支持度和置信度是两个规则兴
Spring 5.0 的系统需求，期待你的反馈 wiselyman spring
Spring 5.0将在2016年发布。Spring5.0将支持JDK 9。 Spring 5.0的特性计划还在工作中，请保持关注，所以作者希望从使用者得到关于Spring 5.0系统需求方面的反馈。

11_Training Deep Neural Networks_4_dropout_Max-Norm Regularization_CIFAR10_find_learning rate

Dropout

Alpha Dropout

Monte Carlo (MC) Dropout

Max-Norm Regularization

Summary and Practical Guidelines

Exercises

In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

Nadam

Batch Normalization

SELU

alpha dropout

1cycle scheduling

你可能感兴趣的:(11_Training Deep Neural Networks_4_dropout_Max-Norm Regularization_CIFAR10_find_learning rate)