LIQING LIN

11_Training Deep Neural Networks_3_Adam_Learning Rate Scheduling_Decay_np.argmax(」)_lambda语句_Regular

11_Training Deep Neural Networks_VarianceScaling_leaky relu_PReLU_SELU _Batch Normalization_Reusing
https://blog.csdn.net/Linli522362242/article/details/106935910

11_Training Deep Neural Networks_2_transfer learning_RBMs_Momentum_Nesterov Accelerated Gra_AdaGrad_RMSProp
https://blog.csdn.net/Linli522362242/article/details/106982127

Adam and Nadam Optimization

Adam, which stands for adaptive moment estimation自适应矩估计, combines the ideas of momentum optimization and RMSProp: just like momentum optimization, it keeps track of an exponentially decaying average of past gradients; and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients (see Equation 11-8).

Equation 11-8. Adam algorithm

# Momentum algorithm #1
OR # Momentum algorithm 2#1 note β is negative
# RMSProp algorithm #1
# RMSProp algorithm 2#2
OR # RMSProp algorithm 2 #2
In this equation, t represents the iteration number (starting at 1).

可以看出，直接对梯度的矩估计对内存没有额外的要求，而且可以根据梯度进行动态调整，而对学习率形成一个动态约束，而且有明确的范围。

特点：

结合了Adagrad善于处理稀疏梯度和RMSprop善于处理非平稳目标的优点
对内存需求较小
为不同的参数计算不同的自适应学习率
也适用于大多非凸优化- 适用于大数据集和高维空间

If you just look at steps 1, 2, and 5, you will notice Adam’s close similarity to both momentum optimization and RMSProp. The only difference is that step 1 computes an exponentially decaying average rather than an exponentially decaying sum, but these are actually equivalent except for a constant factor (the decaying average is just times the decaying sum). Steps 3 and 4 are somewhat of a technical detail: since m and s are initialized at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost m and s at the beginning of training.

The momentum decay hyperparameter is typically initialized to 0.9, while the scaling decay hyperparameter is often initialized to 0.999. As earlier, the smoothing term ε is usually initialized to a tiny number such as . These are the default values for the Adam class (to be precise, epsilon defaults to None, which tells Keras to use keras.backend.epsilon(), which defaults to ; you can change it using keras.backend.set_epsilon()). Here is how to create an Adam optimizer using Keras:

optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

Since Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp), it requires less tuning of the learning rate hyperparameter η. You can often use the default value η = 0.001, making Adam even easier to use than Gradient Descent.

If you are starting to feel overwhelmed[,ovə'welmd]不知所措 by all these different techniques and are wondering how to choose the right ones for your task, don’t worry: some practical guidelines are provided at the end of this chapter.

Finally, two variants of Adam are worth mentioning:

AdaMax

# Momentum algorithm #1
s ← max ()

Notice that in step 2 of Equation 11-8, Adam accumulates the squares of the gradients in s (with a greater weight for more recent gradients). In step 5, if we ignore ε and steps 3 and 4 (which are technical details anyway), Adam scales down the parameter updates by the square root of s. In short, Adam scales down the parameter updates by the ℓ2 norm of the time-decayed gradients (recall that the ℓ2 norm is the square root of the sum of squares). AdaMax, introduced in the same paper as Adam, replaces the ℓ2 norm with the norm (a fancy way of saying the max, https://blog.csdn.net/Linli522362242/article/details/103387527, gives the maximum absolute value in the vector). Specifically, it replaces step 2 in Equation 11-8 with s ← max (), it drops step 4, and in step 5 it scales down the gradient updates by a factor of s, which is just the max of the time-decayed gradients. In practice, this can make AdaMax more stable than Adam, but it really depends on the dataset, and in general Adam performs better. So, this is just one more optimizer you can try if you experience problems with Adam on some task.

Nadam

可以看出，Nadam对学习率有了更强的约束，同时对梯度的更新也有更直接的影响。一般而言，在想使用带动量的RMSprop，或者Adam的地方，大多可以使用Nadam取得更好的效果。

Nadam optimization is Adam optimization plus the Nesterov trick, so it will often converge slightly faster than Adam. In his report introducing this technique, the researcher Timothy Dozat compares many different optimizers on various tasks and finds that Nadam generally outperforms Adam but is sometimes outperformed by RMSProp.

Adaptive optimization methods (including RMSProp, Adam, and Nadam optimization) are often great, converging fast to a good solution. However, a 2017 paper###Ashia C. Wilson et al., “The Marginal Value of Adaptive Gradient Methods in Machine Learning,” Advances in Neural Information Processing Systems 30 (2017): 4148–4158.### by Ashia C. Wilson et al. showed that they can lead to solutions that generalize poorly on some datasets. So when you are disappointed by your model’s performance, try using plain Nesterov Accelerated Gradient instead: your dataset may just be allergic to adaptive gradients. Also check out the latest research, because it’s moving fast.

All the optimization techniques discussed so far only rely on the first-order partial derivatives (Jacobians). The optimization literature also contains amazing algorithms based on the second-order partial derivatives (the Hessians, which are the partial
derivatives of the Jacobians). Unfortunately, these algorithms are very hard to apply to deep neural networks because there are Hessians per output (where n is the number of parameters), as opposed to just n Jacobians per output. Since DNNs typically have tens of thousands of parameters, the second-order optimization algorithms often don’t even fit in memory, and even when they do, computing the Hessians is just too slow.

经验之谈

对于稀疏数据，尽量使用学习率可自适应的优化方法，不用手动调节，而且最好采用默认值
SGD通常训练时间更长，但是在好的初始化和学习率调度方案的情况下，结果更可靠
如果在意更快的收敛，并且需要训练较深较复杂的网络时，推荐使用学习率自适应adaptive learning rate的优化方法。
Adadelta，RMSprop，Adam是比较相近的算法，在相似的情况下表现差不多。
在想使用带动量的RMSprop，或者Adam的地方，大多可以使用Nadam取得更好的效果

#######################################

Training Sparse Models

All the optimization algorithms just presented produce dense密集 models, meaning that most parameters will be nonzero. If you need a blazingly fast model at runtime, or if you need it to take up less memory, you may prefer to end up with a sparse model instead.

One easy way to achieve this is to train the model as usual, then get rid of the tiny weights (set them to zero). Note that this will typically not lead to a very sparse model, and it may degrade the model’s performance.

A better option is to apply strong ℓ1 regularization during training (we will see how later in this chapter), as it pushes the optimizer to zero out as many weights as it can (as discussed in “Lasso Regression” on page 137 in Chapter 4
https://blog.csdn.net/Linli522362242/article/details/104070847 tends to completely eliminate the weights of the least important features最不重要 (i.e., set them to zero) ... since all the weights for the high-degree polynomial features are equal to zero. In other words, Lasso Regression automatically performs feature selection and outputs a sparse model (i.e., with few nonzero feature weights).

If these techniques remain insufficient, check out the TensorFlow Model Optimization Toolkit (TF-MOT), which provides a pruning API capable of iteratively removing connections during training based on their magnitude.
#######################################

Table 11-2 compares all the optimizers we’ve discussed so far (* is bad, ** is average, and *** is good).

Table 11-2. Optimizer comparison

Learning Rate Scheduling

Finding a good learning rate is very important. If you set it much too high, training may diverge (as we discussed in “Gradient Descent” on page 118). If you set it too low, training will eventually converge to the optimum, but it will take a very long time. If you set it slightly too high, it will make progress very quickly at first, but it will end up dancing around the optimum, never really settling down. If you have a limited computing budget, you may have to interrupt training before it has converged properly, yielding a suboptimal solution (see Figure 11-8).

Figure 11-8. Learning curves for various learning rates η

As we discussed in Chapter 10https://blog.csdn.net/Linli522362242/article/details/106849041 One way to find a good learning rate is to train the model for a few hundred iterations, starting with a very low learning rate (e.g., ) and gradually increasing it up to a very large value (e.g., 10). This is done by multiplying the learning rate by a constant factor at each iteration (e.g., by =0.03261938194, ~10 ==> ) to go from to 10 in 500 iterations). the optimal learning rate will be a bit lower than the point at which the loss starts to climb (typically about 10 times lower than the turning point), you can find a good learning rate by training the model for a few hundred iterations, exponentially increasing the learning rate from a very small value to a very large value, and then looking at the learning curve and picking a learning rate slightly lower than the one at which the learning curve starts shooting back up. You can then reinitialize your model and train it with that learning rate.

But you can do better than a constant learning rate: if you start with a large learning rate and then reduce it once training stops making fast progress, you can reach a good solution faster than with the optimal constant learning rate. There are many different strategies to reduce the learning rate during training. It can also be beneficial to start with a low learning rate, increase it, then drop it again. These strategies are called learning schedules (we briefly introduced this concept in Chapter 4). These are
the most commonly used learning schedules:

where alpha and scale are pre-defined constants (alpha = 1.67326324 and scale = 1.05070098).

from scipy.special import erfc
 
# alpha and scale to self normalize with mean 0 and standard deviation 1
# (see equation 14 in the paper https://arxiv.org/pdf/1706.02515.pdf):
alpha_0_1 = -np.sqrt(2/np.pi) / ( erfc(1/np.sqrt(2)) * np.exp(1/2)-1 ) # alpha_0_1 ≈ 1.6732632423543778
scale_0_1 = ( 1- erfc( 1/np.sqrt(2) )*np.sqrt(np.e) ) * np.sqrt( 2*np.pi )*\
            ( 2* erfc( np.sqrt(2) )*np.e**2 + np.pi*erfc(1/np.sqrt(2))**2*np.e \
             -2*(2+np.pi)*erfc( 1/np.sqrt(2) )*np.sqrt(np.e) + np.pi + 2\
            )**(-1/2)  # scale_0_1 ≈ 1.0507009873554805


def selu( z, scale=scale_0_1, alpha=alpha_0_1 ):
    return scale * elu(z,alpha)

https://blog.csdn.net/Linli522362242/article/details/106935910
https://towardsdatascience.com/selu-make-fnns-great-again-snn-8d61526802a9
https://www.tensorflow.org/api_docs/python/tf/keras/activations/selu?hl=ru&authuser=19

The input features must be standardized (mean 0 and standard deviation 1).

Every hidden layer’s weights must be initialized with LeCun normal initialization. In Keras, this means setting kernel_initializer="lecun_normal".

The paper only guarantees self-normalization if all layers are dense, but some researchers have noted that the SELU activation function can improve performance in convolutional neural nets as well (see Chapter 14).

from tensorflow import keras

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
 
# scale the pixel intensities down to the 0–1 range by dividing them by 255.0 
#(this also converts them to floats)
X_train_full = X_train_full/255.0
X_test = X_test/255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

# for using Scaled ELU (SELU) activation function
# The input features must be standardized (mean 0 and standard deviation 1).
pixel_means = X_train.mean(axis=0, keepdims=True) # axis=0 for all instances # pixel_means.shape : (1, 28, 28)
pixel_stds = X_train.std(axis=0, keepdims=True) # pixel_stds.shape : (1, 28, 28)
X_train_scaled = (X_train-pixel_means)/pixel_stds
X_valid_scaled = (X_valid-pixel_means)/pixel_stds
X_test_scaled = (X_test - pixel_means)/pixel_stds

Power scheduling幂调度

Set the learning rate to a function of the iteration number t (#I believe t is steps in keras#): η(t) = . The initial learning rate , the power c (typically set to 1), and the steps s (#I believe s is decay_steps in keras#) are hyperparameters. The learning rate drops at each step. After t=1 steps and s=1, it is down to / 2. After t=2 steps and s=1, it is down to / 3, then it goes down to / 4, then /5, and so on. As you can see, this schedule first drops quickly, then more and more slowly. Of course, power scheduling requires tuning and s (and possibly c).

momentum between 0 (high friction) and 1 (no friction)

The update rule for θ with gradient g when momentum is 0.0:

The update rule when momentum is larger than 0.0(β>0): is negative since ,有方向gradent是向下（negative）,但是在计算机计算过程为了好处理使用了正值

if nesterov is False, gradient is evaluated at . if nesterov is True, gradient is evaluated at , and the variables always store θ+mv instead of theta

Go to
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/LearningRateSchedule?hl=ru&authuser=19

Then
get_config ==> View source
https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/optimizer_v2/learning_rate_schedule.py#L46-L48

Go to class InverseTimeDecay

@keras_export("keras.optimizers.schedules.InverseTimeDecay")
class InverseTimeDecay(LearningRateSchedule):
  """A LearningRateSchedule that uses an inverse time decay schedule."""

  def __init__(
      self,
      initial_learning_rate,
      decay_steps, #default =1
      decay_rate,
      staircase=False,
      name=None):

    """Applies inverse time decay to the initial learning rate.

    ```python
    def decayed_learning_rate(step):
      return initial_learning_rate / (1 + decay_rate * steps / decay_steps)
    ```

    or, if `staircase` is `True`, as:
    ```python
    def decayed_learning_rate(step):
      return initial_learning_rate / (1 + decay_rate * floor(steps / decay_steps))
    ```

    Args:
      initial_learning_rate: A scalar `float32` or `float64` `Tensor` or a
        Python number.  The initial learning rate.
      decay_steps: How often to apply decay.
      decay_rate: A Python number.  The decay rate.
      staircase: Whether to apply decay in a discrete staircase, as opposed to
        continuous, fashion.
      name: String.  Optional name of the operation.  Defaults to
        'InverseTimeDecay'.

    """

    super(InverseTimeDecay, self).__init__()

    self.initial_learning_rate = initial_learning_rate
    self.decay_steps = decay_steps
    self.decay_rate = decay_rate
    self.staircase = staircase
    self.name = name

  def __call__(self, step):
    with ops.name_scope_v2(self.name or "InverseTimeDecay") as name:
      initial_learning_rate = ops.convert_to_tensor_v2(
          self.initial_learning_rate, name="initial_learning_rate")
      dtype = initial_learning_rate.dtype
      decay_steps = math_ops.cast(self.decay_steps, dtype)
      decay_rate = math_ops.cast(self.decay_rate, dtype)

      #initial_learning_rate / (1 + decay_rate * step / decay_step)###################
      global_step_recomp = math_ops.cast(step, dtype)
      p = global_step_recomp / decay_steps                         # steps / decay_steps
      if self.staircase:
        p = math_ops.floor(p)
      const = math_ops.cast(constant_op.constant(1), dtype)         # 1 
      denom = math_ops.add(const, math_ops.multiply(decay_rate, p)) # (1 + decay_rate * step / decay_steps)
      return math_ops.divide(initial_learning_rate, denom, name=name)# initial_learning_rate / denom

decayed_learning_rate : initial_learning_rate / (1 + decay_rate * steps / decay_steps)

lr = lr0 / ( 1 + steps / s )**c            and  Keras uses c=1, s = decay_steps/decay , decay_steps=1
lr =           lr0         / ( 1 + steps / (decay_steps/decay) ) ^1
lr =           lr0         / ( 1 + steps / (     1     /decay) )
lr =           lr0         / ( 1 + decay      * steps/ 1 )
lr = initial_learning_rate / ( 1 + decay_rate * steps / 1 )
lr = initial_learning_rate / ( 1 + decay_rate * steps / 1 )
lr = initial_learning_rate / ( 1 + decay_rate * steps / 1 )

Implementing power scheduling in Keras is the easiest option: just set the decay hyperparameter when creating an optimizer:
The decay is the inverse of s (the number of steps it takes to divide the learning rate by one more unit), and Keras assumes that c is equal to 1.

#class SGD(tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2)
# |  SGD(learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', **kwargs)
optimizer = keras.optimizers.SGD( lr=0.01, decay=1e-4)


import tensorflow as tf
import numpy as np

tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]), # 1D arrray: 28*28
    keras.layers.Dense( 300, activation="selu", kernel_initializer="lecun_normal" ),#Scaled ELU 
    keras.layers.Dense( 100, activation="selu", kernel_initializer="lecun_normal" ),
    keras.layers.Dense( 10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

n_epochs=25
history = model.fit( X_train_scaled, y_train, epochs=n_epochs, validation_data=(X_valid_scaled, y_valid) )

... ...

import matplotlib.pyplot as plt

learning_rate = 0.01
decay = 1e-4
batch_size=32
n_steps_per_epoch = len(X_train) //batch_size
epochs = np.arange(n_epochs)
lrs = learning_rate / (1 + decay* epochs*n_steps_per_epoch )

plt.plot( epochs, lrs, "o-")
plt.axis([0, n_epochs-1, 0, 0.01])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Power Scheduling", fontsize=14)
plt.grid(True)
plt.show()

# class SGD(tensorflow.python.keras.optimizer_v2.optimizer_v2.OptimizerV2)
# |  SGD(learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', **kwargs)
# optimizer = keras.optimizers.SGD( lr=0.01, decay=1e-4)
initial_learning_rate = 0.01
decay = 1e-4
decay_steps = 1

learning_rate_fn = keras.optimizers.schedules.InverseTimeDecay( initial_learning_rate, decay_steps, decay )

import tensorflow as tf
import numpy as np

tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]), # 1D arrray: 28*28
    keras.layers.Dense( 300, activation="selu", kernel_initializer="lecun_normal" ),
    keras.layers.Dense( 100, activation="selu", kernel_initializer="lecun_normal" ),
    keras.layers.Dense( 10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy", 
              optimizer=keras.optimizers.SGD(learning_rate=learning_rate_fn), 
              metrics=["accuracy"])

n_epochs=25
history = model.fit( X_train_scaled, y_train, epochs=n_epochs, validation_data=(X_valid_scaled, y_valid) )

... ...

import matplotlib.pyplot as plt

learning_rate = 0.01
decay = 1e-4
batch_size=32
n_steps_per_epoch = len(X_train) //batch_size
epochs = np.arange(n_epochs)
lrs = learning_rate / (1 + decay* epochs*n_steps_per_epoch )

plt.plot( epochs, lrs, "o-")
plt.axis([0, n_epochs-1, 0, 0.01])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Power Scheduling", fontsize=14)
plt.grid(True)
plt.show()

Exponential scheduling

Set the learning rate to η(t) =

. The learning rate will gradually drop by a factor of 10 every s OR r steps. While power scheduling reduces the learning rate more and more slowly, exponential scheduling keeps slashing大幅削减 it by a factor of 10 every s OR r steps.

@keras_export("keras.optimizers.schedules.ExponentialDecay")
class ExponentialDecay(LearningRateSchedule):
  """A LearningRateSchedule that uses an exponential decay schedule."""

  def __init__(
      self,
      initial_learning_rate,
      decay_steps,
      decay_rate,
      staircase=False,
      name=None):
    """Applies exponential decay to the learning rate.

    ```python
    def decayed_learning_rate(step):
      return initial_learning_rate * decay_rate ^ (step / decay_steps)
    ```

    ```python
    initial_learning_rate = 0.1
    lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate,
        decay_steps=100000,
        decay_rate=0.96,
        staircase=True)
    model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=lr_schedule),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    model.fit(data, labels, epochs=5)
    ```

    Args:
      initial_learning_rate: A scalar `float32` or `float64` `Tensor` or a
        Python number.  The initial learning rate.
      decay_steps: A scalar `int32` or `int64` `Tensor` or a Python number.
        Must be positive.  See the decay computation above.
      decay_rate: A scalar `float32` or `float64` `Tensor` or a
        Python number.  The decay rate.
      staircase: Boolean.  If `True` decay the learning rate at discrete
        intervals
      name: String.  Optional name of the operation.  Defaults to
        'ExponentialDecay'.

    """
    super(ExponentialDecay, self).__init__()
    self.initial_learning_rate = initial_learning_rate
    self.decay_steps = decay_steps
    self.decay_rate = decay_rate
    self.staircase = staircase
    self.name = name

  def __call__(self, step):
    with ops.name_scope_v2(self.name or "ExponentialDecay") as name:
      initial_learning_rate = ops.convert_to_tensor_v2(
          self.initial_learning_rate, name="initial_learning_rate")     # initial_learning_rate
      dtype = initial_learning_rate.dtype
      decay_steps = math_ops.cast(self.decay_steps, dtype)
      decay_rate = math_ops.cast(self.decay_rate, dtype)                # 0.1

      global_step_recomp = math_ops.cast(step, dtype)
      p = global_step_recomp / decay_steps                              # t/s=step /decay_steps
      if self.staircase:
        p = math_ops.floor(p)
      return math_ops.multiply(
          initial_learning_rate, math_ops.pow(decay_rate, p), name=name)#initial_learning_rate*decay_rate^(t/s)

lr = lr0 * 0.1**(epoch / s)
Exponential scheduling and piecewise scheduling are quite simple too. You first need to define a function that takes the current epoch and returns the learning rate. For example, let’s implement exponential scheduling:

# initial_learning_rate = 0.01
# lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
#     initial_learning_rate,
#     decay_steps=20,
#     decay_rate=0.1,
#     staircase=True
# )

# You first need to define a function that takes the current epoch and returns the 
# learning rate. For example, let’s implement exponential scheduling:
# def exponential_decay_fn(epoch): #epoch is global_step_recomp or step or 't'
#     return 0.01 * 0.1**(epoch/20)

def exponential_decay(lr0, s): # def exponential_decay(lr0=0.01, s=20):
    def exponential_decay_fn(epoch): #epoch is global_step_recomp or step or 't'
        return lr0 * 0.1**(epoch/s)
    return exponential_decay_fn #不加括号就是返回函数对象，不是函数调用


exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

model = keras.models.Sequential([
    keras.layers.Flatten( input_shape=[28,28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])

model.compile( loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 25

Next, create a LearningRateScheduler callback, giving it the schedule function, and pass this callback to the fit() method:

lr_scheduler = keras.callbacks.LearningRateScheduler( exponential_decay_fn )
history = model.fit(X_train_scaled, y_train, epochs=n_epochs, 
                    validation_data=(X_valid_scaled, y_valid),
                    callbacks=[lr_scheduler])

The LearningRateScheduler will update the optimizer’s learning_rate attribute at the beginning of each epoch. Updating the learning rate once per epoch is usually enough, but if you want it to be updated more often, for example at every step, you can always write your own callback (see the “Exponential Scheduling” section of the notebook for an example). Updating the learning rate at every step makes sense if there are many steps per epoch. Alternatively, you can use the keras.optimizers.schedules approach, described shortly.

... ...

history.history.keys()

plt.plot(history.epoch, history.history["lr"], "o-")
plt.axis([0, n_epochs-1, 0, 0.011])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Exponential Scheduling", fontsize=14)
plt.grid(True)
plt.show()

The schedule function can take the current learning rate as a second argument: For example, the following schedule function multiplies the previous learning rate by , which results in the same exponential decay (except the decay now starts at the beginning of epoch 0 instead of 1):

def exponential_decay_fn(epoch, current_lr):
    return current_lr*0.1**(1/20) #    decay_steps=20, decay_rate=0.1, steps=t=current epoch when ignoring epoch value

When you save a model, the optimizer and its learning rate get saved along with it. This means that with this new schedule function, you could just load a trained model and continue training where it left off, no problem. Things are not so simple if your schedule function uses the epoch argument, however: the epoch does not get saved, and it gets reset to 0 every time you call the fit() method. If you were to continue training a model where it left off, this could lead to a very large learning rate, which would likely damage your model’s weights. One solution is to manually set the fit() method’s initial_epoch argument so the epoch starts at the right value.

If you want to update the learning rate at each iteration rather than at each epoch(n_epochs=25), you must write your own callback class:

K = keras.backend

class ExponentialDecay( keras.callbacks.Callback ):
    def __init__(self, s=40000): #s: decay_steps
        super().__init__()
        self.s = s
        
    def on_batch_begin(self, batch, logs=None):
        ### Original
        ### batch: integer, index of batch within the current epoch. #each epoch has batch_size=32
        ### the learing rate is updated at each poch
        #now
        #  the learing rate is updated at each batch
        # Note: the `batch` argument is reset at each epoch
        lr = K.get_value(self.model.optimizer.lr)
        #print('\nbatch: ', batch, ' learing rate: ', lr,'\n')
        K.set_value(self.model.optimizer.lr, lr*0.1**(1/s)) #s: decay_steps #s = 20*len(X_train)//32 
        
    def on_epoch_end( self, epoch, logs=None):
        logs = logs or {}
        logs['lr'] = K.get_value(self.model.optimizer.lr)
        
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])

lr0=0.01
optimizer = keras.optimizers.Nadam(lr=lr0)
model.compile( loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])
n_epochs=25

s = 20*len(X_train)//32  # number of steps in 20 epochs (batch size = 32)
exp_decay = ExponentialDecay(s)
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data = (X_valid_scaled, y_valid),
                    callbacks=[exp_decay])

... ...

n_steps = n_epochs * len(X_train) //32 #n_epochs=25
steps = np.arange(n_steps)
lrs = lr0 * 0.1**(steps/s) #s = 20*len(X_train)//32

plt.plot(steps, lrs, "-", linewidth=2)
plt.axis([0, n_steps-1, 0, lr0 * 1.1])
plt.xlabel("Batch")
plt.ylabel("Learning Rate")
plt.title("Exponential Scheduling (per batch)", fontsize=14)
plt.grid(True)
plt.show()

Piecewise constant scheduling分段恒定调度:

def fun(x): # 类似于 lambda(x):
    return x
fun(3)

等价于

fun=lambda x:x # 后面的x是返回语句
fun(3)

l= [lambda:n for n in range(5)]
for x in l:
    print(x()) # each element in the list l is lambda statement, so we need to call with "()"

和以下的形式应该等价:

l=[]
# lambda:n for n in range(5)
def fun():
    for n in range(5):
        n=n
    return n
for n in range(5): #since the lambda is in the list and call lambda statement 5 times
   l.append(fun) #append a function object to the list l
for x in l:
    print(x())

l= [lambda n=i:n for i in range(5)]#OR# l= [lambda n=n:n for n in range(5)]

for x in l:
    print(x())

l=[]
for i in range(5):
    def fun(n=i): #lambda n=i:n
        return n
    l.append(fun) #append a function object to the list
    
for x in l:
    print(x())

@keras_export("keras.optimizers.schedules.PiecewiseConstantDecay")
class PiecewiseConstantDecay(LearningRateSchedule):
  """A LearningRateSchedule that uses a piecewise constant decay schedule."""

  def __init__(
      self,
      boundaries,
      values,
      name=None):
    """Piecewise constant from boundaries and interval values.

    Example: use a learning rate that's 
    1.0 for the first 100001 steps, 
    0.5 for the next 10000 steps, and 
    0.1 for any additional steps.
    ```python
    step = tf.Variable(0, trainable=False)
    boundaries = [100000, 110000]
    values = [1.0,     0.5,     0.1]
    learning_rate_fn = keras.optimizers.schedules.PiecewiseConstantDecay(
        boundaries, values)
    # Later, whenever we perform an optimization step, we pass in the step.
    learning_rate = learning_rate_fn(step)
    ```
   
    Args:
      boundaries: A list of `Tensor`s or `int`s or `float`s with strictly
        increasing entries, and with all elements having the same type as the
        optimizer step.
      values: A list of `Tensor`s or `float`s or `int`s that specifies the
        values for the intervals defined by `boundaries`. It should have one
        more element than `boundaries`, and all elements should have the same
        type.
      name: A string. Optional name of the operation. Defaults to
        'PiecewiseConstant'.
    Returns:
      The output of the 1-arg function that takes the `step`
      is `values[0]` when `step <= boundaries[0]`,
      `values[1]` when `step > boundaries[0]` and `step <= boundaries[1]`, ...,
      and values[-1] when `step > boundaries[-1]`.
    Raises:
      ValueError: if the number of elements in the lists do not match.
    """
    super(PiecewiseConstantDecay, self).__init__()

    if len(boundaries) != len(values) - 1:
      raise ValueError(
          "The length of boundaries should be 1 less than the length of values")

    self.boundaries = boundaries
    self.values = values
    self.name = name

  def __call__(self, step):
    with ops.name_scope_v2(self.name or "PiecewiseConstant"):
      boundaries = ops.convert_n_to_tensor(self.boundaries)
      values = ops.convert_n_to_tensor(self.values)
      x_recomp = ops.convert_to_tensor_v2(step)
      for i, b in enumerate(boundaries):
        if b.dtype.base_dtype != x_recomp.dtype.base_dtype:
          # We cast the boundaries to have the same type as the step
          b = math_ops.cast(b, x_recomp.dtype.base_dtype)
          boundaries[i] = b
      pred_fn_pairs = []
      pred_fn_pairs.append((x_recomp <= boundaries[0], lambda: values[0]))
      pred_fn_pairs.append((x_recomp > boundaries[-1], lambda: values[-1]))
      for low, high, v in zip(boundaries[:-1], boundaries[1:], values[1:-1]):
        # Need to bind v here; can do this with lambda v=v: ...
        pred = (x_recomp > low) & (x_recomp <= high)
        pred_fn_pairs.append((pred, lambda v=v: v))#中间的v是引用当前for中的v值，并保存
        ############################lambda（v=v）: return v

      # The default isn't needed here because our conditions are mutually
      # exclusive and exhaustive, but tf.case requires it.
      default = lambda: values[0]
      return control_flow_ops.case(pred_fn_pairs, default, exclusive=True)

Use a constant learning rate for a number of epochs (e.g.,

= 0.1 for 5 epochs), then a smaller learning rate for another number of epochs (e.g.,

= 0.001 for 50 epochs), and so on. Although this solution can work very well, it requires fiddling around摆弄 to figure out the right sequence of learning rates and how long to use each of them.

def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch<15:
        return 0.005
    else:
        return 0.001

def piecewise_constant(boundaries, values):   #values: learning rates
    boundaries = np.array( [0] + boundaries ) # array([0,5,15])
    values = np.array(values)
    def piecewise_constant_fn(epoch):
        return values[ np.argmax(boundaries>epoch)-1 ]#np.argmax(boundaries>epoch) if boundaries>epoch then return its index
    return piecewise_constant_fn #return function object/ address

piecewise_constant_fn = piecewise_constant([5,15], [0.01,0.005, 0.001])


lr_scheduler = keras.callbacks.LearningRateScheduler(piecewise_constant_fn)

model = keras.models.Sequential([
    keras.layers.Flatten( input_shape=[28,28] ),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])

model.compile( loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=['accuracy'])
n_epochs=25

history=model.fit(X_train_scaled, y_train, epochs=n_epochs,
                  validation_data=(X_valid_scaled, y_valid),
                  callbacks=[lr_scheduler])

... ...

plt.plot(history.epoch, [piecewise_constant_fn(epoch) for epoch in history.epoch], "o-")
plt.axis([0, n_epochs-1, 0, 0.011])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Piecewise Constant Scheduling", fontsize=14)
plt.grid(True)
plt.show()

Performance Scheduling

Measure the validation error every N steps (just like for early stopping), and reduce the learning rate by a factor of λ when the error stops dropping.
For performance scheduling, use the ReduceLROnPlateau callback. For example, if you pass the following callback to the fit() method, it will multiply the learning rate by 0.5 whenever the best validation loss does not improve for five consecutive epochs (other options are available; please check the documentation for more details):

tf.random.set_seed(42)
np.random.seed(42)

# factor: factor by which the learning rate will be reduced. new_lr = lr * factor
# patience: number of epochs with no improvement after which learning rate will be reduced.
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

model = keras.models.Sequential([
    keras.layers.Flatten( input_shape=[28,28] ),
    keras.layers.Dense( 300, activation="selu", kernel_initializer="lecun_normal" ),
    keras.layers.Dense( 100, activation="selu", kernel_initializer="lecun_normal" ),
    keras.layers.Dense( 10, activation="softmax")
])

optimizer = keras.optimizers.SGD( lr=0.02, momentum=0.9 )
model.compile( loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])
n_epochs = 25
history = model.fit( X_train_scaled, y_train, epochs=n_epochs, 
                     validation_data=(X_valid_scaled, y_valid),
                     callbacks = [lr_scheduler])

... ...

plt.plot(history.epoch, history.history['lr'], "bo-")
plt.xlabel("Epoch")
plt.ylabel("Learning Rate", color="b")
plt.tick_params('y', colors="b")
plt.gca().set_xlim(0, n_epochs-1)
plt.grid(True)

ax2 = plt.gca().twinx()
ax2.plot(history.epoch, history.history['val_loss'], "r^-")
ax2.set_ylabel("Validation Loss", color='r')
ax2.tick_params('y', color='r')

plt.title("Reduce LR on Plateau", fontsize=14)
plt.show()

Measure the validation error every N steps (just like for early stopping), and reduce the learning rate by a factor of λ when the error stops dropping.

Lastly, tf.keras offers an alternative way to implement learning rate scheduling: define the learning rate using one of the schedules available in keras.optimizers.schedules, then pass this learning rate to any optimizer. This approach updates the learning rate at each step rather than at each epoch.
For example, here is how to implement
the same exponential schedule as the exponential_decay_fn() function we defined earlier:

model = keras.models.Sequential([
    keras.layers.Flatten( input_shape=[28,28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])

s= 20*len(X_train)//32 # number of steps in 20 epochs (batch size = 32) #decay_steps
# ExponentialDecay( initial_learning_rate, decay_steps, decay_rate, staircase=False, name=None )
# The learning rate will gradually drop by a factor of 100 every s OR r decay_steps.
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD( learning_rate )
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])
n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

This is nice and simple, plus when you save the model, the learning rate and its
schedule (including its state) get saved as well. This approach, however, is not part of
the Keras API; it is specific to tf.keras.

... ...

For piecewise constant scheduling, try this:

learning_rate = keras.optimizers.schedules.PiecewiseConstantDecay(
    boundaries=[5. * n_steps_per_epoch, 15. * n_steps_per_epoch],
    values=[0.01, 0.005, 0.001])

1cycle scheduling

Contrary to the other approaches, 1cycle (introduced in a 2018 paper by Leslie Smith) starts by increasing the initial learning rate , growing linearly up to halfway through training. Then it decreases the learning rate linearly down to again during the second half of training, finishing the last few epochs by dropping the rate down by several orders of magnitude (still linearly). The maximum learning rate is chosen using the same approach we used to find the optimal learning rate, and the initial learning rate is chosen to be roughly 10 times lower. When using a momentum, we start with a high momentum first (e.g., 0.95), then drop it down to a lower momentum during the first half of training (e.g., down to 0.85, linearly), and then bring it back up to the maximum value (e.g., 0.95) during the second half of training, finishing the last few epochs with that maximum value. Smith did many experiments showing that this approach was often able to speed up training considerably and reach better performance. For example, on the popular CIFAR10 image dataset, this approach reached 91.9% validation accuracy in just 100 epochs, instead of 90.3% accuracy in 800 epochs through a standard approach (with the same neural network architecture).

K = keras.backend

class ExponentialLearningRate( keras.callbacks.Callback):
    def __init__(self, factor):
        self.factor = factor
        self.rates = []
        self.losses = []
        
    def on_batch_end(self, batch, logs):
        self.rates.append( K.get_value(self.model.optimizer.lr) )
        self.losses.append( logs['loss'] )
        K.set_value( self.model.optimizer.lr, self.model.optimizer.lr*self.factor )#update learning rate
        
def find_learing_rate( model, X,y, epochs=1, batch_size=32, min_rate=10**-5, max_rate=10):
    init_weights = model.get_weights()
    
    iterations = len(X) // batch_size * epochs
    factor = np.exp(np.log(max_rate / min_rate)/iterations) # initilize learning rate factor
    init_lr = K.get_value(model.optimizer.lr)   # get initial learning rate
    K.set_value( model.optimizer.lr, min_rate ) # replace initial learning rate with min_rate
    exp_lr = ExponentialLearningRate(factor)    # pass learning rate factor to
    history = model.fit( X, y, epochs=epochs, batch_size=batch_size, callbacks=[exp_lr] )
    K.set_value( model.optimizer.lr, init_lr )  # replace current learning rate with initiallearning rate
    
    model.set_weights(init_weights)
    return exp_lr.rates, exp_lr.losses

def plot_lr_vs_loss( rates, losses ):
    plt.plot(rates, losses)
    plt.gca().set_xscale("log")
    plt.hlines( min(losses), min(rates),max(rates) )
    plt.axis( [min(rates), max(rates),  min(losses), (losses[0]+min(losses))/2 ])
    plt.xlabel("Learning rate")
    plt.ylabel("Loss")

tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])

model.compile( loss="sparse_categorical_crossentropy", 
               optimizer=keras.optimizers.SGD(lr=1e-3),
               metrics=["accuracy"])

batch_size=128
rates, losses = find_learing_rate( model, X_train_scaled, y_train, epochs=1, batch_size=batch_size)
plot_lr_vs_loss(rates, losses)

To sum up, exponential decay, performance scheduling, and 1cycle can considerably speed up convergence, so give them a try!

class OneCycleScheduler( keras.callbacks.Callback ):
    def __init__(self, iterations, max_rate, start_rate=None, 
                 last_iterations=None, last_rate=None):
        
        self.iterations = iterations #total iterations
        
        self.max_rate = max_rate
        self.start_rate = start_rate or max_rate/10
        
        self.last_iterations = last_iterations or iterations//10+1
        self.half_iteration_pos = (iterations - self.last_iterations)//2
        
        # finishing the last few epochs by dropping the rate down by several orders of magnitude
        self.last_rate = last_rate or self.start_rate/1000
        
        self.iteration_pos = 0
        
    def _iterpolate( self, iter1, iter2, 
                           rate1, rate2):
                 # a_slope: (rate2-rate1)/(iter2-iter1)
                 # x: (self.iteration-iter1)
                 # b: rate1 
                 # y= a_slope * x + b
        return ( (rate2-rate1)*(self.iteration_pos-iter1) / (iter2-iter1) + rate1 )
    
    def on_batch_begin(self, batch, logs):
        if self.iteration_pos < self.half_iteration_pos:
            rate = self._iterpolate(0, self.half_iteration_pos, 
                                    self.start_rate, self.max_rate)
            
        elif self.iteration_pos < 2*self.half_iteration_pos:
            rate = self._iterpolate(self.half_iteration_pos, 2*self.half_iteration_pos,
                                    self.max_rate, self.start_rate)
        else:    
            rate = self._iterpolate(2*self.half_iteration_pos, self.iterations, 
                                    self.start_rate, self.last_rate)
        self.iteration_pos +=1
        K.set_value(self.model.optimizer.lr, rate)#update

n_epochs = 25 #note each batch tacks n_epochs
onecycle = OneCycleScheduler( len(X_train)//batch_size * n_epochs, max_rate=0.05) #max_rate=0.05 and loss=1.0
history = model.fit(X_train_scaled, y_train, epochs=n_epochs, batch_size=batch_size, 
                    validation_data=(X_valid_scaled, y_valid),
                    callbacks=[onecycle])

... ...

A 2013 paper by Andrew Senior et al. compared the performance of some of the most popular learning schedules when using momentum optimization to train deep neural networks for speech recognition. The authors concluded that, in this setting,
both performance scheduling and exponential scheduling performed well. They favored exponential scheduling because it was easy to tune and it converged slightly faster to the optimal solution (they also mentioned that it was easier to implement than performance scheduling, but in Keras both options are easy). That said, the 1cycle approach seems to perform even better.

Avoiding Overfitting Through Regularization

With thousands of parameters, you can fit the whole zoo. Deep neural networks typically have tens of thousands of parameters, sometimes even millions. This gives them an incredible amount of freedom and means they can fit a huge variety of complex datasets. But this great flexibility also makes the network prone to overfitting the training set. We need regularization.

We already implemented one of the best regularization techniques in Chapter 10: early stopping (cp4:https://blog.csdn.net/Linli522362242/article/details/104124771, cp10: https://blog.csdn.net/Linli522362242/article/details/106582512). Moreover, even though Batch Normalization was designed to solve the unstable gradients problems, it also acts like a pretty good regularizer. In this section we will examine other popular regularization techniques for neural networks: ℓ1 and ℓ2 regularization, dropout, and max-norm regularization.

and Regularization

Just like you did in Chapter 4 for simple linear models(https://blog.csdn.net/Linli522362242/article/details/104070847), you can use regularization to constrain a neural network’s connection weights, and/or regularization if you want a sparse model (with many weights equal to 0).

Here is how to apply regularization to a Keras layer’s connection weights, using a regularization factor of 0.01:

from tensorflow import keras

layer = keras.layers.Dense(100, activation="elu", kernel_initializer = "he_normal",
                           kernel_regularizer = keras.regularizers.l2(0.01))
# or l1(0.1) for ℓ1 regularization with a factor or 0.1
# or l1_l2(0.1, 0.01) for both ℓ1 and ℓ2 regularization, with factors 0.1 and 0.01 respectively

The l2() function returns a regularizer that will be called at each step during training to compute the regularization loss. This is then added to the final loss. As you might expect, you can just use keras.regularizers.l1() if you want ℓ1 regularization; if
you want both ℓ1 and ℓ2 regularization, use keras.regularizers.l1_l2() (specifying both regularization factors).

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]), #exponential linear unit (ELU) 
    keras.layers.Dense( 300, activation="elu", kernel_regularizer=keras.regularizers.l2(0.01) ),
    keras.layers.Dense( 100, activation="elu", kernel_regularizer=keras.regularizers.l2(0.01) ),
    keras.layers.Dense( 10, activation="softmax", kernel_regularizer=keras.regularizers.l2(0.01) )
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs=2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data = (X_valid_scaled, y_valid))

l2(0.01) means every coefficient in the weight matrix of the layer will add 0.01* weight_coefficient_value to the total loss of the network. Note that because this penalty is only added at training time, the loss for this network will be much higher at training than at test time.
Since you will typically want to apply the same regularizer to all layers in your network, as well as using the same activation function and the same initialization strategy in all hidden layers, you may find yourself repeating the same arguments. This makes the code ugly丑陋 and error-prone. To avoid this, you can try refactoring your code to use loops. Another option is to use Python’s functools.partial() function, which lets you create a thin wrapper for any callable, with some default argument values:

from functools import partial

RegularizedDense = partial( keras.layers.Dense,
                            activation="elu",
                            kernel_initializer = "he_normal",
                            kernel_regularizer = keras.regularizers.l2(0.01)
                          )

model = keras.Sequential([
    keras.layers.Flatten( input_shape=[28,28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs = n_epochs,
                    validation_data=(X_valid_scaled, y_valid))

Dropout

https://blog.csdn.net/Linli522362242/article/details/107164478

你可能感兴趣的:(11_Training Deep Neural Networks_3_Adam_Learning Rate Scheduling_Decay_np.argmax(」)_lambda语句_Regular)

【问题记录】npm create vue@latest报错菜鸟级后端问题记录 npm vue.js 前端
1，错误日志npmerrorcodeEPERMnpmerrorsyscallmkdirnpmerrorpathD:\ProgramFiles\nodejs\node_cache\_cacachenpmerrorerrnoEPERMnpmerrorFetchError:Invalidresponsebodywhiletryingtofetchhttps://registry.npmjs.org/cr
不懂的还在争论AI，懂行的已用Python+DeepSeek变现！逆袭机会就在AI应用层渡难繁辰 python开发人工智能拥抱AI 人工智能 python ai
最近总有种错觉：AI时代轰轰烈烈，普通人却只能当看客？大模型训练动辄千万美金，算法高深莫测，似乎离我们太远。别急，AI真正的革命性力量，正从神秘实验室涌向普通人的键盘——它的名字叫“AI应用层”。而拿到这张船票的钥匙，就是你早该学起来的：Python。当质疑者还在争论“AI能否取代人类”，行动派已用DeepSeek+LangChain开发智能应用月入五位数！巨头烧钱搭台，我们轻量唱戏！科技大佬砸重
编译原理7~9 CHARLIIE 编译原理
7。编译原理--03语法制导翻译和中间代码生成复习(清华大学出版社第3版)-X_Jun-博客园继承属性：从上往下in综合属性：从下往上val语法分析树和相应的带标注语法分析树这条产生式`S'→id:=E'`以及相应的语义动作`{S'.nextlist:="";emit(id.place':='E'.place)}`是用于描述赋值语句的翻译过程。这里，`id`表示一个标识符（即变量名），而`E'`是
提示工程入门指南：如何有效地与大语言模型交互止观止大语言模型语言模型人工智能
本文深入拆解提示工程的核心概念、最佳实践和实用技巧。作为AI领域的热点技术，提示工程（PromptEngineering）能显著提升大语言模型（LargeLanguageModel,LLM）如DeepSeek的响应质量。文档结构概览引言：为什么需要提示工程？提示的定义与结构：上下文、指令、约束的完整解析提示工程原则：6项核心技巧有效vs无效提示对比：案例驱动的实操分析用户提示与系统提示：行为控制的
【行云流水a】淘天联合爱橙开源强化学习训练框架ROLL OpenRL/openrl PPO-for-Beginners: 从零开始实现强化学习算法PPO 强化学习框架verl 港大等开源GoT-R1 行云流水AI笔记开源算法
以下是DQN（DeepQ-Network）和PPO（ProximalPolicyOptimization）的全面对比流程图及文字解析。两者是强化学习的核心算法，但在设计理念、适用场景和实现机制上有显著差异：graphTDA[对比维度]-->B[算法类型]A-->C[策略表示]A-->D[动作空间]A-->E[学习机制]A-->F[探索方式]A-->G[稳定性]A-->H[样本效率]A-->I[关键
字符判断星期几伊欧温 C语言刷题记录 c语言算法
题目描述请输入星期几的第一个字母来判断一下是星期几，如果第一个字母一样，则继续判断第二个字母。程序分析：用情况语句比较好，如果第一个字母一样，则判断用情况语句或if语句判断第二个字母。源代码#includeintmain(){chari,j;printf("请输入第一个字母:\n");scanf("%c",&i);//为避免影响后续可能的输入操作，用getchar()读取并丢弃这个换行符getch
DAY 42 Grad-CAM与Hook函数
@浙大疏锦行https://blog.csdn.net/weixin_45655710知识点回顾回调函数lambda函数hook函数的模块钩子和张量钩子Grad-CAM的示例作业：理解下今天的代码即可importtorchimporttorch.nnasnnimporttorch.nn.functionalasFimporttorchvisionimporttorchvision.transfor
for...in 与 for...of的区别是啥？用错后果很严重
for…in与for…of循环详解在JavaScript中，for...in和for...of是两种常用的循环语句，但它们在使用场景和行为上有显著区别。下面我将详细解释它们的差异，并通过示例代码进行说明。核心区别对比表特性for...infor...of遍历目标对象的可枚举属性可迭代对象的值返回值类型键名（key）值（value）适用对象普通对象、数组（不推荐）数组、字符串、Map、Set、Nod
SqlServer基础学习笔记 @半夏微凉科技技术拓展 #sqlserver sqlserver 数据库学习笔记 sqlServer学习笔记
SqlServer基础学习笔记介绍了SQLServer数据库管理系统的基础知识，包括数据库的创建、表的设计、SQL查询语句、数据类型、索引、以及常见的管理任务等内容，适合初学者入门学习。第一章：SQLServer简介1.1SQLServer概述SQLServer是由Microsoft公司开发的关系型数据库管理系统，用于存储和管理大量数据。它提供了可靠性、安全性和高性能的数据库解决方案，广泛应用于企
Veo 3 视频生成大模型完整操作教程（2025）迎风斯黄音视频人工智能
随着AI多模态能力的飞跃，GoogleDeepMind发布的Veo3成为了生成视频领域的一颗重磅炸弹。它不仅能够根据文本生成高质量的视频画面，还能同步生成对白、背景音和环境音，是目前最接近真正“AI导演”的大模型。本文将带你详细了解Veo3的功能、使用方式、提示词撰写技巧，以及完整的创作流程，适合希望用AI快速生成短视频、概念片段、广告、剧情短片等内容的创作者与开发者。一、Veo3是什么？Veo3
DeepSeek R1 Android本地化部署 Dawson_Jiang 大模型 deepseek ollama AI 大模型手机部署deepseek
1.概述android手机端部署deepseek一般需要安装termux,ollama,deepseek三个大的步骤原因分析：deepseek等大模型需要类似ollama的工具去运行。ollama有macwindow和linux版本，无Android版本；termux是一个模拟linux环境的Androidapp，在此环境中即可安装运行ollamalinux版本，然后再ollama上面部署运行de
c# 核心技术指南——第2章 c# 语言基础伦比兔 C#核心技术指南 c#开发语言
本书中几乎所有的程序和代码片段都可以作为交互式示例在LINQPad中运行。阅读本书时使用这些示例可以加快你的学习进度。在LINQPad中编辑执行这些示例可以立即得到结果，无须在VisualStudio中建立项目和解决方案。2.1第一个C#程序在C#中，语句按顺序执行，每个语句都以分号结尾。类将函数成员和数据成员聚合在一起形成面向对象的构建单元。Console类将处理命令行的输入输出功能聚合在一起，
嵌入式linux下基于boa cgic sqlite3的ajax web服务器搭建モザイクカケラ嵌入式linux-web 嵌入式系统开发 boa cgic sqlite3 嵌入式linux ajax
先上大家的资源全部亲测可用sqlite3数据库c语言常用接口应用实例sqlite3数据库交叉编译并移植到嵌入式开发环境步骤fprintf与stderr、stdout的使用Windows中IIS服务器被防火墙阻止导致外网无法访问sqlite3.OperationalError:unabletoopendatabasefileSQLiteDelete语句SQLite数据库中rowid使用基本操作交叉编
MyBatis逆向工程生成 (生成pojo、mapper.xml、mapper.java) weixin_30701521 java 数据库
MyBatis逆向工程生成mybatis需要程序员自己编写sql语句，mybatis官方提供逆向工程，可以针对单表自动生成mybatis执行所需要的代码（mapper.java、mapper.xml、pojo…），可以让程序员将更多的精力放在繁杂的业务逻辑上。企业实际开发中，常用的逆向工程方式：由数据库的表生成java代码。之所以强调单表两个字，是因为Mybatis逆向工程生成的Mapper所进行
反射&枚举&以及lambda表达式观音山保我别报错 java 开发语言
反射,Java代码中,让一个对象,认识到自己,也叫做"自省"自己清楚的认识自己谁是最认识对象的??程序员程序员是非常清楚,某个对象是属于哪个类的这个对象里面有哪些属性(属性的名字,类型,private/public,其他的修饰符注解之类的)这个对象里有哪些方法(方法的名字,参数列表,private/public)这个类的父类是谁这个类实现了接口有哪些这些东西程序员只需要看看代码,就知道这些事情了程
manjaro linux桌面更换 tboqi1 linux manjaro kde xfce deepin
本来安装的xfce版本的manjaro装好后安装了输入法qq微信等，还是喜欢win10那种小图标的样子，然后开始折腾，换其他桌面先是换成了deepin桌面，网上有教程，不过是kde-》deepin，能用---换入deepin桌面后感觉确实比xfce桌面好用，但opera无法打开（不喜欢firefox上面一大条标题，Opera比较简洁），不知道为什么（请路过的高手指点一下）--继续折腾，换成kde桌
和李沐老师学深度学习--2.数据操作部分代码实现（学习笔记）
大家对代码有不懂地方都可以上网去查找，最好是有一定的数据分析基础比较容易理解，李沐老师课程视频链接我放在这里了大家有不懂都可以观看课程进行学习04数据操作+数据预处理【动手学深度学习v2】_哔哩哔哩_bilibili深度学习课程电子书：大家可以使用翻译插件观看书的内容Preface—DiveintoDeepLearning1.0.3documentation深度学习github项目：https:/
番外：MySQL的一些事务处理红中马喽 mysql 数据库学习笔记开发语言后端
前言因为前天没更新，多补一更，简单介绍一下后端数据库MySQL的事务处理什么是事务处理事务（Transaction）：事务是一组SQL语句的执行单元，这些语句被视为一个单独的工作单元。事务的主要目的是保证数据库操作的原子性，即这些操作要么全部执行，要么全部不执行简单来说，事务是用来保证数据库的一致性，完整性的，关于事务处理我们需要提到ACID性A.原子性（Atomicity）：事务中的所有操作要么
python循环语句for BuckData python
目录1、for循环2、示例1、for循环Pythonfor循环可以遍历任何可迭代对象。通过使用for循环，我们可以为列表、元组、集合中的每个项目等执行一组语句。range()函数如需循环一组代码指定的次数，我们可以使用range()函数，range()函数返回一个数字序列，默认情况下从0开始，并递增1（默认地），并以指定的数字结束。2、示例#遍历字典d={'CNY':'人民币','USD':'美元
python循环语句
Python循环语句文章目录Python循环语句一、实验目的二、实验原理三、实验环境四、实验内容五、实验步骤1.While循环结构2.While无限循环3.For循环语法4.break语句和continue语句一、实验目的掌握循环结构的语法二、实验原理Python中的循环语句有for和while。Python循环语句的控制结构图如下所示：三、实验环境Python3.6以上PyCharm四、实验内容
C++ 快速回顾（四）帅_shuai_ C++c++
C++快速回顾（四）前言一、纯虚函数二、final关键字1.作用到函数2.作用到类三、虚函数原理四、Lambda一些知识补充前言用于快速回顾之前遗漏或者补充C++知识一、纯虚函数纯虚函数主要是当接口，没有具体的实现要到派生类去实现。纯虚函数不能直接实例化，类似c#中的抽象函数classMyClassBase{public:virtualvoidInit()=0;virtualvoidDestroy
Oracle 神级函数 Decode 实战：一条 SQL 替代 3000 行代码的计算逻辑 AI、少年郎 oracle sql 数据库递归组织树
在企业级应用开发中，复杂的业务统计需求往往需要编写大量代码进行数据处理。本文将通过Oracle的DECODE函数与分组函数的巧妙结合，展示如何用一条SQL语句实现原本需要3000行代码的复杂计算逻辑，尤其针对企业组织架构中的部门级请假数据统计场景。一、基础准备：构建业务数据表1.创建单位部门表（模拟组织架构）CREATETABLEt_dept(dept_idNUMBERPRIMARYKEY,--部
SpringBoot生态全景图：从SpringCloud到云原生技术栈演进 fanxbl957 Web spring boot spring cloud 云原生
博主介绍：Java、Python、js全栈开发“多面手”，精通多种编程语言和技术，痴迷于人工智能领域。秉持着对技术的热爱与执着，持续探索创新，愿在此分享交流和学习，与大家共进步。DeepSeek-行业融合之万象视界(附实战案例详解100+)全栈开发环境搭建运行攻略：多语言一站式指南(环境搭建+运行+调试+发布+保姆级详解)感兴趣的可以先收藏起来，希望帮助更多的人SpringBoot生态全景图：从S
php flush实时输出线上环境好使，本地环境等待一段时间后一次性输出结果的原因落落鱼2013 php 开发语言
近期对接deepseek接口时为了拥有较好的用户体验，等待答案返回时采用了flush分布输出，但是线上环境下可以正常分布输出，同样代码在本地总是等待许久后一次性出结果，排查许久，发现竟然是本地和线上不同的php加载模式导致。1、线上环境与本地环境区别：1）线上环境：ServerAPIFPM/FastCGI2）本地环境：ServerAPICGI/FastCGI2.PHP-FPM与mod_fcgid差
python selenium 滚动页面到定位元素我有一个希哥 python selenium 前端
用js语句target=driver.find_element_by_id("id")driver.execute_script("arguments[0].scrollIntoView();",target)或target=WebDriverWait(driver,3).until(expected_conditions.presence_of_element_located((By.ID,"i
SQLserver数据库学习笔记溪衡学习
小记1：1.newid()我觉得是一个生成唯一键的好方法，不用自增控制主键，可以用这个试试，注意不做处理的话，需要36位。例如：在数据库中直接使用语句selectnewid()2.nolock按我的理解是“不上锁的”，所谓的脏读，大多用的都是这个东西，据说可以提高查询速度。3.go批处理语句，将前面的代码作为一批处理。4.内连接与简单多表在数据量少的时候查询速度差距并不明显。5.删除和更新数据时，
（转）优秀的 python 机器学习库 patrick75 python 机器学习 python 机器学习
优秀的python机器学习库IntroductionThereisnodoubtthatneuralnetworks,andmachinelearningingeneral,hasbeenoneofthehottesttopicsintechthepastfewyearsorso.It’seasytoseewhywithallofthereallyinterestinguse-casestheys
Densenet模型花卉图像分类深度学习乐园分类数据挖掘人工智能
项目源码获取方式见文章末尾！600多个深度学习项目资料，快来加入社群一起学习吧。《------往期经典推荐------》项目名称1.【基于CNN-RNN的影像报告生成】2.【卫星图像道路检测DeepLabV3Plus模型】3.【GAN模型实现二次元头像生成】4.【CNN模型实现mnist手写数字识别】5.【fasterRCNN模型实现飞机类目标检测】6.【CNN-LSTM住宅用电量预测】7.【VG
算法训练营|数组总结慧泽huize 数据结构算法 leetcode python c++
时间复杂度：算法执行语句的次数空间复杂度：算法在运行过程中临时占存储空间大小数组（C++）：存放在连续内存空间的相同类型固定大小的数据的集合，不能删除，只能覆盖列表（Python）：数据可以是不同类型，列表长度可变1.二分查找循环不变量原则，清楚区间定义时间复杂度：O(logn)空间复杂度：O(1)2.双指针法快指针找到新数组元素，慢指针指向新数组下标时间复杂度：O(n)空间复杂度：O(1)3.双
SQL Server 中的 GO 及其与其他数据库的对比杨云龙UP 三大数据库学习数据库 sqlserver sql Oracle oracle MySQL mysql
在SQLServer中，GO不是SQL语言的一部分，而是一个批处理分隔符，用于分隔脚本中的多个SQL语句或执行块。它由SQLServerManagementStudio(SSMS)等工具处理，用来指示执行一个批次的SQL语句。1、SQLServer中的GO作用分隔批次（处理多批次脚本）：将SQL脚本中的语句分成多个批次执行。每个GO表示一个独立的执行块。例如，在某些操作中，创建表的语句可能依赖于先
eclipse maven IXHONG eclipse
eclipse中使用maven插件的时候，运行run as maven build的时候报错 -Dmaven.multiModuleProjectDirectory system propery is not set. Check $M2_HOME environment variable and mvn script match. 可以设一个环境变量M2_HOME指
timer cancel方法的一个小实例 alleni123 多线程 timer
package com.lj.timer; import java.util.Date; import java.util.Timer; import java.util.TimerTask; public class MyTimer extends TimerTask { private int a; private Timer timer; pub
MySQL数据库在Linux下的安装 ducklsl mysql
1.建好一个专门放置MySQL的目录 /mysql/db数据库目录 /mysql/data数据库数据文件目录 2.配置用户，添加专门的MySQL管理用户 >groupadd mysql ----添加用户组 >useradd -g mysql mysql ----在mysql用户组中添加一个mysql用户 3.配置，生成并安装MySQL >cmake -D
spring------>>cvc-elt.1: Cannot find the declaration of element Array_06 spring bean
将-------- <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3
maven发布第三方jar的一些问题 cugfy maven
maven中发布第三方jar到nexus仓库使用的是 deploy:deploy-file命令有许多参数，具体可查看 http://maven.apache.org/plugins/maven-deploy-plugin/deploy-file-mojo.html 以下是一个例子： mvn deploy:deploy-file -DgroupId=xpp3
MYSQL下载及安装 357029540 mysql
好久没有去安装过MYSQL，今天自己在安装完MYSQL过后用navicat for mysql去厕测试链接的时候出现了10061的问题，因为的的MYSQL是最新版本为5.6.24，所以下载的文件夹里没有my.ini文件，所以在网上找了很多方法还是没有找到怎么解决问题，最后看到了一篇百度经验里有这个的介绍，按照其步骤也完成了安装，在这里给大家分享下这个链接的地址
ios TableView cell的布局张亚雄 tableview
cell.imageView.image = [UIImage imageNamed:[imageArray objectAtIndex:[indexPath row]]]; CGSize itemSize = CGSizeMake(60, 50); &nbs
Java编码转义 adminjun java 编码转义
import java.io.UnsupportedEncodingException; /** * 转换字符串的编码 */ public class ChangeCharset { /** 7位ASCII字符，也叫作ISO646-US、Unicode字符集的基本拉丁块 */ public static final Strin
Tomcat 配置和spring aijuans spring
简介 Tomcat启动时，先找系统变量CATALINA_BASE，如果没有，则找CATALINA_HOME。然后找这个变量所指的目录下的conf文件夹，从中读取配置文件。最重要的配置文件：server.xml 。要配置tomcat，基本上了解server.xml，context.xml和web.xml。 Server.xml -- tomcat主
Java打印当前目录下的所有子目录和文件 ayaoxinchao 递归 File
其实这个没啥技术含量，大湿们不要操笑哦，只是做一个简单的记录，简单用了一下递归算法。 import java.io.File; /** * @author Perlin * @date 2014-6-30 */ public class PrintDirectory { public static void printDirectory(File f
linux安装mysql出现libs报冲突解决 BigBird2012 linux
linux安装mysql出现libs报冲突解决安装mysql出现 file /usr/share/mysql/ukrainian/errmsg.sys from install of MySQL-server-5.5.33-1.linux2.6.i386 conflicts with file from package mysql-libs-5.1.61-4.el6.i686
jedis连接池使用实例 bijian1013 redis jedis连接池 jedis
实例代码： package com.bijian.study; import java.util.ArrayList; import java.util.List; import redis.clients.jedis.Jedis; import redis.clients.jedis.JedisPool; import redis.clients.jedis.JedisPoo
关于朋友 bingyingao 朋友兴趣爱好维持
成为朋友的必要条件：志相同，道不合，可以成为朋友。譬如马云、周星驰一个是商人，一个是影星，可谓道不同，但都很有梦想，都要在各自领域里做到最好，当他们遇到一起，互相欣赏，可以畅谈两个小时。志不同，道相合，也可以成为朋友。譬如有时候看到两个一个成绩很好每次考试争做第一，一个成绩很差的同学是好朋友。他们志向不相同，但他
【Spark七十九】Spark RDD API一 bit1129 spark
aggregate package spark.examples.rddapi import org.apache.spark.{SparkConf, SparkContext} //测试RDD的aggregate方法 object AggregateTest { def main(args: Array[String]) { val conf = new Spar
ktap 0.1 released bookjovi kernel tracing
Dear, I'm pleased to announce that ktap release v0.1, this is the first official release of ktap project, it is expected that this release is not fully functional or very stable and we welcome bu
能保存Properties文件注释的Properties工具类 BrokenDreams properties
今天遇到一个小需求：由于java.util.Properties读取属性文件时会忽略注释，当写回去的时候，注释都没了。恰好一个项目中的配置文件会在部署后被某个Java程序修改一下，但修改了之后注释全没了，可能会给以后的参数调整带来困难。所以要解决这个问题。 &nb
读《研磨设计模式》-代码笔记-外观模式-Facade bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /* * 百度百科的定义： * Facade（外观）模式为子系统中的各类（或结构与方法）提供一个简明一致的界面， * 隐藏子系统的复杂性，使子系统更加容易使用。他是为子系统中的一组接口所提供的一个一致的界面 * * 可简单地
After Effects教程收集 cherishLC After Effects
1、中文入门 http://study.163.com/course/courseMain.htm?courseId=730009 2、videocopilot英文入门教程（中文字幕） http://www.youku.com/playlist_show/id_17893193.html 英文原址： http://www.videocopilot.net/basic/ 素
Linux Apache 安装过程 crabdave apache
Linux Apache 安装过程下载新版本： apr-1.4.2.tar.gz（下载网站：http://apr.apache.org/download.cgi） apr-util-1.3.9.tar.gz（下载网站：http://apr.apache.org/download.cgi） httpd-2.2.15.tar.gz（下载网站：http://httpd.apac
Shell学习之变量赋值和引用 daizj shell 变量引用赋值
本文转自：http://www.cnblogs.com/papam/articles/1548679.html Shell编程中，使用变量无需事先声明，同时变量名的命名须遵循如下规则：首个字符必须为字母（a-z，A-Z）中间不能有空格，可以使用下划线（_）不能使用标点符号不能使用bash里的关键字（可用help命令查看保留关键字）需要给变量赋值时，可以这么写：
Java SE 第一讲（Java SE入门、JDK的下载与安装、第一个Java程序、Java程序的编译与执行） dcj3sjt126com java jdk
Java SE 第一讲： Java SE：Java Standard Edition Java ME: Java Mobile Edition Java EE：Java Enterprise Edition Java是由Sun公司推出的（今年初被Oracle公司收购）。收购价格：74亿美金 J2SE、J2ME、J2EE JDK：Java Development
YII给用户登录加上验证码 dcj3sjt126com yii
1、在SiteController中添加如下代码： /** * Declares class-based actions. */ public function actions() { return array( // captcha action renders the CAPTCHA image displ
Lucene使用说明 dyy_gusi Lucene search 分词器
Lucene使用说明 1、lucene简介 1.1、什么是lucene Lucene是一个全文搜索框架，而不是应用产品。因此它并不像baidu或者googleDesktop那种拿来就能用，它只是提供了一种工具让你能实现这些产品和功能。 1.2、lucene能做什么要回答这个问题，先要了解lucene的本质。实际
学习编程并不难,做到以下几点即可! gcq511120594 数据结构编程算法
不论你是想自己设计游戏，还是开发iPhone或安卓手机上的应用，还是仅仅为了娱乐，学习编程语言都是一条必经之路。编程语言种类繁多，用途各异，然而一旦掌握其中之一，其他的也就迎刃而解。作为初学者，你可能要先从Java或HTML开始学，一旦掌握了一门编程语言，你就发挥无穷的想象，开发各种神奇的软件啦。 1、确定目标学习编程语言既充满乐趣，又充满挑战。有些花费多年时间学习一门编程语言的大学生到
Java面试十问之三：Java与C++内存回收机制的差别 HNUlanwei java C++finalize()堆栈内存回收
大家知道， Java 除了那 8 种基本类型以外，其他都是对象类型（又称为引用类型）的数据。 JVM 会把程序创建的对象存放在堆空间中，那什么又是堆空间呢？其实，堆（ Heap）是一个运行时的数据存储区，从它可以分配大小各异的空间。一般，运行时的数据存储区有堆（ Heap）和堆栈（ Stack），所以要先看它们里面可以分配哪些类型的对象实体，然后才知道如何均衡使用这两种存储区。一般来说，栈中存放的
第二章 Nginx+Lua开发入门 jinnianshilongnian nginx lua
Nginx入门本文目的是学习Nginx+Lua开发，对于Nginx基本知识可以参考如下文章： nginx启动、关闭、重启 http://www.cnblogs.com/derekchen/archive/2011/02/17/1957209.html agentzh 的 Nginx 教程 http://openresty.org/download/agentzh-nginx-tutor
MongoDB windows安装基本命令 liyonghui160com
windows安装安装目录： D:\MongoDB\ 新建目录 D:\MongoDB\data\db 4.启动进城： cd D:\MongoDB\bin mongod -dbpath D:\MongoDB\data\db &n
Linux下通过源码编译安装程序 pda158 linux
一、程序的组成部分　　Linux下程序大都是由以下几部分组成：　　二进制文件：也就是可以运行的程序文件　　库文件：就是通常我们见到的lib目录下的文件　　配置文件：这个不必多说，都知道　　帮助文档：通常是我们在linux下用man命令查看的命令的文档　　二、linux下程序的存放目录　　linux程序的存放目录大致有三个地方：　　/etc, /b
WEB开发编程的职业生涯４个阶段 shw3588 编程 Web 工作生活
觉得自己什么都会 2007年从学校毕业，凭借自己原创的ASP毕业设计，以为自己很厉害似的，信心满满去东莞找工作，找面试成功率确实很高，只是工资不高，但依旧无法磨灭那过分的自信，那时候什么考勤系统、什么OA系统、什么ERP，什么都觉得有信心，这样的生涯大概持续了约一年。根本不是自己想的那样 2008年开始接触很多工作相关的东西，发现太多东西自己根本不会，都需要去学，不管是asp还是js，
遭遇jsonp同域下变作post请求的坑 vb2005xu jsonp 同域post
今天迁移一个站点时遇到一个坑爹问题,同一个jsonp接口在跨域时都能调用成功,但是在同域下调用虽然成功,但是数据却有问题. 此处贴出我的后端代码片段 $mi_id = htmlspecialchars(trim($_GET['mi_id '])); $mi_cv = htmlspecialchars(trim($_GET['mi_cv '])); 贴出我前端代码片段: $.aj