ZJ_Improve

Assignment | 02-week2 -Optimization Methods

该系列仅在原课程基础上课后作业部分添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79125104

Until now, you’ve always used Gradient Descent to update the parameters and minimize the cost. In this notebook, you will learn more advanced optimization methods that can speed up learning and perhaps even get you to a better final value for the cost function. Having a good optimization algorithm can be the difference between waiting days vs. just a few hours to get a good result.

Gradient descent goes “downhill” on a cost function J . Think of it as trying to do this:

Figure 1 : Minimizing the cost is like finding the lowest point in a hilly landscape
At each step of the training, you update your parameters following a certain direction to try to get to the lowest possible point.

Notations: As usual, ∂J∂a= da for any variable a.

To get started, run the following code to import the libraries you will need.

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets

from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
from testCases import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

1 - Gradient Descent

A simple optimization method in machine learning is gradient descent (GD). When you take gradient steps with respect to all m examples on each step, it is also called Batch Gradient Descent.

Warm-up exercise: Implement the gradient descent update rule. The gradient descent rule is, for l=1,...,L:

W [l] = W [l] - α d W [l] (1)

$b [l] = b [l] - α d b [l] (2)$

where L is the number of layers and α is the learning rate. All parameters should be stored in the parameters dictionary. Note that the iterator l starts at 0 in the for loop while the first parameters are W[1] and b[1] . You need to shift l to l+1 when coding.

# GRADED FUNCTION: update_parameters_with_gd def update_parameters_with_gd(parameters, grads, learning_rate): """ Update parameters using one step of gradient descent Arguments: parameters -- python dictionary containing your parameters to be updated: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients to update each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl learning_rate -- the learning rate, scalar. Returns: parameters -- python dictionary containing your updated parameters """ L = len(parameters) // 2 # number of layers in the neural networks # Update rule for each parameter for l in range(L): ### START CODE HERE ### (approx. 2 lines) parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads['dW' + str(l+1)] parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads['db' + str(l+1)] ### END CODE HERE ### return parameters # 错误点：parameters["W" + str(l+1)] - learning_rate * grads['dW' + str(l+1)] # 没有 W0

parameters, grads, learning_rate = update_parameters_with_gd_test_case() parameters = update_parameters_with_gd(parameters, grads, learning_rate) print("W1 = " + str(parameters["W1"])) print("b1 = " + str(parameters["b1"])) print("W2 = " + str(parameters["W2"])) print("b2 = " + str(parameters["b2"]))

W1 = [[ 1.63535156 -0.62320365 -0.53718766] [-1.07799357 0.85639907 -2.29470142]] b1 = [[ 1.74604067] [-0.75184921]] W2 = [[ 0.32171798 -0.25467393 1.46902454] [-2.05617317 -0.31554548 -0.3756023 ] [ 1.1404819 -1.09976462 -0.1612551 ]] b2 = [[-0.88020257] [ 0.02561572] [ 0.57539477]]

Expected Output:

**W1** [[ 1.63535156 -0.62320365 -0.53718766] [-1.07799357 0.85639907 -2.29470142]]

**b1** [[ 1.74604067] [-0.75184921]]

**W2** [[ 0.32171798 -0.25467393 1.46902454] [-2.05617317 -0.31554548 -0.3756023 ] [ 1.1404819 -1.09976462 -0.1612551 ]]

**b2** [[-0.88020257] [ 0.02561572] [ 0.57539477]]

A variant of this is Stochastic Gradient Descent (SGD) 随机梯度, which is equivalent 等同于 to mini-batch gradient descent where each mini-batch has just 1 example. The update rule that you have just implemented does not change. What changes is that you would be computing gradients on just one training example at a time, rather than on the whole training set. The code examples below illustrate the difference between stochastic gradient descent and (batch) gradient descent.

(Batch) Gradient Descent: 一次处理所有数据只有一个for循环迭代次数

X = data_input Y = labels parameters = initialize_parameters(layers_dims) for i in range(0, num_iterations): # Forward propagation a, caches = forward_propagation(X, parameters) # Compute cost. cost = compute_cost(a, Y) # Backward propagation. grads = backward_propagation(a, caches, parameters) # Update parameters. parameters = update_parameters(parameters, grads)

Stochastic Gradient Descent: SGD 随机梯度下降一次处理一个数据，等同于 mini-batch has just 1 example 三层 for 循环 m 个样本和迭代次数和层数 1- l

X = data_input Y = labels parameters = initialize_parameters(layers_dims) for i in range(0, num_iterations): for j in range(0, m): # Forward propagation a, caches = forward_propagation(X[:,j], parameters) # Compute cost cost = compute_cost(a, Y[:,j]) # Backward propagation grads = backward_propagation(a, caches, parameters) # Update parameters. parameters = update_parameters(parameters, grads)

In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” 震荡波动大 toward the minimum rather than converge smoothly.而不是收敛的相对平滑 Here is an illustration of this:

Figure 1 : SGD vs GD
“+” denotes a minimum of the cost. SGD leads to many oscillations to reach convergence. But each step is a lot faster to compute for SGD than for GD, as it uses only one training example (vs. the whole batch for GD).

Note also that implementing SGD requires 3 for-loops in total:
1. Over the number of iterations
2. Over the m training examples
3. Over the layers (to update all parameters, from (W[1],b[1]) to (W[L],b[L]) )

In practice, you’ll often get faster results if you do not use neither the whole training set, nor only one training example, to perform each update. Mini-batch gradient descent uses an intermediate number of examples for each step. With mini-batch gradient descent, you loop over the mini-batches instead of looping over individual training examples.

Figure 2 : SGD vs Mini-Batch GD
“+” denotes a minimum of the cost. Using mini-batches in your optimization algorithm often leads to faster optimization.

What you should remember:
- The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step. GD mini-batch GD 和 SGD 之间的区别是执行更新时，样本数目的不同
- You have to tune a learning rate hyperparameter α .
- With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large). 一个调整很好的 mini-batch 大小的参数，通常比其他的 GD or SGD 都表现的效果梗好，尤其是在训练集非常大的情况下。

2 - Mini-Batch Gradient descent

Let’s learn how to build mini-batches from the training set (X, Y).

There are two steps:
- Shuffle: Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the ith column of X is the example corresponding to the ith label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches.

Partition: Partition the shuffled (X, Y) into mini-batches of size mini_batch_size (here 64). Note that the number of training examples is not always divisible by mini_batch_size. The last mini batch might be smaller, but you don’t need to worry about this. When the final mini-batch is smaller than the full mini_batch_size, it will look like this:

Exercise: Implement random_mini_batches. We coded the shuffling part for you. To help you with the partitioning step, we give you the following code that selects the indexes for the 1st and 2nd mini-batches:

first_mini_batch_X = shuffled_X[:, 0 : mini_batch_size] second_mini_batch_X = shuffled_X[:, mini_batch_size : 2 * mini_batch_size] ...

Note that the last mini-batch might end up smaller than mini_batch_size=64. Let ⌊s⌋ represents s rounded down to the nearest integer (this is math.floor(s) in Python). If the total number of examples is not a multiple of mini_batch_size=64 then there will be ⌊mmini_batch_size⌋ mini-batches with a full 64 examples, and the number of examples in the final mini-batch will be ( m−mini_batch_size×⌊mmini_batch_size⌋ ).

# GRADED FUNCTION: random_mini_batches def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0): """ Creates a list of random minibatches from (X, Y) Arguments: X -- input data, of shape (input size, number of examples) Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples) mini_batch_size -- size of the mini-batches, integer Returns: mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y) """ np.random.seed(seed) # To make your "random" minibatches the same as ours m = X.shape[1] # number of training examples mini_batches = [] # Step 1: Shuffle (X, Y) permutation = list(np.random.permutation(m)) shuffled_X = X[:, permutation] shuffled_Y = Y[:, permutation].reshape((1,m)) # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case. # math.floor 下取整 num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning for k in range(0, num_complete_minibatches): ### START CODE HERE ### (approx. 2 lines) mini_batch_X = shuffled_X[:, k * mini_batch_size : (k+1) * mini_batch_size] mini_batch_Y = shuffled_Y[:, k * mini_batch_size : (k+1) * mini_batch_size] ### END CODE HERE ### mini_batch = (mini_batch_X, mini_batch_Y) mini_batches.append(mini_batch) # Handling the end case (last mini-batch < mini_batch_size) if m % mini_batch_size != 0: ### START CODE HERE ### (approx. 2 lines) mini_batch_X = shuffled_X[:, mini_batch_size * num_complete_minibatches : m ] mini_batch_Y = shuffled_Y[:, mini_batch_size * num_complete_minibatches : m ] ### END CODE HERE ### mini_batch = (mini_batch_X, mini_batch_Y) mini_batches.append(mini_batch) return mini_batches

X_assess, Y_assess, mini_batch_size = random_mini_batches_test_case() mini_batches = random_mini_batches(X_assess, Y_assess, mini_batch_size) print ("shape of the 1st mini_batch_X: " + str(mini_batches[0][0].shape)) print ("shape of the 2nd mini_batch_X: " + str(mini_batches[1][0].shape)) print ("shape of the 3rd mini_batch_X: " + str(mini_batches[2][0].shape)) print ("shape of the 1st mini_batch_Y: " + str(mini_batches[0][1].shape)) print ("shape of the 2nd mini_batch_Y: " + str(mini_batches[1][1].shape)) print ("shape of the 3rd mini_batch_Y: " + str(mini_batches[2][1].shape)) print ("mini batch sanity check: " + str(mini_batches[0][0][0][0:3]))

shape of the 1st mini_batch_X: (12288, 64) shape of the 2nd mini_batch_X: (12288, 64) shape of the 3rd mini_batch_X: (12288, 20) shape of the 1st mini_batch_Y: (1, 64) shape of the 2nd mini_batch_Y: (1, 64) shape of the 3rd mini_batch_Y: (1, 20) mini batch sanity check: [ 0.90085595 -0.7612069 0.2344157 ]

Expected Output:

**shape of the 1st mini_batch_X** (12288, 64)

**shape of the 2nd mini_batch_X** (12288, 64)

**shape of the 3rd mini_batch_X** (12288, 20)

**shape of the 1st mini_batch_Y** (1, 64)

**shape of the 2nd mini_batch_Y** (1, 64)

**shape of the 3rd mini_batch_Y** (1, 20)

**mini batch sanity check** [ 0.90085595 -0.7612069 0.2344157 ]

What you should remember:
- Shuffling and Partitioning are the two steps required to build mini-batches Shuffling and Partitioning 是建立 mini-batches 所需要的两个步骤
- Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128. 大小通常是 2 的次方

3 - Momentum

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence.收敛 Using momentum can reduce these oscillations.振荡

Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable v . Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of v as the “velocity” 速度 of a ball rolling downhill, building up speed (and momentum) 加速度according to the direction of the gradient/slope of the hill.

Figure 3 : The red arrows shows the direction taken by one step of mini-batch gradient descent with momentum. The blue points show the direction of the gradient (with respect to the current mini-batch) on each step. Rather than just following the gradient, we let the gradient influence v and then take a step in the direction of v.

Exercise: Initialize the velocity. The velocity, v , is a python dictionary that needs to be initialized with arrays of zeros. Its keys are the same as those in the grads dictionary, that is:
for l=1,...,L:

v["dW" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["W" + str(l+1)]) v["db" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])

Note that the iterator l starts at 0 in the for loop while the first parameters are v[“dW1”] and v[“db1”] (that’s a “one” on the superscript). This is why we are shifting l to l+1 in the for loop.

# GRADED FUNCTION: initialize_velocity def initialize_velocity(parameters): """ Initialize the velocity. The velocity, vv , is a python dictionary that needs to be initialized with arrays of zeros. 初始化速度，0 矩阵形状和 w 的形状是一样的 Initializes the velocity as a python dictionary with: - keys: "dW1", "db1", ..., "dWL", "dbL" - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters. Arguments: parameters -- python dictionary containing your parameters. parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl Returns: v -- python dictionary containing the current velocity. v['dW' + str(l)] = velocity of dWl v['db' + str(l)] = velocity of dbl """ L = len(parameters) // 2 # number of layers in the neural networks v = {} # Initialize velocity for l in range(L): ### START CODE HERE ### (approx. 2 lines) v["dW" + str(l+1)] = np.zeros(parameters['W' + str(l+1)].shape) v["db" + str(l+1)] = np.zeros(parameters['b' + str(l+1)].shape) ### END CODE HERE ### return v

parameters = initialize_velocity_test_case() v = initialize_velocity(parameters) print("v[\"dW1\"] = " + str(v["dW1"])) print("v[\"db1\"] = " + str(v["db1"])) print("v[\"dW2\"] = " + str(v["dW2"])) print("v[\"db2\"] = " + str(v["db2"]))

v["dW1"] = [[0. 0. 0.] [0. 0. 0.]] v["db1"] = [[0.] [0.]] v["dW2"] = [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]] v["db2"] = [[0.] [0.] [0.]]

Expected Output:

**v[“dW1”]** [[ 0. 0. 0.] [ 0. 0. 0.]]

**v[“db1”]** [[ 0.] [ 0.]]

**v[“dW2”]** [[ 0. 0. 0.] [ 0. 0. 0.] [ 0. 0. 0.]]

**v[“db2”]** [[ 0.] [ 0.] [ 0.]]

Exercise: Now, implement the parameters update with momentum. The momentum update rule is, for l=1,...,L :

${v d W [l] = β v d W [l] + (1 - β) d W [l] W [l] = W [l] - α v d W [l] (3)$

${v d b [l] = β v d b [l] + (1 - β) d b [l] b [l] = b [l] - α v d b [l] (4)$

where L is the number of layers, β is the momentum 动量 and α is the learning rate. All parameters should be stored in the parameters dictionary. Note that the iterator l starts at 0 in the for loop while the first parameters are W[1] and b[1] (that’s a “one” on the superscript). So you will need to shift l to l+1 when coding.

# GRADED FUNCTION: update_parameters_with_momentum def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate): """ Update parameters using Momentum Arguments: parameters -- python dictionary containing your parameters: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients for each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl v -- python dictionary containing the current velocity: v['dW' + str(l)] = ... v['db' + str(l)] = ... beta -- the momentum hyperparameter, scalar learning_rate -- the learning rate, scalar Returns: parameters -- python dictionary containing your updated parameters v -- python dictionary containing your updated velocities """ L = len(parameters) // 2 # number of layers in the neural networks # Momentum update for each parameter for l in range(L): ### START CODE HERE ### (approx. 4 lines) # compute velocities v["dW" + str(l+1)] = beta * v['dW' + str(l+1)] + (1 - beta)* grads['dW' + str(l+1)] v["db" + str(l+1)] = beta * v['db' + str(l+1)] + (1 - beta)* grads['db' + str(l+1)] # update parameters parameters["W" + str(l+1)] = parameters['W' + str(l+1)] - learning_rate * v["dW" + str(l+1)] parameters["b" + str(l+1)] = parameters['b' + str(l+1)] - learning_rate * v["db" + str(l+1)] ### END CODE HERE ### return parameters, v

parameters, grads, v = update_parameters_with_momentum_test_case() parameters, v = update_parameters_with_momentum(parameters, grads, v, beta = 0.9, learning_rate = 0.01) print("W1 = " + str(parameters["W1"])) print("b1 = " + str(parameters["b1"])) print("W2 = " + str(parameters["W2"])) print("b2 = " + str(parameters["b2"])) print("v[\"dW1\"] = " + str(v["dW1"])) print("v[\"db1\"] = " + str(v["db1"])) print("v[\"dW2\"] = " + str(v["dW2"])) print("v[\"db2\"] = " + str(v["db2"]))

W1 = [[ 1.62544598 -0.61290114 -0.52907334] [-1.07347112 0.86450677 -2.30085497]] b1 = [[ 1.74493465] [-0.76027113]] W2 = [[ 0.31930698 -0.24990073 1.4627996 ] [-2.05974396 -0.32173003 -0.38320915] [ 1.13444069 -1.0998786 -0.1713109 ]] b2 = [[-0.87809283] [ 0.04055394] [ 0.58207317]] v["dW1"] = [[-0.11006192 0.11447237 0.09015907] [ 0.05024943 0.09008559 -0.06837279]] v["db1"] = [[-0.01228902] [-0.09357694]] v["dW2"] = [[-0.02678881 0.05303555 -0.06916608] [-0.03967535 -0.06871727 -0.08452056] [-0.06712461 -0.00126646 -0.11173103]] v["db2"] = [[0.02344157] [0.16598022] [0.07420442]]

Expected Output:

**W1** [[ 1.62544598 -0.61290114 -0.52907334] [-1.07347112 0.86450677 -2.30085497]]

**b1** [[ 1.74493465] [-0.76027113]]

**W2** [[ 0.31930698 -0.24990073 1.4627996 ] [-2.05974396 -0.32173003 -0.38320915] [ 1.13444069 -1.0998786 -0.1713109 ]]

**b2** [[-0.87809283] [ 0.04055394] [ 0.58207317]]

**v[“dW1”]** [[-0.11006192 0.11447237 0.09015907] [ 0.05024943 0.09008559 -0.06837279]]

**v[“db1”]** [[-0.01228902] [-0.09357694]]

**v[“dW2”]** [[-0.02678881 0.05303555 -0.06916608] [-0.03967535 -0.06871727 -0.08452056] [-0.06712461 -0.00126646 -0.11173103]]

**v[“db2”]** [[ 0.02344157] [ 0.16598022] [ 0.07420442]]

Note that:
- The velocity is initialized with zeros. So the algorithm will take a few iterations to “build up” velocity and start to take bigger steps.
- If β=0 , then this just becomes standard gradient descent without momentum.

How do you choose β ?

The larger the momentum β is, the smoother the update because the more we take the past gradients into account. But if β is too big, it could also smooth out the updates too much.

Common values for β range from 0.8 to 0.999. If you don’t feel inclined to tune this, β=0.9 is often a reasonable default.

Tuning the optimal β for your model might need trying several values to see what works best in term of reducing the value of the cost function J .

What you should remember:
- Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
- You have to tune a momentum hyperparameter β and a learning rate α .

4 - Adam

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum.

How does Adam work?
1. It calculates an exponentially weighted average of past gradients, and stores it in variables v (before bias correction) and vcorrected (with bias correction).
2. It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables s (before bias correction) and scorrected (with bias correction).
3. It updates parameters in a direction based on combining information from “1” and “2”.

The update rule is, for l=1,...,L :

$⎧ ⎩ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ v d W [l] = β 1 v d W [l] + (1 - β 1) \partial J \partial W [ l ] v c o r r e c t e d d W [l] = v d W [ l ] 1 - ( β 1 ) t s d W [l] = β 2 s d W [l] + (1 - β 2) (\partial J \partial W [ l ]) 2 s c o r r e c t e d d W [l] = s d W [ l ] 1 - ( β 1 ) t W [l] = W [l] - α v c o r r e c t e d d W [ l ] s c o r r e c t e d d W [ l ] \sqrt + ε$
where:
- t counts the number of steps taken of Adam
- L is the number of layers
- β1 and β2 are hyperparameters that control the two exponentially weighted averages.
- α is the learning rate
- ε is a very small number to avoid dividing by zero

As usual, we will store all parameters in the parameters dictionary

Exercise: Initialize the Adam variables v,s which keep track of the past information.

Instruction: The variables v,s are python dictionaries that need to be initialized with arrays of zeros. Their keys are the same as for grads, that is:
for l=1,...,L :

v["dW" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["W" + str(l+1)]) v["db" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["b" + str(l+1)]) s["dW" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["W" + str(l+1)]) s["db" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])

# GRADED FUNCTION: initialize_adam def initialize_adam(parameters) : """ Initializes v and s as two python dictionaries with: - keys: "dW1", "db1", ..., "dWL", "dbL" - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters. Arguments: parameters -- python dictionary containing your parameters. parameters["W" + str(l)] = Wl parameters["b" + str(l)] = bl Returns: v -- python dictionary that will contain the exponentially weighted average of the gradient. v["dW" + str(l)] = ... v["db" + str(l)] = ... s -- python dictionary that will contain the exponentially weighted average of the squared gradient. s["dW" + str(l)] = ... s["db" + str(l)] = ... """ L = len(parameters) // 2 # number of layers in the neural networks v = {} s = {} # Initialize v, s. Input: "parameters". Outputs: "v, s". for l in range(L): ### START CODE HERE ### (approx. 4 lines) v["dW" + str(l+1)] = np.zeros((parameters['W' + str(l+1)].shape[0], parameters['W' + str(l+1)].shape[1])) v["db" + str(l+1)] = np.zeros((parameters['b' + str(l+1)].shape[0], parameters['b' + str(l+1)].shape[1])) s["dW" + str(l+1)] = np.zeros((parameters['W' + str(l+1)].shape[0], parameters['W' + str(l+1)].shape[1])) s["db" + str(l+1)] = np.zeros((parameters['b' + str(l+1)].shape[0], parameters['b' + str(l+1)].shape[1])) ### END CODE HERE ### return v, s

parameters = initialize_adam_test_case() v, s = initialize_adam(parameters) print("v[\"dW1\"] = " + str(v["dW1"])) print("v[\"db1\"] = " + str(v["db1"])) print("v[\"dW2\"] = " + str(v["dW2"])) print("v[\"db2\"] = " + str(v["db2"])) print("s[\"dW1\"] = " + str(s["dW1"])) print("s[\"db1\"] = " + str(s["db1"])) print("s[\"dW2\"] = " + str(s["dW2"])) print("s[\"db2\"] = " + str(s["db2"]))

v["dW1"] = [[0. 0. 0.] [0. 0. 0.]] v["db1"] = [[0.] [0.]] v["dW2"] = [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]] v["db2"] = [[0.] [0.] [0.]] s["dW1"] = [[0. 0. 0.] [0. 0. 0.]] s["db1"] = [[0.] [0.]] s["dW2"] = [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]] s["db2"] = [[0.] [0.] [0.]]

Expected Output:

**v[“dW1”]** [[ 0. 0. 0.] [ 0. 0. 0.]]

**v[“db1”]** [[ 0.] [ 0.]]

**v[“dW2”]** [[ 0. 0. 0.] [ 0. 0. 0.] [ 0. 0. 0.]]

**v[“db2”]** [[ 0.] [ 0.] [ 0.]]

**s[“dW1”]** [[ 0. 0. 0.] [ 0. 0. 0.]]

**s[“db1”]** [[ 0.] [ 0.]]

**s[“dW2”]** [[ 0. 0. 0.] [ 0. 0. 0.] [ 0. 0. 0.]]

**s[“db2”]** [[ 0.] [ 0.] [ 0.]]

Exercise: Now, implement the parameters update with Adam. Recall the general update rule is, for l=1,...,L :

$⎧ ⎩ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ v W [l] = β 1 v W [l] + (1 - β 1) \partial J \partial W [ l ] v c o r r e c t e d W [l] = v W [ l ] 1 - ( β 1 ) t s W [l] = β 2 s W [l] + (1 - β 2) (\partial J \partial W [ l ]) 2 s c o r r e c t e d W [l] = s W [ l ] 1 - ( β 2 ) t W [l] = W [l] - α v c o r r e c t e d W [ l ] s c o r r e c t e d W [ l ] \sqrt + ε$

Note that the iterator l starts at 0 in the for loop while the first parameters are W[1] and b[1] . You need to shift l to l+1 when coding.

# GRADED FUNCTION: update_parameters_with_adam def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8): """ Update parameters using Adam Arguments: parameters -- python dictionary containing your parameters: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients for each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl v -- Adam variable, moving average of the first gradient, python dictionary s -- Adam variable, moving average of the squared gradient, python dictionary learning_rate -- the learning rate, scalar. beta1 -- Exponential decay hyperparameter for the first moment estimates beta2 -- Exponential decay hyperparameter for the second moment estimates epsilon -- hyperparameter preventing division by zero in Adam updates Returns: parameters -- python dictionary containing your updated parameters v -- Adam variable, moving average of the first gradient, python dictionary s -- Adam variable, moving average of the squared gradient, python dictionary """ L = len(parameters) // 2 # number of layers in the neural networks v_corrected = {} # Initializing first moment estimate, python dictionary s_corrected = {} # Initializing second moment estimate, python dictionary # Perform Adam update on all parameters for l in range(L): # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v". ### START CODE HERE ### (approx. 2 lines) v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1-beta1) * grads['dW' + str(l+1)] v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1-beta1) * grads['db' + str(l+1)] ### END CODE HERE ### # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected". ### START CODE HERE ### (approx. 2 lines) v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)]/(1-np.power(beta1,t)) v_corrected["db" + str(l+1)] = v["db" + str(l+1)]/(1-np.power(beta1,t)) ### END CODE HERE ### # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s". ### START CODE HERE ### (approx. 2 lines) s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1-beta2) * np.power(grads['dW' + str(l+1)],2) s["db" + str(l+1)] = beta2 * s["db" + str(l+1)] + (1-beta2) * np.power(grads['db' + str(l+1)],2) ### END CODE HERE ### # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected". ### START CODE HERE ### (approx. 2 lines) s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)]/(1-np.power(beta2,t)) s_corrected["db" + str(l+1)] = s["db" + str(l+1)]/(1-np.power(beta2,t)) ### END CODE HERE ### # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters". ### START CODE HERE ### (approx. 2 lines) parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v_corrected["dW" + str(l+1)]/(np.sqrt(s_corrected["dW" + str(l+1)])+epsilon) parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v_corrected["db" + str(l+1)]/(np.sqrt(s_corrected["db" + str(l+1)])+epsilon) ### END CODE HERE ### # 错误点： +epsilon 因为没有加 +epsilon 所以一直 nan nan nan return parameters, v, s

parameters, grads, v, s = update_parameters_with_adam_test_case() parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, t = 2) print("W1 = " + str(parameters["W1"])) print("b1 = " + str(parameters["b1"])) print("W2 = " + str(parameters["W2"])) print("b2 = " + str(parameters["b2"])) print("v[\"dW1\"] = " + str(v["dW1"])) print("v[\"db1\"] = " + str(v["db1"])) print("v[\"dW2\"] = " + str(v["dW2"])) print("v[\"db2\"] = " + str(v["db2"])) print("s[\"dW1\"] = " + str(s["dW1"])) print("s[\"db1\"] = " + str(s["db1"])) print("s[\"dW2\"] = " + str(s["dW2"])) print("s[\"db2\"] = " + str(s["db2"]))

W1 = [[ 1.63178673 -0.61919778 -0.53561312] [-1.08040999 0.85796626 -2.29409733]] b1 = [[ 1.75225313] [-0.75376553]] W2 = [[ 0.32648046 -0.25681174 1.46954931] [-2.05269934 -0.31497584 -0.37661299] [ 1.14121081 -1.09244991 -0.16498684]] b2 = [[-0.88529979] [ 0.03477238] [ 0.57537385]] v["dW1"] = [[-0.11006192 0.11447237 0.09015907] [ 0.05024943 0.09008559 -0.06837279]] v["db1"] = [[-0.01228902] [-0.09357694]] v["dW2"] = [[-0.02678881 0.05303555 -0.06916608] [-0.03967535 -0.06871727 -0.08452056] [-0.06712461 -0.00126646 -0.11173103]] v["db2"] = [[0.02344157] [0.16598022] [0.07420442]] s["dW1"] = [[0.00121136 0.00131039 0.00081287] [0.0002525 0.00081154 0.00046748]] s["db1"] = [[1.51020075e-05] [8.75664434e-04]] s["dW2"] = [[7.17640232e-05 2.81276921e-04 4.78394595e-04] [1.57413361e-04 4.72206320e-04 7.14372576e-04] [4.50571368e-04 1.60392066e-07 1.24838242e-03]] s["db2"] = [[5.49507194e-05] [2.75494327e-03] [5.50629536e-04]]

Expected Output:

**W1** [[ 1.63178673 -0.61919778 -0.53561312] [-1.08040999 0.85796626 -2.29409733]]

**b1** [[ 1.75225313] [-0.75376553]]

**W2** [[ 0.32648046 -0.25681174 1.46954931] [-2.05269934 -0.31497584 -0.37661299] [ 1.14121081 -1.09245036 -0.16498684]]

**b2** [[-0.88529978] [ 0.03477238] [ 0.57537385]]

**v[“dW1”]** [[-0.11006192 0.11447237 0.09015907] [ 0.05024943 0.09008559 -0.06837279]]

**v[“db1”]** [[-0.01228902] [-0.09357694]]

**v[“dW2”]** [[-0.02678881 0.05303555 -0.06916608] [-0.03967535 -0.06871727 -0.08452056] [-0.06712461 -0.00126646 -0.11173103]]

**v[“db2”]** [[ 0.02344157] [ 0.16598022] [ 0.07420442]]

**s[“dW1”]** [[ 0.00121136 0.00131039 0.00081287] [ 0.0002525 0.00081154 0.00046748]]

**s[“db1”]** [[ 1.51020075e-05] [ 8.75664434e-04]]

**s[“dW2”]** [[ 7.17640232e-05 2.81276921e-04 4.78394595e-04] [ 1.57413361e-04 4.72206320e-04 7.14372576e-04] [ 4.50571368e-04 1.60392066e-07 1.24838242e-03]]

**s[“db2”]** [[ 5.49507194e-05] [ 2.75494327e-03] [ 5.50629536e-04]]

You now have three working optimization algorithms (mini-batch gradient descent, Momentum, Adam). Let’s implement a model with each of these optimizers and observe the difference.

5 - Model with different optimization algorithms

Lets use the following “moons” dataset to test the different optimization methods. (The dataset is named “moons” because the data from each of the two classes looks a bit like a crescent-shaped moon.)

train_X, train_Y = load_dataset()

We have already implemented a 3-layer neural network. You will train it with:
- Mini-batch Gradient Descent: it will call your function:
- update_parameters_with_gd()
- Mini-batch Momentum: it will call your functions:
- initialize_velocity() and update_parameters_with_momentum()
- Mini-batch Adam: it will call your functions:
- initialize_adam() and update_parameters_with_adam()

def model(X, Y, layers_dims, optimizer, learning_rate = 0.0007, mini_batch_size = 64, beta = 0.9, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8, num_epochs = 10000, print_cost = True): """ 3-layer neural network model which can be run in different optimizer modes. Arguments: X -- input data, of shape (2, number of examples) Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples) layers_dims -- python list, containing the size of each layer learning_rate -- the learning rate, scalar. mini_batch_size -- the size of a mini batch beta -- Momentum hyperparameter beta1 -- Exponential decay hyperparameter for the past gradients estimates beta2 -- Exponential decay hyperparameter for the past squared gradients estimates epsilon -- hyperparameter preventing division by zero in Adam updates num_epochs -- number of epochs print_cost -- True to print the cost every 1000 epochs Returns: parameters -- python dictionary containing your updated parameters """ L = len(layers_dims) # number of layers in the neural networks costs = [] # to keep track of the cost t = 0 # initializing the counter required for Adam update seed = 10 # For grading purposes, so that your "random" minibatches are the same as ours # Initialize parameters parameters = initialize_parameters(layers_dims) # Initialize the optimizer if optimizer == "gd": pass # no initialization required for gradient descent elif optimizer == "momentum": v = initialize_velocity(parameters) elif optimizer == "adam": v, s = initialize_adam(parameters) # Optimization loop for i in range(num_epochs): # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch seed = seed + 1 minibatches = random_mini_batches(X, Y, mini_batch_size, seed) for minibatch in minibatches: # Select a minibatch (minibatch_X, minibatch_Y) = minibatch # Forward propagation a3, caches = forward_propagation(minibatch_X, parameters) # Compute cost cost = compute_cost(a3, minibatch_Y) # Backward propagation grads = backward_propagation(minibatch_X, minibatch_Y, caches) # Update parameters if optimizer == "gd": parameters = update_parameters_with_gd(parameters, grads, learning_rate) elif optimizer == "momentum": parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate) elif optimizer == "adam": t = t + 1 # Adam counter parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, t, learning_rate, beta1, beta2, epsilon) # Print the cost every 1000 epoch if print_cost and i % 1000 == 0: print ("Cost after epoch %i: %f" %(i, cost)) if print_cost and i % 100 == 0: costs.append(cost) # plot the cost plt.plot(costs) plt.ylabel('cost') plt.xlabel('epochs (per 100)') plt.title("Learning rate = " + str(learning_rate)) plt.show() return parameters

You will now run this 3 layer neural network with each of the 3 optimization methods.

5.1 - Mini-batch Gradient descent

Run the following code to see how the model does with mini-batch gradient descent.

# train 3-layer model layers_dims = [train_X.shape[0], 5, 2, 1] parameters = model(train_X, train_Y, layers_dims, optimizer = "gd") # Predict predictions = predict(train_X, train_Y, parameters) # Plot decision boundary plt.title("Model with Gradient Descent optimization") axes = plt.gca() axes.set_xlim([-1.5,2.5]) axes.set_ylim([-1,1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

Cost after epoch 0: 0.690736 Cost after epoch 1000: 0.685273 Cost after epoch 2000: 0.647072 Cost after epoch 3000: 0.619525 Cost after epoch 4000: 0.576584 Cost after epoch 5000: 0.607243 Cost after epoch 6000: 0.529403 Cost after epoch 7000: 0.460768 Cost after epoch 8000: 0.465586 Cost after epoch 9000: 0.464518

Accuracy: 0.7966666666666666

5.2 - Mini-batch gradient descent with momentum

Run the following code to see how the model does with momentum. Because this example is relatively simple, the gains from using momemtum are small; but for more complex problems you might see bigger gains.

# train 3-layer model layers_dims = [train_X.shape[0], 5, 2, 1] parameters = model(train_X, train_Y, layers_dims, beta = 0.9, optimizer = "momentum") # Predict predictions = predict(train_X, train_Y, parameters) # Plot decision boundary plt.title("Model with Momentum optimization") axes = plt.gca() axes.set_xlim([-1.5,2.5]) axes.set_ylim([-1,1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

Cost after epoch 0: 0.690741 Cost after epoch 1000: 0.685341 Cost after epoch 2000: 0.647145 Cost after epoch 3000: 0.619594 Cost after epoch 4000: 0.576665 Cost after epoch 5000: 0.607324 Cost after epoch 6000: 0.529476 Cost after epoch 7000: 0.460936 Cost after epoch 8000: 0.465780 Cost after epoch 9000: 0.464740

Accuracy: 0.7966666666666666

5.3 - Mini-batch with Adam mode

Run the following code to see how the model does with Adam.

# train 3-layer model layers_dims = [train_X.shape[0], 5, 2, 1] parameters = model(train_X, train_Y, layers_dims, beta = 0.9, optimizer = "adam") # Predict predictions = predict(train_X, train_Y, parameters) # Plot decision boundary plt.title("Model with Adam optimization") axes = plt.gca() axes.set_xlim([-1.5,2.5]) axes.set_ylim([-1,1.5]) plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

Cost after epoch 0: 0.690552 Cost after epoch 1000: 0.185567 Cost after epoch 2000: 0.150852 Cost after epoch 3000: 0.074454 Cost after epoch 4000: 0.125936 Cost after epoch 5000: 0.104235 Cost after epoch 6000: 0.100552 Cost after epoch 7000: 0.031601 Cost after epoch 8000: 0.111709 Cost after epoch 9000: 0.197648

Accuracy: 0.94

5.4 - Summary

**optimization method** **accuracy** **cost shape**

Gradient descent 79.7% oscillations

Momentum 79.7% oscillations

Adam 94% smoother

Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable. Also, the huge oscillations you see in the cost come from the fact that some minibatches are more difficult thans others for the optimization algorithm.

Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you’ve seen that Adam converges a lot faster.

Momentum一般都是有助于提升速度，但是当学习率较小，数据集相对简单的时候，其性能的优越性没有太明显。我们在优化算法中看到的那些较大的震荡是由于一些 minibatches 相对更加复杂所造成的。

从运行结果可以看出，Adam 算法比mini-batch gradient descent 和 Momentum都要显得优越。对于model如果在简单数据集上，迭代次数更多的话，这三种优化算法都会产生较好的结果，但是我们也可以看出，Adam算法收敛得更快些。

Adam算法的优点:

内存要求低 (尽管比gradient descent 和 gradient descent with momentum要高些)
一般微调超参数就可以获得较好的结果(除了α)

Some advantages of Adam include:
- Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
- Usually works well even with little tuning of hyperparameters (except α )

补两个 gif

References:

Adam paper: https://arxiv.org/pdf/1412.6980.pdf

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

你可能感兴趣的:(深度学习,吴恩达-,Assignment,汇总,深度学习,吴恩达,深度学习,momentum,mini-batch,adam)

Android 开源组件和第三方库汇总 gyyzzr Android Android 开源框架
转载1、github排名https://github.com/trending,github搜索：https://github.com/search2、https://github.com/wasabeef/awesome-android-ui目录UIUI卫星菜单节选器下拉刷新模糊效果HUD与Toast进度条UI其它动画网络相关响应式编程地图数据库图像浏览及处理视频音频处理测试及调试动态更新热更新
PyTorch & TensorFlow速成复习：从基础语法到模型部署实战（附FPGA移植衔接）阿牛的药铺算法移植部署 pytorch tensorflow fpga开发
PyTorch&TensorFlow速成复习：从基础语法到模型部署实战（附FPGA移植衔接）引言：为什么算法移植工程师必须掌握框架基础？针对光学类产品算法FPGA移植岗位需求（如可见光/红外图像处理），深度学习框架是算法落地的"桥梁"——既要用PyTorch/TensorFlow验证算法可行性，又要将训练好的模型（如CNN、目标检测）转换为FPGA可部署的格式（ONNX、TFLite）。本文采用"
深度学习模型表征提取全解析 ZhangJiQun&MXP 教学 2024大模型以及算力 2021 AI python 深度学习人工智能 python embedding 语言模型
模型内部进行表征提取的方法在自然语言处理（NLP）中，“表征（Representation）”指将文本（词、短语、句子、文档等）转化为计算机可理解的数值形式（如向量、矩阵），核心目标是捕捉语言的语义、语法、上下文依赖等信息。自然语言表征技术可按“静态/动态”“有无上下文”“是否融入知识”等维度划分一、传统静态表征（无上下文，词级为主）这类方法为每个词分配固定向量，不考虑其在具体语境中的含义（无法解
【Qualcomm】高通SNPE框架简介、下载与使用 Jackilina_Stone 人工智能 Qualcomm SNPE
目录一高通SNPE框架1SNPE简介2QNN与SNPE3Capabilities4工作流程二SNPE的安装与使用1下载2Setup3SNPE的使用概述一高通SNPE框架1SNPE简介SNPE（SnapdragonNeuralProcessingEngine），是高通公司推出的面向移动端和物联网设备的深度学习推理框架。SNPE提供了一套完整的深度学习推理框架，能够支持多种深度学习模型，包括Pytor
深度学习篇---昇腾NPU&CANN 工具包 Atticus-Orion 上位机知识篇图像处理篇深度学习篇深度学习人工智能 NPU 昇腾 CANN
介绍昇腾NPU是华为推出的神经网络处理器，具有强大的AI计算能力，而CANN工具包则是面向AI场景的异构计算架构，用于发挥昇腾NPU的性能优势。以下是详细介绍：昇腾NPU架构设计：采用达芬奇架构，是一个片上系统，主要由特制的计算单元、大容量的存储单元和相应的控制单元组成。集成了多个CPU核心，包括控制CPU和AICPU，前者用于控制处理器整体运行，后者承担非矩阵类复杂计算。此外，还拥有AICore
深度学习图像分类数据集—桃子识别分类 AI街潜水的八角深度学习图像数据集深度学习分类人工智能
该数据集为图像分类数据集，适用于ResNet、VGG等卷积神经网络，SENet、CBAM等注意力机制相关算法，VisionTransformer等Transformer相关算法。数据集信息介绍：桃子识别分类：['B1','M2','R0','S3']训练数据集总共有6637张图片，每个文件夹单独放一种数据各子文件夹图片统计:·B1:1601张图片·M2:1800张图片·R0:1601张图片·S3:
小学计算机基础知识汇总,电脑基础知识：内存条知识大全，看完小学生都了解...
一、基础知识1、定义、作用内存条又叫随机存取存储器，是一种存储技术，但是和硬盘存储不同，内存条一断电，那么所有数据都会丢失。由于CPU处理器速度很快，而硬盘读写速度完全跟不上CPU的速度，即使是固态硬盘也一样，所以一个急着用，一个慢吞吞，因此就需要一个中间者来帮忙，这就是内存条，硬盘中的数据可以先传输到内存条保存着，如果CPU需要，那么可以直接从内存条中快速读取，相反的，CPU快速处理完后，先放到
NumPy-@运算符详解 GG不是gg numpy numpy
NumPy-@运算符详解一、@运算符的起源与设计目标1.从数学到代码：符号的统一2.设计目标二、@运算符的核心语法与运算规则1.基础用法：二维矩阵乘法2.一维向量的矩阵语义3.高维数组：批次矩阵运算4.广播机制：灵活的形状匹配三、@运算符与其他乘法方式的核心区别1.对比`np.dot()`2.对比元素级乘法`*`3.对比`np.matrix`的`*`运算符四、典型应用场景：从基础到高阶1.深度学习
【C#】依赖注入知识点汇总 Mike_Wuzy c#
在C#中实现依赖注入（DependencyInjection,DI）可以帮助你创建更解耦、可维护和易于测试的软件系统。以下是一些关于依赖注入的关键知识点及其示例代码。1.基本概念容器(Container)容器负责管理对象实例以及它们之间的依赖关系。IoC容器（InversionofControlContainer）是实现依赖注入的核心工具，常见的DI框架包括Unity、Autofac、Castle
NLP_知识图谱_大模型——个人学习记录 macken9999 自然语言处理知识图谱大模型自然语言处理知识图谱学习
1.自然语言处理、知识图谱、对话系统三大技术研究与应用https://github.com/lihanghang/NLP-Knowledge-Graph深度学习-自然语言处理(NLP)-知识图谱：知识图谱构建流程【本体构建、知识抽取（实体抽取、关系抽取、属性抽取）、知识表示、知识融合、知识存储】-元気森林-博客园https://www.cnblogs.com/-402/p/16529422.htm
解决 Python 包安装失败问题：以 accelerate 为例
在使用Python开发项目时，我们经常会遇到依赖包安装失败的问题。今天，我们就以accelerate包为例，详细探讨一下可能的原因以及解决方法。通过这篇文章，你将了解到Python包安装失败的常见原因、如何切换镜像源、如何手动安装包，以及一些实用的注意事项。一、问题背景在开发一个深度学习项目时，我需要安装accelerate包来优化模型的训练过程。然而，当我运行以下命令时：bash复制pipins
从RNN循环神经网络到Transformer注意力机制：解析神经网络架构的华丽蜕变熊猫钓鱼>_> 神经网络 rnn transformer
1.引言在自然语言处理和序列建模领域，神经网络架构经历了显著的演变。从早期的循环神经网络（RNN）到现代的Transformer架构，这一演变代表了深度学习方法在处理序列数据方面的重大进步。本文将深入比较这两种架构，分析它们的工作原理、优缺点，并通过实验结果展示它们在实际应用中的性能差异。2.循环神经网络（RNN）2.1基本原理循环神经网络是专门为处理序列数据而设计的神经网络架构。RNN的核心思想
cyvcf2 常用知识点 Bio Coder Python VCF cyvcf2 vcf 数据分析
以下是cyvcf2常用的操作汇总，涵盖加载文件、解析变异、访问基因型、筛选变异、修改文件等核心功能，附带简洁的代码示例。内容按功能模块组织，力求简明实用，方便快速参考。假设用户已熟悉cyvcf2的基本背景（如VCF/BCF文件解析），本文直接聚焦操作。1.加载VCF/BCF文件基本加载：打开VCF或BCF文件，支持.vcf、.vcf.gz和.bcf格式。fromcyvcf2importVCFvcf
跳转漏洞检测工具汇总（重定向漏洞）墨痕诉清风渗透工具安全
目录简单介绍绕过方式及更多介绍工具介绍Oralyzer介绍主要功能使用缺点下载地址简单介绍URL跳转漏洞是指后台服务器在告知浏览器跳转时，未对客户端传入的重定向地址进行合法性校验，导致用户浏览器跳转到钓鱼页面的一种漏洞。访问http://www.abc.com?url=http://www.xxx.com直接跳转到http://www.xxx.com说明存在URL重定向漏洞绕过方式及更多介绍htt
数据分析框架和方法 XiaoQiong.Zhang 人工智能
一、核心分析框架(TheBigPictureFrameworks)描述性分析(WhatHappened?)目的：了解过去发生了什么，描述现状，监控业务健康。核心工作：汇总、聚合、计算基础指标(KPI)，生成报表和仪表盘。常用方法/指标：计数/求和/平均值/中位数：DAU/MAU，总销售额，客单价等。比率：转化率，点击率，流失率，毛利率等。分布：用户活跃度分布、订单金额分布、地域分布等。常用于理解群
如何使用Python实现交通工具识别
如何使用Python实现交通工具识别文章目录技术架构功能流程识别逻辑用户界面增强特性依赖项主要类别内容展示该系统是一个基于深度学习的交通工具识别工具，具备以下核心功能与特点：技术架构使用预训练的ResNet50卷积神经网络模型（来自ImageNet数据集）集成图像增强预处理技术（随机裁剪、旋转、翻转等）采用多数投票机制提升预测稳定性基于置信度评分的结果筛选策略功能流程用户通过GUI界面选择待识别图
Oracle数据库不同场景批量插入数据的方式汇总 Favor_Yang SQL调优及高级SQL语法编写 oracle 数据库
批量数据插入是数据库操作中的常见需求，Oracle数据库提供了多种高效的数据批量加载方法。不同方法适用于不同场景，从少量数据到海量数据迁移均可找到合适的解决方案。传统单条INSERT语句最基本的插入方式是通过单条INSERT语句逐行插入数据。这种方法语法简单直观，适用于少量数据插入场景。然而当数据量较大时，频繁的SQL解析和网络往返会显著降低性能。示例代码：INSERTINTOemployees(
Python OpenCV教程从入门到精通的全面指南【文末送书】一键难忘 python opencv 开发语言
文章目录PythonOpenCV从入门到精通1.安装OpenCV2.基本操作2.1读取和显示图像2.2图像基本操作3.图像处理3.1图像转换3.2图像阈值处理3.3图像平滑4.边缘检测和轮廓4.1Canny边缘检测4.2轮廓检测5.高级操作5.1特征检测5.2目标跟踪5.3深度学习与OpenCVPythonOpenCV从入门到精通【文末送书】PythonOpenCV从入门到精通OpenCV(Ope
第八周 tensorflow实现猫狗识别降花绘 365天深度学习 tensorflow系列 tensorflow 深度学习人工智能
本文为365天深度学习训练营内部限免文章（版权归K同学啊所有）**参考文章地址：[TensorFlow入门实战｜365天深度学习训练营-第8周：猫狗识别（训练营内部成员可读）]**作者：K同学啊文章目录一、本周学习内容:1、自己搭建VGG16网络2、了解model.train_on_batch（）3、了解tqdm，并使用tqdm实现可视化进度条二、前言三、电脑环境四、前期准备1、导入相关依赖项2、
深度学习实战-使用TensorFlow与Keras构建智能模型程序员Gloria Python超入门 TensorFlow python
深度学习实战-使用TensorFlow与Keras构建智能模型深度学习已经成为现代人工智能的重要组成部分，而Python则是实现深度学习的主要编程语言之一。本文将探讨如何使用TensorFlow和Keras构建深度学习模型，包括必要的代码实例和详细的解析。1.深度学习简介深度学习是机器学习的一个分支，使用多层神经网络来学习和表示数据中的复杂模式。其广泛应用于图像识别、自然语言处理、推荐系统等领域。
AI在垂直领域的深度应用：医疗、金融与自动驾驶的革新之路
AI在垂直领域的深度应用：医疗、金融与自动驾驶的革新之路一、医疗领域：AI驱动的精准诊疗与效率提升1.医学影像诊断AI算法通过深度学习技术，已实现对X光、CT、MRI等影像的快速分析，辅助医生检测癌症、骨折等疾病。例如，GoogleDeepMind的AI系统在乳腺癌筛查中，误检率比人类专家低9.4%；中国的推想医疗AI系统可在20秒内完成肺部CT扫描分析，为急诊救治争取黄金时间。2.药物研发传统药
专题：2025云计算与AI技术研究趋势报告|附200+份报告PDF、原数据表汇总下载
原文链接：https://tecdat.cn/?p=42935关键词：2025,云计算，AI技术，市场趋势，深度学习，公有云，研究报告云计算和AI技术正以肉眼可见的速度重塑商业世界。过去十年，全球云服务收入激增8倍，中国云计算市场规模突破6000亿元，而深度学习算法的应用量更是暴涨400倍。这些数字背后，是企业从“自建机房”到“云原生开发”的转型，是AI从“实验室”走向“产业级应用”的跨越。本报告
【深度学习解惑】在实践中如何发现和修正RNN训练过程中的数值不稳定？云博士的AI课堂大模型技术开发与实践哈佛博后带你玩转机器学习深度学习深度学习 rnn 人工智能 tensorflow pytorch 神经网络机器学习
在实践中发现和修正RNN训练过程中的数值不稳定目录引言与背景介绍原理解释代码说明与实现应用场景与案例分析实验设计与结果分析性能分析与技术对比常见问题与解决方案创新性与差异性说明局限性与挑战未来建议和进一步研究扩展阅读与资源推荐图示与交互性内容语言风格与通俗化表达互动交流1.引言与背景介绍循环神经网络(RNN)在处理序列数据时表现出色，但训练过程中常面临梯度消失和梯度爆炸问题，导致数值不稳定。当网络
专题：2025供应链数智化与效率提升报告|附100+份报告PDF、原数据表汇总下载拓端研究室 php 开发语言
全文链接：https://tecdat.cn/?p=42926在全球产业链重构与数字技术革命的双重驱动下，供应链正经历从传统经验驱动向数据智能驱动的范式变革。从快消品产能区域化布局到垂类折扣企业的效率竞赛，从人形机器人的成本优化到供应链金融对中小企业的赋能，技术创新与模式重构正在重塑行业价值网络。本报告洞察基于《灼识咨询：2025中国供应链金融科技行业蓝皮书》《中国银河证券：折扣业态供应链效率深度
【深度学习实战】当前三个最佳图像分类模型的代码详解云博士的AI课堂大模型技术开发与实践哈佛博后带你玩转机器学习深度学习深度学习人工智能分类模型机器学习 Transformer EfficientNet ConvNeXt
下面给出三个在当前图像分类任务中精度表现突出的模型示例，分别基于SwinTransformer、EfficientNet与ConvNeXt。每个模型均包含：训练代码（使用PyTorch）从预训练权重开始微调（也可注释掉预训练选项，从头训练）数据集目录结构：└──dataset_root├──buy#第一类图像└──nobuy#第二类图像随机拆分：80%训练，20%验证每个Epoch输出一次loss
第35周—————糖尿病预测模型优化探索
目录目录前言1.检查GPU2.查看数据编辑3.划分数据集4.创建模型与编译训练5.编译及训练模型6.结果可视化7.总结前言本文为365天深度学习训练营中的学习记录博客原作者：K同学啊1.检查GPUimporttorch.nnasnnimporttorch.nn.functionalasFimporttorchvision,torch#设置硬件设备，如果有GPU则使用，没有则使用cpudevice=
计算机领域顶级会议汇总 hongyanee parallel performance processing 分布式计算 networking security
转自ustcxjt的专栏：http://blog.csdn.net/ustcxjt/article/details/7075534COREComputerScienceConferenceRankingsAcronymStandardNameRankAAAINationalConferenceoftheAmericanAssociationforArtificialIntelligenceA+AA
Hcia知识汇总小鱼快快游服务器网络 php
一.什么是HCIAHCIA—华为体系下的初级网络工程师二.网络的概念网络就是利用传输介质将世界不同位置的计算机连接在一起，就形成了一张网----可以实现信息传递和资源共享。三.网络基础计算机——电脑：处理电流信号--数字信号,实现电信号到数学信号的转换。1.应用层：抽象语言，电脑不认识，会转换成编码，软件是加到应用层的2.表示层：将编码转成二进制，所以表示层之下都是二进制应用层，表示层都是将各种类
2026-软件工程-《软件质量测试与保证》-期末复习—习题汇总海宁不掉头发习题软件工程软件质量测试与保证期末复习
单选题1.下面哪项对验收测试的描述不正确?(C)A.不仅仅要验收程序,还要验收相关的文档B.测试人员多由客户方担任,也可以客户委托第三方来进行验收测试C.由资深的开发和测试人员来进行测试D.与系统测试不同的是以客户业务需求为标准来进行测试2.白盒测试设计测试用例的依据是©。A.用户使用场景B.需求规格说明书C.代码逻辑结构D.代码注释说明3.下列属于用户体验UE测试的工作是(D)。A.检查界面是否
Github 2025-01-07Python开源项目日报 Top10 老孙正经胡说 github 开源 Github趋势分析开源项目 Python Golang
根据GithubTrendings的统计，今日(2025-01-07统计)共有10个项目上榜。根据开发语言中项目的数量，汇总情况如下：开发语言项目数量Python项目10TypeScript项目1C++项目1OpenHands:人工智能驱动的软件开发代理平台创建周期：195天开发语言：Python协议类型：MITLicenseStar数量：31753个Fork数量：3660次关注人数：31753人
ios内付费 374016526 ios 内付费
近年来写了很多IOS的程序，内付费也用到不少，使用IOS的内付费实现起来比较麻烦，这里我写了一个简单的内付费包，希望对大家有帮助。具体使用如下: 这里的sender其实就是调用者，这里主要是为了回调使用。 [KuroStoreApi kuroStoreProductId:@"产品ID" storeSender:self storeFinishCallBa
20 款优秀的 Linux 终端仿真器 brotherlamp linux linux视频 linux资料 linux自学 linux教程
终端仿真器是一款用其它显示架构重现可视终端的计算机程序。换句话说就是终端仿真器能使哑终端看似像一台连接上了服务器的客户机。终端仿真器允许最终用户用文本用户界面和命令行来访问控制台和应用程序。（LCTT 译注：终端仿真器原意指对大型机-哑终端方式的模拟，不过在当今的 Linux 环境中，常指通过远程或本地方式连接的伪终端，俗称“终端”。）你能从开源世界中找到大量的终端仿真器，它们
Solr Deep Paging(solr 深分页) eksliang solr深分页 solr分页性能问题
转载请出自出处：http://eksliang.iteye.com/blog/2148370 作者：eksliang(ickes) blg:http://eksliang.iteye.com/ 概述长期以来，我们一直有一个深分页问题。如果直接跳到很靠后的页数，查询速度会比较慢。这是因为Solr的需要为查询从开始遍历所有数据。直到Solr的4.7这个问题一直没有一个很好的解决方案。直到solr
数据库面试题 18289753290 面试题数据库
1.union ,union all 网络搜索出的最佳答案： union和union all的区别是,union会自动压缩多个结果集合中的重复结果，而union all则将所有的结果全部显示出来，不管是不是重复。 Union：对两个结果集进行并集操作，不包括重复行，同时进行默认规则的排序； Union All：对两个结果集进行并集操作，包括重复行，不进行排序； 2.索引有哪些分类？作用是
Android TV屏幕适配酷的飞上天空 android
先说下现在市面上TV分辨率的大概情况两种分辨率为主 1.720标清，分辨率为1280x720. 屏幕尺寸以32寸为主，部分电视为42寸 2.1080p全高清，分辨率为1920x1080 屏幕尺寸以42寸为主，此分辨率电视屏幕从32寸到50寸都有适配遇到问题，已1080p尺寸为例：分辨率固定不变，屏幕尺寸变化较大。如：效果图尺寸为1920x1080，如果使用d
Timer定时器与ActionListener联合应用永夜-极光 java
功能:在控制台每秒输出一次代码: package Main; import javax.swing.Timer; import java.awt.event.*; public class T { private static int count = 0; public static void main(String[] args){
Ubuntu14.04系统Tab键不能自动补全问题解决随便小屋 Ubuntu 14.04
Unbuntu 14.4安装之后就在终端中使用Tab键不能自动补全，解决办法如下： 1、利用vi编辑器打开/etc/bash.bashrc文件（需要root权限） sudo vi /etc/bash.bashrc 接下来会提示输入密码 2、找到文件中的下列代码 #enable bash completion in interactive shells #if
学会人际关系三招轻松走职场 aijuans 职场
要想成功，仅有专业能力是不够的，处理好与老板、同事及下属的人际关系也是门大学问。如何才能在职场如鱼得水、游刃有余呢？在此，教您简单实用的三个窍门。　　第一，多汇报最近，管理学又提出了一个新名词“追随力”。它告诉我们，做下属最关键的就是要多请示汇报，让上司随时了解你的工作进度，有了新想法也要及时建议。不知不觉，你就有了“追随力”，上司会越来越了解和信任你。　　第二，勤沟通团队的力
《O2O：移动互联网时代的商业革命》读书笔记 aoyouzi 读书笔记
移动互联网的未来：碎片化内容+碎片化渠道=各式精准、互动的新型社会化营销。 O2O：Online to OffLine 线上线下活动 O2O就是在移动互联网时代，生活消费领域通过线上和线下互动的一种新型商业模式。手机二维码本质：O2O商务行为从线下现实世界到线上虚拟世界的入口。线上虚拟世界创造的本意是打破信息鸿沟，让不同地域、不同需求的人
js实现图片随鼠标滚动的效果百合不是茶 JavaScript 滚动属性的获取图片滚动属性获取页面加载
1,获取样式属性值 top 与顶部的距离 left 与左边的距离 right 与右边的距离 bottom 与下边的距离 zIndex 层叠层次例子:获取左边的宽度,当css写在body标签中时 <div id="adver" style="position:absolute;top:50px;left:1000p
ajax同步异步参数async bijian1013 jquery Ajax async
开发项目开发过程中，需要将ajax的返回值赋到全局变量中，然后在该页面其他地方引用，因为ajax异步的原因一直无法成功，需将async:false，使其变成同步的。格式： $.ajax({ type: 'POST', ur
Webx3框架（1） Bill_chen eclipse spring maven 框架 ibatis
Webx是淘宝开发的一套Web开发框架，Webx3是其第三个升级版本；采用Eclipse的开发环境，现在支持java开发；采用turbine原型的MVC框架，扩展了Spring容器，利用Maven进行项目的构建管理，灵活的ibatis持久层支持，总的来说，还是一套很不错的Web框架。 Webx3遵循turbine风格，velocity的模板被分为layout/screen/control三部
【MongoDB学习笔记五】MongoDB概述 bit1129 mongodb
MongoDB是面向文档的NoSQL数据库，尽量业界还对MongoDB存在一些质疑的声音，比如性能尤其是查询性能、数据一致性的支持没有想象的那么好，但是MongoDB用户群确实已经够多。MongoDB的亮点不在于它的性能，而是它处理非结构化数据的能力以及内置对分布式的支持(复制、分片达到的高可用、高可伸缩)，同时它提供的近似于SQL的查询能力，也是在做NoSQL技术选型时，考虑的一个重要因素。Mo
spring/hibernate/struts2常见异常总结白糖_ Hibernate
Spring ①ClassNotFoundException: org.aspectj.weaver.reflect.ReflectionWorld$ReflectionWorldException 缺少aspectjweaver.jar，该jar包常用于spring aop中 ②java.lang.ClassNotFoundException: org.sprin
jquery easyui表单重置(reset)扩展思路 bozch form jquery easyui reset
在jquery easyui表单中尚未提供表单重置的功能，这就需要自己对其进行扩展。扩展的时候要考虑的控件有： combo,combobox,combogrid,combotree,datebox,datetimebox 需要对其添加reset方法，reset方法就是把初始化的值赋值给当前的组件，这就需要在组件的初始化时将值保存下来。在所有的reset方法添加完毕之后，就需要对fo
编程之美-烙饼排序 bylijinnan 编程之美
package beautyOfCoding; import java.util.Arrays; /* *《编程之美》的思路是：搜索+剪枝。有点像是写下棋程序：当前情况下，把所有可能的下一步都做一遍；在这每一遍操作里面，计算出如果按这一步走的话，能不能赢（得出最优结果）。 *《编程之美》上代码有很多错误，且每个变量的含义令人费解。因此我按我的理解写了以下代码： */
Struts1.X 源码分析之ActionForm赋值原理 chenbowen00 struts
struts1在处理请求参数之前，首先会根据配置文件action节点的name属性创建对应的ActionForm。如果配置了name属性，却找不到对应的ActionForm类也不会报错，只是不会处理本次请求的请求参数。如果找到了对应的ActionForm类，则先判断是否已经存在ActionForm的实例，如果不存在则创建实例，并将其存放在对应的作用域中。作用域由配置文件action节点的s
[空天防御与经济]在获得充足的外部资源之前,太空投资需有限度 comsci 资源
这里有一个常识性的问题: 地球的资源,人类的资金是有限的,而太空是无限的..... 就算全人类联合起来,要在太空中修建大型空间站,也不一定能够成功,因为资源和资金,技术有客观的限制.... &
ORACLE临时表—ON COMMIT PRESERVE ROWS daizj oracle 临时表
ORACLE临时表转临时表：像普通表一样，有结构，但是对数据的管理上不一样，临时表存储事务或会话的中间结果集，临时表中保存的数据只对当前会话可见，所有会话都看不到其他会话的数据，即使其他会话提交了，也看不到。临时表不存在并发行为，因为他们对于当前会话都是独立的。创建临时表时，ORACLE只创建了表的结构（在数据字典中定义），并没有初始化内存空间，当某一会话使用临时表时，ORALCE会
基于Nginx XSendfile+SpringMVC进行文件下载 denger 应用服务器 Web nginx 网络应用 lighttpd
在平常我们实现文件下载通常是通过普通 read-write方式，如下代码所示。 @RequestMapping("/courseware/{id}") public void download(@PathVariable("id") String courseID, HttpServletResp
scanf接受char类型的字符 dcj3sjt126com c
/* 2013年3月11日22:35:54 目的：学习char只接受一个字符 */ # include <stdio.h> int main(void) { int i; char ch; scanf("%d", &i); printf("i = %d\n", i); scanf("%
学编程的价值 dcj3sjt126com 编程
发一个人会编程, 想想以后可以教儿女, 是多么美好的事啊, 不管儿女将来从事什么样的职业, 教一教, 对他思维的开拓大有帮助像这位朋友学习: http://blog.sina.com.cn/s/articlelist_2584320772_0_1.html VirtualGS教程 (By @林泰前): 几十年的老程序员，资深的
二维数组（矩阵）对角线输出飞天奔月二维数组
今天在BBS里面看到这样的面试题目, 1，二维数组（N*N），沿对角线方向，从右上角打印到左下角如N=4： 4*4二维数组 { 1 2 3 4 } { 5 6 7 8 } { 9 10 11 12 } {13 14 15 16 } 打印顺序 4 3 8 2 7 12 1 6 11 16 5 10 15 9 14 13 要
Ehcache（08）——可阻塞的Cache——BlockingCache 234390216 并发 ehcache BlockingCache 阻塞
可阻塞的Cache—BlockingCache 在上一节我们提到了显示使用Ehcache锁的问题，其实我们还可以隐式的来使用Ehcache的锁，那就是通过BlockingCache。BlockingCache是Ehcache的一个封装类，可以让我们对Ehcache进行并发操作。其内部的锁机制是使用的net.
mysqldiff对数据库间进行差异比较 jackyrong mysqld
mysqldiff该工具是官方mysql-utilities工具集的一个脚本，可以用来对比不同数据库之间的表结构，或者同个数据库间的表结构如果在windows下，直接下载mysql-utilities安装就可以了，然后运行后，会跑到命令行下： 1）基本用法 mysqldiff --server1=admin:12345
spring data jpa 方法中可用的关键字 lawrence.li java spring
spring data jpa 支持以方法名进行查询/删除/统计。查询的关键字为find 删除的关键字为delete/remove (>=1.7.x) 统计的关键字为count (>=1.7.x) 修改需要使用@Modifying注解 @Modifying @Query("update User u set u.firstna
Spring的ModelAndView类 nicegege spring
项目中controller的方法跳转的到ModelAndView类，一直很好奇spring怎么实现的？ /* * Copyright 2002-2010 the original author or authors. * * Licensed under the Apache License, Version 2.0 (the "License"); * yo
搭建 CentOS 6 服务器(13) - rsync、Amanda rensanning centos
（一）rsync Server端 # yum install rsync # vi /etc/xinetd.d/rsync service rsync { disable = no flags = IPv6 socket_type = stream wait
Learn Nodejs 02 toknowme nodejs
（1）npm是什么 npm is the package manager for node 官方网站：https://www.npmjs.com/ npm上有很多优秀的nodejs包，来解决常见的一些问题，比如用node-mysql，就可以方便通过nodejs链接到mysql，进行数据库的操作在开发过程往往会需要用到其他的包，使用npm就可以下载这些包来供程序调用 &nb
Spring MVC 拦截器 xp9802 spring mvc
Controller层的拦截器继承于HandlerInterceptorAdapter HandlerInterceptorAdapter.java 1 public abstract class HandlerInterceptorAdapter implements HandlerIntercep

W1	[[ 1.63535156 -0.62320365 -0.53718766] [-1.07799357 0.85639907 -2.29470142]]
b1	[[ 1.74604067] [-0.75184921]]
W2	[[ 0.32171798 -0.25467393 1.46902454] [-2.05617317 -0.31554548 -0.3756023 ] [ 1.1404819 -1.09976462 -0.1612551 ]]
b2	[[-0.88020257] [ 0.02561572] [ 0.57539477]]

shape of the 1st mini_batch_X	(12288, 64)
shape of the 2nd mini_batch_X	(12288, 64)
shape of the 3rd mini_batch_X	(12288, 20)
shape of the 1st mini_batch_Y	(1, 64)
shape of the 2nd mini_batch_Y	(1, 64)
shape of the 3rd mini_batch_Y	(1, 20)
mini batch sanity check	[ 0.90085595 -0.7612069 0.2344157 ]

v[“dW1”]	[[ 0. 0. 0.] [ 0. 0. 0.]]
v[“db1”]	[[ 0.] [ 0.]]
v[“dW2”]	[[ 0. 0. 0.] [ 0. 0. 0.] [ 0. 0. 0.]]
v[“db2”]	[[ 0.] [ 0.] [ 0.]]

W1	[[ 1.62544598 -0.61290114 -0.52907334] [-1.07347112 0.86450677 -2.30085497]]
b1	[[ 1.74493465] [-0.76027113]]
W2	[[ 0.31930698 -0.24990073 1.4627996 ] [-2.05974396 -0.32173003 -0.38320915] [ 1.13444069 -1.0998786 -0.1713109 ]]
b2	[[-0.87809283] [ 0.04055394] [ 0.58207317]]
v[“dW1”]	[[-0.11006192 0.11447237 0.09015907] [ 0.05024943 0.09008559 -0.06837279]]
v[“db1”]	[[-0.01228902] [-0.09357694]]
v[“dW2”]	[[-0.02678881 0.05303555 -0.06916608] [-0.03967535 -0.06871727 -0.08452056] [-0.06712461 -0.00126646 -0.11173103]]
v[“db2”]	[[ 0.02344157] [ 0.16598022] [ 0.07420442]]

W1	[[ 1.63178673 -0.61919778 -0.53561312] [-1.08040999 0.85796626 -2.29409733]]
b1	[[ 1.75225313] [-0.75376553]]
W2	[[ 0.32648046 -0.25681174 1.46954931] [-2.05269934 -0.31497584 -0.37661299] [ 1.14121081 -1.09245036 -0.16498684]]
b2	[[-0.88529978] [ 0.03477238] [ 0.57537385]]
v[“dW1”]	[[-0.11006192 0.11447237 0.09015907] [ 0.05024943 0.09008559 -0.06837279]]
v[“db1”]	[[-0.01228902] [-0.09357694]]
v[“dW2”]	[[-0.02678881 0.05303555 -0.06916608] [-0.03967535 -0.06871727 -0.08452056] [-0.06712461 -0.00126646 -0.11173103]]
v[“db2”]	[[ 0.02344157] [ 0.16598022] [ 0.07420442]]
s[“dW1”]	[[ 0.00121136 0.00131039 0.00081287] [ 0.0002525 0.00081154 0.00046748]]
s[“db1”]	[[ 1.51020075e-05] [ 8.75664434e-04]]
s[“dW2”]	[[ 7.17640232e-05 2.81276921e-04 4.78394595e-04] [ 1.57413361e-04 4.72206320e-04 7.14372576e-04] [ 4.50571368e-04 1.60392066e-07 1.24838242e-03]]
s[“db2”]	[[ 5.49507194e-05] [ 2.75494327e-03] [ 5.50629536e-04]]

optimization method	accuracy	cost shape
Gradient descent	79.7%	oscillations
Momentum	79.7%	oscillations
Adam	94%	smoother