The core problem studied in this section is as
follows: We are given some function f(x) where x is a vector of
inputs and we are interested in computing the gradient
of f at x (i.e.() ).
In the specific case of Neural Networks, f(x) will
correspond to the loss function (L) and the inputs x will consist of the training data and the neural network weights. The training data is given and parameters (e.g. W and b) are variables. We need to compute the gradient for all the parameters so that we can use it to perform a parameter update.
Q1: What is a add gate?
Q2:What is a max gate?
Q3:What is a mul gate?
A1:add gate: gradient distributor(梯度分配器)
A2:max gate: gradient router(梯度路由器)
A3:mul gate: gradient switcher(梯度切换器)
例如:
f ( x , y ) = x + σ ( y ) σ ( x ) + ( x + y ) 2 , 其 中 σ ( x ) = 1 1 + e − x f(x,y)=\frac{x+\sigma(y)}{\sigma(x)+(x+y)^2},其中\sigma(x)=\frac{1}{1+e^{-x}} f(x,y)=σ(x)+(x+y)2x+σ(y),其中σ(x)=1+e−x1
x = 3 # 例子数值
y = -4
# 前向传播
sigy = 1.0 / (1 + math.exp(-y)) # 分子中的sigmoi #(1)
num = x + sigy # 分子 #(2)
sigx = 1.0 / (1 + math.exp(-x)) # 分母中的sigmoid #(3)
xpy = x + y #(4)
xpysqr = xpy**2 #(5)
den = sigx + xpysqr # 分母 #(6)
invden = 1.0 / den #(7)
f = num * invden # 搞定! #(8)
# 回传 f = num * invden
dnum = invden # 分子的梯度 #(8)
dinvden = num #(8)
# 回传 invden = 1.0 / den
dden = (-1.0 / (den**2)) * dinvden #(7)
# 回传 den = sigx + xpysqr
dsigx = (1) * dden #(6)
dxpysqr = (1) * dden #(6)
# 回传 xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr #(5)
# 回传 xpy = x + y
dx = (1) * dxpy #(4)
dy = (1) * dxpy #(4)
# 回传 sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below #(3)
# 回传 num = x + sigy
dx += (1) * dnum #(2)
dsigy = (1) * dnum #(2)
# 回传 sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy #(1)
# 完成! 嗷~~
求
∂ C ∂ w \frac{\partial C}{\partial w} ∂w∂C
可以分解为两步:
1、
∂ z ∂ w \frac{\partial z}{\partial w} ∂w∂z
这是个Forward pass(这个其实是求导的最后一步,可以从前向后直接得到)
2、
∂ C ∂ z \frac{\partial C}{\partial z} ∂z∂C
这是个Backward pass(这部分不能直接求出来,需要运用递归思想,从后向前计算)
神经网络将非常大:用手写下所有参数的梯度公式是不切实际的
backpropagation =沿着计算图递归应用链规则来计算所有的梯度
输入/参数/中间体
实现维护图结构,其中节点实现forward()/ backward()API
计算操作结果并保存内存中梯度计算所需的任何中间体
应用链规则计算相对于输入的损失函数的梯度