个人学习笔记
Date:2023.01.06
参考web:cs231n官方笔记
Expression如下:
f ( x , y ) = x y → ∂ f ∂ x = y , ∂ f ∂ y = x f(x,y)=xy \space\to\space\frac{\partial f}{\partial x}=y,\space\frac{\partial f}{\partial y}=x f(x,y)=xy → ∂x∂f=y, ∂y∂f=x
由
d f ( x ) d x = l i m h → 0 f ( x + h ) − f ( x ) h \frac{\mathrm{d}f(x)}{\mathrm{d}x}=lim_{h \to 0}\frac{f(x+h)-f(x)}{h} dxdf(x)=limh→0hf(x+h)−f(x)
可得
f ( x + h ) = f ( x ) + h d f ( x ) d x f(x+h) = f(x)+h\frac{\mathrm{d}f(x)}{\mathrm{d}x} f(x+h)=f(x)+hdxdf(x)
一个例子,如果 x = 4 , y = − 3 x=4,y=-3 x=4,y=−3,那么对 x x x的偏导为-3,意思就是如果 x x x增加一个小小的值 h h h,那么 f f f就会减小,且减小的值为 3 h 3h 3h
梯度是偏导的向量,既 ∇ f = [ ∂ f ∂ x , ∂ f ∂ x ] \nabla f =[\frac{\partial f}{\partial x},\frac{\partial f}{\partial x}] ∇f=[∂x∂f,∂x∂f]。所以 x x x上的梯度,也就是 f f f对 x x x的偏导
常见汇总:
假如有以下表达式
f ( x , y , z ) = ( x + y ) z f(x,y,z) = (x+y)z f(x,y,z)=(x+y)z
设 q = x + y q=x+y q=x+y,则 f = q z f=qz f=qz,则有以下偏导
∂ f ∂ q = z , ∂ f ∂ z = q , ∂ q ∂ x = 1 , ∂ q ∂ y = 1 \frac{\partial f}{\partial q}=z, \frac{\partial f}{\partial z}=q,\frac{\partial q}{\partial x}=1,\frac{\partial q}{\partial y}=1 ∂q∂f=z,∂z∂f=q,∂x∂q=1,∂y∂q=1
我们想要的是 f f f对 x , y , z x,y,z x,y,z的偏导,则有 ∂ f ∂ x = ∂ f ∂ q ∂ q ∂ x = z ∗ 1 = z \frac{\partial f}{\partial x}=\frac{\partial f}{\partial q}\frac{\partial q}{\partial x}=z*1=z ∂x∂f=∂q∂f∂x∂q=z∗1=z
反向传播机制如下
反向传播机制就是上一步回流的值乘以偏导,比较好理解。解释如下:
f f f对 x x x的偏导为-4,或者说 x x x上的梯度为-4,就是说输入 x x x的值变大,比如由 − 2 -2 −2变为 − 1 -1 −1,则 q q q变为4,则 f f f变为-16,输出 f f f变小,且变小的值为4倍 x x x变大的值(-4)
f ( w , x ) = 1 1 + e − ( w 0 x 0 + w 1 x 1 + w 2 ) f(w,x) = \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}} f(w,x)=1+e−(w0x0+w1x1+w2)1
其反向传播图如下(不解释)
这里输入为[x0,x1],输出为learnable的权重网络[w0,w1,w2]。稍后解释。
这里模板化,令
σ ( x ) = 1 1 + e − x → d σ ( X ) d x = ( 1 − σ ( x ) ) σ ( x ) \sigma(x)=\frac{1}{1+e^{-x}}\\\to\frac{\mathrm{d}\sigma(X)}{\mathrm{d}x}=(1-\sigma(x))\sigma(x) σ(x)=1+e−x1→dxdσ(X)=(1−σ(x))σ(x)
当输入 x = 1 x=1 x=1时,输出 σ ( x ) = 0.73 \sigma(x)=0.73 σ(x)=0.73,则梯度为 ( 1 − 0.73 ) ∗ 0.73 = 0.2 (1-0.73)*0.73 ~= 0.2 (1−0.73)∗0.73 =0.2。可见简便了不少。实际应用中,也经常模板化一些复杂的表达式。
以下程序中,设置中间变量dot
,然后设置输出为f
,则可知中间变量的梯度为(1-f)*f
设置为ddot
,则dx
梯度则为[w[0]*ddot
,w[1]*ddot
]
w = [2,-3,-3] # assume some random weights and data
x = [-1, -2]
# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid function
# backward pass through the neuron (backpropagation)
ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot] # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w
# we're done! we have the gradients on the inputs to the circuit
f ( x , y ) = x + σ ( y ) σ ( x ) + ( x + y ) 2 f(x,y) = \frac{x+\sigma(y)}{\sigma(x)+(x+y)^2} f(x,y)=σ(x)+(x+y)2x+σ(y)
首先模板化,代码如下
x = 3 # example values
y = -4
# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator #(1)
num = x + sigy # numerator #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y #(4)
xpysqr = xpy**2 #(5)
den = sigx + xpysqr # denominator #(6)
invden = 1.0 / den #(7)
f = num * invden # done! #(8)
forward pass之后,再反向传播即可,注意代码中要得到的值。
# backprop f = num * invden
dnum = invden # gradient on numerator #(8)
dinvden = num #(8)
# backprop invden = 1.0 / den
dden = (-1.0 / (den**2)) * dinvden #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden #(6)
dxpysqr = (1) * dden #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr #(5)
# backprop xpy = x + y
dx = (1) * dxpy #(4)
dy = (1) * dxpy #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below #(3)
# backprop num = x + sigy
dx += (1) * dnum #(2)
dsigy = (1) * dnum #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy #(1)
# done! phew
这里学习一下如何根据维度为进行点乘
# forward pass
W = np.random.randn(5, 10)
X = np.random.randn(10, 3)
D = W.dot(X)
# now suppose we had the gradient on D from above in the circuit
dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T) #.T gives the transpose of the matrix
dX = W.T.dot(dD)
dw
和dx
应和W
以及X
一个shape