CS224d Assignment1 答案, Part(2/4)

Assignment1的答案一共被我分成了4部分,分别包含第1,2,3,4题。这部分包含第2题的答案。

2. Neural Network Basics (30 points)

(a). (3 points) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e. in some expression where only σ(x) , but not x , is present). Assume that the input x is a scalar for this question. Recall, the sigmoid function is

σ(x)=11+ex(2)

解:

σ(x)=(1+ex)2(ex)=ex(1+ex)2=11+ex(111+ex)=σ(x)[1σ(x)]


(b). (3 points) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e. fi nd the gradients with respect to the softmax input vector θ , when the prediction is made by y^=softmax(θ) . Remember the cross entropy function is

CE(y,y^)=iyilog(y^i)(3)

where y is the one-hot label vector, and y^ is the predicted probability vector for all classes. (Hint: you might want to consider the fact many elements of y are zeros, and assume that only the k-th dimension of y is one.)

解:根据提示,假设 y 的第k个值为1,其余值都为0,即 yk=1 那么有:

CE(y,y^)=yklog(y^k)=log(y^k)

对于 θ 中的第 i 个元素 θi ,有:
CE(y,y^)θi=logeθkjeθjθi=(θklogjeθj)θi=logjeθjθiθkθi={y^i(y^i1)ik,i=k

所以
CE(y,y^)θ=y^y


(c). (6 points) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that is, find Jx where J is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is y , and cross entropy cost is used. (feel free to use σ(x) as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit)
CS224d Assignment1 答案, Part(2/4)_第1张图片
Recall that the forward propagation is as follows

h=sigmoid(xW1+b1)y^=softmax(hW2+b2)

Note that here we’re assuming that the input vector (thus the hidden variables and output probabilities) is a row vector to be consistent with the programming assignment. When we apply the sigmoid function to a vector, we are applying it to each of the elements of that vector. Wi and bi (i=1,2) are the weights and biases, respectively, of the two layers.

解:设 y 的第k个值为1,其余值都为0,即 yk=1 那么有:

J=yklog(y^k)=log(y^k)

hW2+b2=θ2 ,即 y^=softmax(θ2) ,且记 θ2 的第 i 个元素为 θ(2)i W2 的第 i 行第 j 列个元素为 W(2)ij 那么有:
Jhi=jJθ(2)jθ(2)jhi=j(y^jyj)W(2)ij=(y^y)WT2|i

其中 θ(2)jhi=W(2)ij ,事实上,如果使用 爱因斯坦求和约定,那么有 θ(2)j=hiW(2)ij+b(2)j ,则可得 θ(2)jhi=W(2)ij 。且 Jθ2=(y^y) 可由上一问得到的。

xW1+b1=θ1 ,即 h=σ(θ1) ,记 θ1 的第 i 个元素为 θ(1)i W1 的第 i 行第 j 列个元素为 W(1)ij 那么有:

Jθ(1)i=jJhjhjθ(1)i=Jhihiθ(1)i=(y^y)WT2|iσ(θ1)|i

同时有:
Jxi=jJθ(1)jθ(1)jxi=jJθ(1)jW(1)ij=((y^y)WT2σ(θ1))WT1|i

其中 表示按元素的积(elementwise product) (小吐槽一下,这个推导这么麻烦才给6分,太抠了)


(d). (2 points) How many parameters are there in this neural network, assuming the input is Dx -dimensional, the output is Dy -dimensional, and there are H hidden units?
解: W1 的维度是 Dx×H b1 的维度是 1×H W2 的维度是 H×Dy b2 的维度是 1×Dy 。所以一共有 DxH+H+DyH+Dy 个参数。


(e)(f)(g). 见代码,略


你可能感兴趣的:(cs224d)