参考链接
1、Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
当输入是第八个mini-batch的第七个样本的时候,你会用哪种符号表示第三层的激活?
[i]{j}(k) superscript means i-th layer, j-th minibatch, k-th example
关于mini-batch的符号定义参见链接
===============================================================
2、Which of these statements about mini-batch gradient descent do you agree with?
mini-batch梯度下降的描述,哪个是对的?
参见链接:使用batch梯度下降法,一次遍历训练集只能让你做一个梯度下降,而使用mini-batch梯度下降法,一次遍历训练集,能让你做5000个梯度下降。mini-batch梯度下降法比batch梯度下降法运行地更快,所以几乎每个研习DL的人在训练巨大的数据集时都会用到。
Vectorization is not for computing several mini-batches in the same time.矢量化不适用于同时计算多个mini-batch。
================================================================
3、Why is the best mini-batch size usually not 1 and not m, but instead something in-between? 为什么最好的mini-batch的大小通常不是1也不是m,而是介于两者之间?
如果子集的尺寸为m,那么相当于没有划分子集。若尺寸为1,每次训练完一个样本就要更新参数,失去了向量化的优势,也使得参数更新波动更大,更随机化,效率更低。
=================================================================
4、Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:如果你的模型的成本J随着迭代次数的增加,绘制出来的图如下,那么:
以下哪个正确?
参见链接,mini-batch梯度下降法成本函数J的曲线图,可以发现走向朝下,但有更多的噪声。每次迭代并不是都下降无关紧要,但整体走势应该向下。
There will be some oscillations when you’re using mini-batch gradient descent since there could be some noisy data example in batches. However batch gradient descent always guarantees a lower J before reaching the optimal.
==================================================================
5、Suppose the temperature in Casablanca over the first three days of January are the same:假设卡萨布兰卡一月前三天的气温是一样的:
Jan 1st: θ_1 = 10
Jan 2nd: θ_2 = 10
Say you use an exponentially weighted average with β = 0.5 to track the temperature: v_0 = 0, v_t = βv_t−1 + (1 − β)θ_t. If v_2 is the value computed after day 2 without bias correction, and v^corrected_2 is the value you compute with bias correction. What are these values?
假设您使用β= 0.5的指数加权平均来跟踪温度:v_0 = 0, v_t = βv_t−1 + (1 − β)θ_t。 如果v_2是在没有偏差修正的情况下计算第2天后的值,并且 v c o r r e c t e d 2 v^corrected_2 vcorrected2是你使用偏差修正计算的值。 这些下面的值是正确{的是?
- v 2 = 7.5 , v 2 c o r r e c t e d = 10 v_2 = 7.5, v^{corrected}_2 = 10 v2=7.5,v2corrected=10。正确
计算方法参见链接
v_0 = 0
v_1 = βv_0 + (1 − β)θ_1=0.50+0.510=5
v_2 = βv_1 + (1 − β)θ_2=0.55+0.510=2.5+5=7.5
v^{corrected}_2= v 2 1 − β 2 = 7.5 1 − 0.25 = 10 \frac{v_2}{1-β^2}=\frac{7.5}{1-0.25}=10 1−β2v2=1−0.257.5=10
===============================================================
6、Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.下面哪一个不是比较好的学习率衰减方法?t是epoch数量
参见链接
================================================================
7、You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: v_t = βv_t−1 + (1 − β)θ_t. The red line below was computed using β = 0.9. What would happen to your red curve as you vary β? (Check the two that apply)
你在伦敦气温数据集上使用指数加权平均值。 你可以使用以下公式来追踪温度:vt = βvt -1 +(1 - β)θt。 下面的红线使用的是β= 0.9来计算的。 当你改变β时,你的红色曲线会怎样变化?
参见链接
================================================================
These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?
上图是由梯度下降产生的; 具有动量梯度下降(β= 0.5)和动量梯度下降(β= 0.9)。 哪条曲线对应哪种算法?
动量梯度下降参见链接
=============================================================
9、Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W[1],b[1],…,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value forJ? (Check all that apply)
假设在一个深度学习网络中批处理梯度下降花费了太多的时间来找到一个参数的值,该值对于成本函数J(W[1],b[1],…,W[L],b[L])来说是很小的值。 以下哪些方法可以帮助找到J值较小的参数值?
参见链接
如果把NN的权重全部初始化为0,那么梯度下降算法会无效。因为此时所有的隐藏单元都是对称的,不管你下降了多久,它们都在计算完全一样的函数,完全没有用处。
=================================================================
10、Which of the following statements about Adam is False?
关于Adam算法哪个是错误的?
Adam could be used with both.
参见链接