deeplearning.ai题目记录

课程在网易云课堂上免费观看,作业题如下:加粗为答案。

神经网络和深度学习

第一周 深度学习概论

10个选择题,原见Github

  1. What does the analogy “AI is the new electricity” refer to?
    1. Similar to electricity starting about 100 years ago, AI is transforming multiple industries.
    2. Through the “smart grid”, AI is delivering a new wave of electricity.
    3. AI runs on computers and is thus powered by electricity, but it is letting computers do things not possible before.
    4. AI is powering personal devices in our homes and offices, similar to electricity.
  2. Which of these are reasons for Deep Learning recently taking off? (Check the three options that apply.)
    1. Deep learning has resulted in significant improvements in important applications such as online advertising, speech recognition, and image recognition.
    2. We have access to a lot more data.
    3. We have access to a lot more computational power.
    4. Neural Networks are a brand new field.
  3. Recall this diagram of iterating over different ML ideas. Which of the statements below are true? (Check all that apply.)
    deeplearning.ai题目记录_第1张图片
    1. Being able to try out ideas quickly allows deep learning engineers to iterate more quickly.
    2. Faster computation can help speed up how long a team takes to iterate to a good idea.
    3. It is faster to train on a big dataset than a small dataset.
    4. Recent progress in deep learning algorithms has allowed us to train good models faster (even without changing the CPU/GPU hardware).
  4. When an experienced deep learning engineer works on a new problem, they can usually use insight from previous problems to train a good model on the first try, without needing to iterate multiple times through different models. True/False?
    1. True
    2. False
  5. Which one of these plots represents a ReLU activation function?
    1. Figure 1:
      deeplearning.ai题目记录_第2张图片
    2. Figure 2:
      deeplearning.ai题目记录_第3张图片
    3. Figure 3:
      deeplearning.ai题目记录_第4张图片
    4. Figure 4:
      deeplearning.ai题目记录_第5张图片
  6. Images for cat recognition is an example of “structured” data, because it is represented as a structured array in a computer. True/False?
    1. True
    2. False
  7. A demographic dataset with statistics on different cities’ population, GDP per capita, economic growth is an example of “unstructured” data because it contains data coming from different sources. True/False?
    1. True
    2. False
  8. Why is an RNN (Recurrent Neural Network) used for machine translation, say translating English to French? (Check all that apply.)
    1. It can be trained as a supervised learning problem.
    2. It is strictly more powerful than a Convolutional Neural Network (CNN).
    3. It is applicable when the input/output is a sequence (e.g., a sequence of words).
    4. RNNs represent the recurrent process of Idea->Code->Experiment->Idea->….
  9. In this diagram which we hand-drew in lecture, what do the horizontal axis (x-axis) and vertical axis (y-axis) represent?
    deeplearning.ai题目记录_第6张图片
      • x-axis is the performance of the algorithm
      • y-axis (vertical axis) is the amount of data.
      • x-axis is the input to the algorithm
      • y-axis is outputs.
      • x-axis is the amount of data
      • y-axis is the size of the model you train.
      • x-axis is the amount of data
      • y-axis (vertical axis) is the performance of the algorithm.
  10. Assuming the trends described in the previous question’s figure are accurate (and hoping you got the axis labels right), which of the following are true? (Check all that apply.)
    1. Decreasing the training set size generally does not hurt an algorithm’s performance, and it may help significantly.
    2. Decreasing the size of a neural network generally does not hurt an algorithm’s performance, and it may help significantly.
    3. Increasing the training set size generally does not hurt an algorithm’s performance, and it may help significantly.
    4. Increasing the size of a neural network generally does not hurt an algorithm’s performance, and it may help significantly.

第二周 Logistic Regression

Neural-Network-Basics

10个选择题,原见Github

  1. What does a neuron compute?
    1. A neuron computes an activation function followed by a linear function (z = Wx + b)
    2. A neuron computes a linear function (z = Wx + b) followed by an activation function
    3. A neuron computes a function g that scales the input x linearly (Wx + b)
    4. A neuron computes the mean of all features before applying the output to an activation function
  2. Which of these is the “Logistic Loss”?
    1. L(i)(y^(i),y(i))=(y(i)log(y^(i))+(1y(i))log(1y^(i))) L ( i ) ( y ^ ( i ) , y ( i ) ) = − ( y ( i ) log ⁡ ( y ^ ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − y ^ ( i ) ) ) True
    2. L(i)(y^(i),y(i))=|y(i)y^(i)| L ( i ) ( y ^ ( i ) , y ( i ) ) = | y ( i ) − y ^ ( i ) |
    3. L(i)(y^(i),y(i))=max(0,y(i)y^(i)) L ( i ) ( y ^ ( i ) , y ( i ) ) = max ( 0 , y ( i ) − y ^ ( i ) )
    4. L(i)(y^(i),y(i))=|y(i)y^(i)|2 L ( i ) ( y ^ ( i ) , y ( i ) ) = | y ( i ) − y ^ ( i ) | 2
  3. Suppose img is a (32,32,3) array, representing a 32x32 image with 3 color channels red, green and blue. How do you reshape this into a column vector?
    1. x = img.reshape((32*32*3,1))
    2. x = img.reshape((32*32,3))
    3. x = img.reshape((1,32*32,*3))
    4. x = img.reshape((3,32*32))
  4. Consider the two following random arrays “a” and “b”:
    python
    a = np.random.randn(2, 3) # a.shape = (2, 3)
    b = np.random.randn(2, 1) # b.shape = (2, 1)
    c = a + b

    What will be the shape of “c”?
    1. c.shape = (2, 1)
    2. c.shape = (3, 2)
    3. c.shape = (2, 3)
    4. The computation cannot happen because the sizes don’t match. It’s going to be “Error”!
  5. Consider the two following random arrays “a” and “b”:
    python
    a = np.random.rand(4, 3) # a.shape = (4, 3)
    b = np.random.rand(3, 2) # a.shape = (3, 2)
    c = a*b

    What will be the shape of “c”?
    1. c.shape = (4, 2)
    2. c.shape = (4, 3)
    3. The computation cannot happen because the sizes don’t match. It’s going to be “Error”!
    4. c.shape = (3, 3)
  6. Suppose you have nx n x input features per example. Recall that X=[x(2)x(m)...x(1)] X = [ x ( 2 ) x ( m ) . . . x ( 1 ) ] . What is the dimension of X?
    1. (m, 1)
    2. ( nx n x , m)
    3. (m, nx n x )
    4. (1, m)
  7. Recall that “np.dot(a,b)” performs a matrix multiplication on a and b, whereas “a*b” performs an element-wise multiplication.
    Consider the two following random arrays “a” and “b”:
    python
    a = np.random.randn(12288, 150) # a.shape = (12288, 150)
    b = np.random.randn(150, 45) # b.shape = (150, 45)
    c = np.dot(a, b)

    What is the shape of c?
    1. The computation cannot happen because the sizes don’t match. It’s going to be “Error”!
    2. c.shape = (12288, 150)
    3. c.shape = (12288, 45)
    4. c.shape = (150,150)
  8. Consider the following code snippet:

    
    # a.shape = (3, 4)
    
    
    # b.shape = (4, 1)
    
    
    for i in range(3):
        for j in range(4):
            c[i][j] = a[i][j] + b[j]

    How do you vectorize this?

    1. c = a.T + b
    2. c = a.T + b.T
    3. c = a + b.T
    4. c = a + b
  9. Consider the following code:
    python
    a = np.random.randn(3, 3)
    b = np.random.randn(3, 1)
    c = a * b

    What will be c? (If you’re not sure, feel free to run this in python to find out).
    1. This will invoke broadcasting, so b is copied three times to become (3,3), and ∗ is an element-wise product so c.shape will be (3, 3)
    2. This will invoke broadcasting, so b is copied three times to become (3, 3), and ∗ invokes a matrix multiplication operation of two 3x3 matrices so c.shape will be (3, 3)
    3. This will multiply a 3x3 matrix a with a 3x1 vector, thus resulting in a 3x1 vector. That is, c.shape = (3,1).
    4. It will lead to an error since you cannot use “*” to operate on these two matrices. You need to instead use np.dot(a,b)
  10. Consider the following computation graph.
    deeplearning.ai题目记录_第7张图片
    What is the output J?
    1. J = (c - 1)*(b + a)
    2. J = (a - 1) * (b + c)
    3. J = a*b + b*c + a*c
    4. J = (b - 1) * (c + a)

Logistic-Regression-with-a-Neural-Network-mindset

相关数据集和输出见github

第三周 浅层神经网络

Shallow Neural Networks

10个选择题,原见Github

  1. Which of the following are true? (Check all that apply.)
    1. a[2] a [ 2 ] denotes the activation vector of the 2nd 2 n d layer.
    2. a[2](12) a [ 2 ] ( 12 ) denotes the activation vector of the 2nd 2 n d layer for the 12th 12 t h training example.
    3. X X is a matrix in which each row is one training example.
    4. a[2](12) a [ 2 ] ( 12 ) denotes activation vector of the 12th 12 t h layer on the 2nd 2 n d training example.
    5. X X is a matrix in which each column is one training example.
    6. a[2]4 a 4 [ 2 ] is the activation output by the 4th 4 t h neuron of the 2nd 2 n d layer
    7. a[2]4 a 4 [ 2 ] is the activation output of the 2nd 2 n d layer for the 4th 4 t h training example
  2. The tanh activation usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, and so it centers the data better for the next layer. True/False?
    1. True
    2. False
  3. Which of these is a correct vectorized implementation of forward propagation for layer l l , where 1lL 1 ≤ l ≤ L ?
      • Z[l]=W[l]A[l]+b[l] Z [ l ] = W [ l ] A [ l ] + b [ l ]
      • A[l+1]=g[l+1](Z[l]) A [ l + 1 ] = g [ l + 1 ] ( Z [ l ] )
      • Z[l]=W[l]A[l1]+b[l] Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] True
      • A[l]=g[l](Z[l]) A [ l ] = g [ l ] ( Z [ l ] ) True
      • Z[l]=W[l]A[l]+b[l] Z [ l ] = W [ l ] A [ l ] + b [ l ]
      • A[l+1]=g[l](Z[l]) A [ l + 1 ] = g [ l ] ( Z [ l ] )
      • Z[l]=W[l1]A[l]+b[l1] Z [ l ] = W [ l − 1 ] A [ l ] + b [ l − 1 ]
      • A[l]=g[l](Z[l]) A [ l ] = g [ l ] ( Z [ l ] )
  4. You are building a binary classifier for recognizing cucumbers (y=1) vs. watermelons (y=0). Which one of these activation functions would you recommend using for the output layer?
    1. ReLU
    2. Leaky ReLU
    3. sigmoid
    4. tanh
  5. Consider the following code:
    python
    A = np.random.randn(4, 3)
    B = np.sum(A, axis = 1, keepdims = True)

    What will be B.shape? (If you’re not sure, feel free to run this in python to find out).
    1. (4, 1)
    2. (4, )
    3. (1, 3)
    4. (, 3)
  6. Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?
    1. Each neuron in the first hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.
    2. Each neuron in the first hidden layer will perform the same computation in the first iteration. But after one iteration of gradient descent they will learn to compute different things because we have “broken symmetry”.
    3. Each neuron in the first hidden layer will compute the same thing, but neurons in different layers will compute different things, thus we have accomplished “symmetry breaking” as described in lecture.
    4. The first hidden layer’s neurons will perform different computations from each other even in the first iteration; their parameters will thus keep evolving in their own way.
  7. Logistic regression’s weights w should be initialized randomly rather than to all zeros, because if you initialize to all zeros, then logistic regression will fail to learn a useful decision boundary because it will fail to “break symmetry”, True/False?
    1. True
    2. False
  8. You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*1000. What will happen?
    1. This will cause the inputs of the tanh to also be very large, thus causing gradients to be close to zero. The optimization algorithm will thus become slow.
    2. This will cause the inputs of the tanh to also be very large, causing the units to be “highly activated” and thus speed up learning compared to if the weights had to start from small values.
    3. It doesn’t matter. So long as you initialize the weights randomly gradient descent is not affected by whether the weights are large or small.
    4. This will cause the inputs of the tanh to also be very large, thus causing gradients to also become large. You therefore have to set α to be very small to prevent divergence; this will slow down learning.
  9. Consider the following 1 hidden layer neural network:
    deeplearning.ai题目记录_第8张图片
    Which of the following statements are True? (Check all that apply).
    1. W[1] W [ 1 ] will have shape (2, 4)
    2. b[1] b [ 1 ] will have shape (4, 1)
    3. W[1] W [ 1 ] will have shape (4, 2)
    4. b[1] b [ 1 ] will have shape (2, 1)
    5. W[2] W [ 2 ] will have shape (1, 4)
    6. b[2] b [ 2 ] will have shape (4, 1)
    7. W[2] W [ 2 ] will have shape (4, 1)
    8. b[2] b [ 2 ] will have shape (1, 1)
  10. In the same network as the previous question, what are the dimensions of Z[1] Z [ 1 ] and A[1] A [ 1 ] ?
    1. Z[1] Z [ 1 ] and A[1] A [ 1 ] are (1, 4)
    2. Z[1] Z [ 1 ] and A[1] A [ 1 ] are (4, 2)
    3. Z[1] Z [ 1 ] and A[1] A [ 1 ] are (4, m)
    4. Z[1] Z [ 1 ] and A[1] A [ 1 ] are (4, 1)

Planar data classification with one hidden layer

相关数据集和输出见github

第四周 深层神经网络

Key concepts on Deep Neural Network

  1. What is the “cache” used for in our implementation of forward propagation and backward propagation?
    1. We use it to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.
    2. We use it to pass variables computed during backward propagation to the corresponding forward propagation step. It contains useful values for forward propagation to compute activations.
    3. It is used to cache the intermediate values of the cost function during training.
    4. It is used to keep track of the hyperparameters that we are searching over, to speed up computation.
  2. Among the following, which ones are “hyperparameters”? (Check all that apply.)
    1. size of the hidden layers n[l] n [ l ]
    2. number of layers L L in the neural network
    3. learning rate α α
    4. activation values a[l] a [ l ]
    5. number of iterations
    6. weight matrices W[l] W [ l ]
    7. bias vectors b[l] b [ l ]
  3. Which of the following statements is true?
    1. The deeper layers of a neural network are typically computing more complex features of the input than the earlier layers.
    2. The earlier layers of a neural network are typically computing more complex features of the input than the deeper layers.
  4. Vectorization allows you to compute forward propagation in an L L -layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L. True/False?
    1. True
    2. False
  5. Assume we store the values for n[l] n [ l ] in an array called layers, as follows: layer_dims = [ nx n x , 4,3,2,1]. So layer 1 has four hidden units, layer 2 has 3 hidden units and so on. Which of the following for-loops will allow you to initialize the parameters for the model?
    1. Code1
      python
      for (i in range(1, len(layer_dims)/2)):
      parameter['w' + str(i)] = np.random.randn(layers[i], layers[i-1]) * 0.01
      parameter['b' + str(i)] = np.random.randn(layers[i], 1) * 0.01
    2. Code2
      python
      for (i in range(1, len(layer_dims)/2)):
      parameter['w' + str(i)] = np.random.randn(layers[i], layers[i-1]) * 0.01
      parameter['b' + str(i)] = np.random.randn(layers[i-1], 1) * 0.01
    3. Code3
      python
      for (i in range(1, len(layer_dims))):
      parameter['w' + str(i)] = np.random.randn(layers[i-1], layers[i]) * 0.01
      parameter['b' + str(i)] = np.random.randn(layers[i], 1) * 0.01
    4. Code4
      python
      for (i in range(1, len(layer_dims))):
      parameter['w' + str(i)] = np.random.randn(layers[i], layers[i-1]) * 0.01
      parameter['b' + str(i)] = np.random.randn(layers[i], 1) * 0.01
  6. Consider the following neural network.
    [图片上传失败…(image-4d5dce-1513913878329)]
    How many layers does this network have?
    1. The number of layers L L is 4. The number of hidden layers is 3.
    2. The number of layers L L is 3. The number of hidden layers is 3.
    3. The number of layers L L is 4. The number of hidden layers is 4.
    4. The number of layers L L is 5. The number of hidden layers is 4.
  7. During forward propagation, in the forward function for a layer l l you need to know what is the activation function in a layer (Sigmoid, tanh, ReLU, etc.). During backpropagation, the corresponding backward function also needs to know what is the activation function for layer l l , since the gradient depends on it. True/False?
    1. True
    2. False
  8. There are certain functions with the following properties:
    (i) To compute the function using a shallow network circuit, you will need a large network (where we measure size by the number of logic gates in the network), but (ii) To compute it using a deep network circuit, you need only an exponentially smaller network. True/False?
    1. True
    2. False
  9. Consider the following 2 hidden layer neural network:
    deeplearning.ai题目记录_第9张图片
    Which of the following statements are True? (Check all that apply).
    1. W[1] will have shape (4, 4)
    2. b[1] will have shape (4, 1)
    3. W[1] will have shape (3, 4)
    4. b[1] will have shape (3, 1)
    5. W[2] will have shape (3, 4)
    6. b[2] will have shape (1, 1)
    7. W[2] will have shape (3, 1)
    8. b[2] will have shape (3, 1)
    9. W[3] will have shape (3, 1)
    10. b[3] will have shape (1, 1)
    11. W[3] will have shape (1, 3)
    12. b[3] will have shape (3, 1)
  10. Whereas the previous question used a specific network, in the general case what is the dimension of W^{[l]}, the weight matrix associated with layer l l ?
    1. W[l] has shape (n[l],n[l−1])
    2. W[l] has shape (n[l+1],n[l])
    3. W[l] has shape (n[l],n[l+1])
    4. W[l] has shape (n[l−1],n[l])

Building your Deep Neural Network - Step by Step

ipynb

Deep Neural Network - Application

ipynb

改善深层神经网络:超参数调试、正则化以及优化

网址

第一周 深度学习的实用层面

Practical aspects of deep learning

  1. If you have 10,000,000 examples, how would you split the train/dev/test set?
    1. 60% train . 20% dev . 20% test
    2. 98% train . 1% dev . 1% test
    3. 33% train . 33% dev . 33% test
  2. The dev and test set should:
    1. Come from the same distribution
    2. Come from different distributions
    3. Be identical to each other (same (x,y) pairs)
    4. Have the same number of examples
  3. If your Neural Network model seems to have high variance, what of the following would be promising things to try?
    1. Add regularization
    2. Make the Neural Network deeper
    3. Increase the number of units in each hidden layer
    4. Get more test data
    5. Get more training data
  4. You are working on an automated check-out kiosk for a supermarket, and are building a classifier for apples, bananas, and oranges. Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%, Which of the following are promising things to try to improve your classfier? (Check all that apply.)
    1. Increase the regularization parameter lambda
    2. Decrease the regularization parameter lambda
    3. Get more training data
    4. Use a bigger neural network
  5. What is weight decay?
    1. Gradual corruption of the weights in the neural network if it is trained on noisy data.
    2. A technique to avoid vanishing gradient by imposing a ceiling on the values of the weights.
    3. A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.
    4. The process of gradually decreasing the learning rate during training.
  6. What happens when you increase the regularization hyperparameter lambda?
    1. Weights are pushed toward becoming smaller (closer to 0)
    2. Weights are pushed toward becoming bigger (further from 0)
    3. Doubling lambda should roughly result in doubling the weights
    4. Gradient descent taking bigger steps with each iteration (proportional to lambda)
  7. With the inverted dropout technique, at test time:
    1. You apply dropout (randomly eliminating units) and do not keep the 1/keep_prob factor in the calculations used in training
    2. You do not apply dropout (do not randomly eliminate units), but keep the 1/keep_prob factor in the calculations used in training.
    3. You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training
    4. You apply dropout (randomly eliminating units) but keep the 1/keep_prob factor in the calculations used in training.
  8. Increasing the parameter keep_prob from (say) 0.5 to 0.6 will likely cause the following: (Check the two that apply)
    1. Increasing the regularization effect
    2. Reducing the regularization effect
    3. Causing the neural network to end up with a higher training set error
    4. Causing the neural network to end up with a lower training set error
  9. Which of these techniques are useful for reducing variance (reducing overfitting)? (Check all that apply.)
    1. Data augmentation
    2. Gradient Checking
    3. Exploding gradient
    4. L2 regularization
    5. Dropout
    6. Vanishing gradient
    7. Xavier initialization
  10. Why do we normalize the inputs x?
    1. It makes it easier to visualize the data
    2. It makes the parameter initialization faster
    3. Normalization is another word for regularization–It helps to reduce variance
    4. It makes the cost function faster to optimize

Regularization

ipynb

Initialization

ipynb

Gradient_Checking

ipynb

第二周 优化算法

Optimization algorithms

  1. Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
    1. a[8]{3}(7)
    2. a[3]{7}(8)
    3. a[8]{7}(3)
    4. a[3]{8}(7)
  2. Which of these statements about mini-batch gradient descent do you agree with?
    1. Training one epoch (one pass through the training set) using minibatch gradient descent is faster than training one epoch using batch gradient descent.
    2. One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.
    3. You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time(vectorization).
  3. Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
    1. If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.
    2. If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
    3. If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
    4. If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
  4. Suppose your learning algorithm’s cost , plotted as a function of the number of iterations, looks like this:
    deeplearning.ai题目记录_第10张图片
    Which of the following do you agree with?
    1. Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.
    2. If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.
    3. Whether you’re using batch gradient descent or mini-batch gradient descent, something is wrong.
    4. If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.
  5. Suppose the temperature in Casablanca over the first three days of January are the same:
    Jan 1st: θ1=10C θ 1 = 10 ∘ C
    Jan 2nd: θ2=10C θ 2 = 10 ∘ C
    (We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)
    Say you use an exponentially weighted average with β=0.5 β = 0.5 to track the temperature: v0=0,vt=βvt1+(1β)θt v 0 = 0 , v t = β v t − 1 + ( 1 − β ) θ t . If v2 v 2 is the value computed after day 2 without bias correction, and vcorrected2 v 2 c o r r e c t e d is the value you compute with bias correction. What are these values? (You might be able to do this without a calculator, but you don’t actually need one. Remember what is bias correction doing.)
    1. v2=7.5,vcorrected2=7.5 v 2 = 7.5 , v 2 c o r r e c t e d = 7.5
    2. v2=10,vcorrected2=7.5 v 2 = 10 , v 2 c o r r e c t e d = 7.5
    3. v2=10,vcorrected2=10 v 2 = 10 , v 2 c o r r e c t e d = 10
    4. v2=7.5,vcorrected2=10 v 2 = 7.5 , v 2 c o r r e c t e d = 10 True
  6. Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
    1. α=11+2tα0 α = 1 1 + 2 ∗ t α 0
    2. α=etα0 α = e t α 0 True
    3. α=0.95tα0 α = 0.95 t α 0
    4. α=1tα0 α = 1 t α 0
  7. You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: vt=βvt1+(1β)θt v t = β v t − 1 + ( 1 − β ) θ t . The red line below was computed using β=0.9 β = 0.9 . What would happen to your red curve as you vary β β ? (Check the two that apply)
    deeplearning.ai题目记录_第11张图片
    1. Decreasing β β will shift the red line slightly to the right.
    2. Increasing β β will shift the red line slightly to the right.
    3. Decreasing β β will create more oscillation within the red line.
    4. Increasing β β will create more oscillations within the red line.
  8. Consider this figure:
    deeplearning.ai题目记录_第12张图片
    These plots were generated with gradient descent; with gradient descent with momentum ( β β = 0.5) and gradient descent with momentum ( β β = 0.9). Which curve corresponds to which algorithm?
    1. (1) is gradient descent with momentum (small β β ). (2) is gradient descent. (3) is gradient descent with momentum (large β β )
    2. (1) is gradient descent. (2) is gradient descent with momentum (large β β ) . (3) is gradient descent with momentum (small β β )
    3. (1) is gradient descent. (2) is gradient descent with momentum (small β β ). (3) is gradient descent with momentum (large β β )
    4. (1) is gradient descent with momentum (small β β ), (2) is gradient descent with momentum (small β β ), (3) is gradient descent
  9. Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W[1],b[1],...,W[L],b[L]) J ( W [ 1 ] , b [ 1 ] , . . . , W [ L ] , b [ L ] ) . Which of the following techniques could help find parameter values that attain a small value for J J ? (Check all that apply)
    1. Try initializing all the weights to zero
    2. Try tuning the learning rate α α
    3. Try better random initialization for the weights
    4. Try using Adam
    5. Try mini-batch gradient descent
  10. Which of the following statements about Adam is False?
    1. Adam should be used with batch gradient computations, not with mini-batches.
    2. We usually use “default” values for the hyperparameters and in Adam ( β1=0.9 β 1 = 0.9 , β2=0.999 β 2 = 0.999 , ε=108 ε = 10 − 8 )
    3. The learning rate hyperparameter α α in Adam usually needs to be tuned.
    4. Adam combines the advantages of RMSProp and momentum

第三周 超参数调试、Batch正则化和程序框架

Hyperparameter tuning, Batch Normalization, Programming Frameworks

  1. If searching among a large number of hyperparameters, you should try values in a grid rather than random values, so that you can carry out the search more systematically and not rely on chance. True or False?
    1. True
    2. False
  2. Every hyperparameter, if set poorly, can have a huge negative impact on training, and so all hyperparameters are about equally important to tune well. True or False?
    1. True
    2. False
  3. During hyperparameter search, whether you try to babysit one model (“Panda” strategy) or train a lot of models in parallel (“Caviar”) is largely determined by:
    1. Whether you use batch or mini-batch optimization
    2. The presence of local minima (and saddle points) in your neural network
    3. The amount of computational power you can access
    4. The number of hyperparameters you have to tune
  4. If you think β β (hyperparameter for momentum) is between on 0.9 and 0.99, which of the following is the recommended way to sample a value for beta?
    1. Code1
      python
      r = np.random.rand()
      beta = r*0.09 + 0.9
    2. Code2
      python
      r = np.random.rand()
      beta = 1-10**(- r - 1)
    3. Code3
      python
      r = np.random.rand()
      beta = 1-10**(- r + 1)
    4. Code4
      python
      r = np.random.rand()
      beta = r*0.9 + 0.09
  5. Finding good hyperparameter values is very time-consuming. So typically you should do it once at the start of the project, and try to find very good hyperparameters so that you don’t ever have to revisit tuning them again. True or false?
    1. True
    2. False
  6. In batch normalization as presented in the videos, if you apply it on the l l th layer of your neural network, what are you normalizing?
    1. z[l] z [ l ] True
    2. W[l] W [ l ]
    3. a[l] a [ l ]
    4. b[l] b [ l ]
  7. In the normalization formula z(i)norm=z(i)μσ2+ε z n o r m ( i ) = z ( i ) − μ σ 2 + ε , why do we use epsilon?
    1. To have a more accurate normalization
    2. To avoid division by zero
    3. In case μ is too small
    4. To speed up convergence
  8. Which of the following statements about γ γ and β β in Batch Norm are true?
    1. They can be learned using Adam, Gradient descent with momentum, or RMSprop, not just with gradient descent.
    2. β and γ are hyperparameters of the algorithm, which we tune via random sampling.
    3. They set the mean and variance of the linear variable z[ l] of a given layer.
    4. There is one global value of γR γ ∈ R and one global value of βR β ∈ R for each layer, and applies to all the hidden units in that layer.
    5. The optimal values are γ = σ2+ε σ 2 + ε , and β
  9. After training a neural network with Batch Norm, at test time, to evaluate the neural network on a new example you should:
    1. Use the most recent mini-batch’s value of μ and σ2 to perform the needed normalizations
    2. If you implemented Batch Norm on mini-batches of (say) 256 examples, then to evaluate on one test example, duplicate that example 256 times so that you’re working with a mini-batch the same size as during training.
    3. Perform the needed normalizations, use μ and σ2 estimated using an exponentially weighted average across mini-batches seen during training.
    4. Skip the step where you normalize using and since a single test example cannot be normalized.
  10. Which of these statements about deep learning programming frameworks are true? (Check all that apply)
    1. Even if a project is currently open source, good governance of the project helps ensure that the it remains open even in the long term, rather than become closed or modified to benifit only one company.
    2. A programming framework allows you to code up deep learning algorithms with typically fewer lines of code than a lower-level language such as Python.

TensorFlow Tutorial

ipynb

结构化机器学习项目

机器学习(ML)策略1

机器学习(ML)策略2

卷积神经网络

第一周 卷积神经网络

第二周 深度卷积神经网络

你可能感兴趣的:(deeplearning)