mikelkl

deeplearning.ai课程知识点整理

Neural Networks and Deep Learning

Introduction to deep learning

Neural Networks Basics

Logistic Regression as a Neural Network

Computation graph

神经网络的计算过程由正向传播（forward propagation ）来进行前向计算来计算神经网络的输出以及反向传播（back propagation ）计算来计算梯度（gradients）或微分（derivatives）。计算图（computation graph）解释了为什么以这种方式来组织。

计算图（computation graph）可以方便地展示神经网络的分层计算过程。

Shallow neural networks

Learn to build a neural network with one hidden layer, using forward propagation and backpropagation.

Learning Objectives

Understand hidden units and hidden layers
Be able to apply a variety of activation functions in a neural network.
Build your first forward and backward propagation with a hidden layer
Apply random initialization to your neural network
Become fluent with Deep Learning notations and Neural Network Representations
Build and train a neural network with one hidden layer.

Activation functions

Pros and cons of activation functions
Sigmoid和tanh函数的缺点之一是如果Z的值非常大或者非常小那么关于这个函数导数的梯度或者斜率会变的很小当Z很大或者很小的时候函数的斜率值会接近零这会使得梯度下降变的缓慢。

ReLU是目前广泛被人们使用的一个方法虽然有时候人们也会使用双曲函数作为激活函数 ReLU的缺点之一是当z为负数的时候其导数为0,但在实际应用中并不是问题。ReLU和Leaky ReLU的共有的优势是在z的数值空间里面激活函数的导数或者说激活函数的斜率离0比较远因此在实践当中使用普通的 ReLU激活函数的话那么神经网络的学习速度通常会比使用双曲函数tanh或者Sigmoid函数来的更快主要原因是使学习变慢的斜率趋向0的现象变少了激活函数的导数趋向于0会降低学习的速度我们知道，一半z的数值范围 ReLU的斜率为0 但是在实际使用中大多数的隐藏单元的z值将会大于0，因此学习仍然可以很快。

不同激活函数的优缺点：

不要使用Sigmoid激活函数，除非在输出层上并且你要解决的是二分类问题
tanh函数相比Sigmoid要好很多
默认的最经常使用的激活函数是ReLU函数

Why do you need non-linear activation functions?

如果使用线性激活函数或者叫恒等激活函数那么神经网络的输出仅仅是输入函数的线性变化。

如果使用线性激活函数或者说没有使用激活函数那么无论你的神经网络有多少层它所做的仅仅是计算线性激活函数这还不如去除所有隐藏层。线性的隐藏层没有任何用处因为两个线性函数的组合仍然是线性函数除非你在这里引入非线性函数否则无论神经网络模型包含多少隐藏层都无法实现更有趣的功能只有一个地方会使用线性激活函数当g(z)等于z 就是使用机器学习解决回归问题的时候。

Deep Neural Networks

Improving Deep Neural Networks

About this Course
This course will teach you the “magic” of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.

After 3 weeks, you will:

Understand industry best-practices for building deep learning applications.
Be able to effectively use the common neural network “tricks”, including initialization, L2 and dropout regularization, Batch normalization, gradient checking,
Be able to implement and apply a variety of optimization algorithms, such as mini-batch gradient descent, Momentum, RMSprop and Adam, and check for their convergence.
Understand new best-practices for the deep learning era of how to set up train/dev/test sets and analyze bias/variance
Be able to implement a neural network in TensorFlow.

This is the second course of the Deep Learning Specialization.

Practical aspects of Deep Learning

Learning Objectives
Recall that different types of initializations lead to different results
Recognize the importance of initialization in complex neural networks.
Recognize the difference between train/dev/test sets
Diagnose the bias and variance issues in your model
Learn when and how to use regularization methods such as dropout or L2 regularization.
Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them
Use gradient checking to verify the correctness of your backpropagation implementation

Regularizing your neural network

What we want you to remember from this module:
- Regularization will help you reduce overfitting.
- Regularization will drive your weights to lower values.
- L2 regularization and Dropout are two very effective regularization techniques.

Regularization

Deep Learning models have so much flexibility and capacity that overfitting can be a serious problem, if the training dataset is not big enough. Sure it does well on the training set, but the learned network doesn’t generalize to new examples that it has never seen!

The standard way to avoid overfitting is called L2 regularization. It consists of appropriately modifying your cost function, from:

J = - 1 m \sum i = 1 m (y (i) log (a [L] (i)) + (1 - y (i)) log (1 - a [L] (i))) (1)

To:

J r e g u l a r i z e d = - 1 m \sum i = 1 m (y (i) log (a [L] (i)) + (1 - y (i)) log (1 - a [L] (i)))                                                                cross-entropy cost + 1 m λ 2 \sum l \sum k \sum j W [l] 2 k, j                          L2 regularization cost (2)

为什么只对参数w进行正则化呢? 为什么我们不把b的相关项也加进去呢？实际上也可以这样做但通常会把它省略掉因为w往往是一个非常高维的参数矢量尤其是在发生高方差问题的情况下可能w有非常多的参数模型没能很好地拟合所有的参数而b只是单个数字几乎所有的参数都集中在w中而不是b中即使加上了最后这一项实际上也不会起到太大的作用因为b只是大量参数中的一个参数在实践中通常就不费力气去包含它了但如果想的话也可以(包含b)

L1 regulazation VS L2 regulazation

L1正则化即不使用L2范数(Euclid范数（欧几里得范数，常用计算向量长度），即向量元素绝对值的平方和再开方) 而是使用lambda/m乘以这一项的和这称为参数矢量w的L1范数(即向量元素绝对值之和) 这里有一个数字1的小角标无论你在分母中使用m还是2m 它只是一个缩放常量如果你使用L1正则化 w最后会变得稀疏这意味着w矢量中有很多0 有些人认为这有助于压缩模型因为有一部分参数是0 只需较少的内存来存储模型然而在实践中发现通过L1正则化让模型变得稀疏带来的收效甚微所以至少在压缩模型的目标上它的作用不大

Why regularization reduces overfitting?

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.

Dropout Regularization

Dropout is a widely used regularization technique that is specific to deep learning.
It randomly shuts down some neurons in each iteration. Watch these two videos to see what this means!

Figure 2 : Drop-out on the second hidden layer.
At each iteration, you shut down (= set to zero) each neuron of a layer with probability

1−keep_prob 1 − k e e p _ p r o b or keep it with probability

keep_prob k e e p _ p r o b (50% here). The dropped neurons don’t contribute to the training in both the forward and backward propagations of the iteration.

Figure 3 : Drop-out on the first and third hidden layers.

1st 1 s t layer: we shut down on average 40% of the neurons.

3rd 3 r d layer: we shut down on average 20% of the neurons.

When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.

What you should remember about dropout:
- Dropout is a regularization technique.
- You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

Optimization algorithms

Learning Objectives
Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
Use random minibatches to accelerate the convergence and improve the optimization
Know the benefits of learning rate decay and apply it to your optimization

Mini-batch gradient descent

机器学习的应用是一个高度依赖经验的，不断重复的过程。你需要训练很多模型才能找到一个确实好用的，所以能够快速的训练模型的确是个优势。令情况更艰难的是在大数据领域中深度学习表现得并不算完美，我们能够训练基于大量数据的神经网络，而用大量数据训练就会很慢，所以你会发现快速的优化算法，好的优化算法的确能大幅提高你和你的团队的效率。

我们之前学过，矢量化(vectorization)可以让你有效地计算所有m个样例而不需要一个具体的for循环就能处理整个训练集。是如果M非常大，速度依然会慢。例如，如果M是5百万或者5千万或者更大，对你的整个训练集运用梯度下降法，你必须先处理你的整个训练集才能在梯度下降中往前一小步，然后再处理一次整个5百万的训练集才能再往前一小步。所以实际上算法是可以加快的，如果你让梯度下降在处理完整个巨型的5百万训练集之前就开始有所成效。具体来说，你可以这样做，将你的训练集拆分成更小的，微小的训练集，即小批量训练集(mini-batch)。

一次遍历(epoch)是指过一遍训练集，只不过在批量梯度下降法中对训练集的一轮处理只能得到一步梯度逼近，而小批量梯度下降法中对训练集的一轮处理，也就是一次遍历，可以得到5000步梯度逼近。

当你有一个大型训练集时，小批量梯度下降法比梯度下降法要快得多。这几乎是每个从事深度学习的人在处理一个大型数据集时会采用的算法。

Understanding mini-batch gradient descent

A variant is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient descent where each mini-batch has just 1 example. In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly. Here is an illustration of this:

In practice, you’ll often get faster results if you do not use neither the whole training set, nor only one training example, to perform each update. Mini-batch gradient descent uses an intermediate number of examples for each step. With mini-batch gradient descent, you loop over the mini-batches instead of looping over individual training examples.

The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
You have to tune a learning rate hyperparameter α .
With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

Exponentially weighted averages

有几个点需要注意当beta的值很大的时候你得到的曲线会更平滑因为你对更多天数的温度做了平均处理因此曲线就波动更小更加平滑但另一方面这个曲线会右移因为你在一个更大的窗口内计算平均温度通过在更大的窗口内计算平均这个指数加权平均的公式在温度变化时适应地更加缓慢这就造成了一些延迟原因是当beta=0.98的时候之前的值具有更大的权重而当前值的权重就非常小只有0.02 所以当温度变化的时候温度上升或者下降这个指数加权平均在beta较大时就会适应得更慢我们来试试另一个值让beta的值取另一个极端比如0.5 那么由右边的公式这就变成了只对两天进行平均如果画出来就会得到黄色的线由于仅仅平均两天的气温即只在很小的窗口内计算平均得到结果中会有更多的噪声更容易受到异常值的影响但它可以更快地适应温度变化使用这个公式就可以实现指数加权平均在统计学中它被称为指数加权滑动平均我们把它简称为指数加权平均

Understanding exponentially weighted averages

One of the advantages of this exponentially weighted average formula, is that it takes very little memory. You just need to keep just one row number in computer memory, and you keep on overwriting it with this formula based on the latest values that you got. And it’s really this reason, the efficiency, it just takes up one line of code basically and just storage and memory for a single row number to compute this exponentially weighted average.

Gradient descent with momentum

动量(Momentum) 或者叫动量梯度下降算法的主要思想是计算梯度的指数加权平均然后使用这个梯度来更新权重。

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations.

Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable v . Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of v as the “velocity” of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.

The momentum update rule is, for l=1,...,L :

$这里写图片描述$

where L is the number of layers, β is the momentum and α is the learning rate.

How do you choose β ?

The larger the momentum β is, the smoother the update because the more we take the past gradients into account. But if β is too big, it could also smooth out the updates too much.
Common values for β range from 0.8 to 0.999. If you don’t feel inclined to tune this, β=0.9 is often a reasonable default.
Tuning the optimal β for your model might need trying several values to see what works best in term of reducing the value of the cost function J .
Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
You have to tune a momentum hyperparameter β and a learning rate α .

RMSprop

你已经学习了如何用动量来加速梯度下降还有一个叫做RMSprop的算法全称为均方根传递(Root Mean Square prop)，它也可以加速梯度下降我们来看看它是如何工作的回忆一下之前的例子在实现梯度下降时可能会在垂直方向上出现巨大的振荡即使它试图在水平方向上前进为了说明这个例子我们假设纵轴代表参数b 横轴代表参数W 当然这里也可以是W1和W2等其他参数我们使用b和W是为了便于理解你希望减慢b方向的学习也就是垂直方向同时加速或至少不减慢水平方向的学习这就是RMSprop算法要做的。另一个收效是你可以使用更大的学习率alpha 学习得更快而不用担心在垂直方向上发散

在水平方向上即例子中W的方向上我们希望学习速率较快而在垂直方向上即例子中b的方向上我们希望降低垂直方向上的振荡对于S_dW和S_db这两项我们希望S_dW相对较小因此这里除以的是一个较小的数而S_db相对较大因此这里除以的是一个较大的数这样就可以减缓垂直方向上的更新实际上如果你看一下导数就会发现垂直方向上的倒数要比水平方向上的更大所以在b方向上的斜率很大对于这样的导数 db很大而dW相对较小因为函数在垂直方向即b方向的斜率要比w方向也就是比水平方向更陡所以 db的平方会相对较大因此S_db会相对较大相比之下dW会比较小或者说dW的平方会较小所以S_dW会较小结果是垂直方向上的更新量会除以一个较大的数这有助于减弱振荡而水平方向上的更新量会除以一个较小的数

Adam optimization algorithm

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum.

How does Adam work?

It calculates an exponentially weighted average of past gradients, and stores it in variables v (before bias correction) and vcorrected (with bias correction).
It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables s (before bias correction) and scorrected (with bias correction).
It updates parameters in a direction based on combining information from “1” and “2”.

The update rule is, for l=1,...,L :
$这里写图片描述$

where:
- t counts the number of steps taken of Adam
- L is the number of layers
- β1 and β2 are hyperparameters that control the two exponentially weighted averages.
- α is the learning rate
- ε is a very small number to avoid dividing by zero

Hyperparameter tuning, Batch Normalization and Programming Frameworks

Hyperparameter tuning

Tuning process

超参数（Hyperparameters ）并不具有同等的重要性, E.g. learning rate α 比Adam中的 β1 重要。
在机器学习领域，超参数比较少的情况下，我们之前利用设置网格点的方式来调试超参数；
但在深度学习领域，超参数较多的情况下，不是设置规则的网格点，而是随机选择点进行调试。这样做是因为在我们处理问题的时候，是无法知道哪个超参数是更重要的，所以随机的方式去测试超参数点的性能，更为合理，这样可以探究更超参数的潜在价值。

Using an appropriate scale to pick hyperparameters

在选择超参数的比例的时候，原则上应在不同比例范围内进行均匀随机取值，如 0.001~0.001 、 0.001~0.01 、 0.01~0.1 、 0.1~1 范围内选择。
一般地，如果在 10a ~ 10b 之间的范围内进行按比例的选择，则 r 范围为[a, b] ， α = 10r 。
同样，在使用指数加权平均的时候，超参数beta也需要用上面这种方向进行选择。

Hyperparameters tuning in practice: Pandas vs. Caviar

在计算资源有限的情况下，使用第一种，仅调试一个模型，每天不断优化；
在计算资源充足的情况下，使用第二种，同时并行调试多个模型，选取其中最好的模型。

Batch Normalization

Normalizing activations in a network

在深度学习不断兴起的过程中最重要的创新之一是一种叫批量归一化 (Batch Normalization) 的算法它由Sergey Ioffe 和 Christian Szegedy提出可以让你的超参搜索变得很简单让你的神经网络变得更加具有鲁棒性可以让你的神经网络对于超参数的选择上不再那么敏感而且可以让你更容易地训练非常深的网络。

Implementing Batch Norm

这里的 γ 和 β 值可以从你的模型中学习，这样我们就可以使用梯度下降算法或者其他类似算法比如 momentum的梯度下降算法或者atom算法来更新 γ 和 β 的值就像更新神经网络的权重一样。

Batch norm所做的就是不仅仅在输入层而且在一些隐藏层上也做归一化你使用这种归一化方法对某些隐藏单元的值z做归一化但是输入层和隐藏层的归一化还有一点不同就是隐藏层归一化后并不一定是均值0方差1 比如如果你的激活函数是sigmoid 你就不希望归一化后的值都聚集在这里可能希望它们有更大的方差以便于更好的利用s函数非线性的特性而不是所有的值都在中间这段近似直线的区域上这就是为什么通过设置 γ 和 β 你可以控制 z(i) 在你希望的范围内或者说它真正实现的是通过两个参数 γ 和 β 来让你的隐藏单元有可控的方差和均值而这两个参数是可以在算法中自由设置的目的就是可以得到一些修正的均值和方差这意味可以是均值0方差1 也可以是被参数 γ β 控制的其他值

Batch Norm at test time

通常的方法就是在我们训练的过程中，对于训练集的Mini-batch，使用指数加权平均，当训练结束的时候，得到指数加权平均后的均值和方差，而这些值直接用于Batch Norm公式的计算，用以对测试样本进行预测。

Multi-class classification

Softmax Regression

Softmax回归是一种更普遍的逻辑回归的方法。这种方法能够让你试图预测多分类问题，而不仅仅是二分类问题。

总结一下从 z[L] 到 a[L] 的计算过程。这整个的计算过程从计算幂，到得出临时变量，再做归一化。
我们可以把这个过程总结为一个softmax激活函数。假设 a[L] 是向量 z[L] 的激活函数 g 的结果，这个激活函数不同之处在于，这个函数 g 需要输入一个4*1的向量，也会输出一个4*1的向量。以前我们的激活通常是接收单行输入，比如sigmoid函数和ReLU函数就是接收一个实数输入，然后输出一个实数的输出。softmax函数的不同之处就是，它需要把输出归一化，以及输入输出都是向量。

Train a Softmax Classifier

概括来讲，损失函数找到训练集中的真实类，然后使该类相应的概率尽可能地高（最大似然估计）

Structuring Machine Learning Projects

About this Course

You will learn how to build a successful machine learning project. If you aspire to be a technical leader in AI, and know how to set direction for your team’s work, this course will show you how.

Much of this content has never been taught elsewhere, and is drawn from my experience building and shipping many deep learning products. This course also has two “flight simulators” that let you practice decision-making as a machine learning project leader. This provides “industry experience” that you might otherwise get only after years of ML work experience.

After 2 weeks, you will:

Understand how to diagnose errors in a machine learning system, and
Be able to prioritize the most promising directions for reducing error
Understand complex ML settings, such as mismatched training/test sets, and comparing to and/or surpassing human-level performance
Know how to apply end-to-end learning, transfer learning, and multi-task learning

ML Strategy (1)

Learning Objectives

Understand why Machine Learning strategy is important

Apply satisficing and optimizing metrics to set up your goal for ML projects

Choose a correct train/dev/test split of your dataset

Understand how to define human-level performance

Use human-level perform to define your key priorities in ML projects

Take the correct ML Strategic decision based on observations of performances and dataset

Introduction to ML Strategy

Why ML Strategy

改进ML系统的方法：

ML Strategy的课程内容包括:
1. teach a number of strategies, that is, ways of analyzing a machine learning problem that will point you in the direction of the most promising things to try.
2. 吴恩达自己关于building and shipping large number of deep learning products的经验。

吴恩达指出，在深度学习的时代，机器学习策略正在发生变化，因为现在可以用深度学习算法做的事情已经和前一代的机器学习算法不同了。

Orthogonalization

背景：One of the challenges with building machine learning systems is that there’s so many things you could try, so many things you could change. Including, for example, so many hyperparameters you could tune.

Orthogonalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, it reduces testing and development time.

When a supervised learning system is design, these are the 4 assumptions that needs to be true and orthogonal.

Fit training set well in cost function

If it doesn’t fit well, the use of a bigger neural network or switching to a better optimization algorithm might help.

Fit development set well on cost function

If it doesn’t fit well, regularization or using bigger training set might help.

Fit test set well on cost function

If it doesn’t fit well, the use of a bigger development set might help

Performs well in real world

If it doesn’t perform well, the development test set is not set correctly or the cost function is not evaluating the right thing.

正交（Orthogonalization）在机器学习领域的意义：Figure out exactly what’s wrong, and then have exactly one knob, or a specific set of knobs that helps to just solve that problem that is limiting the performance of machine learning system.

Setting up your goal

Single number evaluation metric

Single number evaluation metric的好处：lets you quickly tell if the new thing you just tried is working better or worse than your last idea.

Evaluation metric的例子:
Precision
Of all the images we predicted y=1, what fraction of it have cats?
Recall
Of all the images that actually have cats, what fraction of it did we correctly identifying have cats?

The problem with using precision/recall as the evaluation metric is that you are not sure which one is better since in this case, both of them have a good precision et recall. F1-score, a harmonic mean, combine both precision and recall.

F1-Score =

21p+1r 2 1 p + 1 r

F1-Score is not the only evaluation metric that can be use, the average, for example, could also be an indicator of which classifier to use.

Satisficing and Optimizing metric

There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices. They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation matrices must be evaluated on a training set, a development set or on the test set.

Example: Cat vs Non-cat

Classifier	Accuracy	Running time
A	90%	80 ms
B	92%	95 ms
C	95%	1,500 ms

In this case, accuracy and running time are the evaluation matrices. Accuracy is the optimizing metric, because you want the classifier to correctly detect a cat image as accurately as possible. The running time which is set to be under 100 ms in this example, is the satisficing metric which mean that the metric has to meet expectation set.

The general rule is:

Nmetric:{1Optimizing metricNmetric−1Satisficing metric N m e t r i c : { 1 O p t i m i z i n g m e t r i c N m e t r i c − 1 S a t i s f i c i n g m e t r i c

Summary：If there are multiple things you care about by say there’s one as the optimizing metric that you want to do as well as possible on and one or more as satisficing metrics were you’ll be satisfice. Almost it does better than some threshold you can now have an almost automatic way of quickly looking at multiple core size and picking the, quote, best one.

Train /dev /test distributions

Setting up the training, development and test sets have a huge impact on productivity. It is important to choose the development and test sets from the same distribution and it must be taken randomly from all the data. However, it is not a problem to have different training and dev distribution.

Guideline

Choose a development set and test set to reflect data you expect to get in the future and consider important to do well.

Size of the dev and test sets

Old way of splitting data
We had smaller data set therefore we had to use a greater percentage of data to develop and test ideas and models.

Modern era – Big data
Now, because a large amount of data is available, we don’t have to compromised as much and can use a greater portion to train the model.

Guidelines

Set up the size of the test set to give a high confidence in the overall performance of the system.
Test set helps evaluate the performance of the final classifier which could be less 30% of the whole data set.
The development set has to be big enough to evaluate different ideas.

When to change dev /test sets and metrics

If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.

Guideline

Define correctly an evaluation metric that helps better rank order classifiers
Optimize the evaluation metric

Why human-level performance?

Today, machine learning algorithms can compete with human-level performance since they are more productive and more feasible in a lot of application. Also, the workflow of designing and building a machine learning system, is much more efficient than before.

Moreover, some of the tasks that humans do are close to ‘’perfection’’, which is why machine learning tries to mimic human-level performance.

The graph below shows the performance of humans and machine learning over time.

The Machine learning progresses slowly when it surpasses human-level performance. One of the reason is that human-level performance can be close to Bayes optimal error, especially for natural perception problem.

Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy.

Also, when the performance of machine learning is worse than the performance of humans, you can improve it with different tools. They are harder to use once its surpasses human-level performance.

These tools are:

Get labeled data from humans
Gain insight from manual error analysis: Why did a person get this right?
Better analysis of bias/variance.

Avoidable bias

By knowing what the human-level performance is, it is possible to tell when a training set is performing well or not.

Example: Cat vs Non-Cat

	Classification error (%)
	Scenario A	Scenario B
Humans	1	7.5
Training error	8	8
Development error	10	10

In this case, the human level error as a proxy for Bayes error since humans are good to identify images. If you want to improve the performance of the training set but you can’t do better than the Bayes error otherwise the training set is overfitting. By knowing the Bayes error, it is easier to focus on whether bias or variance avoidance tactics will improve the performance of the model.

Scenario A
There is a 7% gap between the performance of the training set and the human level error. It means that the algorithm isn’t fitting well with the training set since the target is around 1%. To resolve the issue, we use bias reduction technique such as training a bigger neural network or running the training set longer.

Scenario B
The training set is doing good since there is only a 0.5% difference with the human level error. The
difference between the training set and the human level error is called avoidable bias. The focus here is to reduce the variance since the difference between the training error and the development error is 2%. To resolve the issue, we use variance reduction technique such as regularization or have a bigger training set.

Understanding human-level performance

Summary of bias/variance with human-level performance

Human - level error – proxy for Bayes error
If the difference between human-level error and the training error is bigger than the difference
between the training error and the development error. The focus should be on bias reduction
technique
If the difference between training error and the development error is bigger than the difference
between the human-level error and the training error. The focus should be on variance reduction
technique.

Surpassing human-level performance

There are many problems where machine learning significantly surpasses human-level performance, especially with structured data:

Online advertising
Product recommendations
Logistics (predicting transit time)
Loan approvals

Improving your model performance

The two fundamental assumptions of supervised learning:

There are 2 fundamental assumptions of supervised learning. The first one is to have a low avoidable bias which means that the training set fits well. The second one is to have a low or acceptable variance which means that the training set performance generalizes well to the development set and test set.

If the difference between human-level error and the training error is bigger than the difference between the training error and the development error, the focus should be on bias reduction technique which are training a bigger model, training longer or change the neural networks architecture or try various hyperparameters search.

If the difference between training error and the development error is bigger than the difference between the human-level error and the training error, the focus should be on variance reduction technique which are bigger data set, regularization or change the neural networks architecture or try various hyperparameters search.

Summary

ML Strategy (2)

Learning Objectives

Understand what multi-task learning and transfer learning are

Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets

Error Analysis

Carrying out error analysis

Summary
To carry out error analysis, you should find a set of mislabeled examples, either in your dev set, or in your development set. And look at the mislabeled examples for false positives and false negatives. And just count up the number of errors that fall into various different categories.

During this process, you might be inspired to generate new categories of errors. But by counting up the fraction of examples that are mislabeled in different ways, often this will help you prioritize. Or give you inspiration for new directions to go in.

Example

创建一个人工错误检查表格使差错检测工作更清晰有条理。

Cleaning up incorrectly labeled data

DL algorithms are quite robust to random errors comparing with systematic errors in the training set.

Guideline

Apply same process to your dev and test sets to make sure they continue to come from the same distribution.
Consider examining examples your algorithm got right as well as ones it got wrong.
Train and dev/test data may now come from slightly different distributions.

Build your first system quickly, then iterate

Depending on the area of application, the guideline below will help you prioritize when you build your system.

Guideline

Set up development/ test set and metrics
- Set up a target
Build an initial system quickly
- Train training set quickly: Fit the parameters
- Development set: Tune the parameters
- Test set: Assess the performance
Use Bias/Variance analysis & Error analysis to prioritize next steps

Mismatched training and dev/test set

Example: Cat vs Non-cat
In this example, we want to create a mobile application that will classify and recognize pictures of cats taken and uploaded by users.

There are two sources of data used to develop the mobile app. The first data distribution is small, 10 000 pictures uploaded from the mobile application. Since they are from amateur users, the pictures are not professionally shot, not well framed and blurrier. The second source is from the web, you downloaded 200 000 pictures where cat’s pictures are professionally framed and in high resolution.

The problem is that you have a different distribution:

small data set from pictures uploaded by users. This distribution is important for the mobile app.
bigger data set from the web.

The guideline used is that you have to choose a development set and test set to reflect data you expect to get in the future and consider important to do well.

The data is split as follow:

The advantage of this way of splitting up is that the target is well defined.

The disadvantage is that the training distribution is different from the development and test set
distributions. However, this way of splitting the data has a better performance in long term.

Bias and Variance with mismatched data distributions

When the training set is from a different distribution than the development and test sets, the method to analyze bias and variance changes.

Scenario A
If the development data comes from the same distribution as the training set, then there is a large
variance problem and the algorithm is not generalizing well from the training set.

However, since the training data and the development data come from a different distribution, this
conclusion cannot be drawn. There isn’t necessarily a variance problem. The problem might be that the development set contains images that are more difficult to classify accurately.

When the training set, development and test sets distributions are different, two things change at the same time. First of all, the algorithm trained in the training set but not in the development set. Second of all, the distribution of data in the development set is different.

It’s difficult to know which of these two changes what produces this 9% increase in error between the training set and the development set. To resolve this issue, we define a new subset called trainingdevelopment set. This new subset has the same distribution as the training set, but it is not used for training the neural network.

Scenario B
The error between the training set and the training- development set is 8%. In this case, since the training set and training-development set come from the same distribution, the only difference between them is the neural network sorted the data in the training and not in the training development. The neural network is not generalizing well to data from the same distribution that it hadn’t seen before

Therefore, we have really a variance problem.

Scenario C
In this case, we have a mismatch data problem since the 2 data sets come from different distribution.

Scenario D
In this case, the avoidable bias is high since the difference between Bayes error and training error is 10 %.

Scenario E
In this case, there are 2 problems. The first one is that the avoidable bias is high since the difference between Bayes error and training error is 10 % and the second one is a data mismatched problem.

Scenario F
Development should never be done on the test set. However, the difference between the development set and the test set gives the degree of overfitting to the development set.

General formulation

Addressing data mismatch

This is a general guideline to address data mismatch:

Perform manual error analysis to understand the error differences between training, development/test sets. Development should never be done on test set to avoid overfitting.
Make training data or collect data similar to development and test sets. To make the training data more similar to your development set, you can use is artificial data synthesis. However, it is possible that if you might be accidentally simulating data only from a tiny subset of the space of all possible examples.

Learning from multiple tasks

Transfer learning

Transfer learning refers to using the neural network knowledge for another application. When to use transfer learning
• Task A and B have the same input x
• A lot more data for Task A than Task B
• Low level features from Task A could be helpful for Task B

Example 1: Cat recognition - radiology diagnosis
The following neural network is trained for cat recognition, but we want to adapt it for radiology diagnosis. The neural network will learn about the structure and the nature of images. This initial phase of training on image recognition is called pre-training, since it will pre-initialize the weights of the neural network. Updating all the weights afterwards is called fine-tuning.

For cat recognition
Input x : image
Output y – 1: cat, 0: no cat

Radiology diagnosis
Input x : Radiology images – CT Scan, X-rays
Output y :Radiology diagnosis – 1: tumor malign, 0: tumor benign

Guideline
• Delete last layer of neural network
• Delete weights feeding into the last output layer of the neural network
• Create a new set of randomly initialized weights for the last layer only
• New data set (x,y)

Multi-task learning

Multi-task learning refers to having one neural network do simultaneously several tasks.

When to use multi-task learning

Training on a set of tasks that could benefit from having shared lower-level features
Usually: Amount of data you have for each task is quite similar
Can train a big enough neural network to do well on all tasks

Example: Simplified autonomous vehicle
The vehicle has to detect simultaneously several things: pedestrians, cars, road signs, traffic lights, cyclists, etc. We could have trained four separate neural networks, instead of train one to do four tasks. However, in this case, the performance of the system is better when one neural network is trained to do four tasks than training four separate neural networks since some of the earlier features in the neural network could be shared between the different types of objects.

The input x(i) is the image with multiple labels
The output y(i) has 4 labels which are represents:

Also, the cost can be compute such as it is not influenced by the fact that some entries are not labeled.

End-to-end deep learning

What is end-to-end deep learning?

End-to-end deep learning is the simplification of a processing or learning systems into one neural network.

Example - Speech recognition model

End-to-end deep learning cannot be used for every problem since it needs a lot of labeled data. It is used mainly in audio transcripts, image captures, image synthesis, machine translation, steering in self-driving cars, etc.

Whether to use end-to-end deep learning

Before applying end-to-end deep learning, you need to ask yourself the following question: Do you have enough data to learn a function of the complexity needed to map x and y?

Pro:

Let the data speak
- By having a pure machine learning approach, the neural network will learn from x to y. It will be able to find which statistics are in the data, rather than being forced to reflect human preconceptions.
Less hand-designing of components needed
- It simplifies the design work flow.

Cons:

Large amount of labeled data
- It cannot be used for every problem as it needs a lot of labeled data.
Excludes potentially useful hand-designed component
- Data and any hand-design’s components or features are the 2 main sources of knowledge for a learning algorithm. If the data set is small than a hand-design system is a way to give manual knowledge into the algorithm.

Convolutional Neural Networks

About this Course
This course will teach you how to build convolutional neural networks and apply it to image data. Thanks to deep learning, computer vision is working far better than just two years ago, and this is enabling numerous exciting applications ranging from safe autonomous driving, to accurate face recognition, to automatic reading of radiology images.

You will:
- Understand how to build a convolutional neural network, including recent variations such as residual networks.
- Know how to apply convolutional networks to visual detection and recognition tasks.
- Know to use neural style transfer to generate art.
- Be able to apply these algorithms to a variety of image, video, and other 2D or 3D data.

This is the fourth course of the Deep Learning Specialization.

Foundations of Convolutional Neural Networks

Learn to implement the foundational layers of CNNs (pooling, convolutions) and to stack them properly in a deep network to solve multi-class image classification problems.

Learning Objectives
Understand the convolution operation
Understand the pooling operation
Remember the vocabulary used in convolutional neural network (padding, stride, filter, …)
Build a convolutional neural network for image multi-class classification

Computer Vision

Computer vision is one of the areas that’s been advancing rapidly thanks to deep learning. Two reasons make people excited about deep learning for computer vision:

Rapid advances in computer vision are enabling brand new applications to view, though they just were impossible a few years ago.
Computer vision research community has been so creative and so inventive in coming up with new neural network architectures and algorithms, is actually inspire that creates a lot cross-fertilization into other areas as well.

Edge Detection Example

The convolution operation is one of the fundamental building blocks of a convolutional neural network.

Using edge detection as the motivating example in this module:

在上图的例子中当3x3的filter移动到红圈和蓝圈处时，卷积操作（convolution）的结果如右下角图所示，图中间出现lighter区域，说明vertical edge被成功检测到。

The convolution operation gives you a convenient way to specify how to find these vertical edges in an image.

More Edge Detection

In this module, you’ll learn the difference between positive and negative edges, that is, the difference between light to dark versus dark to light edge transitions. And you’ll also see other types of edge detectors, as well as how to have an algorithm learn, rather than have us hand code an edge detector as we’ve been doing so far.

Different filters allow you to find vertical and horizontal edges:

With the rise of deep learning, one of the things we learned is that when you really want to detect edges in some complicated image, maybe you don’t need to have computer vision researchers handpick these nine numbers. Maybe you can just learn them and treat the nine numbers of this matrix as parameters, which you can then learn using back propagation. And the goal is to learn nine parameters so that when you take the image, the six by six image, and convolve it with your three by three filter, that this gives you a good edge detector.

Rather than just vertical and horizontal edges, maybe deep learning can learn to detect edges that are at 45 degrees or 70 degrees or 73 degrees or at whatever orientation it chooses. And so by just letting all of these numbers be parameters and learning them automatically from data, we find that neural networks can actually learn low level features, can learn features such as edges, even more robustly than computer vision researchers are generally able to code up these things by hand. But underlying all these computations is still this convolution operation, Which allows back propagation to learn whatever three by three filter it wants and then to apply it throughout the entire image, at this position, at this position, at this position, in order to output whatever feature it’s trying to detect. Be it vertical edges, horizontal edges, or edges at some other angle or even some other filter that we might not even have a name for in English.

The idea you can treat these nine numbers as parameters to be learned has been one of the most powerful ideas in computer vision.

Padding

In order to build deep neural networks one modification to the basic convolutional operation that you need to really use is padding.

Basic卷积操作的缺点：
1. If every time you apply a convolutional operator, your image shrinks, so you come from six by six down to four by four then, you can only do this a few times before your image starts getting really small, maybe it shrinks down to one by one or something, so maybe, you don’t want your image to shrink every time you detect edges or to set other features on it.
2. If you look the pixel at the corner or the edge, this little pixel is touched as used only in one of the outputs, because this touches that three by three region. Whereas, if you take a pixel in the middle, say this pixel, then there are a lot of three by three regions that overlap that pixel and so, is as if pixels on the corners or on the edges are use much less in the output. So you’re throwing away a lot of the information near the edge of the image.

In order to fix both of these problems, what you can do is the full apply of convolutional operation. You can pad the image. So in this case, let’s say you pad the image with an additional one border, with the additional border of one pixel all around the edges.

The main benefits of padding are the following:

It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the “same” convolution, in which the height/width is exactly preserved after one layer.
It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.

In
terms of how much to pad, it turns out there two common choices that are called, Valid convolutions and Same convolutions.

By convention in computer vision, f is usually odd. There are two reasons for that:

If f was even, then you need some asymmetric padding.
When you have an odd dimension filter, such as three by three or five by five, then it has a central position and sometimes in computer vision its nice to have a distinguisher, it’s nice to have a pixel, you can call the central pixel so you can talk about the position of the filter.

Strided Convolutions

Strided convolutions is another piece of the basic building block of convolutions as used in Convolutional Neural Networks.

Example:
Let’s say you want to convolve this seven by seven image with this three by three filter, except that instead of doing the usual way, we are going to do it with a stride of two. What that means is instead of stepping the blue box over by one step, we are going to step over by two steps.

Summary of convolutions:

Reminder:
The formulas relating the output shape of the convolution to the input shape is:

n H = ⌊ n H p r e v - f + 2 \times p a d s t r i d e ⌋ + 1

n W = ⌊ n W p r e v - f + 2 \times p a d s t r i d e ⌋ + 1

n C = number of filters used in the convolution

Technical note on cross-correlation vs. convolution

Convolution in math textbook:

In the different math textbook or signal processing textbook, there is one other possible inconsistency in the notation which is the way that the convolution is defined before doing the element Y’s product and summing, there’s actually one other step that you’ll first take which is to convolve this six by six matrix with this three by three filter. You at first take the three by three filter and flip it on the horizontal as well as the vertical axis. Then apply the flipped filter on the target matrix.

To summarize, by convention in machine learning, we usually do not bother with this flipping operation and technically, this operation is maybe better called cross-correlation but most of the deep learning literature just calls it the convolution operator.

Convolutions Over Volume

Convolution can be implemented not only over just 2D images, but over three dimensional volumes.

The three by three by three filter has 27 numbers, or 27 parameters, that’s three cubes. And so, what you do is take each of these 27 numbers and multiply them with the corresponding numbers from the red, green, and blue channels of the image, so take the first nine numbers from red channel, then the three beneath it to the green channel, then the three beneath it to the blue channel, and multiply it with the corresponding 27 numbers that gets covered by this yellow cube show on the left. Then add up all those numbers and this gives you this first number in the output, and then to compute the next output you take this cube and slide it over by one

Multiple filters

The idea of convolution on volumes, turns out to be really powerful. Only a small part of it is that you can now operate directly on RGB images with three channels. But even more important is that you can now detect two features, like vertical, horizontal edges, or maybe several hundreds of different features. And the output will then have a number of channels equal to the number of filters you are detecting.

One Layer of a Convolutional Network

Summary of notation
If layer l is a convolution layer:

Simple Convolutional Network Example

Types of layer in a convolutional network

Convolution
Pooling
Fully connected

Pooling Layers

Other than convolutional layers, ConvNets often also use pooling layers to reduce the size of the representation, to speed the computation, as well as make some of the features that detects a bit more robust.

The pooling (POOL) layer reduces the height and width of the input. It helps reduce computation, as well as helps make feature detectors more invariant to its position in the input. The two types of pooling layers are:

Max-pooling layer: slides an ( f,f ) window over the input and stores the max value of the window in the output.
Average-pooling layer: slides an ( f,f ) window over the input and stores the average value of the window in the output.

These pooling layers have no parameters for backpropagation to train. However, they have hyperparameters such as the window size ff . This specifies the height and width of the fxf window you would compute a max or average over.

The intuition behind max pooling:
If the features detected anywhere in the filter, then keep a high number. But if the feature is not detected, so maybe this feature doesn’t exist in the upper right-hand quadrant. Then the max of all those numbers is still itself quite small.

Example of max pooling:

If you have a 3D input, then the outputs will have the same dimension.

There is another type of pooling that isn’t used very often, but will mention briefly which is average pooling.

Instead of taking the maxes within each filter, average pooling take the average.

Summary of pooling

CNN Example

Why Convolutions?

Two main advantages of convolutional layers over just using fully connected layers:

parameter sharing
A feature detector (such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
sparsity of connections
In each layer, each output value depends only on a small number of inputs.

Deep convolutional models: case studies

Learn about the practical tricks and methods used in deep CNNs straight from the research papers.

Learning Objectives

Understand multiple foundational papers of convolutional neural networks
Analyze the dimensionality reduction of a volume in a very deep network
Understand and Implement a Residual network
Build a deep neural network using Keras
Implement a skip-connection in your network
Clone a repository from github and use transfer learning

Case studies

Why look at case studies?

why look at case studies?

A good way to get intuition on how to build conv nets is to read or to see other examples of effective conv nets.
A net neural network architecture that works well on one computer vision task often works well on other tasks.

Outline

Classic networks:

LeNet-5
AlexNet
VGG

ResNet
Inception

Classic Networks

LeNet - 5

The goal of LeNet-5 was to recognize handwritten digits.

AlexNet

AlexNet convinced a lot of the computer vision community to take a serious look at deep learning to convince them that deep learning really works in computer vision. And then it grew on to have a huge impact not just in computer vision but beyond computer vision as well.

VGG - 16

A remarkable thing about the VGG-16 net is that they said, instead of having so many hyperparameters, the VGG network really simplified this neural network architectures. The architecture is really quite uniform.

ResNets

The problem of very deep neural networks

Last week, you built your first convolutional neural network. In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.

The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow. More specifically, during gradient descent, as you backprop from the final layer back to the first layer, you are multiplying by the weight matrix on each step, and thus the gradient can decrease exponentially quickly to zero (or, in rare cases, grow exponentially quickly and “explode” to take very large values).

During training, you might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:

You are now going to solve this problem by building a Residual Network!

Building a Residual Network
In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly backpropagated to earlier layers:

The image on the left shows the “main path” through the network. The image on the right adds a shortcut to the main path. By stacking these ResNet blocks on top of each other, you can form a very deep network.

We also saw in lecture that having ResNet blocks with the shortcut also makes it very easy for one of the blocks to learn an identity function. This means that you can stack on additional ResNet blocks with little risk of harming training set performance. (There is also some evidence that the ease of learning an identity function–even more than skip connections helping with vanishing gradients–accounts for ResNets’ remarkable performance.)

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different.

The identity block is the standard block used in ResNets, and corresponds to the case where the input activation (say a[l] ) has the same dimension as the output activation (say a[l+2] ). To flesh out the different steps of what happens in a ResNet’s identity block, here is an alternative diagram showing the individual steps:

The upper path is the “shortcut path.” The lower path is the “main path.” In this diagram, we have also made explicit the CONV2D and ReLU steps in each layer. To speed up training we have also added a BatchNorm step.

The ResNet “convolutional block” is the other type of block. You can use this type of block when the input and output dimensions don’t match up. The difference with the identity block is that there is a CONV2D layer in the shortcut path:

The CONV2D layer in the shortcut path is used to resize the input x to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path. (This plays a similar role as the matrix Ws discussed in lecture.) For example, to reduce the activation dimensions’s height and width by a factor of 2, you can use a 1x1 convolution with a stride of 2. The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step.

What you should remember:
- Very deep “plain” networks don’t work in practice because they are hard to train due to vanishing gradients.
- The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
- There are two main type of blocks: The identity block and the convolutional block.
- Very deep Residual Networks are built by stacking these blocks together.

Why ResNets Work

why do ResNets work so well?
Doing well on the training set is usually a prerequisite to doing well on your hold up or on your dev or on your test sets. So, being able to at least train ResNet to do well on the training set is a good first step toward that.

If you make a network deeper, it can hurt your ability to train the network to do well on the training set. But this is not true or at least is much less true when you training a ResNet.

Example:

W is really the key term to pay attention to here. And if w[l+2] is equal to zero. And let’s say that b is also equal to zero, then these terms go away because they’re equal to zero, and then g(a[l]) , this is just equal to a[l] because we assumed we’re using the relu activation function. And so all of the activation are non-negative and so, g(a[l]) is the value applied to a non-negative quantity, so you just get back, a[l] . So, what this shows is that the identity function is easy for residual block to learn. And it’s easy to get a[l+2] equals to a[l] because of this skip connection. And what that means is that adding these two layers in your neural network, it doesn’t really hurt your neural network’s ability to do as well as this simpler network without these two extra layers, because it’s quite easy for it to learn the identity function to just copy a[l] to a[l+2] using despite the addition of these two layers. And this is why adding two extra layers, adding this residual block to somewhere in the middle or the end of this big neural network it doesn’t hurt performance. But of course our goal is to not just not hurt performance, is to help performance and so you can imagine that if all of these hidden units if they actually learned something useful then maybe you can do even better than learning the identity function. And what goes wrong in very deep plain nets in very deep network without this residual of the skip connections is that when you make the network deeper and deeper, it’s actually very difficult for it to choose parameters that learn even the identity function which is why a lot of layers end up making your result worse rather than making your result better.

The main reason the residual network works is that it’s so easy for these extra layers to learn the identity function that you’re kind of guaranteed that it doesn’t hurt performance and then a lot the time you maybe get lucky and then even helps performance.

Networks in Networks and 1x1 Convolutions

Why does a 1 × 1 convolution do?

The 1 × 1 convolution will look at each of the 36 different positions here, and it will take the element wise product between 32 numbers on the left and 32 numbers in the filter. And then apply a ReLU non-linearity to it after that.

This idea is often called a 1 x 1 convolution but it’s sometimes also called Network in Network。

Using 1×1 convolutions

1 x 1 convolution is a way to shrink nC, whereas pooling layers are to shrink nH and nW, the height and width.

Inception Network Motivation

Motivation for inception network

The problem of computational cost

To summarize, if you are building a layer of a neural network and you don’t want to have to decide, do you want a 1 by 1, or 3 by 3, or 5 by 5, or pooling layer, the inception module let’s you say let’s do them all, and let’s concatenate the results. And then we run to the problem of computational cost. And what you saw here was how using a 1 by 1 convolution, you can create this bottleneck layer thereby reducing the computational cost significantly. Now you might be wondering, does shrinking down the representation size so dramatically, does it hurt the performance of your neural network? It turns out that so long as you implement this bottleneck layer so that within reason, you can shrink down the representation size significantly, and it doesn’t seem to hurt the performance, but saves you a lot of computation. So these are the key ideas of the inception module.

Inception Network

Inception module

Inception network (GoogleNet)

To summarize, if you understand the Inception module, then you understand the Inception network, which is largely the Inception module repeated a bunch of times throughout the network.

Practical advices for using ConvNets

Using Open-Source Implementation

It turns out that a lot of these neural networks are difficult or finicky to replicate because a lot of details about tuning of the hyperparameters such as learning decay and other things that make some difference to the performance.

Therefore, it’s sometimes difficult to replicate someone else’s published work just from reading their paper. Fortunately, a lot of deep learning researchers routinely open source their work on the Internet, such as on GitHub.

One of the advantages of doing so also is that sometimes these networks take a long time to train, and someone else might have used multiple GPUs and a very large dataset to pretrain some of these networks. And that allows you to do transfer learning using these networks.

Transfer Learning

If you’re building a computer vision application rather than training the ways from scratch, from random initialization, you often make much faster progress if you download ways that someone else has already trained on the network architecture and use that as pre-training and transfer that to a new task that you might be interested in.

In practice, because the open data sets on the internet are so big and the ways you can download that someone else has spent weeks training has learned from so much data, you find that for a lot of computer vision applications, you just do much better if you download someone else’s open source ways and use that as initialization for your problem. In all the different disciplines, in all the different applications of deep learning, I think that computer vision is one where transfer learning is something that you should almost always do unless, you have an exceptionally large data set to train everything else from scratch yourself. But transfer learning is just very worth seriously considering unless you have an exceptionally large data set and a very large computation budget to train everything from scratch by yourself.

Data Augmentation

Most computer vision task could use more data. And so data augmentation is one of the techniques that is often used to improve the performance of computer vision systems.

Common augmentation method

Color shifting

Implementing distortions during training

Similar to other parts of training a deep neural network, the data augmentation process also has a few hyperparameters such as how much color shifting do you implement and exactly what parameters you use for random cropping? So, similar to elsewhere in computer vision, a good place to get started might be to use someone else’s open source implementation for how they use data augmentation. But of course, if you want to capture more in variances, then you think someone else’s open source implementation isn’t, it might be reasonable also to use hyperparameters yourself.

State of Computer Vision

Deep learning has been successfully applied to computer vision, natural language processing, speech recognition, online advertising, logistics, many, many, many problems. There are a few things that are unique about the application of deep learning to computer vision, about the status of computer vision. In this video, I will share with you some of my observations about deep learning for computer vision and I hope that that will help you better navigate the literature, and the set of ideas out there, and how you build these systems yourself for computer vision.

Data vs. hand-engineering

Tips for doing well on benchmarks/wining competitions

Use open source code

Object detection

Learn how to apply your knowledge of CNNs to one of the toughest but hottest field of computer vision: Object detection.

Learning Objectives

Understand the challenges of Object Localization, Object Detection and Landmark Finding
Understand and implement non-max suppression
Understand and implement intersection over union
Understand how we label a dataset for an object detection application
Remember the vocabulary of object detection (landmark, anchor, bounding box, grid, …)

Object Localization

Object detection is one of the areas of computer vision that’s just exploding and is working so much better than just a couple of years ago. In order to build up to object detection, you first learn about object localization.

The problem discuss here is classification with localization. Which means not only do you have to label this as say a car but the algorithm also is responsible for putting a bounding box, or drawing a red rectangle around the position of the car in the image. So that’s called the classification with localization problem. Where the term localization refers to figuring out where in the picture is the car you’ve detective.

The above loss function is just for simplicity, in practice you could probably use a log like feature loss for the C1,C2,C3 to the softmax output. One of those elements usually you can use squared error or something like squared error for the bounding box coordinates and if a Pc you could use something like the logistics regression loss. Although even if you use squared error it’ll probably work okay.

Landmark Detection

Landmarks is the important points and image, whose X and Y coordinates output by neural network , that you want the neural networks to recognize.

In order to treat a network like detect landmark, you will need a label training set. The labels have to be consistent across different images. But if you can hire labelers or label yourself a big enough data set to do this, then a neural network can output all of these landmarks which is going to used to carry out other interesting effect such as with the pose of the person, maybe try to recognize someone’s emotion from a picture, and so on.

Object Detection

Special applications: Face recognition & Neural style transfer

Discover how CNNs can be applied to multiple fields, including art generation and face recognition. Implement your own algorithm to generate art and recognize faces!

Face Recognition

What you should remember:
- Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
- The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
- The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.

What is face recognition?

Face verification vs. face recognition

Verification

Input image, name/ID
Output whether the input image is that of the claimed person
1:1 matching problem.

Recognition

Has a database of K persons
Get an input image
Output ID if the image is any of the K persons (or “not recognized”)
1:K matching problem

One Shot Learning

Need to be able to recognize a person even though you can only have one sample in your DB.

You can’t train a CNN with a softmax(each person) because:

You Don’t have enough samples
If a new person joins, you need to retrain the network

Siamese Network

Siamese network is a good way to input two faces and tell you how similar or how different they are.

By using a 128-neuron fully connected layer as its last layer, the model ensures that the output is an encoding vector of size 128. You then use the encodings the compare two face images as follows:

Figure 2:
By computing a distance between two encodings and thresholding, you can determine if the two pictures represent the same person
So, an encoding is a good one if:

The encodings of two images of the same person are quite similar to each other
The encodings of two images of different persons are very different

Triplet Loss

The triplet loss function formalizes this, and tries to “push” the encodings of two images of the same person (Anchor and Positive) closer together, while “pulling” the encodings of two images of different persons (Anchor, Negative) further apart.

Figure 3:
In the next part, we will call the pictures from left to right: Anchor (A), Positive (P), Negative (N)

For an image x , we denote its encoding f(x) , where f is the function computed by the neural network.

Training will use triplets of images (A,P,N) :

A is an “Anchor” image–a picture of a person.
P is a “Positive” image–a picture of the same person as the Anchor image.
N is a “Negative” image–a picture of a different person than the Anchor image.

These triplets are picked from our training dataset. We will write (A(i),P(i),N(i)) to denote the i -th training example.

You’d like to make sure that an image A(i) of an individual is closer to the Positive P(i) than to the Negative image N(i) ) by at least a margin α :

∣ ∣ f (A (i)) - f (P (i)) ∣ ∣ 22 + α < ∣ ∣ f (A (i)) - f (N (i)) ∣ ∣ 22

You would thus like to minimize the following “triplet cost”:

J = \sum i = 1 m [∣ ∣ f (A (i)) - f (P (i)) ∣ ∣ 22                          (1) - ∣ ∣ f (A (i)) - f (N (i)) ∣ ∣ 22                          (2) + α] + (3)

Here, we are using the notation “ [z]+ ” to denote max(z,0) .

Notes:
- The term (1) is the squared distance between the anchor “A” and the positive “P” for a given triplet; you want this to be small.
- The term (2) is the squared distance between the anchor “A” and the negative “N” for a given triplet, you want this to be relatively large, so it thus makes sense to have a minus sign preceding it.
- α is called the margin. It is a hyperparameter that you should pick manually. We will use α=0.2 .

Most implementations also normalize the encoding vectors to have norm equal one (i.e., ∣∣f(img)∣∣2 =1);

How do we choose triplets to train on?

If A/P are very similar, and A/N are very different, training is very easy.
Select A/N that are pretty similar to train a good net.

Some big companies have already trained networks on large amount of photos so you may just want to reuse their weights.

Face Verification and Binary Classification

The Triplet Loss is one good way to learn the parameters of a continent for face recognition. There’s another way to learn these parameters. Another way to train a neural network, is to take this pair of neural networks to take this Siamese Network and have them both compute these embeddings, and then have these be input to a logistic regression unit to then just make a prediction. Where the target output will be one if both of these are the same persons, and zero if both of these are of different persons. So, this is a way to treat face recognition just as a binary classification problem.

Rather than just feed in the encoding, the input of the final logistic regression unit will be the differences between the encodings. So, this will be one pretty useful way to learn to predict zero or one whether these are the same person or different persons.

One computational trick that can help neural deployment significantly, which is that, if this is the new image,then instead of having to compute, this embedding every single time, you can do is actually pre-compute that, so, when the new employee walks in, what you can do is use this upper components to compute that encoding and use it, then compare it to your pre-computed encoding and then use that to make a prediction. Because you don’t need to store the raw images and also because if you have a very large database of employees, you don’t need to compute these encodings every single time for every employee database. This idea of free computing, some of these encodings can save a significant computation. And this type of pre-computation works both for this type of Siamese Central architecture where you treat face recognition as a binary classification problem, as well as, when you were learning encodings maybe using the Triplet Loss function as described in the last module.

To treat face verification as supervised learning, you create a training set of pairs of images where the target label is one when these are a pair of pictures of the same person and where the target label is zero, when these are pictures of different persons and you use different pairs to train the neural network to train the Siamese network that were using back propagation.

Sequence Models

About the Course
This course will teach you how to build models for natural language, audio, and other sequence data. Thanks to deep learning, sequence algorithms are working far better than just two years ago, and this is enabling numerous exciting applications in speech recognition, music synthesis, chatbots, machine translation, natural language understanding, and many others.

You will:

Understand how to build and train Recurrent Neural Networks (RNNs), and commonly-used variants such as GRUs and LSTMs.
Be able to apply sequence models to natural language problems, including text synthesis.
Be able to apply sequence models to audio applications, including speech recognition and music synthesis.

Recurrent Neural Networks

Recurrent Neural Networks (RNN) are very effective for Natural Language Processing and other sequence tasks because they have “memory”. They can read inputs x⟨t⟩ (such as words) one at a time, and remember some information/context through the hidden layer activations that get passed from one time-step to the next. This allows a uni-directional RNN to take information from the past to process later inputs. A bidirection RNN can take context from both the past and the future.

Why sequence models

Notation

Superscript [l] denotes an object associated with the lth layer.
- Example: a[4] is the 4th layer activation. W[5] and b[5] are the 5th layer parameters.
Superscript (i) denotes an object associated with the ith example.
- Example: x(i) is the ith training example input.
Superscript ⟨t⟩ denotes an object at the tth time-step.
- Example: x⟨t⟩ is the input x at the tth time-step. x(i)⟨t⟩ is the input at the tth timestep of example i .
Lowerscript i denotes the ith entry of a vector.
- Example: a[l]i denotes the ith entry of the activations in layer l .

Recurrent Neural Network Model

Uni-directional RNN architecture

With the example of sentence input:

follow the left to right sequence, one word, one neural network.
at each time-step, the recurrent neural network passes on the activation to the next time-step for it to use.
same parameters Wax,Waa,Way that used for every time-step.
one weakness of this RNN is that it only uses the information that is earlier in the sequence to make a prediction.

RNN Cell

RNN forward pass

Backpropagation through time

Back propagation requires doing computations or parsing messages in the opposite directions.

RNN-cell’s backward pass. Just like in a fully-connected neural network, the derivative of the cost function J backpropagates through the RNN by following the chain-rule from calculas. The chain-rule is also used to calculate (∂J∂Wax,∂J∂Waa,∂J∂b) to update the parameters (Wax,Waa,ba) .

Different types of RNNs

Language model and sequence generation

Language model estimates the probability of that particular sequence of words.

Build language model:
1. Tokenize sentence
2. Index tokenized words
3. Build RNN to model the probability of different sequences.

Terminology:

EOS: end of sentence, indicates sentence ends.
UNK: unknown keywords, represents new word not in original vocabulary.

set the inputs x<t>=y<t−1>
make softmax prediction to predict the probability of any word in the dictionary.
compute the chance of y<t> given y<1>,y<2>,...,y<t−1>

Sampling novel sequences

Generate randomly chosen sentence from RNN language model:

sample from predicted softmax distribution to generate novel sequences of words.
- Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as y^<t>
  E.g. use the numpy command np.random.choice to sample the first word according to distribution defined by output vector probabilities.
- Then pass this selected word to the next time-step.

Depending on application, character level RNN can also be built. In that case, vocabulary will be the alphabets.
Pros:

can handle unknown word tokens.

Cons：

end up with much longer sequences
character language models are not as good as word level language models at capturing long range dependencies between how the the earlier parts of the sentence also affect the later part of the sentence.
more computationally expensive to train.

Vanishing gradients with RNNs

Vanishing gradients problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.

This RNN will work well enough for some applications, but it suffers from vanishing gradient problems. So it works best when each output y⟨t⟩ can be estimated using mainly “local” context (meaning information from inputs x⟨t′⟩ where t′ is not too far from t ).

Exploding gradients could by using gradient clipping, e.g. every element of the gradient vector is clipped to lie between some range [-N, N], but vanishing gradients will take more work to address.

Gated Recurrent Unit (GRU)

Gated Recurrent Unit is a modification to the RNN hidden layer that makes it much better capturing long range connections and helps a lot with the vanishing gradient problems.

Cell( c ) or memory cell will provide a bit of memory to remember.
E.g. whether cat was singular or plural, so that when it gets much further into the sentence it can still work under consideration whether the subject of the sentence was singular or plural. And so at time t the memory cell will have some value ct
- For the GRU, c<t> is equal to the output activation a<t>
- At each time-step, consider overwriting the memory cell with a value c˜<t> which is going to be a candidate for replacing c<t> .
Gate( Γu ) is used to decide when update cell values.
GRU can help significantly with the vanishing gradient problem, and therefore allow a neural network to go on very long range dependencies.
E.g. a cat and was related even if they’re separated by a lot of words in the middle.
- When Γ close to 0, c<t> essentially equals c<t−1> and the value of c<t> is maintained pretty much exactly even across many many many many time-steps.
c<t> can be a vector.
c˜<t> and Γ would also be the same dimension
- The element wise multiplications tells the GRU unit which are the dimensions of the memory cell vector to update at every time-step.
  It can help choose to keep some bits constant while updating other bits.

Full GRU add one more gate Γr which stand for relevance.
The Γr tells how relevant is c<t−1> to compute the next candidate for c<t> .
Γr is computed with a new parameter matrix Wr , new br , and then the same input x<t> .

Long Short Term Memory (LSTM)

About the gates

Forget gate
For the sake of this illustration, lets assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural. If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state. In an LSTM, the forget gate lets us do this:
$Γ ⟨ t ⟩ f = σ (W f [a ⟨ t - 1 ⟩, x ⟨ t ⟩] + b f) (1)$
Here, Wf are weights that govern the forget gate’s behavior. We concatenate [a⟨t−1⟩,x⟨t⟩] and multiply by Wf . The equation above results in a vector Γ⟨t⟩f with values between 0 and 1. This forget gate vector will be multiplied element-wise by the previous cell state c⟨t−1⟩ . So if one of the values of Γ⟨t⟩f is 0 (or close to 0) then it means that the LSTM should remove that piece of information (e.g. the singular subject) in the corresponding component of c⟨t−1⟩ . If one of the values is 1, then it will keep the information.
Update gate
Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formulat for the update gate:

$Γ ⟨ t ⟩ u = σ (W u [a ⟨ t - 1 ⟩, x {t}] + b u) (2)$
Similar to the forget gate, here Γ⟨t⟩u is again a vector of values between 0 and 1. This will be multiplied element-wise with c~⟨t⟩ , in order to compute c⟨t⟩ .
Updating the cell
To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is:

$c ~ ⟨ t ⟩ = tanh (W c [a ⟨ t - 1 ⟩, x ⟨ t ⟩] + b c) (3)$
Finally, the new cell state is:

c ⟨ t ⟩ = Γ ⟨ t ⟩ f * c ⟨ t - 1 ⟩ + Γ ⟨ t ⟩ u * c ~ ⟨ t ⟩ (4)

Output gate
To decide which outputs we will use, we will use the following two formulas:
$Γ ⟨ t ⟩ o = σ (W o [a ⟨ t - 1 ⟩, x ⟨ t ⟩] + b o) (5)$
$a ⟨ t ⟩ = Γ ⟨ t ⟩ o * tanh (c ⟨ t ⟩) (6)$
Where in equation 5 you decide what to output using a sigmoid function and in equation 6 you multiply that by the tanh of the previous state.

Bidirectional RNN

Bidirectional RNNs enable take information from both earlier and later in the sequence.

Two forward propagation from different direction:

Bidirectional RNN is a modification can be applied to the basic RNN architecture or the GRU or the LSTM.
- This change enable make predictions anywhere even in the middle of a sequence by taking into account information potentially from the entire sequence.

Cons:
- Need the entire sequence of data before making predictions anywhere.

Deep RNNs

Recurrent layers are stacked on top of each other
Deep RNN are computationally expensive to train.
For RNNs, three layers is already quite a lot.
Because of the temporal dimension, these networks can already get quite big even if you have just a small handful of layers.

Natural Language Processing & Word Embeddings

Natural language processing with deep learning is an important combination. Using word vector representations and embedding layers you can train recurrent neural networks with outstanding performances in a wide variety of industries. Examples of applications are sentiment analysis, named entity recognition and machine translation.

What you should remember:
- If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly. Word embeddings allow your model to work on words in the test set that may not even have appeared in your training set.
- Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
- To use mini-batches, the sequences need to be padded so that all the examples in a mini-batch have the same length.
- An Embedding() layer can be initialized with pretrained values. These values can be either fixed or trained further on your dataset. If however your labeled dataset is small, it’s usually not worth trying to train a large pre-trained set of embeddings.
- LSTM() has a flag called return_sequences to decide if you would like to return every hidden states or only the last one.
- You can use Dropout() right after LSTM() to regularize your network.

Introduction to Word Embeddings

Word embedding is a way of representing words that your algorithms automatically understand analogies like that, man is to woman, as king is to queen, and many other examples. And through these ideas of word embeddings, you’ll be able to build NLP applications, even with models the size of, usually of relatively small label training sets.

Word Representation

One of the weaknesses of 1-hot representation is that it treats each word as a thing unto itself, and it doesn’t allow an algorithm to easily generalize the cross words.

So, instead of a one-hot presentation we can learn a featurized representation with each of these words, we could learn a set of features and values for each of words.

For getting words embeddings. We just need you to learn high dimensional feature vectors like these, that can generalize much better than one-hot vectors for representing different words.

One common algorithm for visualize high-dimensional data is the t-SNE algorithm. By doing that, you can easily group similar words together.

Using word embeddings

Embedding matrix

Learning Word Embeddings: Word2vec & GloVe

Learning word embeddings

A early but successful algorithm:

Example: predict the next word in the sequence
Term:

o one-hot encoding vector
E matrix of parameter
e embedding vector

Steps:

E⋅o=e
feed embedding vectors into neural network.
feed this nn to softmax to predict the target word

The parameters of this model will be the matrix E , and use the same matrix E for all the words.

Use BP to perform gradient descent to maximize the likelihood of training set to repeatedly predict given four words in a sequence, what is the next word in text corpus

Summary:
language modeling problem causes the pose of machines learning problem where you input the context like the last four words and predicts some target words, how posing that problem allows you to learn input word embedding.

Word2Vec

Word2Vec algorithm is a simpler and computationally more efficient way to learn word embeddings.

In the skip-gram model, we need come up with a few context to target pairs to create supervised learning problem, like randomly pick a word to be the context word and randomly pick another word within some window. e.g. plus minus five words or plus minus ten words of the context word and we choose target word.

We’ll set up a supervised learning problem where given the context word, you’re asked to predict what is a randomly chosen word within say, a plus minus ten word window of that input context word.

This is called the skip-gram model because is taking as input one word like orange and then trying to predict some words skipping a few words from the left or the right side. To predict what comes little bit before little bit after the context words.

Cons:

Computational speed
Can use a hierarchical softmax classifier to solve this, i.e. build binary tree

To sample context, in practice the distribution of words isn’t taken just entirely uniformly at random for the training set purpose, but instead there are different heuristics that you could use in order to balance out something from the common words together with the less common words.

Negative Sampling

Negative sampling allows to do something similar to the Skip-Gram model, but with a much more efficient learning algorithm.

Generate dataset:
Pick a context word and then pick a target word and that gives us positive example. Then, for number of k times, take the same context word and then pick random words from the dictionary, and those will be negative examples.

Then we’re going to create a supervised learning problem where the learning algorithm inputs x which is a pair of words, and predict the target label.

So the problem is really given a pair of words like orange and juice, do you think they appear together?

Instead of having one giant Softmax, which is very expensive to compute, we turn it into 10,000 binary classification problems, each of which is quite cheap to compute.

On every iteration, we’re only going to train k + 1 of them, of k negative examples and one positive examples. And this is why the computation cost of this algorithm is much lower because you’re updating k + 1 binary classification problems which is relatively cheap to do on every iteration rather than updating a giant Softmax classifier.

Empirically, by taking it to the power of three-fourths, this is somewhere in-between the extreme of taking uniform distribution, and the other extreme of just taking whatever was the observed distribution in your training set.

GloVe word vectors

GloVe stands for global vectors for word representation.

Xij is a count that captures how often do words i and j appear with each other, or close to each other.

How related are words i and j as measured by how often they occur with each other.

Solve for parameters θ and e using gradient descent to minimize the sum over training set to learn vectors that their inner product is a good predictor for how often the two words occur together.

Functions of Weighting term f :

assign 0 to f when Xij=0 to avoid negative infinity of log.
by heuristically choosing f , it neither gives stopwords too much weight nor gives the infrequent words too little weight.

In Glove algorithm, the roles of θ and e are completely symmetric, and they actually end up with the same optimization objective. One way to train the algorithm is to initialize θ and e both uniformly around gradient descent to minimize its objective, and done for every word, to then take the average.

The individual components of the embeddings cannot be guaranteed interpretable.
In particular, the first feature might be a combination of gender, and royal, and age, and food, and cost, and size, is it a noun or an action verb, and all the other features. It’s very difficult to look at individual components, individual rows of the embedding matrix and assign the human interpretation to that.

Applications using Word Embeddings

Sentiment Classification

Sentiment classification is the task of looking at a piece of text and telling if someone likes or dislikes the thing they’re talking about. It is one of the most important building blocks in NLP and is used in many applications.

One of the challenges of sentiment classification is you might not have a huge label training set for it. But with word embeddings, you’re able to build good sentiment classifiers even with only modest-size label training sets.

Cons:

ignore word order

Debiasing word embeddings

Reduce or eliminate bias of learning algorithms is a very important problem because these algorithms are being asked to help with or to make more and more important decisions in society.

Sequence models & Attention mechanism

Sequence models can be augmented using an attention mechanism. This algorithm will help your model understand where it should focus its attention given a sequence of inputs.

Various sequence to sequence architectures

Here’s what you should remember from this notebook:

Machine translation models can be used to map from one sequence to another. They are useful not just for translating human languages (like French->English) but also for tasks like date format translation.
An attention mechanism allows a network to focus on the most relevant parts of the input when producing a specific part of the output.
A network using an attention mechanism can translate from inputs of length Tx to outputs of length Ty , where Tx and Ty can be different.
You can visualize attention weights α⟨t,t′⟩ to see what the network is paying attention to while generating each output.

Basic Models

E.g. translate French to English:

Encoder network is to find an encoding of the input French sentence and then use a decoder network to generate the corresponding English translation.

The encoder network be built as a RNN, e.g. GRU and LSTM, feed in the input French words one word at a time. And after ingesting the input sequence, the RNN then offers a vector that represents the input sentence.
After that, can build a decoder network which takes as input the encoding output by the encoder network shown in black on the left, and then can be trained to output the translation one word at a time until eventually it outputs the end of sequence or end the sentence token upon which the decoder stops.

The AlexNet, if get rid of final Softmax unit, the pre-trained AlexNet can give a 4096-dimensional feature vector of which to represent this picture of a cat. And so this pre-trained network can be the encoder network for the image and now have a 4096-dimensional vector that represents the image. Can then take this and feed it to an RNN, whose job it is to generate the caption one word at a time.

Generating a sequence compared to language model. One of the key differences is, don’t want a randomly chosen translation, maybe want the most likely translation, or don’t want a randomly chosen caption, but might want the best caption and most likely caption.

Picking the most likely sentence

Machine translation model is very similar to the language model, except that instead of always starting along with the vector of all zeros, it instead has an encoded network that figures out some representation for the input sentence, and it takes that input sentence and starts off the decoded network with representation of the input sentence rather than with the representation of all zeros.
So, that’s why I call this a conditional language model, and instead of modeling the probability of any sentence, it is now modeling the probability of, say, the output English translation, conditions on some input French sentence. So in other words, you’re trying to estimate the probability of an English translation.

So, when use this model for machine translation, you’re not trying to sample at random from distribution. Instead, what you would like is to find the English sentence, y , that maximizes that conditional probability.

Greedy search isn’t good enough since it’s not always optimal to just pick one word at a time.
Approximate search algorithm will to pick the sentence, y , that maximizes that conditional probability even though it’s not guaranteed to find the value of y that maximizes this, it usually does a good enough job.

summary
One major difference between machine translation and the language modeling problems is rather than wanting to generate a sentence at random, you may want to try to find the most likely English translation. But the set of all English sentences of a certain length is too large to exhaustively enumerate. So, we have to resort to a search algorithm.

Beam Search

Beam search is the most widely used algorithm to jobs like output the best and the most likely English translation.

whereas greedy search will pick only the one most likely words and move on, Beam Search instead can consider multiple alternatives.

Steps:
1. Run the input French sentence through encoder network and then first step will then decode the network, like a softmax output overall 10,000 possibilities. Then take those 10,000 possible outputs and keep in memory which were the top B .

2.Select and memorize top B possible first 2 words over overall possible outputs.

3. Repeat previous step till the end.

Instead of multiplicative, use log to get a more numerically stable algorithm that is less prone to numerical rounding errors.

Original objective function has an undesirable effect, that it may unnaturally tends to prefer very short outputs. Because the probability of a short sentence is determined just by multiplying fewer of these numbers are less than 1. Use normalized log probability objective to solve this problem.

Error analysis in beam search

Error analysis can help focus time on doing the most useful work for project. Now, beam search is an approximate search algorithm, also called a heuristic search algorithm. And so it doesn’t always output the most likely sentence. It’s only keeping track of B equals 3 or 10 or 100 top possibilities. In this module, you’ll learn how error analysis interacts with beam search and how you can figure out whether it is the beam search algorithm that’s causing problems and worth spending time on. Or whether it might be your RNN model that is causing problems and worth spending time on.

To decide problem happen in search algorithm or RNN model, compute P(y∗|x) as well as to compute P(y^|x) using your RNN model. And then to see which of these two is bigger.

During error analysis process. You go through the development set and find the mistakes that the algorithm made in the development set. Through this process, you can then carry out error analysis to figure out what fraction of errors are due to beam search versus the RNN model. And with an error analysis process, for every example in your dev sets, where the algorithm gives a much worse output than the human translation, you can try to ascribe the error to either the search algorithm or to the objective function, or to the RNN model that generates the objective function that beam search is supposed to be maximizing. Through this, you can try to figure out which of these two components is responsible for more errors.

If you find that beam search is responsible for a lot of errors, then maybe is we’re working hard to increase the beam width.

If you find that the RNN model is at fault, then you could do a deeper layer of analysis to try to figure out if you want to add regularization, or get more training data, or try a different network architecture, or something else.

Bleu Score (optional)

One of the challenges of machine translation is that, given a French sentence, there could be multiple English translations that are equally good translations of that French sentence. Bleu score is a measurement for solving this problem.

BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.

Although developed for translation, it can be used to evaluate text generated for a suite of natural language processing tasks.

The clipped count is the maximum counts of given text in either of the reference texts.

Attention Model Intuition

It’s difficult for encoder-decoder network to memorize a super long sentence. If you had to translate a book’s paragraph from French to English, you would not read the whole paragraph, then close the book and translate. Even during the translation process, you would read/re-read and focus on the parts of the French paragraph corresponding to the parts of the English you are writing down.

The attention mechanism tells a Neural Machine Translation model where it should pay attention to at any step.

The Attention is originally developed for machine translation, it spread to many other application areas as well.

Attention Model compute a set of attention weights.

Attention Model

The diagram on the left shows the attention model. The diagram on the right shows what one “Attention” step does to calculate the attention variables α⟨t,t′⟩ , which are used to compute the context variable context⟨t⟩ for each timestep in the output ( t=1,…,Ty ).

Figure 1: Neural machine translation with attention

There are two separate LSTMs in this model (see diagram on the left). Because the one at the bottom of the picture is a Bi-directional LSTM and comes before the attention mechanism, we will call it pre-attention Bi-LSTM. The LSTM at the top of the diagram comes after the attention mechanism, so we will call it the post-attention LSTM. The pre-attention Bi-LSTM goes through Tx time steps; the post-attention LSTM goes through Ty time steps.
The post-attention LSTM passes s⟨t⟩,c⟨t⟩ from one time step to the next. In the lecture videos, we were using only a basic RNN for the post-activation sequence model, so the state captured by the RNN output activations s⟨t⟩ . But since we are using an LSTM here, the LSTM has both the output activation s⟨t⟩ and the hidden cell state c⟨t⟩ . However, unlike previous text generation examples (such as Dinosaurus in week 1), in this model the post-activation LSTM at time t does will not take the specific generated y⟨t−1⟩ as input; it only takes s⟨t⟩ and c⟨t⟩ as input. We have designed the model this way, because (unlike language generation where adjacent characters are highly correlated) there isn’t as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.
We use a⟨t⟩=[a→⟨t⟩;a←⟨t⟩] to represent the concatenation of the activations of both the forward-direction and backward-directions of the pre-attention Bi-LSTM.
The diagram on the right uses a RepeatVector node to copy s⟨t−1⟩ ’s value Tx times, and then Concatenation to concatenate s⟨t−1⟩ and a⟨t⟩ to compute e⟨t,t′⟩ , which is then passed through a softmax to compute α⟨t,t′⟩ .
a⟨t⟩ is a vector, α⟨t,t′⟩ is a scalar, attention layer end up with context⟨t⟩ with dimenstion of 1×n for each t in Ty .

Speech recognition - Audio data

Speech recognition

From audio recordings to spectrograms

What really is an audio recording? A microphone records little variations in air pressure over time, and it is these little variations in air pressure that your ear also perceives as sound. You can think of an audio recording is a long list of numbers measuring the little air pressure changes detected by the microphone. We will use audio sampled at 44100 Hz (or 44100 Hertz). This means the microphone gives us 44100 numbers per second. Thus, a 10 second audio clip is represented by 441000 numbers (= 10×4410010×44100 ).

It is quite difficult to figure out from this “raw” representation of audio whether the word “activate” was said. In order to help your sequence model more easily learn to detect triggerwords, we will compute a spectrogram of the audio. The spectrogram tells us how much different frequencies are present in an audio clip at a moment in time.

(If you’ve ever taken an advanced class on signal processing or on Fourier transforms, a spectrogram is computed by sliding a window over the raw audio signal, and calculates the most active frequencies in each window using a Fourier transform. If you don’t understand the previous sentence, don’t worry about it.)

Trigger Word Detection

Trigger word detection is the technology that allows devices like Amazon Alexa, Google Home, Apple Siri, and Baidu DuerOS to wake up upon hearing a certain word.