神经网络的计算过程由正向传播(forward propagation )来进行前向计算来计算神经网络的输出以及反向传播(back propagation )计算来计算梯度(gradients)或微分(derivatives)。计算图(computation graph)解释了为什么以这种方式来组织。
计算图(computation graph)可以方便地展示神经网络的分层计算过程。
Learn to build a neural network with one hidden layer, using forward propagation and backpropagation.
Learning Objectives
Pros and cons of activation functions
Sigmoid和tanh函数的 缺点之一是 如果Z的值非常大或者非常小 那么关于这个函数导数的梯度或者 斜率会变的很小 当Z很大或者很小的时候 函数的斜率值 会接近零 这会使得梯度下降变的缓慢。
ReLU是目前广泛被人们使用的一个方法 虽然有时候 人们也会使用双曲函数 作为激活函数 ReLU的缺点之一是 当z为负数的时候 其导数为0,但在实际应用中并不是问题。ReLU和Leaky ReLU的共有的优势是 在z的数值空间里面 激活函数的导数或者说 激活函数的斜率 离0比较远 因此在实践当中使用普通的 ReLU激活函数的话 那么神经网络的学习速度通常会比使用 双曲函数tanh或者Sigmoid函数来的更快 主要原因是使学习变慢的斜率趋向0的现象 变少了 激活函数的导数趋向于0会降低学习的速度 我们知道,一半z的数值范围 ReLU的斜率为0 但是在实际使用中 大多数的隐藏单元的z值 将会大于0,因此学习仍然可以很快 。
不同激活函数的优缺点:
如果使用线性激活函数 或者叫恒等激活函数 那么神经网络的输出 仅仅是输入函数的线性变化。
如果使用线性激活函数 或者说 没有使用激活函数 那么无论你的神经网络有多少层 它所做的仅仅是计算线性激活函数 这还不如去除所有隐藏层。线性的隐藏层没有任何用处 因为两个线性函数的组合 仍然是线性函数 除非你在这里引入非线性函数 否则无论神经网络模型包含多少隐藏层 都无法实现更有趣的功能 只有一个地方会使用线性激活函数 当g(z)等于z 就是使用机器学习解决回归问题的时候。
About this Course
This course will teach you the “magic” of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.
After 3 weeks, you will:
This is the second course of the Deep Learning Specialization.
Learning Objectives
Recall that different types of initializations lead to different results
Recognize the importance of initialization in complex neural networks.
Recognize the difference between train/dev/test sets
Diagnose the bias and variance issues in your model
Learn when and how to use regularization methods such as dropout or L2 regularization.
Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them
Use gradient checking to verify the correctness of your backpropagation implementation
What we want you to remember from this module:
- Regularization will help you reduce overfitting.
- Regularization will drive your weights to lower values.
- L2 regularization and Dropout are two very effective regularization techniques.
Deep Learning models have so much flexibility and capacity that overfitting can be a serious problem, if the training dataset is not big enough. Sure it does well on the training set, but the learned network doesn’t generalize to new examples that it has never seen!
The standard way to avoid overfitting is called L2 regularization. It consists of appropriately modifying your cost function, from:
为什么只对参数w进行正则化呢? 为什么我们不把b的相关项也加进去呢? 实际上也可以这样做 但通常会把它省略掉 因为w往往是一个非常高维的参数矢量 尤其是在发生高方差问题的情况下 可能w有非常多的参数 模型没能很好地拟合所有的参数 而b只是单个数字 几乎所有的参数都集中在w中 而不是b中 即使加上了最后这一项 实际上也不会起到太大的作用 因为b只是大量参数中的一个参数 在实践中通常就不费力气去包含它了 但如果想的话也可以(包含b)
L1 regulazation VS L2 regulazation
L1正则化 即不使用L2范数(Euclid范数(欧几里得范数,常用计算向量长度),即向量元素绝对值的平方和再开方) 而是使用lambda/m乘以这一项的和 这称为参数矢量w的L1范数(即向量元素绝对值之和) 这里有一个数字1的小角标 无论你在分母中使用m还是2m 它只是一个缩放常量 如果你使用L1正则化 w最后会变得稀疏 这意味着w矢量中有很多0 有些人认为这有助于压缩模型 因为有一部分参数是0 只需较少的内存来存储模型 然而在实践中发现 通过L1正则化让模型变得稀疏 带来的收效甚微 所以至少在压缩模型的目标上 它的作用不大
L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
Dropout is a widely used regularization technique that is specific to deep learning.
It randomly shuts down some neurons in each iteration. Watch these two videos to see what this means!
When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.
What you should remember about dropout:
- Dropout is a regularization technique.
- You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
Learning Objectives
Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
Use random minibatches to accelerate the convergence and improve the optimization
Know the benefits of learning rate decay and apply it to your optimization
机器学习的应用是一个高度依赖经验的,不断重复的过程。 你需要训练很多模型才能找到一个确实好用的,所以能够快速的训练模型的确是个优势。令情况更艰难的是在大数据领域中深度学习表现得并不算完美,我们能够训练基于大量数据的神经网络,而用大量数据训练就会很慢,所以你会发现快速的优化算法,好的优化算法的确能大幅提高你和你的团队的效率。
我们之前学过,矢量化(vectorization)可以让你有效地计算所有m个样例而不需要一个具体的for循环就能处理整个训练集。是如果M非常大,速度依然会慢。例如, 如果M是5百万或者5千万或者更大,对你的整个训练集运用梯度下降法,你必须先处理你的整个训练集才能在梯度下降中往前一小步,然后再处理一次整个5百万的训练集才能再往前一小步。所以实际上算法是可以加快的,如果你让梯度下降在处理完整个巨型的5百万训练集之前就开始有所成效。具体来说,你可以这样做,将你的训练集拆分成更小的,微小的训练集,即小批量训练集(mini-batch)。
一次遍历(epoch)是指过一遍训练集,只不过在批量梯度下降法中对训练集的一轮处理只能得到一步梯度逼近,而小批量梯度下降法中对训练集的一轮处理,也就是一次遍历,可以得到5000步梯度逼近。
当你有一个大型训练集时,小批量梯度下降法比梯度下降法要快得多。这几乎是每个从事深度学习的人在处理一个大型数据集时会采用的算法。
A variant is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient descent where each mini-batch has just 1 example. In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly. Here is an illustration of this:
In practice, you’ll often get faster results if you do not use neither the whole training set, nor only one training example, to perform each update. Mini-batch gradient descent uses an intermediate number of examples for each step. With mini-batch gradient descent, you loop over the mini-batches instead of looping over individual training examples.
有几个点需要注意 当beta的值很大的时候 你得到的曲线会更平滑 因为你对更多天数的温度做了平均处理 因此曲线就 波动更小 更加平滑 但另一方面 这个曲线会右移 因为你在一个更大的窗口内计算平均温度 通过在更大的窗口内计算平均 这个指数加权平均的公式 在温度变化时 适应地更加缓慢 这就造成了一些延迟 原因是 当beta=0.98的时候 之前的值具有更大的权重 而当前值的权重就非常小 只有0.02 所以当温度变化的时候 温度上升或者下降 这个指数加权平均 在beta较大时 就会适应得更慢 我们来试试另一个值 让beta的值取另一个极端 比如0.5 那么由右边的公式 这就变成了只对两天进行平均 如果画出来 就会得到黄色的线 由于仅仅平均两天的气温 即只在很小的窗口内计算平均 得到结果中会有更多的噪声 更容易受到异常值的影响 但它可以更快地适应温度变化 使用这个公式就可以实现指数加权平均 在统计学中 它被称为 指数加权滑动平均 我们把它简称为指数加权平均
One of the advantages of this exponentially weighted average formula, is that it takes very little memory. You just need to keep just one row number in computer memory, and you keep on overwriting it with this formula based on the latest values that you got. And it’s really this reason, the efficiency, it just takes up one line of code basically and just storage and memory for a single row number to compute this exponentially weighted average.
动量(Momentum) 或者叫动量梯度下降算法的主要思想是 计算梯度的指数加权平均 然后使用这个梯度来更新权重。
Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations.
Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable v v . Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of v v as the “velocity” of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.
The momentum update rule is, for l=1,...,L l = 1 , . . . , L :
where L is the number of layers, β β is the momentum and α α is the learning rate.
How do you choose β β ?
The larger the momentum β β is, the smoother the update because the more we take the past gradients into account. But if β β is too big, it could also smooth out the updates too much.
Common values for β β range from 0.8 to 0.999. If you don’t feel inclined to tune this, β=0.9 β = 0.9 is often a reasonable default.
Tuning the optimal β β for your model might need trying several values to see what works best in term of reducing the value of the cost function J J .
Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
You have to tune a momentum hyperparameter β β and a learning rate α α .
你已经学习了如何用动量来加速梯度下降 还有一个叫做RMSprop的算法 全称为均方根传递(Root Mean Square prop),它也可以加速梯度下降 我们来看看它是如何工作的 回忆一下之前的例子 在实现梯度下降时 可能会在垂直方向上出现巨大的振荡 即使它试图在水平方向上前进 为了说明这个例子 我们假设 纵轴代表参数b 横轴代表参数W 当然这里也可以是W1和W2等其他参数 我们使用b和W是为了便于理解 你希望减慢b方向的学习 也就是垂直方向 同时加速或至少不减慢水平方向的学习 这就是RMSprop算法要做的。另一个收效是 你可以使用更大的学习率alpha 学习得更快 而不用担心在垂直方向上发散
在水平方向上 即例子中W的方向上 我们希望学习速率较快 而在垂直方向上 即例子中b的方向上 我们希望降低垂直方向上的振荡 对于S_dW和S_db这两项 我们希望S_dW相对较小 因此这里除以的是一个较小的数 而S_db相对较大 因此这里除以的是一个较大的数 这样就可以减缓垂直方向上的更新 实际上 如果你看一下导数 就会发现垂直方向上的倒数要比水平方向上的更大 所以在b方向上的斜率很大 对于这样的导数 db很大 而dW相对较小 因为函数在垂直方向 即b方向的斜率 要比w方向 也就是比水平方向更陡 所以 db的平方会相对较大 因此S_db会相对较大 相比之下dW会比较小 或者说dW的平方会较小 所以S_dW会较小 结果是 垂直方向上的更新量 会除以一个较大的数 这有助于减弱振荡 而水平方向上的更新量会除以一个较小的数
Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum.
How does Adam work?
The update rule is, for l=1,...,L l = 1 , . . . , L :
where:
- t counts the number of steps taken of Adam
- L is the number of layers
- β1 β 1 and β2 β 2 are hyperparameters that control the two exponentially weighted averages.
- α α is the learning rate
- ε ε is a very small number to avoid dividing by zero
在选择超参数的比例的时候,原则上应在不同比例范围内进行均匀随机取值,如 0.001~0.001 、 0.001~0.01 、 0.01~0.1 、 0.1~1 范围内选择。
一般地,如果在 10a 10 a ~ 10b 10 b 之间的范围内进行按比例的选择,则 r 范围为[a, b] , α α = 10r 10 r 。
同样,在使用指数加权平均的时候,超参数beta也需要用上面这种方向进行选择。
在计算资源有限的情况下,使用第一种,仅调试一个模型,每天不断优化;
在计算资源充足的情况下,使用第二种,同时并行调试多个模型,选取其中最好的模型。
在深度学习不断兴起的过程中 最重要的创新之一是一种 叫批量归一化 (Batch Normalization) 的算法 它由Sergey Ioffe 和 Christian Szegedy提出 可以让你的超参搜索变得很简单 让你的神经网络变得更加具有鲁棒性 可以让你的神经网络对于超参数的选择上不再那么敏感 而且可以让你更容易地训练非常深的网络。
这里的 γ γ 和 β β 值可以从你的模型中学习,这样我们就可以使用梯度下降算法 或者其他类似算法比如 momentum的梯度下降算法 或者atom算法 来更新 γ γ 和 β β 的值 就像更新神经网络的权重一样。
Batch norm所做的就是 不仅仅在输入层 而且在一些隐藏层上也做归一化 你使用这种归一化方法 对某些隐藏单元的值z做归一化 但是输入层和隐藏层的归一化还有一点不同 就是隐藏层归一化后并不一定是均值0方差1 比如 如果你的激活函数是sigmoid 你就不希望归一化后的值都聚集在这里 可能希望它们有更大的方差 以便于更好的利用s函数非线性的特性 而不是所有的值都在中间这段近似直线的区域上 这就是为什么通过设置 γ γ 和 β β 你可以控制 z(i) z ( i ) 在你希望的范围内 或者说它真正实现的是 通过两个参数 γ γ 和 β β 来让你的隐藏单元有可控的方差和均值 而这两个参数是可以在算法中自由设置的 目的就是 可以得到一些修正的均值和方差 这意味可以是均值0方差1 也可以是被参数 γ γ β β 控制的其他值
通常的方法就是在我们训练的过程中,对于训练集的Mini-batch,使用指数加权平均,当训练结束的时候,得到指数加权平均后的均值和方差,而这些值直接用于Batch Norm公式的计算,用以对测试样本进行预测。
Softmax回归是一种更普遍的逻辑回归的方法。这种方法能够让你试图预测多分类问题,而不仅仅是二分类问题。
总结一下从 z[L] z [ L ] 到 a[L] a [ L ] 的计算过程。这整个的计算过程从计算幂,到得出临时变量,再做归一化。
我们可以把这个过程总结为一个softmax激活函数。假设 a[L] a [ L ] 是向量 z[L] z [ L ] 的激活函数 g g 的结果,这个激活函数不同之处在于,这个函数 g g 需要输入一个4*1的向量,也会输出一个4*1的向量。以前我们的激活通常是接收单行输入,比如sigmoid函数和ReLU函数就是接收一个实数输入,然后输出一个实数的输出。softmax函数的不同之处就是,它需要把输出归一化,以及输入输出都是向量。
概括来讲,损失函数找到训练集中的真实类,然后使该类相应的概率尽可能地高(最大似然估计)
About this Course
You will learn how to build a successful machine learning project. If you aspire to be a technical leader in AI, and know how to set direction for your team’s work, this course will show you how.
Much of this content has never been taught elsewhere, and is drawn from my experience building and shipping many deep learning products. This course also has two “flight simulators” that let you practice decision-making as a machine learning project leader. This provides “industry experience” that you might otherwise get only after years of ML work experience.
After 2 weeks, you will:
Learning Objectives
Understand why Machine Learning strategy is important
Apply satisficing and optimizing metrics to set up your goal for ML projects
Choose a correct train/dev/test split of your dataset
Understand how to define human-level performance
Use human-level perform to define your key priorities in ML projects
Take the correct ML Strategic decision based on observations of performances and dataset
改进ML系统的方法:
ML Strategy的课程内容包括:
1. teach a number of strategies, that is, ways of analyzing a machine learning problem that will point you in the direction of the most promising things to try.
2. 吴恩达自己关于building and shipping large number of deep learning products的经验。
吴恩达指出,在深度学习的时代,机器学习策略正在发生变化,因为现在可以用深度学习算法做的事情已经和前一代的机器学习算法不同了。
背景:One of the challenges with building machine learning systems is that there’s so many things you could try, so many things you could change. Including, for example, so many hyperparameters you could tune.
Orthogonalization or orthogonality is a system design property that assures that modifying an instruction or a component of an algorithm will not create or propagate side effects to other components of the system. It becomes easier to verify the algorithms independently from one another, it reduces testing and development time.
When a supervised learning system is design, these are the 4 assumptions that needs to be true and orthogonal.
- Fit training set well in cost function
- If it doesn’t fit well, the use of a bigger neural network or switching to a better optimization algorithm might help.
- Fit development set well on cost function
- If it doesn’t fit well, regularization or using bigger training set might help.
- Fit test set well on cost function
- If it doesn’t fit well, the use of a bigger development set might help
- Performs well in real world
- If it doesn’t perform well, the development test set is not set correctly or the cost function is not evaluating the right thing.
正交(Orthogonalization)在机器学习领域的意义:Figure out exactly what’s wrong, and then have exactly one knob, or a specific set of knobs that helps to just solve that problem that is limiting the performance of machine learning system.
Single number evaluation metric的好处:lets you quickly tell if the new thing you just tried is working better or worse than your last idea.
Evaluation metric的例子:
Precision
Of all the images we predicted y=1, what fraction of it have cats?
Recall
Of all the images that actually have cats, what fraction of it did we correctly identifying have cats?
The problem with using precision/recall as the evaluation metric is that you are not sure which one is better since in this case, both of them have a good precision et recall. F1-score, a harmonic mean, combine both precision and recall.
F1-Score is not the only evaluation metric that can be use, the average, for example, could also be an indicator of which classifier to use.
There are different metrics to evaluate the performance of a classifier, they are called evaluation matrices. They can be categorized as satisficing and optimizing matrices. It is important to note that these evaluation matrices must be evaluated on a training set, a development set or on the test set.
Example: Cat vs Non-cat
Classifier | Accuracy | Running time |
---|---|---|
A | 90% | 80 ms |
B | 92% | 95 ms |
C | 95% | 1,500 ms |
In this case, accuracy and running time are the evaluation matrices. Accuracy is the optimizing metric, because you want the classifier to correctly detect a cat image as accurately as possible. The running time which is set to be under 100 ms in this example, is the satisficing metric which mean that the metric has to meet expectation set.
The general rule is:
Summary:If there are multiple things you care about by say there’s one as the optimizing metric that you want to do as well as possible on and one or more as satisficing metrics were you’ll be satisfice. Almost it does better than some threshold you can now have an almost automatic way of quickly looking at multiple core size and picking the, quote, best one.
Setting up the training, development and test sets have a huge impact on productivity. It is important to choose the development and test sets from the same distribution and it must be taken randomly from all the data. However, it is not a problem to have different training and dev distribution.
Guideline
Choose a development set and test set to reflect data you expect to get in the future and consider important to do well.
Old way of splitting data
We had smaller data set therefore we had to use a greater percentage of data to develop and test ideas and models.
Modern era – Big data
Now, because a large amount of data is available, we don’t have to compromised as much and can use a greater portion to train the model.
Guidelines
If doing well on your metric + dev/test set does not correspond to doing well on your application, change your metric and/or dev/test set.
Guideline
Today, machine learning algorithms can compete with human-level performance since they are more productive and more feasible in a lot of application. Also, the workflow of designing and building a machine learning system, is much more efficient than before.
Moreover, some of the tasks that humans do are close to ‘’perfection’’, which is why machine learning tries to mimic human-level performance.
The graph below shows the performance of humans and machine learning over time.
The Machine learning progresses slowly when it surpasses human-level performance. One of the reason is that human-level performance can be close to Bayes optimal error, especially for natural perception problem.
Bayes optimal error is defined as the best possible error. In other words, it means that any functions mapping from x to y can’t surpass a certain level of accuracy.
Also, when the performance of machine learning is worse than the performance of humans, you can improve it with different tools. They are harder to use once its surpasses human-level performance.
These tools are:
By knowing what the human-level performance is, it is possible to tell when a training set is performing well or not.
Example: Cat vs Non-Cat
Classification error (%) | ||
---|---|---|
Scenario A | Scenario B | |
Humans | 1 | 7.5 |
Training error | 8 | 8 |
Development error | 10 | 10 |
In this case, the human level error as a proxy for Bayes error since humans are good to identify images. If you want to improve the performance of the training set but you can’t do better than the Bayes error otherwise the training set is overfitting. By knowing the Bayes error, it is easier to focus on whether bias or variance avoidance tactics will improve the performance of the model.
Scenario A
There is a 7% gap between the performance of the training set and the human level error. It means that the algorithm isn’t fitting well with the training set since the target is around 1%. To resolve the issue, we use bias reduction technique such as training a bigger neural network or running the training set longer.
Scenario B
The training set is doing good since there is only a 0.5% difference with the human level error. The
difference between the training set and the human level error is called avoidable bias. The focus here is to reduce the variance since the difference between the training error and the development error is 2%. To resolve the issue, we use variance reduction technique such as regularization or have a bigger training set.
Summary of bias/variance with human-level performance
There are many problems where machine learning significantly surpasses human-level performance, especially with structured data:
The two fundamental assumptions of supervised learning:
There are 2 fundamental assumptions of supervised learning. The first one is to have a low avoidable bias which means that the training set fits well. The second one is to have a low or acceptable variance which means that the training set performance generalizes well to the development set and test set.
If the difference between human-level error and the training error is bigger than the difference between the training error and the development error, the focus should be on bias reduction technique which are training a bigger model, training longer or change the neural networks architecture or try various hyperparameters search.
If the difference between training error and the development error is bigger than the difference between the human-level error and the training error, the focus should be on variance reduction technique which are bigger data set, regularization or change the neural networks architecture or try various hyperparameters search.
Learning Objectives
Understand what multi-task learning and transfer learning are
Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets
Summary
To carry out error analysis, you should find a set of mislabeled examples, either in your dev set, or in your development set. And look at the mislabeled examples for false positives and false negatives. And just count up the number of errors that fall into various different categories.
During this process, you might be inspired to generate new categories of errors. But by counting up the fraction of examples that are mislabeled in different ways, often this will help you prioritize. Or give you inspiration for new directions to go in.
创建一个人工错误检查表格使差错检测工作更清晰有条理。
DL algorithms are quite robust to random errors comparing with systematic errors in the training set.
Guideline
Depending on the area of application, the guideline below will help you prioritize when you build your system.
Guideline
Example: Cat vs Non-cat
In this example, we want to create a mobile application that will classify and recognize pictures of cats taken and uploaded by users.
There are two sources of data used to develop the mobile app. The first data distribution is small, 10 000 pictures uploaded from the mobile application. Since they are from amateur users, the pictures are not professionally shot, not well framed and blurrier. The second source is from the web, you downloaded 200 000 pictures where cat’s pictures are professionally framed and in high resolution.
The problem is that you have a different distribution:
The guideline used is that you have to choose a development set and test set to reflect data you expect to get in the future and consider important to do well.
The advantage of this way of splitting up is that the target is well defined.
The disadvantage is that the training distribution is different from the development and test set
distributions. However, this way of splitting the data has a better performance in long term.
When the training set is from a different distribution than the development and test sets, the method to analyze bias and variance changes.
Scenario A
If the development data comes from the same distribution as the training set, then there is a large
variance problem and the algorithm is not generalizing well from the training set.
However, since the training data and the development data come from a different distribution, this
conclusion cannot be drawn. There isn’t necessarily a variance problem. The problem might be that the development set contains images that are more difficult to classify accurately.
When the training set, development and test sets distributions are different, two things change at the same time. First of all, the algorithm trained in the training set but not in the development set. Second of all, the distribution of data in the development set is different.
It’s difficult to know which of these two changes what produces this 9% increase in error between the training set and the development set. To resolve this issue, we define a new subset called trainingdevelopment set. This new subset has the same distribution as the training set, but it is not used for training the neural network.
Scenario B
The error between the training set and the training- development set is 8%. In this case, since the training set and training-development set come from the same distribution, the only difference between them is the neural network sorted the data in the training and not in the training development. The neural network is not generalizing well to data from the same distribution that it hadn’t seen before
Therefore, we have really a variance problem.
Scenario C
In this case, we have a mismatch data problem since the 2 data sets come from different distribution.
Scenario D
In this case, the avoidable bias is high since the difference between Bayes error and training error is 10 %.
Scenario E
In this case, there are 2 problems. The first one is that the avoidable bias is high since the difference between Bayes error and training error is 10 % and the second one is a data mismatched problem.
Scenario F
Development should never be done on the test set. However, the difference between the development set and the test set gives the degree of overfitting to the development set.
This is a general guideline to address data mismatch:
Transfer learning refers to using the neural network knowledge for another application. When to use transfer learning
• Task A and B have the same input x x
• A lot more data for Task A than Task B
• Low level features from Task A could be helpful for Task B
Example 1: Cat recognition - radiology diagnosis
The following neural network is trained for cat recognition, but we want to adapt it for radiology diagnosis. The neural network will learn about the structure and the nature of images. This initial phase of training on image recognition is called pre-training, since it will pre-initialize the weights of the neural network. Updating all the weights afterwards is called fine-tuning.
For cat recognition
Input x x : image
Output y y – 1: cat, 0: no cat
Radiology diagnosis
Input x x : Radiology images – CT Scan, X-rays
Output y y :Radiology diagnosis – 1: tumor malign, 0: tumor benign
Guideline
• Delete last layer of neural network
• Delete weights feeding into the last output layer of the neural network
• Create a new set of randomly initialized weights for the last layer only
• New data set (x,y) ( x , y )
Multi-task learning refers to having one neural network do simultaneously several tasks.
When to use multi-task learning
Example: Simplified autonomous vehicle
The vehicle has to detect simultaneously several things: pedestrians, cars, road signs, traffic lights, cyclists, etc. We could have trained four separate neural networks, instead of train one to do four tasks. However, in this case, the performance of the system is better when one neural network is trained to do four tasks than training four separate neural networks since some of the earlier features in the neural network could be shared between the different types of objects.
The input x(i) x ( i ) is the image with multiple labels
The output y(i) y ( i ) has 4 labels which are represents:
Also, the cost can be compute such as it is not influenced by the fact that some entries are not labeled.
End-to-end deep learning is the simplification of a processing or learning systems into one neural network.
Example - Speech recognition model
End-to-end deep learning cannot be used for every problem since it needs a lot of labeled data. It is used mainly in audio transcripts, image captures, image synthesis, machine translation, steering in self-driving cars, etc.
Before applying end-to-end deep learning, you need to ask yourself the following question: Do you have enough data to learn a function of the complexity needed to map x and y?
Pro:
Cons:
About this Course
This course will teach you how to build convolutional neural networks and apply it to image data. Thanks to deep learning, computer vision is working far better than just two years ago, and this is enabling numerous exciting applications ranging from safe autonomous driving, to accurate face recognition, to automatic reading of radiology images.
You will:
- Understand how to build a convolutional neural network, including recent variations such as residual networks.
- Know how to apply convolutional networks to visual detection and recognition tasks.
- Know to use neural style transfer to generate art.
- Be able to apply these algorithms to a variety of image, video, and other 2D or 3D data.
This is the fourth course of the Deep Learning Specialization.
Learn to implement the foundational layers of CNNs (pooling, convolutions) and to stack them properly in a deep network to solve multi-class image classification problems.
Learning Objectives
Understand the convolution operation
Understand the pooling operation
Remember the vocabulary used in convolutional neural network (padding, stride, filter, …)
Build a convolutional neural network for image multi-class classification
Computer vision is one of the areas that’s been advancing rapidly thanks to deep learning. Two reasons make people excited about deep learning for computer vision:
The convolution operation is one of the fundamental building blocks of a convolutional neural network.
Using edge detection as the motivating example in this module:
在上图的例子中当3x3的filter移动到红圈和蓝圈处时,卷积操作(convolution)的结果如右下角图所示,图中间出现lighter区域,说明vertical edge被成功检测到。
The convolution operation gives you a convenient way to specify how to find these vertical edges in an image.
In this module, you’ll learn the difference between positive and negative edges, that is, the difference between light to dark versus dark to light edge transitions. And you’ll also see other types of edge detectors, as well as how to have an algorithm learn, rather than have us hand code an edge detector as we’ve been doing so far.
Different filters allow you to find vertical and horizontal edges:
With the rise of deep learning, one of the things we learned is that when you really want to detect edges in some complicated image, maybe you don’t need to have computer vision researchers handpick these nine numbers. Maybe you can just learn them and treat the nine numbers of this matrix as parameters, which you can then learn using back propagation. And the goal is to learn nine parameters so that when you take the image, the six by six image, and convolve it with your three by three filter, that this gives you a good edge detector.
Rather than just vertical and horizontal edges, maybe deep learning can learn to detect edges that are at 45 degrees or 70 degrees or 73 degrees or at whatever orientation it chooses. And so by just letting all of these numbers be parameters and learning them automatically from data, we find that neural networks can actually learn low level features, can learn features such as edges, even more robustly than computer vision researchers are generally able to code up these things by hand. But underlying all these computations is still this convolution operation, Which allows back propagation to learn whatever three by three filter it wants and then to apply it throughout the entire image, at this position, at this position, at this position, in order to output whatever feature it’s trying to detect. Be it vertical edges, horizontal edges, or edges at some other angle or even some other filter that we might not even have a name for in English.
The idea you can treat these nine numbers as parameters to be learned has been one of the most powerful ideas in computer vision.
In order to build deep neural networks one modification to the basic convolutional operation that you need to really use is padding.
Basic卷积操作的缺点:
1. If every time you apply a convolutional operator, your image shrinks, so you come from six by six down to four by four then, you can only do this a few times before your image starts getting really small, maybe it shrinks down to one by one or something, so maybe, you don’t want your image to shrink every time you detect edges or to set other features on it.
2. If you look the pixel at the corner or the edge, this little pixel is touched as used only in one of the outputs, because this touches that three by three region. Whereas, if you take a pixel in the middle, say this pixel, then there are a lot of three by three regions that overlap that pixel and so, is as if pixels on the corners or on the edges are use much less in the output. So you’re throwing away a lot of the information near the edge of the image.
In order to fix both of these problems, what you can do is the full apply of convolutional operation. You can pad the image. So in this case, let’s say you pad the image with an additional one border, with the additional border of one pixel all around the edges.
The main benefits of padding are the following:
It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the “same” convolution, in which the height/width is exactly preserved after one layer.
It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.
In
terms of how much to pad, it turns out there two common choices that are called, Valid convolutions and Same convolutions.
By convention in computer vision, f is usually odd. There are two reasons for that:
Strided convolutions is another piece of the basic building block of convolutions as used in Convolutional Neural Networks.
Example:
Let’s say you want to convolve this seven by seven image with this three by three filter, except that instead of doing the usual way, we are going to do it with a stride of two. What that means is instead of stepping the blue box over by one step, we are going to step over by two steps.
Reminder:
The formulas relating the output shape of the convolution to the input shape is:
Technical note on cross-correlation vs. convolution
In the different math textbook or signal processing textbook, there is one other possible inconsistency in the notation which is the way that the convolution is defined before doing the element Y’s product and summing, there’s actually one other step that you’ll first take which is to convolve this six by six matrix with this three by three filter. You at first take the three by three filter and flip it on the horizontal as well as the vertical axis. Then apply the flipped filter on the target matrix.
To summarize, by convention in machine learning, we usually do not bother with this flipping operation and technically, this operation is maybe better called cross-correlation but most of the deep learning literature just calls it the convolution operator.
Convolution can be implemented not only over just 2D images, but over three dimensional volumes.
The three by three by three filter has 27 numbers, or 27 parameters, that’s three cubes. And so, what you do is take each of these 27 numbers and multiply them with the corresponding numbers from the red, green, and blue channels of the image, so take the first nine numbers from red channel, then the three beneath it to the green channel, then the three beneath it to the blue channel, and multiply it with the corresponding 27 numbers that gets covered by this yellow cube show on the left. Then add up all those numbers and this gives you this first number in the output, and then to compute the next output you take this cube and slide it over by one
Multiple filters
The idea of convolution on volumes, turns out to be really powerful. Only a small part of it is that you can now operate directly on RGB images with three channels. But even more important is that you can now detect two features, like vertical, horizontal edges, or maybe several hundreds of different features. And the output will then have a number of channels equal to the number of filters you are detecting.
Summary of notation
If layer l is a convolution layer:
Types of layer in a convolutional network
Other than convolutional layers, ConvNets often also use pooling layers to reduce the size of the representation, to speed the computation, as well as make some of the features that detects a bit more robust.
The pooling (POOL) layer reduces the height and width of the input. It helps reduce computation, as well as helps make feature detectors more invariant to its position in the input. The two types of pooling layers are:
Max-pooling layer: slides an ( f,f f , f ) window over the input and stores the max value of the window in the output.
Average-pooling layer: slides an ( f,f f , f ) window over the input and stores the average value of the window in the output.
These pooling layers have no parameters for backpropagation to train. However, they have hyperparameters such as the window size ff . This specifies the height and width of the fxf window you would compute a max or average over.
The intuition behind max pooling:
If the features detected anywhere in the filter, then keep a high number. But if the feature is not detected, so maybe this feature doesn’t exist in the upper right-hand quadrant. Then the max of all those numbers is still itself quite small.
Example of max pooling:
If you have a 3D input, then the outputs will have the same dimension.
There is another type of pooling that isn’t used very often, but will mention briefly which is average pooling.
Instead of taking the maxes within each filter, average pooling take the average.
Two main advantages of convolutional layers over just using fully connected layers:
Learn about the practical tricks and methods used in deep CNNs straight from the research papers.
Learning Objectives
why look at case studies?
Outline
Classic networks:
ResNet
Inception
LeNet - 5
The goal of LeNet-5 was to recognize handwritten digits.
AlexNet
AlexNet convinced a lot of the computer vision community to take a serious look at deep learning to convince them that deep learning really works in computer vision. And then it grew on to have a huge impact not just in computer vision but beyond computer vision as well.
VGG - 16
A remarkable thing about the VGG-16 net is that they said, instead of having so many hyperparameters, the VGG network really simplified this neural network architectures. The architecture is really quite uniform.
The problem of very deep neural networks
Last week, you built your first convolutional neural network. In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.
The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow. More specifically, during gradient descent, as you backprop from the final layer back to the first layer, you are multiplying by the weight matrix on each step, and thus the gradient can decrease exponentially quickly to zero (or, in rare cases, grow exponentially quickly and “explode” to take very large values).
During training, you might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:
You are now going to solve this problem by building a Residual Network!
Building a Residual Network
In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly backpropagated to earlier layers:
The image on the left shows the “main path” through the network. The image on the right adds a shortcut to the main path. By stacking these ResNet blocks on top of each other, you can form a very deep network.
We also saw in lecture that having ResNet blocks with the shortcut also makes it very easy for one of the blocks to learn an identity function. This means that you can stack on additional ResNet blocks with little risk of harming training set performance. (There is also some evidence that the ease of learning an identity function–even more than skip connections helping with vanishing gradients–accounts for ResNets’ remarkable performance.)
Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different.
The identity block is the standard block used in ResNets, and corresponds to the case where the input activation (say a[l] a [ l ] ) has the same dimension as the output activation (say a[l+2] a [ l + 2 ] ). To flesh out the different steps of what happens in a ResNet’s identity block, here is an alternative diagram showing the individual steps:
The upper path is the “shortcut path.” The lower path is the “main path.” In this diagram, we have also made explicit the CONV2D and ReLU steps in each layer. To speed up training we have also added a BatchNorm step.
The ResNet “convolutional block” is the other type of block. You can use this type of block when the input and output dimensions don’t match up. The difference with the identity block is that there is a CONV2D layer in the shortcut path:
The CONV2D layer in the shortcut path is used to resize the input x x to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path. (This plays a similar role as the matrix Ws W s discussed in lecture.) For example, to reduce the activation dimensions’s height and width by a factor of 2, you can use a 1x1 convolution with a stride of 2. The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step.
What you should remember:
- Very deep “plain” networks don’t work in practice because they are hard to train due to vanishing gradients.
- The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
- There are two main type of blocks: The identity block and the convolutional block.
- Very deep Residual Networks are built by stacking these blocks together.
why do ResNets work so well?
Doing well on the training set is usually a prerequisite to doing well on your hold up or on your dev or on your test sets. So, being able to at least train ResNet to do well on the training set is a good first step toward that.
If you make a network deeper, it can hurt your ability to train the network to do well on the training set. But this is not true or at least is much less true when you training a ResNet.
W W is really the key term to pay attention to here. And if w[l+2] w [ l + 2 ] is equal to zero. And let’s say that b b is also equal to zero, then these terms go away because they’re equal to zero, and then g(a[l]) g ( a [ l ] ) , this is just equal to a[l] a [ l ] because we assumed we’re using the relu activation function. And so all of the activation are non-negative and so, g(a[l]) g ( a [ l ] ) is the value applied to a non-negative quantity, so you just get back, a[l] a [ l ] . So, what this shows is that the identity function is easy for residual block to learn. And it’s easy to get a[l+2] a [ l + 2 ] equals to a[l] a [ l ] because of this skip connection. And what that means is that adding these two layers in your neural network, it doesn’t really hurt your neural network’s ability to do as well as this simpler network without these two extra layers, because it’s quite easy for it to learn the identity function to just copy a[l] to a[l+2] using despite the addition of these two layers. And this is why adding two extra layers, adding this residual block to somewhere in the middle or the end of this big neural network it doesn’t hurt performance. But of course our goal is to not just not hurt performance, is to help performance and so you can imagine that if all of these hidden units if they actually learned something useful then maybe you can do even better than learning the identity function. And what goes wrong in very deep plain nets in very deep network without this residual of the skip connections is that when you make the network deeper and deeper, it’s actually very difficult for it to choose parameters that learn even the identity function which is why a lot of layers end up making your result worse rather than making your result better.
The main reason the residual network works is that it’s so easy for these extra layers to learn the identity function that you’re kind of guaranteed that it doesn’t hurt performance and then a lot the time you maybe get lucky and then even helps performance.
Why does a 1 × 1 convolution do?
The 1 × 1 convolution will look at each of the 36 different positions here, and it will take the element wise product between 32 numbers on the left and 32 numbers in the filter. And then apply a ReLU non-linearity to it after that.
This idea is often called a 1 x 1 convolution but it’s sometimes also called Network in Network。
Using 1×1 convolutions
1 x 1 convolution is a way to shrink nC, whereas pooling layers are to shrink nH and nW, the height and width.
Motivation for inception network
The problem of computational cost
To summarize, if you are building a layer of a neural network and you don’t want to have to decide, do you want a 1 by 1, or 3 by 3, or 5 by 5, or pooling layer, the inception module let’s you say let’s do them all, and let’s concatenate the results. And then we run to the problem of computational cost. And what you saw here was how using a 1 by 1 convolution, you can create this bottleneck layer thereby reducing the computational cost significantly. Now you might be wondering, does shrinking down the representation size so dramatically, does it hurt the performance of your neural network? It turns out that so long as you implement this bottleneck layer so that within reason, you can shrink down the representation size significantly, and it doesn’t seem to hurt the performance, but saves you a lot of computation. So these are the key ideas of the inception module.
To summarize, if you understand the Inception module, then you understand the Inception network, which is largely the Inception module repeated a bunch of times throughout the network.
It turns out that a lot of these neural networks are difficult or finicky to replicate because a lot of details about tuning of the hyperparameters such as learning decay and other things that make some difference to the performance.
Therefore, it’s sometimes difficult to replicate someone else’s published work just from reading their paper. Fortunately, a lot of deep learning researchers routinely open source their work on the Internet, such as on GitHub.
One of the advantages of doing so also is that sometimes these networks take a long time to train, and someone else might have used multiple GPUs and a very large dataset to pretrain some of these networks. And that allows you to do transfer learning using these networks.
If you’re building a computer vision application rather than training the ways from scratch, from random initialization, you often make much faster progress if you download ways that someone else has already trained on the network architecture and use that as pre-training and transfer that to a new task that you might be interested in.
In practice, because the open data sets on the internet are so big and the ways you can download that someone else has spent weeks training has learned from so much data, you find that for a lot of computer vision applications, you just do much better if you download someone else’s open source ways and use that as initialization for your problem. In all the different disciplines, in all the different applications of deep learning, I think that computer vision is one where transfer learning is something that you should almost always do unless, you have an exceptionally large data set to train everything else from scratch yourself. But transfer learning is just very worth seriously considering unless you have an exceptionally large data set and a very large computation budget to train everything from scratch by yourself.
Most computer vision task could use more data. And so data augmentation is one of the techniques that is often used to improve the performance of computer vision systems.
Implementing distortions during training
Similar to other parts of training a deep neural network, the data augmentation process also has a few hyperparameters such as how much color shifting do you implement and exactly what parameters you use for random cropping? So, similar to elsewhere in computer vision, a good place to get started might be to use someone else’s open source implementation for how they use data augmentation. But of course, if you want to capture more in variances, then you think someone else’s open source implementation isn’t, it might be reasonable also to use hyperparameters yourself.
Deep learning has been successfully applied to computer vision, natural language processing, speech recognition, online advertising, logistics, many, many, many problems. There are a few things that are unique about the application of deep learning to computer vision, about the status of computer vision. In this video, I will share with you some of my observations about deep learning for computer vision and I hope that that will help you better navigate the literature, and the set of ideas out there, and how you build these systems yourself for computer vision.
Tips for doing well on benchmarks/wining competitions
Learn how to apply your knowledge of CNNs to one of the toughest but hottest field of computer vision: Object detection.
Learning Objectives
Object detection is one of the areas of computer vision that’s just exploding and is working so much better than just a couple of years ago. In order to build up to object detection, you first learn about object localization.
The problem discuss here is classification with localization. Which means not only do you have to label this as say a car but the algorithm also is responsible for putting a bounding box, or drawing a red rectangle around the position of the car in the image. So that’s called the classification with localization problem. Where the term localization refers to figuring out where in the picture is the car you’ve detective.
The above loss function is just for simplicity, in practice you could probably use a log like feature loss for the C1,C2,C3 C 1 , C 2 , C 3 to the softmax output. One of those elements usually you can use squared error or something like squared error for the bounding box coordinates and if a Pc P c you could use something like the logistics regression loss. Although even if you use squared error it’ll probably work okay.
Landmarks is the important points and image, whose X and Y coordinates output by neural network , that you want the neural networks to recognize.
In order to treat a network like detect landmark, you will need a label training set. The labels have to be consistent across different images. But if you can hire labelers or label yourself a big enough data set to do this, then a neural network can output all of these landmarks which is going to used to carry out other interesting effect such as with the pose of the person, maybe try to recognize someone’s emotion from a picture, and so on.
Discover how CNNs can be applied to multiple fields, including art generation and face recognition. Implement your own algorithm to generate art and recognize faces!
What you should remember:
- Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
- The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
- The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.
Face verification vs. face recognition
Verification
Recognition
Need to be able to recognize a person even though you can only have one sample in your DB.
You can’t train a CNN with a softmax(each person) because:
Siamese network is a good way to input two faces and tell you how similar or how different they are.
By using a 128-neuron fully connected layer as its last layer, the model ensures that the output is an encoding vector of size 128. You then use the encodings the compare two face images as follows:
The triplet loss function formalizes this, and tries to “push” the encodings of two images of the same person (Anchor and Positive) closer together, while “pulling” the encodings of two images of different persons (Anchor, Negative) further apart.
For an image x x , we denote its encoding f(x) f ( x ) , where f f is the function computed by the neural network.
Training will use triplets of images (A,P,N) ( A , P , N ) :
These triplets are picked from our training dataset. We will write (A(i),P(i),N(i)) ( A ( i ) , P ( i ) , N ( i ) ) to denote the i i -th training example.
You’d like to make sure that an image A(i) A ( i ) of an individual is closer to the Positive P(i) P ( i ) than to the Negative image N(i) N ( i ) ) by at least a margin α α :
You would thus like to minimize the following “triplet cost”:
Here, we are using the notation “ [z]+ [ z ] + ” to denote max(z,0) m a x ( z , 0 ) .
Notes:
- The term (1) is the squared distance between the anchor “A” and the positive “P” for a given triplet; you want this to be small.
- The term (2) is the squared distance between the anchor “A” and the negative “N” for a given triplet, you want this to be relatively large, so it thus makes sense to have a minus sign preceding it.
- α α is called the margin. It is a hyperparameter that you should pick manually. We will use α=0.2 α = 0.2 .
Most implementations also normalize the encoding vectors to have norm equal one (i.e., ∣∣f(img)∣∣2 ∣ ∣ f ( i m g ) ∣ ∣ 2 =1);
How do we choose triplets to train on?
Some big companies have already trained networks on large amount of photos so you may just want to reuse their weights.
The Triplet Loss is one good way to learn the parameters of a continent for face recognition. There’s another way to learn these parameters. Another way to train a neural network, is to take this pair of neural networks to take this Siamese Network and have them both compute these embeddings, and then have these be input to a logistic regression unit to then just make a prediction. Where the target output will be one if both of these are the same persons, and zero if both of these are of different persons. So, this is a way to treat face recognition just as a binary classification problem.
Rather than just feed in the encoding, the input of the final logistic regression unit will be the differences between the encodings. So, this will be one pretty useful way to learn to predict zero or one whether these are the same person or different persons.
One computational trick that can help neural deployment significantly, which is that, if this is the new image,then instead of having to compute, this embedding every single time, you can do is actually pre-compute that, so, when the new employee walks in, what you can do is use this upper components to compute that encoding and use it, then compare it to your pre-computed encoding and then use that to make a prediction. Because you don’t need to store the raw images and also because if you have a very large database of employees, you don’t need to compute these encodings every single time for every employee database. This idea of free computing, some of these encodings can save a significant computation. And this type of pre-computation works both for this type of Siamese Central architecture where you treat face recognition as a binary classification problem, as well as, when you were learning encodings maybe using the Triplet Loss function as described in the last module.
To treat face verification as supervised learning, you create a training set of pairs of images where the target label is one when these are a pair of pictures of the same person and where the target label is zero, when these are pictures of different persons and you use different pairs to train the neural network to train the Siamese network that were using back propagation.
About the Course
This course will teach you how to build models for natural language, audio, and other sequence data. Thanks to deep learning, sequence algorithms are working far better than just two years ago, and this is enabling numerous exciting applications in speech recognition, music synthesis, chatbots, machine translation, natural language understanding, and many others.
You will:
Recurrent Neural Networks (RNN) are very effective for Natural Language Processing and other sequence tasks because they have “memory”. They can read inputs x⟨t⟩ x ⟨ t ⟩ (such as words) one at a time, and remember some information/context through the hidden layer activations that get passed from one time-step to the next. This allows a uni-directional RNN to take information from the past to process later inputs. A bidirection RNN can take context from both the past and the future.
Superscript [l] [ l ] denotes an object associated with the lth l t h layer.
Superscript (i) ( i ) denotes an object associated with the ith i t h example.
Superscript ⟨t⟩ ⟨ t ⟩ denotes an object at the tth t t h time-step.
Lowerscript i i denotes the ith i t h entry of a vector.
Uni-directional RNN architecture
With the example of sentence input:
Back propagation requires doing computations or parsing messages in the opposite directions.
RNN-cell’s backward pass. Just like in a fully-connected neural network, the derivative of the cost function J J backpropagates through the RNN by following the chain-rule from calculas. The chain-rule is also used to calculate (∂J∂Wax,∂J∂Waa,∂J∂b) ( ∂ J ∂ W a x , ∂ J ∂ W a a , ∂ J ∂ b ) to update the parameters (Wax,Waa,ba) ( W a x , W a a , b a ) .
Language model estimates the probability of that particular sequence of words.
Build language model:
1. Tokenize sentence
2. Index tokenized words
3. Build RNN to model the probability of different sequences.
Terminology:
Generate randomly chosen sentence from RNN language model:
np.random.choice
to sample the first word according to distribution defined by output vector probabilities.
Depending on application, character level RNN can also be built. In that case, vocabulary will be the alphabets.
Pros:
Cons:
Vanishing gradients problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.
This RNN will work well enough for some applications, but it suffers from vanishing gradient problems. So it works best when each output y⟨t⟩ y ⟨ t ⟩ can be estimated using mainly “local” context (meaning information from inputs x⟨t′⟩ x ⟨ t ′ ⟩ where t′ t ′ is not too far from t t ).
Exploding gradients could by using gradient clipping, e.g. every element of the gradient vector is clipped to lie between some range [-N, N], but vanishing gradients will take more work to address.
Gated Recurrent Unit is a modification to the RNN hidden layer that makes it much better capturing long range connections and helps a lot with the vanishing gradient problems.
About the gates
Update gate
Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formulat for the update gate:
Updating the cell
To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is:
Bidirectional RNNs enable take information from both earlier and later in the sequence.
Two forward propagation from different direction:
Bidirectional RNN is a modification can be applied to the basic RNN architecture or the GRU or the LSTM.
- This change enable make predictions anywhere even in the middle of a sequence by taking into account information potentially from the entire sequence.
Cons:
- Need the entire sequence of data before making predictions anywhere.
Natural language processing with deep learning is an important combination. Using word vector representations and embedding layers you can train recurrent neural networks with outstanding performances in a wide variety of industries. Examples of applications are sentiment analysis, named entity recognition and machine translation.
What you should remember:
- If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly. Word embeddings allow your model to work on words in the test set that may not even have appeared in your training set.
- Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
- To use mini-batches, the sequences need to be padded so that all the examples in a mini-batch have the same length.
- An Embedding()
layer can be initialized with pretrained values. These values can be either fixed or trained further on your dataset. If however your labeled dataset is small, it’s usually not worth trying to train a large pre-trained set of embeddings.
- LSTM()
has a flag called return_sequences
to decide if you would like to return every hidden states or only the last one.
- You can use Dropout()
right after LSTM()
to regularize your network.
Word embedding is a way of representing words that your algorithms automatically understand analogies like that, man is to woman, as king is to queen, and many other examples. And through these ideas of word embeddings, you’ll be able to build NLP applications, even with models the size of, usually of relatively small label training sets.
One of the weaknesses of 1-hot representation is that it treats each word as a thing unto itself, and it doesn’t allow an algorithm to easily generalize the cross words.
So, instead of a one-hot presentation we can learn a featurized representation with each of these words, we could learn a set of features and values for each of words.
For getting words embeddings. We just need you to learn high dimensional feature vectors like these, that can generalize much better than one-hot vectors for representing different words.
One common algorithm for visualize high-dimensional data is the t-SNE algorithm. By doing that, you can easily group similar words together.
A early but successful algorithm:
Example: predict the next word in the sequence
Term:
Steps:
The parameters of this model will be the matrix E E , and use the same matrix E E for all the words.
Use BP to perform gradient descent to maximize the likelihood of training set to repeatedly predict given four words in a sequence, what is the next word in text corpus
Summary:
language modeling problem causes the pose of machines learning problem where you input the context like the last four words and predicts some target words, how posing that problem allows you to learn input word embedding.
Word2Vec algorithm is a simpler and computationally more efficient way to learn word embeddings.
In the skip-gram model, we need come up with a few context to target pairs to create supervised learning problem, like randomly pick a word to be the context word and randomly pick another word within some window. e.g. plus minus five words or plus minus ten words of the context word and we choose target word.
We’ll set up a supervised learning problem where given the context word, you’re asked to predict what is a randomly chosen word within say, a plus minus ten word window of that input context word.
This is called the skip-gram model because is taking as input one word like orange and then trying to predict some words skipping a few words from the left or the right side. To predict what comes little bit before little bit after the context words.
To sample context, in practice the distribution of words isn’t taken just entirely uniformly at random for the training set purpose, but instead there are different heuristics that you could use in order to balance out something from the common words together with the less common words.
Negative sampling allows to do something similar to the Skip-Gram model, but with a much more efficient learning algorithm.
Generate dataset:
Pick a context word and then pick a target word and that gives us positive example. Then, for number of k k times, take the same context word and then pick random words from the dictionary, and those will be negative examples.
Then we’re going to create a supervised learning problem where the learning algorithm inputs x which is a pair of words, and predict the target label.
So the problem is really given a pair of words like orange and juice, do you think they appear together?
Instead of having one giant Softmax, which is very expensive to compute, we turn it into 10,000 binary classification problems, each of which is quite cheap to compute.
On every iteration, we’re only going to train k + 1 of them, of k negative examples and one positive examples. And this is why the computation cost of this algorithm is much lower because you’re updating k + 1 binary classification problems which is relatively cheap to do on every iteration rather than updating a giant Softmax classifier.
Empirically, by taking it to the power of three-fourths, this is somewhere in-between the extreme of taking uniform distribution, and the other extreme of just taking whatever was the observed distribution in your training set.
GloVe stands for global vectors for word representation.
Xij X i j is a count that captures how often do words i i and j j appear with each other, or close to each other.
How related are words i i and j j as measured by how often they occur with each other.
Solve for parameters θ θ and e e using gradient descent to minimize the sum over training set to learn vectors that their inner product is a good predictor for how often the two words occur together.
Functions of Weighting term f f :
In Glove algorithm, the roles of θ θ and e e are completely symmetric, and they actually end up with the same optimization objective. One way to train the algorithm is to initialize θ θ and e e both uniformly around gradient descent to minimize its objective, and done for every word, to then take the average.
The individual components of the embeddings cannot be guaranteed interpretable.
In particular, the first feature might be a combination of gender, and royal, and age, and food, and cost, and size, is it a noun or an action verb, and all the other features. It’s very difficult to look at individual components, individual rows of the embedding matrix and assign the human interpretation to that.
Sentiment classification is the task of looking at a piece of text and telling if someone likes or dislikes the thing they’re talking about. It is one of the most important building blocks in NLP and is used in many applications.
One of the challenges of sentiment classification is you might not have a huge label training set for it. But with word embeddings, you’re able to build good sentiment classifiers even with only modest-size label training sets.
Reduce or eliminate bias of learning algorithms is a very important problem because these algorithms are being asked to help with or to make more and more important decisions in society.
Sequence models can be augmented using an attention mechanism. This algorithm will help your model understand where it should focus its attention given a sequence of inputs.
Here’s what you should remember from this notebook:
E.g. translate French to English:
Encoder network is to find an encoding of the input French sentence and then use a decoder network to generate the corresponding English translation.
The AlexNet, if get rid of final Softmax unit, the pre-trained AlexNet can give a 4096-dimensional feature vector of which to represent this picture of a cat. And so this pre-trained network can be the encoder network for the image and now have a 4096-dimensional vector that represents the image. Can then take this and feed it to an RNN, whose job it is to generate the caption one word at a time.
Generating a sequence compared to language model. One of the key differences is, don’t want a randomly chosen translation, maybe want the most likely translation, or don’t want a randomly chosen caption, but might want the best caption and most likely caption.
Machine translation model is very similar to the language model, except that instead of always starting along with the vector of all zeros, it instead has an encoded network that figures out some representation for the input sentence, and it takes that input sentence and starts off the decoded network with representation of the input sentence rather than with the representation of all zeros.
So, that’s why I call this a conditional language model, and instead of modeling the probability of any sentence, it is now modeling the probability of, say, the output English translation, conditions on some input French sentence. So in other words, you’re trying to estimate the probability of an English translation.
So, when use this model for machine translation, you’re not trying to sample at random from distribution. Instead, what you would like is to find the English sentence, y y , that maximizes that conditional probability.
summary
One major difference between machine translation and the language modeling problems is rather than wanting to generate a sentence at random, you may want to try to find the most likely English translation. But the set of all English sentences of a certain length is too large to exhaustively enumerate. So, we have to resort to a search algorithm.
Beam search is the most widely used algorithm to jobs like output the best and the most likely English translation.
whereas greedy search will pick only the one most likely words and move on, Beam Search instead can consider multiple alternatives.
Steps:
1. Run the input French sentence through encoder network and then first step will then decode the network, like a softmax output overall 10,000 possibilities. Then take those 10,000 possible outputs and keep in memory which were the top B B .
2.Select and memorize top B B possible first 2 words over overall possible outputs.
3. Repeat previous step till the end.
Instead of multiplicative, use log l o g to get a more numerically stable algorithm that is less prone to numerical rounding errors.
Original objective function has an undesirable effect, that it may unnaturally tends to prefer very short outputs. Because the probability of a short sentence is determined just by multiplying fewer of these numbers are less than 1. Use normalized log probability objective to solve this problem.
Error analysis can help focus time on doing the most useful work for project. Now, beam search is an approximate search algorithm, also called a heuristic search algorithm. And so it doesn’t always output the most likely sentence. It’s only keeping track of B equals 3 or 10 or 100 top possibilities. In this module, you’ll learn how error analysis interacts with beam search and how you can figure out whether it is the beam search algorithm that’s causing problems and worth spending time on. Or whether it might be your RNN model that is causing problems and worth spending time on.
To decide problem happen in search algorithm or RNN model, compute P(y∗|x) P ( y ∗ | x ) as well as to compute P(y^|x) P ( y ^ | x ) using your RNN model. And then to see which of these two is bigger.
During error analysis process. You go through the development set and find the mistakes that the algorithm made in the development set. Through this process, you can then carry out error analysis to figure out what fraction of errors are due to beam search versus the RNN model. And with an error analysis process, for every example in your dev sets, where the algorithm gives a much worse output than the human translation, you can try to ascribe the error to either the search algorithm or to the objective function, or to the RNN model that generates the objective function that beam search is supposed to be maximizing. Through this, you can try to figure out which of these two components is responsible for more errors.
If you find that beam search is responsible for a lot of errors, then maybe is we’re working hard to increase the beam width.
If you find that the RNN model is at fault, then you could do a deeper layer of analysis to try to figure out if you want to add regularization, or get more training data, or try a different network architecture, or something else.
One of the challenges of machine translation is that, given a French sentence, there could be multiple English translations that are equally good translations of that French sentence. Bleu score is a measurement for solving this problem.
BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.
Although developed for translation, it can be used to evaluate text generated for a suite of natural language processing tasks.
The clipped count is the maximum counts of given text in either of the reference texts.
It’s difficult for encoder-decoder network to memorize a super long sentence. If you had to translate a book’s paragraph from French to English, you would not read the whole paragraph, then close the book and translate. Even during the translation process, you would read/re-read and focus on the parts of the French paragraph corresponding to the parts of the English you are writing down.
The attention mechanism tells a Neural Machine Translation model where it should pay attention to at any step.
The Attention is originally developed for machine translation, it spread to many other application areas as well.
Attention Model compute a set of attention weights.
The diagram on the left shows the attention model. The diagram on the right shows what one “Attention” step does to calculate the attention variables α⟨t,t′⟩ α ⟨ t , t ′ ⟩ , which are used to compute the context variable context⟨t⟩ c o n t e x t ⟨ t ⟩ for each timestep in the output ( t=1,…,Ty t = 1 , … , T y ).
|
|
There are two separate LSTMs in this model (see diagram on the left). Because the one at the bottom of the picture is a Bi-directional LSTM and comes before the attention mechanism, we will call it pre-attention Bi-LSTM. The LSTM at the top of the diagram comes after the attention mechanism, so we will call it the post-attention LSTM. The pre-attention Bi-LSTM goes through Tx T x time steps; the post-attention LSTM goes through Ty T y time steps.
The post-attention LSTM passes s⟨t⟩,c⟨t⟩ s ⟨ t ⟩ , c ⟨ t ⟩ from one time step to the next. In the lecture videos, we were using only a basic RNN for the post-activation sequence model, so the state captured by the RNN output activations s⟨t⟩ s ⟨ t ⟩ . But since we are using an LSTM here, the LSTM has both the output activation s⟨t⟩ s ⟨ t ⟩ and the hidden cell state c⟨t⟩ c ⟨ t ⟩ . However, unlike previous text generation examples (such as Dinosaurus in week 1), in this model the post-activation LSTM at time t t does will not take the specific generated y⟨t−1⟩ y ⟨ t − 1 ⟩ as input; it only takes s⟨t⟩ s ⟨ t ⟩ and c⟨t⟩ c ⟨ t ⟩ as input. We have designed the model this way, because (unlike language generation where adjacent characters are highly correlated) there isn’t as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.
We use a⟨t⟩=[a→⟨t⟩;a←⟨t⟩] a ⟨ t ⟩ = [ a → ⟨ t ⟩ ; a ← ⟨ t ⟩ ] to represent the concatenation of the activations of both the forward-direction and backward-directions of the pre-attention Bi-LSTM.
The diagram on the right uses a RepeatVector
node to copy s⟨t−1⟩ s ⟨ t − 1 ⟩ ’s value Tx T x times, and then Concatenation
to concatenate s⟨t−1⟩ s ⟨ t − 1 ⟩ and a⟨t⟩ a ⟨ t ⟩ to compute e⟨t,t′⟩ e ⟨ t , t ′ ⟩ , which is then passed through a softmax to compute α⟨t,t′⟩ α ⟨ t , t ′ ⟩ .
a⟨t⟩ a ⟨ t ⟩ is a vector, α⟨t,t′⟩ α ⟨ t , t ′ ⟩ is a scalar, attention layer end up with context⟨t⟩ c o n t e x t ⟨ t ⟩ with dimenstion of 1×n 1 × n for each t t in Ty T y .
From audio recordings to spectrograms
What really is an audio recording? A microphone records little variations in air pressure over time, and it is these little variations in air pressure that your ear also perceives as sound. You can think of an audio recording is a long list of numbers measuring the little air pressure changes detected by the microphone. We will use audio sampled at 44100 Hz (or 44100 Hertz). This means the microphone gives us 44100 numbers per second. Thus, a 10 second audio clip is represented by 441000 numbers (= 10×4410010×44100 ).
It is quite difficult to figure out from this “raw” representation of audio whether the word “activate” was said. In order to help your sequence model more easily learn to detect triggerwords, we will compute a spectrogram of the audio. The spectrogram tells us how much different frequencies are present in an audio clip at a moment in time.
(If you’ve ever taken an advanced class on signal processing or on Fourier transforms, a spectrogram is computed by sliding a window over the raw audio signal, and calculates the most active frequencies in each window using a Fourier transform. If you don’t understand the previous sentence, don’t worry about it.)
Trigger word detection is the technology that allows devices like Amazon Alexa, Google Home, Apple Siri, and Baidu DuerOS to wake up upon hearing a certain word.