该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ
Coursera 课程 |deeplearning.ai |网易云课堂
转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
CSDN:http://blog.csdn.net/junjun_zhao/article/details/79182929
2.8 Multi-task learning (多任务学习)
(字幕来源:网易云课堂)
So whereas in transfer learning, you have a sequential process where you learn from task A and then transfer that to task B.In multi-task learning, you start off simultaneously, trying to have one neural network do several things at the same time.And then each of these task helps hopefully all of the other task.Let’s look at an example.Let’s say you’re building an autonomous vehicle, building a self driving car.Then your self driving car would need to detect several different things such as pedestrians, detect other cars, detect stop signs.And also detect traffic lights and also other things.So for example, in this example on the left, there is a stop sign in this image and there is a car in this image but there aren’t any pedestrians or traffic lights.
在迁移学习中 你的步骤是串行的,你从任务 A 里学到只是 然后迁移到任务 B,在多任务学习中 你是同时开始学习的,试图让单个神经网络同时做几件事情,然后希望这里每个任务都能帮到其他所有任务,我们来看一个例子,假设你在研发无人驾驶车辆,那么你的无人驾驶车可能需要同时检测不同的物体,比如检测行人 车辆 停车标志,还有交通灯 各种其他东西,比如在左边这个例子中 图像里有个停车标志,然后图像中有辆车 但没有行人也没有交通灯。
So if this image is an input for an example, x(i) x ( i ) ,then Instead of having one label y(i) y ( i ) , you would actually a four labels.In this example, there are no pedestrians, there is a car, there is a stop sign and there are no traffic lights.And if you try and detect other things, then may be y(i) y ( i ) has even more dimensions.But for now let’s stick with these four.So y(i) y ( i ) is a 4 by 1 vector.And if you look at the training set labels as a whole, then similar to before, we’ll stack the training data’s labels horizontally as follows, y(1) y ( 1 ) up to y(m) y ( m ) .Except that now y(i) y ( i ) is a 4 by 1 vector so each of these is a tall column vector.And so this matrix Y is now a 4 by m matrix, whereas previously, when y was single real number, this would have been a 1 by m matrix.
如果这是输入图像 x(i) x ( i ) ,那么这里不再是一个标签 y(i) y ( i ) 而是有 4 个标签,在这个例子中 没有行人 有一辆车,有一个停车标志 没有交通灯,然后如果你尝试检测其他物体,也许 y(i) y ( i ) 的维数会更高,现在我们就先用 4个 吧,所以 y(i) y ( i ) 是个 4∗1 4 ∗ 1 向量,如果你从整体来看这个训练集标签,和以前类似 我们将训练集的标签水平堆叠起来,像这样 y(1) y ( 1 ) 一直到 y(m) y ( m ) ,不过现在 y(i) y ( i ) 是 4∗1 4 ∗ 1 向量 所以这些都是竖向的列向量,所以这个矩阵Y现在变成 4∗m 4 ∗ m 矩阵 而之前,当 y 是单实数时 这就是 1∗m 1 ∗ m 矩阵。
So what you can do is now train a neural network to predict these values of y.So you can have a neural network input x and output now a four dimensional value for y.Notice here for the output there I’ve drawn four nodes.And so the first node when we try to predict is there a pedestrian in this picture.The second output will predict is there a car here, predict is there a stop sign and this will predict maybe is there a traffic light.So y hat here is four dimensional.So to train this neural network, you now need to define the loss for the neural network.And so given a predicted output y hat i which is 4 by 1 dimensional.The loss averaged over your entire training set would be 1 over m sum from i = 1 through m, sum from j = 1 through 4 of the losses of the individual predictions.So it’s just summing over at the four components of pedestrian car stop sign traffic lights.And this script L is the usual logistic loss.So just to write this out, this is −y(i)jlogŷ (i)j−(1−y(i)j)log(1−ŷ (i)j) − y j ( i ) l o g y ^ j ( i ) − ( 1 − y j ( i ) ) l o g ( 1 − y ^ j ( i ) ) .
那么你现在可以做的是训练一个神经网络 来预测这些 y 值,你就得到这样的神经网络 输入 x,现在输出是一个四维向量 y,请注意 这里输出我画了四个节点,所以第一个节点,就是我们想预测图中有没有行人,然后第二个输出节点预测的是有没有车,这里预测有没有停车标志 这里预测有没有交通灯,所以这里 ŷ y ^ 是四维的,要训练这个神经网络,你现在需要定义神经网络的损失函数,对于一个输出 ŷ y ^ 是个 4维向量,对于整个训练集的平均损失,就是 1 除以 m 对 i=1 到 m 求和,从 j=1 到 4 求和 这些单个预测的损失,所以这就是对四个分量的求和,行人 车 停车标志 交通灯,而这个标志 L L 指的是 logistic 损失,我们就这么写,这是 −y(i)jlogŷ (i)j−(1−y(i)j)log(1−ŷ (i)j) − y j ( i ) l o g y ^ j ( i ) − ( 1 − y j ( i ) ) l o g ( 1 − y ^ j ( i ) ) 。
And the main difference compared to the earlier finding cat classification examples is that you’re now summing over j equals 1 through 4.And the main difference between this and softmax regression, is that unlike softmax regression, which assigned a single label to single example.This one image can have multiple labels.So you’re not saying that each image is either a picture of a pedestrian, or a picture of car, a picture of a stop sign, picture of a traffic light.You’re asking for each picture, does it have a pedestrian, or a car, a stop sign or traffic light, and multiple objects could appear in the same image.In fact, in the example on the previous slide, we had both a car and a stop sign in that image, but no pedestrians and traffic lights.So you’re not assigning a single label to an image, you’re going through the different classes and asking for each of the classes does that class, does that type of object appear in the image?So that’s why I’m saying that with this setting, one image can have multiple labels.If you train a neural network to minimize this cost function, you are carrying out multi-task learning.Because what you’re doing is building a single neural network that is looking at each image and basically solving four problems.It’s trying to tell you does each image have each of these four objects in it.And one other thing you could have done is just train four separate neural networks, instead of train one network to do four things.But if some of the earlier features in neural network can be shared between these different types of objects, then you find that training one neural network to do four things results in better performance than training four completely separate neural networks to do the four tasks separately.So that’s the power of multi-task learning.
和之前分类猫的例子主要区别在于,现在你要对 j=1 到 4 求和,这与 softmax 回归的主要区别在于,与 softmax 回归不同 softmax 将单个标签分配给单个样本,而这张图可以有很多不同的标签,所以不是说每张图都只是一张行人图片,汽车图片 停车标志图片 或者交通灯图片,你要知道每张照片是否有行人,或汽车 停车标志或交通灯,多个物体可能同时出现在一张图里,实际上 在上一张幻灯片中,那张图同时有车和停车标志,但没有行人和交通灯,所以你不是只给图片一个标签,而是需要遍历不同类型,然后看看每个类型,那类物体有没有出现在图中,所以我就说在这个场合,一张图可以有多个标签,如果你训练了一个神经网络 试图最小化这个成本函数,你做的就是多任务学习,因为你现在做的是建立单个神经网络,观察每张图 然后解决四个问题,系统试图告诉你 每张图里面有没有这四个物体,另外你也可以训练四个不同的神经网络,而不是训练一个网络做四件事情,但神经网络一些早期特征,在识别不同物体时都会用到,然后你发现,训练一个神经网络做四件事情,会比训练四个完全独立的神经网络分别做四件事性能要更好,这就是多任务学习的力量。
And one other detail, so far I’ve described this algorithm as if every image had every single label.It turns out that multi-task learning also works even if some of the images were label only some of the objects.So the first training example, let’s say someone, your labeler had told you there’s a pedestrian, there’s no car, but they didn’t bother to label whether or not there’s a stop signor whether or not there’s a traffic light.And maybe for the second example, there is a pedestrian, there is a car, but again the labeler, when they looked at that image, they just didn’t label it, whether it had a stop sign or whether it had a traffic light, and so on.And maybe some examples are fully labeled, and maybe some examples, they were just labeling for the presence and absence of cars so there’s some question marks, and so on.So with a data set like this, you can still train your learning algorithm to do four tasks at the same time, even when some images have only a subset of the labels and others are sort of question marks or don’t cares.
另一个细节,到目前为止 我是这么描述算法的 好像每张图都有全部标签,事实证明 多任务学习,也可以处理图像只有部分物体被标记的情况,所以第一个训练样本 我们说有人,给数据贴标签的人告诉你里面有一个行人,没有车 但他们没有标记,是否有停车标志,或者是否有交通灯,也许第二个例子中 有行人 有车,,但是 当标记人看着那张图片时 他们没有加标签,没有标记是否有停车标志 是否有交通灯 等等,也许有些样本都有标记 但也许有些样本,他们只标记了有没有车,然后还有一些是问号,即使是这样的数据集 你也可以在上面训练算法,同时做四个任务 即使一些图像,只有一小部分标签,其他是问号 或者不管是什么。
And the way you train your algorithm, even when some of these labels are question marks or really unlabeled is that in this sum over j from 1 to 4, you would sum only over values of j with a 0 or 1 label.So whenever there’s a question mark, you just omit that term from summation but just sum over only the values where there is a label.And so that allows you to use datasets like this as well.So when does multi-task learning makes sense?
然后你训练算法的方式,即使这里有些标签是问号 或者没有标记,这就是对 j 从 1 到 4 求和,你就只对带 0 和 1 标签的j值求和,所以当有问号的时候,你就在求和时忽略那个项,这样只对有标签的值求和,于是你就能利用这样的数据集,那么多任务学习什么时候有意义呢?
So when does multi-task learning make sense?I’ll say it makes sense usually when three things hold true.One is if your training on a set of tasks that could benefit from having shared low-level features.So for the autonomous driving example, it makes sense that recognizing traffic lights and cars and pedestrians, those should have similar features that could also help you recognize stop signs, because these are all features of roads.Second, this is less of a hard and fast rule, so this isn’t always true.But what I see from a lot of successful multi-task learning settings is that the amount of data you have for each task is quite similar.So if you recall from transfer learning, you learn from some task A and transfer it to some task B.So if you have a million examples of task Athen and 1,000 examples for task B, then all the knowledge you learned from that million examples could really help augment the much smaller data set you have for task B.Well how about multi-task learning?In multi-task learning you usually have a lot more tasks than just two.So maybe you have, previously we had 4 tasks but let’s say you have 100 tasks.And you’re going to do multi-task learning to try to recognize 100 different types of objects at the same time.So what you may find is that you may have 1,000 examples per task and so if you focus on the performance of just one task, let’s focus on the performance on the 100th task, you can call A100.If you are trying to do this final task in isolation, you would have had just a thousand examples to train this one task, this one of the 100 tasks that by training on these 99 other tasks.These in aggregate have 99,000 training examples which could be a big boost, could give a lot of knowledge to augment this otherwise, relatively small 1,000 example training set that you have for task A100.And symmetrically every one of the other 99 tasks can provide some data or provide some knowledge that help every one of the other tasks in this list of 100 tasks.
多任务学习什么时候有意义?*当三件事为真时 它就是有意义的,第一 如果你训练的一组任务,可以共用低层次特征,对于无人驾驶的例子,同时识别交通灯 汽车和行人是有道理的,这些物体有相似的特征,也许能帮你识别停车标志,因为这些都是道路上的特征,第二* 这个准则没有那么绝对 所以不一定是对的,但我从很多成功的多任务学习案例中看到,如果每个任务的数据量很接近,你还记得迁移学习时 你从A 任务学到知识,然后迁移到 B 任务,所以如果任务 A 有 1 百万个样本,任务 B 只有 1000 个样本,那么你从这 1 百万个样本学到的知识,真的可以帮你增强对更小数据集任务 B 的训练,那么多任务学习又怎么样呢?在多任务学习中 你通常有更多任务而不仅仅是两个,所以也许你有 以前我们有 4 个任务 但比如说你要完成 100 个任务,而你要做多任务学习,尝试同时识别 100 种不同类型的物体,你可能会发现 每个任务大概有 1000 个样本,所以如果你专注加强单个任务的性能,比如我们专注加强第 100 个任务的表现 我们用 A100表示,如果你试图单独去做这个最后的任务,你只有 1000 个样本去训练这个任务,这是 100 项任务之一,而通过在其他 99 项任务的训练,这些加起来可以一共有99000个样本 这可能大幅提升算法性能,可以提供很多知识来增强这个任务的性能,不然对于任务 A100 只有 1000 个样本的训练集 效果可能会很差,如果有对称性 这其他 99 个任务 也许能提供一些数据,或提供一些知识,来帮到这 100 个任务中的每一个任务。
So the second bullet isn’t a hard and fast rule but what I tend to look at is if you focus on any one task, for that to get a big boost for multi-task learning, the other tasks in aggregate need to have quite a lot more data than for that one task.And so one way to satisfy that is if a lot of tasks like we have in this example on the right, and if the amount of data you have in each task is quite similar.But the key really is that if you already have 1,000 examples for 1 task, then for all of the other tasks you better have a lot more than 1,000 examples if those other other task are meant to help you do better on this final task.And finally multi-task learning tends to make more sense when you can train a big enough neural network to do well on all the tasks.So the alternative to multi-task learning would be to train a separate neural network for each task.So rather than training one neural network for pedestrian, car, stop sign, and traffic light detection, you could have trained done neural network for pedestrian detection, one neural network for car detection, one neural network for stop sign detection, and one neural network for traffic light detection.So what a researcher, Rich Carona, found many years ago was that the only times multi-task learning hurts performance compared to training separate neural networks is if your neural network isn’t big enough.But if you can train a big enough neural network, then multi-task learning certainly should not or should very rarely hurt performance.And hopefully it will actually help performance compared to if you were training neural networks to do these different tasks in isolation.
所以第二点不是绝对正确的准则 但我通常会看的是,如果你专注于单项任务 如果想要从多任务学习得到很大性能提升,那么其他任务加起来,必须要有比单个任务大得多的数据量,要满足这个条件 其中一种方法是,比如右边这个例子这样,或者如果每个任务中的数据量很相近,但关键在于 如果对于单个任务你已经有 1000 个样本了,那么对于所有其他任务 你最好有超过 1000 个样本,这样其他任务的知识才能帮你改善这个任务的性能,最后多任务学习往往在以下场合更有意义,当你可以训练一个足够大的神经网络,同时做好所有的工作,所以多任务学习的替代方法是,为每个任务训练一个单独的神经网络,所以 不是训练单个神经网络同时处理行人 汽车 停车标志和,交通灯检测 你可以训练,一个用于行人检测的神经网络,一个用于汽车检测的神经网络,一个用于停车标志检测的神经网络,和一个用于交通信号灯检测的神经网络,那么研究员 Rich Carona 几年前发现的是什么呢?多任务学习会降低性能的唯一情况,和训练单个神经网络相比性能更低的情况,就是你的神经网络还不够大,但如果你可以训练一个足够大的神经网络 那么多任务学习,肯定不会 或者很少会降低性能,我们都希望它可以提升性能,比单独训练神经网络来单独完成各个任务性能要更好,所以这就是多任务学习,在实践中 多任务学习的使用频率要低于迁移学习,我看到很多迁移学习的应用,你需要解决一个问题 但你的训练数据很少。
So that’s it for multi-task learning.In practice, multi-task learning is used much less often than transfer learning.I see a lot of applications of transfer learning where you have a problem you want to solve with a small amount of data.So you find a related problem with a lot of data to learn something and transfer that to this new problem.But multi-task learning is just more rare that you have a huge set of tasks you want to use that you want to do well on, you can train all of those tasks at the same time.Maybe the one example is computer vision.In object detection I see more applications of multi-task learning where one neural network trying to detect a whole bunch of objects at the same time works better than different neural networks trained separately to detect objects.But I would say that on average transfer learning is used much more today than multi-task learning, but both are useful tools to have in your arsenal.
所以你需要找一个数据很多的相关问题 来预先学习,并将知识迁移到这个新问题上,但多任务学习比较少见,就是你需要同时处理很多任务 都要做好,你可以同时训练所有这些任务,也许计算机视觉是一个例子,在物体检测中 我们看到更多使用多任务学习的应用,其中一个神经网络尝试检测一大堆物体,比分别训练不同的神经网络检测物体更好,但我说 平均来说 目前迁移学习使用频率更高,比多任务学习频率要高 但两者都可以成为你的强力工具。
So to summarize, multi-task learning enables you to train one neural network to do many tasks and this can give you better performance than if you were to do the tasks in isolation.Now one note of caution, in practice I see that transfer learning is used much more often than multi-task learning.So I do see a lot of tasks where if you want to solve a machine learning problem but you have a relatively small data set, then transfer learning can really help.Where if you find a related problem but you have a much bigger data set, you can train in your neural network from there and then transfer it to the problem where we have very low data.So transfer learning is used a lot today.There are some applications of transfer multi-task learning as well, but multi-task learning I think is used much less often than transfer learning.And maybe the one exception is computer vision object detection, where I do see a lot of applications of training a neural network to detect lots of different objects.
所以总结一下,多任务学习能让你训练一个神经网络来执行许多任务,这可以给你更高的性能 比单独完成各个任务更高的性能,但要注意 实际上迁移学习,比多任务学习使用频率更高,我看到很多任务都是 如果你想解决一个机器学习问题,但你的数据集相对较小 那么迁移学习真的能帮到你,就是如果你找到一个相关问题 其中数据量要大得多,你就能以它为基础训练你的神经网络,然后迁移到这个数据量很少的任务上来,今天我们学到了很多和迁移学习有关的问题,还有一些迁移学习和多任务学习的应用,但多任务学习 我觉得使用频率比迁移学习要少得多,也许其中一个例外是计算机视觉物体检测,在那些任务中 人们经常训练一个神经网络,同时检测很多不同物体,这比训练单独的神经网络,来检测视觉物体要更好。
And that works better than training separate neural networks and detecting the visual objects.But on average I think that even though transfer learning and multi-task learning often you’re presented in a similar way, in practice I’ve seen a lot more applications of transfer learning than of multi-task learning.I think because often it’s just difficult to set up or to find so many different tasks that you would actually want to train a single neural network for.Again, with some sort of computer vision, object detection examples being the most notable exception.So that’s it for multi-task learning.Multi-task learning and transfer learning are both important tools to have in your tool bag.And finally, I’d like to move on to discuss end-to-end deep learning.So let’s go onto the next video to discuss end-to-end learning.
但平均而言 我认为即使迁移学习和,多任务学习工作方式类似 实际上,我看到用迁移学习比多任务学习要更多,我觉得这是因为你很难找到那么多相似且数据量对等的任务,可以用单一神经网络训练,再次 在计算机视觉领域,物体检测这个例子是最显著的例外情况,所以这就是多任务学习,多任务学习和迁移学习,都是你的工具包中的重要工具,最后 我想继续讨论端到端深度学习,所以我们来看下一个视频来讨论端到端学习。
多任务学习
与迁移学习的串行学习方式不同,在多任务学习中,多个任务是并行进行学习的,同时希望各个任务对其他的任务均有一定的帮助。
自动驾驶的例子
假设在自动驾驶的例子中,我们需要检测的物体很多,如行人、汽车、交通灯等等。
对于现在的任务,我们的目标值变成了一个向量的形式向量中的每一个值代表检测到是否有如行人、汽车、交通灯等,一张图片有多个标签。
模型的神经网络结构如下图所示:
该问题的Loss function:
loss=1m∑i=1m∑j=14L(ŷ (i)j,y(i)j)=1m∑i=1m∑j=14(y(i)jlog(ŷ (i)j)+(1−y(i)j)log(1−ŷ (i)j)) l o s s = 1 m ∑ i = 1 m ∑ j = 1 4 L ( y ^ j ( i ) , y j ( i ) ) = 1 m ∑ i = 1 m ∑ j = 1 4 ( y j ( i ) log ( y ^ j ( i ) ) + ( 1 − y j ( i ) ) log ( 1 − y ^ j ( i ) ) )
对于这样的问题,我们就是在做多任务学习,因为我们建立单个神经网络,来解决多个问题。
特定的对于一些问题,例如在我们的例子中,数据集中可能只标注了部分信息,如其中一张只标注了人,汽车和交通灯的标识没有标注。那么对于这样的数据集,我们依旧可以用多任务学习来训练模型。当然要注意这里Loss function求和的时候,只对带0、1标签的 j 进行求和。
多任务学习有意义的情况
参考文献:
[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(3-2)– 机器学习策略(2)
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。