论文题目:Deep Learning
论文来源:Deep Learning_2015_Nature
翻译人:莫陌莫墨
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
深度学习允许由多个处理层组成的计算模型学习具有多个抽象级别的数据表示。这些方法极大地提升了语音识别、视觉目标识别、目标检测以及许多其他领域的最新技术,例如药物发现和基因组学。深度学习通过使用反向传播算法来指示机器应如何更新其内部参数(从上一层的表示形式计算每一层的表示形式),从而发现大型数据集中的复杂结构。深层卷积网络在处理图像、视频、语音和音频方面带来了突破,而递归网络则对诸如文本和语音之类的顺序数据有所启发。
Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning.
Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.
Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.
Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition and speech recognition, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules, analysing particle accelerator data, reconstructing brain circuits, and predicting the effects of mutations in non-coding DNA on gene expression and disease. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding, particularly topic classification, sentiment analysis, question answering and lan-guage translation.
We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress.
机器学习技术为现代社会的各个方面提供了强大的支持:从网络搜索到社交网络上的内容过滤再到电子商务网站上的推荐,并且它越来越多地出现在诸如相机和智能手机之类的消费产品中。机器学习系统用于识别图像中的目标,语音转录为文本,新闻标题、帖子或具有用户兴趣的产品匹配,以及选择相关的搜索结果。这些应用程序越来越多地使用一类称为深度学习的技术。
传统的机器学习技术在处理原始格式的自然数据方面的能力受到限制。几十年来,构建模式识别或机器学习系统需要认真的工程设计和相当多的领域专业知识,才能设计特征提取器,以将原始数据(例如图像的像素值)转换为合适的内部表示或特征向量,学习子系统(通常是分类器)可以对输入的图片进行检测或分类。
表示学习是一组方法,这些方法允许向机器提供原始数据并自动发现检测或分类所需的表示。深度学习方法是具有表示形式的多层次的表示学习方法,它是通过组合简单但非线性的模块而获得的,每个模块都将一个级别(从原始输入开始)的表示形式转换为更高、更抽象的级别的表示形式。有了足够多的此类转换,就可以学习非常复杂的功能。对于分类任务,较高的表示层会放大输入中对区分非常重要的方面,并抑制不相关的变化。例如,图像以像素值序列的形式出现,并且在表示的第一层中学习的特征通常表示图像中特定方向和位置上是否存在边缘。第二层通常通过发现边缘的特定布置来检测图案,而与边缘位置的微小变化无关。第三层可以将图案组装成与熟悉的对象的各个部分相对应的较大组合,并且随后的层将这些部分的组合作为目标进行检测。深度学习的关键在于每层的功能不是由人类工程师设计的,而是通用训练过程从数据中学习的。
深度学习在解决多年来抵制人工智能界最大尝试的问题方面取得了重大进展。事实证明,它非常善于发现高维数据中的复杂结构,因此适用于科学、商业和政府的许多领域。除了打破图像识别和语音识别中的记录,它在预测潜在药物分子的活性、分析粒子加速器数据,重建脑回路和预测非编码DNA突变对基因表达和疾病的影响方面还优于其他机器学习技术。更令人惊讶的是,深度学习在自然语言理解中的各种任务上产生了非常有希望的结果,尤其是主题分类、情感分析、问答系统和语言翻译。
由于深度学习只需要极少的人工操作,我们认为其在不久的将来会取得更多的成功,因此可以轻松地利用增加的可用计算量和数据量的优势。目前正在为深度神经网络开发的新学习算法和体系结构只会加速这一进展。
总述:先叙述了机器学习的广泛应用,传统的机器学习局限与输入需要对原始数据加工,而加工是一个手艺活,需要很多的经验和算法知识,然后引入Representation learning 。
The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output function of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.
To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.
The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.
In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.
Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.
Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on aw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods, but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples. The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.
A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details-distinguishing Samoyeds from white wolves-and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.
不论深度与否,机器学习最常见的形式都是监督学习。想象一下,我们想建立一个可以将图像分类为包含房屋、汽车、人或宠物的系统。我们首先收集大量的房屋、汽车、人和宠物的图像数据集,每个图像均标有类别。在训练过程中,机器将显示一张图像,并输出一个分数向量,每个类别一个。我们希望所需的类别在所有类别中得分最高,但这不太可能在训练之前发生。我们计算一个目标函数,该函数测量输出得分与期望得分模式之间的误差(或距离)。然后机器修改其内部可更新参数以减少此误差。这些可更新的参数(通常称为权重)是实数,可以看作是定义机器输入输出功能的“旋钮”。在典型的深度学习系统中,可能会有数以亿计的可更新权重,以及数亿个带有标签的实例,用于训练模型。
为了适当地更新权重向量,学习算法计算一个梯度向量,针对每个权重,该梯度向量表明,如果权重增加很小的量,误差将增加或减少的相应的量。然后沿与梯度向量相反的方向更新权重向量。
在所有训练示例中平均的目标函数可以在权重值的高维空间中被视为一种丘陵地形。负梯度矢量指示此地形中最陡下降的方向,使其更接近最小值,其中输出误差平均较低。
在实践中,大多数从业者使用一种称为随机梯度下降(SGD)的算法。这包括显示几个示例的输入向量,计算输出和误差,计算这些示例的平均梯度以及相应地更新权重。对训练集中的许多小样本示例重复此过程,直到目标函数的平均值停止下降。之所以称其为随机的,是因为每个小的示例集都会给出所有示例中平均梯度的噪声估计。与更复杂的优化技术相比[18],这种简单的过程通常会出乎意料地快速找到一组良好的权重。训练后,系统的性能将在称为测试集的不同示例集上进行测量。这用于测试机器的泛化能力:机器在新的输入数据上产生好的效果的能力,这些输入数据在训练集上是没有的。
机器学习的许多当前实际应用都在人工设计的基础上使用线性分类器。两类别线性分类器计算特征向量分量的加权和。如果加权和大于阈值,则将输入分为特定类别。
自二十世纪六十年代以来,我们就知道线性分类器只能将其输入空间划分为非常简单的区域,即由超平面分隔的对半空间。但是,诸如图像和语音识别之类的问题要求输入输出功能对输入的不相关变化不敏感,例如目标的位置、方向或照明的变化,或语音的音高或口音的变化。对特定的微小变化敏感(例如,白狼与萨摩耶之间的差异,萨摩耶是很像狼的白狗)。在像素级别,两幅处于不同姿势和不同环境中的萨摩耶图像可能差别很大,而两幅位于相同位置且背景相似的萨摩耶和狼的图像可能非常相似。线性分类器或其他任何在其上运行的“浅”分类器无法区分后两幅图片,而将前两幅图像归为同一类别。这就是为什么浅分类器需要一个好的特征提取器来解决选择性不变性难题的原因。提取器可以产生对图像中对于辨别重要的方面具有选择性但对不相关方面(例如动物的姿态)不变的表示形式。为了使分类器更强大,可以使用通用的非线性特征,如核方法,但是诸如高斯核所产生的那些通用特征,使学习者无法从训练示例中很好地概括。传统的选择是人工设计好的特征提取器,这需要大量的工程技术和领域专业知识。但是,如果可以使用通用学习过程自动学习好的功能,则可以避免所有这些情况。这是深度学习的关键优势。
深度学习架构是简单模块的多层堆叠,所有模块(或大多数模块)都需要学习,并且其中许多模块都会计算非线性的输入-输出映射。堆叠中的每个模块都会转换其输入,以增加表示的选择性和不变性。系统具有多个非线性层(例如深度为5到20),可以实现极为复杂的输入功能,这些功能同时对细小的细节敏感(区分萨摩耶犬与白狼),并且对不相关的大变化不敏感,例如背景功能、姿势、灯光和周围物体。
2,叙述监督学习(supervised learning):就是给各种狗的图片(train dataset),提示网络架构这是狗(label),多次重复后再给一张狗的图片,训练好的网络能自动反应,给出结果,这是猫。那么到底是怎么train的呢?我们需要定义一个objective function,它的作用是计算预测值和label之间的distance,网络learning的任务就是缩小这个objective function的值,也就是让预测值不断接近真值,这个objective function是关于weight的函数,下面就粗略的提到一种优化的方法,叫随机梯度下降(SGD),就是每次在所有的样本中随机选一个样本,计算objective function关于weight的偏导(梯度),让weight往梯度的负方向(梯度的负方向就是objective function即误差减小的方向)变化,然后多次重复,最后发现objective function的值不变了或者变得很小了,就停止迭代,此时的参数weight,就是我们的网络学习到的。最后就可以训练处有自己理解的一个架构,从而可以判断新的物体是不是之前自己学习过的物体,给出结果
From the earliest days of pattern recognition, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s.
The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.
Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a prob-ability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f ( z ) = m a x ( 0 , z ) f(z)=max(0,z) f(z)=max(0,z). In past decades, neural nets used smoother non-linearities, such as t a n h ( z ) tanh(z) tanh(z) or 1 / ( 1 + e x p ( − z ) ) 1/(1+exp(-z)) 1/(1+exp(−z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).
In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with lit-tle prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima-weight configurations for which no small change would reduce the average error.
In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objec-tive function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.
Interest in deep feedforward networks was revived around 2006 by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). The researchers intro-duced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited36.
The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coef-ficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabu-lary and was quickly developed to give record-breaking results on a large vocabulary task. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting, leading to significantly better generalization when the number of labelled examples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.
There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet). It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computervision community
从模式识别的早期开始,研究人员的目的一直是用可训练的多层网络代替手工设计的功能,但是尽管它很简单,但直到二十世纪八十年代中期才广泛了解该解决方案。事实证明,可以通过简单的随机梯度下降来训练多层体系结构。只要模块是其输入及其内部权重的相对平滑函数,就可以使用反向传播过程来计算梯度。 二十世纪七十年代和二十世纪八十年代,几个不同的小组独立地发现了可以做到这一点并且起作用的想法。
反向传播程序用于计算目标函数相对于模块多层堆叠权重的梯度,无非是导数链规则的实际应用。关键的见解是,相对于模块输入的目标的导数(或梯度)可以通过相对于该模块的输出(或后续模块的输入)的梯度进行反运算来计算(图1)。反向传播方程式可以反复应用,以通过所有模块传播梯度,从顶部的输出(网络产生其预测)一直到底部的输出(外部输入被馈送)。一旦计算出这些梯度,就可以相对于每个模块的权重来计算梯度。
深度学习的许多应用都使用前馈神经网络体系结构(图1),该体系结构会将固定大小的输入(例如图像)映射到固定大小的输出(例如几个类别中的每一个的概率)。为了从一层到下一层,一组单元计算它们来自上一层的输入的加权和,并将结果传递给非线性函数。目前,最流行的非线性函数是整流线性单元(ReLU),即半波整流器 f ( z ) = m a x ( 0 , z ) f(z)=max(0,z) f(z)=max(0,z)。在过去的几十年中,神经网络使用了更平滑的非线性,例如 t a n h ( z ) tanh(z) tanh(z)或 1 / ( 1 + e x p ( − z ) ) 1/(1+exp(-z)) 1/(1+exp(−z)),但ReLU通常在具有多个层的网络中学习得更快,从而可以在无需监督的情况下进行深度监督的网络训练。不在输入或输出层中的单元通常称为隐藏单元。隐藏的层可以被视为以非线性方式使输入失真,以便类别可以由最后一层实现线性分别(图1)。
在二十世纪九十年代后期,神经网络和反向传播在很大程度上被机器学习领域抛弃,而被计算机视觉和语音识别领域所忽略。人们普遍认为,在没有先验知识的情况下学习有用的多阶段特征提取器是不可行的。特别是,通常认为简单的梯度下降会陷入不良的局部极小值——权重配置,对其进行很小的变化将减少平均误差。
实际上,较差的局部最小值在大型网络中很少出现问题。不管初始条件如何,该系统几乎总是能获得效果非常相似的解决方案。最近的理论和经验结果强烈表明,局部极小值通常不是一个严重的问题。取而代之的是,景观中堆积了许多鞍点,其中梯度为零,并且曲面在大多数维度上都向上弯曲,而在其余维度上则向下弯曲。分析似乎表明,只有少数几个向下弯曲方向的鞍点存在很多,但几乎所有鞍点的目标函数值都非常相似。因此,算法陷入这些鞍点中的哪一个都没关系。
加拿大高级研究所(CIFAR)召集的一组研究人员在2006年左右恢复了对深层前馈网络的兴趣。研究人员介绍了无需监督的学习程序,这些程序可以创建特征检测器层,而无需标记数据。学习特征检测器每一层的目的是能够在下一层中重建或建模特征检测器(或原始输入)的活动。通过使用此重建目标“预训练”几层逐渐复杂的特征检测器,可以将深度网络的权重初始化为合理的值。然后可以将输出单元的最后一层添加到网络的顶部,并且可以使用标准反向传播对整个深度系统进行微调。这对于识别手写数字或检测行人非常有效,特别是在标记数据量非常有限的情况下。
这种预训练方法的第一个主要应用是语音识别,而快速图形处理单元(GPU)的出现使编程成为可能,并且使研究人员训练网络的速度提高了10或20倍,从而使之成为可能。在2009年,该方法用于将从声波提取的系数的短暂时间窗口映射到可能由窗口中心的帧表示的各种语音片段的一组概率。它在使用少量词汇的标准语音识别基准上取得了创纪录的结果,并迅速发展为大型词汇任上取得了创纪录的结果。到2012年,许多主要的语音组织都在开发2009年以来的深度网络版本,并且已经在Android手机中进行了部署。对于较小的数据集,无监督的预训练有助于防止过拟合,从而在标记的示例数量较少时或在转移设置中,对于一些“源”任务,我们有很多示例,而对于某些“源”任务却很少,这会导致泛化效果更好“目标”任务。恢复深度学习后,事实证明,仅对于小型数据集才需要进行预训练。
但是,存在一种特定类型的深层前馈网络,它比相邻层之间具有完全连接的网络更容易训练和推广。这就是卷积神经网络(ConvNet)。在神经网络未受关注期间,它取得了许多实际的成功,并且最近被计算机视觉界广泛采用。
3,叙述BP算法(Backpropagation to train multilayer architectures ):核心思想就是 chain rule for derivatives(链式求导),然后说神经网络和反向传播算法在机器学习里被遗忘,后来在2006年有个团队提出了 unsupervised learning procedures,又revive了, unsupervised learning procedure顾名思义就是无监督嘛,能够用没有标签的数据就创造网络,它的厉害之处就是做一个pre-training ,作用是能够把我们的参数weight初始化到一个合适的值,并且呢对一些小的数据集,unsupervised pre-training能够避免过拟合
ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.
The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Different feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.
Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the position of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions. Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained.
Deep neural networks exploit the property that many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combinations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.
The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in visual neuroscience, and the overall architecture is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral pathway. When ConvNet models and monkeys are shown the same picture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey’s inferotemporal cortex. ConvNets have their roots in the neocognitron, the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation. A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words.
There have been numerous applications of convolutional networks going back to the early 1990s, starting with time-delay neural networks for speech recognition and document reading. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recognition and handwriting recognition systems were later deployed by Microsoft. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands and for face recognition.
ConvNets被设计为处理以多个阵列形式出现的数据,例如,由三个二维通道组成的彩色图像,其中三个二维通道在三个彩色通道中包含像素强度。许多数据形式以多个数组的形式出现:一维用于信号和序列,包括语言;2D用于图像或音频频谱图;和3D视频或体积图像。ConvNets有四个利用自然信号属性的关键思想:局部连接,共享权重,池化和多层使用。
典型的ConvNet的体系结构(图2)由一系列阶段构成。前几个阶段由两种类型的层组成:卷积层和池化层。卷积层中的单元组织在特征图中,其中每个单元通过称为滤波器组的一组权重连接到上一层特征图中的局部块。然后,该局部加权和的结果将通过非线性(如ReLU)传递。特征图中的所有单元共享相同的过滤器组。图层中的不同要素图使用不同的滤镜库。这种体系结构的原因有两个。首先,在诸如图像的阵列数据中,局部的值通常高度相关,从而形成易于检测的独特局部图案。其次,图像和其他信号的局部统计量对于位置是不变的。换句话说,如果图形可以出现在图像的一部分中,则它可以出现在任何位置,因此,位于不同位置的单元在数组的不同部分共享相同的权重并检测相同的图案。在数学上,由特征图执行的过滤操作是离散卷积,因此得名。
尽管卷积层的作用是检测上一层的特征的局部连接,但池化层的作用是将语义相似的要素合并为一个。由于形成图案的特征的相对位置可能会略有变化,因此可以通过对每个特征的位置进行粗粒度来可靠地检测图案。一个典型的池化单元计算一个特征图中(或几个特征图中)的局部块的最大值。相邻的池化单元从移动了不止一个行或一列的色块中获取输入,从而减小了表示的尺寸,并为小幅度的移位和失真创建了不变性。卷积、非线性和池化的两个或三个阶段被堆叠,随后是更多卷积和全连接的层。通过ConvNet进行反向传播的梯度与通过常规深度网络一样简单,从而可以训练所有滤波器组中的所有权重。
深度神经网络利用了许多自然信号是成分层次结构的特性,其中通过组合较低层的特征获得较高层的特征。在图像中,边缘的局部组合形成图案,图案组装成零件,而零件形成对象。从声音到电话,音素,音节,单词和句子,语音和文本中也存在类似的层次结构。当上一层中的元素的位置和外观变化时,池化使表示形式的变化很小。
卷积网络中的卷积和池化层直接受到视觉神经科学中简单细胞和复杂细胞的经典概念的启发,整个架构让人联想到视觉皮层腹侧通路中的LGN-V1-V2-V4-IT层次结构。当ConvNet模型和猴子显示相同的图片时,ConvNet中高层单元的激活解释了猴子下颞叶皮层中160个神经元随机集合的一半方差。 ConvNets的根源是新认知器,其架构有些相似,但没有反向传播等端到端监督学习算法。称为时延神经网络的原始一维ConvNet用于识别音素和简单单词。
卷积网络的大量应用可以追溯到二十世纪九十年代初,首先是用于语音识别和文档阅读的时延神经网络。该文档阅读系统使用了一个ConvNet,并与一个实现语言约束的概率模型一起进行了培训。到二十世纪九十年代后期,该系统已读取了美国所有支票的10%以上。 Microsoft随后部署了许多基于ConvNet的光学字符识别和手写识别系统。在二十世纪九十年代初,还对ConvNets进行试验,以检测自然图像中的物体,包括面部和手部,以及面部识别。
讲卷积神经网络(Convolutional neural networks) :ConvNets背后的四个关键思想:
局部连接(local connections):每个神经元其实没有必要对全局图像进行感知,只需要对局部进行感知,然后在更高层将局部的信息综合起来就得到了全局的信息;
权值共享(shared weights):权值共享(也就是卷积操作)减少了权值数量,降低了网络复杂度,可以看成是特征提取的方式。其中隐含的原理是:图像中的一部分的统计特性与其他部分是一样的。意味着我们在这一部分学习的特征也能用在另一部分上,所以对于这个图像上的所有位置,我们都能使用同样的学习特征;
池化( pooling):在通过卷积获得了特征 (features) 之后,下一步我们希望利用这些特征去做分类。人们可以用所有提取得到的特征去训练分类器,例如 softmax 分类器,但这样做面临计算量的挑战,并且容易出现过拟合 (over-fitting),因此,为了描述大的图像,可以对不同位置的特征进行聚合统计,如计算平均值或者是最大值,即mean-pooling和max-pooling; 多层(the use of many layers)。
接下来就讲到,典型 ConvNet的结构: convolution layers, non-linearity and pooling ,分别是卷积层,非线性操作,池化层,然后将这个结构多次堆叠就构成了ConvNet的隐藏层,然后讲了ConvNets中卷积层和池化层的设计灵感
Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abundant, such as traffic sign recognition, the segmentation of biological images particularly for connectomics, and the detection of faces, text, pedestrians and human bodies in natural images. A major recent practical success of ConvNets is face recognition.
Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision systems for cars. Other applications gaining importance involve natural language understanding and speech recognition.
Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best competing approaches. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout, and techniques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).
Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.
The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Y ahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.
ConvNets are easily amenable to efficient hardware implementations in chips or field-programmable gate arrays. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.
自二十一世纪初以来,ConvNets已成功应用于图像中对象和区域的检测、分割和识别。这些都是标记数据相对丰富的任务,例如交通标志识别、生物图像分割、尤其是用于连接组学,以及在自然图像中检测人脸、文字、行人和人体。ConvNets最近在实践中取得的主要成功是面部识别[59]。
重要的是,可以在像素级别标记图像,这将在技术中得到应用,包括自动驾驶机器人和自动驾驶汽车。 Mobileye和NVIDIA等公司正在其即将推出的汽车视觉系统中使用基于ConvNet的方法。其他日益重要的应用包括自然语言理解和语音识别。
尽管取得了这些成功,但ConvNet在很大程度上被主流计算机视觉和机器学习领域弃用,直到2012年ImageNet竞赛为止。当深度卷积网络应用于来自网络的大约一百万个图像的数据集时,其中包含1000个不同的类别,取得了惊人的成绩,几乎使最佳竞争方法的错误率降低了一半。成功的原因是有效利用了GPU、ReLU、一种称为dropout的新正则化技术,以及通过使现有示例变形而生成更多训练示例的技术。这一成功带来了计算机视觉的一场革命。现在,ConvNets是几乎所有识别和检测任务的主要方法,并且在某些任务上达到了人类水平。最近的一次令人震惊的演示结合了ConvNets和递归网络模块以生成图像字幕(图3)。
最新的ConvNet架构具有10到20层ReLU,数亿个权重以及单元之间的数十亿个连接。尽管培训如此大型的网络可能仅在两年前才花了几周的时间,但是硬件、软件和算法并行化方面的进步已将培训时间减少到几个小时。
基于ConvNet的视觉系统的性能已引起大多数主要技术公司的发展,其中包括Google、Facebook、Microsoft、IBM、Y ahoo、Twitter和Adobe,以及数量迅速增长的初创公司启动了研究和开发项目,部署基于ConvNet的图像理解产品和服务。
卷积网络很容易适应芯片或现场可编程门阵列中的高效硬件实现。 NVIDIA、Mobileye、英特尔、高通和三星等多家公司正在开发ConvNet芯片,以支持智能手机、相机、机器人和自动驾驶汽车中的实时视觉应用。
深卷积网络对图像进行理解(Image understanding with deep convolutional networks):感觉这段没讲什么技术上的类容,主要就是各个互联网巨头用DNN做出了厉害的成绩
Deep-learning theory shows that deep nets have two different exponential advantages over classic learning algorithms that do not use distributed representations. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2 n 2n 2n combinations are possible with n n n binary features). Second, composing layers of representation in a deep net brings the potential for another exponential advantage (exponential in the depth).
The hidden layers of a multilayer neural network learn to represent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’ . Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications.
The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.
Before the introduction of neural language models, the standard approach to statistical modelling of language did not exploit distributed representations: it was based on counting frequencies of occurrences of short symbol sequences of length up to N N N (called N − g r a m s N-grams N−grams). The number of possible N − g r a m s N-grams N−grams is on the order of V N V^N VN, where V V V is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. N − g r a m s N-grams N−grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4).
深度学习理论表明,与不使用分布式表示的经典学习算法相比,深网具有两个不同的指数优势。这两个优点都来自于组合的力量,并取决于具有适当组件结构的底层数据生成分布。首先,学习分布式表示可以将学习到的特征值的新组合推广到训练期间看不到的那些新组合(例如使用 n n n个二进制特征可以进行 2 n 2n 2n个组合)。其次,在一个深层网络中构成表示层会带来另一个指数优势(深度指数)。
多层神经网络的隐藏层学习以易于预测目标输出的方式来表示网络的输入。通过训练多层神经网络从较早单词的局部上下文中预测序列中的下一个单词,可以很好地证明这一点。上下文中的每个单词都以 N N N个向量的形式呈现给网络,也就是说,一个组成部分的值为1,其余均为0。在第一层中,每个单词都会创建不同的激活模式,或者字向量(图4)。在语言模型中,网络的其他层学习将输入的单词矢量转换为预测的下一个单词的输出单词矢量,这可用于预测词汇表中任何单词出现为下一个单词的概率。网络学习包含许多有效成分的单词向量,每个成分都可以解释为单词的一个独立特征,如在学习符号的分布式表示形式时首先证明的那样。这些语义特征未在输入中明确显示。通过学习过程可以发现它们,这是将输入和输出符号之间的结构化关系分解为多个“微规则”的好方法。当单词序列来自大量的真实文本并且单个微规则不可靠时,学习单词向量也可以很好地工作。例如,在训练以预测新闻故事中的下一个单词时,周二和周三学到的单词向量与瑞典和挪威的单词向量非常相似。这样的表示称为分布式表示,因为它们的元素(特征)不是互斥的,并且它们的许多配置对应于在观察到的数据中看到的变化。这些词向量由专家事先未确定但由神经网络自动发现的学习特征组成。从文本中学到的单词的矢量表示现在已在自然语言应用中得到广泛使用。
表示问题是逻辑启发和神经网络启发的认知范式之间争论的核心。在逻辑启发范式中,符号实例是某些事物,其唯一属性是它与其他符号实例相同或不同。它没有与其使用相关的内部结构;为了用符号进行推理,必须将它们绑定到明智选择的推理规则中的变量。相比之下,神经网络仅使用较大的活动矢量,较大的权重矩阵和标量非线性来执行快速的“直觉”推断类型,从而支持毫不费力的常识推理。
在引入神经语言模型之前,语言统计建模的标准方法并未利用分布式表示形式:它是基于对长度不超过 N N N(称为 N − g r a m s N-grams N−grams)的短符号序列的出现频率进行计数。可能的 N − g r a m s N-grams N−grams的数量在 V N V^N VN的数量级上,其中 V V V是词汇量,因此考虑到少数单词的上下文,将需要非常大的训练语料库。 N − g r a m s N-grams N−grams将每个单词视为一个原子单元,因此它们无法在语义上相关的单词序列中进行泛化,而神经语言模型则可以将它们与实值特征向量相关联,而语义相关的单词最终彼此靠近在该向量空间中(图4)。
Distributed representations and language processing,深度学习理论表明,与不使用分布式表示的经典学习算法相比,深网具有两个不同的指数优势。这两个优点都来自于组合的力量,并取决于具有适当组件结构的底层数据生成分布,然后讲述了下分布式以及应用发展
When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.
RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish.
Thanks to advances in their architecture and ways of training them, RNNs have been found to be very good at predicting the next character in the text or the next word in a sequence, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a probability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sen-tence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausibility to a conclusion.
Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep ConvNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently .
RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long.
To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory.
LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation.
Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to, and memory networks, in which a regular network is augmented by a kind of associative memory. Memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.
Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’ . Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference90. In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as “where is Frodo now?”.
首次引入反向传播时,其最令人兴奋的用途是训练循环神经网络(RNN)。对于涉及顺序输入的任务,例如语音和语言,通常最好使用RNN(图5)。 RNN一次处理一个输入序列的一个元素,在其隐藏的单元中维护一个“状态向量”,该“状态向量”隐式包含有关该序列的所有过去元素的历史信息。当我们将隐藏单位在不同离散时间步长的输出视为是深层多层网络中不同神经元的输出时(图5右),显然我们可以如何应用反向传播来训练RNN。
RNN是非常强大的动态系统,但是事实证明,训练它们是有问题的,因为反向传播的梯度在每个时间步长都会增大或缩小,因此在许多时间步长上它们通常会爆炸或消失。
由于其结构和训练方法的进步,人们发现RNN非常擅长预测文本中的下一个字符或序列中的下一个单词,但它们也可以用于更复杂的任务。例如,一次读一个单词的英语句子后,可以训练英语的“编码器”网络,使其隐藏单元的最终状态向量很好地表示了该句子表达的思想。然后,可以将此思想向量用作联合训练的法语“解码器”网络的初始隐藏状态(或作为其额外输入),该网络将输出法语翻译的第一个单词的概率分布。如果从该分布中选择了一个特定的第一个单词,并将其作为输入提供给解码器网络,则它将输出翻译的第二个单词的概率分布,依此类推,直到选择了句号。总体而言,此过程根据取决于英语句子的概率分布生成法语单词序列。这种相当幼稚的执行机器翻译的方式已迅速与最新技术竞争,这引起了人们对理解句子是否需要诸如通过使用推理规则操纵的内部符号表达式之类的严重质疑。日常推理涉及许多同时进行的类比,每个类比都为结论提供了合理性,这一观点与观点更加兼容。
与其将法语句子的含义翻译成英语句子,不如学习将图像的含义“翻译”成英语句子(图3)。这里的编码器是一个深层的ConvNet,可将像素转换为其最后一个隐藏层中的活动矢量。解码器是一个RNN,类似于用于机器翻译和神经语言建模的RNN。近年来,对此类系统的兴趣激增。
RNNs随时间展开(图5),可以看作是非常深的前馈网络,其中所有层共享相同的权重。尽管它们的主要目的是学习长期依赖关系,但理论和经验证据表明,很难长期存储信息。
为了解决这个问题,一个想法是用显式内存扩展网络。此类第一种建议是使用特殊隐藏单元的长短期记忆(LSTM)网络,其自然行为是长时间记住输入。称为存储单元的特殊单元的作用类似于累加器或门控泄漏神经元:它在下一时间步与其自身具有连接,其权重为1,因此它复制自己的实值状态并累积外部信号,但是此自连接是由另一个单元乘法控制的,该单元学会确定何时清除内存内容。
LSTM网络随后被证明比常规RNN更有效,特别是当它们在每个时间步都有多层时,使整个语音识别系统从声学到转录中的字符序列都一路走来。LSTM网络或相关形式的门控单元目前也用于编码器和解码器网络,它们在机器翻译方面表现出色。
在过去的一年中,几位作者提出了不同的建议,以使用内存模块扩展RNN。建议包括神经图灵机,其中网络由RNN可以选择读取或写入的“像带”存储器来增强,以及存储网络,其中常规网络由一种关联性存储器来增强。内存网络在标准问答基准方面已表现出出色的性能。存储器用于记住故事,有关该故事后来被要求网络回答问题。
除了简单的记忆外,神经图灵机和存储网络还用于执行通常需要推理和符号操作的任务。神经图灵机可以被称为“算法”。除其他事项外,当他们的输入由未排序的序列组成时,他们可以学习输出已排序的符号列表,其中每个符号都带有一个实数值,该实数值指示其在列表中的优先级。可以训练记忆网络,使其在类似于文字冒险游戏的环境中跟踪世界状况,阅读故事后,它们可以回答需要复杂推理的问题。在一个测试示例中,该网络显示了15句的《指环王》,并正确回答了诸如“ Frodo现在在哪里?”之类的问题。
7,Recurrent neural networks 六,七部分都是讲的DL在文本和语言处理领域的发展
Unsupervised learning had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.
Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-to-end and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and rein-forcement learning are in their infancy, but they already outperform passive vision systems99 at classification tasks and produce impressive results in learning to play many different video game.
Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time.
Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors.
无监督学习在恢复对深度学习的兴趣方面起了催化作用,但此后被监督学习的成功所掩盖。尽管我们在本评论中并未对此进行关注,但我们希望从长远来看,无监督学习将变得越来越重要。人类和动物的学习在很大程度上不受监督:我们通过观察来发现世界的结构,而不是通过告知每个物体的名称来发现世界的结构。
人的视觉是一个活跃的过程,它使用具有高分辨率,低分辨率环绕的小型高分辨率中央凹,以智能的,针对特定任务的方式对光学阵列进行顺序采样。我们期望在视觉上未来的许多进步都将来自端到端训练的系统,并将ConvNets与RNN结合起来,后者使用强化学习来决定在哪里看。结合了深度学习和强化学习的系统尚处于起步阶段,但在分类任务上它们已经超过了被动视觉系统,并且在学习玩许多不同的视频游戏方面产生了令人印象深刻的结果。
自然语言理解是深度学习必将在未来几年产生巨大影响的另一个领域。我们希望使用RNN理解句子或整个文档的系统在学习一次选择性地关注一部分的策略时会变得更好。
最终,人工智能的重大进步将通过将表示学习与复杂推理相结合的系统来实现。尽管长期以来,深度学习和简单推理已被用于语音和手写识别,但仍需要新的范例来通过对大向量进行运算来代替基于规则的符号表达操纵。
深度学习的将来(The future of deep learning ):主要对Unsupervised learning 的发展有个很棒的展望