Share:
Postedon by TimDettmers NoComments Tagged cuDNN,DeepLearning, DeepNeural Networks, MachineLearning
Thisseries of blog posts aims to provide an intuitive and gentleintroduction to deep learning that does not rely heavily on math ortheoretical constructs. The firstpart in this series provided an overview overthe field of deep learning, covering fundamental and core concepts.The thirdpart of the series covers sequence learningtopics such as recurrent neural networks and LSTM.
I
Besure to read Part 1: DeepLearning in a Nutshell: Core Concepts.
Iwrote this series in a glossary style so it can also be used as areference for deep learning concepts.
Theearliest deep-learning-like algorithms that had multiple layersof non-linear features can be traced back to Ivakhnenko and Lapa in1965 (Figure 1), who used thin but deep models with polynomialactivationfunctions which they analyzed with statisticalmethods. In each layer, they selected the best features throughstatistical methods and forwarded them to the next layer. They didnot use backpropagationto train their networkend-to-end but used layer-by-layer least squares fitting whereprevious layers were independently fitted from later layers.
Wikipedia.
Theearliest convolutionalnetworks were used by Fukushima in 1979.Fukushima’s networks had multiple convolutional and poolinglayers similar to modern networks, but the network was trained byusing a reinforcement scheme where a trail of strong activation inmultiple layers was increased over time. Additionally, one wouldassign important features of each image by hand by increasing theweight on certain connections.
Backpropagationof errors to train deep models was lacking atthis point. Backpropagation was derived already in the early 1960sbut in an inefficient and incomplete form. The modern form wasderived first by Linnainmaa in his 1970 masters thesis that includedFORTRAN code for backpropagation but did not mention its applicationto neural networks. Even at this point, backpropagation wasrelatively unknown and very few documented applications ofbackpropagation existed the early 1980s (e.g. Werbos in 1982).Rumelhart, Hinton, and Williams showed in 1985 that backpropagationin neural networks could yield interesting distributedrepresentations. At this time, this was an important result incognitive psychology where the question was whether human cognitioncan be thought of as relying on distributed representations(connectionism) or symbolic logic (computationalism).
Thefirst true, practical application of backpropagation came aboutthrough the work of LeCun in 1989 at Bell Labs. He used convolutionalnetworks in combination with backpropagation toclassify handwritten digits (MNIST) and this system was later used toread large numbers of handwritten checks in the United States. Thevideo above shows Yann LeCun demonstrating digit classification usingthe “LeNet” network in 1993.
Despitethese successes, funding for research into neural networks wasscarce. The term artificial intelligence dropped to nearpseudoscience status during the AIwinter and the field still needed some time torecover. Some important advances were made in this time, for example,the longshort-term memory (LSTM) for recurrentneural networks by Hochreiter and Schmidhuberin 1997, but these advances went mostly unnoticed until later as theywere overshadowed by the support vector machine developed by Cortesand Vapnik in 1995.
Thenext big shift occurred just by waiting for computers to get faster,and then later by the introduction of graphics processing units(GPUs). Waiting for faster computers and GPUs alone increased thecomputational speed by a factor of 1000 over a span of 10 years. Inthis period, neural networks slowly began to rival support vectormachines. Neural networks can be slow when compared to support vectormachines, but they reach much better results with the same amount ofdata. Unlike simpler algorithms, neural networks continue to improvewith more training data.
Themain hurdle at this point was to train big, deep networks, whichsuffered from the vanishing gradient problem, where features in earlylayers could not be learned because no learning signal reached theselayers.
Thefirst solution to this problem was layer-by-layer pretraining, wherethe model is built in a layer-by-layer fashion by using unsupervisedlearning so that the features in early layers are already initializedor “pretrained” with some suitable features (weights). Pretrainedfeatures in early layers only need to be adjusted slightly duringsupervised learning to achieve good results. The first pretrainingapproaches where developed for recurrentneural networks by Schmidhuber in 1992, and forfeed-forward networks by Hinton and Salakhutdinov in 2006. Anothersolution for the vanishing gradient problem in recurrent neuralnetworks was longshort-term memory in 1997.
Asthe speed of GPUs increased rapidly, it was soon possible to traindeep networks such as convolutional networks without the help ofpretraining as demonstrated by Ciresan and colleagues in 2011 and2012 who won character recognition, traffic sign, and medical imagingcompetitions with their convolutional network architecture.Krizhevsky, Sutskever, and Hinton used a similar architecture in 2012that also features rectifiedlinear activation functions and dropoutfor regularization. They received outstanding results in theILSVRC-2012 ImageNet competition, which marked the abandonment offeatureengineering and the adoption of featurelearning in the form of deeplearning. Google, Facebook, and Microsoftnoticed this trend and made major acquisitions of deep learningstartups and research teams between 2012 and 2014. From here,research in deep learning accelerated rapidly.
Additionalmaterial: DeepLearning in Neural Networks: An Overview
Aperceptron contains only a single linear or nonlinear unit.Geometrically, a perceptron with a nonlinear unit trained with thedelta rule can find the nonlinear plane separating data points of twodifferent classes (if the separation plane exists). If no suchseparation plane exists, the perceptron will often still produceseparation planes that provide good classification accuracy. The goodperformance of the perceptron led to a hype of artificialintelligence. In 1969 however, it was shown that a perceptron mayfail to separate seemingly simple patterns such as the pointsprovided by the XOR function. The fall from grace of the perceptronwas one of the main reasons for the occurrence of the first AIwinter. While neural networks with hiddenlayers do not suffer from the typical problems of the perceptron,neural networks were still associated with the perceptron andtherefore also suffered an image problem during the AI winter.
Despitethis, and despite the success of deep learning, perceptrons stillfind widespread use in the realm of big data, where the simplicity ofthe perceptron allows for successful application to very large datasets.
Rapidadvances in machinelearning and other approaches of inference ledto a hype of artificial intelligence (similar to the buzz around deeplearning today). Researchers made promises that these advances wouldcontinue and would lead to strong AI and in turn, AI researchreceived lots of funding.
Inthe 1970s it became clear that those promises could not be kept,funding was cut dramatically and the field of artificial intelligencedropped to near pseudo-science status. Research became very difficult(little funding; publications almost never made it through peerreview), but nevertheless, a few researchers continued further downthis path and their research soon lead to the reinvigoration of thefield and the creation of the field of deep learning.
Thisis why excessive deep learning hype is dangerous and researcherstypically avoid making predictions about the future: AI researcherswant to avoid another AI winter.
AlexNetis a convolutionalnetwork architecture named after AlexKrizhevsky, who along with Ilya Sutskever under the supervision ofGeoffrey Hinton applied this architecture to the ILSVRC-2012competition that featured the ImageNet dataset. They improved theconvolutional network architecture developed by Ciresan andcolleagues, which won multiple international competitions in 2011 and2012 by using rectifiedlinear units for enhanced speed and dropoutfor improved generalization. Their results stood in stark contrast tofeature engineering methods, which immediately created a great riftbetween deep learning and feature engineering methods for computervision. From here it was apparent that deep learning would take overcomputer vision and that other methods would not be able to catch up.AlexNet heralded the mainstream usage and the hype of deep learning.
ImageNetClassification with Deep Convolutional Neural Networks.
Theprocess of training a deep learning architecture is similar to howtoddlers start to make sense of the world around them. When a toddlerencounters a new animal, say a monkey, he or she will not know whatit is. But then an adult points with a finger at the monkey and says:“That is a monkey!” The toddler will then be able to associatethe image he or she sees with the label “monkey”.
Asingle image, however, might not be sufficient to label an animalcorrectly when it is encountered the next time. For example, thetoddler might mistake a sloth for a monkey or a monkey for a sloth,or might simply forget the name of a certain animal. For reliablerecall and labeling, a toddler needs to see many different monkeysand similar animals and needs to know each time whether or not it isreally a monkey—feedback is essential for learning. After sometime, if the toddler encounters enough animals paired with theirnames, the toddler will have learned to distinguish between differentanimals.
Thedeep learning process is similar. We present the neural network withimages or other data, such as the image of a monkey. The deep neuralnetwork predicts a certain outcome, for example, the label of theobject in an image (“monkey”). We then supply the network withfeedback. For example, if the network predicted that the image showeda monkey with 30% probability and a sloth with 70% probability, thenall the outputs in favor of the sloth class made an error! We usethis error to adjust the parameters of the neural network using thebackpropagation of errors algorithm.
Usually,we randomly initialize the parameters of a deep network so thenetwork initially outputs random predictions. This means forImageNet, which consists of 1000 classes, we will achieve an averageclassification accuracy of just 0.1% for any image after initializingthe neural network. To improve the performance we need to adjust theparameters so that the classification performance increases overtime. But this is inherently difficult: If we adjust one parameter toimprove performance on one class, this change might decrease theclassification performance for another class. Only if we findparameter changes that work for all classes can we achieve goodclassification performance.
Ifyou imagine a neural network with only 2 parameters (e.g. -0.37 and1.14), then you can imagine a mountain landscape, where the height ofthe landscape represents the classification error and the twodirections—north-south (x-axis) and east-west (y-axis)—representthe directions in which we can change the two parameters(negative-positive direction). The task is to find the lowestaltitude point in the mountain landscape: we want to find theminimum.
Theproblem with this is that the entire mountain landscape is unknown tous at the beginning. It is as if the whole mountain range is coveredin fog. We only know our current position (the initial randomparameters) and our height (the current classification error). Howcan we find the minimum quickly when we have so little informationabout the landscape?
Imagineyou stand on top of a mountain with skis strapped to your feet. Youwant to get down to the valley as quickly as possible, but there isfog and you can only see your immediate surroundings. How can you getdown the mountain as quickly as possible? You look around andidentify the steepest path down, go down that path for a bit, againlook around and find the new steepest path, go down that path, andrepeat—this is exactly what gradient descent does.
Whilegradient descent is equivalent to stopping every 10 meters andmeasuring the steepness of your surroundings with a measuring tape(you measure your gradient according to the whole data set),stochastic gradient descent is the equivalent of quickly estimatingthe steepness with a short glance (just a few hundred data points areused to estimate the steepness).
Interms of stochastic gradient descent, we go down the steepest path(the negative gradient or first derivative) on the landscape of theerror function to find a local minimum, that is, the point thatyields a low error for our task. We do this in tiny steps so that wedo not get trapped in half-pipe-like obstacles (if we are too fast,we never get out of these half-pipes and we may even be “catapulted”up the mountain).
Whileour ski-landscape is 3D, typical error landscapes may have millionsof dimensions. In such a space we have many valleys so it is easy tofind a good solution, but we also have many saddle points, whichmakes matters very difficult.
Saddlepoints are points at which the surroundings are almost entirely flat,yet which may have dramatic descents at one end or the other (saddlepoints are like plateaus that slightly bend and may lead to a cliff).Most difficulties to find good solutions on an error landscape withmany dimensions stems from navigating saddle points (because theseplateaus have almost no steepness, progress is very slow near saddlepoints) rather than finding the minimum itself (there are manyminima, which are almost all of the same quality).
Additionalmaterial: Coursera:Neural Networks for Machine Learning: Optimization – How to Makethe Learning Go Faster
Backpropagationof errors, or often simply backpropagation, is a method for findingthe gradient of the error with respect to weights over a neuralnetwork. The gradient signifies how the error of the network changeswith changes to the network’s weights. The gradient is used toperform gradientdescent and thus find a set of weights thatminimize the error of the network.
Thereare three good ways to teach backpropagation: (1) Using a visualrepresentation, (2) using a mathematical representation, (3) using arule-based representation. The bonus material at the end of thissection uses a mathematical representation. Here I’ll use arule-based representation as it requires little math and is easy tounderstand.
Imaginea neural network with 100 layers.We can imagine a forward pass in which a matrix (dimensions: numberof examples x number of input nodes) is input to the network andpropagated t through it, where we always have the order (1) inputnodes, (2) weight matrix (dimensions: input nodes x output nodes),and (3) output nodes, which usually also have a non-linear activationfunction (dimensions: examples x output nodes).How can we imagine these matrices?
Theinput matrix represents the following: For every input node we haveone input value, for example, pixels (three input values = threepixels in Figure 1), and we take this times our number of examples,such as the number of images. So for 128 3-pixel images, we have a128×3 input matrix.
Theweight matrix represents the connections between input and outputnodes. The value passed to an input node (a pixel) is weighted by theweight matrix values and it “flows” to each output node throughthese connections. This flow is a result of multipying the inputvalue by the value of each weight between the input node and outputnodes. The output matrix is the accumulated “flow” of all inputnodes at all output nodes.
Sofor each input, we multiply by all weights, and add up all thosecontributions at the output nodes, or more easily we take the matrixproduct of the input matrix times the weight matrix. In our example,this would be our 128×3 input matrix multiplied by the 3×5 weightmatrix (see Figure 1). We thus receive our output matrix as a resultwhich in this example is of size 128×5. We then use this outputmatrix, apply the non-linear activation function and treat ourresulting output matrix as the input matrix to the next layer. Werepeat these steps until we reach the error function. We then applythe error function to see how far the predictions are different fromthe correct values. We can formulate this whole process of theforward pass, and equivalently the backward pass, by defining simplerules (see Figure 1).
Forthe forward pass with given input data we go from the first to thelast layer according to these rules:
Whenwe encounter a weight matrix, we matrix multiply by this weight andpropagate the result.
Ifwe encounter a function, we put our current result into the functionand propagate the function output as our result.
Wetreat outputs of the previous layer as inputs into the next layer
Whenwe encounter the error function we apply it and thus generate theerror for our backward pass
Thebackward pass for a given error is similar but proceeds from the lastto the first layer where the error generated in rule 4 in the forwardpass represents the “inputs” to the last layer. We then gobackward through the network and follow these rules:
Whenwe encounter a weight matrix, we matrix multiply by thetranspose of the matrix and propagate the result.
Ifwe encounter a function, we multiply (element-wise) by thederivative of that function with respect to the inputs thatthis function received from the forward pass. (see Figure 1)
Wetreat errors of the previous layer as inputs (errors) into the nextlayer
Tocalculate the gradients, we use each intermediate result obtainedafter executing rule 2 in the backward pass and matrix multiply thisintermediate result by the value of rule 2 from the forward pass fromthe previous layer (see Figure 1).
Additionalmaterial: Coursera:Neural Networks for Machine Learning: The Backpropagation LearningProcedure
Therectified linear function is a simple non-linearity: It evaluates to0 for negative inputs, and positive values remain untouched (f(x) =max(0,x)). The gradient of the rectified linear function is 1 for allpositive values and 0 for negative values. This means that duringbackpropagation,negative gradients will not be used to update the weights of theoutgoing rectified linear unit.
However,because we have a gradient of 1 for any positive value we have muchbetter training speed when compared to other non-linear functions dueto the good gradient flow. For example, the logisticsigmoid function has very tiny gradients forlarge positive and negative values so that learning nearly stops inthese regions (this behavior is similar to a saddle point).
Despitethe fact that negative gradients do not propagate with rectifiedlinear functions (the gradient is zero here), large gradients forpositive values are very powerful and ensure fast training regardlessof the size of the gradient. Once these benefits were discovered,rectified linear functions and similar activationfunctions with large gradients became theactivation functions of choice for deep networks.
Momentumuses the idea that the gradient zigzags every now and then butgenerally follows a rather straight line towards a local minimum. Assuch, if we move faster in this general direction and disregard thezigzag directions we will arrive faster at the local minimum, ingeneral.
Torealize this behavior we keep track of a running momentum matrix,which is the weighted running sum of the gradient, and we add thatmomentum matrix value to the gradient. The size of this momentummatrix is kept in check by attenuating it on every update (multiplyby a momentum value between 0.7-0.99). Over time, the zigzagdimensions will be smoothed out in our running momentum matrix: A zigin one direction and a zag in the exact opposite direction cancel outand yield a straight line towards the general direction of the localminimum. In the beginning, the general direction towards the localminimum is not strongly established (a sequence of zags with no zigs,or vice versa), and the momentum matrix needs to be attenuated morestrongly or the values for the momentum increasingly emphasizezigzagging directions, which in turn can lead to unstable learning.Thus, the momentum value should be kept small (0.5-0.7) in thebeginning when no general direction towards a local minimum has beenestablished. Later the momentum value can be increased rapidly(0.9-0.999).
Usually,the gradient update is applied first, and then the jump into themomentum direction follows. However, Nesterov showed that it isbetter to first jump into the momentum direction and then correctthis direction with a gradient update; this procedure is known as“Nesterov’s accelerated gradient” (sometimes “Nesterovmomentum”) and yields faster convergence to a local minimum.
Additionalmaterial: Coursera:Neural Networks for Machine Learning: 3. The Momentum Method
RMSpropkeeps track of the weighted running mean of the squared gradient andthen divides each calculated gradient by the square root of thisweighted running mean (it essentially normalizes the gradient bydividing by the magnitude of recent gradients). The consequence isthat when a plateau in the error surface is encountered and thegradient is very small, the updates take greater steps, ensuringfaster learning (a small update: 0.00001, the square root of theweighted average: 0.00005, update size: 0.2). On the other hand,RMSprop protects against exploding gradients (a large update: 100,the square root of the weighted average: 25, update size: 4) and isthus used frequently in recurrentneural networks and LSTMsto protect both against vanishing and exploding gradients.
gradientdescent on a saddle point. Saddle points arethought to be the main difficulty in optimizing deep networks. Imageby AlecRadford.
Additionalmaterial:
Coursera:Neural Networks for Machine Learning for Machine Learning: RMSProp
Additionalanimations comparing different optimization problems.
Imagineyou (a unitin a convolutionalnetwork) are preparing for an exam (aclassification task) and you know that during the exam you arepermitted to copy answers from your peers (other units). Will youstudy for the exam? The answer to this question is probably yes or nodepending on whether at least some students in your class havestudied for the exam.
Let’ssay you know that there are two students (units) in your class(convolutional net) who have the reputation of studying for everyexam they take (every image that is presented). So you do not studyfor the exam and just copy from these students (you weigh the inputfrom a single “elite” unit in the previous layerhighly).
Nowwe introduce an infectious flu (dropout) that affects 50% of allstudents. Now there is a high chance that these two students whoactually studied for the exam will not be present, so relying oncopying their answers is no longer a good strategy. So this time youhave to learn by yourself (make choices which take into account allunits in a layer and not just the elite units).
Inother words, dropout decouples the information processing of units sothat they cannot rely on some unit “superstars” which always seemto have the right answer (these superstars detect features which aremore important than the features that other units detect).
Thisin turn democratizes the classification process so that every unitmakes computations that are largely independent of stronginfluencers, and thus reduces bias by ensuring less extreme opinions(there are no mainstream opinions). This decoupling of units in turnleads to strong regularization and better generalization (wisdom ofthe crowd).
L1and L2 regularization penalizes the size of the weights of a networkso that large output values that signify strong confidence can nolonger be achieved from a single large weight, but instead requireseveral medium-sized weights. Since many units have to agree toachieve a large value, it is less likely that the output will bebiased by the opinion of a single unit. Conceptually, it penalizesstrong opinions from single units and encourages taking into accountthe opinion of multiple units, thus reducing bias.
TheL1 regularization penalizes the absolute size of the weight, whilethe L2 penalizes the squared size of the weight. This penalty isadded to the error function value thus increasing the error if largerweights are used. As a result, the network is driven to solve theproblem with small weights.
Sinceeven small weights produce a sizeable L1 penalty, the L1 penalty hasthe effect that most weights will be set to zero while a fewmedium-to-large weights remain. Because fewer non-zero weights exist,the network must be highly confident about its results to achievegood predictive performance.
TheL2 penalty encourages very small non-zero weights (large weight =very large error). Here the prediction is made by almost all weightsthus reducing the bias (there are no influencers that can turn aroundoutcomes by themselves).
Additionalmaterial: Coursera:Neural Networks for Machine Learning: 2. Limiting the Size of theWeights
Thisconcludes part 2 of this crash course on deep learning. Please checkback soon for the next part of the series. In part3, I’ll provide some details on learningalgorithms, unsupervised learning, sequence learning, and naturallanguage processing, and in part4 I’ll go into reinforcement learning. Incase you missed it, be sure to check out part1 of the series.
Meanwhile,you might be interested in learning about cuDNN,DIGITS,ComputerVision with Caffe, NaturalLanguage Processing with Torch, NeuralMachine Translation, the Mocha.jldeep learning framework for Julia, or other ParallelForall posts on deep learning.
DeepLearning in a Nutshell: Core Concepts
DeepLearning in a Nutshell: Sequence Learning
DeepLearning for Computer Vision with Caffe and cuDNN
DeepSpeech: Accurate Speech Recognition with GPU-Accelerated DeepLearning
∥∀
Share:
TimDettmers is a masters student in informatics at the University ofLugano where he works on deep learning research. Before that hestudied applied mathematics and worked for three years as a softwareengineer in the automation industry. He runs a blogabout deep learning and takes part in Kaggledata science competitions where he has reached a world rank of 63.Follow@Tim_Dettmers on Twitter
Viewall posts by Tim Dettmers →
Share:
Postedon by TimDettmers 8Comments Tagged cuDNN,DeepLearning, DeepNeural Networks, MachineLearning, NeuralNetworks
Thispost is the first in a series I’ll be writing for Parallel Forallthat aims to provide an intuitive and gentle introduction todeep learning. It covers the most importantdeep learning concepts and aims to provide an understanding of eachconcept rather than its mathematical and theoretical details. Whilethe mathematical terminology is sometimes necessary and can furtherunderstanding, these posts use analogies and images whenever possibleto provide easily digestible bits comprising an intuitive overview ofthe field of deep learning.
Iwrote this series in a glossary style so it can also be used as areference for deep learning concepts.
Part1 focuses on introducing the main concepts of deep learning. Part2 provides historical background and delvesinto the training procedures, algorithms and practical tricks thatare used in training for deep learning. Part3 covers sequence learning, including recurrentneural networks, LSTMs, and encoder-decoder systems for neuralmachine translation. Part4 covers reinforcement learning.
Inmachine learning we (1) take some data, (2) train a model on thatdata, and (3) use the trained model to make predictions on new data.The process of traininga model can be seen as a learning process where the model is exposedto new, unfamiliar data step by step. At each step, the model makespredictions and gets feedback about how accurate its generatedpredictions were. This feedback, which is provided in terms of anerror according to some measure (for example distance from thecorrect solution), is used to correct the errors made in prediction.
Thelearning process is often a game of back-and-forth in the parameterspace: If you tweak a parameter of the model to get a predictionright, the model may have in such that it gets a previously correctprediction wrong. It may take many iterations to train a model withgood predictive performance. This iterative predict-and-adjustprocess continues until the predictions of the model no longerimprove.
Featureengineering is the art of extracting useful patterns from data thatwill make it easier for MachineLearning models to distinguish between classes.For example, you might take the number of greenish vs. bluish pixelsas an indicator of whether a land or water animal is in some picture.This feature is helpful for a machine learning model because itlimits the number of classes that need to be considered for a goodclassification.
Featureengineering is the most important skill when you want to achieve goodresults for most predictions tasks. However, it is difficult to learnand master since different data sets and different kinds of datarequire different feature engineering approaches. Only crudeguidelines exist, which makes feature engineering more of an art thana science. Features that are usable for one data set often are notusable for other data sets (for example the next image data set onlycontains land animals). The difficulty of feature engineering and theeffort involved is the main reason to seek algorithms that can learnfeatures; that is, algorithms that automatically engineer features.
Whilemany tasks can be automated by Feature Learning (like object andspeech recognition), feature engineering remains thesingle most effective technique to do well in difficult tasks(like most tasks in Kaggle machine learning competitions).
Featurelearning algorithms find the common patterns that are important todistinguish between classes and extract them automatically to be usedin a classification or regression process. Feature learning can bethought of as FeatureEngineering done automatically by algorithms.In deep learning, convolutional layers are exceptionally good atfinding good features in images to the next layer to form a hierarchyof nonlinear features that grow in complexity (e.g. blobs, edges ->noses, eyes, cheeks -> faces). The final layer(s) use all thesegenerated features for classification or regression (the last layerin a convolutional net is, essentially, multinomial logisticregression).
Figure1: Learned hierarchical features from a deep learning algorithm. Eachfeature can be thought of as a filter, which filters the input imagefor that feature (a nose). If the feature is found, the responsibleunit or units generate large activations, which can be picked up bythe later classifier stages as a good indicator that the class ispresent. Image by Honglak Lee and colleagues (2011) as published in“Unsupervised Learning of Hierarchical Representations withConvolutional Deep Belief Networks”.
Figure1 shows features generated by a deep learning algorithm thatgenerates easily interpretable features. This is rather unusual.Features are normally difficult to interpret, especially in deepnetworks like recurrentneural networks and LSTMsor very deep convolutional networks.
Inhierarchical FeatureLearning, we extract multiple layers ofnon-linear features and pass them to a classifier that combines allthe features to make predictions. We are interested in stacking suchvery deep hierarchies of non-linear features because we cannot learncomplex features from a few layers. It can be shown mathematicallythat for images the best features for a single layer are edges andblobs because they contain the most information that we can extractfrom a single non-linear transformation. To generate features thatcontain more information we cannot operate on the inputs directly,but we need to transform our first features (edges and blobs) againto get more complex features that contain more information todistinguish between classes.
Ithas been shown that the human brain does exactly the same thing: Thefirst hierarchy of neurons that receives information in the visualcortex are sensitive to specific edges and blobs while brainregions further down the visual pipeline are sensitive to morecomplex structures such as faces.
Whilehierarchical feature learning was used before the field deep learningexisted, these architectures suffered from major problems such as thevanishing gradientproblem where the gradients became too small to provide a learningsignal for very deep layers, thus making these architectures performpoorly when compared to shallow learning algorithms (such as supportvector machines).
Theterm deep learning originated from new methods and strategiesdesigned to generate these deep hierarchies of non-linear features byovercoming the problems with vanishing gradients so that we can trainarchitectures with dozens of layers of non-linear hierarchicalfeatures. In the early 2010s, it was shown that combining GPUs withactivationfunctions that offered better gradient flow wassufficient to train deep architectures without major difficulties.From here the interest in deep learning grew steadily.
Deeplearning is not associated just with learning deep non-linearhierarchical features, but also with learning to detect very longnon-linear time dependencies in sequential data. While most otheralgorithms that work on sequential data only have a memory of thelast 10 time steps, longshort-term memory (LSTM) recurrentneural networks (invented by Sepp Hochreiterand Jürgen Schmidhuber in 1997) allow the network to pick up onactivity hundreds of time-steps in the past to make accuratepredictions. While LSTM networks have been mostly ignored in the past10 years, their usage has grown rapidly since 2013 and together withconvolutional nets they form one of two major success stories of deeplearning.
Regressionanalysis estimates the relationship between statistical inputvariables in order to predict an outcome variable. Logisticregression is a regression model that uses input variables to predicta categorical outcome variable that can take on one of a limited setof class values, for example “cancer” / “no cancer”, or animage category such as “bird” / “car” / “dog” / “cat”/ “horse”.
Logisticregression applies the logistic sigmoid function (see Figure 2) toweighted input values to generate a prediction of which of twoclasses the input data belongs to (or in case of multinomial logisticregression, which of multiple classes).
.ImageSource
Logisticregression is similar to a non-linear perceptronor a neural network without hidden layers. The main difference fromother basic models is that logistic regression is easy to interpretand reliable if some statistical properties for the input variableshold. If these statistical properties hold one can produce a veryreliable model with very little input data. This makes logisticregression valuable for areas where data are scarce, like the medicaland social sciences where logistic regression is used to analyze andinterpret results from experiments. Because it is simple and fast itis also used for very large data sets.
Indeep learning, the final layer of a neural network used forclassification can often be interpreted as a logistic regression. Inthis context, one can see a deep learning algorithm as multiplefeature learning stages, which then pass their features into alogistic regression that classifies an input.
Anartificial neural network (1) takes some input data, and (2)transforms this input data by calculating a weighted sum over theinputs and (3) applies a non-linear function to this transformationto calculate an intermediate state. The three steps above constitutewhat is known as a layer,and the transformative function is often referred to as a unit.The intermediate states—often termed features—are used as theinput into another layer.
Throughrepetition of these steps, the artificial neural network learnsmultiple layers of non-linear features, which it then combines in afinal layer to create a prediction.
Theneural network learns by generating an error signal that measures thedifference between the predictions of the network and the desiredvalues and then using this error signal to change the weights (orparameters) so that predictions get more accurate.
Aunit often refers to the activationfunction in a layer by which the inputs aretransformed via a nonlinear activation function (for example by thelogisticsigmoid function). Usually, a unit has severalincoming connections and several outgoing connections. However, unitscan also be more complex, like longshort-term memory (LSTM) units, which havemultiple activation functions with a distinct layout of connectionsto the nonlinear activation functions, or maxout units, which computethe final output over an array of nonlinearly transformed inputvalues. Pooling,convolution,and other input transforming functions are usually not referred to asunits.
Theterm artificial neuron—or most often just neuron—is an equivalentterm to unit,but implies a close connection to neurobiology and the human brainwhile deep learning has very little to do with the brain (forexample, it is now thought that biological neurons are more similarto entire multilayer perceptrons rather than a single unit in aneural network). The term neuron was encouraged after the last AIwinter to differentiate the more successfulneural network from the failing and abandoned perceptron. However,since the wild successes of deep learning after 2012, the media oftenpicked up on the term “neuron” and sought to explain deeplearning as mimicry of the human brain, which is very misleading andpotentially dangerous for the perception of the field of deeplearning. Now the term neuron is discouraged and the more descriptiveterm unit should be used instead.
Anactivation function takes in weighted data (matrix multiplicationbetween input data and weights) and outputs a non-lineartransformation of the data. For example, is the rectifiedlinear activation function (essentially set allnegative values to zero). The difference between units and activationfunctions is that units can be more complex, that is, a unit can havemultiple activation functions (for example LSTMunits) or a slightly more complex structure (for example maxoutunits).
Thedifference between linear and non-linear activation functions can beshown with the relationship of some weighted values: Imagine the fourpoints A1, A2, B1 and B2. Thepairs A1 / A2, and B1 / B2 lieclose to each other, but A1 is distant from B1 andB2, and vice versa; the same for A2.
Witha linear transformation the relationship between pairs might change.For example A1 and A2 might be far apart, but thisimplies that B1 and B2 are also far apart. Thedistance between the pairs might shrink, but if it does, then both B1and B2 will be close to A1 and A2 at thesame time. We can apply many linear transformations, but therelationship between A1 / A2 and B1 / B2will always be similar.
Incontrast, with a non-linear activation function we can increase thedistance between A1 and A2 while we decreasethe distance between B1 and B2. We can make B1close to A1, but B2 distant from A1. Byapplying non-linear functions, we create new relationships betweenthe points. With every new non-linear transformation we can increasethe complexity of the relationships. In deep learning, usingnon-linear activation functions creates increasingly complex featureswith every layer.
Incontrast, the features of 1000 layers of pure linear transformationscan be reproduced by a single layer (because a chain of matrixmultiplication can always be represented by a single matrixmultiplication). This is why non-linear activation functions are soimportant in deep learning.
Alayer is the highest-level building block in deep learning. A layeris a container that usually receives weighted input, transforms itwith a set of mostly non-linear functions and then passes thesevalues as output to the next layer. A layer is usually uniform, thatis it only contains one type of activation function, pooling,convolutionetc. so that it can be easily compared to other parts of the network.The first and last layers in a network are called input and outputlayers, respectively, and all layers in between are called hiddenlayers.
Convolutionis a mathematical operation which describes a rule of how to mix twofunctions or pieces of information: (1) The feature map (or inputdata) and (2) the convolution kernel mix together to form (3) atransformed feature map. Convolution is often interpreted as afilter, where the kernel filters the feature map for information of acertain kind (for example one kernel might filter for edges anddiscard other information).
12.
Convolutionis important in physics and mathematics as it defines a bridgebetween the spatial and time domains (pixel with intensity 147 atposition (0,30)) and the frequency domain (amplitude of 0.3, at 30Hz,with 60-degree phase) through the convolution theorem. This bridge isdefined by the use of Fourier transforms: When you use a Fouriertransform on both the kernel and the feature map, then theconvolution operation is simplified significantly (integrationbecomes mere multiplication). Some of the fastest GPU implementationsof convolutions (for example some implementations in the NVIDIAcuDNN library) currently make use of Fouriertransforms.
1.
Convolutioncan describe the diffusion of information, for example, the diffusionthat takes place if you put milk into your coffee and do not stir canbe accurately modeled by a convolution operation (pixels diffusetowards contours in an image). In quantum mechanics, it describes theprobability of a quantum particle being in a certain place when youmeasure the particle’s position (average probability for a pixel’sposition is highest at contours). In probability theory, it describescross-correlation, which is the degree of similarity for twosequences that overlap (similarity high if the pixels of a feature(e.g. nose) overlap in an image (e.g. face)). In statistics, itdescribes a weighted moving average over a normalized sequence ofinput (large weights for contours, small weights for everythingelse). Many other interpretations exist.
Whileit is unknown which interpretation of convolution is correct for deeplearning, the cross-correlation interpretation is currently the mostuseful: convolutional filters can be interpreted as featuredetectors, that is, the input (feature map) is filtered for a certainfeature (the kernel) and the output is large if the feature isdetected in the image. This is exactly how you interpretcross-correlation for an image.
StevenSmith’s excellent freeonline book about digital signal processing.
Additionalmaterial: UnderstandingConvolution in Deep Learning
Poolingis a procedure that takes input over a certain area and reduces thatto a single value (subsampling). In convolutionalneural networks, this concentration ofinformation has the useful property that outgoing connections usuallyreceive similar information (the information is “funneled” intothe right place for the input feature map of the next convolutionallayer). This provides basic invariance to rotations and translations.For example, if the face on an image patch is not in the center ofthe image but slightly translated, it should still work fine becausethe information is funneled into the right place by the poolingoperation so that the convolutional filters can detect the face.
Thelarger the size of the pooling area, the more information iscondensed, which leads to slim networks that fit more easily into GPUmemory. However, if the pooling area is too large, too muchinformation is thrown away and predictive performance decreases.
Additionalmaterial: Neuralnetworks [9.5]: Computer vision – pooling and subsampling
Aconvolutional neural network, or preferably convolutional network orconvolutional net (the term neural is misleading; see also artificialneuron), uses convolutional layers(see convolution)that filter inputs for useful information. These convolutional layershave parameters that are learned so that these filters areadjusted automatically to extract the most useful information for thetask at hand (see Feature Learning). For example, in a general objectrecognition task it might be most useful to filter information aboutthe shape of an object (objects usually have very different shapes)while for a bird recognition task it might be more suitable toextract information about the color of the bird (most birds have asimilar shape, but different colors; here color is more useful todistinguish between birds). Convolutional networks adjustautomatically to find the best feature for these tasks.
Usually,multiple convolutional layers are used that filter images for moreand more abstract information after each layer (see hierarchicalfeatures).
Convolutionalnetworks usually also use pooling layers (see pooling)for limited translation and rotation invariance (detect the objecteven if it appears at some unusual place). Pooling also reduces thememory consumption and thus allows for the usage of moreconvolutional layers.
Morerecent convolutional networks use inception modules (see inception)which use 1×1 convolutional kernels to reduce the memory consumptionfurther while speeding up the computation (and thus training).
MauricePeemen.
Additionalmaterial: Coursera:Neural Networks for Machine Learning: Object Recognition with NeuralNets.
Inceptionmodules in convolutionalnetworks were designed to allow for deeper andlarger convolutionallayerswhile at the same time allowing for more efficient computation. Thisis done by using 1×1 convolutions with small feature map size, forexample, 192 28×28 sized feature maps can be reduced to 64 28×28feature maps through 64 1×1 convolutions. Because of the reducedsize, these 1×1 convolutions can be followed up with largerconvolutions of size 3×3 and 5×5. In addition to 1×1 convolution,max pooling may also be used to reduce dimensionality.
Inthe output of an inception module, all the large convolutions areconcatenated into a big feature map which is then fed into the nextlayer (or inception module).
Additionalmaterial: GoingDeeper with Convolutions
Thisconcludes part one of this crash course on deep learning. Pleasecheck back soon for the next two parts of the series. In part2, I’ll provide a brief historical overviewfollowed by an introduction to training deep neural networks.
Meanwhile,you might be interested in learning about cuDNN,DIGITS,ComputerVision with Caffe, NaturalLanguage Processing with Torch, NeuralMachine Translation, the Mocha.jldeep learning framework for Julia, or other ParallelForall posts on deep learning.
Share:
Postedon by TimDettmers 3Comments Tagged cuDNN,Debugging,DeepLearning, DeepNeural Networks, LSTM,MachineLearning, NaturalLanguage Processing, RNN
Thisseries of blog posts aims to provide an intuitive and gentleintroduction to deep learning that does not rely heavily on math ortheoretical constructs.
T
Besure to check out the other Deep Learning in a Nutshell posts: Part1,part2.
KlausGreff and colleagues as published in LSTM:A Search Space Odyssey.
Everythingin life depends on time and therefore, represents a sequence. Toperform machinelearning with sequential data (text, speech,video, etc.) we could use a regular neuralnetwork and feed it the entire sequence, butthe input size of our data would be fixed, which is quite limiting.Other problems with this approach occur if important events in asequence lie just outside of the input window. What we need is (1) anetwork to which we can feed sequences of arbitrary length oneelement of the sequence per time step (for example a video is just asequence of images; we feed the network one image at a time); and (2)a network which has some kind of memory to remember important eventswhich happened many time steps in the past. These problems andrequirements have led to a variety of different recurrent neuralnetworks.
Ifwe want a regular neural network to solve the problem of adding twonumbers, then we could just input the two numbers and train thenetwork to prediction the sum of these outputs. If we now have threeinstead of two numbers which we want to add we could (1) extend ournetwork with additional input and additional weights and retrain it,or (2) feed the output, that is the sum, of the first two numbersalong with the third number back into the network. Approach (2) isclearly better because we can hope that the network will perform wellwithout retraining the entire network (the network already “knows”how to add two numbers, so it should know how to add a sum of twonumbers and another number). If we are instead given the task offirst adding two numbers and then subtracting two different numbers,this approach would not work anymore. Even if we use an additionalweight from the output we cannot guarantee correct outputs: it isequivalent to trying to approximate subtraction of two numbers withan addition and a multiplication by a sum, which will generally notwork! Instead, we could try to “change the program” from thenetwork from “addition” to “subtraction”. We can do this byweighting the output of the hidden layer,and feed it back into the hidden layer—a recurrent weight (seeFigure 2)! With this approach we change the internal dynamics of thenetwork with every new input; that is, at every new number. Thenetwork will learn to change the program from “addition” to“subtraction” after the first two numbers and thus will be ableto solve the problem (albeit with some errors in accuracy).
Wecan even generalize this approach and feed the network with twonumbers, one by one, and then feed in a “special” number thatrepresents the mathematical operation “addition”, “subtraction”,“multiplication” and so forth. Although this would not workperfectly in practice, we could get results which are in the ballparkof the correct result. However, the main point here is not gettingthe correct result, but that we can train recurrent neural networksto learn very specific outputs for an arbitrary sequence of inputs,which is very powerful.
Asan example, we can teach recurrent networks to learn sequences ofwords. SoumithChintala and WojciechZaremba wrote an excellentParallel Forall blog post about naturallanguage understanding using RNNs. RNNs can also be used to generatesequences. Andrej Karpathy wrote a fascinating and entertaining blogpost in which he demonstrated character-levelRNNs that can generate imitations of everything from Shakespeare toLinux source code, to baby names.
Longshort-term memory (LSTM) units use a linear unitwith a self-connection with a constant weight of 1.0. This allows avalue (forward pass) or gradient (backward pass) that flows into thisself-recurrent unit to be preserved indefinitely (inputs or errorsmultiplied by 1.0 still have same value; thus, the output or error ofthe previous time step is the same as the output for the next timestep) so that the value or gradient can be retrieved exactly at thetime step when it is needed most. This self-recurrent unit, thememory cell, provides a kind of memory which can store informationwhich lies dozen of time-steps in the past. This is very powerful formany tasks, for example for text data, an LSTM unit can storeinformation contained in the previous paragraph and apply thisinformation to a sentence in the current paragraph.
Additionally,a common problem in deep networks is the “vanishing gradient”problem, where the gradientgets smaller and smaller with each layeruntil it is too small to affect the deepest layers. With the memorycell in LSTMs, we have continuous gradient flow (errors maintaintheir value) which thus eliminates the vanishing gradient problem andenables learning from sequences which are hundreds of time stepslong.
However,sometimes we want to throw away information in the memory cell andreplace it with newer, more relevant information. At the same time,we do not want to confuse the rest of the network by releasingunnecessary information into the network. To solve this problem, theLSTM unit has a forget gate which deletes the information in theself-recurrent unit without releasing the information into thenetwork (see Figure 1). In this way, we can throw away informationwithout causing confusion and make room for a new memory. The forgetgate does this by multiplying the value of the memory cell by anumber between 0 (delete) and 1 (keep everything). The exact value isdetermined by the current input and the LSTM unit output of theprevious time step.
Atother times, the memory cell contains a that needs to be keptintact for many time steps. To do this LSTM adds another gate, theinput or write gate, which can be closed so that no new informationflows into the memory cell (see Figure 1). This way the data in thememory cell is protected until it is needed.
Anothergate manipulates the output from the memory cell by multiplying theoutput of the memory cell by a number between 0 (no outputs) and 1(preserve output) (see Figure 1). This gate may be useful if multiplememories compete against each other: A memory cell might say “Mymemory is very important right now! So I release it now!” but thenetwork might say: “Your memory is important, true, but othermemory cells have much more important memories than you do! So I sendsmall values to your output gate which will turn you off and largevalues to the other output gates so that the more important memorieswin!”
Theconnectivity of an LSTM unit may seem a bit complicated at first, andyou will need some time to understand it. However, if you isolate allthe parts you can see that the structure is essentially the same asin normal recurrent neural networks, in which inputs and recurrentweights flow to all gates, which in turn are connected to theself-recurrent memory cell.
Todive deeper into LSTM and make sense of the whole architecture Irecommend reading LSTM:A Search Space Odyssey and the originalLSTM paper.
Imaginethe word “cat” and all other words which closely relate to theword “cat”. You might think about words like “kitten”,“feline”. Now think about words which are a bit dissimilar to theword “cat” but which are more similar to “cat” than, say,“car”. You might come up with nouns like “lion”, “tiger”,“dog”, “animal” or verbs like “purring”, “mewing”,“sleeping” and so forth.
Nowimagine a three-dimensional space where we put the word “cat” inthe middle of that space. The words mentioned above that are moresimilar to the word “cat” map to locations closer to the locationof “cat” in the space; for example, “kitty” and “feline”are close; “tiger” and “lion” are a bit further away; “dog”further still; and “car” is very, very far away. See Figure 3 foran example of words in cooking recipes and their word embedding spacein two dimensions.
Ifwe use vectors that point to each word in this space, then eachvector will consists of 3 coordinates which give the position in thisspace, e.g. if “cat” is (0,0,0) then “kitty” might have thecoordinates (0.1,0.2,-0.3) while car has the coordinates (10,0,-15).
Thisspace is then a word embedding space for our vocabulary (words) aboveand each vector with 3 coordinates is a word embedding which we canuse as input data for our algorithms.
Typically,an embedding space contains thousands of words and more than ahundred dimensions. This space is difficult to comprehend for humans,but the property that similar words lie close together in theembedding space still holds. For machines, this is an excellentrepresentation for words which improves results for many tasks thatdeal with language.
Ifyou want to learn more about word embedding and see how we can applythem to create models which can “understand” language (at leastto a certain degree), I recommend you to read UnderstandingNatural Language with Deep Neural Networks Using Torchby SoumithChintala and WojciechZaremba.
Stopfor a moment and imagine a tomato. Now think about ingredients ordishes that go well with tomatoes. If your thinking is similar to themost common recipes that you find online, then you should havethought about ingredients such as cheese and salami (pizza);parmesan, basil, macaroni; and other ingredients such as olive oil,thyme, and parsley. These ingredients are mostly associated withItalian and Mediterranean cuisines.
Nowthink about the same tomato but in terms of Mexican cuisine. Youmight associate the tomato with beans, corn (maize), chilies,cilantro, or avocados.
Whatyou just did was to transform the representation of the word “tomato”into a new representation in terms of “tomato in Mexican cuisine”.
Youcan think of the encoder doing the same thing. It takes input word byword and transforms it into a new “thought vector” bytransforming the representation of all words accordingly (just likeadding the context “Mexican cuisine” to “tomato”). This isthe first step of the encoder-decoder architecture.
Thesecond step in the encoder-decoder architecture exploits the factthat representations of two different languages have similar geometryin the word embedding space even though they use completely differentwords for a certain thing. For example, the word cat in German is“Katze” and the word dog is “Hund”, which are of coursedifferent words, but the underlying relationships between these wordswill be nearly the same, that is Katze relates to Hund as cat relatesto dog such that the difference in “thought vectors” betweenKatze and Hund, and cat and dog will be quite similar. Or in otherwords, even though the words are different they describe a similar“thought vector”. There are some words which cannot be reallyexpressed in another language, but these cases are rare and ingeneral the relationships between words should be similar.
Withthese ideas, we can now construct a decoder network. We feed ourgenerated “thought vectors” from our English encoder word by wordinto the German decoder. The German decoder reinterprets these“thought vectors” or “transformations of relationships” asbelonging to the “German word vector space”, and thus willproduce a sentence which captures the same relationships as theEnglish sentence, but in terms of German words. So we essentiallyconstruct a network that can translate languages. This idea is stilldeveloping in current research; results are not perfect, but they arerapidly improving and soon this method might be the best way totranslate languages.
Ifyou would like to dive deeper into the encoder-decoder architectureand NeuralMachine Translation, I recommend you read thiscomprehensive Parallel Forall blogpost about the encoder-decoder architecture byKyunghyunCho.
Wehave seen that we can use recurrent neural networks to solvedifficult tasks which deal with sequences of arbitrary length. Wehave also seen that memory of past input is crucial for successfulsequence learning and that LSTMs provide improved performance in thiscase and alleviate the vanishing gradient problem. We delved intoword-embeddings and how we can use them to train recurrentarchitectures to acquire a certain sense of language understanding.
Ihope you enjoyed this blog post. Stay tuned for the next post in thisseries which will deal with reinforcement learning. If you have anyquestions please post them in the comments below and I will answerthem. Be sure to read part1, part2, and part4 of the series to learn about deeplearning fundamental and core concepts, history, and trainingalgorithms, and reinforcement learning!
Tolearn even more about deep neural networks, come to the 2016 GPUTechnology Conference (April 4-7 in San Jose, CA) and learn from theexperts. Registernow to save 20% with code“MHarris2603”!