In recent years, there’s been a resurgence in the field of Artificial Intelligence. It’s spread beyond the academic world with major players like Google, Microsoft, and Facebook creating their own research teams and making some impressive acquisitions.
Some this can be attributed to the abundance of raw data generated by social network users, much of which needs to be analyzed, as well as to the cheap computational power available via GPGPUs.
But beyond these phenomena, this resurgence has been powered in no small part by a new trend in AI, specifically in machine learning, known as “Deep Learning”. In this tutorial, I’ll introduce you to the key concepts and algorithms behind deep learning, beginning with the simplest unit of composition and building to the concepts of machine learning in Java.
The examples in this article are implemented using a Java deep learning library.
The general procedure is as follows:
This setting is incredibly general: your data could be symptoms and your labels illnesses; or your data could be images of handwritten characters and your labels the actual characters they represent.
One of the earliest supervised training algorithms is that of the perceptron, a basic neural network building block.
Say we have n points in the plane, labeled ‘0’ and ‘1’. We’re given a new point and we want to guess its label (this is akin to the “Dog” and “Not dog” scenario above). How do we do it?
One approach might be to look at the closest neighbor and return that point’s label. But a slightly more intelligent way of going about it would be to pick a line that best separates the labeled data and use that as your classifier.
In this case, each piece of input data would be represented as a vector x = (x_1, x_2) and our function would be something like “‘0’ if below the line, ‘1’ if above”.
To represent this mathematically, let our separator be defined by a vector of weights w and a vertical offset (or bias) b. Then, our function would combine the inputs and weights with a weighted sum transfer function:
The result of this transfer function would then be fed into an activation function to produce a labeling. In the example above, our activation function was a threshold cutoff (e.g., 1 if greater than some value):
The training of the perceptron consists of feeding it multiple training samples and calculating the output for each of them. After each sample, the weights w are adjusted in such a way so as to minimize the output error, defined as the difference between the desired (target) and the actual outputs. There are other error functions, like the mean square error, but the basic principle of training remains the same.
The single perceptron approach to deep learning has one major drawback: it can only learn linearly separable functions. How major is this drawback? Take XOR, a relatively simple function, and notice that it can’t be classified by a linear separator (notice the failed attempt, below):
To address this problem, we’ll need to use a multilayer perceptron, also known as feedforward neural network: in effect, we’ll compose a bunch of these perceptrons together to create a more powerful mechanism for learning.
A neural network is really just a composition of perceptrons, connected in different ways and operating on different activation functions.
For starters, we’ll look at the feedforward neural network, which has the following properties:
What if each of our perceptrons is only allowed to use a linear activation function? Then, the final output of our network will still be some linear function of the inputs, just adjusted with a ton of different weights that it’s collected throughout the network. In other words, a linear composition of a bunch of linear functions is still just a linear function. If we’re restricted to linear activation functions, then the feedforward neural network is no more powerful than the perceptron, no matter how many layers it has.
Because of this, most neural networks use non-linear activation functions like the logistic, tanh, binary orrectifier. Without them the network can only learn functions which are linear combinations of its inputs.
The most common deep learning algorithm for supervised training of the multilayer perceptrons is known as backpropagation. The basic procedure:
The output error is calculated, typically the mean squared error:
Where t is the target value and y is the actual network output. Other error calculations are also acceptable, but the MSE is a good choice.
Network error is minimized using a method called stochastic gradient descent.
Gradient descent is universal, but in the case of neural networks, this would be a graph of the training error as a function of the input parameters. The optimal value for each weight is that at which the error achieves a global minimum. During the training phase, the weights are updated in small steps (after each training sample or a mini-batch of several samples) in such a way that they are always trying to reach the global minimum—but this is no easy task, as you often end up in local minima, like the one on the right. For example, if the weight has a value of 0.6, it needs to be changed towards 0.4.
This figure represents the simplest case, that in which error depends on a single parameter. However, network error depends on every network weight and the error function is much, much more complex.
Thankfully, backpropagation provides a method for updating each weight between two neurons with respect to the output error. The derivation itself is quite complicated, but the weight update for a given node has the following (simple) form:
Where E is the output error, and w_i is the weight of input i to the neuron.
Essentially, the goal is to move in the direction of the gradient with respect to weight i. The key term is, of course, the derivative of the error, which isn’t always easy to calculate: how would you find this derivative for a random weight of a random hidden node in the middle of a large network?
The answer: through backpropagation. The errors are first calculated at the output units where the formula is quite simple (based on the difference between the target and predicted values), and then propagated back through the network in a clever fashion, allowing us to efficiently update our weights during training and (hopefully) reach a minimum.
The hidden layer is of particular interest. By the universal approximation theorem, a single hidden layer network with a finite number of neurons can be trained to approximate an arbitrarily random function. In other words, a single hidden layer is powerful enough to learn any function. That said, we often learn better in practice with multiple hidden layers (i.e., deeper nets).
The hidden layer is where the network stores it’s internal abstract representation of the training data, similar to the way that a human brain (greatly simplified analogy) has an internal representation of the real world. Going forward in the tutorial, we’ll look at different ways to play around with the hidden layer.
You can see a simple (4-2-3 layer) feedforward neural network that classifies the IRIS dataset implemented in Java here through the testMLPSigmoidBP method. The dataset contains three classes of iris plants with features like sepal length, petal length, etc. The network is provided 50 samples per class. The features are clamped to the input units, while each output unit corresponds to a single class of the dataset: “1/0/0” indicates that the plant is of class Setosa, “0/1/0” indicates Versicolour, and “0/0/1” indicates Virginica). The classification error is 2/150 (i.e., it misclassifies 2 samples out of 150).
A neural network can have more than one hidden layer: in that case, the higher layers are “building” new abstractions on top of previous layers. And as we mentioned before, you can often learn better in-practice with larger networks.
However, increasing the number of hidden layers leads to two known issues:
Let’s look at some deep learning algorithms to address these issues.
Most introductory machine learning classes tend to stop with feedforward neural networks. But the space of possible nets is far richer—so let’s continue.
An autoencoder is typically a feedforward neural network which aims to learn a compressed, distributed representation (encoding) of a dataset.
Conceptually, the network is trained to “recreate” the input, i.e., the input and the target data are the same. In other words: you’re trying to output the same thing you were input, but compressed in some way. This is a confusing approach, so let’s look at an example.
Say that the training data consists of 28x28 grayscale images and the value of each pixel is clamped to one input layer neuron (i.e., the input layer will have 784 neurons). Then, the output layer would have the same number of units (784) as the input layer and the target value for each output unit would be the grayscale value of one pixel of the image.
The intuition behind this architecture is that the network will not learn a “mapping” between the training data and its labels, but will instead learn the internal structure and features of the data itself. (Because of this, the hidden layer is also called feature detector.) Usually, the number of hidden units is smaller than the input/output layers, which forces the network to learn only the most important features and achieves a dimensionality reduction.
In effect, we want a few small nodes in the middle to really learn the data at a conceptual level, producing a compact representation that in some way captures the core features of our input.
To further demonstrate autoencoders, let’s look at one more application.
In this case, we’ll use a simple dataset consisting of flu symptoms (credit to this blog post for the idea). If you’re interested, the code for this example can be found in the testAEBackpropagation method.
Here’s how the data set breaks down:
We’ll consider a patient to be sick when he or she has at least two of the first three features and healthy if he or she has at least two of the second three (with ties breaking in favor of the healthy patients), e.g.:
We’ll train an autoencoder (using backpropagation) with six input and six output units, but only two hidden units.
After several hundred iterations, we observe that when each of the “sick” samples is presented to the machine learning network, one of the two the hidden units (the same unit for each “sick” sample) always exhibits a higher activation value than the other. On the contrary, when a “healthy” sample is presented, the other hidden unit has a higher activation.
Essentially, our two hidden units have learned a compact representation of the flu symptom data set. To see how this relates to learning, we return to the problem of overfitting. By training our net to learn a compact representation of the data, we’re favoring a simpler representation rather than a highly complex hypothesis that overfits the training data.
In a way, by favoring these simpler representations, we’re attempting to learn the data in a truer sense.
The next logical step is to look at a Restricted Boltzmann machines (RBM), a generative stochastic neural network that can learn a probability distribution over its set of inputs.
RBMs are composed of a hidden, visible, and bias layer. Unlike the feedforward networks, the connections between the visible and hidden layers are undirected (the values can be propagated in both the visible-to-hidden and hidden-to-visible directions) and fully connected (each unit from a given layer is connected to each unit in the next—if we allowed any unit in any layer to connect to any other layer, then we’d have a Boltzmann (rather than a restricted Boltzmann) machine).
The standard RBM has binary hidden and visible units: that is, the unit activation is 0 or 1 under a Bernoulli distribution, but there are variants with other non-linearities.
While researchers have known about RBMs for some time now, the recent introduction of the contrastive divergence unsupervised training algorithm has renewed interest.
The single-step contrastive divergence algorithm (CD-1) works like this:
Weight update:
Where a is the learning rate and v, v’, h, h’, and w are vectors.
The intuition behind the algorithm is that the positive phase (h given v) reflects the network’s internal representation of the real world data. Meanwhile, the negative phase represents an attempt to recreate the data based on this internal representation (v’ given h). The main goal is for the generated data to be as close as possible to the real world and this is reflected in the weight update formula.
In other words, the net has some perception of how the input data can be represented, so it tries to reproduce the data based on this perception. If its reproduction isn’t close enough to reality, it makes an adjustment and tries again.
To demonstrate contrastive divergence, we’ll use the same symptoms data set as before. The test network is an RBM with six visible and two hidden units. We’ll train the network using contrastive divergence with the symptoms v clamped to the visible layer. During testing, the symptoms are again presented to the visible layer; then, the data is propagated to the hidden layer. The hidden units represent the sick/healthy state, a very similar architecture to the autoencoder (propagating data from the visible to the hidden layer).
After several hundred iterations, we can observe the same result as with autoencoders: one of the hidden units has a higher activation value when any of the “sick” samples is presented, while the other is always more active for the “healthy” samples.
You can see this example in action in the testContrastiveDivergence method.
We’ve now demonstrated that the hidden layers of autoencoders and RBMs act as effective feature detectors; but it’s rare that we can use these features directly. In fact, the data set above is more an exception than a rule. Instead, we need to find some way to use these detected features indirectly.
Luckily, it was discovered that these structures can be stacked to form deep networks. These networks can be trained greedily, one layer at a time, to help to overcome the vanishing gradient and overfitting problems associated with classic backpropagation.
The resulting structures are often quite powerful, producing impressive results. Take, for example, Google’s famous “cat” paper in which they use special kind of deep autoencoders to “learn” human and cat face detection based on unlabeled data.
Let’s take a closer look.
As the name suggests, this network consists of multiple stacked autoencoders.
The hidden layer of autoencoder t acts as an input layer to autoencoder t + 1. The input layer of the first autoencoder is the input layer for the whole network. The greedy layer-wise training procedure works like this:
Stacked auto encoders, then, are all about providing an effective pre-training method for initializing the weights of a network, leaving you with a complex, multi-layer perceptron that’s ready to train (or fine-tune).
As with autoencoders, we can also stack Boltzmann machines to create a class known as deep belief networks (DBNs).
In this case, the hidden layer of RBM t acts as a visible layer for RBM t+1. The input layer of the first RBM is the input layer for the whole network, and the greedy layer-wise pre-training works like this:
This procedure is akin to that of stacked autoencoders, but with the autoencoders replaced by RBMs and backpropagation replaced with the contrastive divergence algorithm.
(Note: for more on constructing and training stacked autoencoders or deep belief networks, check out the sample code here.)
As a final deep learning architecture, let’s take a look at convolutional networks, a particularly interesting and special class of feedforward networks that are very well-suited to image recognition.
Before we look at the actual structure of convolutional networks, we first define an image filter, or a square region with associated weights. A filter is applied across an entire input image, and you will often apply multiple filters. For example, you could apply four 6x6 filters to a given input image. Then, the output pixel with coordinates 1,1 is the weighted sum of a 6x6 square of input pixels with top left corner 1,1 and the weights of the filter (which is also 6x6 square). Output pixel 2,1 is the result of input square with top left corner 2,1 and so on.
With that covered, these networks are defined by the following properties:
You can see several examples of convolutional networks trained (with backpropagation) on the MNIST data set (grayscale images of handwritten letters) here, specifically in the the testLeNet* methods (I would recommend testLeNetTiny2 as it achieves a low error rate of about 2% in a relatively short period of time). There’s also a nice JavaScript visualization of a similar network here.
Now that we’ve covered the most common neural network variants, I thought I’d write a bit about the challenges posed during implementation of these deep learning structures.
Broadly speaking, my goal in creating a Deep Learning library was (and still is) to build a neural network-based framework that satisfied the following criteria:
To satisfy these requirements, I took a tiered (or modular) approach to the design of the software.
Let’s start with the basics:
This structure is agile enough to be used for classic feedforward networks, as well as for RBMs and more complex architectures like ImageNet.
It also allows a layer to be part of more than one network. For example, the layers in a Deep Belief Networkare also layers in their corresponding RBMs.
In addition, this architecture allows a DBN to be viewed as a list of stacked RBMs during the pre-training phase and a feedforward network during the fine-tuning phase, which is both intuitively nice and programmatically convenient.
The next module takes care of propagating data through the network, a two-step process:
As I mentioned earlier, one of the reasons that neural networks have made a resurgence in recent years is that their training methods are highly conducive to parallelism, allowing you to speed up training significantly with the use of a GPGPU. In this case, I chose to work with the Aparapi library to add GPU support.
Aparapi imposes some important restrictions on the connection calculators:
As such, most of the data (weights, input, and output arrays) is stored in Matrix instances, which use one-dimensional float arrays internally. All Aparapi connection calculators use either AparapiWeightedSum (for fully connected layers and weighted sum input functions), AparapiSubsampling2D (for subsampling layers), orAparapiConv2D (for convolutional layers). Some of these limitations can be overcome with the introduction ofHeterogeneous System Architecture. Aparapi also allows to run the same code on both CPU and GPU.
The training module implements various training algorithms. It relies on the previous two modules. For example, BackPropagationTrainer (all the trainers are using the Trainer base class) uses feedforward layer calculator for the feedforward phase and a special breadth-first layer calculator for propagating the error and updating the weights.
My latest work is on Java 8 support and some other improvements, which are available in this branch and will soon be merged into master.
The aim of this Java deep learning tutorial was to give you a brief introduction to the field of deep learning algorithms, beginning with the most basic unit of composition (the perceptron) and progressing through various effective and popular architectures, like that of the restricted Boltzmann machine.
The ideas behind neural networks have been around for a long time; but today, you can’t step foot in the machine learning community without hearing about deep networks or some other take on deep learning. Hype shouldn’t be mistaken for justification, but with the advances of GPGPU computing and the impressive progress made by researchers like Geoffrey Hinton, Yoshua Bengio, Yann LeCun and Andrew Ng, the field certainly shows a lot of promise. There’s no better time to get familiar and get involved like the present.
