The Data Science team in Boston is working on leveraging cutting edge tools and algorithms to optimize business actions based on insights hidden in user data. Data Science heavily exploits machine algorithms that can help us identify and exploit patterns in the data. Obtaining insights from Internet scale data is a challenging undertaking; hence being able to run the algorithms at scale is a crucial requirement. With the explosion of data and accompanying thousand-machine clusters, we need to adapt the algorithms to be able to operate on such distributed environments. Running machine learning algorithms in general purpose distributed computing environment possesses its own set of challenges.
Here we discuss how we have implemented and deployed Deep Learning, a cutting edge machine-learning framework, in one of the Hadoop clusters. We provide the details on how the algorithm was adapted to run in a distributed setting. We also present results of running the algorithm on a standard dataset.
Deep Belief Networks
Deep Belief Networks (DBN) are graphical models that are obtained by stacking and training Restricted Boltzmann Machines (RBM) in a greedy, unsupervised manner [1]. DBNs are trained to extract deep hierarchical representation of the training data by modeling the joint distribution between observed vectors x and the l hidden layers hk as follows, where distribution for each hidden layer is conditioned on a layer immediately preceding it [4]:
Equation 1: DBN Distribution
The relationship between the input layer and the hidden layers can be observed in the figure below. At a high level, the first layer is trained as an RBM that models the raw input x. An input is a sparse binary vector representing the data to be classified, for e.g. a binary image of a digit. Subsequent layers are trained using the transformed data (sample or mean activations) as training examples from the previous layers. Number of layers can be determined empirically to obtain the best model performance, and DBNs support arbitrary many layers.
Figure 1: DBN layers
The following code snippet shows the training that goes into an RBM. For the input data supplied to the RBM, there are a multiple number of predefined epochs. The input data is divided into small batches and the weights, activations, and deltas are computed for each layer:
After all layers are trained, the parameters of the Deep Network are fine-tuned using a supervised training criterion. The supervised training criterion, for instance, can be framed as a classification problem, which then allows using the deep network to solve a classification problem. More complex supervised criterion can be employed which can provide interesting results such as scene interpretation, for instance explaining what objects are present in a picture.
Infrastructure
Deep learning has received large attention not only because of the fact that it can deliver results superior to some of the other learning algorithms, but also because it can be run on a distributed setting, allowing processing of large scale datasets. Deep networks can be parallelized in two major levels – at the layer level, and at the data level [6]. For layer level parallelization, many implementations use GPU arrays to compute layer activations in parallel and frequently synchronize them. However, this approach is not suitable for clusters where data can reside across multiple machines connected by a network, because of high network costs. Data level parallelization, in which the training is parallelized to subsets of data, is more suitable for these settings. Most of the data at Paypal is stored in Hadoop clusters; hence being able to run the algorithms in those clusters is our priority. Dedicated cluster maintenance and support is also an important factor for us to consider. However, since deep learning is inherently iterative in nature, a paradigm such as MapReduce is not well suited for running these algorithms. But with the advent of Hadoop 2.0 and Yarn based resource management, we can write iterative applications as we can finely control the resources the application is using. We adapted IterativeReduce [7], a simple abstraction for writing iterative algorithms in Hadoop YARN, and were able to deploy it in one of the PayPal clusters running Hadoop 2.4.1.
Methodology
We implemented the core deep learning algorithm by Hinton, reference in [2]. Since distributing the algorithm for running on multi-machine cluster is our requirement, we adapted their algorithm for such a setting. For distributing the algorithm across multiple machines, we followed the guidelines proposed by Grazia, et al. [6]. A high level summary of our implementation is given below:
The figure below describes a single dataset epoch (steps 3-5) while running the deep learning algorithm. We note that this paradigm can be leveraged to implement a host of machine learning algorithms that are iterative in nature.
Figure 2: Single dataset epoch for training
The following code snippet shows the steps involved in training a DBN in a single machine. The dataset is first divided into multiple batches. Then multiple RBM layers are initialized and trained sequentially. After the RBMs are trained, they are passed through a fine-tune phase which uses error back propagation.
We adapted the IterativeReduce [7] implementation for much of the YARN “plumbing”. We did a major overhaul of the implementation to make it useable for our deep learning implementation. The IterativeReduce implementation was written for Cloudera Hadoop distribution, which we re-platformed to adapt it to standard Apache Hadoop distribution. We also rewrote the implementation to use the standard programming models described in [8]. In particular, we used YarnClient API for communication between the client application and the ResourceManager. We also used the AMRMClient and AMNMClient for interaction between ApplicationMaster and ResourceManager and NodeManager.
We first submit an application to the YARN resource manager using the YarnClient API:
After the application is submitted, YARN resource manager launches the application master. The application master is responsible for allocating and releasing the worker containers as necessary. The application master uses AMRMClient to communicate with the resource manager.
The application master uses the NMClient API to run commands in the containers it received from the application master.
Once the application master launches the worker containers it requires, it sets up a port to communicate with the workers. For our deep learning implementation, we added the methods required for parameter initialization, layer-by-layer training, and fine-tune signaling to the original IterativeReduce interface. IterativeReduce uses Apache Avro IPC for Master-Worker communication.
The following code snippets shows the series of steps involved in Master-worker nodes for distributed training. The master sends the initial parameters to the workers, and then the worker trains its RBM on its portion of the data. After worker is done training, it sends back the results to master, which then combines the results. After the iterations are completed, master completes the process by starting the back propagation fine-tune phase.
Results
We evaluated the performance of the deep learning implementation using the MNIST handwritten digit recognition [3]. The dataset contains manually labeled hand written digits ranging from 0-9. The training set consists of 60,000 images and the test set consists of 10,000 images.
In order to measure the performance, the DBN was first pre-trained and then fine-tuned on the 60,000 training images. After the above steps, the DBN was then evaluated on the 10,000 test images. No pre-processing was done on the images during training or evaluation. The error rate was obtained as a ratio between total number of misclassified images and the total number of images on the test set.
We were able to achieve the best classification error rate of 1.66% when using RBM with 500-500-2000 hidden units in each RBM, and using a 10-node distributed cluster setting. The error rate is comparable with the error rate of 1.2% reported by authors of the original algorithm (with 500-500-2000 hidden units) [2], and with some of the results with similar settings reported in [3]. We note that original implementation was on a single machine, and our implementation is on a distributed setting. The parameter-averaging step contributes to slight reduction in performance, although the benefit of distributing the algorithm over multiple machines far outweighs the reduction. The table below summarizes the error rate variation per the number of hidden units in each layer while running on a 10-node cluster.
Table 1: MNIST performance evaluation
Further thoughts
We had a successful deployment of a distributed deep learning system, which we believe will prove useful in solving some of the machine learning problems. Furthermore, the iterative reduce abstraction can be leveraged to distribute any other suitable machine learning algorithm. Being able to utilize the general-purpose Hadoop cluster will prove highly beneficial for running scalable machine-learning algorithms on large datasets. We note that there are several improvements we would like to make to the current framework, chiefly around reducing the network latency and having more advanced resource management. Additionally we’d like to optimize the DBN framework so that we can minimize inter-node communication. With fine-grained control of cluster resources, Hadoop Yarn framework provides us the flexibility to do so.
References