3. The DeepDog Architecture
Our final architecture was spurred in large part by the publication on April 17 of Google’s MobileNets paper, promising a new neural architecture with Inception-like accuracy on simple problems like ours, with only 4M or so parameters. This meant it sat in an interesting sweet spot between a SqueezeNet that had maybe been overly simplistic for our purposes, and the possibly overwrought elephant-trying-to-squeeze-in-a-tutu of using Inception or VGG on Mobile. The paper introduced some capacity to tune the size & complexity of network specifically to trade memory/CPU consumption against accuracy, which was very much top of mind for us at the time.
With less than a month to go before the app had to launch we endeavored to reproduce the paper’s results. This was entirely anticlimactic as within a day of the paper being published a Keras implementation was already offered publicly on GitHub by Refik Can Malli, a student at Istanbul Technical University, whose work we had already benefitted from when we took inspiration from his excellent Keras SqueezeNet implementation. The depth & openness of the deep learning community, and the presence of talented minds like R.C. is what makes deep learning viable for applications today — but they also make working in this field more thrilling than any tech trend we’ve been involved with.
Our final architecture ended up making significant departures from the MobileNets architecture or from convention, in particular:
- We do not use Batch Normalization & Activation between depthwise and pointwise convolutions, because the XCeption paper (which discussed depthwise convolutions in detail) seemed to indicate it would actually lead to less accuracy in architecture of this type (as helpfully pointed out by the author of the QuickNet paper on Reddit). This also has the benefit of reducing the network size.
- We use ELU instead of ReLU. Just like with our SqueezeNet experiments, it provided superior convergence speed & final accuracy when compared to ReLU
- We did not use PELU. While promising, this activation function seemed to fall into a binary state whenever we tried to use it. Instead of gradually improving, our network’s accuracy would alternate between ~0% and ~100% from one batch to the next. It’s unclear why this happened, and might just come down to an implementation error or user error. Fusing the width/height axes of our images had no effect.
- We did not use SELU. A short investigation between the iOS & Android release led to results very similar to PELU. It’s our suspicion that SELU should not be used in isolation as a sort of activation function silver bullet, but rather — as the paper’s title implies — as part of a narrowly-defined SNN architecture.
- We maintain the use of Batch Normalization with ELU. There are many indications that this should be unnecessary, however, every experiment we ran without Batch Normalization completely failed to converge. This could be due to the small size of our architecture.
- We used Batch Normalization before the activation. While this is a subject of some debate these days, our experiments placing BN after activation on small networks failed to converge as well.
- To optimize the network we used Cyclical Learning Rates and (fellow student) Brad Kenstler’s excellent Keras implementation. CLRs take the guessing game out of finding the optimal learning rate for your training. Even more importantly by adjusting the learning rate both up & down throughout your training, they help achieve a final accuracy that’s in our experience better than a traditional optimizer. For both of these reasons, we can’t conceive using anything else thant CLRs to train a neural network in the future.
- For what it’s worth, we saw no need to adjust the α or ρ values from the MobileNets architecture. Our model was small enough for our purposes at α = 1, and computation was fast enough at ρ = 1, and we preferred to focus on achieving maximum accuracy. However, this could be helpful when attempting to run on older mobile devices, or embedded platforms.
So how does this stack work exactly? Deep Learning often gets a bad rap for being a “black box”, and while it’s true many components of it can be mysterious, the networks we use often leak information about how some of their magic work. We can look at the layers of this stack and how they activate on specific input images, giving us a sense of each layer’s ability to recognize sausage, buns, or other particularly salient hotdog features.
Data quality was of the utmost importance. A neural network can only be as good as the data that trained it, and improving training set quality was probably one of the top 3 things we spent time on during this project. The key things we did to improve this were:
- Sourcing more images, and more varied images (height/width, background, lighting conditions, cultural differences, perspective, composition, etc.)
- Matching image types to expected production inputs. Our guess was people would mostly try to photograph actual hotdogs, other foods, or would sometimes try to trick the system with random objects, so our dataset reflected that.
- Give lots of examples of things that are similar that may trip your network. Some of the things that look most similar to hotdogs are other foods (such as hamburgers, sandwiches, or in the case of naked hotdogs, baby carrots or even cooked cherry tomatoes). Our dataset reflected that.
- Expect distortions: in mobile situations, most photos will be worse than the “average” picture taken with a DLSR or in perfect lighting conditions. Mobile photos are dim, noisy, taken at an angle. Aggressive data augmentation was key to counter this.
- Additionally we figured that users may lack access to real hotdogs, so may try photographing hotdogs from Google search results, which led to its own types of distortion (skewing if photo is taken at angle, flash reflection on the screen visible moiré effect caused by taking a picture of an LCD screen with a mobile camera). These specific distortion had an almost uncanny ability to trick our network, not unlike recently-published papers on Convolutional Network’s (lack of) resistance to noise. Using Keras’ channel shift feature resolved most of these issues.
Example distortion introduced by moiré and a flash. Original photo: Wikimedia Commons.
- Some edge cases were hard to catch. In particular, images of hotdogs taken with a soft focus or with lots of bokeh in the background would sometimes trick our neural network. This was hard to defend against as a) there just aren’t that many photographs of hotdogs in soft focus (we get hungry just thinking about it) and b) it could be damaging to spend too much of our network’s capacity training for soft focus, when realistically most images taken with a mobile phone will not have that feature. We chose to leave this largely unaddressed as a result.
The final composition of our dataset was 150k images, of which only 3k were hotdogs: there are only so many hotdogs you can look at, but there are many not hotdogs to look at. The 49:1 imbalance was dealt with by saying a Keras class weight of 49:1 in favor of hotdogs. Of the remaining 147k images, most were of food, with just 3k photos of non-food items, to help the network generalize a bit more and not get tricked into seeing a hotdog if presented with an image of a human in a red outfit.
Our data augmentation rules were as follows:
- We applied rotations within ±135 degrees — significantly more than average, because we coded the application to disregard phone orientation.
- Height and width shifts of 20%
- Shear range of 30%
- Zoom range of 10%
- Channel shifts of 20%
- Random horizontal flips to help the network generalize
These numbers were derived intuitively, based on experiments and our understanding of the real-life usage of our app, as opposed to careful experimentation.
The final key to our data pipeline was using Patrick Rodriguez’s multiprocess image data generator for Keras. While Keras does have a built-in multi-threaded and multiprocess implementation, we found Patrick’s library to be consistently faster in our experiments, for reasons we did not have time to investigate. This library cut our training time to a third of what it used to be.
The network was trained using a 2015 MacBook Pro and attached external GPU (eGPU), specifically an Nvidia GTX 980 Ti (we’d probably buy a 1080 Ti if we were starting today). We were able to train the network on batches of 128 images at a time. The network was trained for a total of 240 epochs, meaning we ran all 150k images through the network 240 times. This took about 80 hours.
We trained the network in 3 phases:
- Phase 1 ran for 112 epochs (7 full CLR cycles with a step size of 8 epochs), with a learning rate between 0.005 and 0.03, on a triangular 2 policy (meaning the max learning rate was halved every 16 epochs).
- Phase 2 ran for 64 more epochs (4 CLR cycles with a step size of 8 epochs), with a learning rate between 0.0004 and 0.0045, on a triangular 2 policy.
- Phase 3 ran for 64 more epochs (4 CLR cycles with a step size of 8 epochs), with a learning rate between 0.000015 and 0.0002, on a triangular 2 policy.
UPDATED: a previous version of this chart contained inaccurate learning rates.
While learning rates were identified by running the linear experiment recommended by the CLR paper, they seem to intuitively make sense, in that the max for each phase is within a factor of 2 of the previous minimum, which is aligned with the industry standard recommendation of halving your learning rate if your accuracy plateaus during training.
In the interest of time we performed some training runs on a PaperspaceP5000 instance running Ubuntu. In those cases, we were able to double the batch size, and found that optimal learning rates for each phase were roughly double as well.
Running Neural Networks on Mobile Phones
Even having designed a relatively compact neural architecture, and having trained it to handle situations it may find in a mobile context, we had a lot of work left to make it run properly. Trying to run a top-of-the-line neural net architecture out of the box can quickly burns hundreds megabytes of RAM, which few mobile devices can spare today. Beyond network optimizations, it turns out the way you handle images or even load TensorFlow itself can have a huge impact on how quickly your network runs, how little RAM it uses, and how crash-free the experience will be for your users.
This was maybe the most mysterious part of this project. Relatively little information can be found about it, possibly due to the dearth of production deep learning applications running on mobile devices as of today. However, we must commend the Tensorflow team, and particularly Pete Warden, Andrew Harp and Chad Whipkey for the existing documentation and their kindness in answering our inquiries.
- Rounding the weights of our network helped compressed the network to ~25% of its size. Essentially instead of using the arbitrary stock values derived from your training, this optimization picks the N most common values and sets all parameters in your network to these values, which drastically reduces the size of your network when zipped. This however has no impact on the uncompressed app size, or memory usage. We did not ship this improvement to production as the network was small enough for our purposes, and we did not have time to quantify how much of a hit the rounding would have on the accuracy of the app.
- Optimize the TensorFlow lib by compiling it for production with -Os
- Removing unnecessary ops from the TensorFlow lib: TensorFlow is in some respect a virtual machine, able to interpret a number or arbitrary TensorFlow operations: addition, multiplications, concatenations, etc. You can get significant weight (and memory) savings by removing unnecessary ops from the TensorFlow library you compile for ios.
- Other improvements might be possible. For example unrelated work by the author yielded 1MB improvement in Android binary size with a relatively simple trick, so there may be more areas of TensorFlow’s iOS code that can be optimized for your purposes.
Instead of using TensorFlow on iOS, we looked at using Apple’s built-in deep learning libraries instead (BNNS, MPSCNN and later on, CoreML). We would have designed the network in Keras, trained it with TensorFlow, exported all the weight values, re-implemented the network with BNNS or MPSCNN (or imported it via CoreML), and loaded the parameters into that new implementation. However, the biggest obstacle was that these new Apple libraries are only available on iOS 10+, and we wanted to support older versions of iOS. As iOS 10+ adoption and these frameworks continue to improve, there may not be a case for using TensorFlow on device in the near future.
Changing App Behavior by Injecting Neural Networks on The fly
If you think injecting JavaScript into your app on the fly is cool, try injecting neural nets into your app! The last production trick we used was to leverage CodePush and Apple’s relatively permissive terms of service, to live-inject new versions of our neural networks after submission to the app store. While this was mostly done to help us quickly deliver accuracy improvements to our users after release, you could conceivably use this approach to drastically expand or alter the feature set of your app without going through an app store review again.
There are a lot of things that didn’t work or we didn’t have time to do, and these are the ideas we’d investigate in the future:
- More carefully tune our data-augmentation parameters.
- Measure accuracy end-to-end, i.e. the final determination made by the app abstracting things like whether our app has 2 or many more categories, what the final threshold for hotdog recognition is (we ended up having the app say “hotdog” if recognition is above 0.90 as opposed to the default of 0.5), after weights are rounded, etc.
- Building a feedback mechanism into the app — to let users vent frustration if results are erroneous, or actively improve the neural network.
- Use a larger resolution for image recognition than 224 x 224 pixels — essentially using a MobileNets ρ value > 1.0