Technical Perspective: XNOR-Networks – Powerful but Tricky

You can now run computations on your phone that would have been unthinkable a few years ago. But as small devices get smarter, we discover new uses for them that overwhelm their resources. If you want your phone to recognize a picture of your face (image classification) or to find faces in pictures (object detection), you want it to run a convolutional neural net (CNN).

Modern computer vision applications are mostly built using CNNs. This is because vision applications tend to have a classifier at their heart—so, for example, one builds an object detector by building one classifier that tells whether locations in an image could contain an object, then another that determines what the object is. CNNs consist of a sequence of layers. Each takes a block of data which has two spatial dimensions (like an image) and one feature dimension (for an image, red, green, and blue), and then makes another such block, which it passes to the next layer. Most layers apply a convolution and a nonlinearity to their input, typically increasing the feature dimension. Some layers reduce the spatial dimension by pooling spatial windows in the input block. So, one might pool by replacing 2×2 non-overlapping windows with the largest value in that window. The final layer is usually classification by logistic regression.

CNNs yield excellent classifiers, because the training process chooses image features that are useful for the particular classification task in hand. For this to work, some layers must have quite large feature dimensions, and the network needs to have many layers (yielding a "deep" network). Deep networks produce image features that have very wide spatial support and are complicated composites. One should think of a single convolutional layer as a pattern detector; a deep network detects patterns of patterns of patterns.

All this means that CNNs tend to have a very large number of floating-point parameters, meaning that running a CNN has traditionally required a GPU (or patience!). Building networks with few parameters tends to result in classifiers that aren't accurate. But a CNN's parameters are redundant. For example, once a CNN has been trained, some procedures for compressing its parameters don't significantly affect its accuracy. CNNs can respond badly to apparently minor changes. For example, changing from single precision to double precision arithmetic can significantly affect accuracy.

How, then, to produce a CNN that is small enough to run on a mobile device, and accurate enough to be worth using? The strategies in the following paper are the best known to date. One builds a CNN where every parameter is represented by a single bit, yielding a very large reduction in size and a speedup (a binary weight network or BWN). In a BWN, layers apply binary weights to real valued inputs to get real valued outputs. Even greater improvements in size and speed can be obtained by insisting that layers accept and produce single bit data blocks (an XNOR-network). Multiplying data by weights in an XNOR-network is particularly efficient, so very significant speedups are available.

Producing a useful XNOR-network requires a variety of tricks. Pooling binary values loses more information than pooling real values, so pooling layers must be adjusted. Batch normalization layers must be moved around. Training a conventional CNN then quantizing the weights produces a relatively poor classifier. Better is to train the CNN so it "knows" the weights will be quantized, using a series of clever tricks described in the paper. It helps to adjust the labels used for training with a measure of image similarity; these refinements are adjusted dynamically throughout the training process.

These tricks result in a compression procedure that can be applied to any network architecture, with weights learned on any dataset. But compression produces a loss of accuracy. The ideal way to evaluate this procedure is to find others that produce networks of the same size and speed on the same dataset. Then the compressor that produces the smallest loss in accuracy wins. It's hard to match size and speed, but a compressor that produces the smallest loss of accuracy with acceptable size and speed is the standard to beat.

The procedures described result in accuracies much higher than is achievable with comparable methods. This work has roots in a paper that appeared in the Proceeding of the 2016 European Conference on Computer Vision. Since then, xnor.ai, a company built around some of the technologies in this paper, has flourished.

The technologies described mean you can run accurate modern computer vision methods on apparently quite unpromising devices (for example, a pi0). There is an SDK and a set of tutorials for this technology at https://ai2go.xnor.ai/getting-started/python. Savings in space and computation turn into savings in energy, too. An extreme example—a device that can run accurate detectors and classifiers using only solar power—was just announced (https://www.xnor.ai/blog/ai-powered-by-solar).

Footnotes

To view the accompanying paper, visit doi.acm.org/10.1145/3429945