In recent years we have seen a growing number of edge devices adopted by consumers, in their homes (e.g., smart cameras and doorbells), in their cars (e.g., driver assisted systems), and even on their persons (e.g., smart watches and rings). Similar growth is reported in industries including aerospace, agriculture, healthcare, transport, and manufacturing. At the same time that devices are getting smaller, Deep Neural Networks (DNN) that power most forms of artificial intelligence are getting larger, requiring more compute power, memory, and bandwidth. This creates a growing disconnect between advances in artificial intelligence and the ability to develop smart devices at the edge. In this paper, we present a novel approach to running state-of-the-art AI algorithms at the edge. We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks (BWN) and XNOR-Networks. In BWN, the filters are approximated with binary values resulting in 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58x faster convolutional operations (in terms of number of the high precision operations) and 32x memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a BWN version of AlexNet is the same as the full-precision AlexNet. Our code is available at: urlhttp://allenai.org/plato/xnornet.

### 1. Introduction

In recent years, the approach of using Deep Neural Networks (DNN) to create artificial intelligence has been highly successful in teaching computers to recognize^{8,11,17,18} and detect^{4,5,16} objects, read text, and understand speech.^{7} Such capabilities could have significant impacts on industries such a healthcare, agriculture, aerospace, transport, and manufacturing, yet to date there are limited real world applications of DNN and Convolutional Deep Neural Networks (CDNN). While there has been some progress made with virtual reality (VR by Oculus),^{13} augmented reality (AR by HoloLens),^{6} and smart wearable devices, the majority of applications rely on edge devices that have limited or no bandwidth, are low powered, and require the data to be stored locally for privacy and security reasons. These constraints are at odds with the current state-of-the-art CNNs and DCNNs that require large amounts of compute power and are therefore currently limited to the cloud.

CNN-based recognition systems need large amounts of memory and computational power. While they perform well on expensive, GPU-based machines, they are often unsuitable for smaller devices like cell phones and embedded electronics. For example, AlexNet,^{11} one of the most well-known DNN architecture for image classification, has 61M parameters (249MB of memory) and performs 1.5B high precision operations to classify one image. These numbers are even higher for deeper CNNs for example, VGG^{17} (see Section 3.1). These models quickly overtax the limited storage, battery power, and compute capabilities of smaller devices like cell phones.

In this paper, we introduce simple, efficient, and accurate approximations to CNNs by binarizing the weights and even the intermediate representations in convolutional neural networks. Our binarization method aims at finding the best approximations of the convolutions using binary operations. We demonstrate that our way of binarizing neural networks results in ImageNet classification accuracy numbers that are comparable to standard full precision networks while requiring a significantly less memory and fewer floating point operations.

We study two approximations: Neural networks with binary weights and XNOR-Networks. In **BWN** all the weight values are approximated with binary values. A convolutional neural network with binary weights is significantly smaller (~32x) than an equivalent network with single-precision weight values. In addition, when weight values are binary, convolutions can be estimated by only addition and subtraction (without multiplication), resulting in ~2x speed up. Binary-weight approximations of large CNNs can fit into the memory of even small, portable devices while maintaining the same level of accuracy (see Sections 3.1 and 3.2).

To take this idea further, we introduce **XNOR-Networks** where both the weights and the inputs to the convolutional and fully connected layers are approximated with binary values. Binary weights and binary inputs allow an efficient way of implementing convolutional operations. If all of the operands of the convolutions are binary, then the convolutions can be estimated by XNOR and bit-counting operations.^{2} XNOR-Nets result in accurate approximation of CNNs while offering ~58x speed up in CPUs (in terms of number of the high precision operations). This means that XNOR-Nets can enable real-time inference in devices with small memory and no GPUs (inference in XNOR-Nets can be done very efficiently on).^{a}

To the best of our knowledge this paper is the first attempt to present an evaluation of binary neural networks on large-scale datasets like ImageNet. Our experimental results show that our proposed method for binarizing convolutional neural networks outperforms the state-of-the-art network binarization method of^{2} by a large margin (16:3%) on top-1 image classification in the ImageNet challenge ILSVRC2012. Our contribution is two-fold: First, we introduce a new way of binarizing the weight values in convolutional neural networks and show the advantage of our solution compared to state-of-the-art solutions. Second, we introduce XNOR-Nets, a DNN model with binary weights and binary inputs and show that XNOR-Nets can obtain similar classification accuracies compared to standard networks while being significantly more efficient. Our code is available at: urlhttp://allenai.org/plato/xnornet.

### 2. Binary Convolutional Neural Network

To process an image for a variety of computer vision tasks, we need to pass the image through a multi-layer convolutional neural network (see Figure 1). The major computational bottleneck is in the convolutional operations, which are combinations of simple floating point arithmetic operations. In the state-of-the-art CNN models the floating point operations are in the order of billions. This is the main reason that processing images with the state-of-the-art CNN models require GPU servers. GPUs can parallelize these huge amount of floating point operations. But GPUs are expensive and consume extensive power to run. In this paper, we are questioning floating point arithmetic operations in CNNs. We show that it is possible to reduce the precision of the parameters and the activation values for the neurons from 32 bits all the way down to a single bit. By reducing the precision we can save in memory and computation. Single bit precision enables using logical operations instead of floating point operations. Mathematically we present binary values in {-1, +1}, therefore the arithmetic operations translates to logical operations in {0, 1}. As it is shown in Figure 2 the multiplication translates to XNOR operation and addition and subtraction translate to popcount operations. These operations are natively available in the most of the commodity CPUs in edge devices and can be parallelized inside the CPU. Hence, it eliminated the need of GPU for fast computation.

**Figure 1. We propose two efficient variations of convolutional neural networks. Binary-Weight-Networks, when the weight filters contains binary values. XNOR-Networks, when both weigh and input have binary values. These networks are very efficient in terms of memory and computation, while being very accurate in natural image classification. This offers the possibility of using accurate vision techniques in portable devices with limited resources.**

**Figure 2. Mathematically we present binary values in {-1, +1}, therefore the arithmetic operations translates to logical operations in {0, 1}. The multiplication translates to XNOR operation and addition and subtraction translate to popcount operations.**

**2.1. Binary-Weight-Networks**

In order to constrain a convolutional neural network to have binary weights, we estimate the real-value weight filter **W** ∈ R^{cxwxh} using a binary filter **B** ∈ {+1, -1}^{cxwxh}. The best approximation is easy to find; the sign values of the elements in **W.** However, this approximation enforces a large amount of quantization error. To compensate this quantization error, we introduce a scaling factor α ∈ R^{+} such that **W** ∈ α**B.** A convolutional operation can be approximated by:

where, ⊕ indicates a convolution without any multiplication. Since the weight values are binary, we can implement the convolution with additions and subtractions. The binary weight filters reduce memory usage by a factor of ~32x compared to single-precision filters. In Ref.,^{14} we found a closed form optimal estimation for **W** ≈ α**B** by solving a constrained optimization problem. The optimal estimation of a binary weight filter can be simply achieved by taking the sign of weight values. The optimal scaling factor is the average of absolute weight values.

**Training Binary-Weights-Networks.** A naive solution for training the BWN could be first training a full precision model and then simply quantizing the weight values as describes above. This approach does not work. Figure 3 shows an experiment on image classification task in ImageNet dataset using a ResNet models with 50 layers. The top-1 accuracy in the second bar from the left shows the accuracy of this approach in compare with the full precision (first bar from the left). The naive quantization destroys all the information in the parameters of the network. The main challenge here is to find a set of real value weight filters that if we quantize them, we can reliably classify the categories of objects in an image. To find this set of weight filters, we adopt a modified version of gradient backpropagation algorithm.

**Figure 3. This bar chart compares the top-1 accuracies in image classification on ImageNet challenge ILSVRC2012 using Residual Network model with 50 layers. From the left side the first bar shows the accuracy of the full precision model, the second bar shows the accuracy when the model parameters are binarized with a naive approach as discussed in Section 2.1.1, the third bar shows the accuracy when only the weights are binarized, the forth bar shows the accuracy when both weights and inputs are binarized, and the last bar shows the accuracy when the XNOR-Net model is trained with Label Refinery (see Section 2.3).**

**Algorithm 1:** Training a CNN with binary weights:

**Input:** A set of training images **X**

**Output:** Model parameters **W**

- Randomly initialize
**W** **for***iter*= 1 to*N***do**- Load a random image
**X**from the train set - Quantize the model parameters
**W**as described above - Forward pass the image
**X**using the quantized parameters - Compute the loss function (cross-entropy for classification)
- Backward pass to compute gradients with respect of the quantized parameters
- Update the real-value weights
**W**using the gradients with a proper learning rate

Each iteration of training a CNN involves three steps; forward pass, backward pass and parameters update. To train a CNN with binary weights (in convolutional layers), we only quantize the weights during the forward pass and backward propagation. To compute the gradient for the sign function, we follow the same approach as.^{2} For updating the parameters, we use the high precision (real-value) weights. Because, in gradient descend the parameter changes are tiny, quantization after updating the parameters ignores these changes and the training objective cannot be improved. References^{2,3} also employed this strategy to train a binary network. Algorithm 1 demonstrates a high-level schema of our procedure for training a CNN with binary weights. Once the training finished, there is no need to keep the real-value weights. Because, at inference we only perform forward propagation with the binarized weights. In Figure 3 the third bar from the left shows the accuracy of the binary weight network trained with the proposed algorithm. As it can be seen, the top-1 accuracy is as high as the full precision model while the model size is about 32x smaller.

**2.2. XNOR-Networks**

So far, we could find binary weight filters for a CNN model. The inputs to the convolutional layers are still real-value tensors. Now, we explain how to quantize both weight filters and input tensors, so convolutions can be implemented efficiently using XNOR and bitcounting operations. This is the key element of our XNOR-Networks. In order to constrain a convolutional neural network to have binary weights and binary inputs, we need to enforce binary operands at each step of the convolutional operation. A convolution consist of repeating a shift operation and a dot product. Shift operation moves the weight filter over the input and the dot product performs element-wise multiplications between the values of the weight filter and the corresponding part of the input. If we express the dot product in terms of binary operations, convolution can be approximated using binary operations. Dot product between two binary vectors can be implemented by XNOR-bitcounting operations.^{2} In Ref.,^{14} we explain how to approximate the dot product between two vectors in R^{n} by a dot product between two vectors in {+1, -1}^{n}. Similar to the binary weight approximation, we introduced scaling factor for the quantized input tensor and we found the optimal solution by solving a constrained optimization that has a closed form solution for the weight filters and the input tensors. The optimal estimation of a binary weight filter and an input tensor can be simply achieved by taking the sign of their values. The optimal scaling factors are the average of absolute values.

Next, we demonstrate how to use this approximation for estimating a convolutional operation between two tensors.

Now, using this approximation we can perform convolution between input **I** and weight filter **W** mainly using binary operations:

where indicates a convolutional operation using XNOR and bitcount operations and α is the scaling factor for the weight filter and **K** is a matrix of scaling factors for all of the spatial sections of the input tensor in the convolution. Note that the number of non-binary operations is very small compared to binary operations.

**Training XNOR-Networks:** A typical block in CNN contains several different layers. Figure 4(left) illustrates a typical block in a CNN. This block has four layers in the following order: 1-Convolutional, 2-Batch Normalization, 3-Activation, and 4-Pooling. Batch Normalization layer^{9} normalizes the input batch by its mean and variance. The activation is an element-wise non-linear function (e.g., Sigmoid, ReLU). The pooling layer applies any type of pooling (e.g., max, min or average) on the input batch. For a binarized convolution the activation layer is the sign function. With the typical CNN block structure, binarization does not work. Applying pooling on binary input results in significant loss of information. For example, max-pooling on binary input returns a tensor that most of its elements are equal to +1. Moreover, in the backward pass we often observe more than one maximum that leads to uncertainty in the penalization. One may assume that switching between activation and the pooling layer will solve this issues. In this case, the input to the activation layer is real value. Max pooling will often gives us a positive tensor. Then the activation turns it into a unity matrix where most of the values are +1, which means again we loose the information for the next layer. However, this configuration does not have the penalization problem in the backward pass. Because the pooling here usually has one maximum per each window. The XNOR-Net block configuration shown in Figure 4(right), start with BatchNormalization and activation then a convolution and at the en the pooling. This configuration passes a binary input to the convolution, which generates a real-value tensor followed by a pooling which produces a tensor with mostly positive values. This tensor goes to the batch normalization in the next layer and the mean centering in the batch normalization generates negative values that when it passed to the activation, a proper binary tensor can be generated that we pass it to the next convolution.

**Figure 4. This figure contrasts the block structure in our XNOR-Network (right) with a typical CNN (left).**

Therefore, we put the pooling layer after the convolution. To further decrease the information loss due to binarization, we normalize the input before binarization. This ensures the data to hold zero mean, therefore, thresholding at zero leads to less quantization error. The order of layers in a block of binary CNN is shown in Figure 4(right).

Once we have the binary CNN structure, the training algorithm would be the same as Algorithm 1.

**Binary Gradient:** The computational bottleneck in the backward pass at each layer is computing a convolution between weight filters and the gradients with respect of the inputs. Similar to binarization in the forward pass, we can binarize the gradients in the backward pass. This leads to a very efficient training procedure using binary operations. Note that if we use the same mechanism to compute the scaling factor for quantized gradient, the direction of maximum change for SGD would be diminished. To preserve the maximum change in all dimensions, we use the maximum of the absolute values in the gradients as the scaling factor. *k*-**bit Quantization:** So far, we showed 1-bit Quantization of weights and inputs using sign(*x*) function. One can easily extend the quantization level to *k*-bits by using instead of the sign function. Where [.] indicates rounding operation and *x* ∈ [-1, 1].

**2.3. Improving accuracy using Label Refinery**

To further improve the accuracy of the XNOR-Networks, we introduced an iterative training methods in^{1} to update ground truth labels using a visual model trained on the entire dataset. The Label Refinery produces soft, multi-category, dynamically-generated labels consistent with the visual signal. The training images are labelled with the single category. After a few iterations of label refining, the labels from which the final model is trained are informative, unambiguous, and smooth. This results in major improvements in the model accuracy during successive stages of refinement as well as improved model generalization. The last column from left in Figure 3 shows the top-1 accuracy improvement for the XNOR-Network, which is trained by Label Refinery.

### 3. Experiments

We evaluate our method by analyzing its efficiency and accuracy. We measure the efficiency by computing the computational speedup (in terms of number of high precision operation) achieved by our binary convolution vs. standard convolution. To measure accuracy, we perform image classification on the large-scale ImageNet dataset. This paper is the first work that evaluates binary neural networks on the ImageNet dataset. Our binarization technique is general, we can use any CNN architecture. We evaluate AlexNet^{11} and two deeper architectures in our experiments. We compare our method with two recent works on binarizing neural networks; BinaryConnect (BC)^{3} and BinaryNet (BNN)^{2}. The classification accuracy of our BWN version of AlexNet is as accurate as the full precision version of AlexNet. This classification accuracy outperforms competitors on binary neural networks by a large margin. We also present an ablation study, where we evaluate the key elements of our proposed method; computing scaling factors and our block structure for binary CNN. We shows that our method of computing the scaling factors is important to reach high accuracy.

**3.1. Efficiency analysis**

In a standard convolution, the total number of operations is *cN*_{w}*N*_{1}, where *c* is the number of channels, *N*_{w} = *wh* and *N*_{1}=*w _{in}h_{in}*. Note that some modern CPUs can fuse the multiplication and addition as a single cycle operation. On those CPUs, BWN does not deliver speed up. Our binary approximation of convolution has

*cN*

_{w}

*N*

_{1}binary operations and

*N*

_{1}non-binary operations. With the current generation of CPUs, we can perform 64 binary operations in one clock of CPU, therefore the speedup can be computed by .

The speedup depends on the channel size and filter size but not the input size. In Figure 5(b) and (c), we illustrate the speedup achieved by changing the number of channels and filter size. While changing one parameter, we fix other parameters as follows: *c* = 256, *n*_{1} = 14^{2} and *n*_{w} = 3^{2} (majority of convolutions in ResNet^{8} architecture have this structure). Using our approximation of convolution we gain 62.27x theoretical speed up, but in our CPU implementation with all of the overheads, we achieve 58x speed up in one convolution (Excluding the process for memory allocation and memory access). With the small channel size (*c* = 3) and filter size (*N*_{w} = 1 x 1) the speedup is not considerably high. This motivates us to avoid binarization at the first and last layer of a CNN. In the first layer the channel size is three and in the last layer the filter size is 1 x 1. A similar strategy was used in.^{2} Figure 5(a) shows the required memory for three different CNN architectures(AlexNet, VGG-19, ResNet-18) with binary and double precision weights. BWN are so small that can be easily fitted into portable devices. BNN^{2} is in the same order of memory and computation efficiency as our method. In Figure 5, we show an analysis of computation and memory cost for a binary convolution. The same analysis is valid for BNN and BC. The key difference of our method is using a scaling-factor, which does not change the order of efficiency while providing a significant improvement in accuracy.

**Figure 5. This figure shows the efficiency of binary convolutions in terms of memory (a) and computation (b and c). (a) Figure label a is contrasting the required memory for binary and double precision weights in three different architectures(AlexNet, ResNet-18, and VGG-19). (b, c) Figure labels b and c show speedup gained by binary convolution under (b) different number of channels and (c) different filter size.**

**3.2. Image classification**

We evaluate the performance of our proposed approach on the task of natural image classification. So far, in the literature, binary neural network methods have presented their evaluations on either limited domain or simplified datasets for example, CIFAR-10, MNIST, SVHN. To compare with state-of-the-art vision, we evaluate our method on ImageNet (ILSVRC2012). ImageNet has ~1.2M train images from 1K categories and 50K validation images. The images in this dataset are natural images with reasonably high resolution compared to the CIFAR and MNIST dataset, which have relatively small images. We report our classification performance using top-1 and top-5 accuracies. We adopt three different CNN architectures as our base architectures for binarization: AlexNet,^{11} Residual Networks (known as ResNet),^{8} and a variant of GoogLenet.^{18}. We compare our **BWN** with **BC**^{3} and our XNOR-Networks (**XNOR-Net**) with BinaryNeuralNet (**BNN**).^{2} BC is a method for training a DNN with binary weights during forward and backward propagations. Similar to our approach, they keep the real-value weights during the updating parameters step. Our binarization is different from BC. The binarization in BC can be either deterministic or stochastic. We use the deterministic binarization for BC in our comparisons because the stochastic binarization is not efficient. The same evaluation settings have been used and discussed in.^{2} BNN^{2} is a neural network with binary weights and activations during inference and gradient computation in training. In concept, this is a similar approach to our XNOR-Network but the binarization method and the network structure in BNN is different from ours. Their training algorithm is similar to BC and they used deterministic binarization in their evaluations.

**CIFAR-10:** BC and BNN showed near state-of-the-art performance on CIFAR-10, MNIST, and SVHN dataset. BWN and XNOR-Net on CIFAR-10 using the same network architecture as BC and BNN achieve the error rate of 9.88% and 10.17% respectively. In this paper, we explore the possibility of obtaining near state-of-the-art results on a much larger and more challenging dataset (ImageNet).

**AlexNet:** Reference^{11} is a CNN architecture with five convolutional layers and two fully-connected layers. This architecture was the first CNN architecture that showed to be successful on ImageNet classification task. This network has 61M parameters. We use AlexNet coupled with batch normalization layers.^{9}

*Train:* In each iteration of training, images are resized to have 256 pixel at their smaller dimension and then a random crop of 224 x 224 is selected for training. We run the training algorithm for 16 epochs with batch size equal to 512. We use negative-log-likelihood over the soft-max of the outputs as our classification loss function. In our implementation of AlexNet we do not use the Local-Response-Normalization (LRN) layer. We use SGD with momentum = 0.9 for updating parameters in BWN and BC. For XNOR-Net and BNN we used ADAM.^{10} ADAM converges faster and usually achieves better accuracy for binary inputs.^{2} The learning rate starts at 0.1 and we apply a learning-rate-decay = 0.01 every four epochs.

*Test:* At inference time, we use the 224 x 224 center crop for forward propagation.

Figure 6 demonstrates the classification accuracy for training and inference along the training epochs for top-1 and top-5 scores. The dashed lines represent training accuracy and solid lines shows the validation accuracy. In all of the epochs our method outperforms BC and BNN by large margin (~17%). Table 1 compares our final accuracy with BC and BNN. We found that the scaling factors for the weights (α) are much more effective than the scaling factors for the inputs (β). Removing β reduces the accuracy by a small margin (less than 1% top-1 AlexNet).

**Figure 6. This figure compares the imagenet classification accuracy on top-1 and top-5 across training epochs. Our approaches BWN and XNOR-Net outperform BinaryConnect (BC) and BinaryNet (BNN) in all the epochs by large margin (~17%).**

**Table 1. This table compares the final accuracies (top-1-top-5) of the full precision network with our binary precision networks; Binary-Weight-Networks (BWN) and XNOR-Networks (XNOR-Net) and the competitor methods; BinaryConnect (BC) and BinaryNet (BNN).**

*Binary Gradient:* Using XNOR-Net with binary gradient the accuracy of top-1 will drop only by 1.4%.

**Residual Net:** We use the ResNet-18 proposed in^{8} with short-cut type B.

*Train:* In each training iteration, images are resized randomly between 256 and 480 pixel on the smaller dimension and then a random crop of 224 x 224 is selected for training. We run the training algorithm for 58 epochs with batch size equal to 256 images. The learning rate starts at 0.1 and we use the learning-rate-decay equal to 0.01 at epochs number 30 and 40.

*Test:* At inference time, we use the 224 x 224 center crop for forward propagation.

Figure 7 demonstrates the classification accuracy (top-1 and top-5) along the epochs for training and inference. The dashed lines represent training and the solid lines represent inference. Table 2 shows our final accuracy by BWN and XNOR-Net.^{b}

**Figure 7. This figure shows the classification accuracy; (a) top-1 and (b) top-5 measures across the training epochs on ImageNet dataset by Binary-Weight-Network and XNOR-Network using ResNet-18.**

**Table 2. This table compares the final classification accuracy achieved by our binary precision networks with the full precision network in ResNet-18 and GoogLenet architectures.**

**GoogLenet variant:** We experiment with a variant of GoogLenet^{18} that uses a similar number of parameters and connections but only straightforward convolutions, no branching. It has 21 convolutional layers with filter sizes alternating between 1 x 1 and 3 x 3.

*Train:* Images are resized randomly between 256 and 320 pixel on the smaller dimension and then a random crop of 224 x 224 is selected for training. We run the training algorithm for 80 epochs with batch size of 128. The learning rate starts at 0.1 and we use polynomial rate decay, β = 4.

*Test:* At inference time, we use a center crop of 224 x 224.

**3.3. Ablation studies**

There are two key differences between our method and the previous network binarization methods; the binarization technique and the block structure in our binary CNN. For binarization, we find the optimal scaling factors at each iteration of training. For the block structure, we order the layers in a block in a way that decreases the quantization loss for training XNOR-Net. Here, we evaluate the effect of each of these elements in the performance of the binary networks. Instead of computing the scaling factor α, one can consider α as a network parameter. In other words, a layer after binary convolution multiplies the output of convolution by an scalar parameter for each filter. This is similar to computing the affine parameters in batch normalization. Table 3(a) compares the performance of a binary network with two ways of computing the scaling factors. As we mentioned in Section 2.2.1 the typical block structure in CNN is not suitable for binarization. Table 3(b) compares the standard block structure C-B-A-P (Convolution, Batch Normalization, Activation, and Pooling) with our structure B-A-C-P. (A, is binary activation).

**Table 3. In this table, we evaluate two key elements of our approach; computing the optimal scaling factors and specifying the right order for layers in a block of CNN with binary input. (a) Demonstrates the importance of the scaling factor in training Binary-Weight-Networks and (b) shows that our way of ordering the layers in a block of CNN is crucial for training XNOR-Networks. C, B, A, P stands for Convolutional, BatchNormalization, Active function (here binary activation), and Pooling respectively.**

### 4. Conclusion^{c}

We introduce simple, efficient, and accurate binary approximations for neural networks. We train a neural network that learns to find binary values for weights, which reduces the size of network by ~32x and provide the possibility of loading very DNN into portable devices with limited memory. We also propose an architecture, XNOR-Net, that uses mostly bitwise operations to approximate convolutions. This provides ~58x speed up and enables the possibility of running the inference of state of the art DNN on CPU (rather than GPU) in real time.

### References

1. Bagherinezhad, H., Horton, M., Rastegari, M., Farhadi, A. Label refinery: Improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641 (2018).

2. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830 (2016).

3. Courbariaux, M., Bengio, Y., David, J.-P. Binaryconnect: Training deep neural networks with binary weights during propagations. In *Advances in Neural Information Processing Systems* (2015), 3105–3113.

4. Girshick, R. Fast R-CNN. In *Proceedings of the IEEE International Conference on Computer Vision* (2015), 1440–1448.

5. Girshick, R., Donahue, J., Darrell, T., Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition* (2014), 580–587.

6. Gottmer, M. *Merging Reality and Virtuality with Microsoft Hololens*, 2015.

7. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).

8. He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition* (2016), 770–778.

9. Ioffe, S., Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).

10. Kingma, D., Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

11. Krizhevsky, A., Sutskever, I., Hinton, G.E. Imagenet classification with deep convolutional neural networks. In *Advances in Neural Information Processing Systems* (2012), 097–1105.

12. Long, J., Shelhamer, E., Darrell, T. Fully convolutional networks for semantic segmentation. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition* (2015), 3431–3440.

13. Oculus, V. Oculus rift-virtual reality headset for 3D gaming, 2012. URL: http://www.oculusvr.com.

14. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A. XNOR-Net: Imagenet classification using binary convolutional neural networks. In *European Conference on Computer Vision* (2016), Springer.

15. Redmon, J. Darknet: Open source neural networks in C, 2013-2016. http://pjreddie.com/darknet/.

16. Ren, S., He, K., Girshick, R., Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In *Advances in Neural Information Processing Systems* (2015), 91–99.

17. Simonyan, K., Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

18. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A. Going deeper with convolutions. In *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition* (2015), 1–9.

### Footnotes

a. Fully connected layers can be implemented by convolution, therefore, in the rest of the paper, we refer to them also as convolutional layers.^{12}

b. Our implementation is followed by https://gist.github.com/szagoruyko/dd032c529048492630fc.

c. We used the Darknet^{15} implementation: http://pjreddie.com/darknet/imagenet/#extraction.

The original version of this paper is entitled *"XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks"* and was published in (European Conference on Computer Vision (ECCV) 2016.

**©2020 ACM 0001-0782/20/12**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2020 ACM, Inc.