{"title": "Memory-Driven Mixed Low Precision Quantization for Enabling Deep Network Inference on Microcontrollers", "book": "Proceedings of Machine Learning and Systems", "page_first": 326, "page_last": 335, "abstract": "This paper presents a novel end-to-end methodology for enabling the deployment of high-accuracy deep networks on microcontrollers. To fit within the memory and computational limitations of resource-constrained edge-devices, we exploit mixed low-bitwidth compression, featuring 8, 4 or 2-bit uniform quantization, and we model the inference graph with integer-only operations. \nOur approach aims at determining the minimum bit precision of every activation and weight tensor given the memory constraints of a device. This is achieved through a rule-based iterative procedure, which cuts the number of bits of the most memory-demanding layers, aiming at meeting the memory constraints. After a quantization-aware retraining step, the fake-quantized graph is converted into an inference integer-only model by inserting the Integer Channel-Normalization (ICN) layers, which introduce a negligible loss as demonstrated on INT4 MobilenetV1 models. We report the latency-accuracy evaluation of mixed-precision MobilenetV1 family networks on a STM32H7 microcontroller. Our experimental results demonstrate an end-to-end deployment of an integer-only Mobilenet network with Top1 accuracy of 68% on a device with only 2MB of FLASH memory and 512kB of RAM, improving by 8% the Top1 accuracy with respect to previously published 8 bit implementations for microcontrollers.", "full_text": " MEMORY-DRIVENMIXEDLOWPRECISIONQUANTIZATIONFORENABLING\r\n DEEPNETWORKINFERENCEONMICROCONTROLLERS\r\n ManueleRusci1 AlessandroCapotondi2 LucaBenini13\r\n ABSTRACT\r\n This paper presents a novel end-to-end methodology for enabling the deployment of high-accuracy deep networks\r\n onmicrocontrollers. To \ufb01t within the memory and computational limitations of resource-constrained edge-devices,\r\n we exploit mixed low-bitwidth compression, featuring 8, 4 or 2-bit uniform quantization, and we model the\r\n inference graph with integer-only operations. Our approach aims at determining the minimum bit precision of\r\n every activation and weight tensor given the memory constraints of a device. This is achieved through a rule-based\r\n iterative procedure, which cuts the number of bits of the most memory-demanding layers, aiming at meeting\r\n the memory constraints. After a quantization-aware retraining step, the fake-quantized graph is converted into\r\n an inference integer-only model by inserting the Integer Channel-Normalization (ICN) layers, which introduce\r\n a negligible loss as demonstrated on INT4 MobilenetV1 models. We report the latency-accuracy evaluation\r\n of mixed-precision MobilenetV1 family networks on a STM32H7 microcontroller. Our experimental results\r\n demonstrate an end-to-end deployment of an integer-only Mobilenet network with Top1 accuracy of 68% on a\r\n device with only 2MB of FLASH memory and 512kB of RAM, improving by 8% the Top1 accuracy with respect\r\n to previously published 8 bit implementations for microcontrollers.\r\n 1 INTRODUCTION released a software library, CMSIS-NN (Lai et al., 2018),\r\n Enabling machine learning on extreme-edge-devices is chal- which enabled the ef\ufb01cient computation of deep networks\r\n lenging due to their tight memory and computing power ontinymicrocontrollers. The optimized routines composing\r\n constraints. When envisioning smart sensors operating on the library realize convolutional operations in \ufb01xed-point\r\n batteries, the target power envelope must be below tens representations, to exploit instruction-level parallelism. Un-\r\n of mWs to guarantee a battery lifetime of years. This re- fortunately, due to memory constraints, only a small set\r\n quirement impacts the system architecture design: adding of relatively complex networks has been ported to the mi-\r\n computational units (e.g. \ufb02oating-point units) or memory crocontroller domain yet (Zhang et al., 2017). For what\r\n banks contributes increasing the complexity and the power concerns deep inference models tailored for complex tasks,\r\n cost, and hence the energy, of a system. e.g. 1000 classes image classi\ufb01cation, the deployment on\r\n memory-constrained MCUs is still an open problem.\r\n Nowadays,microcontroller units (MCUs), such STMicro- To address this, recent works focused on designing novel\r\n electronics STM32 devices, feature an energy consumption network topologies optimized not only in terms of accu-\r\n compliant with the requirement of smart autonomous sen- racy but also for computational and memory costs (Howard\r\n sors and include energy-ef\ufb01cient computational units for et al., 2017; Ma et al., 2018; Wu et al., 2018). In addi-\r\n running machine learning workloads. However, the typical tion, a variety of compression techniques can be applied\r\n size of the embedded memory cuts is limited to a few MB to further shrink a trained model. Among these, the quan-\r\n (a STM32H7MCUfeatures2MBofFLASHmemory)and tization of either activations values and parameters to a\r\n the computation core (commonly a single ARM Cortex- low-bitwidth format, i.e. 8 bit or less, is extremely effective\r\n MCPU)runs up to few hundreds of MHz. To boost the because, besides reducing the memory footprint, it allows\r\n performance of this class of MCUs while leveraging the to operate with low precision integer operations, which can\r\n high \ufb02exibility of software-programmability, ARM recently be ef\ufb01ciently mapped on the limited instruction-set of tiny\r\n 1DEI, Universita\u2019 di Bologna, Bologna, Italy 2Universita\u2019 di microcontrollers. Figure 1 highlights a typical develop-\r\n ModenaeReggioEmilia,Italy3D-ITET,ETHZurich,Switzerland. ment \ufb02ow to deploy a deep network design into a resource-\r\n Correspondence to: Manuele Rusci . constrained device. A pretrained network f(x) is quantized\r\n Proceedings of the 3rd MLSys Conference, Austin, TX, USA, by means of an initial device-aware \ufb01ne tuning process,\r\n 2020. Copyright 2020 by the author(s). whichcanincludealsoare-training step. The resultant fake-\r\n Memory-DrivenMixedLowPrecisionQuantizationforEnablingDeepNetworkInferenceonMicrocontrollers\r\n Figure 1. Design \ufb02ow to bring deep neural networks into tiny microntrollers.\r\n quantized model g(x), emulating quantized values during et al., 2018).\r\n the forward pass, is turned into an integer-only deployment This paper places the following contributions:\r\n model g\u2032(x) by means of an additional optimization step.\r\n Ideally, loss(g\u2032(x)) \u2248 loss(g(x)) \u2248 loss(f(x)).\r\n State-of-the-art quantization approaches lead to almost-zero \u2022 WeintroducetheIntegerChannel-Normalization(ICN)\r\n accuracy loss if approximating a deep models with 8 bits activation layer to achieve an ef\ufb01cient conversion of the\r\n arithmetic (Jacob et al., 2018). This compression level is fake-quantized graph into an integer-only deployment\r\n however not suf\ufb01cient to bring deep models with high ac- graph, also supporting per-channel quantization and\r\n curacy into memory-constrained microcontrollers. As an quantization-aware training strategies.\r\n example, a 8 bit MobilenetV1 (Howard et al., 2017) with \u2022 We present a mixed-precision quantization method-\r\n the highest accuracy requires more than 4 MB of embedded ology driven by the memory constraints of a target\r\n memory,whichisprohibitive for the majority of microcon- architecture, which aims at selecting the bit precision\r\n troller devices available today. If homogeneously lowering of every weight and activation tensor of an integer-only\r\n the number of bits below 8 bits on a per-network base, the network.\r\n accuracy degradation becomes not negligible (Krishnamoor-\r\n thi, 2018). To keep the accuracy level high, the bit precision \u2022 We studied the latency-accuracy tradeoff on iso-\r\n of individual tensors should be tuned such as i) to \ufb01t the memory mixed-precision networks belonging to the\r\n memory constraints and ii) to minimize the reduction of MobilenetV1 family when running on a STM32H7\r\n the bitwidth (Dong et al., 2019). These needs motivate our microcontroller device.\r\n proposed heterogeneous sub-byte quantization approach,\r\n denoted as Mixed Low Precision Quantization, which \ufb01nely Ourmethodologydemonstrates, for the very \ufb01rst time, an\r\n controls the per-tensor bit precision in accordance to the integer-only deployment of a MobilenetV1 network on a\r\n memorybudget. Moreover, the compression scheme must STM32H7microcontroller, featuring only 2MB of FLASH\r\n be combined with novel techniques for deriving integer- memory and 512kB of RAM, with 68% Top1 accuracy,\r\n only inference models, required to accelerate deep learning which is 8% higher than previous reported 8 bit integer-\r\n workloads on microntrollers. only implementations \ufb01tting into the same memory con-\r\n In this work we present a methodology for quantizing deep straints (Jacob et al., 2018).\r\n networks based on a mixed-precision scheme. The selection\r\n of the bit precision of every individual tensor is automated 2 RELATEDWORK\r\n such as to satisfy the memory limitations of a given device.\r\n Moreover, we improve the methodology (Jacob et al., 2018) QuantizedNeuralNetworks. Earlyworksonquantization\r\n for integer-only inference networks by supporting sub-byte of deep networks targeted 16 bits \ufb01xed-point implemen-\r\n per-channel quantization. Our experimental evaluation is tations (Lin et al., 2016), which result in an almost loss-\r\n conducted over the MobilenetV1 family networks on the less approximation of full-precision trained networks, or\r\n 1000 classes Imagenet classi\ufb01cation task (Howard et al., extreme binarized networks, which, despite the fascinat-\r\n 2017). Wearguethatthisisarepresentativeproblemfortiny ing low-computational and memory requirements, showed\r\n microntrollers, not yet solved (Jain et al., 2019), and much major accuracy losses when applied on image classi\ufb01ca-\r\n harder than quantizing over-parameterized networks (Choi tion benchmarks (Courbariaux et al., 2016; Rastegari et al.,\r\n 2016). Several studies demonstrated that 8 bit quantization\r\n Memory-DrivenMixedLowPrecisionQuantizationforEnablingDeepNetworkInferenceonMicrocontrollers\r\n of weights and activations results in a good trade-off be- procedure. Both (Dong et al., 2019) and (Wang et al., 2018)\r\n tween latency, compression and a near-zero accuracy degra- reports superior accuracy than ours when compressing net-\r\n dation, also if applied to ef\ufb01cient Imagenet classi\ufb01cation works to a 1MB of memory footprint, but they rely on a\r\n networks(Jacobetal.,2018;Migacz,2017;Jainetal.,2019). non-uniform clustering quantization of \ufb02oating-point pa-\r\n Amongtheemployed methodologies, TensorRT (Migacz, rameters, therefore not fully-comparable with our work in\r\n 2017) approximates the parameters tensor by the minimiza- terms of microcontroller readiness, as current MCUs are not\r\n tion of the KL divergence metric between quantized and equipped with the hardware needed for manipulation and\r\n full-precision values. On the contrary, (Jacob et al., 2018) computation on these data formats.\r\n quantizes values within a range de\ufb01ned by the tensor min Deep networks for resource-constrained devices. To\r\n and max values. Concerning activations, the PACT ap- bridge the gap between the complexity on deep networks\r\n proach (Choi et al., 2018) demonstrated the highest ef\ufb01- and the limitations of resource-constrained devices, device-\r\n ciency by leveraging backpropagation to learn the quantiza- aware optimization strategies have also been presented. The\r\n tion ranges. Recently, to \ufb01t stringent memory requirements, work(Blott et al., 2018) introduced FINN-R to quantize and\r\n moreaggressivesub-byteprecisionquantizationapproaches, deployagenericmodelintoconstrainedFPGAarchitectures.\r\n i.e. less than 8 bit, are under investigation (Choukroun et al., Their quantization approach makes use of integer thresh-\r\n 2019; Jain et al., 2019; Esser et al., 2019; Krishnamoorthi, olds (Umuroglu & Jahre, 2017; Gao et al., 2018; Rusci\r\n 2018; Liu & Mattina, 2019). The works (Jain et al., 2019; et al., 2018) for data compression. This method enabled a\r\n Esseretal., 2019) exploits learning-based approaches for de- lossless integer representation of a fake-quantized networks,\r\n termining the quantization ranges of activation and weights but demands larger memory footprint with respect to our\r\n at low-bitwidth precision. State-of-the-art accuracy level on proposed method. In contrast, the integer-only deployment\r\n the ef\ufb01cient MobilenetV1 model has been reported by (Kr- in (Jacob et al., 2018) presented a compact \ufb01xed-point 8\r\n ishnamoorthi, 2018; Liu & Mattina, 2019), by making use bit quantization strategy, which performs the folding of\r\n of per-channel quantization when moving to 4 bits preci- batch-normalization and scaling factors into weights before\r\n sion. It is also worth to mention as non-uniform quantizers applying a uniform quantizer. Additionally, per-layer \ufb01xed-\r\n have resulted as the best approximators when reducing the point parameters are needed for adapting the dynamic range\r\n bit precision (Zhang et al., 2018; Wang et al., 2018; Han whenpassing data from a layer to the next one. In contrast\r\n et al., 2015). However, a high-precision (\ufb02oating point) with this work, our methodology generalizes the deploy-\r\n arithmetic is needed on uncompressed values within the ment process when a more effective quantization strategy is\r\n datapath, hence these methods results not suitable for the used, i.e. per-channel mixed-precision quantization.\r\n microcontroller domain. In this work, we leverage existing\r\n techniques and show the insights, concerning either compu- 3 BACKGROUNDONLOW-BITWIDTH\r\n tational and memory aspects, when bringing fake-quantized\r\n networks to the integer-only arithmetic domain, which is QUANTIZATION\r\n not taken into consideration by this class of works. Thequantization process aims at quantizing either the net-\r\n Mixed Low Precision Quantization. Mixed-precision workparameters and the activations values, i.e. the tempo-\r\n techniques make use of multiple bit precision throughout a rary input and output values of the network layers. While\r\n quantized network, motivated by the fact that a lossy and the parameters can be quantized just before the inference\r\n aggressive linear cut is not necessary to reach a given com- (forward) pass (Migacz, 2017), the quantization of the acti-\r\n pression rate. The method (Fromm et al., 2018) targeted per- vations requires the insertion of fake-quantized activation\r\n pixel binarization based on a de\ufb01ned tensor mask. Despite layers within the network graph. These additional layers are\r\n achieving an extreme quantization level, a per-pixel quan- responsible for recording the activation range statistics, op-\r\n tization cannot be ef\ufb01ciently handled on a microcontroller, tionally via backpropagation (Choi et al., 2018), and apply\r\n duetothecontrol-basednatureoftherequireddata\ufb02ow. The quantization during the forward pass depending on the col-\r\n HAWQ(Dongetal.,2019)methodreliesonasecondorder lected statistics. Because of injected quantization noise, the\r\n Hessian metric to de\ufb01ne prioritization of tensor\u2019s bit preci- original full-precision network f is approximated with the\r\n sion to reduce, but without choosing the optimal per-tensor correspondent fake-quantized function g. A quantization-\r\n quantization level. On the same direction, HAQ (Wang et al., aware retraining of a fake-quantized model is essential to\r\n 2018) dynamically explores multiple low-bitwidth precision recover accuracy, especially when low-bitwidth precision is\r\n at training time by means of reinforcement learning. When employed (Jacob et al., 2018).\r\n optimizing for memory constraints, a non-uniform quanti- In the remainder of the paper we only focus on uniform\r\n zation is used. Compered to this, our methodology for bit quantization because its arithmetic is naturally supported by\r\n precision selection applies statically, before quantization- the instruction-set of general-purpose programmable MCUs.\r\n aware retraining, and it is based on a rule-based iterative Hence, without loosing generalities, any tensor t \u2208 RN,\r\n Memory-DrivenMixedLowPrecisionQuantizationforEnablingDeepNetworkInferenceonMicrocontrollers\r\n Table 1. Memory Requirements of a Quantized Convolutional Layer\r\n Label Z Weights Z B M N Z Thresholds\r\n x w q 0 0 y\r\n PL+FB(Jacobetal., 2018) 1 c \u00b7 k \u00b7 k \u00b7 c 1 c 1 1 1 -\r\n O w h I O\r\n PL+ICN (our) 1 c \u00b7 k \u00b7 k \u00b7 c 1 c c c 1 -\r\n O w h I O O O\r\n PC+ICN(our) 1 c \u00b7 k \u00b7 k \u00b7 c c c c c 1 -\r\n O w h I O O O O\r\n PC+Thresholds (Umuroglu & Jahre, 2017) 1 c \u00b7 k \u00b7 k \u00b7 c c - - - 1 c \u00b7 2Q\r\n O w h I O O\r\n either representing weights or activations or only a subset of is applied after folding the batch-norm parameters into the\r\n them, can be quantized across the range [a,b] with a given convolutional weights. However, when reducing the bit pre-\r\n numberofQbits(Jacobetal., 2018) as: cision below 8 bit using per-layer quantization, the folding\r\n clamp(t,a,b) process itself can lead to accuracy drop because it can dras-\r\n T \u00b7 St = quant(t) = round( )St (1) tically affects the range of the parameters to quantize. As a\r\n S\r\n t reference, Table 2 shows the collapse of the training process\r\n where S = b\u2212a is a real scaling parameter and T is an for INT4 MobilenetV1 with the folding of the batch-norm\r\n t Q\r\n 2 \u22121 parameters enabled.\r\n integer tensor.\r\n Equation (1) derives from the mapping: With the aim of an integer-only deployment, we extend (Ja-\r\n cobetal., 2018) to a) prevent the folding of batch normaliza-\r\n t = S \u00b7(T \u2212Z ) (2) tion parameters into convolutional weights and b) support\r\n t q t\r\n per-channel low-bitwidth weight quantization. We observe\r\n where Zt is a bias parameter required to shift the numeric that any fake-quantized network\u2019s sub-graph composed by a\r\n Q\r\n domain of the quantized tensors Tq into [0,2 \u22121] or convolutional layer, a batch-normalization layer and a fake-\r\n Q\u22121 Q\u22121\r\n [\u22122 , 2 \u22121]ranges, representative of the UINT-Q quantizer activation module can be modeled by the transfer\r\n and INT-Q datatypes. If a = \u2212b,b > 0, the quantization function:\r\n range is symmetric and Zt is zero.\r\n In the case of weights, the parameters a and b can be com- y = quant act(\u03c6\u2212\u00b5 \u00b7\u03b3 +\u03b2) (3)\r\n puted as the min and max values of a tensor (Jacob et al., \u03c3\r\n 2018) or by means of more sophisticated statistic analy- P\r\n sis (Migacz, 2017) or via backpropagation (Choi et al., where \u03c6 = x\u00b7wis the output of a full-precision con-\r\n 2018). A Per-Layer (PL) quantization exploit single val- volution and \u00b5,\u03c3,\u03b3,\u03b2 are channel-wise full-precision pa-\r\n ues a and b for the whole full-precision tensor, hence the rameters of a batch normalization layer. It is worth to note\r\n Equation1isappliedlayer-wise. APer-Channel(PC)proce- that this kind of formulation holds for any feature-wise or\r\n dure results more effective by independently approximating layer-wise scaling factor applied to the convolution\u2019s output\r\n a given tensor along the outer dimension (Krishnamoorthi, tensor.\r\n 2018). This corresponds to compute the a and b parameters When applying a per-layer quantization of either in-\r\n in correspondence of any output channel of the tensor. put/output activations and weights, the Rule 2 is injected\r\n Todetermine the quantization range of the activation values, into Equation 3 that becomes:\r\n statistics can be collected at training time during the for-\r\n ward pass, or against a speci\ufb01c calibration dataset. The\r\n PACT strategy demonstrated the effectiveness of learn- S S \u03b3 1 \u03c3\r\n ing b via backprogation while a = 0 to reproduce the Y =Z +quant act( i w (\u03a6+ \u00b7(B\u2212\u00b5+\u03b2 )))\r\n y S \u03c3 S S \u03b3\r\n non-linearity of the ReLU function. In our implementa- o i w (4)\r\n \u02d9 \u02d9\r\n tion, the round() of Equation 1 is replaced by \ufb02oor() be- P\r\n cause of the lighter software implementation (the operand where \u03a6 = (X \u2212Zx)\u00b7(W \u2212Zw)istheintegeroutput\r\n of a low-bitwidth convolution. We de\ufb01ne the arrays B =\r\n gets simply truncated, i.e. a shift operation), becoming: q\r\n clamp(x,0,b) b round( 1 \u00b7 (B \u2212\u00b5+\u03b2\u03c3)),i.e. the quantized bias, and\r\n quant act(x) = \ufb02oor( ) \u00b7 S ,S = . S S \u03b3\r\n S x x Q i w\r\n x 2 \u22121 S S \u03b3\r\n M= i w .Asdoneby(Jacobetal.,2018),eachelement\r\n S \u03c3\r\n o\r\n n0i\r\n m ofM canbedecomposed as m = m0 \u00b72 , where\r\n 4 INTEGER-ONLY INFERENCE i i i\r\n m0 isasignedfractionary \ufb01xed-point number with 0.5 \u2264\r\n i\r\n Previous work (Jacob et al., 2018) discussed the training abs(m0 ) < 1.0. For the sake of notation, we indicate as\r\n i\r\n N\r\n and integer-only deployment of a fake-quantized network M andN thetwovectorssuchasM =M \u00b72 0. Given\r\n 0 0 0\r\n with 8 bit per-layer quantization. The weight quantization this, Equation 4 can be rewritten as:\r\n Memory-DrivenMixedLowPrecisionQuantizationforEnablingDeepNetworkInferenceonMicrocontrollers\r\n Algorithm 1 Cut Activation Bits\r\n Require: a fake-quantized network g of L stacked quantized convolutional layers, a M memory constraint, a Q minimum\r\n RW a,min\r\n quantization level\r\n Ensure: the bit precion Qi ,Qi,i = 0,..L \u2212 1 to satisfy (7)\r\n x y\r\n i i+1\r\n 1: Q \u2261Q \u21908 i = 0,..L \u2212 1 \u22b2Initialization\r\n y x\r\n 2: while (7) is not True for every layer do \u22b2StopCondition\r\n 3: for i = 0 to L \u2212 2 doi i i i \u22b2Forwardpass\r\n 4: while mem(x ,Q )+mem(y ,Q ) > M ANDCutBits(x ,Q ,y ,Q )do\r\n i x i y RW i x i y\r\n 5: Qi andQi+1 are decremented by one step\r\n y x\r\n 6: endwhile\r\n 7: endfor\r\n 8: for i = L \u2212 1 to 1 doi i i i \u22b2Backwardpass\r\n 9: while mem(x ,Q )+mem(y ,Q ) > M ANDCutBits(y ,Q ,x ,Q )do\r\n i x i y RW i y i x\r\n i i\u22121\r\n 10: Q andQ are decremented by one step\r\n x y\r\n 11: endwhile\r\n 12: endfor\r\n 13: end while\r\n 14:\r\n 15: function CUTBITS(x ,Q ,x ,Q ) \u22b2Return True if Q have to be decremented\r\n 1 x1 2 x2 x2\r\n 16: if Qx2 > Qa,min then\r\n 17: if Q >Q OR(Q ==Q ANDmem(x ,Q )>mem(x ,Q )) then\r\n x2 x1 x2 x1 2 x2 1 x1\r\n 18: return True\r\n 19: endif\r\n 20: endif\r\n 21: return False\r\n 22: end function\r\n posedby(Umuroglu&Jahre,2017;Gaoetal.,2018),which\r\n exponentially increases with Q.\r\n N Q\r\n Y =Z +clamp(\ufb02oor(M \u00b72 0 \u00b7(\u03a6+B )),0,2 \u22121)\r\n y 0 q\r\n (5) 5 MEMORY-DRIVENMIXEDLOW\r\n Note that every value in Equation 5 is an integer or a \ufb01xed- PRECISION METHODOLOGYFORMCU\r\n point value, so that a quantized convolutional layer can be DEPLOYMENT\r\n computedwithaninteger-only arithmetic. Since the static\r\n parameters M ,N ,B vary along the channel dimension, To run deep networks on microcontrollers, the memory\r\n 0 0 q\r\n we name this activation function (Equation 5) as Integer footprint is a stringent constraint. Given common microcon-\r\n Channel-Normalization activation, indicated as ICN. If troller architectures (Zhang et al., 2017), we distinguish:\r\n weightparametersgetquantizedper-channel(PC),i.e. every\r\n output channel weight bank has its own Sw and Zw values, \u2022 Read-Only (RO) Memory, to store frozen inference\r\n Equation (5) still holds after deriving the B , M and N parameters, i.e. parameters that will not change during\r\n q 0 0\r\n vector parameters accordingly. the lifetime of a smart device.\r\n 4.1 MemoryRequirement \u2022 Read-Write (RW) Memory, to store temporary val-\r\n ues, i.e. input and output of any quantized convolu-\r\n Table 1 schematizes the memory requirements to compute tional layer that depends on the current sensor data.\r\n the transfer Function 5, considering both per-layer (PL) or\r\n per-channel (PC) quantization and the ICN layer. The table At any step of the inference pass, a pair of temporary ac-\r\n reports the amount of parameters of a convolution opera- tivation tensors, i.e. the input and output of a layer, and\r\n tion with a kw x kh receptive \ufb01eld, cI input channels and the whole set of \ufb01xed parameters must be present in the\r\n cO output channels. The weight-parameters are stored in memory. If considering a network of L stacked quantized\r\n memoryasUINT-Q,whereQdenotesthenumberofbits, convolutionallayersandadevicewithMRO andMRW mem-\r\n so that the represented numeric domain corresponds to [0, ory budget (expressed in bytes), the above requirement is\r\n Q\r\n 2 \u22121]. Zx, Zw and Zy are in a UINT8 format (Zw as translated as:\r\n INT16 if PC is applied), B and M are stored as INT32\r\n q 0\r\n and N0 is a INT8 array. For comparison purpose, Table 1 L\r\n reports also the higher memory requirement of a quantized X i i\r\n mem(w ,Q )+M \u2264M (6)\r\n i w T RO\r\n convolutional layer if using the thresholding method pro- A\r\n i=0\r\n Memory-DrivenMixedLowPrecisionQuantizationforEnablingDeepNetworkInferenceonMicrocontrollers\r\n Algorithm 2 Cut Weights Bits\r\n Require: a fake-quantized network g of L stacked quantized convolutional layers, a M memory constraint, a Q minimum\r\n RO w,min\r\n quantization level, a \u03b4 margin\r\n i\r\n Ensure: The bit precision Q ,i = 0,..L \u2212 1 to satisfy (6)\r\n w\r\n i\r\n 1: Q \u21908 \u22b2Initialization\r\n w P\r\n 2: while L\u22121mem(wi,Qi )+M i >MRO do\r\n i=0 w T\r\n A\r\n P\r\n 3: Computer =mem(w ,Qi )/ L\u22121mem(w ,Qi )foreverylayerwithQ >Q\r\n i i w i=0 i w wi w,min\r\n 4: Find R = maxiri\r\n 5: Amongthelayerswithr > (R\u2212\u03b4),selectthek-thwiththesmallest index i\r\n i\r\n 6: Qk is decremented by one step\r\n w\r\n 7: end while\r\n where i indicates the i-th quantized convolutional layer and 6 EXPERIMENTAL RESULTS\r\n mem(t,Q)returnsthememoryfootprint of a tensor t with We run experiments on the MobilenetV1 family net-\r\n bit precision Q. MTi is the memory footprint of the ad-\r\n A works (Howard et al., 2017) on Imagenet using the Py-\r\n ditional set of layer\u2019s static parameters (see Table 1) with Torch framework. In the following, any model of the\r\n datatype detailed in Section 4.1. Concerning activation val- MobilenetV1 family is marked with a label x y, where\r\n ues: x = {128,160,192,224} is the spatial resolution of the\r\n input data and y = {0.25,0.5,0.75,1.0} refers to the width\r\n i i channel multiplier. The quantization-aware retraining starts\r\n mem(xi,Q )+mem(yi,Q )\u2264MRW i=0,..,L\u22121 1\r\n x y from pre-trained weights . Every training session executes\r\n (7) on a compute node equipped with 4 NVIDIA-Tesla P100\r\n to ensure input and output of any block \ufb01tting the available GPUsfor8hours. ADAMischosenasoptimizerwithan\r\n memoryfootprint. Our methodology aims determining the initial learning rate of 1e-4, which is decreased in a \ufb01xed\r\n i i i schedule to 5e-5 and 1e-5 at, respectively, the 5th and 8th\r\n bit precision Q ,Q ,Q of any input xi, output yi and\r\n w x y epochs. Running statistics and learned parameters of batch-\r\n weight w tensor of the i-th layer, to match the memory\r\n i normalization layers are frozen after the \ufb01rst training epoch.\r\n constraints (6) and (7). Only the values of Q = {2,4,8} are\r\n 0 i i+1 Batch size is 128. An asymmetric uniform quantization is\r\n admittable solutions; Q is \ufb01xed to 8. Note that y \u2261 x ,\r\n x\r\n i i+1 applied on weights: the PACT method is used in case of PL\r\n hence \ufb01xing Q is equivalent to set Q . Initially, the bit\r\n y x quantization while min/max statistics are employed in case\r\n precision of every tensor is set as Q = 8. Algorithm 1 and\r\n Algorithm 2 reports the pseudo-code of the procedure to cut of PC quantization. PPQ (Liu & Mattina, 2019) is applied\r\n the bit precision of, respectively, activations and weights, for re\ufb01ning pre-trained weights before the quantization-\r\n under the hypothesis that exists a solution that satisfy (6) aware retraining. Folding of batch-normalization param-\r\n and (7). The procedure in Algorithm 1 iterates over the L eters into weights, when applied layer-wise, starts from\r\n quantized convolutiona layers in a forward and backward the 2nd training epoch. Activations are quantized with the\r\n i i+1 PACTstrategy. The code to reproduce our experiments is\r\n fashion: the bit precision of output tensors Q \u2261 Q are\r\n y x open-source 2.\r\n cut during the forward pass, reductions of the input tensors\u2019\r\n i i\u22121\r\n precision Q \u2261 Q are applied during the backward pass. 1\r\n x y Pretrained weights are downloaded from https:\r\n Any cut consists of reducing the bit precision by a single //github.com/tensorflow/models/blob/master/\r\n step, i.e. from 8 to 4 and from 4 to 2 bits, and it is applied research/slim/nets/mobilenet_v1.md\r\n if the number of bits of the intended tensor (output during 2 https://github.com/mrusci/\r\n forward or input during backward) is lower or equal, but training-mixed-precision-quantized-networks\r\n with a higher footprint, than the other activation tensor of\r\n the i-th layer.\r\n Algorithm 2 details the iterative procedure for cutting bits Table 2. Integer-Only MobilenetV1 224 1.0\r\n of the weights parameters. At any iteration, a layer score Quantization Method Top1Accuracy Weight MemoryFootprint\r\n r is computed as the ratio between the layer\u2019s footprint of\r\n i Full-precision (Jacob et al., 2018) 70.9% 16.27 MB\r\n the i-th layer and the total occupation. Among the highest PL+FBINT8(Jacobetal.,2018) 70.1% 4.06 MB\r\n scores ri within a \u03b4 margin, the layer with the lowest layer\u2019s PL+FBINT4(our) 0.1% 2.05 MB\r\n index is selected for the cut. This heuristic rule is intended PL+ICNINT4(our) 61.75% 2.10 MB\r\n PC+ICNINT4(our) 66.41% 2.12 MB\r\n to balance the quantization level between the central layers PCW4A4(Liu&Mattina,2019) 64.3% -\r\n and the last layers, which are more subject to aggressive PCW4A8(Krishnamoorthi,2018) 65% -\r\n cuts due to the typically higher number of parameters. PC+Thresholds INT4 (our) 66.46% 2.35 MB\r\n Memory-DrivenMixedLowPrecisionQuantizationforEnablingDeepNetworkInferenceonMicrocontrollers\r\n Figure 2. Top1 Accuracy of INT4 integer-only MobilenetV1 models, compared with full-precision, INT8 integer-only and INT4 fake-\r\n quantized models.\r\n Figure 3. Accuracy-latency tradeoff of Mixed-Precision MobilenetV1 networks running on a STM32H7 device with MRO = 2MB and\r\n M =512kB.\r\n RW\r\n To prove the effectiveness of the ICN layers, we quantize based methodology.\r\n weights and activations of every layers of a MobilenetV1 More in details, Figure 2 shows the Top1 accuracy of the\r\n 224 1.0 model to 4 bits and we measure the accuracy family of INT4 integer-only PC+ICN Mobilenets. Com-\r\n achieved in case of integer-only approximation. Table 2 pared with the related INT4 fake-quantized models, using\r\n reports the accuracies for the following con\ufb01gurations: ICN activations results into negligible loss. Only for the\r\n PL+FB stands for per-layer quantization and folding of 160 0.75 case a relevant accuracy drop was observed. To\r\n batch-norm parameters into weights, PL+ICN indicates per- recover it, we found effective to change the datatype of the\r\n layer quantization with ICN layers and PC+ICN refers to quantized bias parameters to Q30.2, hence paying only an\r\n per-channel quantization with ICN layers. First we can note additional shift operation on the accumulator before of the\r\n that only thanks to the proposed ICN layers, the folding of bias addition.\r\n the batch-norm parameters, which causes the collapse of the\r\n training process (PL+FB INT4), can be avoided, therefore After validating the ICN solution, we evaluate our proposed\r\n enablingtheconvergenceofthetrainingalgorithm(PL+ICN memory-driven methodology for the deployment of deep\r\n INT4 and PC+ICN INT4). Secondly, the insertion of the networks on microcontrollers. To this end, we apply our\r\n ICN layer introduces an almost negligible accuracy drop mixed-precision technique on all the Mobilenet con\ufb01gura-\r\n of 0.3% on PL+ICN and 0.05% on PC-ICN with respect tions after setting the memory constraints M =2MB\r\n RO\r\n to the fake-quantized graph. Moreover, by means of PC and MRW =512kB,correspondingtothememorycharac-\r\n quantization, the accuracy of our 4 bit model is higher than teristics of an STM32H7 device. The trained integer-only\r\n other reported implementations (Krishnamoorthi, 2018; Liu models are deployed and bechmarked on the STM32H7\r\n &Mattina,2019). In addition, Table 2 also reports the mem- MCUrunning at 400MHz, to assess the implications for\r\n ory footprint of our PC+ICN INT4, which results to be 10% inference implementations. To this aim, we leverages an\r\n less memory-demanding than using the integer thresholds extended version of the ARM CMSIS-NN (Lai et al., 2018)\r\n Memory-DrivenMixedLowPrecisionQuantizationforEnablingDeepNetworkInferenceonMicrocontrollers\r\n Figure 4. Bit precision (on the radial axes) of weights (blue curves) and activation output tensors (orange curves) for any layer (numbered\r\n from 0 to 27) of the MobilenetV1 models with input size 224 or 192, after setting the memory constraints M = 2MB and\r\n RO\r\n M =512kB.Top1AccuraciesonImagenetarereportedinthegreenboxesincaseofPer-Layer(PL)orPer-Channel(PC)Mixed\r\n RW\r\n Precision Quantization.\r\n library, featuring an output stationary data\ufb02ow, and we mea- Pareto frontiers of Figure 3 are mostly populated by MixQ-\r\n sure latency in terms of clock cycles. Figure 3 plots the PC-ICNcon\ufb01gurations. The most accurate model, PC+ICN\r\n accuracy-latency tradeoff measured on two con\ufb01gurations. 224 0.75, scores 68% Top1 accuracy by featuring 4 bit\r\n MixQ-PL indicates per-layer quantization with either the weight on the last convolutional pointwise and on the linear\r\n 1 2 5\r\n folding of batch-norm parameters or ICN for layers with layers, in addition to Q ,Q ,Q = 4, as determined by the\r\n y y y\r\n Q <8orQ <8.Onthecontrary,MixQ-PC-ICNindi- memory-driven procedure of Section 5. This score is 8%\r\n y w\r\n cates integer-only models with per-channel quantization and higher than the more accurate INT8 Mobilenet (192 0.5)\r\n ICNasactivation layers. Every curve represents a group of \ufb01tting into the same device. Note that all the con\ufb01gurations\r\n Mobilenet models with same input resolution. Increasing featuring width multiplier 1.0 suffers of a dramatic accuracy\r\n the width multiplier causes a longer latency because of the degradation with respect to full-precision settings (from 2%\r\n increasing amount of MAC operations. When applying our to 15%) due to aggressive quantization required to \ufb01t into\r\n mixed-precisionmethodunderthismemoryconstraints,Mo- the memoryconstrains. On the latency side, the fastest infer-\r\n bilenet models with width multipliers of 0.25 and 0.5, with ence model (128 0.25 MixQ-PL), which features a homoge-\r\n the exception of 224 0.5, features no cuts of bit precision. neous 8 bit quantization, runs at 10fps, 20\u00d7 higher than the\r\n Hence, under the con\ufb01guration MixQ-PL, these points cor- the most precise con\ufb01guration (224 0.75 PC+ICN), but only\r\n respondstothe8bitinteger-onlymodelsdescribedin(Jacob achieves 43% of Top1 accuracy. We can observe that the\r\n et al., 2018). MixQ-PC-ICNquantization introduces a latency overhead\r\n Figure 4 details the individual tensors bit-precision for of approx. 20% with respect to the MixQ-PL setting, due\r\n larger MobilenetV1 models after applying memory-driven to the additional subtractions of Zw biases within the inner\r\n mixed-precision quantization (Algorithms 1 and 2) on both loop of the convolution. On the other hand, MixQ-PC-ICN\r\n MixQ-PLandMixQ-PC-ICNcon\ufb01gurations. Modelswith provides up to 4% more accuracy for classi\ufb01cation.\r\n higher number of parameters or activations are more af- To further test our proposed mixed-precision method, we\r\n fected by the bit reduction procedure. Typically, \ufb01rst layers set the memory constrain to MRO = 1MB and compare\r\n feature large spatial maps but weight tensors with low num- with other mixed-precision methodologies in Table 3. Our\r\n ber of parameters. On the contrary, last layers feature small best models feature up to 7% lower accuracy with respect\r\n activation tensors, but high number of weight parameters, to (Wang et al., 2018), but, in contrast with this and similar\r\n with the exception of the depthwise layers. works, we remark that we only use integer operations also\r\n Memory-DrivenMixedLowPrecisionQuantizationforEnablingDeepNetworkInferenceonMicrocontrollers\r\n Table 3. Comparison with state-of-the-art mixed precision models when MRO is 1MB\r\n Model Quantization Method Top1Accuracy Memory\r\n MobilenetV1 224 0.5 MixQ-PC-ICN 62.9% 1MBM +512kBM\r\n RO RW\r\n MobilenetV1 192 0.5 MixQ-PC-ICN 60.2% 1MBM +256kBM\r\n RO RW\r\n MobilenetV1 224 0.5 (Jacob et al., 2018) INT8PL+FB 60.7% 1.34 MB\r\n MobilenetV1 224 0.25 (Jacob et al., 2018) INT8PL+FB 48.0% 0.47 MB\r\n MobilenetV1 (Wang et al., 2018) MIXnot-uniform 57.14%/67.66% 1.09 / 1.58 MB\r\n MobileNetV2(Wangetal., 2018) MIXnot-uniform 66.75%/70.90% 0.95 / 1.38 MB\r\n SqueezeNext (Dong et al., 2019) MIXnot-uniform 68.02% 1.09 MB\r\n thanks to the exploited uniform quantization. Moreover, our Dong, Z., Yao, Z., Gholami, A., Mahoney, M., and Keutzer,\r\n solution features a 2% higher accuracy than INT8 models K. Hawq: Hessian aware quantization of neural networks\r\n with comparable memory footprint and tailored for integer- with mixed-precision. arXiv preprint arXiv:1905.03696,\r\n only deployments. 2019.\r\n 7 CONCLUSION Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R.,\r\n and Modha, D. S. Learned step size quantization. arXiv\r\n Bymixingquantization methodologies, it is possible to exe- preprint arXiv:1902.08153, 2019.\r\n cute complex deep neural networks such as MobilenetV1 on Fromm, J., Patel, S., and Philipose, M. Heterogeneous\r\n memoryconstrained MCUedgedevices. To pursue this ob- bitwidth binarization in convolutional neural networks.\r\n jective, in this work we introduced a mixed-precision quan- In Advances in Neural Information Processing Systems,\r\n tization technique tailored for memory-constrained micro- pp. 4006\u20134015, 2018.\r\n controller devices, leveraging the formulation of a quantized\r\n activation layer, i.e. the Integer Channel-Normalization acti- Gao, H., Tao, W., Wen, D., Chen, T.-W., Osa, K., and\r\n vation, to enable sub byte integer-only deployments. The ex- Kato, M. Ifq-net: Integrated \ufb01xed-point quantization net-\r\n perimental results show a MobilenetV1 network running on works for embedded vision. In Proceedings of the IEEE\r\n a microcontroller equipped with 2MB of Flash and 512kB Conference on Computer Vision and Pattern Recognition\r\n of RAMandfeaturing a Top1 accuracy of 68%, which is Workshops, pp. 607\u2013615, 2018.\r\n 8%higherthan state-of-the-art integer-only 8 bit implemen-\r\n tations \ufb01tting the same memory constraints. Han, S., Mao, H., and Dally, W. J. Deep compres-\r\n sion: Compressing deep neural networks with pruning,\r\n REFERENCES trained quantization and huffman coding. arXiv preprint\r\n arXiv:1510.00149, 2015.\r\n Blott, M., Preu\u00dfer, T. B., Fraser, N. J., Gambardella, G., Howard,A.G.,Zhu,M.,Chen,B.,Kalenichenko,D.,Wang,\r\n Obrien, K., Umuroglu, Y., Leeser, M., and Vissers, W.,Weyand,T.,Andreetto,M.,andAdam,H. Mobilenets:\r\n K. Finn-r: An end-to-end deep-learning framework for Ef\ufb01cient convolutional neural networks for mobile vision\r\n fast exploration of quantized neural networks. ACM applications. arXiv preprint arXiv:1704.04861, 2017.\r\n Transactions on Recon\ufb01gurable Technology and Systems\r\n (TRETS), 11(3):16, 2018. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard,\r\n Choi,J., Wang,Z., Venkataramani, S., Chuang, P.I.-J., Srini- A., Adam, H., and Kalenichenko, D. Quantization\r\n vasan, V., and Gopalakrishnan, K. Pact: Parameterized and training of neural networks for ef\ufb01cient integer-\r\n clipping activation for quantized neural networks. arXiv arithmetic-only inference. In Proceedings of the IEEE\r\n preprint arXiv:1805.06085, 2018. Conference on Computer Vision and Pattern Recognition,\r\n pp. 2704\u20132713, 2018.\r\n Choukroun, Y., Kravchik, E., and Kisilev, P. Low-bit quan- Jain, S. R., Gural, A., Wu, M., and Dick, C. Trained uni-\r\n tization of neural networks for ef\ufb01cient inference. arXiv form quantization for accurate and ef\ufb01cient neural net-\r\n preprint arXiv:1902.06822, 2019. workinference on \ufb01xed-point hardware. arXiv preprint\r\n Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and arXiv:1903.08066, 2019.\r\n Bengio, Y. Binarized neural networks: Training deep Krishnamoorthi,R. Quantizingdeepconvolutionalnetworks\r\n neural networks with weights and activations constrained for ef\ufb01cient inference: A whitepaper. arXiv preprint\r\n to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016. arXiv:1806.08342, 2018.\r\n Memory-DrivenMixedLowPrecisionQuantizationforEnablingDeepNetworkInferenceonMicrocontrollers\r\n Lai, L., Suda, N., and Chandra, V. Cmsis-nn: Ef\ufb01cient\r\n neural network kernels for arm cortex-m cpus. arXiv\r\n preprint arXiv:1801.06601, 2018.\r\n Lin, D., Talathi, S., and Annapureddy, S. Fixed point quan-\r\n tization of deep convolutional networks. In International\r\n Conference on Machine Learning, pp. 2849\u20132858, 2016.\r\n Liu, Z.-G. and Mattina, M. Learning low-precision neural\r\n networks without straight-through estimator (ste). arXiv\r\n preprint arXiv:1903.01061, 2019.\r\n Ma,N.,Zhang,X.,Zheng,H.-T., and Sun, J. Shuf\ufb02enet v2:\r\n Practical guidelines for ef\ufb01cient cnn architecture design.\r\n In Proceedings of the European Conference on Computer\r\n Vision (ECCV), pp. 116\u2013131, 2018.\r\n Migacz,S. 8-bitinferencewithtensorrt. InGPUTechnology\r\n Conference, volume 2, pp. 7, 2017.\r\n Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A.\r\n Xnor-net: Imagenet classi\ufb01cation using binary convo-\r\n lutional neural networks. In European Conference on\r\n ComputerVision, pp. 525\u2013542. Springer, 2016.\r\n Rusci, M., Capotondi, A., Conti, F., and Benini, L. Work-\r\n in-progress: Quantized nns as the de\ufb01nitive solution for\r\n inference on low-power arm mcus? In 2018 International\r\n Conference on Hardware/Software Codesign and System\r\n Synthesis (CODES+ ISSS), pp. 1\u20132. IEEE, 2018.\r\n Umuroglu, Y. and Jahre, M. Streamlined deploy-\r\n ment for quantized neural networks. arXiv preprint\r\n arXiv:1709.04060, 2017.\r\n Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. Haq:\r\n hardware-aware automated quantization. arXiv preprint\r\n arXiv:1811.08886, 2018.\r\n Wu,B.,Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian,\r\n Y., Vajda, P., Jia, Y., and Keutzer, K. Fbnet: Hardware-\r\n aware ef\ufb01cient convnet design via differentiable neural\r\n architecture search. arXiv preprint arXiv:1812.03443,\r\n 2018.\r\n Zhang, D., Yang, J., Ye, D., and Hua, G. Lq-nets: Learned\r\n quantization for highly accurate and compact deep neural\r\n networks. In Proceedings of the European Conference on\r\n ComputerVision (ECCV), pp. 365\u2013382, 2018.\r\n Zhang, Y., Suda, N., Lai, L., and Chandra, V. Hello edge:\r\n Keyword spotting on microcontrollers. arXiv preprint\r\n arXiv:1711.07128, 2017.\r\n", "award": [], "sourceid": 130, "authors": [{"given_name": "Manuele", "family_name": "Rusci", "institution": "Universit\u00e0 di Bologna"}, {"given_name": "Alessandro", "family_name": "Capotondi", "institution": "Universit\u00e0 di Modena e Reggio Emilia"}, {"given_name": "Luca", "family_name": "Benini", "institution": "ETHZ"}]}