{"title": "Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks", "book": "Proceedings of Machine Learning and Systems", "page_first": 112, "page_last": 128, "abstract": "We propose a method of training quantization thresholds (TQT) for uniform symmetric quantizers using standard backpropagation and gradient descent. Contrary to prior work, we show that a careful analysis of the straight-through estimator for threshold gradients allows for a natural range-precision trade-off leading to better optima. Our quantizers are constrained to use power-of-2 scale-factors and per-tensor scaling of weights and activations to make it amenable for hardware implementations. We present analytical support for the general robustness of our methods and empirically validate them on various CNNs for ImageNet classification. We are able to achieve near-floating-point accuracy on traditionally difficult networks such as MobileNets with less than 5 epochs of quantized (8-bit) retraining. Finally, we present Graffitist, a framework that enables automatic quantization of TensorFlow graphs for TQT.", "full_text": " TRAINED QUANTIZATION THRESHOLDS FOR ACCURATE AND EFFICIENT\r\n FIXED-POINT INFERENCE OF DEEP NEURAL NETWORKS\r\n SambhavR.Jain*1 AlbertGural*2 MichaelWu1 ChrisH.Dick1\r\n ABSTRACT\r\n Weproposeamethodoftrainingquantization thresholds (TQT) for uniform symmetric quantizers using standard\r\n backpropagation and gradient descent. Contrary to prior work, we show that a careful analysis of the straight-\r\n through estimator for threshold gradients allows for a natural range-precision trade-off leading to better optima.\r\n Ourquantizers are constrained to use power-of-2 scale-factors and per-tensor scaling of weights and activations\r\n to make it amenable for hardware implementations. We present analytical support for the general robustness of\r\n our methods and empirically validate them on various CNNs for ImageNet classi\ufb01cation. We are able to achieve\r\n near-\ufb02oating-point accuracy on traditionally dif\ufb01cult networks such as MobileNets with less than 5 epochs of\r\n quantized (8-bit) retraining. Finally, we present Graf\ufb01tist, a framework that enables automatic quantization of\r\n TensorFlow graphs for TQT (available at github.com/Xilinx/graf\ufb01tist).\r\n 1 INTRODUCTION loss in accuracy. We provide an easy-to-implement and fast\r\n Low-precision quantization (such as uniform quantization convergence training scheme, which trains thresholds in log-\r\n between two clipping thresholds) is an important technique domain with an adaptive optimizer. In Section 4 we present\r\n enabling low-power and high-throughput DNN inference. a framework for automatic quantization and retraining of\r\n However, this reduced precision leads to commensurate TensorFlow graphs using our methods. We demonstrate that\r\n reductions in accuracy. our implementation and hyperparameter recommendations\r\n are robust, through experiments in Section 5 and analytical\r\n Retraining weights with quantization-in-the-loop is a useful discussion in Appendix B. Finally we present insights from\r\n technique to regain some lost accuracy. However the quanti- TQTinSection6.\r\n zation thresholds are typically \ufb01xed after initial calibration,\r\n leading to (a) lack of ability to adapt to changing weight and 2 RELATEDWORK\r\n activation distributions during training, and (b) calibration\r\n based on local quantization errors that is agnostic to the Network quantization became popular with BinaryNet\r\n \ufb01nal network loss. We address these two issues by treating (Courbariaux et al., 2016), which quantized weights and ac-\r\n thresholds as learnable parameters, trained using standard tivations to +1 and -1 and trained weights using the straight-\r\n backpropagation and gradient descent. Therefore during through estimator (STE) (Bengio et al., 2013). Other works\r\n quantized training, (a) our thresholds can be trained along looked at similar low bitwidth networks, such as XOR-\r\n with weights simultaneously, and (b) the gradients are com- Nets (Rastegari et al., 2016), ternary networks (Li et al.,\r\n puted on the overall loss meaning the learned thresholds are 2016; Zhu et al., 2016), and TTQ (Zhu et al., 2016). To\r\n moreoptimal for the network as a whole. achieve higher accuracies, researchers started examining\r\n We propose a general method for training quantization higher bitwidth quantization such as in DoReFa-Net (Zhou\r\n thresholds (TQT) using accurate gradients in Section 3. et al., 2016), WRPN (Mishra et al., 2017), HWGQ (Cai\r\n With thresholds that automatically train to achieve a range- et al., 2017), LQ-Nets (Zhang et al., 2018) and QIL (Jung\r\n precision trade-off, this work enables hardware amenable et al., 2018).\r\n per-tensor and power-of-2 scaling constraints with minimal More recent work in DNN quantization has focused on\r\n * 1 practical considerations for hardware implementations, with\r\n Equal contribution Xilinx Inc., San Jose, California, USA research advertising one or more of the following: uniform\r\n 2Stanford University, Stanford, California, USA. Correspondence\r\n to: Sambhav R. Jain , Albert quantization to allow integer arithmetic, per-tensor quanti-\r\n Gural . zation to increase homogeneity of compute requirements,\r\n power-of-2 scale factors to allow scaling with ef\ufb01cient bit-\r\n Proceedings of the 3rd MLSys Conference, Austin, TX, USA, shifts, and symmetric quantization to avoid cross-terms\r\n 2020. Copyright 2020 by the author(s).\r\n Trained Quantization Thresholds (TQT)\r\n with each computation arising from a zero-point (Krish- Table 1. Comparison of MobileNet 8-bit quantization performance\r\n namoorthi, 2018). Work in this area includes NVIDIA\u2019s between Google\u2019s QAT (from Table 4 of (Krishnamoorthi, 2018))\r\n TensorRT (Migacz, 2017), Google\u2019s Quantization-Aware and ours (TQT). Our quantization scheme is strictly more con-\r\n Training (QAT) (Jacob et al., 2017; TensorFlow, 2017a), strained, yet achieves better top-1 accuracy (%) on ImageNet.\r\n IBM\u2019s FAQ (McKinstry et al., 2018), PACT (Choi et al., Method Precision Quantization Scheme Top-1\r\n 2018), NICE (Baskin et al., 2018) and FAT (Goncharenko MobileNet v1 1.0 224\r\n et al., 2018). TensorRT uses local Kullback-Leibler (KL) FP32 70.9\r\n divergence minimization to calibrate quantization thresh- QAT INT8 per-channel, symmetric, real scaling 70.7\r\n olds and shows good performance for traditional CNNs, but INT8 per-tensor, asymmetric, real scaling 70.0\r\n uses\ufb02oating-pointscale-factorsanddoesnotexploreretrain- TQT FP32 71.1\r\n ing. FAQ uses percentile initialization to determine clipping INT8 per-tensor, symmetric, p-of-2 scaling 71.1\r\n thresholds, but does not train them. PACT introduced the MobileNet v2 1.0 224\r\n FP32 71.9\r\n idea of training not only the weights but also the clipping QAT INT8 per-channel, symmetric, real scaling 71.1\r\n parameter \u03b1 for clipped ReLU using gradient descent and INT8 per-tensor, asymmetric, real scaling 70.9\r\n STE: ( TQT FP32 71.7\r\n \u2202y (x;\u03b1) 0 x\u2208(\u2212\u221e,\u03b1) INT8 per-tensor, symmetric, p-of-2 scaling 71.8\r\n q = (1)\r\n \u2202\u03b1 1 x\u2208[\u03b1,+\u221e)\r\n Both QATandFATsupporttraining quantization thresholds number, and z is a quantized value that maps to the real\r\n 1\r\n using a gradient similar to (1), likewise NICE trains a clamp- zero .\r\n ing parameter ca, initialized \u03b1 standard deviations from the 3.1 Quantizer Constraints\r\n meanoftheinputdistribution,usingagradientsimilarto(1).\r\n However, we show in Section 3.5 that these formulations of While the af\ufb01ne quantizer allows for a direct mapping from\r\n clipped threshold gradients do not balance range and preci- \ufb02oatingpointvaluestointegers(withouttheneedforlookup\r\n sion, resulting in poor 8-bit quantization performance for tables), there is added cost due to special handling of zero-\r\n dif\ufb01cult networks such as MobileNets (Howard et al., 2017; points and real-valued scale-factors, as illustrated in Ap-\r\n Sandler et al., 2018) shown in Table 1. pendix A. For ef\ufb01cient \ufb01xed-point implementations, we\r\n In contrast, and independently of our work, IBM\u2019s LSQ constrain our quantization scheme to use:\r\n (Esser et al., 2019) found a gradient de\ufb01nition that is sim-\r\n ilar to ours. However, direct comparisons of our results 1. Symmetric: By setting z = 0, the af\ufb01ne quantizer in\r\n are not possible due to the large differences between our (2) reduces to a symmetric quantizer:\r\n experiments and applications. For instance, LSQ learns\r\n the scale-factors directly, which leads to stability issues, r = s\u00b7q (3)\r\n requiring careful \ufb01ne-tuning of hyperparameters and con-\r\n sequent retraining for 90 epochs. We address this issue in Thuswecandropthecross-termsfromamatrixmul-\r\n Section 3 with a gradient formulation to train log-thresholds tiplication or convolution operation involving zero-\r\n instead, which we show in Appendix B to have better stabil- points (see Appendix A.1).\r\n ity guarantees and faster convergence. Secondly, LSQ does\r\n not constrain scale-factors to power-of-2 and uses higher 2. Per-tensor scaling: All elements in a given weight or\r\n precision in the \ufb01rst and last layers to retain performance, in- activation tensor are quantized using a single scale-\r\n curring additional implementation complexity. Lastly, LSQ factor s. While it is common practice to use per-\r\n does not explore quantization on dif\ufb01cult networks such as channel scaling for networks with depthwise convo-\r\n MobileNets, which from our experiments are seen to bene\ufb01t lutions such as MobileNets, we \ufb01nd that per-tensor\r\n the most from training quantization thresholds. scaling combined with 8-bit TQT is suf\ufb01cient.\r\n 3 TRAINED QUANTIZATION THRESHOLDS 3. Power-of-2 scaling: Scale-factors are constrained to\r\n \u2212f\r\n the form s = 2 (where f is an integer denoting\r\n Asimpledesignchoice for a uniform quantizer is one that the fractional length; f can be positive or negative).\r\n uses an af\ufb01ne mapping between the real domain r and the This enables scaling using simple bit-shifts without\r\n quantized domain q, such as the overhead of a \ufb01xed-point multiply operation (see\r\n r = s\u00b7(q \u2212z) (2) Appendix A.2).\r\n 1This formulation satis\ufb01es the domain-speci\ufb01c constraint that\r\n whereconstants s (scale-factor) and z (zero-point) are the the real zero be exactly representable (Jacob et al., 2016b; 2017;\r\n quantization parameters. Generally, s is a positive real Krishnamoorthi, 2018).\r\n Trained Quantization Thresholds (TQT)\r\n 3.2 Linear Quantizer - Forward Pass Considering the three cases of how \u230ax\u2309 compares to n and\r\n p, we re-write (4) as: s\r\n Thequantizationfunctionq(x;s)foratensorxisparameter- \uf8f1jxm \u0004 \u0007\r\n ized only by its scale-factor s, which depends on threshold t \uf8f4 \u00b7 s if n \u2264 x \u2264p,\r\n \uf8f4\r\n 2 \uf8f2 s \u0004 \u0007 s\r\n andbit-width b of the tensor . q(x;s) performs quantization q(x;s) := n\u00b7s if x p.\r\n s\r\n Scale: Tensor elements are scaled such that the lowest Thelocal gradient with respect to scale-factor s is:\r\n \u2308log (t)\u2309\r\n power-of-2largerthanrawthresholdt(i.e.,2 2 , where \uf8f1j m \u0004 \u0007\r\n \u2308.\u2309 denotes ceil3) is mapped to the largest value supported in \uf8f4 x \u2212x ifn\u2264 x \u2264p,\r\n \uf8f4\r\n the quantized domain (i.e., 2b\u22121 if signed, or 2b if unsigned). \u2207 q(x;s) := \uf8f2 s s \u0004x\u0007 s (6)\r\n Naturally, elements that fall out of the saturation threshold s \uf8f4n if \u0004s\u0007 < n,\r\n \uf8f4\r\n \u2308log (t)\u2309 \uf8f3p if x >p.\r\n 2 2 in either direction would be clipped. s\r\n Round: Thescaled tensor elements are round to nearest in- Noting that \u2207 s = s ln(2),\r\n tegers using bankers rounding (round-half-to-even) denoted (log2 t)\r\n \uf8f1\r\n jxm x \u0004x\u0007\r\n by\u230a.\u2309. This prevents an overall upward or downward bias \uf8f4 \u2212 if n \u2264 \u2264p,\r\n \uf8f4\r\n which is known to impact end-to-end inference accuracy in \u2207 q(x;s) := s ln(2)\u00b7\uf8f2 s s \u0004x\u0007 s\r\n (log2 t) n if s p\r\n Saturate: Once scaled and rounded, elements in the tensor (7)\r\n that exceed the largest supported value in the quantized do- Thechoice to train thresholds in the log-domain is simple\r\n main are clipped: clip(x;n,p) = min(max(x,n),p). Since yet effective for various stability reasons discussed in detail\r\n weapply clipping to the scaled tensor, the clipping limits in Appendix B.\r\n (n,p) are independent of the real bounds. A signed tensor\r\n b\u22121 b\u22121 \u0001\r\n is clipped to \u22122 , 2 \u22121 andanunsignedtensorto Similarly, the local gradient with respect to input x is:\r\n b \u0001 (\r\n 0,2 \u22121 . 1 if n \u2264 \u0004x\u0007 \u2264 p,\r\n \u2207 q(x;s) := s (8)\r\n De-quant: The last step undoes the scaling step. Therefore, x 0 otherwise\r\n we emulate the effect of quantization while retaining the\r\n original scale of the input tensor. 3.4 Interpretation of Gradients\r\n Putting together the point-wise operations from above, the To qualitatively understand the role of threshold gradient\r\n quantization function q(x;s) can be formally written as: \u2207 q(x;s) and input gradient \u2207 q(x;s) during back-\r\n (log2 t) x\r\n \u0010j m \u0011 propagation, let us consider the following toy problem:\r\n q(x;s) := clip x ;n,p \u00b7s, (4) Asingle quantizer optimized using least-square-error loss\r\n s L=(q(x;s)\u2212x)2/2. Theoverallgradients of L are:\r\n \u2308log2 t\u2309\r\n b\u22121 b\u22121 2 \u2207(log2t)L = (q(x;s)\u2212x)\u00b7\u2207(log2t)q(x;s) (9)\r\n wheren = \u22122 , p = 2 \u22121ands= b\u22121 for signed\r\n 2\r\n \u2308log2 t\u2309\r\n b 2 \u2207 L=(q(x;s)\u2212x)\u00b7(\u2207 q(x;s)\u22121) (10)\r\n data; n = 0, p = 2 \u2212 1 and s = 2b for unsigned data. x x\r\n 3.3 Linear Quantizer - Backward Pass Figure 1 shows the forward and backward pass transfer\r\n curves for our quantizer. As noted, the exact clipping thresh-\r\n To train the weights and thresholds of the quantized net- olds of x in the real domain are xn = s \u00b7 (n \u2212 0.5) and\r\n work with gradient descent, we derive the local gradients x =s\u00b7(p+0.5).\r\n p\r\n of our quantizer q(x;s) with respect to input x and scale- Role of threshold gradients: As seen from the plots of\r\n factor s. We carefully use the STE to approximate gradi- \u2207 Lvs. x in Figure 2, threshold gradients are pos-\r\n ents of round/ceil to 1, without approximating round/ceil (log2 t)\r\n itive for x within clipping thresholds (x ,x ) and nega-\r\n to be identity in the backward pass. Speci\ufb01cally, we de\ufb01ne n p\r\n 4\r\n \u2202 \u230ax\u2309 = \u2202 \u2308x\u2309 = 1, but \u230ax\u2309 6= x and \u2308x\u2309 6= x. tive otherwise. When most of the input distribution falls\r\n \u2202x \u2202x within (x ,x ), the cumulative threshold gradient is pos-\r\n n p\r\n 2 5\r\n We\ufb01xbforeachtensor based on the footprint of the \ufb01xed- itive causing log2 t to decrease . In other words, the lim-\r\n point hardware it maps to (albeit con\ufb01gurable), and allow t (hence its (x ,x ) get pulled inward in favor of larger precision.\r\n s) to be trained with backpropagation. n p\r\n 3The ceil function ensures a power-of-2 scale-factor that is 4Gaussian in this example, but the analysis holds in general.\r\n initially biased in the direction of having more elements within the 5Fromtheupdaterule log t := log t\u2212\u03b1\u2207 Lwhere\u03b1\r\n 2 2 (log2 t)\r\n clipping range. is the learning rate.\r\n Trained Quantization Thresholds (TQT)\r\n (a) Signed\r\n Figure 3. Forward pass (blue) and backward pass (red) transfer\r\n curvesofTensorFlow\u2019sFakeQuantforsigneddata. Localgradients\r\n shown in the top rows, and overall gradients of L -loss in the\r\n 2\r\n bottom rows. We pick bit-width b = 3 and clipping thresholds\r\n n=\u22121.125,p=0.875tomatchwithourexample.\r\n Tosummarize, threshold gradients help train optimal thresh-\r\n olds for clipping weights and activations, whereas input gra-\r\n (b) Unsigned dients nudge the weights and activations to tighter bounds.\r\n Bysimultaneously training clipping thresholds and weights\r\n Figure 1. Forward pass (blue) and backward pass (red) transfer of the quantized network through backpropagation, we\r\n curves of our quantizer for signed and unsigned data. Local gradi- adopt joint (mutual) optimization over a global loss.\r\n ents shown in the top rows, and overall gradients of L2-loss in the\r\n bottom rows. We pick bit-width b = 3 and raw threshold t = 1.0 3.5 ComparisontoClippedThresholdGradients\r\n in this example.\r\n In contrast, certain quantizer implementations de\ufb01ne thresh-\r\n Similarly, when most of the input distribution falls out- old gradients by simply clipping the upstream gradients\r\n side (x ,x ), the cumulative threshold gradient is negative, at the saturation thresholds. For example TensorFlow\u2019s\r\n n p FakeQuant (used for QAT) de\ufb01nes gradients with respect to\r\n log t increases, and the limits (x ,x ) get pushed outward\r\n 2 n p min/maxthresholds as a clip function.\r\n in favor of larger dynamic range. This technique is natu-\r\n rally robust to distributions with long tails or outliers, by In the forward pass, FakeQuant operation (TensorFlow,\r\n achieving range-precision trade-off through gradient-based 2016a) is mathematically equivalent to our formulation (ex-\r\n optimization. cept with zero-point), de\ufb01ned as:\r\n \uf8ef \uf8f9\r\n \uf8ef\r\n \uf8efclip(x;n,p)\u2212n p\u2212n\r\n q(x;n,p) := \uf8ef \uf8fa\u00b7 +n, (11)\r\n \uf8f0 p\u2212n \uf8fa b\r\n \uf8fa 2 \u22121\r\n b \uf8fa\r\n 2 \u22121\r\n However, in the backward pass they treat the round func-\r\n tion in (11) to be identity, reducing (11) to a clip function\r\n with clipped gradients. That is, gradients with respect to\r\n thresholds (n,p) are trivially clipped to zero for x within\r\n (n,p), as seen in FakeQuant\u2019s transfer curves in Figure 3\r\n Figure 2. Trained quantization thresholds move inward (left) or and its kernel de\ufb01nition (TensorFlow, 2016b). As a result,\r\n outward (center) to achieve range-precision trade-off. When con- the overall gradients only push the limits (n,p) outward,\r\n verged(right), the positive gradients from x within (xn,xp) cancel training to the min/max of the input distributions and strictly\r\n the negative gradients from x outside(x ,x ).\r\n n p favoring range over precision. We believe this behavior can\r\n be corrected to allow effective range-precision trade-off, as\r\n Role of Input Gradients: Using a similar analysis as for seen in Figure 2 with the toy L2 model, by carefully using\r\n threshold gradients, we see that the input gradients \u2207 L are the STE such that \u2202 \u230ax\u2309 = 1, but \u230ax\u2309 6= x in the backward\r\n x \u2202x\r\n non-zero for values of x that fall outside (xn,xp), biased to pass. While the actual loss landscape is non-trivial, we em-\r\n keep them from getting clipped. This encourages the weight pirically observe similar qualitative behavior to our toy L\r\n 2\r\n and activation distributions to be tighter. model, in Section 5.3.\r\n Trained Quantization Thresholds (TQT)\r\n Another popular clipping threshold method (applicable to 4.2 Quantization Modes\r\n ReLUactivations) is PACT, which has similar behavior to Graf\ufb01tist allows for quantization in either static or retrain\r\n TensorFlow\u2019s FakeQuant. As seen in (1), the gradient with modes.\r\n respect to clipping threshold \u03b1 takes a value of either 0 or 1\r\n depending on whether the quantizer input x lies to the left Static Mode. Quantization thresholds (hence scale factors)\r\n or right of \u03b1. This results in a tendency of \u03b1 to train to the are determined based on statistics of weights and activa-\r\n maxlimits of the distribution of x. To combat this tendency, tions derived from a calibration dataset. Speci\ufb01cally, weight\r\n a regularizer on the magnitude of \u03b1 is applied to the loss thresholds (per-tensor) are set to the maximum absolute\r\n function. However, this requires an additional parameter value (Table 2), and activation thresholds (per-tensor) are\r\n \u03bb\u03b1 to be tuned manually and has no awareness for the loss chosensuchastominimizethesymmetricKullback-Leibler-\r\n landscape or the quantization bitwidth. J distance (D\u2019Alberto&Dasdan,2009)foreachquantization\r\n layer locally. This is done in a strictly topological order to\r\n 4 FRAMEWORKFORTQT ensure inputs to a layer are quantized (and \ufb01xed) prior to\r\n quantizing the current layer. The entire optimization and\r\n Wereleased Graf\ufb01tist6, an end-to-end software stack built calibration process is automated and only requires a single\r\n on top of TensorFlow, to quantize and retrain deep neural APIcall to Graf\ufb01tist.\r\n networks(DNNs)usingTQTforaccurateandef\ufb01cientinfer- Retrain Mode. Quantization thresholds and weights are si-\r\n ence on \ufb01xed-point hardware. Fundamentally, Graf\ufb01tist is a multaneously trained on a global loss. Recovery is achieved\r\n \ufb02exible and scalable framework to process low-level graph within 5 epochs of TQT retraining. This requires two sep-\r\n descriptions of DNNs, comprising of a (growing) library of arate API calls to Graf\ufb01tist - \ufb01rst to generate a quantized\r\n transforms to implement various neural net optimizations. training graph that can be trained with native TensorFlow\r\n Eachgraphtransform consists of unique pattern matching on GPU, and second to generate an equivalent quantized\r\n andmanipulation algorithms that when run sequentially pro- inference graph that accurately models the target \ufb01xed-point\r\n duce an optimized output graph. It is still in experimental implementation. The bene\ufb01t of a hardware-accurate infer-\r\n stages as we continue to add support for more operation ence graph is twofold: (i) much before deployment, one can\r\n types, layer topologies, network styles, graph optimizations, quickly validate the inference accuracy of the quantized net-\r\n and compression techniques. Graf\ufb01tist stands on the shoul- workusing CPU/GPU,and(ii) scale factors and quantized\r\n ders of giants and the interface is inspired in part by earlier weights from TQT can be ported directly onto the target of\r\n tools from TensorFlow (TensorFlow, 2016c; 2017a). choice. On tests across several networks, we found that our\r\n 4.1 GraphOptimizations inference graphs run on the CPU were bit-accurate to our\r\n \ufb01xed-point implementation on the FPGA.\r\n Graf\ufb01tist applies several optimizations to the input graph\r\n prior to quantization. For example, folding batch normaliza- 4.3 LayerPrecisions\r\n tion layers into preceding convolutional or fully connected WhileGraf\ufb01tistsupportscon\ufb01gurablebit-widthsforweights\r\n or depthwise convolutional layers\u2019 weights. We adopt the and activations, for the scope of this paper we use two\r\n following best practices from (Jacob et al., 2017; Krish- modes: INT8 with 8/8 (W/A) and INT4 with 4/8 (W/A).\r\n namoorthi, 2018; TensorFlow, 2017a): (a) ensure folded Thechoice of 4/8 as opposed to 4/4 is primarily guided by\r\n batch norms in training and inference graphs are mathe- the availability of 4x8 multipliers; even in the absence of\r\n matically equivalent (i.e., distributions seen during training this, the INT4 mode still allows for 50% weight compres-\r\n match those during inference); (b) apply batch norm cor- sion (double packing weights per byte) and reduced memory\r\n rections for switching between batch and moving average footprint for fetching weights. The internal precisions for\r\n statistics to reduce jitter in training folded weights due to different layer topologies are de\ufb01ned below. Quantization\r\n noisybatchupdates;(c)freezebatchnormmovingmeanand layers marked as q\u2032 indicate that their scale-factors are ex-\r\n variance updates post convergence for improved accuracy. plicitly merged / shared. To avoid double quantization, input\r\n Otheroptimizationsincludecollapsingconcat-of-concatlay- tensors are assumed to be already quantized by the previous\r\n ers into single concat, splicing identity nodes not involved in layer, with the exception of the primary input (placeholder)\r\n control edges, transforming average pool layers into depth- which is explicitly quantized.\r\n wise conv layers with reciprocal7 multiplier as weights, and\r\n explicitly merging input scales for scale preserving ops such \u2022 Computelayers (e.g., conv, matmul, depthwise conv)\r\n as concat, bias-add, eltwise-add, and maximum (for leaky are quantized as:\r\n relu). \u0010 \u0010 \u0011 \u0011\r\n q q\u2032 Xq (w)\u00b7q (x)\u0001 +q\u2032 (b) ,\r\n 6Available at github.com/Xilinx/graf\ufb01tist. 8 16 8/4 8 16\r\n 7Reciprocal being 1/F2 where F is the kernel size. where x is the input tensor, w is the weight tensor, and\r\n Trained Quantization Thresholds (TQT)\r\n STE\r\n b is the bias tensor. If followed by a ReLU or ReLU6 Output Output\r\n activation function, the last q8() stage is delayed to De-quant s De-quant s Divide p + 1\r\n until after ReLU/ReLU6, and uses unsigned datatype Saturate n, p Saturate n, p Pow(2, \u22c5)\r\n to utilize the extra sign bit. add tf.stop_gradient\r\n add tf.stop_gradient\r\n \u2022 Eltwise-add layer is quantized as: + sub - + sub -\r\n q (q\u2032(x)+q\u2032(y)), Round Round Ceil\r\n 8 8 8\r\n Scale s Scale\r\n where x and y are the input tensors. Similar to the Input Input log2_th\r\n compute layer case, the last q () stage is delayed and Quantization Quantization \r\n 8 (inference) (training)\r\n uses unsigned datatype if followed by ReLU/ReLU6.\r\n XILINX CONFIDENTIAL\r\n \u2022 Leaky ReLUisquantized as: Figure 4. Illustration of the unfused quantization layer using the\r\n q (max(q\u2032 (x),q\u2032 (q (\u03b1)\u00b7q\u2032 (x)))), STEonthresholdandinputgradient paths. During backpropaga-\r\n 8 16 16 16 16 tion, the round and ceil functions are hidden by tf.stop gradient.\r\n where x is the input tensor, and \u03b1 is the slope of acti-\r\n vation function for negative inputs. The last q8() stage Table 2. Threshold initialization scheme using MAX or 3SD ini-\r\n on the previous compute layer is skipped when it is tialization for weights and KL-J distance calibrated for activations.\r\n followed by Leaky ReLU. Instead a q () stage is used Mode Threshold Initialization\r\n 16\r\n to retain high internal precision for the \u03b1-multiply op. weights activations\r\n \u2022 Average pool is quantized as: Static MAX KL-J\r\n \u0010 \u0011 wt MAX KL-J\r\n q X(q (r)\u00b7q (x)) , Retrain wt,th 3SD KL-J\r\n 8 8 8\r\n where x is the input tensor, and r is the reciprocal. 2015; Szegedy et al., 2015; 2016), ResNet v1 {50, 101,\r\n \u2022 Concat is not quantized because the input scales are 152} (He et al., 2015), MobileNet v{1, 2} 1.0 224 (Howard\r\n merged explicitly, and hence it is lossless: et al., 2017; Sandler et al., 2018), and DarkNet 19 (Red-\r\n concat(q\u2032(x),q\u2032(y),q\u2032(z)), mon&Farhadi,2016). Weobtainedthemodels, pre-trained\r\n 8 8 8 weights (FP32) and pre-processing for each of these net-\r\n where x, y, and z are input tensors. works from the TF-Slim model zoo (TensorFlow, 2017b)\r\n except for DarkNet 19 which was converted to TensorFlow\r\n 4.4 FusedKernelImplementation using DW2TF(Hao&Jain,2018).\r\n The quantization layer de\ufb01ned in (4) and (6) may be We are interested in a scalable and production-ready ap-\r\n trivially implemented using native TensorFlow ops and proachtoINT8/INT4quantizationthatmapswellongeneric\r\n tf.stop gradient as depicted in Figure 4. However this low- \ufb01xed-point hardware. While our simplifying constraints\r\n level implementation has a large memory footprint during (from Section 3.1) may not be ideal for lower bit-widths, the\r\n training due to the need for storing intermediate tensors for fundamentals of TQT are more generally applicable even\r\n gradient computation in the backward pass. This impacts without these constraints. To limit the scope of this paper\r\n the maximum batch size that can \ufb01t on a single GPU. To to the least-common-denominator \ufb01xed-point quantization,\r\n overcome this, Graf\ufb01tist is packaged with fused quantiza- we do not make comparisons with other state-of-the-art\r\n tion kernels that are pre-compiled for CPU/GPU. The fused low-bitwidth quantization schemes. Instead we draw com-\r\n implementation is ef\ufb01cient, helps avoid memory overhead parisons of TQT(wt+th)retrainingtostaticquantizationand\r\n and allows training using larger batch sizes compared to the wt-only retraining. We can derive many interesting insights\r\n native implementation. from this analysis.\r\n 5 EXPERIMENTS 5.1 Threshold Initializations\r\n Calibration sets are prepared for each network using a batch\r\n We evaluate TQT on variants of \ufb01ve classes of CNNs of 50 unlabeled images, randomly sampled from the val-\r\n trained and validated on ImageNet (ILSVRC14) classi\ufb01- idation set, with applied pre-processing. This is used for\r\n cation dataset (Russakovsky et al., 2015). The networks initializing the thresholds in both static and retrain modes.\r\n include VGG {16, 19} (Simonyan & Zisserman, 2014), In- Whenthresholdsarenottrained,theyareinitializedtoMAX\r\n ception v{1, 2, 3, 4} (Szegedy et al., 2014; Ioffe & Szegedy, for weights, and KL-J distance calibrated for activations.\r\n Trained Quantization Thresholds (TQT)\r\n Howeverwhentraining thresholds, we \ufb01nd it useful to ini- 5.3 Results\r\n tialize the weight thresholds based on n standard deviations Table3reportsthesingle-cropImageNetvalidationaccuracy\r\n or percentile of the weight distribution rather than MAX. for 12 networks. Default image sizes are used: 299\u00d7299 for\r\n Table 2 summarizes the threshold initialization scheme we Inception v{3, 4}, 256\u00d7256 for Darknet 19 and 224\u00d7224\r\n used for all our experiments. for all other networks. Standard pre-processing for each\r\n 5.2 Implementation Details network is applied to center crop, resize, and normalize the\r\n input data. The different trials include pre-trained FP32\r\n Before exporting the models to TensorFlow protocol buffers baseline, static INT8 run, and 4 retrain runs - FP32 wt-only,\r\n (.pb) for Graf\ufb01tist to absorb, we make the following INT8 wt-only, INT8 wt+th and INT4 wt+th. Here, INT8\r\n synthetic modi\ufb01cations: (i) replace tf.reduce mean with is 8/8 (W/A) and INT4 is 4/8 (W/A). FP32 baseline num-\r\n tf.nn.avg pool (if any), (ii) remove auxiliary logit layers (if bers are reported as validated on our end. For an unbiased\r\n any), and (iii) remove dropouts (if any). Additionally, we comparison, we train the FP32 weights using the same pro-\r\n disable data-augmentation (e.g., random \ufb02ip / crop) during cedure (optimizers, learning rates, decay, BN freeze etc.) as\r\n retraining. These modi\ufb01cations are done keeping in mind with our quantized weight retraining. This FP32 wt-only\r\n that TQT focuses primarily on learning thresholds through retraining serves as a fair baseline to our INT8 and INT4\r\n backpropagation, while allowing previously trained weights retrain results. That said, we do not use the retrained FP32\r\n to be \ufb01ne-tuned using a relatively small learning rate. As weights to initialize any of our INT8/INT4 retraining runs,\r\n expected, most of the recovery is achieved within a frac- and they always start from pre-trained FP32 weights. This\r\n tion of an epoch due to thresholds converging, and the rest is done to keep the overhead of retraining to a minimum.\r\n of it (up to 5 epochs) is just weights adjusting to the new\r\n thresholds. Because the overall training steps required with 6 DISCUSSION\r\n TQTaresofewcomparedtofrom-scratchtraining, and that\r\n pre-trained weight distributions are not allowed to wildly Thevalidation accuracy and epoch count corresponding to\r\n change(over\ufb01t), we\ufb01nditbesttodisabledata-augmentation the best checkpoint are noted in Table 3. As we see, all\r\n and dropout regularization. the networks converge within 5 epochs. Variance on the\r\n Based on the stability analysis and hyperparameter recom- reported accuracy stems from a few sources (in decreasing\r\n mendations in Appendix B.2 and B.3, we use the Adam order): (a) best rather than mean validation (our \ufb01ndings in\r\n optimizer with parameters \u03b2 = 0.9 and \u03b2 = 0.999 for Appendix D suggest this variance is within 0.2%), (b) non-\r\n 1 2 determinism due to inexact \ufb02oating point math (empirically\r\n training thresholds and weights in all our experiments. The within 0.1%), (c) round to one decimal (bound to 0.05%).\r\n initial learning rate is set to 1e\u22122 for thresholds and 1e\u22126 Keepingthese variance bounds on accuracy in mind, we can\r\n for weights. Learning rates are decayed exponentially (with draw interesting insights into the bene\ufb01ts of TQT.\r\n staircase enabled) by a factor of 0.94 every 3000 \u00b7 (24/N)\r\n steps for weights and by a factor of 0.5 every 1000\u00b7(24/N) 6.1 Insights from TQT\r\n steps for thresholds, where N is the batch size. We use a\r\n batch size of 24 for all networks except for ResNet v1 152 Our experiments demonstrate \ufb02oating-point accuracy for\r\n and Inception v4 for which a batch of 16 is used. Softmax 8-bit quantization and near-\ufb02oating-point accuracy for 4-bit\r\n cross-entropy loss is used to compute quantization threshold quantization for most networks. We see that static quan-\r\n gradients and this loss, together with weight regularization tization incurs a higher loss than retrained methods. This\r\n (if any), are used to compute weight gradients. Batch norm is expected because (a) weights are not trained to adapt\r\n movingmeansandvariancesarefrozen after 1 epoch. to the quantized network, and (b) quantization thresholds\r\n In Appendix B.3, we discussed the post-convergence oscil- are picked using local statistics instead of being optimized\r\n lations of thresholds around the critical integer threshold on a global loss. For networks that are easier to quantize\r\n log t\u2217 due to our power-of-2 scaling constraint. When to INT8 (e.g., VGGs, Inceptions, ResNets), we \ufb01nd that\r\n 2 retraining weights alone while \ufb01xing thresholds to their pre-\r\n thresholds cross this integer level, it can change the distri- calibrated values (based on Table 2) is suf\ufb01cient. In such\r\n butions of downstream activations, requiring weights and cases, TQT(wt+th)retrainingshowsnoaddedbene\ufb01t. How-\r\n thresholds of the following layers to adapt to it. To minimize ever, for networks known to be dif\ufb01cult to quantize (e.g.,\r\n this effect, we incrementally freeze thresholds starting at MobileNets, DarkNets), TQT(wt+th)retraining yields up to\r\n 1000 \u00b7 (24/N) steps, once every 50 steps in the order of 4%highertop-1accuracy compared to wt-only training for\r\n increasing absolute gradient magnitude, if they are on the INT8, and can match FP32 accuracy even with per-tensor,\r\n correct side of log2 t\u2217 (determined using an EMA). This is uniform symmetric, power-of-2 scaling constraints. This\r\n automatically handled by the training scripts packaged with demonstrates the range-precision trade-off through trained\r\n Graf\ufb01tist. thresholds in action. For lower precisions such as INT4,\r\n Trained Quantization Thresholds (TQT)\r\n Table 3. Quantization accuracy achieved on different ImageNet CNNs for static quantization, weight-only quantized retraining, and\r\n weight+threshold quantized retraining (TQT). Training is run until validation accuracy plateaus (max 5 epochs). We also compare to\r\n \ufb02oating-point retraining to isolate the impact of our quantization methods from our training setup.\r\n Mode Precision Bit-width Accuracy (%) Epochs Mode Precision Bit-width Accuracy (%) Epochs\r\n (W/A) top-1 top-5 (W/A) top-1 top-5\r\n VGG16 MobileNet v1 1.0 224\r\n FP32 32/32 70.9 89.8 FP32 32/32 71.0 90.0\r\n Static INT8 8/8 70.4 89.7 Static INT8 8/8 0.6 3.6\r\n wt FP32 32/32 71.9 90.5 1.0 wt FP32 32/32 71.1 90.0 3.4\r\n wt INT8 8/8 71.8 90.5 1.0 wt INT8 8/8 67.0 87.9 4.6\r\n Retrainwt,thINT8 8/8 71.7 90.4 0.9 Retrainwt,thINT8 8/8 71.1 90.0 2.1\r\n wt,th INT4 4/8 71.5 90.3 4.0 wt,th INT4 4/8 \u2013 \u2013\r\n VGG19 MobileNet v2 1.0 224\r\n FP32 32/32 71.0 89.8 FP32 32/32 70.1 89.5\r\n Static INT8 8/8 70.4 89.7 Static INT8 8/8 0.3 1.2\r\n wt FP32 32/32 71.8 90.4 1.0 wt FP32 32/32 71.7 90.7 3.2\r\n wt INT8 8/8 71.7 90.4 1.0 wt INT8 8/8 68.2 89.0 2.7\r\n Retrainwt,thINT8 8/8 71.7 90.4 1.0 Retrainwt,thINT8 8/8 71.8 90.6 2.2\r\n wt,th INT4 4/8 71.2 90.1 2.0 wt,th INT4 4/8 \u2013 \u2013\r\n Inception v1 DarkNet19\r\n FP32 32/32 69.8 89.6 FP32 32/32 73.0 91.4\r\n Static INT8 8/8 68.6 88.9 Static INT8 8/8 68.7 89.7\r\n wt FP32 32/32 70.3 90.0 2.8 wt FP32 32/32 74.4 92.3 3.1\r\n wt INT8 8/8 70.6 90.3 3.5 wt INT8 8/8 72.9 91.6 3.8\r\n Retrainwt,thINT8 8/8 70.7 90.2 2.4 Retrainwt,thINT8 8/8 74.5 92.3 1.8\r\n wt,th INT4 4/8 67.2 88.2 4.0 wt,th INT4 4/8 73.2 91.6 2.8\r\n Inception v2 ResNetv150\r\n FP32 32/32 74.0 91.8 FP32 32/32 75.2 92.2\r\n Static INT8 8/8 73.1 91.3 Static INT8 8/8 74.3 91.7\r\n wt FP32 32/32 74.3 92.2 3.3 wt FP32 32/32 75.4 92.5 3.7\r\n wt INT8 8/8 74.4 92.3 4.7 wt INT8 8/8 75.3 92.3 1.0\r\n Retrainwt,thINT8 8/8 74.4 92.4 2.5 Retrainwt,thINT8 8/8 75.4 92.3 1.9\r\n wt,th INT4 4/8 71.9 90.8 4.8 wt,th INT4 4/8 74.4 91.7 2.0\r\n Inception v3 ResNetv1101\r\n FP32 32/32 78.0 93.9 FP32 32/32 76.4 92.9\r\n Static INT8 8/8 76.8 93.3 Static INT8 8/8 74.8 92.0\r\n wt FP32 32/32 78.3 94.2 2.1 wt FP32 32/32 76.6 93.2 1.2\r\n wt INT8 8/8 78.2 94.1 2.0 wt INT8 8/8 76.3 93.0 1.0\r\n Retrainwt,thINT8 8/8 78.3 94.3 1.2 Retrainwt,thINT8 8/8 76.4 93.1 0.9\r\n wt,th INT4 4/8 76.4 93.1 4.4 wt,th INT4 4/8 75.7 92.5 2.0\r\n Inception v4 ResNetv1152\r\n FP32 32/32 80.2 95.2 FP32 32/32 76.8 93.2\r\n Static INT8 8/8 79.4 94.6 Static INT8 8/8 76.2 93.0\r\n wt FP32 32/32 80.2 95.2 0.0 wt FP32 32/32 76.8 93.3 1.0\r\n wt INT8 8/8 80.1 95.3 1.7 wt INT8 8/8 76.7 93.3 1.5\r\n Retrainwt,thINT8 8/8 80.1 95.2 1.5 Retrainwt,thINT8 8/8 76.7 93.3 1.4\r\n wt,th INT4 4/8 78.9 94.7 4.2 wt,th INT4 4/8 76.0 93.0 1.9\r\n we\ufb01ndthatwt-onlytraining does not recover, and so TQT 6.2 MobileNet Comparisons\r\n (wt+th) retraining is necessary. The INT4 accuracy falls For more dif\ufb01cult networks such as MobileNets, it is well\r\n short of FP32, and we believe this maybe due to (a) our knownthat symmetric, per-tensor quantization done post-\r\n quantization constraints in Section 3.1, and (b) the \ufb01rst/last training or through calibrate-only methods is detrimental\r\n layers not retaining full precision8.\r\n (Krishnamoorthi, 2018; Goncharenko et al., 2018). We be-\r\n 8Wequantize \ufb01rst/last layers to a minimum of INT8, so that lieve this is true, in particular due to the use of depthwise\r\n they can be mapped on the same \ufb01xed-point hardware used for convolutions with irregular weight distributions and widely\r\n other layers. varying ranges between channels. With wt-only retraining\r\n weareonlyable to recover to within 4% of \ufb02oating-point\r\n accuracy. However, with TQT (wt+th) retraining, our re-\r\n Trained Quantization Thresholds (TQT)\r\n Figure 5. Selected weight and activation distributions of MobileNet v1 before (black) and after (red) quantized TQT (wt+th) retraining\r\n for thresholds that changed by non-zero integer amount in log-domain. Initial thresholds (cyan) and trained thresholds (blue) are also\r\n plotted. These are the raw thresholds t. Also indicated above each plot are bit-width b and threshold deviation d := \u2206\u2308log2 t\u2309 for the\r\n quantized layer. A positive deviation indicates preference for range over precision, and a negative deviation indicates otherwise. We note\r\n that depthwise convolutions\u2019 weights have unique threshold training behavior with a strong preference for precision compared to range.\r\n sults for 8-bit are the highest we have seen using symmetric,\r\n power-of-2 scaled, per-tensor quantization, even matching\r\n \ufb02oating-point accuracy with no loss. We draw a few compar-\r\n isons with Google\u2019s QAT results for MobileNets in Table 1\r\n and observe that we incur no loss with INT8 quantization\r\n even with stricter constraints. We believe this is due to the\r\n fact that our threshold gradient formulation is in fact able to\r\n balance range-precision effectively.\r\n In Figure 5 we analyze the retrained distributions for a\r\n few quantized layers in MobileNet v1, highlighting the im-\r\n (a) INT8 portance of range-precision trade-off. As seen with the\r\n depthwise convolutional layers\u2019 weights, the trained thresh-\r\n olds move-in from their initialized values by up to 3 integer\r\n bins in the log-domain, favoring precision over dynamic\r\n range. For some other layers, the thresholds move-out from\r\n their initialized values, favoring range over precision. For\r\n more such layers with non-zero threshold deviations, see\r\n Figure 10 in Appendix.\r\n (b) INT4 Figure 6 shows a histogram of deviations of trained thresh-\r\n olds for different networks under 8-bit and 4-bit quantized\r\n Figure 6. Threshold deviations during TQT training. For each retraining. We \ufb01nd that larger positive deviations are seen\r\n network, the left plot shows the value of each of the thresholds over in the 8-bit case compared to the 4-bit case. This intuitively\r\n the \ufb01rst 100 training steps, and the right plot shows a histogram of makessenseasthemethoddecidestofavorrangewithmore\r\n deviations from the start (initialized thresholds) to the end (trained bits of precision, but cuts back on range when only few bits\r\n thresholds) of training. of precision are available.\r\n Trained Quantization Thresholds (TQT)\r\n 7 CONCLUSION clipping activation for quantized neural networks. arXiv\r\n In Section 3, we proposed a general method for training preprint arXiv:1805.06085, 2018.\r\n quantization thresholds (TQT), amenable to most generic Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and\r\n \ufb01xed-pointhardwarebyconstrainingourmethodtouniform, Bengio, Y. Binarized neural networks: Training deep\r\n symmetric, power-of-2 scaled, per-tensor quantization. We neural networks with weights and activations constrained\r\n showedthat our quantizer\u2019s gradient formulation allowed a to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.\r\n uniquerange-precisiontrade-off, essential for high-accuracy D\u2019Alberto, P. and Dasdan, A. Non-parametric information-\r\n quantized networks. We demonstrated a robust, fast con- theoretic measures of one-dimensional distribution func-\r\n vergence training scheme for TQT utilizing log-domain tions from continuous time series. In Proceedings of the\r\n threshold training with an adaptive optimizer. In Section 4, 2009SIAMInternationalConferenceonDataMining,pp.\r\n wepresented Graf\ufb01tist, a framework for automatic quantiza- 685\u2013696. SIAM, 2009.\r\n tion and retraining of TensorFlow graphs with our methods.\r\n In Section 5, we empirically validated our methods on a Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R.,\r\n suite of standard CNNs trained on ImageNet. Finally, in and Modha, D. S. Learned step size quantization. arXiv\r\n Sections 6, we provided insightful discussions on TQT and preprint arXiv:1902.08153, 2019.\r\n state-of-the-art results for 8-bit MobileNet quantization.\r\n Ourworkandresultsdemonstrate the effectiveness of our Goncharenko, A., Denisov, A., Alyamkin, S., and Terentev,\r\n techniques for high accuracy quantization of neural net- E. Fast adjustable threshold for uniform neural network\r\n works for \ufb01xed-point inference. While our work covers quantization. arXiv preprint arXiv:1812.07872, 2018.\r\n a major use case for quantization, there are many other Hao, Y. and Jain, S. R. Darknet to tensor\ufb02ow\r\n quantization \ufb02avors we could explore in future work. For (DW2TF). https://github.com/jinyu121/\r\n example, it would be useful to see how well the techniques DW2TF/releases/tag/v1.2,2018.\r\n wedesigned for strict power-of-2 scaling generalize to non He, K., Zhang, X., Ren, S., and Sun, J. Deep resid-\r\n power-of-2 scale-factors. Some additional relaxations of ual learning for image recognition. arXiv preprint\r\n our constraints we could explore include per-channel rather arXiv:1512.03385, 2015.\r\n than per-tensor quantization, which could potentially allow\r\n for more aggressive bitwidths on dif\ufb01cult networks like Mo- Hinton, G., Srivastava, N., and Swersky, K. Lecture 6a\r\n bileNets, and non-symmetric or even non-uniform quantiza- overview of mini-batch gradient descent (2012). Cours-\r\n tion schemes, where threshold training via backpropagation era Lecture slides https://class. coursera. org/neuralnets-\r\n and gradient descent has been tried with mild success. We 2012-001/lecture, 2012.\r\n wouldnotbesurprisedtoseeourmethodsandanalysistech-\r\n niques have broader applicability for more general classes Howard,A.G.,Zhu,M.,Chen,B.,Kalenichenko,D.,Wang,\r\n of quantizers and problems beyond ImageNet. W.,Weyand,T.,Andreetto,M.,andAdam,H. Mobilenets:\r\n Ef\ufb01cient convolutional neural networks for mobile vision\r\n REFERENCES applications. arXiv preprint arXiv:1704.04861, 2017.\r\n Baskin, C., Liss, N., Chai, Y., Zheltonozhskii, E., Schwartz, Ioffe, S. and Szegedy, C. Batch normalization: Accelerating\r\n E., Giryes, R., Mendelson, A., and Bronstein, A. M. Nice: deep network training by reducing internal covariate shift.\r\n Noise injection and clamping estimation for neural net- arXiv preprint arXiv:1502.03167, 2015.\r\n work quantization. arXiv preprint arXiv:1810.00162, Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M.,\r\n 2018. Howard, A., Adam, H., and Kalenichenko, D. Quan-\r\n tization and training of neural networks for ef\ufb01-\r\n \u00b4\r\n Bengio, Y., Leonard, N., and Courville, A. Estimating or cient integer-arithmetic-only inference. arXiv preprint\r\n propagating gradients through stochastic neurons for con- arXiv:1712.05877, 2017.\r\n ditional computation. arXiv preprint arXiv:1308.3432, Jacob, B. et al. Gemmlowp: Ef\ufb01cient handling of offsets.\r\n 2013. https://github.com/google/gemmlowp/\r\n Cai, Z., He, X., Sun, J., and Vasconcelos, N. Deep learning blob/master/doc/low-precision.md#\r\n with low precision by half-wave gaussian quantization. efficient-handling-of-offsets,2016a.\r\n arXiv preprint arXiv:1702.00953, 2017. Jacob, B. et al. Gemmlowp: Building a quanti-\r\n zation paradigm from \ufb01rst principles. https:\r\n Choi,J., Wang,Z., Venkataramani, S., Chuang, P.I.-J., Srini- //github.com/google/gemmlowp/blob/\r\n vasan, V., and Gopalakrishnan, K. Pact: Parameterized master/doc/quantization.md,2016b.\r\n Trained Quantization Thresholds (TQT)\r\n Jung, S., Son, C., Lee, S., Son, J., Kwak, Y., Han, J.-J., Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi,\r\n Hwang, S. J., and Choi, C. Learning to quantize deep A. Inception-v4, inception-resnet and the impact\r\n networks by optimizing quantization intervals with task of residual connections on learning. arXiv preprint\r\n loss. arXiv preprint arXiv:1808.05779, 2018. arXiv:1602.07261, 2016.\r\n Kingma, D. P. and Ba, J. Adam: A method for stochastic TensorFlow. FakeQuant API. https://www.\r\n optimization. arXiv preprint arXiv:1412.6980, 2014. tensorflow.org/versions/r1.13/api_\r\n Krishnamoorthi,R. Quantizingdeepconvolutionalnetworks docs/python/tf/quantization/fake_\r\n for ef\ufb01cient inference: A whitepaper. arXiv preprint quant_with_min_max_vars,2016a.\r\n arXiv:1806.08342, 2018. TensorFlow. FakeQuant threshold gradients. https:\r\n Li, F., Zhang, B., and Liu, B. Ternary weight networks. //github.com/tensorflow/tensorflow/\r\n arXiv preprint arXiv:1605.04711, 2016. blob/v1.13.1/tensorflow/core/kernels/\r\n fake_quant_ops_functor.h#L179-L187,\r\n McKinstry, J. L., Esser, S. K., Appuswamy, R., Bablani, 2016b.\r\n D., Arthur, J. V., Yildiz, I. B., and Modha, D. S. Dis- TensorFlow. Graph transform tool. https:\r\n covering low-precision networks close to full-precision //github.com/tensorflow/tensorflow/\r\n networksforef\ufb01cientembeddedinference. arXivpreprint blob/v1.13.1/tensorflow/tools/graph_\r\n arXiv:1809.04191, 2018. transforms/README.md,2016c.\r\n Migacz,S. 8-bitinferencewithtensorrt. InGPUTechnology TensorFlow. Quantization-aware training. https:\r\n Conference, 2017. //github.com/tensorflow/tensorflow/\r\n Mishra, A., Nurvitadhi, E., Cook, J. J., and Marr, D. blob/v1.13.1/tensorflow/contrib/\r\n Wrpn: wide reduced-precision networks. arXiv preprint quantize/README.md,2017a.\r\n arXiv:1709.01134, 2017. TensorFlow. Tf-slim pre-trained models. https:\r\n Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. //github.com/tensorflow/models/blob/\r\n Xnor-net: Imagenet classi\ufb01cation using binary convolu- v1.13.0/research/slim/README.md#\r\n tional neural networks. arXiv preprint arXiv:1603.05279, pre-trained-models,2017b.\r\n 2016. Zhang, D., Yang, J., Ye, D., and Hua, G. Lq-nets: Learned\r\n Redmon, J. and Farhadi, A. Yolo9000: Better, faster, quantization for highly accurate and compact deep neural\r\n stronger. arXiv preprint arXiv:1612.08242, 2016. networks. arXiv preprint arXiv:1807.10029, 2018.\r\n Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., and Zou, Y.\r\n Ma,S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, Dorefa-net: Training low bitwidth convolutional neural\r\n M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale networks with low bitwidth gradients. arXiv preprint\r\n Visual Recognition Challenge. International Journal of arXiv:1606.06160, 2016.\r\n Computer Vision (IJCV), 115(3):211\u2013252, 2015. doi: Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary\r\n 10.1007/s11263-015-0816-y. quantization. arXiv preprint arXiv:1612.01064, 2016.\r\n Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and\r\n Chen, L.-C. Mobilenetv2: Inverted residuals and linear\r\n bottlenecks. arXiv preprint arXiv:1801.04381, 2018.\r\n Simonyan, K. and Zisserman, A. Very deep convolu-\r\n tional networks for large-scale image recognition. arXiv\r\n preprint arXiv:1409.1556, 2014.\r\n Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,\r\n Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,\r\n A. Going deeper with convolutions. arXiv preprint\r\n arXiv:1409.4842, 2014.\r\n Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,\r\n Z. Rethinking the inception architecture for computer\r\n vision. arXiv preprint arXiv:1512.00567, 2015.\r\n Trained Quantization Thresholds (TQT)\r\n A COSTOFAFFINEQUANTIZER B.1 Numerical Stability\r\n A.1 Cross-terms due to zero-points Oneobviousproblemwithtraining raw thresholds t \u2208 R+\r\n Consider two real numbers r and r and their product is that gradient updates could potentially bump a threshold\r\n 1 2 to a negative value, causing log t and therefore scale-factor\r\n r =r \u00b7r . Usingtheaf\ufb01nemappingfrom(2)torepresent 2\r\n 3 1 2 s to diverge. If this happens even once, the network as\r\n this, we get: a whole will break. An easy solution is to train log t as\r\n 2\r\n s (q \u2212z ) = s (q \u2212z )\u00b7s (q \u2212z ), (12) opposedtotitself, since its domain is log2 t \u2208 R. Using log\r\n 3 3 3 1 1 1 2 2 2 thresholds is convenient because it already appears in the\r\n which can be expressed as expression for s(t). However, the most important bene\ufb01t\r\n s1s2\u0002 \u0003 is described in Section B.2, where the log representation\r\n q =z + q q \u2212q z \u2212q z +z z . (13)\r\n 3 3 s 1 2 1 2 2 1 1 2 makesensuring scale invariance very easy.\r\n 3\r\n The cross-terms in (13) add complexity and often require B.2 Scale Invariance\r\n special handling to remain ef\ufb01cient. While the added cost\r\n can be amortized over several accumulations of a matrix For a given input distribution we prefer that the threshold\r\n multiplicationorconvolutionoperation,itwouldstillrequire gradients have similar magnitudes regardless of the position\r\n 9 of the threshold itself. This threshold scale invariance is\r\n optimizations , both algorithmic and kernel-level.\r\n Byeliminating zero-points, the cross-terms vanish and the useful for making sure training is not too slow when the\r\n operation simpli\ufb01es to: thresholds are far from their optimal values. Similarly, the\r\n \u0002 \u0003 properties of our threshold gradients should not depend on\r\n s s\r\n q = 1 2 q q . (14) the scale of the input distribution. This input scale invari-\r\n 3 s 1 2\r\n 3 ance is important because it ensures that quantized training\r\n A.2 Real-valued scale-factors behaves the same way for the different weights and activa-\r\n tions in the network, even if the variance of their distribu-\r\n With positive real scale-factors, the constant multiplier tions vary over many orders of magnitude.\r\n s s /s in (14), empirically found to be in the interval\r\n 1 2 3 Unfortunately, neither of these scale invariances hold. Far\r\n (0, 1) (Jacob et al., 2017), can be expressed in the normal- from improving, Figure 7 shows that in moving from raw\r\n \u2212n\r\n ized form 2 s0 where n is a non-negative integer and s0 threshold training (left) to log threshold training (middle),\r\n is in the interval [0.5, 1). In other words, the accumulator both scale invariance properties of the threshold gradients\r\n (storing q1q2) needs to be scaled by a \ufb01xed-point multi- actually degrade.\r\n plier that approximates s0 and right-shifted by n bits (with\r\n round-to-nearest): Raw Grad \u2207tL Log Grad \u2207log2tL Desired Log Grad \u2207log2tL\r\n 106 \u22122 106 106\r\n \u0002 \u0003 \u03c3 = 10\r\n \u2212n 103 \u03c3 = 10\u22121 103 103\r\n q =2 s q q . (15) \u03c3 = 10+0\r\n 3 0 1 2 100 \u03c3 = 10+1 100 100\r\n \u03c3 = 10+2\r\n 0 0 0\r\n However, by constraining scale-factors s ,s ,s to strict \u2212100 \u2212100 \u2212100\r\n 1 2 3 \u2212103 \u2212103 \u2212103\r\n power-of-2, the scaling operation reduces to a rather simple \u2212106 \u2212106 \u2212106\r\n \u221210 \u22125 0 5 10 \u221210 \u22125 0 5 10 \u221210 \u22125 0 5 10\r\n bit-shift (with round-to-nearest): log2t\r\n \u2212f\u0002 \u0003\r\n q =2 q q . (16)\r\n 3 1 2\r\n Figure 7. Gradients of L -loss with respect to raw threshold\r\n B LOGTHRESHOLDTRAINING 2\r\n (left) or log threshold (middle, right) versus log threshold, for\r\n Initially, it may seem that with the de\ufb01nition of a gradi- Gaussian(\u03c3) inputs of varying \u03c3. Desired (normed) gradients for\r\n ent with respect to the raw threshold, backpropagation and the log threshold case are shown on the right.\r\n gradient descent could be immediately used to train it. How- Threshold scale invariance: Updates to the log threshold\r\n ever, just as training weights in a vanilla neural network wouldbethreshold scale invariant if the gradients on both\r\n requires care in the choice of optimizer and learning rate, sides of the negative-to-positive jump were \ufb02at, as seen in\r\n here too care must be taken to ensure training stability and the right plot of Figure 7. However, this is not the case for\r\n convergence. There are three main properties we would log threshold gradients (center plot of Figure 7). On the\r\n like our training procedure to satisfy: numerical stability, left-of-jump side, as log t decreases, gradients of (hence\r\n scale invariance, and convergence. We discuss each of these 2\r\n issues and the engineering tweaks used to solve them here.\r\n 9Some of which are covered in (Jacob et al., 2016a; 2017;\r\n Krishnamoorthi, 2018).\r\n Trained Quantization Thresholds (TQT)\r\n b = 4, \u03c3 = 10\u22122, r = 1 b = 4, \u03c3 = 10\u22121, r = 5 b = 4, \u03c3 = 100, r = 13 b = 4, \u03c3 = 101, r = 1 b = 4, \u03c3 = 102, r = 4\r\n 1 g 1 g 3 g 7 g 10 g\r\n 0 6 9\r\n \u22121 0 2 5 8\r\n 7\r\n t2\u22122 \u22121 4 6\r\n \u22123 1 3 5\r\n log\u22124 \u22122 2 4\r\n 3\r\n \u22125 \u22123 0 1 2\r\n \u22126 0 1\r\n 0\r\n \u22127 \u22122 \u22124 \u22121 \u22121 0 \u22121 1 \u22121 2\r\n b = 8, \u03c3 = 10 , r = 244 b = 8, \u03c3 = 10 , r = 16 b = 8, \u03c3 = 10 , r = 78 b = 8, \u03c3 = 10 , r = 173 b = 8, \u03c3 = 10 , r = 24\r\n 1 g 1 g 4 g 7 g 11 g\r\n 0 6 10\r\n 3 9\r\n \u22121 0 5 8\r\n t \u22122 2 4 7\r\n 2 6\r\n \u22123 \u22121 3 5 Raw Grad - SGD\r\n log\u22124 1 2 4 Log Grad - SGD\r\n 3\r\n \u22125 \u22122 0 1 2 Norm Log Grad - SGD\r\n \u22126 0 1 Log Grad - Adam\r\n 0\r\n \u22127 0 1000 2000 \u22123 0 1000 2000 \u22121 0 1000 2000 \u22121 0 1000 2000 \u22121 0 1000 2000\r\n Training Steps\r\n Figure 8. Raw, log and normed log threshold training on L -loss for 2000 steps with learning rate \u03b1 = 0.1. We compare different\r\n 2\r\n bit-widths - 4 (top) and 8 (bottom), and Gaussian(\u03c3) inputs of varying \u03c3 - smallest (left) to largest (right). The empirical value of r is\r\n g\r\n estimated from the last few hundred steps of Adam.\r\n updates to) log2 t get exponentially smaller, meaning it will normed gradient cases. This is important for the conver-\r\n converge very slowly to lower optimal values (see the log gence dynamics of the system discussed in Section B.3. In\r\n grad SGD case in the left plots of Figure 8). Similarly, dynamic situations, the gradient normalization solution (17)\r\n on the right-of-jump side, as log2 t increases, updates to approximates this feature as well.\r\n log2 t increase exponentially, meaning it will converge very\r\n v \u2190\u03b2v +(1\u2212\u03b2)g2\r\n quickly and possibly unstably to higher optimal values (see i i\u22121 i\r\n v\r\n the log grad SGD case in the right plots of Figure 8). In the v\u02c6 \u2190 i\r\n raw threshold domain, we would like gradients of (hence i 1\u2212\u03b2i\r\n updates to) t to scale proportional to t. This is also not the g\u02dc \u2190 \u221a gi (17)\r\n case for the left-of-jump side of raw threshold gradients (left i v\u02c6 +\u01eb\r\n i \u0012 \u0013\r\n plot of Figure 7). In other words, the raw and log threshold gi\r\n g\u02dc \u2190 tanh \u221a (18)\r\n gradients are swapped from what we would prefer on the i v\u02c6 +\u01eb\r\n left-of-jump sides. i\r\n Input scale invariance: Updates to the log threshold are Figure 8 shows training curves on the toy L quantization\r\n 2\r\n input scale invariant if the gradients are threshold scale error problem across various bit-widths, input scales, and\r\n invariant and x-axis shifted copies for varying input scales, optimization algorithms. Raw gradient with SGD fails for\r\n as seen in the right plot of Figure 7. However, this is not large \u03c3 and converges too slowly for small \u03c3, as we would\r\n the case for log threshold gradients (center plot of Figure 7) expect from Sections B.1 and B.2. Additionally, they have\r\n as the gradient magnitudes depend on the scale of the input. b,\u03c3-dependent stability once converged. Switching from\r\n In fact when accounting for the threshold scale dependence, raw to log threshold gradients, we see that log gradient with\r\n the gradient magnitudes depend quadratically on the scale Adamperformswell,yetloggradient with SGD performs\r\n of the input. poorly, with weak convergence rates for small \u03c3 and di-\r\n Normed gradients: While neither raw or log threshold vergence for large \u03c3. However, after performing gradient\r\n gradients have the desired properties of scale invariance, normalization(18), normedloggradientwithSGDperforms\r\n only minimal modi\ufb01cations to our log threshold gradient well, demonstrating that lack of proper gradient norming is\r\n is needed to get these properties to hold (see desired log the main issue preventing convergence using standard gradi-\r\n threshold gradient on the right of Figure 7). In particular, if ent descent. Besides the differing convergence rates, another\r\n wenormalizethegradient g by its bias-corrected moving characteristic becomes immediately obvious - stability after\r\n i convergence. For example, raw gradient method tends to os-\r\n average variance, we achieve a close approximation of the cillate wildly between multiple integer-level log thresholds,\r\n desired gradients g\u02dc, shown in (17). To improve stability,\r\n i whereas normed log gradient method is better behaved and\r\n wecanencapsulate (17) in a clipping function to guarantee tends to stay within a single integer log threshold band.\r\n nolarge gradients, shown in (18).\r\n Yet another desired property highlighted in Figure 7 is that Adamoptimizer: Whilegradientnorming(18)ledtogood\r\n near the jump, the ratio of the gradient magnitudes to either results with SGD, we note that Adam without this gradient\r\n side of the jump is to be preserved between the original and norming also works quite well. It is easy to see why this is -\r\n Trained Quantization Thresholds (TQT)\r\n Adamhasbuilt-in gradient norming (Kingma & Ba, 2014). ents m \u2190 \u03b2 m +(1\u2212\u03b2 )g andamoving variance\r\n i 1 i\u22121 2 1 i\r\n Thus we can avoid rede\ufb01ning the gradients by simply using v \u2190 \u03b2 v +(1\u2212\u03b2 )g before applying update rule\r\n i 1 i\u22121 \u221a1 i\r\n an optimizer that includes adaptive gradients, such as Adam \u03b8 \u2190 \u03b8 \u2212\u03b1\u00b7m/ v. Inpractice, bias correction is\r\n i i\u22121 i i\r\n or RMSprop (Hinton et al., 2012). While RMSprop appears used to get m\u02c6 ,v\u02c6, but when considering settling dynamics\r\n i i\r\n to super\ufb01cially resemble (18) more closely than Adam, we for i \u2192 \u221e, this bias correction is insigni\ufb01cant. Typical\r\n suspect Adam has better behavior in the absence of gradient values are \u03b1 \u2248 10\u22123,\u03b2 \u2248 0.9,\u03b2 \u2248 0.999.\r\n 1 2\r\n clipping due to its use of moments to smooth the gradients. In AppendixC,adetailedanalysisofconvergenceforAdam\r\n TouseAdamsafely,wederiveroughboundsonthelearning is carried out. From this analysis a simple set of guidelines\r\n rate and momentum parameters to ensure the oscillations emerge. First, the learning rate is set to guarantee \u03b1 <\r\n seen in Figure 8 for log gradient with Adam do not exceed \u221a\r\n 0.1/ p. Next, we ensure 1/e < \u03b2 < 1 to satisfy the\r\n a single integer bin. This is important because if they move 1\r\n across bins often, the network may have more trouble adapt- limits of our analysis. Finally, we make sure rg \u2248 p \u226a\r\n 1/(1\u2212\u03b2 ) \u21d21\u2212\u03b2 \u226a1/p.Theseresultsaresummarized\r\n ing to the changing distributions from a given quantized 2 2\r\n in Table 4. For simplicity, we use \u03b1 = 0.01,\u03b2 = 0.9,\u03b2 =\r\n layer, in an effect that may be similar to the motivation for 1 2\r\n batch normalization (Ioffe & Szegedy, 2015). 0.999 for all of our training.\r\n B.3 Convergence Table 4. Guidelines for log threshold training with Adam, assum-\r\n ing b = 2b\u22121 \u2212 1 for signed data.\r\n Oneprimarycauseofthesharpgradient jumps seen in Fig- Bit-width b 4 8\r\n ure 7 is our insistence on power-of-2 scaling. In the forward\r\n pass, features downstream from the quantized layer are \u03b1 \u2264 \u221a 0.1 \u22640.035 \u22640.009\r\n completely unaware of intermediate non-power-of-2 scale- 2b\u22121\u22121\r\n factors so there are sharp jumps at integral log2 t, similar to \u03b2 \u22651/e \u22651/e \u22651/e\r\n what might be observed when using the STE for traditional 1\r\n \u03b2 \u22651\u2212 0.1 \u22650.99 \u22650.999\r\n 2 b\u22121\r\n quantization. The net effect is a bang-bang like operation. 2 \u22121\r\n In more detail, for a given input distribution there is some Steps \u2248\u03b1\u22121+(1\u2212\u03b22)\u22121 \u2248100 \u22481000\r\n critical integer threshold log2 t\u2217 before which the gradients\r\n are negative (causing positive threshold updates) and after C ANALYSISOFADAMCONVERGENCE\r\n which the gradients are positive. This negative feedback\r\n will force the threshold to oscillate around log2 t\u2217. The Let T be the period of oscillations at convergence. If we\r\n gradients g and g on either side of log t\u2217 tend to be fairly\r\n l h 2 assume T \u226a 1/(1 \u2212 \u03b2 ), then we can treat the moving\r\n constant within a distance 1 of log t\u2217 due to power-of-2 2\r\n 2 variance estimate as if it is a constant v = ((T \u2212 1)g2 +\r\n scaling. For simplicity, assume |g | > |g | so that the ratio i h\r\n l h g2)/T \u2248 g2(1/r2 +1/T). However, we cannot make the\r\n r = \u2212g/g > 1. As r grows, we would expect the l l g\r\n g l h g same assumption for the relationship between T and \u03b2 .\r\n following behavior: the threshold stays in the higher bin for 1\r\n a while, slowly decaying until reaching the lower bin, at Instead, based on our earlier discussion in Section B.3 of\r\n which point a large |g | causes it to jump back to the higher the bang-bang behavior, we assume that a gradient gl is seen\r\n l for a single step, then g is seen for T \u2212 1 steps. Then for a\r\n bin, where it begins a slow decay again. This behavior can h\r\n given cycle of this behavior, m = \u03b2i(\u03b2 m +(1\u2212\u03b2 )g )+\r\n be observed in the left plots of Figure 8 and are shown in i 1 1 0 1 l\r\n (1 \u2212 \u03b2i)g , where m is the steady-state minimum mean\r\n moredetail in Figure 9. 1 h 0\r\n during the cycle. Because this is steady-state, we can solve\r\n If normed log gradients and SGD are used together, the for m and m :\r\n 0 i\r\n dynamics are fairly simple. Let log t \u2190 log t \u2212\u03b1g\u02dc\r\n 2 i 2 i\u22121 i\r\n be the SGD update on normed log gradient g\u02dc (18). Then\r\n i\r\n because |g\u02dc| \u2264 1 by design, a given jump in the sawtooth- i i\r\n i m =\u03b2 (\u03b2 m +(1\u2212\u03b2 )g)+(1\u2212\u03b2 )g\r\n like pattern will have magnitude bounded by learning rate i 1 1 0 1 l 1 h\r\n m =m =\u03b2T(\u03b2 m +(1\u2212\u03b2 )g)+(1\u2212\u03b2T)g\r\n \u03b1. Thus by selecting \u03b1 \u226a 1, we can ensure convergence T 0 1 1 0 1 l 1 h\r\n within a threshold bin. \u03b2T(1\u2212\u03b21)\u2212(1\u2212\u03b2T)/rg\r\n m = 1 1 g (19)\r\n Howeverinourexperiments, we used the implementation- 0 1\u2212\u03b2T+1 l\r\n 1\r\n ally simpler approach of unnormed log gradients with the m \u03b2T(1\u2212\u03b2 )\u2212(1\u2212\u03b2T)/r\r\n i = \u03b2i+1 1 1 1 g\r\n Adamoptimizer. While simpler to implement, the analysis g 1 1\u2212\u03b2T+1\r\n is more complicated due to the second-order nature of the l 1\r\n optimizer. Adam has three key hyperparameters: \u03b1,\u03b2 ,\u03b2 +\u03b2i(1\u2212\u03b2 + 1 )\u2212 1 (20)\r\n 1 2 1 1 r r\r\n and operates by keeping track of a moving mean of gradi- g g\r\n Trained Quantization Thresholds (TQT)\r\n b = 8, \u03c3 = 10\u22122, rg = 272 b = 8, \u03c3 = 10\u22121, rg = 18 b = 8, \u03c3 = 100, rg = 52\r\n log2t for Log Grad - Adam \u22120.8\r\n t2\u22124.5 2.02\r\n log \u22121.0\r\n \u22125.0 \u22121.2 2.00\r\n 10\u22125 10\u22123 10\u22121\r\n 10\u22126 10\u22124 10\u22122\r\n Lt 0 0 0\r\n 2\u221210\u22126\r\n log\u221210\u22125 \u221210\u22124 \u221210\u22122\r\n \u2207 \u221210\u22124 \u221210\u22123 \u221210\u22121\r\n \u22123 \u22122 \u2212100\r\n \u221210 Loss Gradient \u221210 1\r\n 1500 1600 1700 1800 1900 2000 1800 1850 1900 1950 2000 \u221210 1800 1850 1900 1950 2000\r\n Training Steps\r\n Figure 9. Close up of Figure 8 for the Adam-trained log threshold gradient on a few select settings.\r\n \u221a\r\n Adamupdateslooklike\u03b8 \u2190 \u03b8 \u2212\u03b1\u00b7m/ v or\u03b8 \u2190 where we replace the large expression in (23) with c in\r\n Pi \u221a i i\u22121 i i i 1\r\n \u03b8 \u2212\u03b1 m/ v .WecansolveforT by\ufb01ndingwhen (24). Wenowsolveforthecriticalpointof\u2206t\u03b8 todetermine\r\n 0 j=0 j j\r\n P \u221a\r\n T tmax = argmax \u2206t\u03b8.\r\n \u03b8 =\u03b8 or m/ v =0. Asanintermediate step, t\r\n T 0 i=0 i i\r\n we\ufb01nd:\r\n 0 = d \u2206 \u03b8\r\n t t\r\n Xm dt\r\n \u2206\u03b8= i \u0014 \u22121 t +1 \u0015\r\n t \u221a \u221a ln(\u03b2 )\u03b2 max 1\r\n v 1 1\r\n i=0 i = rg c1 \u2212\r\n \u0010 \u0011 1\u2212\u03b2 r\r\n \u03b2T(1\u2212\u03b2 )\u2212(1\u2212\u03b2T)/r 1 g\r\n t \u03b2i \u03b2 1 1 1 g +1\u2212\u03b2 + 1 \u2212 1\r\n X 1 1 T+1 1 r r 1 1\u2212\u03b2\r\n 1\u2212\u03b2 g g tmax+1 1\r\n 1 \u03b2 = (25)\r\n = q 1 ln(\u03b2\u22121) r \u00b7c\r\n i=0 1/r2 +1/T 1 g 1\r\n g r +1\r\n \u0014 \u0012 1 1\u2212\u03b2 g\r\n 1 1\u2212\u03b2t+1 \u03b2T(1\u2212\u03b2 )\u2212(1\u2212\u03b2T)/r = 1\r\n = q 1 \u03b2 1 1 1 g ln(\u03b2\u22121) 1+r\r\n 1 1 1\u2212\u03b2 1 1\u2212\u03b2T+1 1 g !\r\n r2 + T 1 1 r +1\r\n g 1 1\u2212\u03b2 g\r\n \u0013 \u0015 t =log 1 \u22121 (26)\r\n 1 t +1 max \u03b21 ln(\u03b2\u22121) 1+r\r\n +1\u2212\u03b21+ \u2212 (21) 1 g\r\n rg rg\r\n Now,weset\u2206 \u03b8 =0: Plugging (25) and (26) into (24),\r\n T\r\n \" T+1 \u0012 T T \u221a \u0014 c 1\r\n 1 1\u2212\u03b2 \u03b2 (1\u2212\u03b2 )\u2212(1\u2212\u03b2 )/r 1\r\n 1 1 1 1 g \u2206 \u03b8 \u2248 r \u2212\r\n 0 = q \u03b2 t g\r\n 1 1 1 T+1 max 1\u2212\u03b2 r ln(\u03b2\u22121)\r\n 1\u2212\u03b2 1\u2212\u03b2 1 g\r\n r2 + T 1 1 1 !#\r\n g r +1\r\n \u0013 \u0015 1 1 1\u2212\u03b2 g\r\n 1 T +1 \u2212 log 1 (27)\r\n \u03b2 \u22121\r\n +1\u2212\u03b21+ \u2212 rg 1 ln(\u03b2 ) 1+rg\r\n rg rg 1\r\n \u03b2 (1\u2212\u03b2T) 1\u2212\u03b2T+1 T +1\r\n =\u03b2T+1\u2212 1 1 +1\u2212\u03b2T+1+ 1 \u2212 Tosimplify this expression, note that \u03b2 < 1 and r \u226b 1 so\r\n 1 r (1\u2212\u03b2 ) 1 r (1\u2212\u03b2 ) r r 1 g\r\n g 1 g 1 g 1\u2212\u03b2 g \u22481. Thenc /(1\u2212\u03b2 ) \u2248 1+1/r \u22481and:\r\n T =r (22) 1 1 1 g\r\n g\r\n Theworstcasehappenswhenrg islarge, so if we substitute \u221a \u0014 1+ln(rgln\u03b2\u22121)\u0015\r\n T \u2190r andassumer \u226b1,weget: \u2206 \u03b8 \u2248 r 1+ 1 (28)\r\n g g t g\r\n max r ln\u03b2\r\n g 1\r\n \u221a \"1\u2212\u03b2t+1 \u03b2rg(1\u2212\u03b2 )\u2212(1\u2212\u03b2rg)/r\r\n \u2206\u03b8\u2248 r 1 \u03b2 1 1 1 g Further, if 1/e < \u03b21 < 1, then the right term is negative\r\n t g 1\u2212\u03b2 1 rg+1 and the expression has a simple upper bound:\r\n 1 1\u2212\u03b2\r\n \u0013 \u0015 1\r\n +1\u2212\u03b2 + 1 \u2212t+1 (23)\r\n 1 r r\r\n \u0014 g g \u0015\r\n \u221a 1\u2212\u03b2t+1 t +1\r\n = r 1 c \u2212 (24)\r\n g 1\u2212\u03b2 1 r\r\n 1 g\r\n Trained Quantization Thresholds (TQT)\r\n higher, as is usually the case when \u2206\u2308log2 t\u2309 < 0, as seen in the\r\n \u221a small \u03c3 plots of Figure 8.\r\n \u2206t \u03b8 < rg (29) whythis is the case - the log threshold spends far more than\r\n max\r\n one step in the lower threshold bin per period, violating our\r\n In practice, we notice that sometimes noise can can cause \u03b8 one-step assumption. This violation can be explained by\r\n to stay on the high-gradient side of the threshold boundary looking at the gradients, which show that the lower thresh-\r\n for multiple steps, causing the momentum to build up. Thus, old bin sometimes has positive gradients, depending on the\r\n \u221a randomnessoftheinputGaussianvector. These phenomena\r\n to be safe, we recommend designing for \u2206 \u03b8 < 10 r .\r\n tmax g\r\n Arough estimate for the number of steps needed for con- motivate our suggestion to over-design by 10\u00d7. The cost in\r\n vergence is O(\u2206\u2308log t\u2309/(\u03b1|g\u02dc|)). Because of adaptive additional steps needed to reach convergence seems like a\r\n 2 worthwhile trade-off.\r\n gradients, |g\u02dc| should be close to 1, provided we allow\r\n enoughtimeforhistoricalvariancetodecay-O(1/(1\u2212\u03b2 ))\r\n 2\r\n steps10. Thus, the overall number of steps would be D BESTORMEANVALIDATION\r\n O(\u2206\u2308log t\u2309/\u03b1 + \u2206\u2308log t\u2309/(1 \u2212 \u03b2 )). Assuming cali-\r\n 2 2 2 Werun validation every 1000 training steps and save the\r\n bration is used, \u2206\u2308log2 t\u2309 should be close to 1, giving the best top-1 score checkpoint. This approach was initially\r\n simpli\ufb01ed expression O(1/\u03b1 +1/(1\u2212\u03b22)) steps. driven by a desire to better understand convergence and\r\n Finally, we address how to approximate rg. The operation stability properties with our method, but we continued using\r\n of crossing a threshold boundary moves some fraction f of it since intermediate validation was not too expensive for\r\n inputs {xi} from the n \u2264 \u230ax/s\u2309 \u2264 p case to the \u230ax/s\u2309 < n 5 epochs of retraining. However a valid concern is that\r\n or \u230ax/s\u2309 > p cases (assume only \u230ax/s\u2309 > p for simplicity this intermediate validation introduces a positive bias to\r\n from here on). Using the toy L2-loss model (9), our results through cherry-picking. To quantify this, we\r\n compare the positive-biased validation method to simply\r\n \uf8f1 taking the average of validation scores at \ufb01xed intervals:\r\n \u0010j m \u0011\r\n \uf8f4 x x 2 \u0004x\u0007 20%, 40%, 60%, 80% and 100% of the \ufb01fth epoch. As\r\n \uf8f4 \u2212 if n \u2264 \u2264p,\r\n \u2207 L=s2ln2\u00b7\uf8f2 s s \u0004x\u0007 s noted in Table 5, the differences between these methods on\r\n (log2 t) n(n\u2212x/s) if p (30) results in a minor positive bias on our reported accuracy.\r\n weseethat for any given xi, the ratio rgi between the gradi- Table 5. Best validation (cherry-picked) is compared to the average\r\n ents in the outer and inner cases is p(p \u2212 x /s)/(\u230ax /s\u2309 \u2212\r\n i i of \ufb01ve validations (at pre-determined steps) in the last epoch, for\r\n x /s)2. But since x recently switched cases, (p \u2212 two networks.\r\n i i\r\n x /s) < 1. As a rough estimate, we might expect r \u2248 Accuracy (%) Epochs\r\n i gi\r\n (1/2p)/(1/12) \u2248 6p. Averaged over the entire input, top-1 top-5\r\n r \u22486fp / p. The 10\u00d7 over-design helps address some\r\n g MobileNet v1 1.0 224\r\n uncertainty in this measure as well. 70.982 89.886 4.2\r\n Figure 9 shows a re-run of Figure 8 for the case of Adam 70.986 89.860 4.4\r\n optimization on log threshold gradients. These plots allow 71.076 89.930 4.6\r\n us to validate our Adam convergence analysis above. First 71.000 89.870 4.8\r\n 8\u22121 71.022 89.944 5.0\r\n wenotethat p = 2 \u22121=127,whichisanapproximate Mean 71.0 89.9\r\n upper bound on r and well within the 10\u00d7 over-design\r\n g Best 71.1 90.0 2.1\r\n principle. Next, notice that T \u2248 r . For example, in the\r\n g\r\n \u22122 VGG16\r\n \u03c3 = 10 case, T \u2248 280 while rg \u2248 272. 71.448 90.438 4.2\r\n Mostimportantly, we expect the max log-threshold devia- 71.462 90.456 4.4\r\n tion to be upper-bounded by \u03b1\u221ar = (1.6,0.4,0.7) from\r\n g 71.434 90.436 4.6\r\n left to right if our original assumptions hold - that we visit 71.500 90.426 4.8\r\n the lower threshold bin for one step and stay in the upper bin 71.458 90.456 5.0\r\n for T \u22121 steps. While the bound holds for all \u03c3, it is close Mean 71.5 90.4\r\n \u22121\r\n to not holding for \u03c3 = 10 . Abrief inspection reveals Best 71.7 90.4 0.9\r\n 10This is a problem when historical gradient magnitudes were\r\n Trained Quantization Thresholds (TQT)\r\n Figure 10. Weight and activation distributions of MobileNet v1 before (black) and after (red) quantized TQT (wt+th) retraining for\r\n thresholds that changed by non-zero integer amount in log-domain. Initial thresholds (cyan) and trained thresholds (blue) are also plotted.\r\n These are the raw thresholds t. Also indicated above each plot are bit-width b and threshold deviation d := \u2206\u2308log2 t\u2309 for the quantized\r\n layer. A positive deviation indicates preference for range over precision, and a negative deviation indicates otherwise.\r\n", "award": [], "sourceid": 71, "authors": [{"given_name": "Sambhav", "family_name": "Jain", "institution": "Xilinx / Stanford"}, {"given_name": "Albert", "family_name": "Gural", "institution": "Stanford University"}, {"given_name": "Michael", "family_name": "Wu", "institution": "Xilinx, Inc."}, {"given_name": "Chris", "family_name": "Dick", "institution": "Xilinx, Inc."}]}