{"title": "OPTIMUS: OPTImized matrix MUltiplication Structure for Transformer neural network accelerator", "book": "Proceedings of Machine Learning and Systems", "page_first": 363, "page_last": 378, "abstract": "We present a high-performance Transformer neural network inference accelerator named OPTIMUS. Optimus has several features for performance enhancement such as the redundant computation skipping method to accelerate the decoding process and the Set-Associative RCSC (SA-RCSC) sparse matrix format to maintain high utilization even when a large number of MACs are used in hardware.  OPTIMUS also has a flexible hardware architecture to support diverse matrix multiplications and it keeps all the intermediate computation values fully local and completely eliminates the DRAM access to achieve exceptionally fast single batch inference.  It also reduces the data transfer overhead by carefully matching the data compute and load cycles.  The simulation using the WMT15 (EN-DE) dataset shows that latency of OPTIMUS is 41.62\u00d7, 24.23\u00d7, 16.01\u00d7 smaller than that of Intel(R) i7 6900K CPU, NVIDIA Titan Xp GPU, and the baseline custom hardware, respectively. In addition, the throughput of OPTIMUS is 43.35\u00d7, 25.45\u00d7 and 19.00\u00d7 higher and the energy efficiency of OPTIMUS is 2393.85\u00d7, 1464\u00d7 and 19.01\u00d7 better than that of CPU, GPU and the baseline custom hardware, respectively.", "full_text": "                    OPTIMUS:OPTIMIZEDMATRIXMULTIPLICATIONSTRUCTUREFOR\r\n                                    TRANSFORMERNEURALNETWORKACCELERATOR\r\n                                Junki Park1 HyunsungYoon1 DaehyunAhn1 JungwookChoi2 Jae-JoonKim1\r\n                                                                       ABSTRACT\r\n                    Wepresent a high-performance Transformer neural network inference accelerator named OPTIMUS. OPTIMUS\r\n                    hasseveralfeaturesforperformanceenhancementsuchastheredundantcomputationskippingmethodtoaccelerate\r\n                    the decoding process and the Set-Associative RCSC (SA-RCSC) sparse matrix format to maintain high utilization\r\n                    even when large number of MACs are used in hardware. OPTIMUS also has a \ufb02exible hardware architecture\r\n                    to support diverse matrix multiplications and it keeps all the intermediate computation values fully local and\r\n                    completely eliminate the DRAM access to achieve exceptionally fast single batch inference. It also reduces the\r\n                    data transfer overhead by carefully matching the data compute and load cycles. The simulation using the WMT15\r\n                    (EN-DE) dataset shows that latency of OPTIMUS is 41.62\u00d7, 24.23\u00d7, 16.01\u00d7 smaller than that of Intel(R) i7\r\n                    6900KCPU,NVIDIATitanXpGPU,andthebaselinecustomhardware,respectively. In addition, the throughput\r\n                    of OPTIMUSis43.35\u00d7,25.45\u00d7and19.00\u00d7higherandtheenergyef\ufb01ciencyofOPTIMUSis2393.85\u00d7,1464\u00d7\r\n                    and 19.01\u00d7 better than that of CPU, GPU and the baseline custom hardware, respectively.\r\n               1    INTRODUCTION                                                 ware to accelerate the inference of the Transformer despite\r\n               In recent years, neural machine translation based on deep         having better performance than RNN and LSTM. There\r\n               learning has been widely used. Recurrent neural network           are several challenges in designing a transformer inference\r\n               (RNN), and long short-term memory (LSTM) have been                engine. First, the overhead of DRAM access is large be-\r\n               popular choices for machine translation (Sutskever et al.,        cause of the large amount of data. A well-known technique\r\n               2014; Cho et al., 2014; Bahdanau et al., 2015). However,          called pruning can be applied to reduce the memory require-\r\n               RNN/LSTMareknowntohavesomeproblems;itishardto                     ment (Han et al., 2015a; 2017). Second, when large number\r\n               parallelize the computation due to sequential characteristics     of multiplier and accumulators (MAC) are embedded in the\r\n               (Wuet al., 2016a) and the accuracy drops when the input           accelerator to increase the parallelism and the performance,\r\n               sentence is very long (Cho et al., 2014). The attention           MACutilization is reduced. This problem is exacerbated\r\n               mechanismimprovestheaccuracybyallowingthedecoding                 whenthedenseweightmatrixbecomessparseafter pruning.\r\n               process to focus on the input part which is the most relevant     Third, the computation \ufb02ow of encoding and decoding in\r\n               to the current decoding step (Bahdanau et al., 2015). In          the Transformer is very different and the excessive com-\r\n               particular, the Transformer neural network which consists         putational overhead in decoding should be addressed. In\r\n               of attention mechanisms only is known to have much more           the encoding process, all the word vectors in a sentence are\r\n               parallelism and improved translation quality (Vaswani et al.,     computed in parallel as a matrix form. However, only one\r\n               2017).                                                            wordvector is translated for each decoding iteration. Since\r\n                                                                                 all previously decoded word vectors need to be used as an\r\n               While various inference hardware accelerators for RNN             input to the decoder at the next decoding step, the amount\r\n               and LSTM have been proposed (Han et al., 2017; Gao                of computation increases quadratically during the iterations.\r\n               et al., 2018; Wang et al., 2018; Park et al., 2018; Park et al.,  This paper presents a high-performance and \ufb02exible hard-\r\n               2019; Cao et al., 2019), there is a lack of research on hard-     ware architecture, OPTIMUS, for the transformer algorithm\r\n                  1Department of Creative IT Engineering, Pohang University of   inference. The main contributions of the paper can be sum-\r\n               Science and Technology (POSTECH), Pohang, Republic of Korea       marized as follows:\r\n               2Department of Electronics and Computer Engineering, Hanyang\r\n               University, Seoul, Republic of Korea. Correspondence to: Jae-     1. We analyze the computation process of the Trans-\r\n               Joon Kim <jaejoon@postech.ac.kr>.                                 former network and improve the performance by skip-\r\n               Proceedings of the 3rd MLSys Conference, Austin, TX, USA,         ping redundant computations. It is shown that sequential\r\n               2020. Copyright 2020 by the author(s).                            generation of words in the Transformer decoder is the bot-\r\n                                                                                 tleneck in terms of performance and skipping redundant\r\n                                                      OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                                                                              Weight Matrix                                                                   the time-varying feature of the input sentence and decoder\r\n                                           Step1: count the number of nonzero elements in each                                                                exploits it to predict a sentence, a word at a time. There\r\n                                           row to analyze the computational load                                                                              have been various approaches for constructing encoder and\r\n                                                                                                                                                              decoder. LSTM-based layer structures such as Google\u2019s\r\n                                           Step2: evenly distribute computation loads to PEs                                                                  Neural Machine Translation (Wu et al., 2016b) have been\r\n                                                                                                                                                              popular for their superior translation performance, but they\r\n                                           Step3: rearranges the matrix rows so that the PE                                                                   suffer restricted parallelism inherent in LSTM computation.\r\n                                           index can be extracted by a simple decoding (modulo                                                                Recently, a network based primarily on the attention mech-\r\n                                           operation).\r\n                                             i_gate/h0               c_gate/h1               f_gate/h2              o_gate/h3                                 anism, the Transformer (Vaswani et al., 2017), has been\r\n                                          w           w                 w w                           w                 w           w                         introduced to increase parallelism in computation.\r\n                                            0,0         0,2              0,5   0,6                    0,10              0,13        0,15\r\n                                                w w                                w            w w               w\r\n                                                  1,1   1,2                          1,7         1,9  1,10        1,12                                        The state-of-the-art NMT models are often composed of\r\n                                                                   Rearranged Weight Matrix                                                                   multiple layers with large weight matrices.                                                        Therefore,\r\n                                            Step4                                                                                                             modelcompressionsuchaspruning(Hanetal.,2016;2017)\r\n                                              Value w w w w w w w w w w w w w\r\n                                                          0,0   1,12  1,1    0,5   1,9  0,13   0,2  1,2    0,6  0,10  1,10   1,7 0,15                         is commonly used to alleviate the memory access overhead\r\n                                             Row_id 0           1     1     0     1     0      0     1     0     0     1     1     0\r\n                                              Col_id      0     4     8 12 1 5 9 13 2 6 10 14 3                                                               for loading weights. After weight elements with small im-\r\n                                             Col_len 1          0     0     1     1     1     1     1      2     1     2     0     0                          portance are pruned to zero, a dense matrix becomes sparse.\r\n                                              Col_id      7 11 15              Conventional RCSC Format                                                       Toeliminate the overhead of fetching unnecessary zero ele-\r\n                                             Col_len 1          0     1                                                                                       ments, a pruned weight is stored using a sparse matrix for-\r\n                                                                                                                                                              matsuchasacompressedsparsecolumn(CSC)format(Han\r\n                                           Step5: performs a network transformation so that the                                                               et al., 2017), which consists of the non-zero values, row in-\r\n                                           dot product sequence changed by Step 3 does not                                                                    dices and column pointers of non-zero elements.\r\n                                           affect the result\r\n                             Figure 1. The process of generating the conventional RCSC format.                                                                However, two major problems arise when the CSC format\r\n                             It solves the problemsofloadimbalanceandinputloadmisscaused                                                                      is used for the sparse matrix computation in the custom\r\n                             byasparse matrix.                                                                                                                hardware accelerators. First, since the computation load is\r\n                                                                                                                                                              unevenly assigned to each PE, the overall PE utilization is\r\n                             computations reduces the overhead signi\ufb01cantly. We also                                                                          reduced. Second, since the input vector element is loaded\r\n                             showthat skipping redundant computations is much more                                                                            from the input buffer in irregular access pattern, the miss\r\n                             effective in custom hardware design than in GPU.                                                                                 rate of the input is high. If the corresponding element is not\r\n                             2. We propose a Set-Associative RCSC (SA-RCSC) for-                                                                              loaded from the input buffer due to a miss, the PE is stalled\r\n                             mattoenablelarge-scaleMACstomaintainhighutilization.                                                                             until the corresponding input element is loaded. There have\r\n                             Theproposed sparse matrix format signi\ufb01cantly reduces the                                                                        been several studies to solve these load imbalance problem\r\n                             input miss rate by allowing multiple PEs to handle a matrix                                                                      and input load miss problem (Han et al., 2017; Park et al.,\r\n                             row. As a result, the MAC utilization is improved by \u223c 2X                                                                        2018; Rizakis et al., 2018; Park et al., 2019). Among them,\r\n                             compared to the conventional RCSC format case.                                                                                   only the rearranged compressed sparse column (RCSC) for-\r\n                                                                                                                                                              matproposed in (Park et al., 2019) addresses both issues.\r\n                             3. We design the OPTIMUS, a custom hardware accel-\r\n                             erator for the Transformer neural network which has a                                                                            2.2         RearrangedCompressedSparseColumnFormat\r\n                             \ufb02exibility to support various types of matrix multiplications.                                                                               (RCSC)\r\n                             While it outperforms generic computing platforms by sig-\r\n                             ni\ufb01cant margin in general, OPTIMUS shows a particularly                                                                          TheRCSCformat(Parketal.,2019)utilizes the character-\r\n                             goodperformance for a single batch inference by keeping                                                                          istics of LSTM to improve the hit rate of the input vector in\r\n                             all the intermediate computation values fully local and elim-                                                                    the local buffer as well as balancing the computation loads\r\n                             inating DRAM access. It also has an optimized control \ufb02ow                                                                        betweenPEs. This format was introduced as a sparse matrix\r\n                             to hide the data transfer overhead from the computation.                                                                         format targeted for LSTM, but is applicable to all networks\r\n                                                                                                                                                              in which an input vector is multiplied to multiple sparse\r\n                             2         BACKGROUNDANDRELATEDWORK                                                                                               weight matrices.\r\n                             2.1        Sparse Neural Machine Translation                                                                                     TheRCSCformatisgeneratedthrougha\ufb01ve-stepprocess\r\n                                                                                                                                                              (Fig. 1) (Park et al., 2019). The \ufb01rst step (Step 1) is to\r\n                             Neural machine translation (NMT) is to map a sequence                                                                            analyze the computation load for each PE by counting the\r\n                             of words in one language to one in another language us-                                                                          numberofnonzeroelementsineachrow. Thesecondstep\r\n                             ing a neural network based sequence to sequence model.                                                                           (Step 2) is to assign a PE for each row. The computation\r\n                             In general, the sequence to sequence model consists of                                                                           load is evenly distributed to each PE in this step. The third\r\n                             two parts, encoder and decoder, where encoder extracts                                                                           step (Step 3) is to sort the matrix rows in circular order based\r\n                                       OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                             TRANSFORMER                                    I love you                                      Table 1. The Computation Type of the Transformer\r\n                                                                        Linear & Softmax                                        1. EMBEDDING AND POSITIONAL ENCODING\r\n                                            To Multi-Head     x 6 Decoder\r\n                                         Attention of Decoder                      dmodel   t                        EM/PE           E=Embedding(X)+PE(X)\r\n                                             E (d    x t )             Add & Layer Norm\r\n                                                 model E                                                                                    2. MULTI-HEAD ATTENTION\r\n                             x 6 Encoder\r\n                                                 d       t                    Feed\r\n                                                  model  E                  Forward                                  COM1           [Q,K,V]=[WQ,WK,WV]\u00b7Y                                WEIGHT(sM)\r\n                                     Add & Layer Norm                                                                COM2                          P =KT \u00b7Q                             WEIGHT(dM)\r\n                                            Feed                       Add & Layer Norm                                                                              \u221a\r\n                                                                                                                     COM3                  S =Softmax(P/ d )\r\n                                          Forward                                                 seq                                                                     k\r\n                                                                           Multi-Head             to                 COM4                         Z0\u22127 = V \u00b7S                           WEIGHT(dM)\r\n                                                                            Attention             seq                                               O                   0\u22127\r\n                                     Add & Layer Norm                             E (d    x t )                      COM5                Z =W \u00b7Concat(Z                      )          WEIGHT(sM)\r\n                                                                                      model E\r\n                                         Multi-Head                      M(dmodel   [1,2,3 \u2026])                            3. RESIDUAL ADDITION AND LAYER NORMALIZATION\r\n                                          Attention                    Add & Layer Norm\r\n                                                 d       t                                                           LN                  Z =\u03b3Norm(Y +Z)+\u03b2\r\n                                                  model  E                Masked Multi-\r\n                                        Embedding &                      Head Attention                                                4. POSITION-WISE FEED FORWARD\r\n                                    Positional Encoding                      (dmodel  [1,2,3 \u2026])\r\n                                                                                                                                                             F1            F1\r\n                                                                                                                     FF1               Z =ReLU(W                 \u00b7 Z +b        )        WEIGHT(sM)\r\n                                        ichliebedich                      Embedding &                                FF2                     Z =WF2\u00b7Z+bF2                               WEIGHT(sM)\r\n                                                                       Positional Encoding\r\n                                Figure 2. Model architecture of the Transformer.                                   following the previous word. This process is repeated until\r\n                     on the PE index assigned to compute each row. In Step 4,                                      the end of the sentence (EOS) is decoded.\r\n                     the \ufb01rst columns of weight matrices for the 4 LSTM gates                                      Here we brie\ufb02y explain the computation patterns in the\r\n                     are encoded before the second columns are encoded. This                                       Transformer. Each encoder layer is composed of two sub-\r\n                     encoding order increases the probability of having non-zero                                   layers: multi-head self attention layer and position-wise\r\n                     weightvaluesinadjacentcolumns. IntheTransformerarchi-                                         fully-connected feed forward layer. Each decoder layer\r\n                     tecture, a similar approach can be applied to the multi-head                                  has one more sub-layer: masked multi-head attention. The\r\n                     attention case, in which an input is multiplied by multiple                                   masking ensures that the prediction of output word depends\r\n                     weight matrices. The \ufb01fth step (Step 5) is to transform the                                   on the previous output words only. All these layers are\r\n                     weight matrix so that rearrangement does not affect the out-                                  followed by the residual connection and layer normalization.\r\n                     come. HowtheRCSCformatisappliedtotheTransformer                                               Multi-head attention is the structure to measure the relation-\r\n                     is described in detail in Section 5.                                                          ship among words in the sentence. This process is divided\r\n                     2.3      Transformer Neural Network                                                           into \ufb01ve computations (COM1\u223c5) in Table 1. COM1 is a\r\n                                                                                                                   matrix-matrix multiplication that computes query (Q), key\r\n                     TheTransformer is one of the most popular neural machine                                      (K), and value (V ). COM2 is to compute the score which\r\n                     translation methods thanks to its superior performance and                                    represents how relevant each word is to other words. COM3\r\n                     the improved parallelism. Yet there is limited study on its                                   is to scale down the value in order to stabilize gradients\r\n                     computation patterns to design customized accelerators. In                                    during training (Vaswani et al., 2017). COM4 is to multiply\r\n                     this section, we provide a brief explanation of the com-                                      the result of COM3 by value (V ). COM5 is to concatenate\r\n                     putational characteristic of the Transformer, with the key                                    the results (Z0 - Z7) of each head and multiply the concate-\r\n                     computations summarized in Table 1. (We ask readers to                                        nated results by the weight matrix (WO) to mix them. In the\r\n                     refer Appendix A for more details.)                                                           position-wise feed forward network of each layer, two linear\r\n                     TheTransformer has the form of encoder-decoder (Fig. 2).                                      transformations are executed, which the \ufb01rst one involves\r\n                     One sentence composed of t                      words is represented by a                     Recti\ufb01ed Linear Unit (ReLU) activation. Residual addition\r\n                                                                 E                                                 and layer normalization are inserted after each (masked)\r\n                     d          \u00d7t matrix when the embedding and positional\r\n                       model          E                                                                            multi-head attention and feed forward network.\r\n                     encoding are \ufb01nished. The matrix of these symbol repre-\r\n                     sentations is computed over six encoder layers. When the                                      3      CHALLENGESFORTRANSFORMER\r\n                     encoding is \ufb01nished, the output containing the encoding                                              ACCELERATION\r\n                     information becomes a key-value pair of the multi-head at-\r\n                     tention in the decoder layers. While a whole input sentence                                   3.1     Limited Parallelism in Decoder\r\n                     is processed in parallel in the encoding layers, decoding of\r\n                     an output sentence is done word by word as the decoding                                       In the Transformer, the computation pattern in the encoding\r\n                     of each word requires the previously decoded words as the                                     stage is vastly different from the decoding stage. In the\r\n                     input. Thus, decoding for an encoded sentence requires                                        encodingstage, all the words in an input can be processed in\r\n                     repeated computations of all decoder layers. The output                                       parallel thanks to the attention-based layer structure \u2013 there\r\n                     from each decoding iteration is the probability of the word                                   is no dependency via hidden states across the time-steps\r\n                            OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                 e6 Encoding                   40 Decoding                                100                       CSC         RCSC\r\n                mti5                           35                                         90\r\n                -n      CPU                    30                                         %]\r\n                u4      GPU                    25     CPU                                  [80\r\n                R                                     GPU                                 on\r\n                d3                             20                                         ati70\r\n                zeli                           15                                         iz                                   - 35% \r\n                a2                                                                        til60\r\n                mr                             10                                          U\r\n                o1                             5                                          C50\r\n                N0                             0                                          A\r\n                   4     18     32    46     60  4     18    32      46    60             M40\r\n                               (a)      Number of Words      (b)\r\n               Figure 3. The CPUandGPUprocessingtimefordifferentnumbers                   30   32     64    128    256    512   1024\r\n                                                                                                          Number of MACs\r\n               of words. (a) In encoding process, all words are computed in     Figure 4. Average MAC utilization for Transformer. The MAC\r\n               parallel. (b) In decoding process, word is sequentially decoded  utilization degrades signi\ufb01cantly as the number of MAC increases\r\n               one by one.                                                      in both CSC and RCSC formats.\r\n               in encoder. Therefore, one can exploit parallelism in the        In order to improve latency and throughput, accelerators\r\n               time-step dimension to accelerate the processing speed. For      need to have large number of MACs. However, as the\r\n               example, one can stack word vectors into an input matrix         number of MAC increases, the load imbalance and input\r\n               and employ matrix-matrix multiplication to reuse weight          load miss problems caused by the sparse matrix become\r\n               matrix and perform computation in parallel across the time-      more serious. Although the RCSC format mitigates these\r\n               step. Since the decoder shares the similar layer structure       problems somewhat, low MAC utilization still limits the\r\n               with the encoder, there is no hidden state dependency in it.     maximumperformanceinhardwareaccelerator when many\r\n               However, the decoder still suffers limited parallelism since     MACsareused(Fig.4). This paper proposes an extension\r\n               in the decoding stage the computation for the prediction at      to the RCSC format to maintain high utilization even when\r\n               one time-step depends on the prediction of all the previous      a large number of MACs is used. The detailed explanation\r\n               time-steps. Such dependency requires a feedback structure        will be given in Section 5.\r\n               in the computation along the time-step dimension, leading\r\n               to repetitive load of weight for each time-step and slow         4    SKIPPING REDUNDANT DECODING\r\n               speed even with the parallel processing units.\r\n               The challenge of the limited parallelism in the decoding              COMPUTATIONS\r\n               stage is demonstrated in Fig. 3, where the processing time       As discussed in Section 3, the computational complexity\r\n               for the encoding and decoding stages is compared for CPU         of decoding layers increases over time-step due to the feed-\r\n               (multi-thread) and GPU. In the encoding stage, the process-      back structure of the network. Note that in the decoding\r\n               ing time increases as the sentence length grows for CPU          stage the output word in the previous time-step comes in\r\n               while it is almost constant for GPU. This implies that the       as a new input token to the network, which is stacked into\r\n               amountofcomputation needed for more number of words              a input matrix. Input word or output word becomes to-\r\n               is fully parallelized using GPU once the weight is loaded.       ken as expressed as a vector that becomes the input of\r\n               Therefore, GPU can achieve high speedup over CPU when            encoder or decoder after the process of embedding and\r\n               encoding long sentences. In contrast, the speedup of GPU         positional encoding. Fig. 5a shows the detail computation\r\n               over CPU is much lower in the decoding stage. This indi-         procedure in Masked Multi-Head Attention layer in the de-\r\n               cates that the overhead of repetitive load of weight due to      coding stage (cf. Fig. B.1 for Multi-Head Attention layer).\r\n               the limited parallelism in decoder shows up as the number        Note that the input token at time-step t is being stacked to\r\n               of words increases and limits the effectiveness of the GPU       Y = [y ,y ,...,y ]. This stacking is necessary since the\r\n               implementation.                                                           1  2      t\r\n                                                                                correlation between K and Q is computed over the entire\r\n                                                                                time steps in COM2. Due to the stacking, the computational\r\n               3.2   LowMACUtilization                                          complexityaswellastheamountofdataneededforthecom-\r\n               In real-time applications, latency is a very important design    putation increase linearly as the time-step increases. This\r\n               speci\ufb01cation. For example, when the machine translation          results in quadratic increase of the total decoding operations\r\n               is applied to the simultaneous interpretation, the translation   as well as the data elements, as demonstrated in Fig. 6 for\r\n               latency of each sentence (batch size = 1) must be very short.    various sentence length.\r\n               Ontheotherhand,whenmultipleusersperformtranslations              However, if we carefully investigate the computation proce-\r\n               (batch size > 1) via a server at the same time, throughput       dure in the decoding stage, it can be noticed that the unique\r\n               for multiple batches becomes an important speci\ufb01cation.          information added at each time step is constant except for\r\n               In summary, reducing latency when processing a single            COM2andCOM4,ashighlightedinFig.5b. Morespeci\ufb01-\r\n               batch and increasing throughput when processing multiple         cally, if we maintain K and V for all the previous time-steps,\r\n               batchesareoneofthekeydesignissuesinacceleratordesign.            wecancomputeCOM2andCOM4withoutperformingre-\r\n                                                       OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                                 Masked                      COM5Concatenate                                                                                        Masked                       COM5Concatenate\r\n                                 Multi-Head Attention                                                                                                               Multi-Head Attention \r\n                                                                                                                                                                    (Skipping \r\n                                                                                                                                                                    Redundant                                                                        \r\n                                                              Wo(d        x d      )                                 Z (d       x t)                                Computations)                Wo(d        x d      )                              Z (d       x 1)\r\n                                                                    model    model                                        model                                                                         model    model                                    model\r\n                                head0 -7                                                                                                                            head0 -7\r\n                                  COM2                   q1 q2 q3          Masked           COM3            COM4               Masked                                 COM2                   q3                                COM3             COM4\r\n                                    k1                                                        Divide                                                                   k1                                                         Divide \r\n                                    k2                                                        by                                                                       k2                                                         by                              \r\n                                    k                                                                                                                                  k                                                                 \r\n                                     3                                                           &                                                                       3                                                           &\r\n                                                                                                              v   v   v                                                                                                                           v   v  v                    \r\n                                         T               Q(d x t)                            Softmax           1   2   3  softmax    (txt)          Z0(d x t)                T          Q(d x 1)                                 Softmax           1   2  3  softmax    (tx1)          Z0(d x 1)\r\n                                       K (t x d )             q             P(t x t)                                                                     v                 K (t x d )        q              P(t x 1)                                                                         v \r\n                                                k                                                            V(d x t)                                                               k                                                            V(d x t)\r\n                                                                                                                  v                                                                                                                                  v                      \r\n                                                                                                                                                                                                                                                                             \r\n                                                                      head0 -7                                                           K                                                               head0 -7                                                           K \r\n                                y   y   y       y   y        y                               y   y   y                           K= W *Y                                y         y          y                                   y                                  K= W *Y   \r\n                                 1   2   3       1   2        1        COM1                   1   2   3           q q q                                                  3         2          1           COM1                    3                  q\r\n                                                                                                                   1   2   3        (d x t)                                                                                                            3                (d x 1)\r\n                                                                                                                                       k                                                                                                                                  k\r\n                                                                                                                                                                                                                                            \r\n                                                                                                                                 V= WV*Y                                                                                                                             V= WV*Y   \r\n                                  (i+2)          (i+1)       i        WQ(d x d          )  Y(d        x t)       Q(dqx t)            (dvx t)                          (i+2)      (i+1)       i           WQ(d x d          ) Y(d         x 1)    Q(dqx 1)               (dvx 1)\r\n                                       th             th      th             q    model         model                                                                      th         th      th                q     model        model\r\n                                                                                             (a)                                                                                                                                 (b)\r\n                             Figure 5. Comparison of the computing \ufb02ows for the masked multi-head attention between (a) the conventional \ufb02ow with on-the-\ufb02y\r\n                              iterative computations and (b) the proposed \ufb02ow with redundant computation skipping\r\n                                                          7.5    Decoding                                                                                         the storage needed for intermediate activation is (almost)\r\n                                                    gn  ]\r\n                                                    id 10 6.0                 OPTIMUS                                                                             independent to t (i.e., the size of buffer needed for keeping\r\n                                                    oc  x10              without Skipping\r\n                                                    e    [4.5                                                                                                     Z0[d \u00d71] is independent to t and P is typically smaller\r\n                                                    Df  ns              OPTIMUS                                                                                              v\r\n                                                    o   tio3.0                                                                                                    than Z0). This implies that one can assign a \ufb01xed buffer\r\n                                                    re  ra                                                        -86.11%\r\n                                                    bm  pe1.5                                                                                                     size to keep all the intermediate activation locally and avoid\r\n                                                    uN  O                                                                                                         DRAMmemoryaccess.\r\n                                                          0.0                                   (a)\r\n                                                       ]1250 Decoding                                                                                             Thethird implication is that the computation pattern in de-\r\n                                                       MB [1000               OPTIMUS                                                                             coding is changed from Matrix-Matrix to Matrix-Vector\r\n                                                       ze750             without Skipping                                                                         multiplication. This change becomes a serious issue for\r\n                                                       Si \r\n                                                       taa               OPTIMUS                                                                                  GPU. As demonstrated in Section 7, GPU cannot exploit\r\n                                                       D 500\r\n                                                       la                                                         -85.45%                                         the bene\ufb01t of skipping redundant decoding computation as\r\n                                                       tir250                                                                                                     it suffers seriously low utilization for Matrix-Vector com-\r\n                                                       Pa\r\n                                                            0       4           18               32              46            60                                 putation. Whereas, custom hardware tends to maintain the\r\n                                                                                                (b)\r\n                                                                                                (b)                                                               utilization rate for Matrix-Vector computation as well, and\r\n                                                                                    Number of Words                                                               thusthereducedcomputationalcomplexityfromskippingre-\r\n                             Figure 6. (a) Comparison of the number of decoding operations                                                                        dundant decoding computation can be fully exploited. Also,\r\n                              depending on skipping computation. (b) Comparison of partial                                                                        note that the use of sparse matrix for computation in hard-\r\n                              data size depending on skipping computation.                                                                                        ware can further reduce the overhead of weight load and\r\n                              dundant computation of re-creating them in COM1. Note                                                                               makeMatrix-Vector multiplication more ef\ufb01cient.\r\n                              that computation in COM1, COM3 and COM5 takes time-                                                                                 Wenotice that OpenNMT (Klein et al., 2017) also employs\r\n                              step as an independent dimension. Therefore, once K and                                                                             the concept of skipping redundant decoding computation\r\n                              V of the previous time-step are loaded, token vectors only                                                                          in its Pytorch implementation. But the performance gain\r\n                              for the current time-step, i.e., q ,k ,v , need to be newly\r\n                                                                                                 t      t      t                                                  is limited for the reason we discussed above. In Section 7,\r\n                              computed to produce z , which will be used as the new\r\n                                                                                t                                                                                 weshowthattheimpactofredundant computation skipping\r\n                              token for the next layer.                                                                                                           is much larger in the proposed custom accelerator than in\r\n                              This change allows us to skip redundant decoding computa-                                                                           GPU.\r\n                              tion, and there are three implications with it. First, since K\r\n                              and V of previous time steps are loaded (rather than com-                                                                           5         SET-ASSOCIATIVE RCSC (SA-RCSC)\r\n                              puted on the \ufb02y), it increases memory load overhead. But\r\n                              its overhead is much smaller compared to loading weights,                                                                           Asexplained in Section 2.2, the RCSC format (Park et al.,\r\n                              since the typical size of K and V (e.g., K[d \u00d7 t]) is much\r\n                                                                                                                         k                                        2019) is a sparse matrix format that mitigates problems\r\n                              smaller than the weight (e.g., WK[d \u00d7 d                                                                  ]) where\r\n                                                                                                              k            model                                  with sparse matrix-vector multiplication (sM\u00d7dV) such as\r\n                              t < dmodel(= 512). Furthermore, there are savings as we                                                                             PEload imbalance and input load miss. While the RCSC\r\n                              need to keep the input token for the next layer Y just for                                                                          formatwasoriginallyproposedtoincreasethePEutilization\r\n                              one time-step. Therefore, the overall increase of memory                                                                            for LSTM by exploiting unique characteristics of LSTM,\r\n                              load overhead is small.                                                                                                             it can actually be applied to any neural network in which\r\n                              Second, this change opens up the possibility of keeping                                                                             an input is multiplied by multiple weight matrices. Since\r\n                              intermediate activation fully local. As shown in Fig. 5b,                                                                           the Transformer also has such characteristics, we extend\r\n                                                      OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                                                                                                                                           SA = 1 (Conventional RCSC Format)                                                                                            Weight \r\n                                            Head0              Head1           Step1                                                       Step2          Step3                                              Step4                                               Assignment to PE\r\n                                                                                                                                            PE R_id R_len   PE R_id    w        w        w         w           Value w w w w w w w w                                 ele 0    0    2    2\r\n                                            w                  w w                 R_len                                                                                4,0      4,2       4,4      4,6                  4,0  3,0 4,4  3,4  0,1  1,1  0,5 1,5   PE addr\r\n                                              0,1                0,5 0,6                                                                    0    4    4     0    4         w                  w w             Re_R_id 0      2    0    2    1   3    1    3        0    w w w w\r\n                                            w                  w                0    3                                                                                       0,1               0,5  0,6                                                                   4,0  4,4 4,2  4,6\r\n                                              1,1                1,5                                                                        1    0    3     1    0     w                 w         w                   w w w w w w w w                               ele 1    1    2    3\r\n                                                                                1    2                                                      2    3    3     2    3      3,0                3,4      3,6                  4,2  4,6 0,6  3,6  6,3  5,3  7,7 5,7   PE addr\r\n                                        w                 w         w                                                                                                      w                  w                          0   0    1    2    6   7    5    7        1    w w w w\r\n                                         3,0                3,4      3,6        2    0                                                      3    1    2     3    1           1,1               1,5                                                                        0,1  0,5 0,6  7,7\r\n                                        w        w         w        w           3    3      R_id: Row Index of Original                                                                                        Col_id    0   4    1    5    2   6    3    7          ele  0   0    2    3\r\n                                         4,0      4,2 w     4,4      4,6 w                                                                  3    5   2      0    2                                     w      Col_len 2      2    2    2   1    3    2    2     PE addr\r\n                                                       5,3                5,7   4    4      Weight Matirx                                                                                                7,7                                                       2    w w w w\r\n                                                      w                                                                                     2    6   1      1    7                   w                                                                                    3,0  3,4 3,6  6,3\r\n                                                       6,3                      5    2                                                                                                6,3                                     RCSC (SA = 1)                          ele  1   1    3    3\r\n                                                                         w                  Re_R_id: Row Index of                           1    7    1     2    6                   w                  w Step5                                                 PE addr\r\n                                                                          7,7   6    1      Rearranged Weight Matirx                        0    2    0     3    5                    5,3                5,7                                                       3    w w w w\r\n                                            Original Weight Matrix              7    1                                                                                  Rearranged Weight Matrix                      Network Transformation                              1,1  1,5 5,3  5,7\r\n                                                                                            SA: Set Associativity                         SA = 2 (Set Associative RCSC Format) (b)\r\n                                                                                                                                                          Step3                                              Step4                                                      Weight \r\n                                 Q weight K weight V weight                  K weight V weight K weight V weight                           Step2                                                                                                                 Assignment to PE\r\n                                                                                                                                           SET R_id R_len  SET R_id    w        w        w         w           Value w w w w w w w w                                 ele  0   1    2    3\r\n                                                                                                                                                                        4,0      4,2       4,4      4,6                  4,0 3,0  4,4  3,4 0,1  1,1  0,5  1,5    PE addr\r\n                                                                                                                                            0    4   4      0    4         w                  w w            Re_R_id 0       3    0    3   1    2    1    2        0     w w w w\r\n                                                                                                                                                                             0,1               0,5  0,6                                                                   4,0  1,1 4,2  5,3\r\n                                                                                                                                            1    0   3      1    0         w                  w                        w w w w w w w w                               ele  0   1    2    3\r\n                                                                                                                                            1    3   3      0    1     w     1,1         w     1,5 w                     4,2 4,6  0,6  3,6  5,3  6,3 5,7  7,7    PE addr\r\n                                                                                                                                                                        3,0                3,4      3,6                 0    0    1    3   4    5    4    7        1     w w w w\r\n                                h0        h7 h0         h7 h0          h7 h0           h7 h0         h7 h0         h7 h0          h7        0    1   2      1    3                   w                 w                                                                  3,0  0,1 0,6  6,3\r\n                                                                                                                                                                                      5,3                5,7  Col_id    0    4    1    5   2    6    3    7          ele  0   1    2    3\r\n                                                                                                                                            0    5   2      0    5                   w                        Col_len 2      2    2   2    1    3    2    2      PE addr\r\n                                                                                                                                                                                      6,3                                                                          2     w w w w\r\n                                                                                                                                            1    6   1      1    6                                                           RCSC (SA = 2)                                4,4  1,5 4,6  5,7\r\n                                                                                                                                            1    7   1      0    2                                      w                                                            ele  0   1    2    3\r\n                                                                                                                                                                                                         7,7 Step5                                               PE addr\r\n                                                                                                                                            0    2   0      1    7                                                                                                 3     w w w w\r\n                                                                               Layer0 (decoder)             Layer1 (decoder)                                            Rearranged Weight Matrix                      Network Transformation                              3,4  0,5 3,6  7,7\r\n                                                                                 (a)                                                                                                                      (c)\r\n                             Figure 7. (a) The process of concatenating weights to apply the RCSC format. (b) The process of generating the conventional RCSC\r\n                             format (SA = 1). (c) The process of generating the proposed SA-RCSC format (SA = 2).\r\n                             the RCSCformattoexpress the sparse weight matrices of                                                                             relatively high probability to share the same input vector.\r\n                             the Transformer. We also propose the SA-RCSC format to                                                                            Let us show an example using a simple LSTM accelerator\r\n                             improvethePEutilization rate which tends to degrade when                                                                          with four PEs (Fig. 7). Step 1 for the SA-RCSC is same as\r\n                             the original RCSC format is used for large number of PEs.                                                                         that of the conventional RCSC. The number of nonzero ele-\r\n                                                                                                                                                               mentsineachrowiscountedtoassessthecomputationload.\r\n                             5.1         Generalizing RCSC for Transformer                                                                                     In step 2, the procedures for conventional RCSC and the\r\n                             Theprocess of generating the RCSC format has two main                                                                             proposed SA-RCSCstart to differ. In conventional RCSC,\r\n                             goals. The \ufb01rst goal is to assign the non-zero values to                                                                          four PE indices are sequentially assigned to the rows sorted\r\n                             the PEs evenly, so that the computational load of the PE is                                                                       in descending order of computation load (Step 2 in Fig. 7b).\r\n                             similar to each other (Step 2 in Fig. 1). The second goal is                                                                      Ontheother hand, in SA-RCSC, only two set indices are\r\n                             to reduce the input load miss by successively encoding the                                                                        sequentially assigned to the rows if the set associativity (SA)\r\n                             samecolumnsoftheweightmatrices for all the gates which                                                                            is 2 (Step 2 in Fig. 7c). If the SA is 4, only one set index\r\n                             share the same input vector (Step 4 in Fig. 1). Note that                                                                         would be assigned in the step 2. After the set indices (num-\r\n                             the weight matrix (WQ, WK, WV)of(masked)multi-head                                                                                ber of PEs in the accelerator / SA) are sequentially assigned\r\n                             attention in the Transformer is also multiplied by the same                                                                       from top rows, the next row with the largest number of\r\n                             input vector and there are 8 heads which share the same                                                                           non-zero values is assigned to the set index with the least\r\n                             input, so that the locality of the loaded input vector is higher                                                                  computation load. In step 3 of the SA-RCSC, the pair of set\r\n                             than that of LSTM.                                                                                                                index and row index is sorted so that the set index is to be in\r\n                                                                                                                                                               circular order to easily decode the set index assigned in the\r\n                                                                                                                                                               row. In step 4, the \ufb01rst column of eight heads of WQ, WK,\r\n                             5.2         SA-RCSCforLarge-ScalePEs                                                                                              WV issuccessively encoded, and then the second column is\r\n                             IntheconventionalRCSCformat,onePEisassignedtoeach                                                                                 sequentially generated in RCSC format. In step 5, network\r\n                             row. If the number of PEs is much larger than the number                                                                          transformation is performed to keep the same output results\r\n                             of rows in the matrix, the number of rows processed by one                                                                        regardless of rearrangement the weight matrix in step 3.\r\n                             PEbecomes smaller. If the number of rows processed by                                                                             Conventional RCSC and SA-RCSC formats are clearly dis-\r\n                             one PE is too small, the locality of the input vector tends to                                                                    tinguished when non-zero elements are assigned to PEs. In\r\n                             become low as the locality becomes more sensitive to the                                                                          conventional RCSC, non-zero elements are assigned to PE\r\n                             distribution of non-zero elements in the row.                                                                                     according to the decoded PE index by modulo operation. In\r\n                                                                                                                                                               the table showing weight assignment to PE in Fig. 7b, w4,0,\r\n                             To mitigate this problem, we propose the SA-RCSC, in                                                                              w4,4, w4,2, w4,6 are assigned to PE0. On the other hand, in\r\n                             whichasetofPEsinsteadofonePEisassignedtoeachrow.                                                                                  SA-RCSC,non-zero elements with a decoded set index 0\r\n                             With the proposed concept, the number of rows per set can                                                                         are assigned to PE0 and PE2 alternately as they are in the\r\n                             be made relatively large so that the locality of input vector                                                                     sameset. In the table showing weight assignment to PE in\r\n                             for the sets becomes higher. And, by assigning the weights                                                                        Fig. 7c, w               , w         , w         , w          are assigned to PE0 and w                                 ,\r\n                             to the PEs in a set alternately, the PEs in a set can have                                                                                            4,0        1,1         4,2         5,3                                                        4,4\r\n                                                                                                                                                               w1,5, w4,6, w5,7 are assigned to PE2. Similarly, a non-zero\r\n                                                     OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                                                                             OFF-CHIP DRAM                                                           PE ARRAY                                                      weight data, K, V\r\n                                                                                                                                                                                                    input data (dense mode)                    MAC            R\r\n                                                                                                                                                                g_buf3                                                                                        DE\r\n                                                                                                             SA-RCSC                   input data                                                               i_reg             R C V                       D\r\n                                                                                                                                    (sparse mode)          x    g_buf2            _a                                                                          A\r\n                                        OPTIMUS                                                                Format                                      mu                     ux                               R C V          R C V w_fifo\r\n                                                                                                                                                           ed   g_buf1            m                            b                 c\r\n                                                                                                                                                                                                                                     Mult     Add\r\n                                             FORMAT DECODER                              WEIGHT MEM                                                             g_buf0                                         x_                _      W\r\n                                                                                             ]      ]              ]                                                                                           u                 mux\r\n                                                                                                                   KB      N             32                  (g_buf0:          comp                            m\r\n                                           POSITION MEM [80 KB]                              5 KB   5 KB           75      LE                           input (R,C,V) x 4)      _out \r\n                                                                                             7.     .7             [9.     L_             x         comps                                                                   spar/den     P_SUM \r\n                                                 LAYER_NORM_                                 9[     [9             7       O             2                                                                                                Buffer \r\n                                                                                             0 B    B1             12      C             3                  comp                                                    comps\r\n                                             PARAM MEM [60 KB]                                                     B                     Y                  comp                         R C V\r\n                                                                                                                                         A                  comp                           i_buf              R(i_buf) C(w_reg)\r\n                                               BIAS MEM [60 KB]                                                                                        )                )o                                                                                            EM\r\n                                                                                                                                         R             eg   comp        ffi                                                                                           M\r\n                                                                                         INPUT MEM                                       R             _r               w_   PE 0                                                                             ER      T \r\n                                                                                                    ]       ]      ]                     A             i(   comp        C(                                                                                    D       U\r\n                                            CTRL               DIVIDER                       ]                             ]                           R    comp                  PE 1                                                                        AD      P\r\n                                                                 ROOT                        4 KB   4 KB    4 KB   4 KB    4 KB                                                                                                                                        IN\r\n                                           LOGIC                                             [      [       [      [       [             PE                 comp                                                                                                      o\r\n                                                            EXPONENT                         B0     B1      B2     B3      B4                               comp                               PE 1023                                                                T\r\n                                       Dense Matrix Multiplication                    Sparse Matrix Multiplication                                (a)\r\n                                    State               State0              State1               State2                                State3                                                 State4                                              State5\r\n                                                   Bias, Position           Q    K    V               O                                   F1                                                      F2                                    Q    K    V \r\n                                Data Fetch         Layer norm, x          W ,W ,W                  W                                   W                                                       W                                     W ,W ,W (for next layer)\r\n                                                                               2                   2                                                                                                                                                 2\r\n                                 Data Size                                 3\u00b7d                   d                                  2048\u00b7d                                                 d      \u00b72048                                          3\u00b7d\r\n                                                                                model               model                                    model                                          model                                                     model\r\n                                                                        x + Position           Q    K    V             Softmax                             O     0-7    X + Z      Layer Norm                    F1                F2            A0 + F1 Layer Norm \r\n                              Computation                                                  [W ,W ,W ] \u00b7 X                                           0-7 W  \u00b7 Z                                        ReLU (W  \u00b7 L0 +            W  \u00b7 F0 +                  \r\n                                                                                                                                        V \u00b7 S = Z\r\n                                                                                                                      T                                                                                      F1                     F2\r\n                                                                              = X                                                                                       = A0         (A0) = L0                                                     = A1        (A1) = L1\r\n                                                                                                                   (K \u00b7Q/      ) = S                                                                       B ) = F0               B = F1\r\n                                Number of                                                       = Q,K,V                                                     = Z                                                                        \r\n                                                                                                   2                            2                  2       2\r\n                                                                            t\u00b7d                3\u00b7d        \u00b7t           d      \u00b7t          d      \u00b7t      d        \u00b7t   t\u00b7d            d      \u00b7t          2048\u00b7d        \u00b7t     2048\u00b7d         \u00b7t  t\u00b7d             d      \u00b7t\r\n                                Operations                                      model               model                model              model           model         model         model                     model                model         model         model\r\n                                                                                                                                                  (b)\r\n                            Figure 8. (a) The overall architecture of OPTIMUS, a high-performance Transformer inference engine. (b) The control \ufb02ow of the\r\n                            OPTIMUS.Densematrixmultiplication is colored in green, and a sparse matrix multiplication is colored in blue.\r\n                            elements in set 1 are assigned to PE1 and PE3 alternately.                                                                    needed for the conventional RCSC. See Appendix C for a\r\n                            In this example, the addresses of the input vector element                                                                    detailed explanation of how OPTIMUS handles sparse and\r\n                            required by PE0 is 0, 0, 2, 2 with conventional RCSC. In                                                                      dense matrix multiplication.\r\n                            contrast, they are 0, 1, 2, 3 with SA-RCSC. As these ad-                                                                      OPTIMUS is also equipped with the shared data buffers\r\n                            dresses are requested sequentially, stalls due to input load                                                                  for inputs and weights. WEIGHT MEM of 1.2MB (multi-\r\n                            miss decrease in the SA-RCSC case. Detailed experimental                                                                      banks of 4.8KB) is used to double-buffer weights as well as\r\n                            results for the PE utilization will be discussed in more detail                                                               K,V matrixfor skipping redundant decoding computation.\r\n                            in Section 7.2.                                                                                                               Thanks to pruning, the requirement of WEIGHT MEM for\r\n                            6         PROPOSEDHARDWAREARCHITECTURE                                                                                        double-buffering entire weights of a layer is reduced to\r\n                                                                                                                                                          30%ofthedenseweightmatrix(4MB).INPUTMEMalso\r\n                            6.1        Overall Architecture of OPTIMUS                                                                                    consists of multi-bank SRAMs to separately buffer input\r\n                                                                                                                                                          and partial-sums. Its size is set to stage-in at most 4-copies\r\n                            Theoverall architecture of OPTIMUS, a customized system                                                                       of input and partial-sums speci\ufb01cally targeting the single-\r\n                            for high-performance Transformer inference, is shown in                                                                       batch use case of the decoder \u2013 four beams of input and\r\n                            Fig. 8a.                                                                                                                      partial-sums can be fully-kept in INPUT MEM so that one\r\n                            PEarrayconsistsofN=1024PEs,eachofwhichisequipped                                                                              can avoid overhead of accessing DRAM to load/store them.\r\n                            with a MAC unit as well as internal buffers for temporarily                                                                   This results in remarkable inference performance for the\r\n                            staging in weight, input, and partial-sum data. A PE has two                                                                  Transformer, as demonstrated in Section 7.\r\n                            data paths to support matrix computation for both sparse                                                                      6.2        Supporting Diverse Matrix Computations\r\n                            and dense weights. In case of sparse weight (= sparse-\r\n                            mode), the hierarchical input buffer (g buf and i buf) (Park                                                                  OPTIMUS is designed to achieve high performance for\r\n                            et al., 2019) is used to widen the search windows for input                                                                   all kinds of matrix multiplications in the Transformer. In\r\n                            vector, thereby reducing the input load miss rate due to                                                                      particular, OPTIMUS can achieve near-peak utilization for\r\n                            indexing sparse weights. In case of dense weight (= dense-                                                                    both matrix-matrix multiplication in the encoder and matrix-\r\n                            mode), however, the hierarchical buffer is inef\ufb01cient since                                                                   vector multiplication in the decoder with skipping redun-\r\n                            it incurs unnecessary delay to \ufb01ll it in with the shared input.                                                               dant computations. In case of matrix-vector multiplication,\r\n                            Therefore, input in dense-mode streams into i reg (instead                                                                    SA-RCSCenablesbalancedparallelization of dot-product\r\n                            of i buf) to be directly multiplied with the dense weight.                                                                    computations across the rows of weights, achieving high uti-\r\n                            Tosupport SA-RCSC,partial sums of PEs within a set are                                                                        lization even with a large number of PEs (N=1024). In case\r\n                            added via an adder tree. This across-PE accumulation is not                                                                   of matrix-matrix multiplication, OPTIMUS utilizes a cus-\r\n                           OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                                                                                      100\r\n               tomized data\ufb02ow to maximize weight reuse; weights loaded               ]90\r\n               in WEIGHT-MEMisfullyreusedoverallthepartial-sums                        [%80 Set Associativity\r\n               1) across the samples in a batch, 2) across the time-steps             ion      1 Conventional \r\n               (in the encoder) and 3) across the beams (in the decoder)              at70          RCSC\r\n                                                                                      zili     2\r\n               via INPUT MEMandpartial-sum buffers. Please refer to                   Ut60     4\r\n               Fig. C.1 for more details on the data\ufb02ow.                              C 50        Set Associative      34.5% \r\n                                                                                      A        8      RCSC\r\n               The increase in this weight reuse comes at the cost of                 M40      16\r\n               increased overhead of DRAM access for load/store of                     30      32\r\n               input/partial-sums. However, such overhead is relatively                      32    64      128    256     512    1024\r\n               smallcomparedtoloadingweightsandK,V matrices. Note                                       Number of MACs\r\n               that the size of K and V matrices also increases over the       Figure 9. The MAC utilization for various number of MACs and\r\n               increased weight reuse, but they are double-buffered along      set associativity (SA). The proposed SA-RCSC maintains very\r\n               with the weight load, hiding its overhead behind the com-       high MACutilization rate even with the large number of MACs.\r\n               putation cycles. Together with the dedicate data paths for\r\n               supporting sparse and dense weight matrices (as discussed       periments in RTL simulation, we devised a cycle-accurate\r\n               in the previous section), OPTIMUS can achieve high uti-         simulation model, of which the cycle-by-cycle behavior is\r\n               lization for the four different matrix computations of the      validated with the RTL simulation for the core PE block\r\n               Transformer.                                                    (including SA-RCSC-base data fetch, MAC operation, and\r\n                                                                               partial-sum reduction). The precision for all the data used\r\n               6.3  Control Flow for Hiding Data Transfer Overhead             in MAC/layer-norm/softmax is 16-bit \ufb01xed-point, except\r\n               Oneofthe key challenges in achieving high performance           for the accumulation in MAC (= 32-bit, then rounded). The\r\n               for the transformer inference is hiding the DRAM access         row-index for SA-RCSC is 11-bit.\r\n               overhead for its large model data. In OPTIMUS, we care-         The weight matrix trained with PyTorch on the GPU was\r\n               fully designed a control-\ufb02ow for double buffering (via \ufb01nite    pruned using the well-known magnitude-based pruning to\r\n               state machines) to match the computation and data load          reduce the amount of data (Han et al., 2015b). The av-\r\n               cycles. As an example, Fig. 8b illustrates the weight fetch     erage pruning rate for all layers is 77.25%, which makes\r\n               scheduling for a Multi-Head Attention layer. The computa-       the amount of weight data stored in the SA-RCSC format\r\n               tion sequence is grouped into 6 states, where each state is     71.65%smaller than that of the dense matrix. The accuracy\r\n               associated with a set of computations along with the weight     in terms of BLEUwasdecreasedby1.92%afterthepruning.\r\n               to be prefetched in it. Note that the computation and the       Thedetailed layer-by-layer description of the Transformer\r\n               data load cycles can be estimated given a word-length; i.e.,    model and its pruned network is given in the Appendix D.\r\n                                                      2\r\n               the computation cycle for COM1 = [d          \u00d7t]/[#MAC\r\n                                                      model                    ThehardwaresetupforrunninginferenceoftheTransformer\r\n               \u00d7Effective PE Util], whereas the data transfer cycle for        is as follows. The CPU result is measured from the in-\r\n                       2\r\n               Wo=[d         \u00d7Sparsity (dense=1.0)] / Bandwidth. By em-\r\n                       model                                                   ference using Intel(R) i7-6900K CPU @ 3.20GHz, and\r\n               ploying this cycle estimation and by considering the data       the GPU result is measured using NVIDIA Titan Xp with\r\n               dependency between the prefetched weight and the com-           the latest CUDA kernel. The Neural Machine Translation\r\n               putations, we balanced the weight prefetch cycles and the       Toolkit (Klein et al., 2017) is used for both CPU and GPU\r\n               compute cycles for all the states. As a result, we could mea-   experiments. To the best of our knowledge, hardware accel-\r\n               sure that the spill-over cycles due to non-overlapped weight    erators dedicated for the Transformer neural network have\r\n               double-buffer was only 4.7% of the total computation.           not been reported yet. Thus, we design a custom Trans-\r\n               7    EXPERIMENTAL RESULTS                                       former hardware and apply CSC, RCSC and SA-RCSC\r\n                                                                               formats to the weight data for the hardware to see the effects\r\n               7.1  Experimental Setup                                         from different sparse matrix formats. Also, the redundant\r\n                                                                               computation skipping is intentionally disabled/enabled to\r\n               Toevaluate the performance of OPTIMUS, WMT15 (EN-               see the impact.\r\n               DE)(Sebastien Jean & Bengio, 2015), which is a represen-\r\n               tative benchmark data set for the Transformer, was used.        7.2   MACUtilization\r\n               For the evaluation of accuracy degradation due to prun-         In accelerators which consist of large number of MACs,\r\n               ing, the bilingual evaluation understudy (BLEU) (Papineni       it is important to maintain high MAC utilization for small\r\n               et al., 2002) score is used. We evaluated the latency and       latency and high throughput. However, as mentioned in\r\n               throughput of OPTIMUS as the average of 3200 sentences          Section 3.2, a sparse matrix encoded in CSC and RCSC\r\n               of different lengths. Since it takes too long to run such ex-   formats suffers from low utilization on large number of\r\n                                    OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                        1.06                                                                                c]120\r\n                                    2.54x                       Encoding              Decoding              se   Batch Size   1     2     4\r\n                                                                                                            /e100\r\n                        0.4                                                                                 nc                8     16    32\r\n                       ])                                                                                   net80\r\n                       s [0.3                                                                               Se\r\n                       yc                        1.09x                                                      t [60\r\n                       n                                                                                    pu\r\n                       te0.2                                                                                gh40\r\n                       aL                                                                                   ou\r\n                                                                                                            rh20\r\n                        0.1                                                                                 T\r\n                                                                                          1.62x               0                                                Custom HW\r\n                          0                                             Custom HW\r\n                                                                                                          Figure 12. The throughput of various hardware for the batch size\r\n                   Figure 10. The inference latency of various hardware. The latency                      from 1 to 32.\r\n                    is measured for the average number of words (t = 27) for a batch\r\n                    size of 1 and a beam size of 4.                                                       maximizedintheparallelencodingprocess(Fig.11a). How-\r\n                     ]25  Encoding                           1750 Decoding                                ever, most of the computation time is spent on decoding,\r\n                     ms [20                                  1400                                         where the performance of the OPTIMUS is signi\ufb01cantly\r\n                     em       CPU (Skipping)                                        CPU (Skipping)\r\n                     i T15                                   1050                                         better than that of GPU and CPU (Fig. 11b). In the decod-\r\n                     gin                                                                                  ing process, the processing time increases with the number\r\n                     ss10                                    700                     GPU (Skipping)\r\n                     ec                       GPU (Skipping)                                              of words in any hardware platform because of the iterative\r\n                     o 5                                      350\r\n                     Pr   OPTIMUS                                                           OPTIMUS       decoding characteristics. The performance gap between\r\n                       0                                       0\r\n                         4       18      32      46       60     4       18       32      46      60      OPTIMUSandCPU/GPUbecomeshigherasthenumber\r\n                                         (a)         Number of Words             (b)                      of words increases thanks to the ef\ufb01cient vector-matrix mul-\r\n                   Figure 11. The processing time for (a) encoding and (b) decoding                       tiplication in custom hardware which boosts up the effec-\r\n                    depending on the number of words on various hardware.                                 tiveness of redundant computation skipping.\r\n                    MACs. Simulation results con\ufb01rm that the proposed SA-                                 7.4     Throughput\r\n                    RCSCformatmaintainsmuchhigherMACutilizationwhen\r\n                    the number of MACs is large (Fig. 9). Note that, with 1024                            In server system or multi-user scenarios, the throughput\r\n                    MACs,SA-RCSCformatwithSA=8showsalmosttwice                                            analysis is important for batch sizes greater than 1. Fig.\r\n                    higher MAC utilization rate than that of the conventional                             12 shows the comparison of the throughput among CPU,\r\n                    RCSC(SA=1case). TheMACutilizationincreases as SA                                      GPU,andtheproposedhardware. Here, the throughput is\r\n                    increases but it becomes saturated when SA > 8 because                                de\ufb01ned as the number of translated sentences per second\r\n                    the number of non-zero elements assigned to each PE starts                            (sentence/s), which is calculated by dividing the number\r\n                    to become relatively even at this condition.                                          of translated sentences by the processing time including\r\n                                                                                                          DRAMaccess. Thankstothecombinations of weight prun-\r\n                    7.3    Latency                                                                        ing, SA-RCSC and computation skipping, processing time\r\n                    In real-time processing applications, latency of single batch                         becomeshighlyshort, so the throughput of the OPTIMUS is\r\n                    processing is one of the most important design parameters.                            muchhigherthanthat of CPU and GPU for any batch size.\r\n                   Asmentioned in Section 3, most of the computation time                                 The throughput of GPU increases with the number of\r\n                    is spent on decoding because of the sequence to sequence                              batches because the MAC utilization increases and weight\r\n                    structure (Fig. 10). The decoding processing time can be                              data are reused as the batch size increases. On the other\r\n                    reduced by skipping redundant computations. The effect of                             hand, the increase of throughput is relatively small in OPTI-\r\n                    redundant computation skipping varies from one hardware                               MUScasebecausetheMACutilizationrate remains almost\r\n                    platform to another as mentioned in Section 4. With the                               sameregardless of the batch size. The modest throughput\r\n                    skipping, the inference latency becomes 16.01\u00d7 smaller in                             increase in OPTIMUS with the increased batch size mostly\r\n                    custom hardware, but the latency reductions are only 2.54\u00d7                            comesfromtheweightreuseinmulti-batch scenario.\r\n                    and 1.09\u00d7 in the CPU and GPU, respectively (Fig. 10). In                              Notethat OPTIMUSshowsexceptionallyhighperformance\r\n                    addition to the redundant computation skipping, the pro-                              in the single batch case because we designed the hardware to\r\n                    posed SA-RCSCformatgivesadditional 1.62 \u00d7 reduction                                   keep all intermediate computation results local so that time-\r\n                    in latency thanks to the higher MAC utilization.                                      consuming DRAM access can be completely eliminated.\r\n                    For encoding, GPU processing time could be shorter than                               ThisuniquefeaturemakesOPTIMUSanexcellentcandidate\r\n                    OPTIMUSprocessingtime when the number of words in                                     for real-time applications, where the latency of single batch\r\n                    one sentence is very large because GPU utilization can be                             inference is very important.\r\n                                           OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                       Table 2. Area and Power Consumption of OPTIMUS Core Blocks                                                120.0                            Batch Size      1      2      4\r\n                                                                                                                                ]J110.0\r\n                                                                                                                                / e 100.0\r\n                                                                             2                                                  c 90.0                                            8      16     32\r\n                          COMPONENTS                        AREA[\u00b5m ]                          POWER[mW]                        en\r\n                                                                                                                                tn70.0\r\n                          TOP CONTROL                     54348(1.05%)                         10.43(1.42%)                     [Se60.0\r\n                          MEMORY                       2759794(53.21%)                         57.24(7.82%)                      y50.0\r\n                                                                                                                                cn 6.0                                       155x\r\n                          G BUF                            1577(0.03%)                           0.35(0.05%)                    cei4.0\r\n                          PERIPHERAL                      23244(0.45%)                           9.57(1.31%)                    iff2.0\r\n                                                                                                                                E  0.8\r\n                                                              1024 PES                                                          gy 0.6\r\n                                                                                                                                re 0.4                                         1464x\r\n                          CONTROL                        325758(6.28%)                         34.13(4.66%)                     En 0.2\r\n                          MACS                         1930187(37.22%)                     598.16(81.73%)                           0                                                         Custom HW\r\n                          I BUF                           91294(1.76%)                         21.96(3.00%)\r\n                          TOTAL                      5186201.416(100%)                        731.84(100%)\r\n                            6\r\n                          10\r\n                                                                                     Batch Size        1      2      4\r\n                         )  5\r\n                         el10                                                                          8     16      32      Figure 14. The energy ef\ufb01ciency of processing the test set (3200\r\n                         ac\r\n                         s  4                                                                                                 sentences) for the batch size from 1 to 32.\r\n                         go10\r\n                         l] (\r\n                         J  3\r\n                         [ 10                                                                                                 MUSandCPU/GPU(Fig.13). ItisbecausetheOPTIMUS\r\n                         yg\r\n                         re 2                                                                                                 \ufb01nishes the inference operations much faster with smaller\r\n                         n10\r\n                         E                                                                                                    power. As the batch size increases, the energy tends to de-\r\n                            1\r\n                          10                                                           Custom HW                              creaseonallhardwareduetoweightdatareuse(Fig.13). Al-\r\n                                                                                                                              though the largest energy reduction with the increased batch\r\n                                                                                                                              size is achieved on the GPU, OPTIMUSconsumesthesmall-\r\n                                                                                                                              est energy for any batch size. Thanks to the high through-\r\n                       Figure 13. The energy consumption expressed in log-scale to pro-                                       put and small energy consumption, the OPTIMUS shows\r\n                       cess the test set (3200 sentences) for the batch size from 1 to                                       1464\u00d7and155\u00d7higherenergyef\ufb01ciency(sentences/J)than\r\n                       32.                                                                                                    GPUfor a single batch case and a multi-batch case with\r\n                       Meanwhile, more widely used effective throughput                                                       batch size = 32, respectively (Fig. 14).\r\n                       (OPS) (Gao et al., 2018) is de\ufb01ned as the total number                                                 8      CONCLUSION\r\n                       of operations to fully encode and decode a sentence divided\r\n                       by processing time. The effective throughput of OPTIMUS                                               This paper presents a custom hardware, OPTIMUS, for ac-\r\n                       is 500.05 GOPS but we could not measure the OPS for CPU                                                celerating the Transformer neural network computation with\r\n                       and GPU so direct comparison is not possible unlike the                                                high performance and high energy-ef\ufb01ciency. In order to\r\n                       sentence/s metric.                                                                                     run the inference ef\ufb01ciently, the encoding and decoding\r\n                                                                                                                              process were analyzed in detail, and dramatic performance\r\n                       7.5      PowerConsumptionandEnergyEf\ufb01ciency                                                            improvement was achieved by skipping redundant compu-\r\n                       For the power analysis, we synthesize OPTIMUS in a 28nm                                                tations in the decoding process. In addition, a SA-RCSC\r\n                       CMOStechnologyrunningat200MHzwith1.0V.Thearea                                                          format was proposed to maintain high MAC utilization even\r\n                       andpowerconsumptionoftheon-chipcomponentsinOPTI-                                                      whenalargenumberofMACsaredesignedintheaccelera-\r\n                       MUSareextracted using Synopsys design compiler and the                                                 tor. These make latency, throughput, and energy ef\ufb01ciency\r\n                       data are shown in Table 2. While the memory part occupies                                              of OPTIMUSmuchbetterthanCPU,GPUandconventional\r\n                       the largest area, power consumption is dominated by MACs                                               custom hardware.\r\n                       as expected.\r\n                       The CPU power measured by the likwid power me-                                                         ACKNOWLEDGEMENTS\r\n                       ter (Treibig et al., 2010) is 50.46W, GPU power measured                                              This research was supported by the MSIT(Ministry of Sci-\r\n                       by the NVIDIA-SMI is 53.4W and the custom hardware                                                     ence and ICT), Korea, under the ICT Consilience Cre-\r\n                       consumes 731.84mWandDRAMpower(196.3mW)was                                                              ative program (IITP-2019-2011-1-00783) supervised by the\r\n                       adopted from the Micron power calculator (Micron Tech-                                                 IITP(Institute for Information & communications Technol-\r\n                       nology, 2017). Total energy accounts for both acceler-                                                 ogyPromotion).\r\n                       ator and DRAM energy consumption. The energy con-\r\n                       sumed by a DRAM is calculated by multiplying the total                                                 REFERENCES\r\n                       amount of DRAMdataaccess by the energy per unit bit (39\r\n                       pJ/bit (Pawlowski, 2011)). There is orders-of-magnitude                                                Bahdanau, D., Cho, K., and Bengio, Y. Neural machine\r\n                       difference between the energy consumption in the OPTI-                                                     translation by jointly learning to align and translate. Inter-\r\n                            OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                 nationalConferenceonLearningRepresentations(ICLR),             Park, J., Kung, J., Yi, W., and Kim, J.-J. Maximizing system\r\n                 2015.                                                             performance by balancing computation loads in LSTM\r\n               Cao,S., Zhang, C., Yao, Z., Xiao, W., Nie, L., Zhan, D., Liu,       accelerators. In Design, Automation Test in Europe Con-\r\n                 Y., Wu, M., and Zhang, L. Ef\ufb01cient and effective sparse           ference Exhibition (DATE), 2018. ISBN 978-3-9819263-\r\n                 lstm on fpga with bank-balanced sparsity. In Proceed-             0-9.\r\n                 ings of the 2019 ACM/SIGDA International Symposium             Park, J., Yi, W., Ahn, D., Kung, J., and Kim, J. Balancing\r\n                 onField-Programmable Gate Arrays, pp. 63\u201372. ACM,                 computation loads and optimizing input vector loading in\r\n                 2019.                                                             lstm accelerators. IEEE Transactions on Computer-Aided\r\n                                                                                   Design of Integrated Circuits and Systems, 2019. ISSN\r\n                                     \u00a8\r\n               Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau,               0278-0070. doi: 10.1109/TCAD.2019.2926482.\r\n                 D., Bougares, F., Schwenk, H., and Bengio, Y. Learn-           Pawlowski, J. T. Hybrid memory cube (HMC). In 2011\r\n                 ing phrase representations using rnn encoder-decoder for          IEEEHotChipsSymposium(HCS),pp.1\u201324,2011.\r\n                 statistical machine translation. CoRR, abs/1406.1078,\r\n                 2014.                                                          Rizakis, M. et al. Approximate FPGA-based LSTMs under\r\n               Gao, C., Chang, et al. DeltaRNN: A power-ef\ufb01cient recur-            computation time constraints. In International Sympo-\r\n                 rent neural network accelerator. In International Sym-            sium in Applied Recon\ufb01gurable Computing (ARC), 2018.\r\n                 posium Field-Programmable Gate Arrays (FPGA), pp.              Sebastien Jean, Orhan Firat, K. C. R. M. and Bengio,\r\n                 21\u201330. ACM, 2018.                                                 Y.   Montreal neural machine translation systems for\r\n                                                                                   WMT\u201915. In Proceedings of the Tenth Workshop on\r\n               Han, S., Mao, H., and Dally, W. J.          Deep compres-           Statistical Machine Translation. Association for Com-\r\n                 sion:   Compressing deep neural network with prun-                putational Linguistics, 2015.     URL https://www.\r\n                 ing, trained quantization and huffman coding. CoRR,               aclweb.org/anthology/W15-3014.\r\n                 abs/1510.00149, 2015a. URL http://arxiv.org/                   Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-\r\n                 abs/1510.00149.                                                   quence learning with neural networks. In Advances in\r\n               Han, S., Pool, J., Tran, J., and Dally, W. Learning both            neural information processing systems, pp. 3104\u20133112,\r\n                 weights and connections for ef\ufb01cient neural network.              2014.\r\n                 In Advances in Neural Information Processing Systems           Treibig, J., Hager, G., and Wellein, G.          Likwid: A\r\n                 (NIPS), pp. 1135\u20131143, 2015b.                                     lightweight performance-oriented tool suite for x86 mul-\r\n               Han,S.etal. EIE: Ef\ufb01cient inference engine on compressed            ticore environments. In 2010 39th International Con-\r\n                 deep neural network. In International Symposium Com-              ference on Parallel Processing Workshops, pp. 207\u2013216.\r\n                 puter Architecture (ISCA), pp. 243\u2013254, 2016. ISBN                IEEE, 2010.\r\n                 978-1-4673-8947-1.                                             Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,\r\n               Han, S. et al. ESE: Ef\ufb01cient speech recognition engine with         L., Gomez, A. N., Kaiser, \u0141., and Polosukhin, I. Atten-\r\n                 sparse LSTM on FPGA. In International symposium.                  tion is all you need. In Advances in neural information\r\n                 Field-Programmable Gate Arrays (FPGA), pp. 75\u201384,                 processing systems, pp. 5998\u20136008, 2017.\r\n                 2017. ISBN 978-1-4503-4354-1. doi: 10.1145/3020078.            Wang, S., Li, Z., Ding, C., Yuan, B., Qiu, Q., Wang, Y.,\r\n                 3021745.                                                          and Liang, Y. C-LSTM: Enabling ef\ufb01cient LSTM using\r\n               Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush,              structured compression techniques on FPGAs. In Interna-\r\n                 A. M. OpenNMT: Open-source toolkit for neural ma-                 tional Symposium on Field-Programmable Gate Arrays,\r\n                 chine translation. In Proc. ACL, 2017. doi: 10.18653/v1/          pp. 11\u201320. ACM, 2018.\r\n                 P17-4012. URL https://doi.org/10.18653/                        Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,\r\n                 v1/P17-4012.                                                      Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,\r\n                                                                                   K., et al. Google\u2019s neural machine translation system:\r\n               Micron Technology, I. Calculating memory power for ddr4             Bridgingthegapbetweenhumanandmachinetranslation.\r\n                 sdram. Tech. Rep. TN-40-07, 2017.                                 arXiv preprint arXiv:1609.08144, 2016a.\r\n               Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. BLEU:         Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,\r\n                 a method for automatic evaluation of machine transla-             Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,\r\n                 tion. In association for computational linguistics (ACL),         K., et al. Google\u2019s neural machine translation system:\r\n                 pp. 311\u2013318. Association for Computational Linguistics,           Bridgingthegapbetweenhumanandmachinetranslation.\r\n                 2002.                                                             arXiv preprint arXiv:1609.08144, 2016b.\r\n                                                           OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                                             x                                       x                                       x                                                 Multi-Head Attention\r\n                                               1                                      2                                       3\r\n                                                                                                                                                                                  head1                               head7               COM5Concatenate\r\n                                                      x +pe                                  x +pe                                   x +pe                                          COM2\r\n                                                       1       1                               2       2                              3        3\r\n                                           pe                                      pe                                      pe                                                          ~                                                                                                       \r\n                                               1                                      2                                        3                                                    COM4\r\n                                                                                                                                                                                                     Z1                      Z7           W(d         x d      )                                 Z (d      x t)\r\n                                     Positional                                                                                                                                                                                              o  model     model                                      model\r\n                                     Encoding                 (1 x dmodel)                                                                                                         head0\r\n                                                                                                       pe                           pe          pe                                  COM2                   q1 q2 q3                         COM3            COM4\r\n                                                                                                           1                            2           3                                 k1\r\n                                                                                                                                                                                                                                               Divide \r\n                                                                                              \r\n                                                        =                                      )                                                                                      k2                                                                                    \r\n                                                                                                                                                                                                                                               by   \r\n                                                                                                         pos 0 0 0 0 pos(1) pos(2)                                                    k                                                               \r\n                                                                                                                                                                                       3                                                         &\r\n                                                                                                                                                                                                                                                              v   v   v                  \r\n                                                                                                                                                                                           T                                                  Softmax          1   2   3              \r\n                                                            =                                      )       i                                           i                                 K (t x d )        Q(d x t)                                                      softmax          (txt)\r\n                                                                                                                0 1 2 3                   i                                                       k            q             P(t x t)                         V(d x t)                             Z0(d x t)\r\n                                                                                                                                                                                                                                                                  v                                     v \r\n                                                                                                                                                                                                                                                                                        \r\n                                             x                                       x                                       x                                                      head0\r\n                                               1                                      2                                       3                                                                              x   x   x\r\n                                                                                                                                                                                     COM1                     1   2   3          q q q                  K \r\n                                     Embedding                (1 x d       )                                                                                                                                                      1   2   3     K = W *X   \r\n                                                                     model\r\n                                              ich                                   liebe                                                                                                                                                                          K(dkx t)       head1             head7\r\n                                                                                                                             dich\r\n                                                                                                                                                                                         Q                                       Q(d x t)              V\r\n                                                                                                                                                                                      W (dqx dmodel) X (dmodel x t)                   q         V = W *X   \r\n                                            (1 x k)                                                                                                                                                                                                                V(d x t)\r\n                                                             (k x d       )                                                                                                                                                                                            v\r\n                                                                     model\r\n                                                                                                                                                                                                             x                     x                     x                     (1 x d      )\r\n                                                                                                                                                                                                               1                    2                      3                          model\r\n                                Figure A.1. The process of embedding and positional encoding                                                                                                                           ich                  liebe                 dich\r\n                                                                                                                                                                              Figure A.2. The process of multi-head attention. This process is\r\n                                A TRANSFORMERCOMPUTATION                                                                                                                      divided into \ufb01ve computations (COM1\u223c5).\r\n                                            BREAKDOWN\r\n                                A.1           Embedding&PositionalEncoding                                                                                                    tion that computes query (Q), key (K), and value (V ). The\r\n                                                                                                                                                                              size of the weight matrix (WQ,WK,WV) is (d ,d ,d )\r\n                                The \ufb01rst step of the Transformer is the word embedding                                                                                                                                                                                                        q       k       v\r\n                                                                                                                                                                              \u00d7d                    , where d ,d ,d                             = d                    /h. In the case of\r\n                                (Fig. A.1). The words in a sentence are converted into                                                                                                  model                              q       k       v              model\r\n                                vectors of d                            size through the embedding process. For                                                               computingtheCOM1inmulti-headattentionoftheencoder\r\n                                                          model                                                                                                               andinmaskedmulti-headattention of the decoder, the same\r\n                                example, a d                              size vector representing \u2018ich\u2019 is the result\r\n                                                            model                                                                                                             input matrix is multiplied by WQ, WK, WV to compute\r\n                                of multiplying the dmodel \u00d7 k size embedding matrix and                                                                                       Q, K, V. On the other hand, when the COM1 in multi-\r\n                                one-hot k size vector representing \u2018ich\u2019, where k is the                                                                                      head attention of the decoder is computed, K and V are\r\n                                numberofwordsthattheembeddingmatrixcanrepresent.                                                                                              computed by multiplying the \ufb01nal output of the encoder by\r\n                                Since the multiplied vector is one-hot vector, embedded                                                                                       WK,WV.Qiscomputedbymultiplyingtheoutputofthe\r\n                                wordvectors can be computed by reading only the memory                                                                                        maskedmulti-head attention of the decoder by WQ. If WQ,\r\n                                of the embedding matrix without multiplication.                                                                                               WK,andWV arepruned,COM1becomessparsematrix\r\n                                Next, information about the relative or absolute position of                                                                                  and dense matrix multiplication (sM\u00d7dM).\r\n                                each word should be injected into embedded word vectors,                                                                                      Thesecondcomputation (COM2)istocomputethescore.\r\n                                which is called positional encoding. Positional information                                                                                   AscoreiscomputedastheinnerproductofK andV,which\r\n                                is expressed through sine and cosine functions. The values                                                                                    represents how words relate to each other. COM2 is always\r\n                                of those functions are \ufb01xed values depending on the position                                                                                  multiplication of two dense matrices (dM\u00d7dM) because any\r\n                                of each element of a vector and the position of each vector                                                                                   pruned weight is not used in COM2.\r\n                                in the sentence, so positional information can be referred to\r\n                                a lookup table without being computed every time. The vec-                                                                                    The third computation (COM3) is to divide the result of\r\n                                tors (x ,x ,\u00b7\u00b7\u00b7 ,x                             ) which is the summation of embedded\r\n                                              1        2                  tE                                                                                                  COM2 by the size of the key vector (d ). This process\r\n                                wordvectorsandpositional information are used as an input                                                                                                                                                                                  k\r\n                                                                                                                                                                              scales down the value and stabilizes gradients during train-\r\n                                matrix (d                          \u00d7t )oftheencoder. This embedding and\r\n                                                     model                E                                                                                                   ing (Vaswani et al., 2017). Through the softmax computa-\r\n                                positional encoding process is also applied to output words                                                                                   tion, all these values become positive and the element-wise\r\n                                whendecoding.                                                                                                                                 suminthequerydirection becomes always one.\r\n                                A.2           Multi-Head Attention                                                                                                            Thefourth computation (COM4) is to multiply the result of\r\n                                                                                                                                                                              COM3byvalue(V). Thisprocessreducestheinformation\r\n                                Multi-head attention is the structure to measure the re-                                                                                      of unrelated words with low scores and increases that of\r\n                                lationship among words in two same/different sentences.                                                                                       words which need to be focused. Due to the same reason as\r\n                                (Fig. A.2). This process is divided into \ufb01ve computations                                                                                     COM2,COM4consistsofdM\u00d7dM.\r\n                                (COM1\u223c5). AllcomputationsexceptCOM5areprogressed                                                                                              The\ufb01nal\ufb01fthcomputation is to concatenate the results of\r\n                                separately in the h heads which guarantee diverse attention                                                                                   COM4(Z0-Z7)ineachheadandmultiplytheconcatenated\r\n                                mapsforbetter translation quality.                                                                                                            results by the weight matrix (WO) to mix them. If WO is\r\n                                The\ufb01rstcomputation(COM1)isamatrix-matrixmultiplica-                                                                                           pruned, COM5consistsofsM\u00d7dM.After\ufb01vecomputations\r\n                                                           OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                                     Layer Normalization                                            (dmodelxt )                                                               Masked Multi-Head Attention\r\n                                                                                                                                                                              head0\r\n                                                                                                                                                                                COM2                   q1 q2 q3                          COM3             COM4\r\n                                                                                                                                                                                  k\r\n                                                                                                                                                                                   1                                                        Divide \r\n                                                                                                                                                                                  k                                                                                      \r\n                                                                                                                                                                                   2                                                        by   \r\n                                                                                                                                                                                  k                                                                \r\n                                                                                                                                                                                   3                                                          &\r\n                                                                                                                                                                                                                                                            v   v  v                   \r\n                                                                                                                                                                                       T               Q(d x t)                            Softmax           1   2  3 softmax           (txt)\r\n                                                                                                                                                                                     K (t x dk)            q              P(t x t)                         V(d x t)                              Z0(d x t)\r\n                                                                                                                                                                                                                                                                v                                      v \r\n                                                                                                                                                                                                                                                                                      \r\n                                     Add (to residuals)                                             (dmodelxt )                                                             Figure A.5. The different computations (COM2 and COM4) of\r\n                                            (From embedding &                                   +                             (From multi head                              maskedmulti-head attention.\r\n                                            positional encoding)                                                                 self-attention)\r\n                                                                          X (d        xt )               Z (d        xt )\r\n                                                                                model                         model\r\n                                                                                                                                                                            be d                   \u00d7d andd                                each. After WF1 and WF2 are\r\n                                                                                                                                                                                      model                 f               model\r\n                               Figure A.3. The process of the residual connection around each of                                                                            pruned, the \ufb01rst transformation consists of sM\u00d7dM. On the\r\n                               the sub-layers, followed by layer normalization.                                                                                             other hand, the second one becomes multiplication between\r\n                                      Add & Layer Normalization                                      (dmodelxt )                                                            twosparse matrices (sM\u00d7sM), because its input matrix also\r\n                                                                                                                                                                            has many zero values after passing through ReLU.\r\n                                      Feed Forward 2                                                 (dmodelxt )\r\n                                                                                                                                                                            A.5           MaskedMulti-HeadAttention\r\n                                                                                                                \r\n                                                                        WF2                   (df x t)                  bF2                                                 Maskedmulti-head attention is additionally performed only\r\n                                                                    (d       xd)                                   (d       x 1)\r\n                                                                       model     f                                    model                                                 at the decoder. This process is the same as the multi-head\r\n                                                                                                      (df x t)                                                              attention computation process except the computation of\r\n                                     Feed Forward 1                                                                                                                         COM2andCOM4(Fig.A.5). Unlikethecorrelationamong\r\n                                                                                                                                          ReLU                              all words in a sentence is computed in the encoder, the\r\n                                                                        WF1                 (dmodelxt )                bF1                                                  correlation betweeneachwordanditspreviouswordsisonly\r\n                                                                   (df x dmodel)                                     (df x 1)                                               computed in the masked multi-head attention. Therefore,\r\n                                                                                                    (dmodelxt )                                                             after the correlation among all words is computed in COM2,\r\n                               Figure A.4. The process of position-wise feed forward network.                                                                               the multiplication results between the queries of previous\r\n                                                                                                                                                                            words and the keys such as k \u00d7q and k \u00d7q are masked\r\n                                                                                                                                                                                                                                           2           1               3           2\r\n                                                                                                                                                                            as a negative in\ufb01nity value to make those masked values\r\n                               in multi-head attention, the output matrix still maintains the                                                                               converge to zero at COM3.\r\n                               samesize as that of the input matrix.                                                                                                        A.6           Linear & Softmax\r\n                               A.3           Residual Add & Layer Normalization                                                                                             Theresult of the multi-layer decoder process is converted\r\n                               Theoutputofthesub-layersineachencoderanddecoderare                                                                                           into probabilities of all k words through a linear and soft-\r\n                               added to their input, then the summation results are normal-                                                                                 max layer. A linear layer consisting of a fully-connected\r\n                               izedinthelayer-normalizationprocess(Fig.A.3). Themean                                                                                        neural network projects the \ufb01nal output of the decoder into\r\n                               (\u00b5t) and standard deviation (\u03c3t) for the layer normalization                                                                                 k-dimension. Note that k varies from dataset to dataset, and\r\n                               are computed for each vector in the word direction. The                                                                                      is usually as large as tens of thousands. Since the weight\r\n                               normalized output is scaled by \u03b3 and is shifted by \u03b2, where                                                                                  matrix size of the linear layer (k \u00d7 dmodel) is very large, it\r\n                               \u03b3 and \u03b2 is the trained parameters. This computation amount                                                                                   is important to reduce the memory requirement of its weight\r\n                               is much smaller than multi-head attention or position-wise                                                                                   matrix using pruning to reduce the amount of computations.\r\n                               feed forward (0.72% of total computations).                                                                                                  The softmax layer converts the output of the linear layer\r\n                                                                                                                                                                            into a probability matrix of all k words. The word with\r\n                               A.4           Position-wise Feed Forward                                                                                                     the highest probability is selected as the \ufb01nal result of that\r\n                               Each layer of the encoder and decoder has a fully con-                                                                                       decoding step. In the inference process, because only the\r\n                               nected feed-forward network. In this network, the input                                                                                      wordwiththe highest score is selected, the softmax process\r\n                               matrix is \ufb01rst linearly transformed after being multiplied by                                                                                can be skipped.\r\n                               WF1[d \u00d7d                                  ] and added by bF1[d ], where d is the\r\n                                                f            model                                                       f                        f                         A.7           BeamSearch\r\n                               inner-layer dimension size. The \ufb01rst transformation result\r\n                               passes through Recti\ufb01ed Linear Unit (ReLU) activation, and                                                                                   The most common way to search a target sentence is to\r\n                               the recti\ufb01ed result is linearly transformed again as the sim-                                                                                select the word which has the highest probability for every\r\n                               ilar way to the \ufb01rst linear transformation. To maintain the                                                                                  decoding-step. This way is based on the greedy algorithm,\r\n                               dimension of the output by d                                                , the size of weight WF2                                         however, is not guaranteed whether this method always\r\n                                                                                               model\r\n                               andbiasbF2 usedinthesecondlineartransformationshould                                                                                         generates a best target sentence. The beam search supple-\r\n                                                                                                               OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                                                                Multi-Head Attention                                                                                                                                                                                                                                                   \ufb02ows. The sM\u00d7sM multiplication for the sparse weight\r\n                                                                                                                       COM5Concatenate                                                                                                                                                                                                 and the sparse input matrix is done as follows. Tiny sized\r\n                                                                                                                                                                                                                                                                                                                                       g buf and i buf are assumed for a simple example and the\r\n                                                                                                                        Wo(dmodelx dmodel)                                                                                       Z (dmodelx t)                                                                                         exemplary sparse weight matrix and input matrix are shown\r\n                                                                head0 -7                                                                                                                                                                                                                                                               in Fig. C.1. Note that the number of input matrix columns\r\n                                                                    COM2                                       q1 q2 q3                                                          COM3                            COM4\r\n                                                                       k1                                                                                                             Divide                                                                                                                                           that can be loaded in OPTIMUS depends on the size of\r\n                                                                       k2                                                                                                            by                                                     \r\n                                                                       k                                                                                                                           \r\n                                                                         3                                                                                                                 &                                                                                                                                           the P SUMbuffer in the MAC. The example assumes that\r\n                                                                                                                                                                                                                    v      v      v                                   \r\n                                                                                T                              Q(d x t)                                                             Softmax                           1      2      3    softmax    (t xt)                                 Z0(d x t)\r\n                                                                            K (t x d )                                  q                         P(t x t)                                                                                                                   E                        v \r\n                                                                                     E          k                                                        E                                                        V(d x t )\r\n                                                                                                                                                                                                                           v       E                               \r\n                                                                                                                                                                                                                                                                                                                                       inputs for up to two time steps can be stored, so the inputs\r\n                                                                                                     head0 -7                                                                                                                            K                                                                                             for t0, t1 are loaded into i buf via g buf from INPUT MEM\r\n                                                                                                                                                      m m m                                                               K= W *E   \r\n                                                                                                       COM1                                               1      2      3                    q q q\r\n                                                                                                                                                                                                1      2      3                 (d x t )\r\n                                                                                                                                                                                                                                    k        E                                                                                         in the order a                                                          , a                    , a                    .        The sparse weight matrix is\r\n                                                                                                                                                                                                                                                                                                                                                                                                   0,0                    0,1                     1,1\r\n                                                                                                                                                                               \r\n                                                                                                                                                                                                                          V= WV*E                                                                                                      encoded in SA-RCSC format and loaded via w \ufb01fo. Then,\r\n                                                                                                               Q                                                                             Q(dqx t)                           (d x t )\r\n                                                                                                          W (dqx dmodel) M (dmodel x t)                                                                                             v        E                                                                                         the column index of the weight element and the row index\r\n                                                                                                                                           m                 m m                      m m m                                           e e e\r\n                                                                                                                                               1                1       2                1       2      3                                1      2      3                                                                               of the input vector element are compared in the comparators\r\n                                                                                  From Masked-Multi                                                                                                                                                            From Encoder                                                            (comps), and if they match, the input value is multiplied\r\n                                                                                  Head Attention\r\n                                                                                                                                            i                  (i+1)                      (i+2)                                    i   , (i+1) , (i+2)                                                                                 bythe weight value, so w                                                                                                  and a                                 are multiplied in this\r\n                                                                                                                                             th                         th                         th                               th            th            th                                                                                                                                                                                 0,0                                   0,0\r\n                                                                                                                                                                                                                                                                                                                                       example. This value is stored in the P SUM buffer and is\r\n                                                           Figure B.1. Analysis of redundant decoding computations of multi-                                                                                                                                                                                                           added to the result of other dot product with the same index\r\n                                                           head attention                                                                                                                                                                                                                                                              information. Since the column index of w0,0 and the row\r\n                                                                                                                                                                                                                                                                                                                                       index of a                                                 also matches, w                                                                    \u00b7 a                       is executed. When\r\n                                                                                                                                                                                                                                                                                                                                                                                   0,1                                                                                 0,0                      0,1\r\n                                                                                                                                                                                                                                                                                                                                       there are no more input elements that match the column\r\n                                                           ments the limitation of the greedy search. In the beam                                                                                                                                                                                                                      index of w                                             , the pointer of w \ufb01fo points to w                                                                                                                   . Similarly,\r\n                                                           search method, the sentences where their cumulative proba-                                                                                                                                                                                                                                                              0,0                                                                                                                                                 2,1\r\n                                                                                                                                                                                                                                                                                                                                       the column index of w2,1 is compared with the row of i buf.\r\n                                                           bility for each word falls within top-n are selected for each                                                                                                                                                                                                               Whena                                          is matched, the value in the red region in i buf is\r\n                                                           decoding-step, where n is the beam size. Note that the beam                                                                                                                                                                                                                                                  1,1\r\n                                                                                                                                                                                                                                                                                                                                       shifted to the blue region and two input elements are newly\r\n                                                           search is as the same method as the greedy search algorithm                                                                                                                                                                                                                 loaded from g buf. This control method minimizes the\r\n                                                           when n = 1. This beam search increases the translation                                                                                                                                                                                                                      occurrenceofstallsbecausethelargersearchwindowallows\r\n                                                           performance of a neural machine translation model, how-                                                                                                                                                                                                                     the input elements to be prepared even if the address of the\r\n                                                           ever, more resources and computation power are required                                                                                                                                                                                                                     requested input elements is irregular due to sparse weight.\r\n                                                           because the input size of the model is increased by n.                                                                                                                                                                                                                      After the sixth computation shown in the computation order\r\n                                                                                                                                                                                                                                                                                                                                       in the Fig. C.1, the MAC computations for t0 and t1 are\r\n                                                            B SKIPPINGREDUNDANTCOMPUTATIONS                                                                                                                                                                                                                                            completed. ThevaluestoredintheP SUMbufferisaddedto\r\n                                                                                  OF MULTI-HEAD ATTENTION IN                                                                                                                                                                                                                           the value in the P SUM buffer of another PE with the same\r\n                                                                                  DECODERS                                                                                                                                                                                                                                             SAnumberandthenitisstoredinINPUT MEM.Ifthere\r\n                                                                                                                                                                                                                                                                                                                                       are no more tokens to be computed other than t0, t1, the\r\n                                                           AsmentionedinSection A.2, K and V in the multi-head                                                                                                                                                                                                                         tokens are directly used as an input of the next computation.\r\n                                                           attention of the decoder are computed by using the \ufb01nal                                                                                                                                                                                                                     However, if word length exceeds the internal P SUM buffer\r\n                                                           output of the encoder (Fig.B.1). That is, K and V are \ufb01xed                                                                                                                                                                                                                  size, the value of t0, t1 are stored in the DRAM and the\r\n                                                                                                                                                                                                                                                                                                                                       weight matrix must be reloaded to compute t , t .\r\n                                                           matrices once they are computed at the \ufb01rst decoding time-                                                                                                                                                                                                                                                                                                                                                                                                                2             3\r\n                                                           step. We can skip the computations for K and V for other                                                                                                                                                                                                                    Theprocess for multiplication of dense weight matrix with\r\n                                                           decodingtime-steps by loading/storing the computed K and                                                                                                                                                                                                                    dense input matrix is simpler than the process for the sparse\r\n                                                           V. Furthermore, due to the \ufb01xed K and V, zt which is                                                                                                                                                                                                                        matrix computation. The order of input matrix loading is\r\n                                                           the vector element of Z at time-step t is only dependent on                                                                                                                                                                                                                 the same as that for the sparse input case. However, input\r\n                                                           q , the query at time-step t. This property allows skipping\r\n                                                                t                                                                                                                                                                                                                                                                      elements are loaded through i reg rather than g buf and\r\n                                                           redundant decoding computations to be applied to even                                                                                                                                                                                                                       i buf. Unlike the sparse matrix computation where all PEs\r\n                                                           multi-head attention in the decoder layers. In summary, only                                                                                                                                                                                                                are loaded with the same input data, different input values\r\n                                                           the vector from the output word of the previous decoding                                                                                                                                                                                                                    are loaded to each PE in the dense matrix computation case.\r\n                                                           time-step is required as an input of the decoder for each                                                                                                                                                                                                                   Therefore, the hierarchical buffer structure is not suitable\r\n                                                           decoding time-step.                                                                                                                                                                                                                                                         for each PE to load input vector element separately because\r\n                                                                                                                                                                                                                                                                                                                                       the input vector elements are shared by all PEs when the\r\n                                                            C SPARSE/DENSEMATRIXCOMPUTATION                                                                                                                                                                                                                                            hierarchical buffer is used. The parts where the dense matrix\r\n                                                                                  FLOWSINOPTIMUS                                                                                                                                                                                                                                       multiplications are performed are the COM2 and COM4\r\n                                                                                                                                                                                                                                                                                                                                       process of the masked multi-head attention and the multi-\r\n                                                           In this section, we describe the details of the computation                                                                                                                                                                                                                 headattention. In these processes, the row size of the weight\r\n                                                           \ufb02owsinOPTIMUS,focusingonthematrixmultiplication\r\n                                                                                                                                                                                          OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                                                                                                                   Sparse Weight Matrix                                                                      Sparse Input Matrix                                                                 sM\u00d7sM                                                                                                                                                                                                                                                                                                                                                                                                                                                        dM\u00d7dM                                                                                              dense weight elements\r\n                                                                                                                                                                                                                     t              t              t               t                                                                                                                                                                                                                                                                                                                                                                                                                           The order of \r\n                                                                                                                                                                                                                       0               1              2               3                         PE0                                                                                                                                     From                                                 sparse weight elements                                                                                                                                                                                                                                           PE0                                                               From                                                                                                                                            The order of \r\n                                                                                                         PE0 w                                            w w                                                     a              a               a               a                                                                                                                                                                                                                                                                                                                                                                                                                          computations                                                                                                    WEIGHT_MEM                                                                     R C                             V                                                computations\r\n                                                                                                                                0,0                             0,2            0,3                                    0,0            0,1             0,2             0,3                                                                                                                                             WEIGHT_MEM                                                                                                                                                                             R C                             V\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    1: w                          a\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    0,0                0,0                                                                                                                                                  2               0 w                                                                     1: w                          a\r\n                                                                                                         PE1 w                                                            w                                                      a               a               a                                                                                                             g_buf0                                                                                                                                                                                                                                                                                                                                        *                                                                                                                                                                                               2,1                                                                    0,0 *              0,0\r\n                                                                                                                                1,0                                            1,3                                                   1,1             1,2             1,3                                                                                                                                                                                                                                                     i_buf                                                                           0               3 w\r\n                                                                                                                                                                                                                                                                                                                                                                                                                 R C V R C V                                                                                                                                                                                                                  0,3                                                   2: w                          a\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    0,0 *              0,1                                                                                                                                                  0               1 w                                                                     2: w                          a\r\n                                                                                                         PE0                              w                                                                       a              a                               a                                                                             From                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          0,1                                                                    0,0 *              0,1\r\n                                                                                                                                                2,1                                                                   2,0            2,1                             2,3                                                                                                                                                                                                                                                       R C V                                                                         0               2 w\r\n                                                                                                                                                                                                                                                                                                                                                                                                                  3               0           a                  2              1            a                                                                                                                                                0,2\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                  3,0                                            2,1                                                                                                                                                                                                3: w                          a\r\n                                                                                                                                                                                                                                                                                                                               INPUT_MEM                                                                                                                                                                                                                                                                                                                                                                            2,1 *              1,1                                                                                                                                                  2               0 w\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             2,0                                                    3: w                          a\r\n                                                                                                         PE1 w                                            w                                                       a                              a               a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2,0 *              0,0\r\n                                                                                                                                3,0                             3,2                                                   3,0                            3,2             3,3                                                                                                                                                                                                                                                        2               0            a                                               2               1 w\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2,0                                                                           2,1                                                   4: w                          a\r\n                                                                                                                                                              Range of P_SUM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        0,2 *              2,0                                                                                                                                                  0               0 w\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             0,0                                                    4: w                          a\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1               1           a                             b                                                                                                                                                                                     From                                                                                                                                                       w_fifo                                                   2,0 *              0,1\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1,1                                          0               0 w\r\n                                                                                                                                                                           Buffer                                                                                                                                                                                          sparse input elements                                                                                                                                                                                          x_                                                   0,0          w_fifo                                  5: w                          a                                                                                                    i_reg\r\n                                                                                                                      Dense Weight Matrix                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           0,2 *              2,1                       INPUT_MEM                                                                                                                                                                                                          5: w0,1                       a\r\n                                                                                                                                                                                                              Dense Input Matrix                                                                                                                                                                                                                                                                                                0               1            a                            u                                                                                                                                                                                                                                                                                                                                                                                                                  *         1,0\r\n                                                                                                                                                                                                                    t               t              t               t                                                                                                                                                                                                                                                                                             0,1                                                                                                                                                                                                                                                                    R C V                                                        Mult                                Add\r\n                                                                                                                                                                                                                       0              1              2               3                                                                                                                                                                                                                                                                                                                    m                                                               Add                                       6: w                          a                                                                                                                                                                                W\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Mult                                                                                          0,3 *              3,0\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    W                                                                                                                                                                                                                                                                                                                                               6: w0,1                       a\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                0               0           a                                                                                                                                                                                                                                                                           0               0            a                                                                                                                                       *         1,1\r\n                                                                                                         PE0 w                            w w w                                                                   a              a               a               a                                                                                                                                                                                                                                                                                              0,0                                                                                                                                                                                                                                                                                                      0,0\r\n                                                                                                                                0,0             0,1             0,2            0,3                                    0,0            0,1             0,2             0,3                                                P_SUM Buffer                                                                                                                                                                                                                                                                                                                                                                7: w                          a\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    0,0 *              0,2                                                                                                                                                                                                                                          7: w2,1                       a\r\n                                                                                                         PE1 w                            w w w                                                                   a              a               a               a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 P_SUM Buffer                                                                                                                                                                                                              *         1,0\r\n                                                                                                                                1,0             1,1             1,2            1,3                                    1,0            1,1             1,2             1,3                                                                               partial                                                           partial                                                                                                                                                                                                                                                                    8: w                          a                                                                                partial                                                           partial \r\n                                                                                                                                                                                                                                                                                                                        R C sum                                                           R C sum                                                                                                                                                                                                                                                                                                                   0,0 *              0,3                                                                                                                                                                                                                                          8: w2,1                       a\r\n                                                                                                         PE0 w                            w w w                                                                   a              a               a               a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 R C sum                                                            R C sum                                                                                                                                                *         1,1\r\n                                                                                                                                2,0             2,1             2,2            2,3                                    2,0            2,1             2,2             2,3                                                 0              0            w a                                    0              1            w a                                                                                                                                                                                                                              To                                                                                                                                                                                                                                                                              To \r\n                                                                                                                                                                                                                                                                                                                                                          0,0             0,0                                                0,0             0,1                                                                                                                                                                                                                                                    9: w                          a\r\n                                                                                                                                                                                                                                                                                                                                                                  *                                                                  *                                                                                                                                                                                                                                                                              2,1 *              1,2                                          0              0            w a                                   0               1            w a                                                                                              9: w0,2                       a\r\n                                                                                                         PE1 w                             w w w                                                                  a               a              a               a                                                                                                                                                                                                                                                                                                                                                                               INPUT                                                                                                                                                                0,0* 0,0                                                          0,0* 0,1                                 INPUT                                                                       *         2,0\r\n                                                                                                                                3,0             3,1             3,2            3,3                                    3,0             3,1            3,2             3,3                                                                                                                   2               1            w a                                          partial sum:                                                                                                                                                                                                               10: w\r\n                                                                                                                                                                                                                                                                                                                                                                                                                             2,1* 1,1                                                                              w                    a              + w                          a              + w                         a                  _MEM                                                              2,1 * a1,3                                                                                                                                                                                                    _MEM\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         0,0                 0,0                     0,2 *              2,0                     0,3                 3,0                                                                                                                                             2               0 w a                                                             1            w a                                                                                          10: w                              a\r\n                                                                                                                        SA-RCSC  Range of P_SUM                                                                                                                                                                                                                                                                                                                    (R = 0, C = 0)                                                  *                                                                                      *                                                                                                                                                                                          2,0* 0,0                         2                                  2,0* 0,1                                                                                                   0,2 *              2,1\r\n                                                                                                                            Format                                         Buffer                                                                                                    The computed partial sum for t t   is transferred to the INPUT_MEM . Weight Matrix are reload from WEIGHT_MEM for computing t t\r\n                                                                                                                                                                                                                                                                                                                                                                                             0,    1                                                                                                                                                                                                                                                                                                                                        2,   3\r\n                                                                                                                                                                                                         Figure C.1. Detailed description of how sM\u00d7sM and dM\u00d7dM are computed inside a PE of OPTIMUS.\r\n                                                                                                   matrix is t or dmodel (Fig. 5). This row size is smaller than\r\n                                                                                                   that of the weight matrix where the sparse matrix compu-\r\n                                                                                                   tation is performed. So one PE processes fewer rows than\r\n                                                                                                   sparse matrix multiplication case so that the partial sum for\r\n                                                                                                   more columns of input matrix can be accumulated in the\r\n                                                                                                   P SUMbuffer. As a result, high reuse of weight data can be\r\n                                                                                                   achieved. Since the weight matrix is not sparse, the value\r\n                                                                                                   is transferred to the w \ufb01fo in the column direction with-\r\n                                                                                                   out using the sparse matrix format. The pointer of w \ufb01fo is\r\n                                                                                                   shifted every cycle and the calculated P SUM buffer value is\r\n                                                                                                   transferred to the INPUT MEM similar to the sparse matrix\r\n                                                                                                   multiplication case.\r\n                                                                                                    D PRUNINGRESULTSOFTHE\r\n                                                                                                                                         TRANSFORMERMODEL\r\n                                                                                                  We\ufb01rst trained a 6-layer transformer model with h = 8,\r\n                                                                                                   d                                                       = 512, d = 2048, and n = 36549 using WMT\r\n                                                                                                             model                                                                                                             f\r\n                                                                                                   English-to-German (EN-DE) dataset (Sebastien Jean &\r\n                                                                                                   Bengio, 2015) under the same training condition as sug-\r\n                                                                                                   gested in (Klein et al., 2017) and (Vaswani et al., 2017).\r\n                                                                                                   After \ufb01nishing training, we pruned the weights in the trans-\r\n                                                                                                   former model with the pruning rates shown in Table D.1\r\n                                                                                                   using the magnitude-based pruning method (Han et al.,\r\n                                                                                                   2015b). Then we retrained the pruned model while main-\r\n                                                                                                   taining the above training condition except the learning rate\r\n                                                                                                   schedule; we use the learning schedule scaled by 1.25 com-\r\n                                                                                                   pared to the original one. The weights of the transformer\r\n                                                                                                   model are removed by 77.25% in average, but the BLEU\r\n                                                                                                   scoreonlydegradesabout0.6intheWMT15EN-DEdataset\r\n                                                                                                   (Table D.1).\r\n                     OPTIMUS:OPTImizedmatrixMUltiplicationStructureforTransformerneuralnetworkaccelerator\r\n                        Table D.1. The sparsity of the pruned Transformer model and BLEU evaluation results on WMT15\r\n               LAYER    SUB LAYER  MATRIX SIZE PRUNING RATE [%]   DATA SIZE    DATA SIZE     BLEU\r\n                                                                (DENSE) [KB] (PRUNED) [KB]\r\n             ENCODER0     MHA        512X512         77.93          2048         567.27\r\n                           FF       2048X512         73.39          4096        1368.21\r\n             ENCODER1     MHA        512X512         77.89          2048         586.14\r\n                           FF       2048X512         75.12          4096        1279.68\r\n             ENCODER2     MHA        512X512         77.92          2048         567.56\r\n                           FF       2048X512         75.18          4096        1276.77\r\n             ENCODER3     MHA        512X512         78.02          2048         565.00\r\n                           FF       2048X512         75.26          4096        1272.52\r\n             ENCODER4     MHA        512X512         77.97          2048         566.15\r\n                           FF       2048X512         75.31          4096        1270.09\r\n             ENCODER5     MHA        512X512         77.91          2048         567.73\r\n                           FF       2048X512         75.17          4096        1277.11       PRE\r\n                         MMHA        512X512         78.09          2048         563.04     PRUNING:\r\n             DECODER0     MHA        512X512         77.99          2048         565.68       32.29\r\n                           FF       2048X512         75.08          4096        1281.80\r\n                         MMHA        512X512         77.99          2048         565.59       POST\r\n             DECODER1     MHA        512X512         78.09          2048         563.06     PRUNING:\r\n                           FF       2048X512         75.06          4096        1282.88       31.67\r\n                         MMHA        512X512         78.01          2048         565.77\r\n             DECODER2     MHA        512X512         77.97          2048         566.28\r\n                           FF       2048X512         74.99          4096        1286.58\r\n                         MMHA        512X512         78.00          2048         565.52\r\n             DECODER3     MHA        512X512         77.95          2048         566.77\r\n                           FF       2048X512         74.96          4096        1288.27\r\n                         MMHA        512X512         78.02          2048         564.87\r\n             DECODER4     MHA        512X512         77.97          2048         566.10\r\n                           FF       2048X512         75.02          4096        1284.88\r\n                         MMHA        512X512         77.99          2048         565.60\r\n             DECODER5     MHA        512X512         77.90          2048         567.95\r\n                           FF       2048X512         75.04          4096        1284.03\r\n               LINEAR               36549X512        79.77         36549        9104.25\r\n             MMHA:MASKEDMULTI-HEADATTENTION,MHA:MULTI-HEADATTENTION,FF:POSITION-WISEFEEDFORWARD\r\n", "award": [], "sourceid": 143, "authors": [{"given_name": "Junki", "family_name": "Park", "institution": "POSTECH"}, {"given_name": "Hyunsung", "family_name": "Yoon", "institution": "POSTECH"}, {"given_name": "Daehyun", "family_name": "Ahn", "institution": "POSTECH"}, {"given_name": "Jungwook", "family_name": "Choi", "institution": "Hanyang University"}, {"given_name": "Jae-Joon", "family_name": "Kim", "institution": "POSTECH"}]}