Junki Park, Hyunsung Yoon, Daehyun Ahn, Jungwook Choi, Jae-Joon Kim
We present a high-performance Transformer neural network inference accelerator named OPTIMUS. Optimus has several features for performance enhancement such as the redundant computation skipping method to accelerate the decoding process and the Set-Associative RCSC (SA-RCSC) sparse matrix format to maintain high utilization even when a large number of MACs are used in hardware. OPTIMUS also has a flexible hardware architecture to support diverse matrix multiplications and it keeps all the intermediate computation values fully local and completely eliminates the DRAM access to achieve exceptionally fast single batch inference. It also reduces the data transfer overhead by carefully matching the data compute and load cycles. The simulation using the WMT15 (EN-DE) dataset shows that latency of OPTIMUS is 41.62×, 24.23×, 16.01× smaller than that of Intel(R) i7 6900K CPU, NVIDIA Titan Xp GPU, and the baseline custom hardware, respectively. In addition, the throughput of OPTIMUS is 43.35×, 25.45× and 19.00× higher and the energy efficiency of OPTIMUS is 2393.85×, 1464× and 19.01× better than that of CPU, GPU and the baseline custom hardware, respectively.