Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design

Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference

Bibtex Paper

Authors

Jian Meng, Yuan Liao, Anupreetham Anupreetham, Ahmed Hasssan, Shixing Yu, Han-sok Suh, Xiaofeng Hu, Jae-sun Seo

Abstract

Deep neural network (DNN) compression (e.g., quantization, pruning) has been widely investigated in variousdeep learning tasks (e.g., vision and language). The development of model compression is continuously motivatedby the evolution of various neural network accelerator designs with ASIC or FPGA. On the algorithm side, theultimate goal of quantization or pruning is accelerating the expensive DNN computations on low-power hardware.However, such a “design-and-deploy” workflow faces under-explored challenges in the current hardware-algorithmco-design community due to some unavoidable flaws. First, although the state-of-the-art quantization algorithmcan achieve ultra-low precision with negligible degradation of accuracy, the latest deep learning framework (e.g.,PyTorch) can only support non-customizable 8-bit precision, data format, and parameter extraction workflow forCNN. Secondly, the ultimate goal of quantization is enabling the computation with low-precision data (e.g., 4-bitinteger). However, the current SoTA algorithm treats the quantized integer as an intermediate result, while the finaloutput of the quantizer is the “discretized” floating-point values, ignoring the practical needs and adding additionalworkload to hardware designers for integer parameter extraction and layer fusion. Finally, the compressiontoolkits designed by the industry are constrained to their in-house product or a handful of algorithms. The limiteddegree of freedom in the current toolkit and the under-explored customization hinder the prototype ASIC orFPGA-based accelerator design. To resolve these challenges, we propose Torch2Chip, an open-sourced, fullycustomizable, and high-performance toolkit that supports the user-designed compression algorithm followed byautomatic model fusion and parameter extraction. Torch2Chip incorporates the hierarchical design workflow, andthe user-customized compression algorithm will be directly packed into the deployment-ready format for eitherprototype chip verification with either CNN or vision transformer (ViT). Furthermore, Torch2Chip covers a widerange of training methods to achieve high performance, from basic supervised learning to state-of-the-art (SoTA)lightweight self-supervised learning (SSL). The Torch2Chip toolkit and source codes will be released soon.