MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

Part of Proceedings of Machine Learning and Systems 5 pre-proceedings (MLSys 2023) mlsys2023


Bibtek download is not available in the pre-proceeding


Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia


We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) training on GPUs. Our system ismotivated by the limitations of current frameworks, which restrict the dynamic routing in MoE layers to satisfythe constraints of existing software and hardware. These formulations force a tradeoff between model quality andhardware efficiency, as users must choose between dropping tokens from the computation or wasting computationand memory on padding. To address these limitations, we reformulate MoE computation in terms of block-sparseoperations and develop new block-sparse GPU kernels that efficiently handle the dynamism present in MoEs. Ourapproach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups ofup to 40% over MoEs trained with the state-of-the-art Tutel library and 2.4× over dense DNNs trained with thehighly-optimized Megatron-LM framework.