Ningning Xie, Tamara Norman, Dominik Grewe, Dimitrios Vytiniotis
We present a novel characterization of the mapping of multiple parallelism forms(e.g. data and model parallelism) onto hierarchical accelerator systems that ishierarchy-aware and greatly reduces the space of software-to-hardware mapping.We experimentally verify the substantial effect of these mappings on all-reduceperformance (up to 448x). We offer a novel syntax-guided programsynthesis framework that is able to decompose reductions over one or moreparallelism axes to sequences of collectives in a hierarchy- and mapping-awareway. For 69% of parallelism placements and user requested reductions, ourframework synthesizes programs that outperform the default all-reduceimplementation when evaluated on different GPU hierarchies (max 2.04x,average 1.27x). We complement our synthesis tool with a simulatorexceeding 90% top-10 accuracy, which therefore reduces the need for massiveevaluations of synthesis results to determine a small set of optimal programsand mappings.