HORIZONTALLY FUSED TRAINING ARRAY: AN EFFECTIVE HARDWARE UTILIZATION SQUEEZER FOR TRAINING NOVEL DEEP LEARNING MODELS

Shang Wang 1 2 Peiming Yang 3 2 Yuxuan Zheng 4 Xin Li 3 2 Gennady Pekhimenko 5 2

ABSTRACT
Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years. We analyze GPU cluster usage statistics from a top research institute for more insights into the hardware efficiency achieved by typical DL training jobs. Our study reveals that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively (e.g., for hyper-parameter tuning) while severely under-utilizing the hardware. Fortunately, we observe that such workloads have the following unique characteristics: (i) the models among jobs often have the same types of operators with the same shapes, and (ii) the inter-model horizontal fusion of such operators is mathematically equivalent to other already well-optimized operators. Thus, to help DL researchers and practitioners effectively improve the hardware utilization of their novel DL training workloads, we propose Horizontally Fused Training Array (HFTA). HFTA is a new DL framework extension library that horizontally fuses the models from different repetitive jobs deeply down to operators and trains them simultaneously on a shared accelerator. To show the generality of our solution, we apply HFTA to six DL models training on state-of-the-art accelerators (GPUs and TPUs). Our results indicate that HFTA is highly effective in improving hardware utilization and achieves up to 15.1× higher training throughput vs. the standard practice of running each job on a separate accelerator.

1 INTRODUCTION
Deep Learning (DL) algorithms have facilitated tremendous progress in a range of domains, including natural language translation (Wu et al., 2016), recommendation systems (Nakov et al., 2019), magnetic resonance imaging segmentation (Akkus et al., 2017), video game bots (OpenAI, 2018), real-time high-resolution rendering (NVIDIA, 2020e), and very-large-scale integrated circuit placement (Lin et al., 2019). This is driven by the abundant and continuous efforts in researching and developing novel DL models by academia and industry in recent years. Developing these models is computationally intensive, requiring an army of expensive, specialized accelerators such as GPUs and TPUs (Jouppi et al., 2017), leading to staggeringly high training costs (Coleman et al., 2017; Amodei et al., 2018; Zhu et al., 2018; Mattson et al., 2020). To reduce this training cost and optimize the cluster-wide hardware resource usage, we analyze GPU usage statistics over two consecutive months on a large GPU cluster from the Vector Institute (Vector Institute, 2021). We observe that, despite significant attention on optimizing DL training workloads from the computer system and architecture communities, especially on distributed training optimizations (Appleyard et al., 2016; Chen et al., 2016; Lin et al., 2018; Rajbhandari et al., 2019; Mattson et al., 2020), single-accelerator (e.g., single-GPU) training jobs, often launched repetitively by DL researchers (to perform hyper-parameter tuning, model architecture search or convergence stability tests), can (i) dominate the cluster-wide hardware resource consumption (e.g., 46.2% in our study) while (ii) having extremely low hardware utilization (Section 2.1 and 5.3).

The root cause of this phenomenon is manifold. DL researchers and practitioners often lack the expertise to optimize their training workloads independently. As a result, basic techniques, such as increasing the batch size, often become the only approach at their disposal to improve hardware utilization. However, this technique can be impractical due to many reasons, including generalization gap (Keskar et al., 2017), batch size scaling limit (Shalmane et al., 2019), and GAN training instability (Odena, 2019). On the other hand, accelerators (e.g., GPUs and TPUs) evolve towards more computing power and larger memory capacities (Table 2 and 3). This trend amplifies the severity of the hardware under-utilization caused by the inability of such training workloads to scale their performance well.

Thus, this phenomenon motivates hardware sharing approaches. To the best of our knowledge, the only widely used hardware-based sharing solutions applicable to DL

------
1 Equal contribution
2 NVIDIA
3 Vector Institute
4 Department of Computer Science and Engineering, Shanghai Jiao Tong University
5 Intel

Proceedings of the 4th MLSys Conference, San Jose, CA, USA, 2021. Copyright 2021 by the author(s).
training are the MPS (NVIDIA, 2020b) and MIG (NVIDIA, 2020g) features on NVIDIA GPUs. However, as we later show in Section 2.2, these generic GPU sharing features that aim at arbitrary workloads are far from the “silver bullets” to effectively improve the hardware utilization in the case of repetitive single-GPU training workloads. The situation is even worse for emerging DL accelerators (e.g., TPUs) that currently do not have any hardware-based sharing features.

To address such hardware under-utilization on various accelerators, we make two key observations based on the unique characteristics of these workloads. First, the models across jobs belonging to the same workload (e.g., hyper-parameter tuning) often have the same types of operators with the same shapes. Second, if these operators are horizontally fused across the models, the outcome is mathematically equivalent to other well-optimized operators found in existing DL framework stacks and accelerators (e.g., fusing multiple convolution operators can be realized using grouped convolutions). Inspired by these key observations, we propose to horizontally merge multiple training jobs with the same or similar DL models by deeply fusing most, if not all, operators in those models. These models’ training is then performed collectively on the same shared accelerator (instead of training each model separately on its accelerator). Our proposed idea of inter-model horizontal fusion is drastically different from and also more effective than major-related prior works as it better exercises the full potential of modern accelerators while (i) not relying on the generic sharing primitives (e.g., CUDA streams) that are ineffective for repetitive single-GPU workloads, and (ii) avoiding limited fusion techniques that, for example, support only stateless operators or require the weights across models to be the same (Narayanan et al., 2018a).

We leverage this novel idea to build a new DL framework extension library for DL researchers and practitioners, called Horizontally Fused Training Array (HFTA) \(^1\), that significantly simplifies the adoption of our proposed inter-model horizontal fusion technique. In summary, this work makes the following major contributions.

- To understand the nature of the jobs running on modern DL accelerator clusters, we collect and study GPU cluster usage statistics, including 51K jobs running for 472K GPU hours in total, from real research workloads. The results of this study demonstrate that repetitive single-accelerator training jobs (i) dominate the hardware resource usage (i.e., 46.2%) and (ii) have extremely low hardware utilization.

- Motivated by this study, we make two key observations about these jobs that our proposal is built upon: (1) The models often have the same types of operators with the same shapes. (2) The inter-model horizontal fusion of such operators is mathematically equivalent to other existing and well-optimized operators.

- We develop HFTA, a new library that helps DL researchers and practitioners (even with a limited computer system and architecture expertise) to easily extract better performance from their hardware when training novel DL models. While doing so, we avoid (i) introducing any additional device-specific operator implementations that would limit the generality of our idea across different accelerators and (ii) any effect on individual models’ convergence as the speedup is achieved only through mathematically equivalent transformations. HFTA applies to a wide variety of models and can run on any hardware backends supported by existing DL frameworks. Furthermore, we propose a simple yet effective method to integrate HFTA into existing tuning algorithms and develop a lightweight tuning framework named Horizontally Fused Hyper-parameter Tuning (HFHT).

- We evaluate HFTA on six highly impactful DL models, covering a wide range of tasks, in the machine learning (ML) community. On modern GPUs (V100, RTX6000 and A100), HFTA achieves 2.42 \(\times\) to 11.50 \(\times\) higher training throughput than running the training jobs without sharing, which is commonly employed by hyper-parameter tuning frameworks (Li, 2020), 1.25 \(\times\) to 4.72 \(\times\) than MPS and 1.33 \(\times\) to 4.88 \(\times\) than MIG. HFTA can also fit 1.50 \(\times\) to 9.43 \(\times\) more training jobs on the same GPU than MPS. On TPUs, HFTA achieves 2.98 \(\times\) to 15.13 \(\times\) higher training throughput, which demonstrates HFTA’s ability to improve performance across different hardware backends significantly. Finally, when HFTA is integrated into two tuning algorithms via HFHT, we reduce the cost of total GPU hours by up to 5.10 \(\times\) among four end-to-end hyper-parameter tuning workloads.

\(^1\)https://github.com/UofT-EcoSystem/hfta

2 Background and Motivation

2.1 Inefficiency in Repetitive Training Jobs

As DL research continues to evolve in recent years, the accompanied training cost increased dramatically. For example, (Amodei et al., 2018) show that the amount of compute for training SOTA DL models doubles every 3.4 month, outpacing even Moore’s Law (Schaller, 1997). Motivated by the practical goal of reducing cluster-wide training cost, using the methodology detailed in Appendix A, we collect and study the GPU usage statistics of real research workloads for two consecutive months on a large GPU cluster from the Vector Institute (Vector Institute, 2021). To our surprise, we find that single-accelerator (e.g., single-GPU) training jobs dominate the cluster-wide hardware resource consumption when these jobs are launched repetitively in groups, and the aggregated cost of these jobs can even outweigh that of distributed training (the primary focus of many research efforts from the computer system and architecture communities (Lin et al., 2018; Jayarajan et al., 2019; Rajbhandari et al., 2019; Mattson et al., 2020; Li et al., 2020)).
Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Potential reasons of these repetitive jobs include (but are not limited to) hyper-parameter tuning (Strubell et al., 2019) and convergence stability testing.

**Background** Hyper-parameter tuning finds the optimal set of hyper-parameters unknown a priori, which are usually necessary for building accurate models targeting a previously unexplored problem (Bergstra et al., 2011; Bergstra & Bengio, 2012). Typical hyper-parameters include learning rates, the choices of weight initializers, and optimizer settings. Model architecture search (Elskens et al., 2019) is a subset of hyper-parameter tuning where the hyper-parameters directly impact the model architecture (e.g., the number of layers). Convergence stability testing trains the same model many times with different random seeds to verify the final accuracy results.

In our study, we classify the jobs into four main categories: (1) multi-node or single-node distributed training, (2) repetitive single-GPU training, (3) isolated single-GPU training, and (4) others (meaning the jobs that do not belong to the first three categories or can not be identified). Table 1 shows the GPU hour usage distribution among these categories, from which we can observe that the repetitive single-GPU training jobs consume as much as 46.2% of the cluster-wide total GPU hours. Furthermore, those repetitive single-accelerator training jobs often have low hardware utilization, as we show in Appendix A. The cause of such phenomenon is manifold:

- Improving the hardware utilization for DL training jobs can be very challenging. DL researchers and practitioners often lack the system and architecture expertise to optimize their training workloads independently. Increasing the batch size, which is the naive and often the only approach at their disposal to increase hardware utilization, is not universally applicable. For instance, large batch sizes can lead to training instability for the generative adversarial networks (GAN) (Odena, 2019; Brock et al., 2019), generalization gap (Keskar et al., 2017), and diminishing returns due to batch size scaling limit (Shallue et al., 2019). Even with the help from the computer system and architecture experts, applying various advanced optimization techniques (e.g., kernel fusion (Appleyard et al., 2016) or checkpointing (Chen et al., 2016; Zheng et al., 2020)) on each new model requires an enormous amount of engineering efforts (Mattson et al., 2020). Meanwhile, novel DL models are being proposed at the exponential pace in recent years (Charrez, 2019).
- As DL research progresses, accelerators (e.g., GPUs and TPUs (Jouppi et al., 2017)) evolve towards more compute power (e.g., more streaming multiprocessors (SMs) and the introduction of specialized compute units for fast matrix multiplications in GPUs called tensor cores (TCs) (Markidis et al., 2018)) and larger memory capacity/bandwidth. We can observe this trend from Tables 2 and 3 that list the specifications of the most recent NVIDIA data center GPUs and Google Cloud TPUs, where the largest accelerators suffer from under-utilization the most.

The fast development of both new DL models and accelerators together exacerbates the hardware under-utilization from repetitive single-accelerator training jobs, which motivates hardware sharing methods discussed below.

### 2.2 Hardware-based Sharing

The most well-known and (to the best of our knowledge) the only widely-used hardware-based sharing solutions applicable to DL training workloads are the Multi-Process Service (MPS) (NVIDIA, 2020h) and Multi-Instance GPU (MIG) (NVIDIA, 2020g) on NVIDIA GPUs. MPS allows CUDA kernels from different processes to potentially run concurrently on the same GPU via a hardware feature called Hyper-Q (Bradley, 2007). MIG, which is currently only available on the most recent A100 GPUs (NVIDIA, 2020a), partitions a single GPU into multiple (up to 7) isolated GPU instances (GIs) where each job now run on a single GI.

However, as we quantitatively demonstrate in Section 5.1, both MPS and MIG still leave the significant potential of training performance unharnessed due to the following reasons. First, both MPS and MIG duplicate the runtime overhead among kernels from different training jobs, including kernel launches (Lustig & Martonosi, 2013), GEMM setups and teardowns (NVIDIA, 2020j), and/or memory format conversions (related explicitly to TCs) (NVIDIA, 2020). Thus, they can not effectively improve the SM and TC utilization. Second, both MPS and MIG require running training jobs as separate processes, which duplicates the GPU memory overhead reserved by the DL framework stack (Gross et al., 2019) and leads to a higher overall GPU memory footprint. Therefore, we can fit fewer training jobs into the same GPU. Finally, MIG’s partitioning granularity can be too coarse for many training workloads. Even

---

### Table 1. GPU hour usage breakdown for two consecutive months of a large GPU cluster from the Vector Institute.

<table>
<thead>
<tr>
<th>Training Jobs</th>
<th>Repetitive Single-GPU</th>
<th>Isolated Single-GPU</th>
<th>Distributed</th>
<th>Other</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPU Hours</td>
<td>218K (46.2%)</td>
<td>19K (3.5%)</td>
<td>113K (24.0%)</td>
<td>124K (26.3%)</td>
</tr>
</tbody>
</table>

---

### Table 2. Cloud TPU Core Specifications (Google, 2020c)

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>MXUs</td>
<td>1</td>
<td>2</td>
<td>≥ 4†</td>
</tr>
<tr>
<td>Memory (HBM)</td>
<td>8 GB</td>
<td>16 GB</td>
<td>≥ 4? GB</td>
</tr>
</tbody>
</table>

† TPU v4 is expected to double the FLOPs of TPU v3 along with other enhancements (Kumar, 2020).

### Table 3. NVIDIA Data Center GPU Specifications

<table>
<thead>
<tr>
<th>GPU</th>
<th>SMs</th>
<th>HBM (GB)</th>
<th>HBM Bandwidth</th>
<th>TC Types</th>
</tr>
</thead>
<tbody>
<tr>
<td>P100 (2016)</td>
<td>56</td>
<td>12/16</td>
<td>549/372 GB/s</td>
<td>-</td>
</tr>
<tr>
<td>V100 (2018)</td>
<td>80</td>
<td>16/32</td>
<td>900 GB/s</td>
<td>FP16</td>
</tr>
<tr>
<td>A100 (2020)</td>
<td>108</td>
<td>40</td>
<td>1.6 TB/s</td>
<td>TF32 &amp; FP16</td>
</tr>
</tbody>
</table>

---

AMD GPUs also have a hardware-based sharing feature called CU-mask (Otterness & Anderson, 2020); however, we skip its discussion due to their irrelevance in mainstream training workloads.
with the finest granularity of MIG (7 GIs), each job can still under-utilize a single GI, as we show in Section 5.1.

3 Our Proposal: HFTA

To address the challenge of improving hardware utilization for novel repetitive training workloads on a variety of accelerators, we make the following two key observations on the unique characteristics of these workloads:

- When launched repetitively (such as during hyper-parameter tuning or convergence stability testing), the models used across these jobs often have the same types of operators with the same shapes.

- Horizontally fusing the same types of operators with the same shapes often results in other mathematically equivalent operators that already exist in many SOTA DL models and thus have been optimized in most DL framework stacks on different accelerators.

Figure 1 explains the above observations with a concrete example of hyper-parameter tuning where the goal is to determine which weight initializer and learning rate work the best. Regardless of which weight initializer or learning rate is used, the first operators in both models are Conv2d of the same shape; the horizontal fusion of many Conv2d operators is mathematically equivalent to a grouped Conv2d which is already used in the ResNeXt (Xie et al., 2017) and MobileNets (Howard et al., 2017) models and supported by cuDNN (NVIDIA, 2020c) on NVIDIA GPUs and XLA (Google, 2020e) on TPUs.

Inspired by the above observations, instead of the common practice (Li, 2020) of running each job with a single model on a separate accelerator, we propose to better utilize existing hardware by deeply fusing the the same (class of) models across multiple jobs together. Most, if not all, operators of these models can be horizontally fused, and we train these models simultaneously on the same accelerator. Thus, as depicted in Figure 1, we can fuse many training jobs into a single one, without the need to implement any new device-specific operator from scratch, which is both time-consuming and error-prone. Moreover, this approach easily generalizes to any hardware backends that the DL frameworks already support (e.g., all NVIDIA GPUs and Google TPUs in the case of PyTorch) Since horizontal operator fusion can be performed for both single-accelerator and distributed training, our approach applies to both use cases.

However, manually implementing or porting existing training workloads to the fused ones from scratch can be challenging for DL researchers and practitioners. To much simplify the associated engineering efforts, we develop a new DL framework library called Horizontally Fused Training Array (HFTA). Even though we choose PyTorch (Paszke et al., 2019) as our prototyping DL framework due to its user-friendliness and increased popularity within the ML community (He, 2019), the same idea can be implemented on top of other DL frameworks (e.g., TensorFlow (Abadi et al., 2016) and MXNet (Chen et al., 2015)). Also, HFTA is carefully designed to accommodate computer system and architecture “novices”. It can be used seamlessly with PyTorch-native training scripts and only requires changing very few lines of code. As an illustrative example, Figure 2 shows how to enable HFTA for AlexNet (Krizhevsky et al., 2012). We can observe that the model definition is kept the same with only a few extra lines of code (highlighted in the red box) to update the PyTorch’s operator classes.

We now discuss the HFTA’s individual components:

HFTA Operators To relieve the DL researchers and practitioners from the need to implement any horizontally fused operators themselves, HFTA covers most common opera-
Adadelta (Zeiler, 2012)) and learning rate schedulers (e.g., (de)convolution family (e.g., Conv2d or ConvTranspose1d) are common hyper-parameters that can be fused into a single grouped Conv2d with two groups.

**HFTA Optimizers and Learning Rate Schedulers** In addition, HFTA supports inter-model horizontally fused optimizers (e.g., Adam (Kingma & Ba, 2015) and Adadelta (Zeiler, 2012)) and learning rate schedulers (e.g., StepLR (Senior et al., 2013)). This is because (1) hyper-parameter tuning is a common use case in repetitive training workloads, and (2) learning rates, learning rate schedules, and optimizer settings (e.g., momentum (Qian, 1999; Sutskever et al., 2013)) are common hyper-parameters that require tuning for many DL models. The scalar-vector operations (e.g., multiplying a learning rate under tuning with the gradients) in the original implementations are now replaced by broadcasted vector-vector operations (e.g., multiplying a vector of learning rates with the concatenated gradients of all models) in HFTA’s implementations (as we show in Figure 1). We also plan to continue improving the HFTA's coverage to support more operators, optimizers, and learning rate schedulers beyond this work’s publication.

**Loss Scaling** The loss functions across multiple models can be fused as well. If the loss is reduced by averaging over the mini-batch, we scale the fused loss value by the number of models under fusion to reconstruct mathematically equivalent gradients. If the loss is not reduced or reduced by sum, such scaling rule is no longer needed. We provide detailed derivations in Appendix C.

**Convergence** Since the fusion of all components are achieved through mathematically equivalent transformations, HFTA theoretically does not have any effect on the models’ original convergence. We also provide the related empirical validation in Appendix D.

**Integration with Tuning Algorithms** During hyper-parameter tuning, as a tuning algorithm proposes different sets of hyper-parameters to try, we can leverage HFTA to improve the overall hardware utilization by partitioning these sets and fusing each partition as a single training job.

To prototype this idea, we develop a lightweight hyper-parameter tuning framework called Horizontally Fused Hyper-parameter Tuning (HFHT). HFHT currently supports (1) two hyper-parameter tuning algorithms (Hyperband (Li et al., 2018) and random search (Bergstra & Bengio, 2012)), and (2) not only sharing accelerators via HFTA but also sharing GPUs via MPS (to compare with HFTA). We discuss HFHT’s detailed design in Appendix E.

### 4 Methodology

**Workloads** Our major benchmarks are carefully selected based on the following three criteria. First, our workloads should represent important models in their corresponding DL sub-fields, making sure that HFTA is effective in improving the hardware utilization for important DL models. Second, we select models that have not yet received much attention from the computer system and architecture communities and are not over-optimized. This is a much more realistic scenario for DL researchers and practitioners who typically lack the expertise to apply advanced optimization techniques. Third, we would like to cover both compute-bound and memory-bound DL models. Based on the aforementioned criteria, two classes of models (three different workloads) are selected as our major benchmarks.

PointNet (Qi et al., 2017) is a memory-bound neural network that performs (i) object classification and (ii) segmentation tasks on 3D point clouds. The models for both tasks are trained on the ShapeNet part dataset (Yi et al., 2016). We leverage a third-party PyTorch implementation of PointNet (Xia, 2019) endorsed by Qi et al. (Qi, 2017).

DCGAN (Radford et al., 2016) is a compute-bound generative adversarial network (GAN) that synthesizes natural-apparent images. The model is trained on the LSUN dataset (Yu et al., 2015). We leverage an implementation of DCGAN from PyTorch official examples (PyTorch, 2020).

To emulate the hardware usage habits of DL researchers and practitioners without the influence from the computer system and architecture experts, the batch sizes used in both benchmarks are kept the same as reported in their corresponding publications.

To empirically validate that HFTA does not affect convergence and to demonstrate that HFTA can also improve the hardware utilization for conventional models, we further include ResNet-18 (He et al., 2016), MobileNetV3-Large (Howard et al., 2019), Transformer (Vaswani et al., 2017), and BERT-Medium (Turc et al., 2019) as our secondary benchmarks. To evaluate HFHT, for each of the the PointNet and MobileNet classification tasks, we prepare two end-to-end hyper-parameter tuning workloads using different tuning algorithms (random search and Hyperband) as

---

HFHT also supports MIG. HFHT’s evaluation on MIG is not included due to space constraints.
our benchmarks. All four workloads aim at maximizing the validation accuracy on their corresponding datasets. Each workload tunes eight independent hyper-parameters. We discuss the detailed setup of these benchmarks in Appendix H.

**Experimental Setup** Our experiments are performed on two types of ML accelerators (NVIDIA GPUs and Google TPUs) including the most recent three generations of GPUs and the latest available generation of TPUs: (i) Volta-based V100 (NVIDIA, 2020k), (ii) Turing-based RTX6000 (NVIDIA, 2020), and (iii) very recent Ampere-based A100 (NVIDIA, 2020a), (iv) TPU v3 (Google, 2020a). We provide the detailed specifications in Table 4.

**Baselines** We use hyper-parameter tuning (including learning rate, learning rate schedule, and optimizer settings) as the use case for our repetitive single-accelerator training jobs under experimentation. We compare HFTA with the following four SOTA baselines. (1) **Serial**: each training job is executed on a single accelerator. This scheme is employed by most hyper-parameter tuning frameworks (Weights&Biases, 2020; Li, 2020). (2) **Concurrent**: multiple training jobs are executed as independent processes on the same GPU. In this case, the kernels from different processes are time-multiplexed, but can not execute concurrently on the same GPU (without the help of MPS or other hardware features). This scheme is used when MPS is not preferable due to reasons related to infrastructure and/or security (e.g., custom-built infrastructure or CUPTI tools that are not compatible with MPS). (3) **MPS**: similar to concurrent, except the independent processes are executed via MPS. (4) **MIG**: similar to concurrent, except the independent processes are executed via MIG. This scheme is currently only available on the A100 GPUs. We use concurrent, MPS, and MIG only on GPUs since TPUs do not support running concurrent processes as of now. We do not evaluate the major related prior work, HiveMind (Narayanan et al., 2018a), since it is both close-sourced and implemented on a different ML framework (TensorFlow). We provide the detailed qualitative comparison against HiveMind in Section 6.

**Metrics** We use the per-device training throughput as our key performance metric to compare HFTA against our baselines since HFTA has no impact on the model convergence. We calculate this throughput by measuring the end-to-end training latency of (i) 10 epochs for both classification and segmentation tasks on PointNet; and (2) 5 epochs, 1000 iterations per epoch on DCGAN (enough for these workloads to enter the execution steady state). We skip the first epoch on GPUs and the first two epochs on TPUs to warm up the hardware properly before making any measurements. We repeat each experiment three times and report the average, minimum, and maximum per experiment.

To measure the effect of each technique on the hardware utilization, we use the sm_active and sm_occupancy performance counters that represent the SM temporal and spatial utilization, respectively. We also use the tensor_active performance counter to measure the TC temporal utilization (NVIDIA, 2020d). Details on these performance counters can be found in Appendix F.

### Table 4. Specifications of our experiment platforms. Dev. Mem. and VM/Host Mem. stands for device memory and VM/host memory respectively in GB. CSP stands for cloud service provider.

<table>
<thead>
<tr>
<th>Accelerator</th>
<th>Dev. Mem.</th>
<th>CSP</th>
<th>VM Instance</th>
<th>CUDA</th>
<th>cuDNN</th>
<th>GPU Driver</th>
<th>PyTorch</th>
<th>PyTorch/XLA</th>
<th>(v)CPUs</th>
<th>VM/Host Mem.</th>
</tr>
</thead>
<tbody>
<tr>
<td>V100</td>
<td>16</td>
<td>AWS</td>
<td>p3.2xlarge</td>
<td>10.2</td>
<td>7.6.5</td>
<td>450.51.05</td>
<td>1.60</td>
<td>-</td>
<td>8</td>
<td>61</td>
</tr>
<tr>
<td>RTX6000</td>
<td>24</td>
<td>-</td>
<td>-</td>
<td>10.2</td>
<td>7.6.5</td>
<td>450.66</td>
<td>1.60</td>
<td>-</td>
<td>8</td>
<td>16</td>
</tr>
<tr>
<td>A100</td>
<td>40</td>
<td>GCP</td>
<td>a2-highgpu-1g</td>
<td>11.0.3</td>
<td>8.0.2</td>
<td>450.51.06</td>
<td>1.7.0a+8deb4fc</td>
<td>-</td>
<td>12</td>
<td>85</td>
</tr>
<tr>
<td>TPU v3</td>
<td>16</td>
<td>GCP</td>
<td>n1-highmem-8</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>1.7.0a+862e410</td>
<td>1.6+8a57fb</td>
<td>8</td>
<td>52</td>
</tr>
</tbody>
</table>

### 5 Evaluation

Our evaluation results on the major benchmarks (i.e., PointNet classification task, PointNet segmentation task, and DCGAN) are thoroughly analyzed here, including end-to-end training performance on GPUs (Section 5.1) and TPUs (Section 5.2). We also analyze key GPU hardware performance counters to explain why HFTA achieves significantly better training performance (Section 5.3). We discuss the results on our secondary benchmarks (i.e., ResNet-18, MobileNetV3-Large, Transformer, and BERT-Medium) in Appendix I. Finally, we showcase HFTA’s potential to significantly improve the hardware utilization for existing hyper-parameter tuning algorithms (Section 5.4).

#### 5.1 End-to-end Training Performance on GPUs

**V100 Results** To compare the HFTA’s end-to-end training performance with other alternatives (i.e., serial, concurrent, MPS), Figure 4a, 4b and 4c plot the per-GPU normalized training throughput on the V100 GPUs (Volta architecture (NVIDIA, 2017)) with the PointNet classification task, PointNet segmentation task, and DCGAN respectively. We normalize the throughput for each experiment by the respective FP32 serial baseline. We show both FP32 and AMP (Huang et al., 2020) training results for each experiment. Each curve grows as we increase the number of models that either co-run together (for the concurrent and MPS baselines) or run in the fused form with HFTA. Each curve “stops” when it reaches the maximum number of models before the GPU runs out of memory. Based on these figures, we make several major observations:

First, **HFTA achieves significantly higher peak throughput than all baselines**; specifically, $4.29 \times 10$ over serial, $2.01 \times 4.87 \times$ over concurrent and $2.03 \times 4.50 \times$ over MPS. The significant throughput improvement is due to a much higher achieved utilization in both compute cores.
Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

(a) PointNet Classification on V100
(b) PointNet Segmentation on V100
(c) DCGAN on V100
(d) PointNet Classification on RTX6000
(e) PointNet Segmentation on RTX6000
(f) DCGAN on RTX6000
(g) PointNet Classification on A100
(h) PointNet Segmentation on A100
(i) DCGAN on A100

Figure 4. The normalized training throughput as we increase the number of models sharing the same GPU.

(d details in Section 5.3) and GPU memory (discussed in the next observation).

Second, HFTA enables more models to share the same GPU than MPS and concurrent; specifically, up to 1.80× on the PointNet classification task, up to 1.60× on the segmentation task and up to 7.57× on DCGAN. This is because HFTA does not duplicate the GPU memory overhead as we explain in Section 5.3.

Third, as we increase the number of models sharing the same GPU, the throughput of HFTA scales up and, in some cases, plateaus eventually. This is because using HFTA, the SM and TC utilization increases with the number of co-executing models (as we explain in Section 5.3). In contrast, MPS and concurrent either (i) plateau at a smaller number of models with a lower throughput, as we observe in Figure 4a and 4b, or (ii) even experience performance degradation as we observe in Figure 4c due to host resource (e.g., CPUs, disk I/O bandwidth, and/or memory) contention among many training processes.

Fourth, even with the same number of models sharing the same GPU, HFTA often achieves higher throughput than all baselines. The maximum speedups range from 1.62× to 3.41× over concurrent and 1.17× to 3.05× over MPS.

Fifth, HFTA can better exploit computing power from advanced hardware features such as TCs used during AMP training compared to the baselines. Specifically, the maximum speedup of AMP training over FP32 is 2.65× with HFTA, but only 1.00× for serial, 1.07× for concurrent, and 1.06× for MPS.

Therefore, we conclude that HFTA can significantly outperform major hardware-based sharing alternatives in improving hardware utilization and, as a result, improve the throughput of emerging ML models during repetitive single-accelerator training.

RTX6000 and A100 Results To check whether HFTA’s significant performance gains are general across different GPU architectures (e.g., Turing [NVIDIA, 2018] and Am-
pere (NVIDIA, 2020b)), we conduct the same set of experiments on the RTX6000 (Figure 4d, 4e and 4f respectively) and the A100 (Figure 4g, 4h and 4i respectively) while adding the extra MIG baseline for the A100. The general trends in these figures are similar to those we observe for V100. To simplify the comparison, for each workload on each GPU, Table 5 presents the peak throughput speedups of HFTA over the baselines, while Appendix G presents (i) the maximum throughput speedups of HFTA over the baselines given a fixed number of models, and (ii) the maximum AMP training throughput speedups over FP32 for both HFTA and the baselines. In addition, we make the following new observations:

First, both RTX6000 and A100 have higher GPU memory (HBM) capacities than V100 (24 GB and 40 GB vs. 16 GB); therefore, both HFTA and the baselines can co-run more models on the same RTX6000/A100 compared with V100. For example, AMP training of the PointNet classification task via HFTA can run up to 15/25 models on RTX6000/A100 vs. 9 on V100.

Second, since A100 has more compute capability and a larger GPU memory capacity than V100, the comparison of Figure 4g vs. 4a and 4h vs. 4b reveals that HFTA not only fits more models on the same hardware, but also achieves a higher peak throughput speedup over the baselines on A100 than on V100 (e.g., for PointNet segmentation task, the peak throughput speedup over serial is as high as 9.48× on A100 vs. 4.29× on V100).

Third, we observe one anomaly in DCGAN training on A100 (Figure 4i) where HFTA’s FP32 throughput is higher than that of AMP. After profiling the AMP run of this experiment via the PyProf (Agrawal & Kolodziej, 2020) tool, we pinpoint a few suspicious cuDNN-related FP32 kernels (which are supposed to be replaced by the equivalent TC kernels) in the backward pass. Since the Ampere architecture and the corresponding versions of cuDNN/PyTorch are very recently released, and we do not observe similar problems on older cuDNN/PyTorch versions for V100 and RTX6000, we believe that this issue is temporary due to the insufficient optimization in some of the new cuDNN kernels for A100. We hope it will be addressed in future cuDNN releases.

Fourth, we notice that on A100, the MIG partitioning (only up to 7 GBs) can be too coarse-grained. As we observe in Figure 4g, 4h and 4i, both MPS and concurrent could often share the A100 with more than seven models.

Therefore, we conclude that HFTA’s performance generally scales well with the compute and memory capabilities of modern GPUs. We observe higher performance benefits in the newer GPU architectures that would otherwise suffer more significantly from the hardware under-utilization when training without HFTA (as we qualitatively discuss in Section 2.1 and empirically show in Appendix G).

5.2 End-to-end Training Performance on TPUs

As we aim to build a general solution that works for different ML accelerators, we also evaluate HFTA on an entirely different type of accelerator: Google TPU v3. Figure 5 plots the per-core training throughput for serial vs. HFTA on the PointNet classification and DCGAN experiments on TPU v3, normalized by the throughput of the respective serial baseline. Similar to previous results on GPUs, each HFTA curve shows how the normalized throughput increases with the number of models sharing the same TPU (until the fused models can not fit into the TPU HBM memory). We make three major observations from these figures.

First, HFTA achieves 4.93× / 15.13× higher peak throughput than serial on the PointNet classification / DCGAN.

Second, we observe that for DCGAN, HFTA can sometimes achieve “super-linear” speedups. Our current investigation concludes that the most likely cause of such a behaviour is the tensor padding added in the serial baseline by the XLA (Google, 2020b) compiler (Google, 2020d), making this baseline weaker than it should be otherwise.

Additionally, we also investigate the HFTA’s potential on the PointNet segmentation task. Unfortunately, HFTA currently achieves a less impressive 1.20× speedup over the serial baseline, which we attribute to the PointNet segmentation variant having many non-GEMM-based operators that intrinsically do not map well to systolic arrays by the XLA compiler. Deeper analysis, however, is limited due to the xprof (Google, 2020f) tool, just recently released, do not directly support PyTorch/XLA. We will perform deeper analysis of this problem and research potential solutions as soon as a proper version of the profiler is released.

5.3 In-depth Performance Analysis

Using PointNet classification task as a case study, we perform deeper analysis by profiling GPU hardware performance counters to explain why HFTA can share the same GPU with more training workloads and achieves higher training throughput than the baselines.

---

**Table 5.** The peak training throughput speedups of HFTA over the baselines. For each experiment, the higher throughput between FP32 and AMP is used in the calculation. The detailed breakdown between FP32 and AMP is included in Appendix G.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>PointNet Classification</th>
<th>PointNet Segmentation</th>
<th>DCGAN</th>
</tr>
</thead>
<tbody>
<tr>
<td>V100</td>
<td>serial</td>
<td>concurrent</td>
<td>MPS</td>
</tr>
<tr>
<td></td>
<td>5.02</td>
<td>4.29</td>
<td>4.59</td>
</tr>
<tr>
<td></td>
<td>4.87</td>
<td>4.24</td>
<td>2.01</td>
</tr>
<tr>
<td></td>
<td>4.50</td>
<td>3.03</td>
<td>2.03</td>
</tr>
<tr>
<td>RTX6000</td>
<td>serial</td>
<td>concurrent</td>
<td>MPS</td>
</tr>
<tr>
<td></td>
<td>4.36</td>
<td>3.63</td>
<td>6.29</td>
</tr>
<tr>
<td></td>
<td>4.26</td>
<td>3.54</td>
<td>1.72</td>
</tr>
<tr>
<td></td>
<td>3.79</td>
<td>2.54</td>
<td>1.82</td>
</tr>
<tr>
<td>A100</td>
<td>serial</td>
<td>concurrent</td>
<td>MPS</td>
</tr>
<tr>
<td></td>
<td>11.50</td>
<td>9.48</td>
<td>4.41</td>
</tr>
<tr>
<td></td>
<td>12.98</td>
<td>10.26</td>
<td>1.29</td>
</tr>
<tr>
<td></td>
<td>4.72</td>
<td>2.93</td>
<td>1.33</td>
</tr>
<tr>
<td></td>
<td>4.88</td>
<td>3.02</td>
<td>1.33</td>
</tr>
</tbody>
</table>
Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Figure 5. The normalized training throughput as we increase the number of models sharing (via HFTA) the same TPU v3 core.

Figure 6. GPU Memory Footprints of MPS and HFTA for PointNet classification task as we increase the number of models sharing the same V100.

Figure 7. The hardware performance counters for PointNet classification task as we increase the number of models sharing the same A100.

5.4 Overall Cost Saving for Tuning Algorithms

To demonstrate that HFTA can be efficient in improving the hardware utilization for existing hyper-parameter tuning algorithms, Figure 8 lists the total GPU hour cost of tuning eight hyper-parameters via HFHT for the PointNet and MobileNet classification tasks on the V100 GPU using two tuning algorithms (random search (Bergstra & Bengio, 2012) and Hyperband (Li et al., 2018)) and four job schedulers (serial, concurrent, MPS, and HFTA). We make two major observations from this figure.

First, HFTA can reduce the total GPU hour cost by up to $5.10 \times$ and lead to significantly better hardware utilization than all other baselines.

Second, as we theoretically discuss in Appendix E, random search benefits more from HFTA than Hyperband. This is because, during certain iterations, Hyperband proposes to run many epochs on just a few sets of hyper-parameters. Thus, within such iterations, Hyperband generates not enough parallel jobs to either share the GPU with MPS or to be fused by HFTA. Therefore, such iterations become the bottleneck of the total cost when tuning is conducted with a job scheduler based on hardware sharing.

6 RELATED WORKS

Major prior works on DL job fusion (Liu et al., 2020; Narayanan et al., 2018b;a) suffer from three key weaknesses:

---

5The trends on RTX6000 and A100 are consistent with V100.
6V100 results are similar and shown in Appendix G.
Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Figure 8. Total GPU hours of four hyper-parameter tuning workloads on V100. For each algorithm and scheduler, we show the lower total GPU hours between the FP32 and AMP training.

(i) avoiding directly addressing hardware under-utilization, (ii) strongly depending on the CUDA stream primitive (Harris, 2015), a generic GPU-sharing method but inefficient for repetitive training jobs, and (iii) employing very restricted fusion schemes that are ineffective in practice. We discuss these prior works in detail below.

pack (Liu et al., 2020) merges TensorFlow (Abadi et al., 2016) graphs from multiple training jobs into a single graph to amortize only the IO and data preprocessing cost, but does not address the hardware under-utilization from the model forward and backward passes.

Furthermore, ModelBatch (Narayanan et al., 2018b) attempts to parallelize the kernel launches from multiple training jobs via CUDA streams (the CUDA programming interface of Hyper-Q (Bradley, 2007)), which suffers from similar pitfalls of runtime overhead duplication as MPS.

Although intra-model vertical and horizontal fusion of DL operators have been studied extensively by many prior works (Appleyard et al., 2016; Gray et al., 2017; Vasilache et al., 2018; Rotem et al., 2018; Chen et al., 2018; Jia et al., 2019), inter-model horizontal fusion has only been explored in extremely limited depth: HiveMind (Narayanan et al., 2018a) proposes fusion schemes for 1) non-stateful operators with the same shapes, 2) stateful operators that share the same weights, and 3) stateful operators that share the same shapes and inputs. Unfortunately, condition 2) is rarely applicable to training workloads since each individual model has its own weights, while condition 3) usually only applies to the first operator in a DL model since the following operators will have different inputs, leaving most of the fusion opportunities completely untapped. Besides, HiveMind does not demonstrate any performance improvement over MPS as it also relies on CUDA streams to extract utilization when its fusion scheme becomes ineffective. Therefore, HiveMind approach is hard to generalize to accelerators with no hardware-specific sharing features (e.g., TPUs).

In contrast, HFTA can fuse any operators of the same types and shapes across training jobs, which generally leads to full inter-model fusions. Moreover, HFTA demonstrates significant performance improvement against the existing widely-adopted generic hardware-based sharing approaches (e.g., MPS and MIG) since operator fusion does not possess the same shortcomings of those approaches, as we show in Section 2.2. Finally, HFTA requires no hardware or DL framework stack modifications and applies to any existing hardware backends, including GPUs, TPUs, and any other accelerators that the major DL frameworks support.

In parallel with our work: Rammer (Ma et al., 2020) proposes a data flow graph compiler that enables hardware sharing via operator and accelerator abstractions which requires special operator implementations (e.g., using the Persistent Thread programming model (Gupta et al., 2012) for CUDA), and focuses mostly on intra-model horizontal fusion and inference workloads. Retiariii (Zhang et al., 2020) hints that observations similar to ours are possible, but does not focus on exploring this idea in depth.

7 Conclusion

In this work, we learn from the GPU cluster usage analysis that repetitive single-accelerator training jobs (e.g., for hyper-parameters tuning) often dominate cluster-wide hardware resource usage and can severely under-utilize the hardware. To address this challenge, we observe that these jobs possess unique characteristics which enables the inter-model horizontal fusion. Therefore, we propose the HFTA (DL framework extension) library that horizontally fuses the models deeply down to operators with a minimal extra effort from DL researchers and practitioners, significantly improving the hardware utilization of these workloads by simultaneously training many models on the same accelerator. On six highly impactful DL models, HFTA achieves up to 15.13× higher training throughput than running each job on a separate accelerator, a common practice employed by hyper-parameter tuning frameworks. We continue to expand the coverage of HFTA with more operators, optimizers, and learning rate schedulers, and to investigate how existing DL frameworks, hyper-parameter tuning and model architecture search algorithms can be adjusted to extract more performance from hardware sharing via HFTA. We hope our work inspires future research on assisting ML researchers and developers with limited optimization experience to better utilize the hardware for their novel DL models.

Acknowledgements

We want to especially thank Suvinay Subramanian for TPU-related issues and discussions. We want to thank Xiaodan (Serina) Tan, Suvinay Subramanian, James Gleeson, Anand Jayarajan, and Garth Gibson for their constructive feedback during the development of this work. We want to thank Google for TPU credits and early accesses to the GCP A2 Alpha version instances. This project was supported in part by the Canada Foundation for Innovation JELF grant, NSERC Discovery grant, AWS Machine Learning Research Award, and Facebook Faculty Research Award.
REFERENCES


Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models


Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models


Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

SUMMARY OF APPENDICES

These appendices cover the following content that does not fit into the paper’s main text due to space constraints.

Appendix A describes the methodology that we use to collect the GPU cluster usage statistics from the Vector Institute. It also provides empirical evidence to support our observation that the dominating single-GPU training jobs often have low hardware utilization.

Appendix B lists the operators that HFTA currently supports as well as their corresponding horizontally-fused counterparts.

Appendix C describes how fused loss functions are handled in order to reconstruct mathematically equivalent gradients.

Appendix D provides empirical evidence that HFTA has no impact on the model’s original convergence.

Appendix E discusses our approach to integrate HFTA with existing hyper-parameter tuning algorithms via HFHT in detail.

Appendix F shows how we collect the GPU hardware performance counters and provides the related references.

Appendix G provides additional statistics and insights that can help clarify our observations and conclusions in Section 5.

Appendix H includes additional evaluation methodology regarding our secondary benchmarks (i.e., ResNet-18, MobileNetV3-Large, Transformer and BERT-Medium), convergence validation, HFHT and partial fusion.

Appendix I shows the additional evaluation results on our secondary benchmarks and partially fused ResNet-18.

A “REAL-WORLD” GPU CLUSTER USAGE STATISTICS

We analyzed the job submissions and execution logs for a two-month period (July 1st to Sept. 1st, 2020) from a large GPU cluster belonging to the Vector Institute, an independent, not-for-profit corporation dedicated to research in the field of artificial intelligence and machine learning (Vector Institute, 2021). The cluster services a variety of DL training workloads from the Vector Institute’s community. The community consists of 501 faculty, postdoc and student researchers who have published 263 conference and journal papers from April 2019 to March 2020, including 61 papers in NeurIPS, ICLR, CVPR and ICML.

The cluster includes 4 GPU partitions, V1a (200 P100 GPUs), V1b (40 T4 GPUs), V2 (480 T4 GPUs) and V3 (240 RTX6000 GPUs), where V3 came online in the last few days of the collection period. V2 was recorded for the entire period, and the other three partitions were recorded for the last 11 days. V2 is distinguished as the largest partition with the least powerful GPUs. The data contains information on 51338 jobs. The total number of GPU hours spent in these two months amounts to 471768 (equivalent to ~317 GPU days per day).

We classify the submitted jobs as “repetitive single-GPU training jobs” if they contain the following submission and execution patterns:

1. Each job only requests a single GPU despite the availability of multiple GPUs on the same node (i.e., not single-node distributed training). The job also does not require specifically which node the GPU resides (i.e., not multi-node distributed training). Therefore, it can only be a single-GPU training job.

2. Within a short time period (60 seconds), a batch of such single-GPU jobs are submitted from the same user, which means that the submission of these jobs is automated, and possibly contains the same code/program with varying parameters.7

3. The job names are very similar within the batch for such a short time period. We determine the similarity by calculating the normalized Levenshtein distance (Levenshtein, 1966) among job names with a threshold of 0.9. As a reference, the distance score between two job names ranges from 0 to 1, where 1 represents being completely identical and 0 represents being different. This filter further verifies that these jobs are repetitive single-GPU jobs since the job names are very similar. Afterwards, a manual inspection of the job names within the batches indicates that those names usually contain small variations such as learning rate value or optimizer choices and settings.

We further reached out to individual users to confirm our conclusion. We interviewed 11 active (i.e., most frequent) users of the GPU cluster: (1) 7 users responded that more than 50% of their jobs are repetitive single-GPU training for purposes including hyper-parameter tuning; and (2) 4 of those 7 users submitted over 95% of their jobs for repetitive single-GPU training. The GPU hour usage distribution is plotted in Figure 9.

Since the cluster does not actively monitor GPU hardware performance counters, we randomly sampled several jobs tagged as repetitive single-GPU training jobs and manually gathered the performance counters. Based on the sm_active and sm_occupancy (explained in Section 4 and elaborated in Appendix F) metrics from our samples, we

7The exact code for each job was not available to us due to security/IP concerns.
observe that many of the repetitive single-GPU training jobs can severely under-utilize the GPUs both temporally and spatially (as shown in Figure 10a and Figure 10b, respectively). The maximum $sm_{active}$ among the sampled jobs is 24%, and maximum $sm_{occupancy}$ is 14%.

### B HFTA OPERATOR FUSION RULES

We list the horizontal operator fusion rules for 17 PyTorch operators in Table 6. The left column contains the original operators, and the right column indicates using which operator we could get the mathematically equivalent horizontally-fused version of $B$ original operators. These operators are commonly used in DL research and development, and sufficient to implement a wide range of state-of-the-art DL models. Building on top of these fusion rules, we further develop the fused multihead attention layer and the fused Transformer encoder layer to support models that are based on the attention mechanisms or transformers.

### C FUSED LOSS

In the following, we show how loss fusion is handled in order to reconstruct mathematically equivalent gradients. The inter-model horizontally fused loss with mean reduction is shown as:

$$\mathcal{L} = \frac{1}{B} \sum_{b=0}^{B} \ell_b$$  \hspace{1cm} (1)

where $\ell_b$ is the loss of the $b$-th model, and there are $B$ models in total contributing to the fused loss $\mathcal{L}$. Taking the gradients on both side of Equation 1 with respect to the parameters $\theta_\beta$ of a specific model $\beta$ results in:

$$\nabla_{\theta_\beta} \mathcal{L} = \frac{1}{B} \sum_{b=0}^{B} \nabla_{\theta_\beta} \ell_b = \frac{1}{B} \sum_{b=0}^{B} \nabla_{\theta_\beta} \ell_\beta = \frac{1}{B} \nabla_{\theta_\beta} \ell_\beta$$  \hspace{1cm} (2)

because $\nabla_{\theta_\beta} \ell_b = 0$ if $b \neq \beta$. We can rearrange Equation 2 into:

$$\nabla_{\theta_\beta} \ell_\beta = B \nabla_{\theta_\beta} \mathcal{L} = \nabla_{\theta_\beta} B \mathcal{L}$$  \hspace{1cm} (3)

We can recognize that the expression on the left-hand side of Equation 3 is precisely the gradients for model $\beta$ if each model were trained independently. Therefore, in order to reconstruct exactly the same gradients when training via HFTA, the final fused loss $\mathcal{L}$ needs to be scaled by $B$. Similarly for the fused loss with sum reduction:

$$\mathcal{L} = \sum_{b=0}^{B} \ell_b$$  \hspace{1cm} (4)

we can derive that such scaling is no longer needed:

$$\nabla_{\theta_\beta} \mathcal{L} = \nabla_{\theta_\beta} \sum_{b=0}^{B} \ell_b = \sum_{b=0}^{B} \nabla_{\theta_\beta} \ell_b = \nabla_{\theta_\beta} \ell_\beta$$  \hspace{1cm} (5)
**Table 6.** The horizontal fusion rules for the operators that HFTA currently supports. “ConvT” stands for “ConvTranspose” (a.k.a., deconvolution). $x$, $y$, $b$, and $\tilde{b}$ represent the input, output, weight and bias tensors, respectively. $N$, $C$, $H$, $W$ and $L$ represent the batch sizes, channel sizes, heights, widths, and signal lengths of the tensors used in convolutions, deconvolution, batch-norms, MaxPool2d, and Dropout2d. $G$ represents the numbers of groups used in the convolutions and deconvolution. $F$ represents the feature sizes of the tensors used in linear layers. $D$ represents arbitrary unmodified dimensions in LayerNorm and embedding layers. $E$ represents the dimensions for normalization in LayerNorm. $\epsilon$ and $\xi$ represent the number of embedding vectors and the size of each embedding vector for the embedding layer.

<table>
<thead>
<tr>
<th>PyTorch Operator (Tensors: Shapes, Other Parameters = Arguments)</th>
<th>HFTA Horizontally Fused Operator (Tensors: Shapes, Other Parameters = Arguments)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Conv2d($x$: $[N, C_x, H_x, W_x]$, $\delta$: $[C_y, C_z, H_y, W_y]$, $\tilde{b}$: $[C_z]$, $G = g$, $\ast$)</td>
<td>Conv2d($x$: $[N, B \times C_x, H_y, W_y]$, $\delta$: $[C_y, B \times C_z, H_y, W_y]$, $\tilde{b}$: $[B \times C_z]$, $G = B \times g$, $\ast$)</td>
</tr>
<tr>
<td>Conv1d($x$: $[N, C_x, L_x]$, $\delta$: $[C_y, L_y]$, $\tilde{b}$: $[C_y]$, $G = g$, $\ast$)</td>
<td>Conv1d($x$: $[N, B \times C_x, L_y]$, $\delta$: $[C_y, B \times L_y]$, $\tilde{b}$: $[B \times C_y]$, $G = B \times g$, $\ast$)</td>
</tr>
<tr>
<td>Conv2d($x$: $[N, C_x, H_x, W_x]$, $\delta$: $[C_y, C_z, H_y, W_y]$, $\tilde{b}$: $[C_z]$, $G = g$, $\ast$)</td>
<td>Conv2d($x$: $[N, B \times C_x, H_y, W_y]$, $\delta$: $[C_y, B \times C_z, H_y, W_y]$, $\tilde{b}$: $[B \times C_z]$, $G = B \times g$, $\ast$)</td>
</tr>
<tr>
<td>Linear($x$: $[N, F_x]$, $\delta$: $[F_y]$)</td>
<td>BatchNorm1d($x$: $[N, C_x]$, $\delta$: $[C_y]$)</td>
</tr>
<tr>
<td>BatchNorm1d($x$: $[N, C_x, L_x]$, $\delta$: $[C_y, L_y]$, $\tilde{b}$: $[C_y]$, $G = g$, $\ast$)</td>
<td>BatchNorm2d($x$: $[N, C_x, H_x, W_x]$, $\delta$: $[C_y, C_z, H_y, W_y]$, $\tilde{b}$: $[C_z]$, $G = g$, $\ast$)</td>
</tr>
<tr>
<td>LayerNorm($x$: $[N, D_1, \ldots, D_n, E_1, \ldots, E_n]$, $\delta$: $[E_1, \ldots, E_n]$, $\tilde{b}$: $[E_1, \ldots, E_n]$, $\epsilon$)</td>
<td>LayerNorm($x$: $[B, N, D_1, \ldots, D_n, E_1, \ldots, E_n]$, $\delta$: $[B \times E_1, \ldots, B \times E_n]$, $\tilde{b}$: $[B \times E_1, \ldots, B \times E_n]$, $\epsilon$)</td>
</tr>
<tr>
<td>MaxPool2d($x$: $[N, C_x, H_x, W_x]$, $\ast$)</td>
<td>MaxPool2d($x$: $[N, B \times C_x, H_y, W_y]$, $\ast$)</td>
</tr>
<tr>
<td>AdaptiveAvgPool2d($x$: $[N, C_x, H_x, W_x]$, $\ast$)</td>
<td>AdaptiveAvgPool2d($x$: $[N, B \times C_x, H_y, W_y]$, $\ast$)</td>
</tr>
<tr>
<td>Dropout2d($x$: $[N, C_x, H_x, W_x]$, $\ast$)</td>
<td>Dropout2d($x$: $[N, B \times C_x, H_y, W_y]$, $\ast$)</td>
</tr>
<tr>
<td>LeakyReLU($x$: $[\ast, \ast, \ast]$, $\epsilon$)</td>
<td>LeakyReLU($x$: $[\ast, B, \ast, \ast]$, $\epsilon$)</td>
</tr>
<tr>
<td>ReLU($x$: $[\ast, \ast, \ast]$)</td>
<td>ReLU($x$: $[\ast, B, \ast, \ast]$)</td>
</tr>
<tr>
<td>$\tanh(x)$: $[\ast, \ast, \ast]$</td>
<td>$\tanh(x)$: $[\ast, B, \ast, \ast]$</td>
</tr>
</tbody>
</table>

**Figure 11.** Training loss per iteration when training ResNet-18 on CIFAR-10. LR represents the learning rate. Serial represents training each model separately, and HFTA represents our method.

In these derivations, no assumption is made on the exact formula of $\ell_\frac{b}{g}$, which means such scaling rules are universal to any types of loss functions including regularization.

**D Convergence**

Even though HFTA reconstructs the mathematically equivalent gradients for each independently trained model, minor numerical differences can still exist since the order of computations in fused operators can be different from the original ones. To demonstrate that such numerical differences do not affect the models’ original convergence empirically, we train a well-known ResNet-18 (He et al., 2016) model on the CIFAR-10 (Krizhevsky, 2009) dataset with three different learning rates. Figure 11 shows the training-epoch-per-iteration curves for training each model independently (solid curves) and collectively as a horizontally fused job via HFTA (dotted curves). Since the dotted curves overlap entirely with the solid ones, we conclude that HFTA-based training maintains exactly the same convergence as independent model training.

**E HFHT Design**

Many hyper-parameter tuning algorithms and frameworks (e.g., Ray Tune (Liau et al., 2018)) are built based on the paradigm shown in Algorithm 1: (i) an algorithm (e.g., Hyperband (Li et al., 2018)) proposes a batch of sets of hyperparameters (e.g., $H$ in Algorithm 1) to a job scheduler that schedules and runs the evaluations (i.e., the training jobs) of these sets of hyper-parameters; (ii) the algorithm then updates (and hopefully improves) itself using the evaluation results (e.g., $R$ in Algorithm 1, such as validation losses or accuracy) gathered from the output artifacts after those jobs finish. This routine repeats until a certain terminating condition is met while tracking a globally optimal set of hyper-parameters (e.g., $h$ in Algorithm 1).

---

8We provide the detailed methodology behind this and other experiments on our secondary benchmarks in Appendix H.
Therefore, to integrate HFTA with existing hyper-parameter tuning algorithms, we need to modify the interface between the algorithm and the job scheduler (e.g., line 12 in Algorithm 1). When the algorithm proposes a batch of hyper-parameter sets, we should partition these sets before passing the partitions to the scheduler, and ask the scheduler to run each partition as a fused job (e.g., `partition_and_fuse()`) on line 8 in Algorithm 1). When the fused jobs finish, we then scatter/unfuse the results into their original orders before sending them back to the algorithm (e.g., `unfuse_and_reorder()`) on line 8 in Algorithm 1).

The aforementioned hyper-parameter set partition can be conducted in different ways. In HFHT, we leverage the most straightforward approach. Given a bag of different hyper-parameters, we divide them into two categories: fusible and infusible. Being fusible means that different values of such hyper-parameters can be co-evaluated in a fused job (e.g., different learning rates or weight initializers), whereas being infusible means that different values of such hyper-parameters can lead to significant difference in the operator types and shapes (e.g., batch size) or change the model architecture completely (e.g., MobileNet V2 vs. V3-Large). Once the users specify which hyper-parameters are fusible and infusible, as we show in Figure 12, HFHT can utilize this information to partition arbitrary sets of hyper-parameters. Afterwards, each partition only has a single value for each infusible hyper-parameter.

To integrate our hardware-based sharing baselines (i.e., concurrent, MPS, and MIG) with existing hyper-parameter tuning algorithms, since these hardware features treat each job as a separate process, there is no scheduling constraint on the values of the hyper-parameters. Therefore, we simply ask the scheduler to run these jobs as concurrently as the GPU allows.

By default, the tuning algorithms only concern about the impact of each hyper-parameter on the training results’ quality (e.g., accuracy). However, with the introduction of HFTA, each hyper-parameter now is additionally correlated with the hardware utilization of the underlying tuning jobs in terms of how well the fusion opportunities are exposed. Thus, for a specific iteration during tuning, if the algorithm either (1) only proposes a single set of hyper-parameters, or (2) proposes sets that don’t share any common value for the infusible hyper-parameters, there would be few opportunities for HFTA to exploit. In other words, the exact design of the tuning algorithm itself can be either HFTA-“friendly” or “unfriendly”. For example, as we show in Section 5.4, random search is a more HFTA-“friendly” algorithm than Hyperband. Therefore, as one direction for our future work, we will investigate how to adjust existing tuning algorithms to extract the most benefits offered by HFTA.

### Algorithm 1 Integrating HFTA, concurrent, MPS and MIG with a General Hyper-parameter Tuning Routine

```plaintext
1: \( \hat{h} \leftarrow \{ \text{The best set of hyper-parameters.} \} \)
2: \( \hat{r} \leftarrow \{ \text{The best training result.} \} \)
3: \( H \leftarrow \emptyset \{ \text{A batch of arbitrary sets of hyper-parameters.} \} \)
4: \( R \leftarrow \emptyset \{ \text{Training results corresponding to } H. \} \)
5: \text{while} \ \text{terminating condition} \ \text{do}
6: \( \text{if } \text{HFTA enabled then} \)
7: \( R \leftarrow \text{unfuse_and_reorder}(\text{partition_and_fuse}(H)) \)
8: \text{else if} \ \text{concurrent or MPS or MIG enabled then}
9: \( R \leftarrow \text{scheduler.schedule_and_run_in_parallel}(H) \)
10: \text{else}
11: \( R \leftarrow \text{scheduler.schedule_and_run}(H) \)
12: \text{end if}
13: \( \hat{r}, \hat{h} \leftarrow \text{select_best}(R, H, (\hat{r}, \hat{h})) \)
14: \text{algorithm.update}(H, R)
15: \text{end while}
```

### Table 7. DCGM Metric Names, Field Identifier Macros and IDs

<table>
<thead>
<tr>
<th>Name</th>
<th>Field Identifier Macro</th>
<th>ID</th>
</tr>
</thead>
<tbody>
<tr>
<td>sm_active</td>
<td>DCGM_HI_PROF_SM_ACTIVE</td>
<td>1002</td>
</tr>
<tr>
<td>sm_occupancy</td>
<td>DCGM_HI_PROF_SM_OCCUPANCY</td>
<td>1003</td>
</tr>
<tr>
<td>tensor_active</td>
<td>DCGM_HI_PROFPIPE_TENSOR_ACTIVE</td>
<td>1004</td>
</tr>
<tr>
<td>GPU Utilization</td>
<td>DCGM_HI_DEV_GPU_UTIL</td>
<td>203</td>
</tr>
</tbody>
</table>

The sm_active, sm_occupancy and tensor_active performance counters are measured through DCGM (Kukanur, 2016). Their field identifier macros and IDs are listed in Table 7. Please refer to the DCGM Library API Reference Manual (NVIDIA, 2020d) for their precise definitions.
Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Table 8. The peak training throughput speedups of HFTA over the baselines.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>PointNet</th>
<th>PointNet</th>
<th>DCGAN</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Cls.</td>
<td>Seg.</td>
<td></td>
</tr>
<tr>
<td>FP32 V100</td>
<td>serial</td>
<td>2.62</td>
<td>1.62</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>2.54</td>
<td>1.62</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>2.36</td>
<td>1.17</td>
</tr>
<tr>
<td>AMP</td>
<td>serial</td>
<td>5.02</td>
<td>4.29</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>5.02</td>
<td>4.24</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>4.50</td>
<td>3.03</td>
</tr>
<tr>
<td>RTX 6000</td>
<td>serial</td>
<td>2.46</td>
<td>1.97</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>2.46</td>
<td>1.95</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>2.07</td>
<td>1.22</td>
</tr>
<tr>
<td>AMP</td>
<td>serial</td>
<td>4.36</td>
<td>3.63</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>4.26</td>
<td>3.54</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>3.79</td>
<td>2.54</td>
</tr>
<tr>
<td>A100</td>
<td>serial</td>
<td>5.47</td>
<td>4.56</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>5.47</td>
<td>4.56</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>2.05</td>
<td>1.31</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>2.00</td>
<td>1.35</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>12.98</td>
<td>9.48</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>11.30</td>
<td>9.48</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>4.72</td>
<td>2.93</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>4.88</td>
<td>3.02</td>
</tr>
</tbody>
</table>

Table 9. The maximum training throughput speedups of HFTA over the baselines given the same number of models sharing one GPU.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>PointNet</th>
<th>PointNet</th>
<th>DCGAN</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Cls.</td>
<td>Seg.</td>
<td></td>
</tr>
<tr>
<td>FP32 V100</td>
<td>serial</td>
<td>1.77</td>
<td>1.62</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>1.65</td>
<td>1.17</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>3.41</td>
<td>3.12</td>
</tr>
<tr>
<td>AMP</td>
<td>serial</td>
<td>3.05</td>
<td>2.23</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>2.33</td>
<td>1.99</td>
</tr>
<tr>
<td>RTX 6000</td>
<td>serial</td>
<td>1.92</td>
<td>1.22</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>4.14</td>
<td>2.21</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>4.05</td>
<td>3.95</td>
</tr>
<tr>
<td>A100</td>
<td>serial</td>
<td>1.64</td>
<td>1.04</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>1.51</td>
<td>1.07</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>9.18</td>
<td>7.86</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>9.18</td>
<td>7.86</td>
</tr>
<tr>
<td></td>
<td>serial</td>
<td>3.18</td>
<td>2.13</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>3.18</td>
<td>2.13</td>
</tr>
</tbody>
</table>

Table 10. The maximum speedups of AMP training over FP32.

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>PointNet Classification</th>
<th>PointNet Segmentation</th>
<th>DCGAN</th>
</tr>
</thead>
<tbody>
<tr>
<td>V100</td>
<td>serial</td>
<td>1.00</td>
<td>1.00</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>0.97</td>
<td>1.01</td>
</tr>
<tr>
<td></td>
<td>MPS</td>
<td>1.01</td>
<td>1.03</td>
</tr>
<tr>
<td></td>
<td>HFTA</td>
<td>1.92</td>
<td>2.65</td>
</tr>
<tr>
<td>RTX 6000</td>
<td>serial</td>
<td>1.06</td>
<td>1.19</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>1.09</td>
<td>1.22</td>
</tr>
<tr>
<td></td>
<td>MPS</td>
<td>1.03</td>
<td>1.05</td>
</tr>
<tr>
<td></td>
<td>HFTA</td>
<td>1.88</td>
<td>2.20</td>
</tr>
<tr>
<td>A100</td>
<td>serial</td>
<td>1.13</td>
<td>1.13</td>
</tr>
<tr>
<td></td>
<td>concurrent</td>
<td>1.00</td>
<td>1.05</td>
</tr>
<tr>
<td></td>
<td>MPS</td>
<td>1.03</td>
<td>1.06</td>
</tr>
<tr>
<td></td>
<td>MIG</td>
<td>1.02</td>
<td>1.05</td>
</tr>
<tr>
<td></td>
<td>HFTA</td>
<td>2.37</td>
<td>2.36</td>
</tr>
</tbody>
</table>

G Additional Evaluation Statistics

To facilitate the reading of the results from our GPU experiments in Figure 4, we summarize the comparison from different angles between HFTA and the baselines into three tables.

Table 8 shows the peak training throughput comparison between HFTA and the baselines. It is important to highlight that, for both MPS and concurrent, the training throughput could decrease as we increase the number of models sharing the same GPU (due to host resource contention). Therefore, the “peak” is determined by the highest possible throughput instead of the largest number of models that the GPU can fit (which might or might not lead to the highest throughput). Unlike Table 5, the results here are split between FP32 and AMP to demonstrate how well HFTA performs for each training type.

Table 9 shows the maximum training throughput speedups of HFTA over the baselines, given the same number of models sharing the same GPU. The maximum is picked by varying the number of models sharing the same GPU and finding the largest performance gap between HFTA and the baselines. This helps to isolate the benefits of better SMs and TCs utilization from the benefits of better memory utilization when training via HFTA.

Table 10 shows the maximum training throughput speedups of AMP over FP32 for both HFTA and the baselines. The maximum here is also picked by varying the number of models (except for serial, which always only runs one model per GPU) and finding the largest performance gap between FP32 and AMP. This helps to demonstrate that HFTA is more efficient in utilizing advanced hardware compute units such as TCs.

Similar to Figure 7a, 7b and 7c, Figure 13 plots the nvidia-smi-defined “GPU utilization” (NVIDIA, 2016) for PointNet classification task training on the A100 GPU. Contrary to a popular belief (Elangovan, 2020; fastai, 2020), we observe that the nvidia-smi-defined “GPU utilization” can be sometimes a weak utilization indicator, since...
the curves in Figure 13 appear rather noisy and do not follow the trends of throughput improvements in Figures 4g or any hardware counters’ trend in Figure 7a, 7b or 7c.

Similar to Figure 7, Figure 14 plots the sm_active, sm_occupancy, and tensor_active of HFTA and the baselines as we increase the number of models sharing the same V100 GPU. In addition to the observations we already present in Section 5.3, we also observe that the serial baselines’ hardware utilization is lower on A100 than on V100. Therefore, Figure 14 provides empirical evidence to support our argument in Section 2.1 and Section 5.1 that newer GPU generations suffer more significantly from the hardware under-utilization of repetitive single-accelerator training workloads.

H Additional Methodology Details

H.1 Secondary Benchmarks

To evaluate the HFTA’s general effectiveness in extracting hardware utilization for conventional models, we further include the following models and tasks as our secondary benchmarks.

ResNet (He et al., 2016) and MobileNet (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019) are two classes of convolutional neural networks that are generally used (or as backbones) in computer vision (CV) tasks. Both classes contain many variants of the models. As our benchmarks, we train ResNet-18 and MobileNetV3-Large (Howard et al., 2019) to perform image classification tasks, though HFTA can be applied on the other variants and tasks as well. We leverage an implementation of ResNet-18 from PyTorch’s official examples (PyTorch, 2020), and an open-sourced PyTorch re-implementation (Wang, 2019) of MobileNetV3-Large. Both models are trained on the CIFAR-10 (Krizhevsky, 2009) dataset. To evaluate HFTA with different training configurations, ResNet-18 is trained using the Adadelta optimizer with a batch size of 128, whereas MobileNetV3-Large is trained using the Adam (Kingma & Ba, 2015) optimizer with a batch size of 1024.

Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019; Turc et al., 2019) are two classes of attention-based language models that are generally used in natural language processing (NLP) tasks. For Transformer, we leverage an implementation from PyTorch’s official examples (PyTorch, 2020) and configure our variant to have 2 encoder layers with 2 attention heads and the hidden size of 128 (similar to BERT-Tiny (Turc et al., 2019) in the parameter size). We train our Transformer variant for the language modeling (LM) task. For BERT, we leverage an open-sourced PyTorch re-implementation (Kim, 2018) and select the BERT-Medium (Turc et al., 2019) variant to perform the masked LM task. Both models are trained on the WikiText-2 (Merity et al., 2017) dataset using the Adadelta optimizer with the batch size and sequence length of 32.

We perform the experiments of all four secondary benchmarks on two ML accelerators: the V100 GPU and TPU v3. We list the detailed specifications of our experimental setup in Table 4. We provide the corresponding evaluation results in Appendix I.

H.2 Convergence

To empirically prove that HFTA does not affect convergence, using three different learning rates (0.0005, 0.001, 0.002), we train ResNet-18 on V100 with the CIFAR-10 dataset using the Adadelta optimizer with a batch size of 1000. The corresponding validation results are provided in Appendix D.

H.3 HFHT

To demonstrate that HFTA can significantly improve the hardware utilization for existing hyper-parameter tuning algorithms via HFHT, we construct the following four end-to-end hyper-parameter tuning workloads: for each of the PointNet and MobileNet classification tasks, we tune eight different hyper-parameters using each of the Hyperband (Li et al., 2018) and random search (Bergstra & Bengio, 2012) algorithms. We leverage an open-sourced implementation of the Hyperband algorithm (Zajac, 2017) and adopt it into HFHT, and we implement the random search algorithm from scratch. We evaluate HFHT across different algorithm settings listed in Table 11, and we list the hyper-parameters that we tune for each task in Table 12. We provide the corresponding evaluation results in Section 5.4.

H.4 Partial Fusion

Even when there are slight differences in the architectures or the operator types/shapes among the models in repet-
Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Figure 14. The hardware performance counters for PointNet classification task as we increase the number of models sharing the same V100.

Table 12. The list of hyper-parameters that are tuned for the PointNet and MobileNet classification tasks. “[ ]” represents a continuous closed interval, whereas “{}” represents a discrete set.

<table>
<thead>
<tr>
<th>Hyper-parameter</th>
<th>Fusible</th>
<th>Range</th>
<th>Task(s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Learning Rate</td>
<td>Yes</td>
<td>[0.0001, 0.01]</td>
<td>Both</td>
</tr>
<tr>
<td>Adam’s $\beta_1$</td>
<td>Yes</td>
<td>[0.001, 0.999]</td>
<td>Both</td>
</tr>
<tr>
<td>Adam’s $\beta_2$</td>
<td>Yes</td>
<td>[0.001, 0.999]</td>
<td>Both</td>
</tr>
<tr>
<td>Weight Decay</td>
<td>Yes</td>
<td>[0.0, 0.5]</td>
<td>Both</td>
</tr>
<tr>
<td>Factor of Learning Rate Decay</td>
<td>Yes</td>
<td>[0.1, 0.9]</td>
<td>Both</td>
</tr>
<tr>
<td>Period of Learning Rate Decay</td>
<td>Yes</td>
<td>{5, 10, 20, 40}</td>
<td>Both</td>
</tr>
<tr>
<td>Batch Size</td>
<td>No</td>
<td>PointNet: {8, 16, 32}</td>
<td>MobileNet: {1024, 2048}</td>
</tr>
<tr>
<td>Feature Transformation</td>
<td>No</td>
<td>{True, False}</td>
<td>PointNet</td>
</tr>
<tr>
<td>Version</td>
<td>No</td>
<td>{V2, V3-Large}</td>
<td>MobileNet</td>
</tr>
</tbody>
</table>

To conduct such a study, we add partially fused optimizers and learning rate schedulers to HFTA, and we extend the ResNet-18 benchmark such that we can individually configure whether each of the 8 basic blocks, the first convolutional block and the last linear block is horizontally fused. We then fix the number of jobs that share the same accelerator (V100) as a constant (30), and measure the per-device training throughput as we incrementally turn off the horizontal fusion of each block. We provide the corresponding evaluation results in Appendix I.

I Evaluation: Secondary Benchmarks

Similar to Section 5.1 and Section 5.2, to compare HFTA’s end-to-end training performance with our baselines (i.e., serial, concurrent, and MPS) for our secondary benchmarks (i.e., ResNet-18, MobileNet-V3Large, Transformer, and BERT-Medium), Figure 15 and Figure 16 plot the per-device normalized training throughput on the V100 GPUs and the TPU v3 respectively. We normalize the throughput for each experiment by the serial (FP32 for V100) baseline. We show both FP32 and AMP training results for the experiments on V100. Each curve grows as we increase the number of models that share the same device. Each curve “stops” when it reaches the maximum number of models before an out-of-memory error occurs, except that the curve for ResNet-18 on TPU stops when the throughput starts to degrade.

These figures coincide with the trends in our main observations from Section 5.1 and Section 5.2. Across these secondary benchmarks, on V100, HFTA achieves $2.42 \times$ to $3.94 \times$ higher peak training throughput over serial, $1.67 \times$ to $3.02 \times$ over concurrent and $1.25 \times$ to $2.24 \times$ over MPS; on TPU v3, HFTA achieves $2.98 \times$ to $6.43 \times$ peak training throughput than serial. Therefore, we conclude that HFTA can be also effective in improving the hardware utilization for conventional models during repetitive single-accelerator training.

To evaluate the partial fusion’s performance sensitivity, Figure 17 plots the normalized training throughput when 30 ResNet-18 models share the same V100 GPU as we incrementally turn off the horizontal fusion for each block. We make two observations from this figure. First, a higher degrada-

10When we check the TPU usage monitoring dashboard, we observe that the TPU memory usage surpasses its per-core memory capacity when we fuse too many models. Therefore, we hypothesize that the performance degradation could be due to certain memory optimizations (e.g., swapping or rematerialization) employed by the TPU’s memory management system or the XLA compiler.
Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Figure 15. The normalized training throughput as we increase the number of models sharing the same V100 GPU.

Figure 16. The normalized training throughput as we increase the number of models sharing the same TPU v3 core.
Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Figure 17. ResNet-18’s normalized training throughput as we incrementally turn off the horizontal fusion for each block.

gree of fusion does lead to better performance. Thus, even if the models among repetitive training jobs can not be fully fused, every little bit of fusion can still be helpful. Second, the fusion of different blocks (thus, different operators) contributes differently to the overall performance improvement. This is because some operators within a model suffer more from hardware under-utilization than others.