Hotline Profiler: Automatic Annotation and A Multi-Scale Timeline for Visualizing Time-Use in DNN Training

Part of Proceedings of Machine Learning and Systems 5 (MLSys 2023) mlsys2023

Bibtex Paper

Authors

Daniel Snider, Fanny Chevalier, Gennady Pekhimenko

Abstract

Profiling is a standard practice used to investigate the efficiency of software and hardware operation at runtime and is a crucial part of proving new concepts, debugging problems, and optimizing performance. However, most machine learning (ML) developers find profiling secondary to their goal of improving model accuracy or just too difficult (especially with existing ML tools). As a result, profiling is frequently an afterthought, and so many ML developers rely on opaque metrics such as iteration time and GPU utilization which give little insight into why ML training may be slow. This leads developers to spend excessive time investigating performance issues. In this work, we aim to provide better tools to the large group of ML developers who currently do not profile their deep neural network (DNN) training workloads or are not happy with existing tools. To help ML developers investigate and understand time-use in DNN training, we propose Hotline, a novel profiler designed specifically for runtime bottleneck identification. Hotline is the first profiler to automatically annotate a standard data format for program runtime traces with DNN concepts that most ML developers are familiar with, i.e. the DNN training loop and model architecture. Hotline does so without modifying DNN libraries or making use of vendor-specific tools and introduces no additional overhead on measurements. We further introduce noise reduction techniques and a multi-scale timeline visualization to make the presentation of DNN runtime data more insightful, familiar, and easy to navigate. We demonstrate Hotline’s utility through in-depth case studies of finding bottlenecks in real-world DNN applications and we report on a user study with 17 software developers in which most participants were able to perform common performance investigation tasks in under 30 seconds (avg = 26 sec) and further commented that Hotline’s visualization “takes less time to findinsights compared to existing approaches”. Source code: https://github.com/UofT-EcoSystem/hotline