Part of Proceedings of Machine Learning and Systems 4 (MLSys 2022)
Ankur Mallick, Kevin Hsieh, Behnaz Arzani, Gauri Joshi
Today's data centers rely more heavily on machine learning (ML) in their deployed systems. However, these systems are vulnerable to the data drift problem, that is, a mismatch between training and test data, which can lead to significant performance degradation and system inefficiencies. In this paper, we demonstrate the impact of data drift in production by studying two real-world deployments in a leading cloud provider. Our study shows that, despite frequent model retraining, these deployed models experience major accuracy drops (up to 40%) and high accuracy variation, which lead to drastic increase in operational costs. None of the current solutions to the data drift problem are designed for large-scale deployments, which need to address real-world issues such as scale, ground truth latency, and mixed types of data drift. We propose Matchmaker, the first scalable, adaptive, and flexible solution to the data drift problem in large-scale production systems. Matchmaker finds the most similar training data batch and uses the corresponding ML model for inference on each test point. As part of Matchmaker, we introduce a novel similarity metric to address multiple types of data drifts while only incurring limited overhead. Experiments on our two real-world ML deployments show matchmaker significantly improve model accuracy (upto 14\% and 2\%), which saves 18\% and 1\% in the operational costs. At the same time, Matchmaker provides 8x- and 4x- faster predictions than a state-of-the-art ML data drift solution, AUE.