Attention-based Learning for Missing Data Imputation in HoloClean

Part of Proceedings of Machine Learning and Systems 2 (MLSys 2020)

Bibtex »Metadata »Paper »Supplemental »

Authors

Richard Wu, Aoqian Zhang, Ihab Ilyas, Theodoros Rekatsinas

Abstract

We study the problem of missing data imputation, a data validation task that machine learning researchers and practitioners confront regularly. We focus on mixed (discrete and continuous) data and introduce AimNet, an attention-based learning network for missing data imputation. AimNet utilizes a variation of the dot product attention mechanism to learn interpretable, structural properties of the mixed data distribution and relies on the learned structure to perform imputation. We perform an extensive experimental study over 14 real-world data sets to understand the role of attention and structure on data imputation. We find that the simple attention-based architecture of AimNet outperforms state-of-the-art baselines, such as ensemble tree models and deep learning architectures (e.g., generative adversarial networks), by up to 43% in accuracy on discrete values and up to 26.7% in normalized-RMS error on continuous values. A key finding of our study is that, by learning the structure of the underlying distribution, the attention mechanism can generalize better on systematically-missing data where imputation requires reasoning about functional relationships between attributes.