Part of Proceedings of Machine Learning and Systems 4 (MLSys 2022)
Zichang Liu, Zhaozhuo Xu, Alan Ji, Junyan Zhang, Jonathan Li, Beidi Chen, Anshumali Shrivastava
Efficient inference in large output space is an essential yet challenging task in large scale machine learning. Previous approaches reduce this problem to Approximate Maximum Inner Product Search (AMIPS), which is based on the observation that the prediction of a given model corresponds to the logit with the largest value. However, models are not perfect in accuracy, and the successful retrievals of the largest logit may not lead to the correct predictions. We argue that approximate MIPS approaches are sub-optimal because they are tailored for retrieving largest inner products class instead of retrieving the correct class. Moreover, the logits generated from neural networks with large output space lead to extra challenges for the AMIPS method to achieve a high recall rate within the computation budget of efficient inference. In this paper, we propose HALOS, which reduces inference into sub-linear computation by selectively activating a small set of output layer neurons that are likely to correspond to the correct classes rather than to yield the largest logit. Our extensive evaluations show that HALOS matches or even outperforms the accuracy of given models with 21x speed up and 87\% energy reduction.