Georgios Damaskinos, El-Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis, Sébastien Rouault
We present AGGREGATHOR, a framework that implements state-of-the-art robust (Byzantine-resilient) distributed stochastic gradient descent. Following the standard parameter server model, we assume that a minority of worker machines can be controlled by an adversary and behave arbitrarily. Such a setting has been theoretically studied with several of the existing approaches using a robust aggregation of the workers’ gradient estimations. Yet, the question is whether a Byzantine-resilient aggregation can leverage more workers to speed-up learning. We answer this theoretical question, and implement these state-of-the-art theoretical approaches on AGGREGATHOR, to assess their practical costs. We built AGGREGATHOR around TensorFlow and introduce modifications for vanilla TensorFlow towards making it usable in an actual Byzantine setting. AGGREGATHOR also permits the use of unreliable gradient transfer over UDP to provide further speed-up (without losing the accuracy) over the native communication protocols (TCP-based) of TensorFlow in saturated networks. We quantify the overhead of Byzantine resilience of AGGREGATHOR to 19% and 43% (to ensure weak and strong Byzantine resilience respectively) compared to vanilla TensorFlow.