{"title": "Federated Optimization in Heterogeneous Networks", "book": "Proceedings of Machine Learning and Systems", "page_first": 429, "page_last": 450, "abstract": "Federated Learning is a distributed learning paradigm with two key challenges that differentiate it from traditional distributed optimization: (1) significant variability in terms of the systems characteristics on each device in the network (systems heterogeneity), and (2) non-identically distributed data across the network (statistical heterogeneity). In this work, we introduce a framework, FedProx, to tackle heterogeneity in federated networks. FedProx can be viewed as a generalization and re-parametrization of FedAvg, the current state-of-the-art method for federated learning. While this re-parameterization makes only minor modifications to the method itself, these modifications have important ramifications both in theory and in practice. Theoretically, we provide convergence guarantees for our framework when learning over data from non-identical distributions (statistical heterogeneity), and while adhering to device-level systems constraints by allowing each participating device to perform a variable amount of work (systems heterogeneity). Practically, we demonstrate that FedProx allows for more robust convergence than FedAvg across a suite of realistic federated datasets. In particular, in highly heterogeneous settings, FedProx demonstrates significantly more stable and accurate convergence behavior relative to FedAvg\u2014improving absolute test accuracy by 18.8% on average.", "full_text": " FEDERATED OPTIMIZATION IN HETEROGENEOUS NETWORKS\r\n Tian Li1 Anit KumarSahu2 ManzilZaheer3 MaziarSanjabi4 AmeetTalwalkar15 VirginiaSmith1\r\n ABSTRACT\r\n Federated Learning is a distributed learning paradigm with two key challenges that differentiate it from traditional\r\n distributed optimization: (1) signi\ufb01cant variability in terms of the systems characteristics on each device in\r\n the network (systems heterogeneity), and (2) non-identically distributed data across the network (statistical\r\n heterogeneity). In this work, we introduce a framework, FedProx, to tackle heterogeneity in federated networks.\r\n FedProx can be viewed as a generalization and re-parametrization of FedAvg, the current state-of-the-art\r\n method for federated learning. While this re-parameterization makes only minor modi\ufb01cations to the method\r\n itself, these modi\ufb01cations have important rami\ufb01cations both in theory and in practice. Theoretically, we provide\r\n convergence guarantees for our framework when learning over data from non-identical distributions (statistical\r\n heterogeneity), and while adhering to device-level systems constraints by allowing each participating device to\r\n perform a variable amount of work (systems heterogeneity). Practically, we demonstrate that FedProx allows\r\n for more robust convergence than FedAvg across a suite of realistic federated datasets. In particular, in highly\r\n heterogeneous settings, FedProx demonstrates signi\ufb01cantly more stable and accurate convergence behavior\r\n relative to FedAvg\u2014improving absolute test accuracy by 22% on average.\r\n 1 INTRODUCTION dient descent (SGD) on K devices\u2014where E is a small\r\n Federated learning has emerged as an attractive paradigm constant and K is a small fraction of the total devices in\r\n for distributing training of machine learning models in net- the network. The devices then communicate their model\r\n works of remote devices. While there is a wealth of work updates to a central server, where they are averaged.\r\n on distributed optimization in the context of machine learn- While FedAvghasdemonstrated empirical success in het-\r\n ing, two key challenges distinguish federated learning from erogeneous settings, it does not fully address the underlying\r\n traditional distributed optimization: high degrees of systems challenges associated with heterogeneity. In the context\r\n 1 of systems heterogeneity, FedAvg does not allow partici-\r\n and statistical heterogeneity (McMahan et al., 2017; Li\r\n et al., 2019). pating devices to perform variable amounts of local work\r\n In an attempt to handle heterogeneity and tackle high com- based on their underlying systems constraints; instead it\r\n munication costs, optimization methods that allow for lo- is common to simply drop devices that fail to compute E\r\n cal updating and low participation are a popular approach epochs within a speci\ufb01ed time window (Bonawitz et al.,\r\n for federated learning (McMahan et al., 2017; Smith et al., 2019). From a statistical perspective, FedAvg has been\r\n 2017). In particular, FedAvg (McMahan et al., 2017) is shown to diverge empirically in settings where the data is\r\n an iterative method that has emerged as the de facto opti- non-identically distributed across devices (e.g., McMahan\r\n mization method in the federated setting. At each iteration, et al., 2017, Sec 3). Unfortunately, FedAvg is dif\ufb01cult to\r\n FedAvg\ufb01rstlocally performs E epochs of stochastic gra- analyze theoretically in such realistic scenarios and thus\r\n lacks convergence guarantees to characterize its behavior\r\n 1Carnegie Mellon University 2Bosch Center for Arti\ufb01cial Intel- (see Section 2 for additional details).\r\n ligence 3Goolge Research 4Facebook AI 5Determined AI. Corre-\r\n spondence to: Tian Li . In this work, we propose FedProx, a federated optimiza-\r\n tion algorithm that addresses the challenges of heterogene-\r\n Proceedings of the 3rd MLSys Conference, Austin, TX, USA, ity both theoretically and empirically. A key insight we\r\n 2020. Copyright 2020 by the author(s). have in developing FedProx is that an interplay exists\r\n 1Privacy is a third key challenge in the federated setting. While between systems and statistical heterogeneity in federated\r\n not the focus of this work, standard privacy-preserving approaches learning. Indeed, both dropping stragglers (as in FedAvg)\r\n such as differential privacy and secure multiparty communication or naively incorporating partial information from stragglers\r\n can naturally be combined with the methods proposed herein\u2014 (as in FedProx with the proximal term set to 0) implicitly\r\n particularly since our framework proposes only lightweight algo-\r\n rithmic modi\ufb01cations to prior work. increases statistical heterogeneity and can adversely impact\r\n Federated Optimization in Heterogeneous Networks\r\n convergence behavior. To mitigate this issue, we propose generalizable to non-convex problems, e.g., deep learning,\r\n addingaproximaltermtotheobjectivethathelpstoimprove where strong duality is no longer guaranteed. In the non-\r\n the stability of the method. This term provides a principled convex setting, Federated Averaging (FedAvg), a heuristic\r\n wayfortheserver to account for heterogeneity associated method based on averaging local Stochastic Gradient De-\r\n with partial information. Theoretically, these modi\ufb01cations scent (SGD) updates in the primal, has instead been shown\r\n allow us to provide convergence guarantees for our method to work well empirically (McMahan et al., 2017).\r\n and to analyze the effect of heterogeneity. Empirically, we Unfortunately, FedAvg is quite challenging to analyze due\r\n demonstrate that the modi\ufb01cations improve the stability to its local updating scheme, the fact that few devices are\r\n and overall accuracy of federated learning in heterogeneous active at each round, and the issue that data is frequently\r\n networks\u2014improving the absolute testing accuracy by 22% distributed in a heterogeneous nature in the network. In par-\r\n onaverage in highly heterogeneous settings. ticular, as each device generates its own local data, statistical\r\n Theremainder of this paper is organized as follows. In Sec- heterogeneity is common with data being non-identically\r\n tion 2, we provide background on federated learning and distributed between devices. Several works have made steps\r\n an overview of related work. We then present our proposed towards analyzing FedAvg in simpler, non-federated set-\r\n framework,FedProx,inSection3,andderiveconvergence tings. For instance, parallel SGDandrelatedvariants(Zhang\r\n guarantees for the framework accounting for both statistical et al., 2015; Shamir et al., 2014; Reddi et al., 2016; Zhou &\r\n andsystemsheterogeneityinSection4. Finally,inSection5, Cong, 2018; Stich, 2019; Wang & Joshi, 2018; Woodworth\r\n weprovideathoroughempiricalevaluationofFedProxon et al., 2018; Lin et al., 2020), which make local updates\r\n a suite of synthetic and real-world federated datasets. Our similar to FedAvg, have been studied in the IID setting.\r\n empirical results help to illustrate and validate our theoreti- However, the results rely on the premise that each local\r\n cal analysis, and demonstrate the practical improvements of solver is a copy of the same stochastic process (due to the\r\n FedProxoverFedAvginheterogeneousnetworks. IID assumption). This line of reasoning does not apply to\r\n the heterogeneous setting.\r\n 2 BACKGROUNDANDRELATEDWORK Although some recent works (Yu et al., 2018; Wang et al.,\r\n Large-scale machine learning, particularly in data center 2019; Hao et al., 2019; Jiang & Agrawal, 2018) have ex-\r\n settings, has motivated the development of numerous dis- ploredconvergenceguaranteesinstatisticallyheterogeneous\r\n tributed optimization methods in the past decade (see, e.g., settings, they make the limiting assumption that all devices\r\n Boyd et al., 2010; Dekel et al., 2012; Dean et al., 2012; participate in each round of communication, which is often\r\n Zhang et al., 2013; Li et al., 2014a; Shamir et al., 2014; infeasible in realistic federated networks (McMahan et al.,\r\n \u00b4 \u00b4\u02c7 2017). Further, they rely on speci\ufb01c solvers to be used on\r\n Reddi et al., 2016; Zhang et al., 2015; Richtarik & Takac, each device (either SGD or GD), as compared to the solver-\r\n 2016; Smith et al., 2018). However, as computing substrates agnostic framework proposed herein, and add additional\r\n such as phones, sensors, and wearable devices grow both in assumptions of convexity (Wang et al., 2019) or uniformly\r\n power and in popularity, it is increasingly attractive to learn bounded gradients (Yu et al., 2018) to their analyses. There\r\n statistical models locally in networks of distributed devices, are also heuristic approaches that aim to tackle statistical\r\n in contrast to moving the data to the data center. This prob- heterogeneity by sharing the local device data or server-side\r\n lem, known as federated learning, requires tackling novel proxy data (Jeong et al., 2018; Zhao et al., 2018; Huang\r\n challenges with privacy, heterogeneous data and devices, et al., 2018). However, these methods may be unrealistic: in\r\n and massively distributed networks (Li et al., 2019). addition to imposing burdens on network bandwidth, send-\r\n Recent optimization methods have been proposed that are ing local data to the server (Jeong et al., 2018) violates the\r\n tailored to the speci\ufb01c challenges in the federated setting. key privacy assumption of federated learning, and sending\r\n These methods have shown signi\ufb01cant improvements over globally-shared proxy data to all devices (Zhao et al., 2018;\r\n traditional distributed approaches such as ADMM (Boyd Huangetal., 2018) requires effort to carefully generate or\r\n et al., 2010) or mini-batch methods (Dekel et al., 2012) by collect such auxiliary data.\r\n allowing both for inexact local updating in order to balance Beyond statistical heterogeneity, systems heterogeneity is\r\n communication vs. computation in large networks, and for also a critical concern in federated networks. The storage,\r\n a small subset of devices to be active at any communication computational, and communication capabilities of each de-\r\n round(McMahanetal.,2017;Smithetal.,2017). Forexam- vice in federated networks may differ due to variability in\r\n ple, Smith et al. (2017) propose a communication-ef\ufb01cient hardware (CPU, memory), network connectivity (3G, 4G,\r\n primal-dual optimization method that learns separate but 5G, wi\ufb01), and power (battery level). These system-level\r\n related models for each device through a multi-task learning characteristics dramatically exacerbate challenges such as\r\n framework. Despite the theoretical guarantees and practical straggler mitigation and fault tolerance. One strategy used\r\n ef\ufb01ciency of the proposed method, such an approach is not\r\n Federated Optimization in Heterogeneous Networks\r\n in practice is to ignore the more constrained devices failing function based on the device\u2019s data is used as a surrogate\r\n to complete a certain amount of training (Bonawitz et al., for the global objective function. At each outer iteration,\r\n 2019). However (as we demonstrate in Section 5), this can a subset of the devices are selected and local solvers are\r\n have negative effects on convergence as it limits the number used to optimize the local objective functions on each of\r\n of effective devices contributing to training, and may induce the selected devices. The devices then communicate their\r\n bias in the device sampling procedure if the dropped devices local model updates to the central server, which aggregates\r\n have speci\ufb01c data characteristics. them and updates the global model accordingly. The key to\r\n In this work, inspired by FedAvg, we explore a broader allowing \ufb02exible performance in this scenario is that each of\r\n framework, FedProx, that is capable of handling hetero- the local objectives can be solved inexactly. This allows the\r\n geneous federated environments while maintaining similar amountoflocalcomputationvs. communicationtobetuned\r\n privacy and computational bene\ufb01ts. We analyze the con- based on the number of local iterations that are performed\r\n vergence behavior of the framework through a statistical (with additional local iterations corresponding to more exact\r\n dissimilarity characterization between local functions, while local solutions). We introduce this notion formally below,\r\n also taking into account practical systems constraints. Our as it will be utilized throughout the paper.\r\n dissimilarity characterization is inspired by the randomized De\ufb01nition 1 (\u03b3-inexact solution). For a function\r\n h(w;w ) = F(w) + \u00b5kw \u2212 w k2, and \u03b3 \u2208 [0,1],\r\n Kaczmarz method for solving linear system of equations 0 2 0\r\n (Kaczmarz, 1993; Strohmer & Vershynin, 2009), a similar we say w\u2217 is a \u03b3-inexact solution of minwh(w;w0)\r\n if k\u2207h(w\u2217;w )k \u2264 \u03b3k\u2207h(w ;w )k, where\r\n assumption of which has been used to analyze variants of 0 0 0\r\n SGDin other settings (see, e.g., Schmidt & Roux, 2013; \u2207h(w;w0) = \u2207F(w) + \u00b5(w \u2212 w0). Note that a\r\n Vaswani et al., 2019; Yin et al., 2018). Our proposed frame- smaller \u03b3 corresponds to higher accuracy.\r\n work provides improved robustness and stability for opti- Weuse \u03b3-inexactness in our analysis (Section 4) to mea-\r\n mization in heterogeneous federated networks. sure the amount of local computation from the local solver\r\n Finally, in terms of related work, we note that two aspects at each round. As discussed earlier, different devices are\r\n of our proposed work\u2014the proximal term in FedProx and likely to make different progress towards solving the local\r\n the bounded dissimilarity assumption used in our analysis\u2014 subproblems due to variable systems conditions, and it is\r\n have been previously studied in the optimization literature, therefore important to allow \u03b3 to vary both by device and\r\n though often with very different motivations and in non- by iteration. This is one of the motivations for our proposed\r\n federated settings. For completeness, we provide a further framework discussed in the next sections. For ease of nota-\r\n discussion in Appendix B on this background work. tion, we \ufb01rst derive our main convergence results assuming\r\n a uniform \u03b3 as de\ufb01ned here (Section 4), and then provide\r\n 3 FEDERATED OPTIMIZATION: METHODS results with variable \u03b3\u2019s in Corollary 9.\r\n In this section, we introduce the key ingredients behind 3.1 Federated Averaging (FedAvg)\r\n recent methods for federated learning, including FedAvg,\r\n and then outline our proposed framework, FedProx. In Federated Averaging (FedAvg) (McMahan et al., 2017),\r\n Federated learning methods (e.g., McMahan et al., 2017; the local surrogate of the global objective function at de-\r\n vice k is F (\u00b7), and the local solver is stochastic gradient\r\n Smith et al., 2017) are designed to handle multiple devices k\r\n collecting data and a central server coordinating the global descent (SGD), with the same learning rate and number\r\n learning objective across the network. In particular, the aim of local epochs used on each device. At each round, a\r\n is to minimize: subset K \u226a N of the total devices are selected and run\r\n SGDlocallyfor E number of epochs, and then the resulting\r\n N\r\n min f(w) = Xp F (w)=E [F (w)], (1) model updates are averaged. The details of FedAvg are\r\n w k k k k summarized in Algorithm 1.\r\n k=1 P McMahanetal.(2017)showempirically that it is crucial to\r\n where N is the number of devices, p \u2265 0, and p =1.\r\n k k k tune the optimization hyperparameters of FedAvg properly.\r\n In general, the local objectives measure the local empiri- In particular, the number of local epochs in FedAvg plays\r\n cal risk over possibly differing data distributions Dk, i.e., an important role in convergence. On one hand, perform-\r\n :\r\n F (w) = E [f (w;x )], with n samples available\r\n k x \u223cD k k k\r\n k k ing more local epochs allows for more local computation\r\n at each device k. Hence, we can set p =nk, where n=\r\n P k n and potentially reduced communication, which can greatly\r\n k nk is the total number of data points. In this work, we improve the overall convergence speed in communication-\r\n consider F (w) to be possibly non-convex.\r\n k constrained networks. On the other hand, with dissimilar\r\n To reduce communication, a common technique in feder- (heterogeneous)localobjectivesF ,alargernumberoflocal\r\n k\r\n ated optimization is that on each device, a local objective epochs may lead each device towards the optima of its local\r\n Federated Optimization in Heterogeneous Networks\r\n Algorithm 1 Federated Averaging (FedAvg) accommodates variable \u03b3\u2019s for different devices and at dif-\r\n 0 ferent iterations. We formally de\ufb01ne \u03b3t-inexactness for\r\n Input: K, T, \u03b7, E, w , N, pk, k = 1,\u00b7\u00b7\u00b7 ,N k\r\n for t = 0,\u00b7\u00b7\u00b7 ,T \u2212 1 do device k at iteration t below, which is a natural extension\r\n ServerselectsasubsetS ofK devicesatrandom(each from De\ufb01nition 1.\r\n t\r\n device k is chosen with probability p ) De\ufb01nition 2 (\u03b3t-inexact solution). For a function\r\n k k\r\n t \u00b5 2\r\n Server sends w to all chosen devices h (w;w ) = F (w) + kw \u2212 w k , and \u03b3 \u2208 [0,1], we\r\n k t k 2 t\r\n Eachdevice k \u2208 St updates wt for E epochs of SGD say w\u2217 is a \u03b3t-inexact solution of minwhk(w;wt)\r\n t+1 k\r\n onF withstep-size \u03b7 to obtain w if k\u2207h (w\u2217;w )k \u2264 \u03b3tk\u2207h (w ;w )k, where\r\n k k k t k k t t\r\n Each device k \u2208 S sends wt+1 back to the server \u2207h (w;w ) = \u2207F (w) + \u00b5(w \u2212 w ). Note that a\r\n t k P k t k t\r\n Server aggregates the w\u2019s as wt+1 = 1 wt+1 smaller \u03b3t corresponds to higher accuracy.\r\n endfor K k\u2208St k k\r\n Analogous to De\ufb01nition 1, \u03b3t measures how much local\r\n k\r\n objective as opposed to the global objective\u2014potentially computation is performed to solve the local subproblem\r\n hurting convergence or even causing the method to diverge. on device k at the t-th round. The variable number of\r\n Further, in federated networks with heterogeneous systems local iterations can be viewed as a proxy of \u03b3t. Utilizing\r\n k\r\n resources, setting the number of local epochs to be high the more \ufb02exible \u03b3t-inexactness, we can readily extend\r\n k\r\n mayincrease the risk that devices do not complete training the convergence results under De\ufb01nition 1 (Theorem 4) to\r\n within a given communication round and must therefore consider issues related to systems heterogeneity such as\r\n drop out of the procedure (Bonawitz et al., 2019). stragglers (see Corollary 9).\r\n In practice, it is therefore important to \ufb01nd a way to set the Proximal term. As mentioned in Section 3.1, while toler-\r\n local epochs to be high (to reduce communication) while ating nonuniform amounts of work to be performed across\r\n also allowing for robust convergence. More fundamentally, devices can help alleviate negative impacts of systems het-\r\n wenotethatthe\u2018best\u2019settingforthenumberoflocalepochs erogeneity, too many local updates may still (potentially)\r\n is likely to change at each iteration and on each device\u2014as cause the methods to diverge due to the underlying hetero-\r\n a function of both the local data and available systems re- geneous data. We propose to add a proximal term to the\r\n sources. Indeed, a more natural approach than mandating a local subproblem to effectively limit the impact of variable\r\n \ufb01xed number of local epochs is to allow the epochs to vary local updates. In particular, instead of just minimizing the\r\n local function F (\u00b7), device k uses its local solver of choice\r\n according to the characteristics of the network, and to care- k\r\n to approximately minimize the following objective h :\r\n fully merge solutions by accounting for this heterogeneity. k\r\n Weformalize this strategy in FedProx, introduced below. \u00b5\r\n t t 2\r\n minh (w; w ) = F (w)+ kw\u2212w k . (2)\r\n w k k 2\r\n 3.2 Proposed Framework: FedProx\r\n Ourproposedframework, FedProx(Algorithm2),issimi- The proximal term is bene\ufb01cial in two aspects: (1) It ad-\r\n lar to FedAvginthatasubsetofdevicesareselectedateach dresses the issue of statistical heterogeneity by restricting\r\n round, local updates are performed, and these updates are the local updates to be closer to the initial (global) model\r\n then averaged to form a global update. However, FedProx withoutanyneedtomanuallysetthenumberoflocalepochs.\r\n makesthefollowingsimpleyetcriticalmodi\ufb01cations,which (2) It allows for safely incorporating variable amounts of\r\n result in signi\ufb01cant empirical improvements and also allow local work resulting from systems heterogeneity. We sum-\r\n us to provide convergence guarantees for the method. marize the steps of FedProx in Algorithm 2.\r\n Tolerating partial work. As previously discussed, dif- Algorithm 2 FedProx(Proposed Framework)\r\n ferent devices in federated networks often have different\r\n Input: K, T, \u00b5, \u03b3, w0, N, p , k = 1,\u00b7\u00b7\u00b7 ,N\r\n resource constraints in terms of the computing hardware, k\r\n network connections, and battery levels. Therefore, it is un- for t = 0,\u00b7\u00b7\u00b7 ,T \u2212 1 do\r\n ServerselectsasubsetS ofK devicesatrandom(each\r\n realistic to force each device to perform a uniform amount t\r\n device k is chosen with probability p )\r\n of work (i.e., running the same number of local epochs, k\r\n E), as in FedAvg. In FedProx, we generalize FedAvg Server sends wt to all chosen devices\r\n Each chosen device k \u2208 S \ufb01nds a wt+1\r\n byallowing for variable amounts of work to be performed t k\r\n locally across devices based on their available systems re- which is a \u03b3t-inexact minimizer of: wt+1 \u2248\r\n k \u00b5 k\r\n t t 2\r\n argmin h (w; w ) = F (w)+ kw\u2212w k\r\n sources, and then aggregate the partial solutions sent from w k k 2\r\n the stragglers (as compared to dropping these devices). In Each device k \u2208 S sends wt+1 back to the server\r\n t k P\r\n other words, instead of assuming a uniform \u03b3 for all de- Server aggregates the w\u2019s as wt+1 = 1 wt+1\r\n endfor K k\u2208St k\r\n vices throughout the training process, FedProx implicitly\r\n Federated Optimization in Heterogeneous Networks\r\n Wenote that proximal terms such as the one above are a local objective functions. This could be achieved by assum-\r\n popular tool utilized throughout the optimization literature; ing the data to be IID, i.e., homogeneous across devices.\r\n for completeness, we provide a more detailed discussion Unfortunately, in realistic federated networks, this assump-\r\n onthis in Appendix B. An important distinction of the pro- tion is impractical. Thus, we \ufb01rst propose a metric that\r\n posed usage is that we suggest, explore, and analyze such a speci\ufb01cally measures the dissimilarity among local func-\r\n term for the purpose of tackling heterogeneity in federated tions (Section 4.1), and then analyze FedProx under this\r\n networks. Our analysis (Section 4) is also unique in con- assumption while allowing for variable \u03b3\u2019s (Section 4.2).\r\n sidering solving such an objective in a distributed setting\r\n with: (1) non-IID partitioned data, (2) the use of any local 4.1 Local dissimilarity\r\n solver, (3) variable inexact updates across devices, and (4) a Here we introduce a measure of dissimilarity between the\r\n subset of devices being active at each round. These assump- devices in a federated network, which is suf\ufb01cient to prove\r\n tions are critical to providing a characterization of such a convergence. This can also be satis\ufb01ed via a simpler and\r\n framework in realistic federated scenarios. morerestrictive bounded variance assumption of the gradi-\r\n In our experiments (Section 5), we demonstrate that tol- ents (Corollary 10), which we explore in our experiments in\r\n erating partial work is bene\ufb01cial in the presence of sys- Section 5. Interestingly, similar assumptions (e.g., Schmidt\r\n tems heterogeneity and our modi\ufb01ed local subproblem in &Roux,2013;Vaswanietal.,2019;Yinetal., 2018) have\r\n FedProx results in more robust and stable convergence been explored elsewhere but for differing purposes; we pro-\r\n compared to vanilla FedAvg for heterogeneous datasets. vide a discussion of these works in Appendix B.\r\n In Section 4, we also see that the usage of the proximal De\ufb01nition 3 (B-local dissimilarity). The local functions\r\n term makes FedProxmoreamenabletotheoretical analy- \u0002 2\u0003\r\n F are B-locally dissimilar at w if E k\u2207F (w)k \u2264\r\n sis (i.e., the local objective may be more well-behaved). In k k q k\r\n E [k\u2207F (w)k2]\r\n particular, if \u00b5 is chosen accordingly, the Hessian of h may 2 2 k k\r\n k k\u2207f(w)k B . Wefurtherde\ufb01neB(w)= k\u2207f(w)k2\r\n be positive semi-de\ufb01nite. Hence, when F is non-convex, 2\r\n k for k\u2207f(w)k6=0.\r\n h will be convex, and when F is convex, it becomes \u00b5-\r\n k k\r\n strongly convex. HereE [\u00b7]denotestheexpectationoverdeviceswithmasses\r\n k P\r\n Finally, we note that since FedProx makes only p =n /nand N p =1(asinEquation1). De\ufb01ni-\r\n k k k=1 k\r\n lightweight modi\ufb01cations to FedAvg, this allows us to tion 3 can be seen as a generalization of the IID assumption\r\n reason about the behavior of the widely-used FedAvg with boundeddissimilarity, while allowing for statistical het-\r\n method, and enables easy integration of FedProx into erogeneity. As a sanity check, when all the local functions\r\n existing packages/systems, such as TensorFlow Federated are the same, we have B(w) = 1 for all w. However, in the\r\n and LEAF (TFF; Caldas et al., 2018). In particular, we federated setting, the data distributions are often heteroge-\r\n note that FedAvg is a special case of FedProx with (1) neous and B > 1 due to sampling discrepancies even if the\r\n \u00b5 = 0, (2) the local solver speci\ufb01cally chosen to be SGD, samples are assumed to be IID. Let us also consider the case\r\n and (3) a constant \u03b3 (corresponding to the number of local where Fk(\u00b7)\u2019s are associated with empirical risk objectives.\r\n epochs) across devices and updating rounds (i.e., no notion If the samples on all the devices are homogeneous, i.e., they\r\n of systems heterogeneity). FedProx is in fact much more are sampled in an IID fashion, then as mink nk \u2192 \u221e, it\r\n general in this regard, as it allows for partial work to be per- follows that B(w) \u2192 1foreveryw asallthelocalfunctions\r\n formed across devices and any local (possibly non-iterative) convergetothesameexpectedriskfunctioninthelargesam-\r\n solver to be used on each device. ple limit. Thus, B(w) \u2265 1 and the larger the value of B(w),\r\n the larger is the dissimilarity among the local functions.\r\n 4 FEDPROX: CONVERGENCE ANALYSIS Using De\ufb01nition 3, we now state our formal dissimilarity\r\n FedAvgandFedProxarestochasticalgorithmsbynature: assumption, which we use in our convergence analysis. This\r\n in each round, only a fraction of the devices are sampled simply requires that the dissimilarity de\ufb01ned in De\ufb01nition 3\r\n to perform the update, and the updates performed on each is bounded. As discussed later, our convergence rate is a\r\n device may be inexact. It is well known that in order for function of the statistical heterogeneity/device dissimilarity\r\n stochastic methods to converge to a stationary point, a de- in the network.\r\n creasing step-size is required. This is in contrast to non- Assumption 1 (Bounded dissimilarity). For some \u01eb > 0,\r\n stochastic methods, e.g., gradient descent, that can \ufb01nd a there exists a B\u01eb such that for all the points w \u2208 Sc =\r\n \u01eb\r\n stationary point by employing a constant step-size. In or- 2\r\n {w|k\u2207f(w)k >\u01eb},B(w)\u2264B .\r\n \u01eb\r\n der to analyze the convergence behavior of methods with 2\r\n As an exception we de\ufb01ne B(w) = 1 when\r\n constant step-size (as is usually implemented in practice), \u0002 2\u0003 2\r\n E k\u2207F (w)k = k\u2207f(w)k , i.e. w is a stationary so-\r\n weneedtoquantify the degree of dissimilarity among the k k\r\n lution that all the local functions F agree on.\r\n k\r\n Federated Optimization in Heterogeneous Networks\r\n For most practical machine learning problems, there is no require \u00b5\u00af > 0, which is a suf\ufb01cient but not necessary con-\r\n need to solve the problem to highly accurate stationary so- dition for FedProx to converge. Hence, it is possible that\r\n lutions, i.e., \u01eb is typically not very small. Indeed, it is well- some other \u00b5 (not necessarily satisfying \u00b5\u00af > 0) can also\r\n knownthatsolvingtheproblembeyondsomethresholdmay enable convergence, as we explore empirically (Section 5).\r\n evenhurtgeneralization performance due to over\ufb01tting (Yao Theorem 4 uses the dissimilarity in De\ufb01nition 3 to iden-\r\n et al., 2007). Although in practical federated learning prob- tify suf\ufb01cient decrease of the objective value at each itera-\r\n lems the samples are not IID, they are still sampled from tion for FedProx. In Appendix A.2, we provide a corol-\r\n distributions that are not entirely unrelated (if this were the lary characterizing the performance with a more common\r\n case, e.g., \ufb01tting a single global model w across devices (though slightly more restrictive) bounded variance assump-\r\n would be ill-advised). Thus, it is reasonable to assume that tion. This assumption is commonly employed, e.g., when\r\n the dissimilarity between local functions remains bounded analyzing methods such as SGD. We next provide suf\ufb01cient\r\n throughout the training process. We also measure the dis- (but not necessary) conditions that ensure \u03c1 > 0 in Theorem\r\n similarity metric empirically on real and synthetic datasets 4 such that suf\ufb01cient decrease is attainable after each round.\r\n in Section 5.3.3 and show that this metric captures real-\r\n world statistical heterogeneity and is correlated with practi- Remark 5. For \u03c1 in Theorem 4 to be positive, we need\r\n cal performance (the smaller the dissimilarity, the better the B\r\n \u03b3B < 1 and \u221a <1. These conditions help to quantify\r\n convergence). K\r\n the trade-off between dissimilarity (B) and the algorithm\r\n parameters (\u03b3, K).\r\n 4.2 FedProxAnalysis\r\n Usingtheboundeddissimilarityassumption(Assumption1), Finally, we can use the above suf\ufb01cient decrease to the char-\r\n acterize the rate of convergence to the set of approximate\r\n we now analyze the amount of expected decrease in the \u0002 2\u0003\r\n stationary solutions S = {w | E k\u2207f(w)k \u2264\u01eb}under\r\n objective when one step of FedProx is performed. Our s\r\n convergence rate (Theorem 6) can be directly derived from the bounded dissimilarity assumption, Assumption 1. Note\r\n that these results hold for general non-convex F (\u00b7).\r\n the results of the expected decrease per updating round. We k\r\n assume the same \u03b3t for any k,t for ease of notation in the\r\n k Theorem 6 (Convergence rate: FedProx). Given some\r\n following analyses. \u01eb > 0, assume that for B \u2265 B , \u00b5, \u03b3, and K the assump-\r\n \u01eb\r\n Theorem4(Non-convexFedProxconvergence: B-local tions of Theorem 4 hold at each iteration of FedProx.\r\n Moreover, f(w0)\u2212f\u2217 = \u2206. Then, after T = O(\u2206) itera-\r\n dissimilarity). Let Assumption 1 hold. Assume the functions P \u0002 \u03c1\u01eb \u0003\r\n 1 T\u22121 t 2\r\n F are non-convex, L-Lipschitz smooth, and there exists tions of FedProx, we have E k\u2207f(w )k \u2264\u01eb.\r\n k T t=0\r\n 2 :\r\n L >0,suchthat\u2207 F \u0017\u2212L I,with\u00b5\u00af =\u00b5\u2212L >0.\r\n \u2212 t k \u2212 \u2212 While the results thus far hold for non-convex F (\u00b7), we\r\n Suppose that w is not a stationary solution and the local k\r\n functions F are B-dissimilar, i.e. B(wt) \u2264 B. If \u00b5, K, can also characterize the convergence for the special case\r\n k of convex loss functions with exact minimization in terms\r\n and\u03b3 in Algorithm 2 are chosen such that\r\n \u221a of local objectives (Corollary 7). A proof is provided in\r\n 1 \u03b3B B(1+\u03b3) 2 LB(1+\u03b3) Appendix A.3.\r\n \u03c1= \u00b5\u2212 \u00b5 \u2212 \u221a \u2212 \u00b5\u00b5\u00af\r\n \u00b5\u00af K Corollary 7 (Convergence: Convex case). Let the asser-\r\n 2 2 2 2\u0012 \u221a \u0013\u0013\r\n \u2212L(1+\u03b3) B \u2212LB (1+\u03b3) 2 2K+2 >0, tions of Theorem 4 hold. In addition, let Fk (\u00b7)\u2019s be convex\r\n 2\u00b5\u00af2 \u00b5\u00af2K and \u03b3t = 0 for any k,t, i.e., all the local problems are\r\n k \u221a\r\n then at iteration t of Algorithm 2, we have the following solved exactly, if 1 \u226a B \u2264 0.5 K, then we can choose\r\n \u00b5\u22486LB2fromwhichitfollowsthat\u03c1 \u2248 1 .\r\n expected decrease in the global objective: 24LB2\r\n \u0002 \u0003 Note that small \u01eb in Assumption 1 translates to larger B .\r\n t+1 t t 2 \u01eb\r\n E f(w ) \u2264f(w )\u2212\u03c1k\u2207f(w )k ,\r\n St Corollary 7 suggests that, in order to solve the problem\r\n where St is the set of K devices chosen at iteration t. with increasingly higher accuracies using FedProx, one\r\n needs to increase \u00b5 appropriately. We empirically verify\r\n Wedirect the reader to Appendix A.1 for a detailed proof. that \u00b5 > 0 leads to more stable convergence in Section 5.3.\r\n Thekeysteps include applying our notion of \u03b3-inexactness Moreover, in Corollary 7, if we plug in the upper bound\r\n (De\ufb01nition 1) for each subproblem and using the bounded for B , under a bounded variance assumption (Corollary\r\n \u01eb\r\n dissimilarity assumption, while allowing for only K de- 10), the number of required steps to achieve accuracy \u01eb is\r\n vices to be active at each round. This last step in particular O(L\u2206 + L\u2206\u03c32). Our analysis helps to characterize the\r\n \u01eb \u01eb2\r\n introduces ES , an expectation with respect to the choice performance of FedProx and similar methods when local\r\n t\r\n of devices, St, in round t. We note that in our theory, we functions are dissimilar.\r\n Federated Optimization in Heterogeneous Networks\r\n Remark 8 (Comparison with SGD). We note that 5 EXPERIMENTS\r\n FedProxachievesthesameasymptoticconvergence guar- We now present empirical results for the generalized\r\n antee as SGD: Under the bounded variance assumption, for FedProxframework. InSection5.2, we demonstrate the\r\n small \u01eb, if we replace B\u01eb with its upper-bound in Corollary improved performance of FedProx tolerating partial solu-\r\n 10andchoose\u00b5largeenough,theiteration complexity of tions in the face of systems heterogeneity. In Section 5.3,\r\n FedProx when the subproblems are solved exactly and\r\n F (\u00b7)\u2019s are convex is O(L\u2206 + L\u2206\u03c32), the same as SGD weshowtheeffectiveness of FedProx in the settings with\r\n k \u01eb \u01eb2 statistical heterogeneity (regardless of systems heterogene-\r\n (Ghadimi & Lan, 2013). ity). We also study the effects of statistical heterogeneity\r\n Toprovide context for the rate in Theorem 6, we compare on convergence (Section 5.3.1) and show how empirical\r\n it with SGD in the convex case in Remark 8. In general, convergence is related to our theoretical bounded dissimilar-\r\n our analysis of FedProx does not yield convergence rates ity assumption (Assumption 1) (Section 5.3.3). We provide\r\n that improve upon classical distributed SGD (without local thoroughdetailsoftheexperimentalsetupinSection5.1and\r\n updating)\u2014even though FedProxpossibly performs more Appendix C. All code, data, and experiments are publicly\r\n work locally at each communication round. In fact, when available at: github.com/litian96/FedProx.\r\n data are generated in a non-identically distributed fashion,\r\n it is possible for local updating schemes such as FedProx 5.1 Experimental Details\r\n to perform worse than distributed SGD. Therefore, our theo- Weevaluate FedProxondiversetasks, models, and real-\r\n retical results do not necessarily demonstrate the superiority world federated datasets. In order to better characterize\r\n of FedProx over distributed SGD; rather, they provide statistical heterogeneity and study its effect on convergence,\r\n suf\ufb01cient (but not necessary) conditions for FedProx to we also evaluate on a set of synthetic data, which allows\r\n converge. Ouranalysis is the \ufb01rst we are aware of to analyze for more precise manipulation of statistical heterogeneity.\r\n any federated (i.e., with local-updating schemes and low Wesimulate systems heterogeneity by assigning different\r\n device participation) optimization method for Problem (1) amounts of local work to different devices.\r\n in heterogeneous settings.\r\n Finally, we note that the previous analyses assume no sys- Synthetic data. To generate synthetic data, we follow\r\n tems heterogeneity and use the same \u03b3 for all devices and it- a similar setup to that in Shamir et al. (2014), addition-\r\n erations. However, wecanextendthemtoallowfor\u03b3 tovary ally imposing heterogeneity among devices. In particular,\r\n for each device k, we generate samples (X ,Y ) accord-\r\n by device and by iteration (as in De\ufb01nition 2), which cor- k k\r\n responds to allowing devices to perform variable amounts ing to the model y = argmax(softmax(Wx + b)), x \u2208\r\n 60 10\u00d760 10\r\n R ,W \u2208 R , b \u2208 R . We model W \u223c N(u ,1),\r\n of work as determined by the local systems conditions. We k k\r\n b \u223cN(u ,1),u \u223cN(0,\u03b1);x \u223cN(v ,\u03a3),wherethe\r\n provide convergence results with variable \u03b3\u2019s below. k k k k k\r\n covariance matrix \u03a3 is diagonal with \u03a3 =j\u22121.2. Eachel-\r\n Corollary 9 (Convergence: Variable \u03b3\u2019s). Assume the func- j,j\r\n ementinthemeanvectorv isdrawnfromN(B ,1),B \u223c\r\n tions F are non-convex, L-Lipschitz smooth, and there ex- k k k\r\n k N(0,\u03b2). Therefore, \u03b1 controls how much local models dif-\r\n ists L >0,suchthat\u22072F \u0017\u2212L I,with\u00b5\u00af := \u00b5\u2212L >\r\n \u2212 t k \u2212 \u2212 fer from each other and \u03b2 controls how much the local data\r\n 0. Suppose that w is not a stationary solution and the local at each device differs from that of other devices. We vary\r\n functions F are B-dissimilar, i.e. B(wt) \u2264 B. If \u00b5, K,\r\n k \u03b1,\u03b2 to generate three heterogeneous distributed datasets,\r\n and\u03b3t in Algorithm 2 are chosen such that denoted Synthetic (\u03b1,\u03b2), as shown in Figure 2. We also\r\n k\r\n t t \u221a t generate one IID dataset by setting the same W,b on all\r\n t 1 \u03b3 B B(1+\u03b3 ) 2 LB(1+\u03b3 ) devices and setting X to follow the same distribution. Our\r\n \u03c1 = \u00b5\u2212 \u00b5 \u2212 \u221a \u2212 \u00b5\u00b5\u00af k\r\n \u00b5\u00af K goal is to learn a global W and b. Full details are given in\r\n t 2 2 2 t 2\u0012 \u221a \u0013\u0013 Appendix C.1.\r\n \u2212L(1+\u03b3 ) B \u2212LB (1+\u03b3 ) 2 2K+2 >0,\r\n 2\u00b5\u00af2 \u00b5\u00af2K Real data. We also explore four real datasets; statistics are\r\n then at iteration t of Algorithm 2, we have the following summarized in Table 1. These datasets are curated from\r\n expected decrease in the global objective: prior work in federated learning as well as recent feder-\r\n \u0002 \u0003 ated learning benchmarks (McMahan et al., 2017; Caldas\r\n t+1 t t t 2\r\n E f(w ) \u2264f(w )\u2212\u03c1 k\u2207f(w )k ,\r\n St et al., 2018). We study a convex classi\ufb01cation problem with\r\n where S is the set of K devices chosen at iteration t and MNIST(LeCunetal.,1998)usingmultinomiallogistic re-\r\n t gression. To impose statistical heterogeneity, we distribute\r\n \u03b3 =max \u03b3t.\r\n t k\u2208S\r\n t k the data among 1,000 devices such that each device has\r\n The proof can be easily extended from the proof for The- samples of only two digits and the number of samples per\r\n orem 4 , noting the fact that E [(1 + \u03b3t)k\u2207F (wt)k] \u2264 device follows a power law. We then study a more com-\r\n k k k\r\n (1+max \u03b3t)E [k\u2207F (wt)k]. plex 62-class Federated Extended MNIST (Cohen et al.,\r\n k\u2208S k k\r\n t k\r\n Federated Optimization in Heterogeneous Networks\r\n 2017; Caldas et al., 2018) (FEMNIST) dataset using the 5.2 Systems Heterogeneity: Tolerating Partial Work\r\n samemodel. For the non-convex setting, we consider a text In order to measure the effect of allowing for partial so-\r\n sentiment analysis task on tweets from Sentiment140 (Go lutions to be sent to handle systems heterogeneity with\r\n et al., 2009) (Sent140) with an LSTM classi\ufb01er, where each FedProx,wesimulatefederated settings with varying sys-\r\n twitter account corresponds to a device. We also investigate temheterogeneity, as described below.\r\n the task of next-character prediction on the dataset of The\r\n Complete Works of William Shakespeare (McMahan et al., Systems heterogeneity simulations. We assume that there\r\n 2017) (Shakespeare). Each speaking role in the plays is as- exists a global clock during training, and each participating\r\n sociated with a different device. Details of datasets, models, device determines the amount of local work as a function of\r\n and workloads are provided in Appendix C.1. this clock cycle and its systems constraints. This speci\ufb01ed\r\n amountoflocal computation corresponds to some implicit\r\n value \u03b3t for device k at the t-th iteration. In our simulations,\r\n k\r\n Table 1. Statistics of four real federated datasets. we\ufb01xaglobalnumberofepochsE,andforcesomedevices\r\n to perform fewer updates than E epochs given their current\r\n Dataset Devices Samples Samples/device systemsconstraints. In particular, for varying heterogeneous\r\n settings, at each round, we assign x number of epochs (cho-\r\n mean stdev sen uniformly at random between [1, E]) to 0%, 50%, and\r\n MNIST 1,000 69,035 69 106 90%oftheselecteddevices,respectively. Settingswhere0%\r\n FEMNIST 200 18,345 92 159 devices perform fewer than E epochs of work correspond to\r\n Shakespeare 143 517,106 3,616 6,808 the environments without systems heterogeneity, while 90%\r\n Sent140 772 40,783 53 32 of the devices sending their partial solutions corresponds to\r\n highly heterogeneous environments. FedAvg will simply\r\n drop these 0%, 50%, and 90% stragglers upon reaching\r\n the global clock cycle, and FedProx will incorporate the\r\n Implementation. We implement FedAvg (Algorithm 1) partial updates from these devices.\r\n and FedProx (Algorithm 2) in Tensor\ufb02ow (Abadi et al., In Figure 1, we set E to be 20 and study the effects of aggre-\r\n 2016). In order to draw a fair comparison with FedAvg, we gating partial work from the otherwise dropped devices. The\r\n employ SGD as a local solver for FedProx, and adopt a synthetic dataset here is taken from Synthetic (1,1) in Figure\r\n slightly different device sampling scheme than that in Algo- 2. We see that on all the datasets, systems heterogeneity has\r\n rithms 1 and 2: sampling devices uniformly and then averag- negative effects on convergence, and larger heterogeneity\r\n ing the updates with weights proportional to the number of results in worse convergence (FedAvg). Compared with\r\n local data points (as originally proposed in McMahan et al. dropping the more constrained devices (FedAvg), incor-\r\n (2017)). Whilethissamplingschemeisnotsupportedbyour porating variable amounts of work (FedProx, \u00b5 = 0) is\r\n analysis, we observe similar relative behavior of FedProx bene\ufb01cial and leads to more stable and faster convergence.\r\n vs. FedAvg whether or not it is employed. Interestingly, Wealsoobservethat setting \u00b5 > 0 in FedProx can further\r\n wealsoobserve that the sampling scheme proposed herein improve convergence, as we discuss in Section 5.3.\r\n in fact results in more stable performance for both methods\r\n (see Appendix C.3.4, Figure 12). This suggests an addi- Weadditionally investigate two less heterogeneous settings.\r\n tional bene\ufb01t of the proposed framework. Full details are First, we limit the capability of all the devices by setting E\r\n provided in Appendix C.2. to be 1 (i.e., all the devices run at most one local epoch), and\r\n Hyperparameters & evaluation metrics. For each impose systems heterogeneity in a similar way. We show\r\n dataset, we tune the learning rate on FedAvg (with E=1 training loss in Figure 9 and testing accuracy in Figure 10\r\n and without systems heterogeneity) and use the same learn- in the appendix. Even in these settings, allowing for partial\r\n ing rate for all experiments on that dataset. We set the work can improve convergence compared with FedAvg.\r\n numberofselected devices to be 10 for all experiments on Second, we explore a setting without any statistical hetero-\r\n all datasets. For each comparison, we \ufb01x the randomly se- geneity using an identically distributed synthetic dataset\r\n lected devices, the stragglers, and mini-batch orders across (Synthetic IID). In this IID setting, as shown in Figure 5\r\n all runs. We report all metrics based on the global objec- in Appendix C.3.2, FedAvg is rather robust under device\r\n tive f(w). Note that in our simulations (see Section 5.2 failure, and tolerating variable amounts of local work may\r\n for details), we assume that each communication round cor- not cause major improvement. This serves as an additional\r\n responds to a speci\ufb01c aggregation time stamp (measured motivation to rigorously study the effect of statistical het-\r\n in real-world global wall-clock time)\u2014we therefore report erogeneity on new methods designed for federated learning,\r\n results in terms of rounds rather than FLOPs or wall-clock as simply relying on IID data (a setting unlikely to occur in\r\n time. See details of the hyper-parameters in Appendix C.2. practice) may not tell a complete story.\r\n Federated Optimization in Heterogeneous Networks\r\n 0%\r\n stragglers\r\n 50% \r\n stragglers\r\n 90% \r\n stragglers\r\n Figure 1. FedProx results in signi\ufb01cant convergence improvements relative to FedAvg in heterogeneous networks. We simulate\r\n different levels of systems heterogeneity by forcing 0%, 50%, and 90% devices to be the stragglers (dropped by FedAvg). (1) Comparing\r\n FedAvgandFedProx(\u00b5=0),weseethatallowingforvariableamountsofworktobeperformedcanhelpconvergenceinthepresence\r\n of systems heterogeneity. (2) Comparing FedProx (\u00b5 = 0) with FedProx (\u00b5 > 0), we show the bene\ufb01ts of our added proximal term.\r\n FedProxwith\u00b5>0leadstomorestableconvergenceandenablesotherwisedivergentmethodstoconverge,bothinthepresenceof\r\n systems heterogeneity (50% and 90% stragglers) and without systems heterogeneity (0% stragglers). Note that FedProx with \u00b5 = 0 and\r\n without systems heterogeneity (no stragglers) corresponds to FedAvg. We also report testing accuracy in Figure 7, Appendix C.3.2, and\r\n showthat FedProximprovesthetest accuracy on all datasets.\r\n 5.3 Statistical Heterogeneity: Proximal Term awayfromtheinitial starting point, thus leading to potential\r\n Tobetter understand how the proximal term can be bene\ufb01- divergence (McMahan et al., 2017). Therefore, to handle\r\n cial in heterogeneous settings, we \ufb01rst show convergence the divergence or instability of FedAvg with non-IID data,\r\n can become worse as statistical heterogeneity increases. it is helpful to tune E carefully. However, E is constrained\r\n by the underlying system\u2019s environments on the devices,\r\n 5.3.1 Effects of Statistical Heterogeneity and it is dif\ufb01cult to determine an appropriate uniform E for\r\n all devices. Alternatively, it is bene\ufb01cial to allow for device-\r\n In Figure 2 (the \ufb01rst row), we study how statistical hetero- speci\ufb01c E\u2019s (variable \u03b3\u2019s) and tune a best \u00b5 (a parameter\r\n geneity affects convergence using four synthetic datasets that can be viewed as a re-parameterization of E) to prevent\r\n without the presence of systems heterogeneity (\ufb01xing E divergence and improve the stability of methods. A proper\r\n to be 20). From left to right, as data become more hetero- \u00b5canrestrict the trajectory of the iterates by constraining\r\n geneous, convergence becomes worse for FedProx with the iterates to be closer to that of the global model, thus\r\n \u00b5 = 0 (i.e., FedAvg). Though it may slow convergence incorporating variable amounts of updates and guaranteeing\r\n for IID data, we see that setting \u00b5 > 0 is particularly useful convergence (Theorem 6).\r\n in heterogeneous settings. This indicates that the modi\ufb01ed We show the effects of the proximal term in FedProx\r\n subproblem introduced in FedProx can bene\ufb01t practical (\u00b5 > 0) in Figure 1. For each experiment, we compare the\r\n federated settings with varying statistical heterogeneity. For results between FedProx with \u00b5 = 0 and FedProx with\r\n perfectly IID data, some heuristics such as decreasing \u00b5 a best \u00b5 (see the next paragraph for discussions on how to\r\n if the loss continues to decrease may help avoid the decel- select \u00b5). For all datasets, we observe that the appropriate \u00b5\r\n eration of convergence (see Figure 11 in Appendix C.3.3). can increase the stability for unstable methods and can force\r\n In the sections to follow, we see similar results in our non- divergent methods to converge. This holds both when there\r\n synthetic experiments. is systems heterogeneity (50% and 90% stragglers) and\r\n 5.3.2 Effects of \u00b5 > 0 there is no systems heterogeneity (0% stragglers). \u00b5 > 0\r\n also increases the accuracy in most cases (see Figure 6\r\n The key parameters of FedProx that affect performance and Figure 7 in Appendix C.3.2). In particular, FedProx\r\n are the amount of local work (as parameterized by the num- improves absolute testing accuracy relative to FedAvg by\r\n ber of local epochs, E), and the proximal term scaled by \u00b5. 22%onaverageinhighlyheterogeneousenvironments(90%\r\n Intuitively, large E may cause local models to drift too far stragglers) (see Figure 7).\r\n Federated Optimization in Heterogeneous Networks\r\n Synthetic-IID 3 Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n 2.0 FedAvg (FedProx, =0) 3 3\r\n 1.5 FedProx, >0 2\r\n 2 2\r\n 1.0\r\n Training Loss0.5 1 1 1\r\n 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n 0.3 Synthetic-IID Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n FedAvg (FedProx, =0)40 60 100\r\n 0.2 FedProx, >0 30 75\r\n 40\r\n 0.1 20 50\r\n 10 20\r\n 25\r\n Variance of Local Grad.0.0\r\n 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n Figure 2. Effect of data heterogeneity on convergence. We remove the effects of systems heterogeneity by forcing each device to run the\r\n sameamountofepochs. In this setting, FedProx with \u00b5 = 0 reduces to FedAvg. (1) Top row: We show training loss (see results on\r\n testing accuracy in Appendix C.3, Figure 6) on four synthetic datasets whose statistical heterogeneity increases from left to right. Note\r\n that the method with \u00b5 = 0 corresponds to FedAvg. Increasing heterogeneity leads to worse convergence, but setting \u00b5 > 0 can help to\r\n combat this. (2) Bottom row: We show the corresponding dissimilarity measurement (variance of gradients) of the four synthetic datasets.\r\n This metric captures statistical heterogeneity and is consistent with training loss \u2014 smaller dissimilarity indicates better convergence.\r\n Synthetic-IID Synthetic (1,1) 5.3.3 Dissimilarity Measurement and Divergence\r\n 2.0 FedAvg (FedProx, =0)3\r\n FedProx, dynamic Finally, in Figure 2 (the bottom row), we demonstrate that\r\n 1.5 FedProx, >0\r\n 2\r\n 1.0 our B-local dissimilarity measurement in De\ufb01nition 3 cap-\r\n Training Loss0.5 1 tures the heterogeneity of datasets and is therefore an appro-\r\n 0 50 100 150 200 0 50 100 150 200 priate proxy of performance. In particular, we track the vari-\r\n # Rounds # Rounds 2\r\n anceofgradientsoneachdevice,E [k\u2207F (w)\u2212\u2207f(w)k ],\r\n Figure 3. Effectiveness of setting \u00b5 adaptively based on the current k k\r\n model performance. We increase \u00b5 by 0.1 whenever the loss whichislowerboundedbyB\u01eb(seeBoundedVarianceEquiv-\r\n increases and decreases it by 0.1 whenever the loss decreases for alence Corollary 10). Empirically, we observe that increas-\r\n 5 consecutive rounds. We initialize \u00b5 to 1 for Synthetic IID (in ing \u00b5 leads to smaller dissimilarity among local functions\r\n F , and that the dissimilarity metric is consistent with the\r\n order to be adversarial to our methods), and initialize \u00b5 to 0 for k\r\n Synthetic (1,1). This simple heuristic works well empirically. training loss. Therefore, smaller dissimilarity indicates bet-\r\n ter convergence, which can be enforced by setting \u00b5 ap-\r\n propriately. We also show the dissimilarity metric on real\r\n Choosing \u00b5. One natural question is to determine how to federated data in Appendix C.3.2.\r\n set the penalty constant \u00b5 in the proximal term. A large \u00b5\r\n maypotentiallyslowtheconvergencebyforcingtheupdates 6 CONCLUSION\r\n to be close to the starting point, while a small \u00b5 may not In this work, we have proposed FedProx, an optimization\r\n makeanydifference. In all experiments, we tune the best framework that tackles the systems and statistical hetero-\r\n \u00b5fromthelimited candidate set {0.001,0.01,0.1,1}. For geneity inherent in federated networks. FedProx allows\r\n the \ufb01ve federated datasets in Figure 1, the best \u00b5 values are for variable amounts of work to be performed locally across\r\n 1, 1, 1, 0.001, and 0.01, respectively. While automatically devices, and relies on a proximal term to help stabilize\r\n tuning \u00b5 is dif\ufb01cult to instantiate directly from our theoret- the method. We provide the convergence guarantees for\r\n ical results, in practice, we note that \u00b5 can be adaptively FedProxinrealistic federated settings under a device dis-\r\n chosen based on the current performance of the model. For similarity assumption, while also accounting for practical\r\n example, one simple heuristic is to increase \u00b5 when seeing issues such as stragglers. Our empirical evaluation across a\r\n the loss increasing and decreasing \u00b5 when seeing the loss suite of federated datasets has validated our theoretical anal-\r\n decreasing. In Figure 3, we demonstrate the effectiveness of ysis and demonstrated that the FedProx framework can\r\n this heuristic using two synthetic datasets. Note that we start signi\ufb01cantly improve the convergence behavior of federated\r\n frominitial \u00b5 values that are adversarial to our methods. We learning in realistic heterogeneous networks.\r\n provide full results showing the competitive performance\r\n of this approach in Appendix C.3.3. Future work includes ACKNOWLEDGEMENTS\r\n developing methods to automatically tune this parameter\r\n for heterogeneous datasets, based, e.g., on the theoretical \u02c7 \u00b4\r\n We thank Sebastian Caldas, Jakub Konecny, Brendan\r\n groundwork provided here. McMahan,NathanSrebro,andJianyuWangfortheirhelp-\r\n Federated Optimization in Heterogeneous Networks\r\n ful discussions. AT and VS are supported in part by DARPA Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Op-\r\n FA875017C0141,theNational Science Foundation grants timal Distributed Online Prediction Using Mini-Batches.\r\n IIS1705121 and IIS1838017, an Okawa Grant, a Google Journal of Machine Learning Research, 2012.\r\n Faculty Award, an Amazon Web Services Award, a JP Mor-\r\n ganA.I.ResearchFacultyAward,aCarnegieBoschInstitute Ghadimi, S. and Lan, G. Stochastic \ufb01rst-and zeroth-order\r\n Research Award, and the CONIX Research Center, one of methods for nonconvex stochastic programming. SIAM\r\n six centers in JUMP, a Semiconductor Research Corpora- Journal on Optimization, 2013.\r\n tion (SRC) program sponsored by DARPA. Any opinions, Go, A., Bhayani, R., and Huang, L. Twitter sentiment\r\n \ufb01ndings, and conclusions or recommendations expressed classi\ufb01cation using distant supervision. CS224N Project\r\n in this material are those of the author(s) and do not nec- Report, Stanford, 2009.\r\n essarily re\ufb02ect the views of DARPA, the National Science\r\n Foundation, or any other funding agency. Goldblum, M., Reich, S., Fowl, L., Ni, R., Cherepanova,\r\n REFERENCES V., and Goldstein, T. Unraveling meta-learning: Under-\r\n standing feature representations for few-shot tasks. arXiv\r\n Tensor\ufb02ow federated: Machine learning on decentral- preprint arXiv:2002.06753, 2020.\r\n ized data. URL https://www.tensorflow.org/ Hao,Y.,Rong,J.,andSen,Y. Onthelinearspeedupanalysis\r\n federated. of communicationef\ufb01cientmomentumsgdfordistributed\r\n Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, non-convex optimization. In International Conference on\r\n J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, Machine Learning, 2019.\r\n M. K., Levenberg, J., Monga, R., Moore, S., Murray, Huang, L., Yin, Y., Fu, Z., Zhang, S., Deng, H., and\r\n D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Liu, D. Loadaboost: Loss-based adaboost federated\r\n Wicke, M., Yu, Y., and Zheng, X. Tensor\ufb02ow: A system machine learning on medical data. arXiv preprint\r\n for large-scale machine learning. In Operating Systems arXiv:1811.12629, 2018.\r\n Design and Implementation, 2016.\r\n Allen-Zhu, Z. How to make the gradients small stochas- Jeong, E., Oh, S., Kim, H., Park, J., Bennis, M., and Kim, S.-\r\n tically: Even faster convex and nonconvex sgd. In Ad- L. Communication-ef\ufb01cient on-device machine learning:\r\n vances in Neural Information Processing Systems, 2018. Federated distillation and augmentation under non-iid\r\n private data. arXiv preprint arXiv:1811.11479, 2018.\r\n Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Inger-\r\n man, A., Ivanov, V., Kiddon, C., Konecny, J., Mazzocchi, Jiang, P. and Agrawal, G. A linear speedup analysis of dis-\r\n S., McMahan, H. B., Overveldt, T. V., Petrou, D., Ram- tributed deep learning with sparse and quantized commu-\r\n age, D., and Roselander, J. Towards federated learning at nication. In Advances in Neural Information Processing\r\n scale: system design. In Conference on Machine Learn- Systems, 2018.\r\n ing and Systems, 2019. Kaczmarz, S. Approximate solution of systems of linear\r\n Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. equations. International Journal of Control, 1993.\r\n Distributed optimization and statistical learning via the\r\n alternating direction method of multipliers. Foundations Khodak,M.,Balcan,M.-F.F.,andTalwalkar,A.S. Adaptive\r\n andTrends in Machine Learning, 2010. gradient-based meta-learning methods. In Advances in\r\n Neural Information Processing Systems, 2019.\r\n \u02c7 `\r\n Caldas, S., Wu, P., Li, T., Konecny, J., McMahan, H. B.,\r\n Smith, V., and Talwalkar, A. Leaf: A benchmark for fed- LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-\r\n erated settings. arXiv preprint arXiv:1812.01097, 2018. based learning applied to document recognition. Proceed-\r\n ings of the IEEE, 1998.\r\n Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. Em-\r\n nist: an extension of mnist to handwritten letters. arXiv Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Com-\r\n preprint arXiv:1702.05373, 2017. munication ef\ufb01cient distributed machine learning with\r\n the parameter server. In Advances in Neural Information\r\n Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Processing Systems, 2014a.\r\n Q.V., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang,\r\n K., and Ng, A. Large scale distributed deep networks. Li, M., Zhang, T., Chen, Y., and Smola, A. J. Ef\ufb01cient mini-\r\n In Advances in Neural Information Processing Systems, batch training for stochastic optimization. In Conference\r\n 2012. onKnowledgeDiscoveryandDataMining,2014b.\r\n Federated Optimization in Heterogeneous Networks\r\n Li, T., Sahu, A., Talwalkar, A., and Smith, V. Federated Wang, J. and Joshi, G. Cooperative sgd: A\r\n learning: Challenges, methods, and future directions. uni\ufb01ed framework for the design and analysis of\r\n arXiv preprint arXiv:1908.07873, 2019. communication-ef\ufb01cient sgd algorithms. arXiv preprint\r\n Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., arXiv:1808.07576, 2018.\r\n and Smith, V. Feddane: A federated newton-type method. Wang, S., Tuor, T., Salonidis, T., Leung, K. K., Makaya,\r\n arXiv preprint arXiv:2001.01920, 2020. C., He, T., and Chan, K. Adaptive federated learning\r\n Lin, T., Stich, S. U., and Jaggi, M. Don\u2019t use large mini- in resource constrained edge computing systems. IEEE\r\n batches, use local sgd. In International Conference on Journal on Selected Areas in Communications, 2019.\r\n Learning Representations, 2020. Woodworth, B. E., Wang, J., Smith, A., McMahan, B.,\r\n McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and Srebro, N. Graph oracle models, lower bounds, and\r\n and Arcas, B. A. y. Communication-ef\ufb01cient learning of gaps for parallel stochastic optimization. In Advances in\r\n deep networks from decentralized data. In International Neural Information Processing Systems, 2018.\r\n Conference on Arti\ufb01cial Intelligence and Statistics, 2017. Yao, Y., Rosasco, L., and Caponnetto, A. On early stopping\r\n Pennington, J., Socher, R., and Manning, C. Glove: Global in gradient descent learning. Constructive Approximation,\r\n vectors for word representation. In Empirical Methods in 2007.\r\n Natural Language Processing, 2014. Yin, D., Pananjady, A., Lam, M., Papailiopoulos, D., Ram-\r\n \u02c7 ` \u00b4 \u00b4 \u00b4 chandran, K., and Bartlett, P. Gradient diversity: a key\r\n Reddi, S. J., Konecny, J., Richtarik, P., Poczos, B., and\r\n Smola, A. Aide: Fast and communication ef\ufb01cient dis- ingredient for scalable distributed learning. In Interna-\r\n tributed optimization. arXiv preprint arXiv:1608.06879, tional Conference on Arti\ufb01cial Intelligence and Statistics,\r\n 2016. 2018.\r\n \u00b4 \u00b4\u02c7 Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd for\r\n Richtarik, P. and Takac, M. Distributed coordinate descent\r\n method for learning with big data. Journal of Machine non-convex optimization with faster convergence and\r\n Learning Research, 2016. less communication. In AAAI Conference on Arti\ufb01cial\r\n Schmidt, M. and Roux, N. L. Fast convergence of stochastic Intelligence, 2018.\r\n gradient descent under a strong growth condition. arXiv Zhang, S., Choromanska, A. E., and LeCun, Y. Deep learn-\r\n preprint arXiv:1308.6370, 2013. ing with elastic averaging sgd. In Advances in Neural\r\n Shamir, O., Srebro, N., and Zhang, T. Communication- Information Processing Systems, 2015.\r\n ef\ufb01cient distributed optimization using an approximate Zhang, Y., Duchi, J. C., and Wainwright, M. J.\r\n newton-type method. In International Conference on Communication-ef\ufb01cient algorithms for statistical opti-\r\n Machine Learning, 2014. mization. Journal of Machine Learning Research, 2013.\r\n Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra,\r\n Federated multi-task learning. In Advances in Neural V. Federated learning with non-iid data. arXiv preprint\r\n Information Processing Systems, 2017. arXiv:1806.00582, 2018.\r\n Smith, V., Forte, S., Ma, C., Takac, M., Jordan, M. I., Zhou, F. and Cong, G. On the convergence properties of\r\n and Jaggi, M. Cocoa: A general framework for a k-step averaging stochastic gradient descent algorithm\r\n communication-ef\ufb01cient distributed optimization. Jour- for nonconvex optimization. In International Joint Con-\r\n nal of Machine Learning Research, 2018. ference on Arti\ufb01cial Intelligence, 2018.\r\n Stich, S. U. Local sgd converges fast and communicates Zhou, P., Yuan, X., Xu, H., Yan, S., and Feng, J. Ef\ufb01cient\r\n little. In International Conference on Learning Represen- meta learning via minibatch proximal update. In Ad-\r\n tations, 2019. vances in Neural Information Processing Systems, 2019.\r\n Strohmer, T. and Vershynin, R. A randomized kaczmarz al-\r\n gorithmwithexponentialconvergence. JournalofFourier\r\n Analysis and Applications, 2009.\r\n Vaswani, S., Bach, F., and Schmidt, M. Fast and faster\r\n convergence of sgd for over-parameterized models (and\r\n an accelerated perceptron). In International Conference\r\n onArti\ufb01cial Intelligence and Statistics, 2019.\r\n Federated Optimization in Heterogeneous Networks\r\n A COMPLETEPROOFS\r\n A.1 Proof of Theorem 4\r\n t+1\r\n Proof. Using our notion of \u03b3-inexactness for each local solver (De\ufb01nition 1), we can de\ufb01ne e such that:\r\n k\r\n \u2207F (wt+1)+\u00b5(wt+1\u2212wt)\u2212et+1 =0,\r\n k k k k\r\n ket+1k \u2264 \u03b3k\u2207F (wt)k. (3)\r\n k k\r\n t+1 \u0002 t+1\u0003\r\n Nowletusde\ufb01new\u00af =Ek wk . Based on this de\ufb01nition, we know\r\n t+1 t \u22121 \u0002 t+1 \u0003 1 \u0002 t+1\u0003\r\n w\u00af \u2212w = E \u2207F (w ) + E e . (4)\r\n \u00b5 k k k \u00b5 k k\r\n Let us de\ufb01ne \u00b5\u00af = \u00b5 \u2212 L >0andw\u02c6t+1 = argmin h (w;wt). Then,duetothe\u00b5\u00af-strongconvexity of h , we have\r\n \u2212 k w k k\r\n kw\u02c6t+1 \u2212wt+1k \u2264 \u03b3k\u2207F (wt)k. (5)\r\n k k \u00b5\u00af k\r\n Note that once again, due to the \u00b5\u00af-strong convexity of h , we know that kw\u02c6t+1 \u2212 wtk \u2264 1k\u2207F (wt)k. Now we can use\r\n k k \u00b5\u00af k\r\n the triangle inequality to get\r\n kwt+1 \u2212wtk \u2264 1+\u03b3k\u2207F (wt)k. (6)\r\n k \u00b5\u00af k\r\n Therefore,\r\n t+1 t \u0002 t+1 t \u0003 1+\u03b3 \u0002 t \u0003 1+\u03b3p t 2 B(1+\u03b3) t\r\n kw\u00af \u2212wk\u2264E kw \u2212wk \u2264 E k\u2207F (w )k \u2264 E [k\u2207F (w )k ] \u2264 k\u2207f(w )k,\r\n k k \u00b5\u00af k k \u00b5\u00af k k \u00b5\u00af\r\n where the last inequality is due to the bounded dissimilarity assumption.\r\n t+1 t \u22121 t \u0001 \u0002 t+1 t t+1\u0003\r\n Nowletusde\ufb01neM such that w\u00af \u2212w = \u2207f(w )+M , i.e. M =E \u2207F (w )\u2212\u2207F (w )\u2212e .\r\n t+1 \u00b5 t+1 t+1 k k k k k\r\n WecanboundkM k:\r\n t+1\r\n \u0002 t+1 t t+1 \u0003 L(1+\u03b3) ! \u0002 t \u0003 L(1+\u03b3) ! t\r\n kM k\u2264E Lkw \u2212w k+ke k \u2264 +\u03b3 \u00d7E k\u2207F (w )k \u2264 +\u03b3 Bk\u2207f(w )k, (7)\r\n t+1 k k k k \u00b5\u00af k k \u00b5\u00af\r\n where the last inequality is also due to bounded dissimilarity assumption. Based on the L-Lipschitz smoothness of f and\r\n Taylor expansion, we have\r\n t+1 t t t+1 t L t+1 t 2\r\n f(w\u00af ) \u2264 f(w )+h\u2207f(w ),w\u00af \u2212wi+ 2kw\u00af \u2212wk\r\n 2 2\r\n t 1 t 2 1 t L(1+\u03b3) B t 2\r\n \u2264f(w )\u2212 k\u2207f(w )k \u2212 h\u2207f(w ),M i + k\u2207f(w )k\r\n t+1 2\r\n \u00b5 \u00b5 2\u00b5\u00af\r\n \u0012 2 2\u0013\r\n t 1\u2212\u03b3B LB(1+\u03b3) L(1+\u03b3) B t 2\r\n \u2264f(w )\u2212 \u2212 \u2212 2 \u00d7k\u2207f(w )k . (8)\r\n \u00b5 \u00b5\u00b5\u00af 2\u00b5\u00af\r\n From the above inequality it follows that if we set the penalty parameter \u00b5 large enough, we can get a decrease in the\r\n t+1 t t 2\r\n objective value of f(w\u00af ) \u2212f(w )whichisproportional to k\u2207f(w )k . However, this is not the way that the algorithm\r\n works. In the algorithm, we only use K devices that are chosen randomly to approximate w\u00aft. So, in order to \ufb01nd the\r\n E\u0002f(wt+1)\u0003, we use local Lipschitz continuity of the function f.\r\n f(wt+1) \u2264 f(w\u00aft+1)+L kwt+1 \u2212w\u00aft+1k, (9)\r\n 0\r\n where L is the local Lipschitz continuity constant of function f and we have\r\n 0\r\n L \u2264k\u2207f(wt)k+Lmax(kw\u00aft+1\u2212wtk,kwt+1\u2212wtk)\u2264k\u2207f(wt)k+L(kw\u00aft+1\u2212wtk+kwt+1\u2212wtk).\r\n 0\r\n Therefore, if we take expectation with respect to the choice of devices in round t we need to bound\r\n E \u0002f(wt+1)\u0003 \u2264 f(w\u00aft+1)+Q , (10)\r\n St t\r\n Federated Optimization in Heterogeneous Networks\r\n \u0002 t+1 t+1 \u0003\r\n where Q = E L kw \u2212w\u00af k . Note that the expectation is taken over the random choice of devices to update.\r\n t S 0\r\n t\r\n \u0014\u0012 t t+1 t t+1 t \u0013 t+1 t+1 \u0015\r\n Qt \u2264 ES k\u2207f(w )k+L(kw\u00af \u2212wk+kw \u2212wk) \u00d7kw \u2212w\u00af k\r\n t\r\n \u0012 t t+1 t \u0013 \u0002 t+1 t+1 \u0003 \u0002 t+1 t t+1 t+1 \u0003\r\n \u2264 k\u2207f(w )k+Lkw\u00af \u2212wk E kw \u2212w\u00af k +LE kw \u2212wk\u00b7kw \u2212w\u00af k\r\n St St\r\n \u0012 t t+1 t \u0013 \u0002 t+1 t+1 \u0003 \u0002 t+1 t+1 2\u0003\r\n \u2264 k\u2207f(w )k+2Lkw\u00af \u2212wk E kw \u2212w\u00af k +LE kw \u2212w\u00af k (11)\r\n St St\r\n From(7), we have that kw\u00aft+1 \u2212 wtk \u2264 B(1+\u03b3)k\u2207f(wt)k. Moreover,\r\n \u00b5\u00af\r\n \u0002 t+1 t+1 \u0003 p t+1 t+1 2\r\n E kw \u2212w\u00af k \u2264 E [kw \u2212w\u00af k ] (12)\r\n St St\r\n and\r\n \u0002 t+1 t+1 2\u0003 1 \u0002 t+1 t+1 2\u0003\r\n E kw \u2212w\u00af k \u2264 E kw \u2212w\u00af k\r\n St K k k\r\n \u2264 2 E \u0002kwt+1 \u2212wtk2\u0003, (asw\u00aft+1 = E \u0002wt+1\u0003)\r\n K k k k k\r\n 2 \u0002 \u0003\r\n \u2264 2 (1+\u03b3) E k\u2207F (wt)k2 (from (6))\r\n 2 k k\r\n K \u00b5\u00af\r\n 2B2 (1+\u03b3)2\r\n \u2264 k\u2207f(wt)k2, (13)\r\n K \u00b5\u00af2\r\n where the \ufb01rst inequality is a result of K devices being chosen randomly to get wt and the last inequality is due to bounded\r\n dissimilarity assumption. If we replace these bounds in (11) we get\r\n \u221a 2 2\u0012 \u221a \u0013!\r\n B(1+\u03b3) 2 LB (1+\u03b3) t 2\r\n Qt \u2264 \u221a + 2 2 2K+2 k\u2207f(w )k (14)\r\n \u00b5\u00af K \u00b5\u00af K\r\n Combining(8), (10), (9) and (14) and using the notation \u03b1 = 1 we get\r\n \u00b5\r\n \u0002 \u0003 \u221a\r\n t+1 t 1 \u03b3B B(1+\u03b3) 2 LB(1+\u03b3)\r\n ES f(w ) \u2264f(w )\u2212 \u2212 \u2212 \u221a \u2212\r\n t \u00b5 \u00b5 \u00b5\u00af K \u00b5\u00b5\u00af\r\n 2 2 2 2\u0012 \u221a \u0013!\r\n \u2212L(1+\u03b3) B \u2212 LB (1+\u03b3) 2 2K+2 k\u2207f(wt)k2.\r\n 2 2\r\n 2\u00b5\u00af \u00b5\u00af K\r\n A.2 Proof for Bounded Variance\r\n Corollary 10 (Bounded variance equivalence). Let Assumption 1 hold. Then, in the case of bounded variance, i.e.,\r\n \u0002 2\u0003 2 q \u03c32\r\n E k\u2207F (w)\u2212\u2207f(w)k \u2264\u03c3 ,forany\u01eb>0itfollowsthatB \u2264 1+ .\r\n k k \u01eb \u01eb\r\n Proof. Wehave,\r\n 2 2 2 2\r\n E [k\u2207F (w)\u2212\u2207f(w)k ]=E [k\u2207F (w)k ]\u2212k\u2207f(w)k \u2264\u03c3\r\n k k k k\r\n 2 2 2\r\n \u21d2E[k\u2207F (w)k ]\u2264\u03c3 +k\u2207f(w)k\r\n k sk\r\n 2 r 2\r\n E [k\u2207F (w)k ] \u03c3\r\n \u21d2B = k k \u2264 1+ .\r\n \u01eb 2\r\n k\u2207f(w)k \u01eb\r\n With Corollary 10 in place, we can restate the main result in Theorem 4 in terms of the bounded variance assumption.\r\n Federated Optimization in Heterogeneous Networks\r\n Theorem11(Non-convexFedProxconvergence: Boundedvariance). Lettheassertions of Theorem 4 hold. In addition,\r\n t t 2 \u0002 2\u0003 2\r\n let the iterate w be such that k\u2207f(w )k \u2265 \u01eb, and let E k\u2207F (w)\u2212\u2207f(w)k \u2264\u03c3 holdinsteadofthedissimilarity\r\n k k\r\n condition. If \u00b5, K and \u03b3 in Algorithm 2 are chosen such that\r\n !\r\n \u221a r 2 \u0012 2 2\u0012 \u221a \u0013\u0013\u0012 2\u0013\u0013\r\n 1 \u03b3 (1+\u03b3) 2 L(1+\u03b3) \u03c3 L(1+\u03b3) L(1+\u03b3) \u03c3\r\n \u03c1= \u2212 + \u221a + 1+ \u2212 2 + 2 2 2K+2 1+ >0,\r\n \u00b5 \u00b5 \u00b5\u00af K \u00b5\u00b5\u00af \u01eb 2\u00b5\u00af \u00b5\u00af K \u01eb\r\n then at iteration t of Algorithm 2, we have the following expected decrease in the global objective:\r\n \u0002 t+1 \u0003 t t 2\r\n ES f(w ) \u2264f(w )\u2212\u03c1k\u2207f(w )k ,\r\n t\r\n where St is the set of K devices chosen at iteration t.\r\n TheproofofTheorem11followsfromtheproofofTheorem4bynotingtherelationshipbetweentheboundedvariance\r\n assumption and the dissimilarity assumption as portrayed by Corollary 10.\r\n A.3 Proof of Corollary 7\r\n In the convex case, where L\u2212 = 0 and \u00b5\u00af = \u00b5, if \u03b3 = 0, i.e., all subproblems are solved accurately, we can get a decrease\r\n t 2 \u221a \u221a\r\n proportional to k\u2207f(w )k if B < K. In such a case if we assume 1 << B \u2264 0.5 K, then we can write\r\n \u0002 \u0003 1 3LB2\r\n t+1 t t 2 t 2\r\n ESt f(w ) /f(w )\u2212 k\u2207f(w )k + 2 k\u2207f(w )k . (15)\r\n 2\u00b5 2\u00b5\r\n In this case, if we choose \u00b5 \u2248 6LB2 we get\r\n \u0002 t+1 \u0003 t 1 t 2\r\n ESt f(w ) /f(w )\u2212 24LB2k\u2207f(w )k . (16)\r\n Note that the expectation in (16) is a conditional expectation conditioned on the previous iterate. Taking expectation of both\r\n sides, and telescoping, we have that the number of iterations to at least generate one solution with squared norm of gradient\r\n less than \u01eb is O(LB2\u2206).\r\n \u01eb\r\n Federated Optimization in Heterogeneous Networks\r\n B CONNECTIONSTOOTHERSINGLE-MACHINEANDDISTRIBUTEDMETHODS\r\n Twoaspects of the proposed work\u2014the proximal term in FedProx, and the bounded dissimilarity assumption used in our\r\n analysis\u2014have been previously studied in the optimization literature, but with very different motivations. For completeness,\r\n weprovide a discussion below on our relation to these prior works.\r\n Proximal term. The proposed modi\ufb01ed objective in FedProx shares a connection with elastic averaging SGD\r\n (EASGD) (Zhang et al., 2015), which was proposed as a way to train deep networks in the data center setting, and\r\n uses a similar proximal term in its objective. While the intuition is similar to EASGD (this term helps to prevent large\r\n deviations on each device/machine), EASGD employs a more complex moving average to update parameters, is limited to\r\n using SGD as a local solver, and has only been analyzed for simple quadratic problems. The proximal term we introduce\r\n has also been explored in previous optimization literature with different purposes, such as Allen-Zhu (2018), to speed up\r\n (mini-batch) SGD training on a single machine, and in Li et al. (2014b) for ef\ufb01cient SGD training both in a single machine\r\n and distributed settings. However, the analysis in Li et al. (2014b) is limited to a single machine setting with different\r\n assumptions (e.g., IID data and solving the subproblem exactly at each round).\r\n In addition, DANE (Shamir et al., 2014) and AIDE (Reddi et al., 2016), distributed methods designed for the data center\r\n setting, propose a similar proximal term in the local objective function, but also augment this with an additional gradient\r\n correction term. Both methods assume that all devices participate at each communication round, which is impractical\r\n in federated settings. Indeed, due to the inexact estimation of full gradients (i.e., \u2207\u03c6(w(t\u22121)) in Shamir et al. (2014, Eq\r\n (13))) with device subsampling schemes and the staleness of the gradient correction term (Shamir et al., 2014, Eq (13)),\r\n these methods are not directly applicable to our setting. Regardless of this, we explore a variant of such an approach in\r\n federated settings and see that the gradient direction term does not help in this scenario\u2014performing uniformly worse than\r\n the proposed FedProx framework for heterogeneous datasets, despite the extra computation required (see Figure 4). We\r\n refer interested readers to Li et al. (2020) for more detailed discussions.\r\n Finally, we note that there is an interesting connection between meta-learning methods and federated optimization meth-\r\n ods (Khodak et al., 2019), and similar proximal terms have recently been investigated in the context of meta-learning for\r\n improved performance on few-shot learning tasks (Goldblum et al., 2020; Zhou et al., 2019).\r\n Synthetic-IID Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n =0, E=20, FedProx30 40\r\n 100\r\n 2.0 =1, E=20, FedProx25 35\r\n =0, E=20, FedDANE 30 80\r\n =1, E=20, FedDANE20 25\r\n 1.5 60\r\n 15 20\r\n Training Loss1.0 10 15 40\r\n 10\r\n 0.5 5 20\r\n 5\r\n 0 0 0\r\n 0 25 50 75 100125 150 175200 0 25 50 75 100 125 150 175200 0 25 50 75 100 125 150175 200 0 25 50 75 100 125150 175 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n Synthetic-IID Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n =0, E=20, c=10, FedProx30 40\r\n 100\r\n 2.0 =0, E=20, c=10, FedDANE25 35\r\n =0, E=20, c=20, FedDANE 30 80\r\n 1.5 =0, E=20, c=30, FedDANE20 25\r\n 60\r\n 15 20\r\n Training Loss1.0 15 40\r\n 10\r\n 10 20\r\n 0.5 5 5\r\n 0 0 0\r\n 0 25 50 75 100125 150 175200 0 25 50 75 100 125 150 175200 0 25 50 75 100 125 150175 200 0 25 50 75 100 125150 175 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n Figure 4. DANE and AIDE (Shamir et al., 2014; Reddi et al., 2016) are methods proposed in the data center setting that use a similar\r\n proximal term as FedProx as well as an additional gradient correction term. We modify DANE to apply to federated settings by allowing\r\n for local updating and low participation of devices. We show the convergence of this modi\ufb01ed method, which we call FedDane, on\r\n synthetic datasets. In the top \ufb01gures, we subsample 10 devices out of 30 on all datasets for both FedProx and FedDane. While\r\n FedDaneperforms similarly as FedProx on the IID data, it suffers from poor convergence on the non-IID datasets. In the bottom\r\n \ufb01gures, we show the results of FedDane when we increase the number of selected devices in order to narrow the gap between our\r\n estimated full gradient and the real full gradient (in the gradient correction term). Note that communicating with all (or most of the)\r\n devices is already unrealistic in practical settings. We observe that although sampling more devices per round might help to some extent,\r\n FedDaneisstillunstable and tends to diverge. This serves as additional motivation for the speci\ufb01c subproblem we propose in FedProx.\r\n Federated Optimization in Heterogeneous Networks\r\n Boundeddissimilarity assumption. Theboundeddissimilarity assumption we discuss in Assumption 1 has appeared in\r\n different forms, for example in Schmidt & Roux (2013); Yin et al. (2018); Vaswani et al. (2019). In Yin et al. (2018), the\r\n bounded similarity assumption is used in the context of asserting gradient diversity and quantifying the bene\ufb01t in terms of\r\n scaling of the mean square error for mini-batch SGD for IID data. In Schmidt & Roux (2013); Vaswani et al. (2019), the\r\n authors use a similar assumption, called strong growth condition, which is a stronger version of Assumption 1 with \u01eb = 0.\r\n Theyprovethat some interesting practical problems satisfy such a condition. They also use this assumption to prove optimal\r\n and better convergence rates for SGD with constant step-sizes. Note that this is different from our approach as the algorithm\r\n that we are analyzing is not SGD, and our analysis is different in spite of the similarity in the assumptions.\r\n Federated Optimization in Heterogeneous Networks\r\n C SIMULATIONDETAILSANDADDITIONALEXPERIMENTS\r\n C.1 Datasets and Models\r\n Here we provide full details on the datasets and models used in our experiments. We curate a diverse set of non-synthetic\r\n datasets, including those used in prior work on federated learning (McMahan et al., 2017), and some proposed in LEAF, a\r\n benchmarkforfederatedsettings(Caldasetal.,2018). Wealsocreatesyntheticdatatodirectlytesttheeffectofheterogeneity\r\n onconvergence, as in Section 5.1.\r\n \u2022 Synthetic: Weset(\u03b1,\u03b2)=(0,0), (0.5,0.5) and (1,1) respectively to generate three non-identical distributed datasets (Figure\r\n 2). In the IID data (Figure 5), we set the same W,b \u223c N(0,1) on all devices and X to follow the same distribution\r\n k\r\n N(v,\u03a3)whereeachelementinthemeanvectorv iszeroand\u03a3isdiagonalwith\u03a3j,j = j\u22121.2. For all synthetic datasets,\r\n there are 30 devices in total and the number of samples on each device follows a power law.\r\n \u2022 MNIST: We study image classi\ufb01cation of handwritten digits 0-9 in MNIST (LeCun et al., 1998) using multinomial\r\n logistic regression. To simulate a heterogeneous setting, we distribute the data among 1000 devices such that each device\r\n has samples of only 2 digits and the number of samples per device follows a power law. The input of the model is a\r\n \ufb02attened 784-dimensional (28 \u00d7 28) image, and the output is a class label between 0 and 9.\r\n \u2022 FEMNIST: We study an image classi\ufb01cation problem on the 62-class EMNIST dataset (Cohen et al., 2017) using\r\n multinomial logistic regression. To generate heterogeneous data partitions, we subsample 10 lower case characters (\u2018a\u2019-\u2018j\u2019)\r\n from EMNISTanddistribute only 5 classes to each device. We call this federated version of EMNIST FEMNIST. There\r\n are 200 devices in total. The input of the model is a \ufb02attened 784-dimensional (28 \u00d7 28) image, and the output is a class\r\n label between 0 and 9.\r\n \u2022 Shakespeare: This is a dataset built from The Complete Works of William Shakespeare (McMahan et al., 2017). Each\r\n speaking role in a play represents a different device. We use a two-layer LSTM classi\ufb01er containing 100 hidden units\r\n with an 8D embedding layer. The task is next-character prediction, and there are 80 classes of characters in total. The\r\n modeltakes as input a sequence of 80 characters, embeds each of the characters into a learned 8-dimensional space and\r\n outputs one character per training sample after 2 LSTM layers and a densely-connected layer.\r\n \u2022 Sent140: In non-convex settings, we consider a text sentiment analysis task on tweets from Sentiment140 (Go et al.,\r\n 2009) (Sent140) with a two layer LSTM binary classi\ufb01er containing 256 hidden units with pretrained 300D GloVe\r\n embedding (Pennington et al., 2014). Each twitter account corresponds to a device. The model takes as input a sequence\r\n of 25 characters, embeds each of the characters into a 300-dimensional space by looking up Glove and outputs one\r\n character per training sample after 2 LSTM layers and a densely-connected layer.\r\n C.2 Implementation Details\r\n (Implementation)InordertodrawafaircomparisonwithFedAvg,weuseSGDasalocalsolverforFedProx,andadopt\r\n a slightly different device sampling scheme than that in Algorithms 1 and 2: sampling devices uniformly and averaging\r\n updates with weights proportional to the number of local data points (as originally proposed in McMahan et al. (2017)).\r\n Whilethissamplingschemeisnotsupportedbyouranalysis,weobservesimilarrelativebehaviorofFedProxvs. FedAvg\r\n whether or not it is employed (Figure 12). Interestingly, we also observe that the sampling scheme proposed herein results in\r\n morestable performance for both methods. This suggests an added bene\ufb01t of the proposed framework.\r\n R\r\n (Machines) We simulate the federated learning setup (1 server and N devices) on a commodity machine with 2 Intel\r\r\n R R\r\n \r \r\r\n Xeon E5-2650 v4 CPUs and 8 NVidia 1080Ti GPUs.\r\n (Hyperparameters) We randomly split the data on each local device into an 80% training set and a 20% testing set. We\r\n \ufb01xthenumberofselecteddevices per round to be 10 for all experiments on all datasets. We also do a grid search on the\r\n learning rate based on FedAvg. We do not decay the learning rate through all rounds. For all synthetic data experiments,\r\n the learning rate is 0.01. For MNIST, FEMNIST, Shakespeare, and Sent140, we use the learning rates of 0.03, 0.003, 0.8,\r\n and 0.3. We use a batch size of 10 for all experiments.\r\n (Libraries) All code is implemented in Tensor\ufb02ow Version 1.10.1 (Abadi et al., 2016). Please see\r\n github.com/litian96/FedProx for full details.\r\n Federated Optimization in Heterogeneous Networks\r\n C.3 Additional Experiments and Full Results\r\n C.3.1 Effects of Systems Heterogeneity on IID Data\r\n Weshowtheeffectsofallowing for partial work on a perfect IID synthetic data (Synthetic IID).\r\n Synthetic IID (0% stragglers) Synthetic IID (10% stragglers) Synthetic IID (50% stragglers) Synthetic IID (90% stragglers)\r\n 2.0 FedAvg 2.0 2.0 2.0\r\n 1.5 FedProx ( =0) 1.5 1.5 1.5\r\n Training Loss1.0 1.0 1.0 1.0\r\n 0.5 0.5 0.5 0.5\r\n 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n Synthetic IID (0% stragglers) Synthetic IID (10% stragglers) Synthetic IID (50% stragglers) Synthetic IID (90% stragglers)\r\n 0.8 0.8 0.8 0.8\r\n 0.6 0.6 0.6 0.6\r\n 0.4 FedAvg 0.4 0.4 0.4\r\n Testing Accuracy0.2FedProx (=0) 0.2 0.2 0.2\r\n 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n Figure 5. FedAvg is robust to device failure with IID data. In this case, whether incorporating partial solutions from the stragglers would\r\n not have much effect on convergence.\r\n C.3.2 Complete Results\r\n In Figure 6, we present testing accuracy on four synthetic datasets associated with the experiments shown in Figure 2.\r\n Synthetic-IID 3 Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n 2.0 FedAvg (FedProx, =0) 3 3\r\n 1.5 FedProx, >0 2\r\n 2 2\r\n 1.0\r\n Training Loss0.5 1 1 1\r\n 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n Synthetic-IID Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n 0.8 0.8\r\n 0.8 0.6\r\n 0.6 0.6\r\n 0.6\r\n 0.4 0.4 0.4\r\n 0.4 FedAvg (FedProx, =0)\r\n Testing Accuracy0.2FedProx, >00.2 0.2 0.2\r\n 0 50 100 150 200 0.00 50 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n 0.3 Synthetic-IID Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n FedAvg (FedProx, =0)40 60 100\r\n 0.2 FedProx, >0 30 75\r\n 40\r\n 0.1 20 50\r\n 10 20\r\n 25\r\n Variance of Local Grad.0.0\r\n 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n Figure 6. Training loss, test accuracy, and dissimilarity measurement for experiments described in Fig. 2.\r\n Federated Optimization in Heterogeneous Networks\r\n In Figure 7, we show the testing accuracy associated with the experiments described in Figure 1. We calculate the accuracy\r\n improvement numbers by identifying the accuracies of FedProx and FedAvg when they have either converged, started\r\n to diverge, or run suf\ufb01cient number of rounds (e.g., 1000 rounds), whichever comes earlier. We consider the methods to\r\n converge when the loss difference in two consecutive rounds |f \u2212f | is smaller than 0.0001, and consider the methods to\r\n diverge when we see f \u2212f greater than 1. t t\u22121\r\n t t\u221210\r\n 0%\r\n stragglers\r\n 50% \r\n stragglers\r\n 90% \r\n stragglers\r\n Figure 7. The testing accuracy of the experiments in Figure 1. FedProx achieves on average 22% improvement in terms of testing\r\n accuracy in highly heterogeneous settings (90% stragglers).\r\n In Figure 8, we report the dissimilarity measurement on \ufb01ve datasets (including four real datasets) described in Figure 1.\r\n Again, the dissimilarity characterization is consistent with the real performance (the loss).\r\n Synthetic 30 MNIST FEMNIST Shakespeare 30 Sent140\r\n 150 FedAvg (FedProx, =0) 200 6\r\n FedProx (>0) 20 20\r\n 100 150 4\r\n 50 10 100 2 10\r\n Variance of Local Grad.0501001502000100 200 300 400 500 50 100 150 200 00 5 10 15 20 00 200 400 600 800\r\n # Rounds # Rounds # Rounds # Rounds # Rounds\r\n Figure 8. The dissimilarity metric on \ufb01ve datasets in Figure 1. We remove systems heterogeneity by only considering the case when no\r\n participating devices drop out of the network. Our dissimilarity assumption captures the data heterogeneity and is consistent with practical\r\n performance (see training loss in Figure 1).\r\n Federated Optimization in Heterogeneous Networks\r\n In Figure 9 and Figure 10, we show the effects (both loss and testing accuracy) of allowing for partial solutions under\r\n systems heterogeneity when E = 1 (i.e., the statistical heterogeneity is less likely to affect convergence negatively).\r\n 0%\r\n stragglers\r\n 50% \r\n stragglers\r\n 90% \r\n stragglers\r\n Figure 9. The loss of FedAvg and FedProx under various systems heterogeneity settings when each device can run at most 1 epoch at\r\n each iteration (E = 1). Since local updates will not deviate too much from the global model compared with the deviation under large E\u2019s,\r\n it is less likely that the statistical heterogeneity will affect convergence negatively. Tolerating for partial solutions to be sent to the central\r\n server (FedProx, \u00b5 = 0) still performs better than dropping the stragglers (FedAvg).\r\n 0%\r\n stragglers\r\n 50% \r\n stragglers\r\n 90% \r\n stragglers\r\n Figure 10. The testing accuracy of the experiments shown in Figure 9.\r\n C.3.3 Adaptively setting \u00b5\r\n OneofthekeyparametersofFedProxis\u00b5. Weprovidethecompleteresultsofasimpleheuristicofadaptivelysetting\u00b5on\r\n four synthetic datasets in Figure 11. For the IID dataset (Synthetic-IID), \u00b5 starts from 1, and for the other non-IID datasets,\r\n \u00b5starts from 0. Such initialization is adversarial to our methods. We decrease \u00b5 by 0.1 when the loss continues to decrease\r\n for 5 rounds and increase \u00b5 by 0.1 when we see the loss increase. This heuristic allows for competitive performance. It\r\n could also alleviate the potential issue that \u00b5 > 0 might slow down convergence on IID data, which rarely occurs in real\r\n federated settings.\r\n Federated Optimization in Heterogeneous Networks\r\n Synthetic-IID 3 Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n 2.0 FedAvg (FedProx, =0) 3 3\r\n 1.5 FedProx, dynamic 2\r\n FedProx, >0 2 2\r\n 1.0\r\n Training Loss0.5 1 1 1\r\n 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n Figure 11. Full results of choosing \u00b5 adaptively on all the synthetic datasets. We increase \u00b5 by 0.1 whenever the loss increases and\r\n decreases it by 0.1 whenever the loss decreases for 5 consecutive rounds. We initialize \u00b5 to 1 for the IID data (Synthetic-IID) (in order to\r\n be adversarial to our methods), and initialize it to 0 for the other three non-IID datasets. We observe that this simple heuristic works well\r\n in practice.\r\n C.3.4 ComparingTwoDeviceSamplingSchemes\r\n Weshowthetrainingloss, testing accuracy, and dissimilarity measurement of FedProx on a set of synthetic data using two\r\n different device sampling schemes in Figure 12. Since our goal is to compare these two sampling schemes, we let each\r\n device perform the uniform amount of work (E = 20) for both methods.\r\n Synthetic-IID 3 Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n 2.0 3 3\r\n 1.5 2\r\n 2 2\r\n 1.0\r\n Training Loss0.5 1 1 1\r\n 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n Synthetic-IID Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n 0.8 0.8\r\n 0.8 0.6\r\n 0.6 0.6\r\n 0.6\r\n 0.4 0.4 0.4\r\n 0.4\r\n Testing Accuracy0.2 0.2 0.2 0.2\r\n 0 50 100 150 200 0.00 50 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n 0.3 Synthetic-IID Synthetic (0,0) Synthetic (0.5,0.5) Synthetic (1,1)\r\n 40 60 100\r\n 0.2\r\n 40\r\n 0.1 20 50\r\n 20\r\n 0.0\r\n Variance of Local Grad.050100150200050 100 150 200 0 50 100 150 200 0 50 100 150 200\r\n # Rounds # Rounds # Rounds # Rounds\r\n =0, E=20, uniform sampling+weighted average=0, E=20, weighted sampling+simple average\r\n =1, E=20, uniform sampling+weighted average=1, E=20, weighted sampling+simple average\r\n Figure 12. Differences between two sampling schemes in terms of training loss, testing accuracy, and dissimilarity measurement. Sampling\r\n devices with a probability proportional to the number of local data points and then simply averaging local models performs slightly better\r\n than uniformly sampling devices and averaging the local models with weights proportional to the number of local data points. Under\r\n either sampling scheme, the settings with \u00b5 = 1 demonstrate more stable performance than settings with \u00b5 = 0.\r\n", "award": [], "sourceid": 176, "authors": [{"given_name": "Tian", "family_name": "Li", "institution": "Carnegie Mellon University"}, {"given_name": "Anit Kumar", "family_name": "Sahu", "institution": "Bosch Center for Artificial Intelligence"}, {"given_name": "Manzil", "family_name": "Zaheer", "institution": "Google"}, {"given_name": "Maziar", "family_name": "Sanjabi", "institution": "USC"}, {"given_name": "Ameet", "family_name": "Talwalkar", "institution": "CMU"}, {"given_name": "Virginia", "family_name": "Smith", "institution": "Carnegie Mellon University"}]}