Distributed Stochastic Gradient Descent with Event-Triggered Communication

Distributed Stochastic Gradient Descent with Event-Triggered Communication

https://github.com/akamaster/pytorch resnet cifar10

Abstract

We develop a Distributed Event-Triggered Stochastic GRAdient Descent (DETSGRAD) algorithm for solving non-convex optimization problems typically encountered in distributed deep learning. We propose a novel communication triggering mechanism that would allow the networked agents to update their model parameters aperiodically and provide sufficient conditions on the algorithm step-sizes that guarantee the asymptotic mean-square convergence. The algorithm is applied to a distributed supervised-learning problem, in which a set of networked agents collaboratively train their individual neural networks to perform image classification, while aperiodically sharing the model parameters with their onehop neighbors. Results indicate that all agents report similar performance that is also comparable to the performance of a centrally trained neural network, while the event-triggered communication provides significant reduction in inter-agent communication. Results also show that the proposed algorithm allows the individual agents to classify the images even though the training data corresponding to all the classes are not locally available to each agent.
我们开发了一种分布式事件触发的随机梯度下降(DETSGRAD)算法,用于解决通常在分布式深度学习中遇到的非凸优化问题。我们提出了一种新颖的通信触发机制,该机制将允许网络代理不定期地更新其模型参数,并为算法步长提供足够的条件,以保证渐近均方收敛。该算法应用于分布式监督学习问题,该问题中一组网络代理协同训练他们各自的神经网络以进行图像分类,同时不定期地与单跳邻居共享模型参数。结果表明,所有代理都报告了相似的性能,也可以与集中训练的神经网络的性能相媲美,而事件触发的通信则大大减少了代理之间的通信。结果还表明,即使不是所有代理都本地可用的训练数据,所提出的算法也允许单个代理对图像进行分类。

Introduction

With the advent of smart devices, there has been an exponential growth in the amount of data collected and stored locally on individual devices. Applying machine learning to extract value from such massive data to provide data-driven insights, decisions, and predictions has been a popular research topic as well as the focus of numerous businesses. However, porting these vast amounts of data to a data center to conduct traditional machine learning has raised two main issues:(i)the communication challenge associated with transferring vast amounts of data from a large number of devices to a central location and (ii) the privacy issues associated with sharing raw data. Distributed machine learning techniques based on the server-client architecture(Lietal.2014a; 2014b; Zhang, Alqahtani, and Demirbas 2017) have been proposed as solutions to this problem. On one extreme end of this architecture, we have the parameter server approach, where a server or group of servers initiate distributed learning by pushing the current model to a set of client nodes that host the data. The client nodes compute the local gradients or parameter updates and communicate them to the server nodes. Server nodes aggregate these values and update the current model (Zhang et al. 2018; Li et al. 2014b). On the other extreme, we have federated learning, where each client node obtains a local solution to the learning problem and the server node computes a global model by averaging the local models (Konecn˘ u et al ´ . 2016; McMahan et al. 2017). Besides the server-client architecture, a shared-memory (multicore/multiGPU) architecture, where different processors independently compute the gradients and update the global model parameter using a shared memory, has also been proposed as a solution to the distributed machine learning problem (Recht et al. 2011; De Sa et al. 2015; Chaturapruek, Duchi, and Re 2015; ´ Feyzmahdavian, Aytekin, and Johansson 2016). However, none of the above-mentioned learning techniques are truly distributed since they follow a master-slave architecture and do not involve any peer-to-peer communication. Furthermore, these techniques are not always robust and they are rendered useless if the master/server node or the shared-memory fails. Therefore, we aim to develop a fully distributed machine learning architecture enabled by client-to-client interaction.
随着智能设备的出现,本地收集和存储在单个设备上的数据量呈指数增长。应用机器学习从海量数据中提取价值以提供数据驱动的见解,决策和预测,已成为热门的研究主题,也是众多企业关注的焦点。但是,将这些大量数据移植到数据中心以进行传统的机器学习带来了两个主要问题:(i)与将大量数据从大量设备传输到中央位置相关的通信挑战;以及(ii)与共享原始数据相关的隐私问题。已经提出了基于服务器-客户端架构的分布式机器学习技术(Lietal.2014a; 2014b; Zhang,Alqahtani和Demirbas 2017)作为解决此问题的方法。在此架构的一个极端中,我们采用了参数服务器方法,其中一台服务器或一组服务器通过将当前模型推送到托管数据的一组客户端节点来启动分布式学习。客户端节点计算局部梯度或参数更新,并将其传达给服务器节点。服务器节点汇总这些值并更新当前模型(Zhang等,2018; Li等,2014b)。另一方面,我们采用联合学习,其中每个客户端节点获得学习问题的本地解决方案,而服务器节点则通过对本地模型求平均值来计算全局模型(Konecn˘u等,2016; McMahan等,2017)。 )。除了服务器客户端架构外,还提出了一种共享内存(多核/多GPU)架构,其中不同的处理器独立计算梯度并使用共享内存更新全局模型参数,作为分布式机器学习问题的解决方案( Recht等人,2011; De Sa等人,2015; Chaturapruek,Duchi和Re,2015; Feyzmahdavian,Aytekin和Johansson,2016年)。然而,由于上述学习技术遵循主从架构并且不涉及任何对等通信,因此没有一种是真正的分布式学习技术。此外,这些技术并不总是健壮的,如果主/服务器节点或共享内存出现故障,它们将变得无用。因此,我们旨在开发一种通过客户端到客户端交互实现的完全分布式的机器学习架构。

For large-scale machine learning, stochastic gradient descent (SGD) methods are often preferred over batch gradient methods (Bottou, Curtis, and Nocedal 2018) because (i) in many large-scale problems, there is a good deal of redundancy in data and therefore it is inefficient to use all the data in every optimization iteration, (ii) the computational cost involved in computing the batch gradient is much higher than that of the stochastic gradient, and (iii) stochastic methods are more suitable for online learning where data are arriving sequentially. Since most machine learning problems are non-convex, there is a need for distributed stochastic gradient methods for non-convex problems. Therefore, here we present a communication efficient, distributed stochastic gradient algorithm for non-convex problems and demonstrate its utility for distributed machine learning.

对于大规模机器学习,随机梯度下降(SGD)方法通常优于批量梯度方法(Bottou,Curtis和Nocedal 2018),因为(i)在许多大规模问题中,数据中存在大量冗余 因此在每次优化迭代中使用所有数据效率低下,(ii)计算批次梯度所涉及的计算成本比随机梯度的计算成本高得多,并且(iii)随机方法更适合在线学习,其中 数据按顺序到达。 由于大多数机器学习问题是非凸的,因此需要用于非凸问题的分布式随机梯度方法。 因此,在这里,我们提出了一种针对非凸问题的通信有效的分布式随机梯度算法,并展示了其在分布式机器学习中的效用。

Relatedwork

Distributed Non-Convex Optimization: A few early examples of (non-stochastic or deterministic) distributed non convex optimization algorithms include the Distributed Approximate Dual Subgradient (DADS) Algorithm (Zhu and Martnez 2013), NonconvEx primal-dual SpliTTing (NESTT) algorithm (Hajinezhad et al. 2016), and the Proximal PrimalDual Algorithm (Prox-PDA) (Hong, Hajinezhad, and Zhao 2017). More recently, a non-convex version of the accelerated distributed augmented Lagrangians (ADAL) algorithm is presented in Chatzipanagiotis and Zavlanos(2017) and successive convex approximation (SCA)-based algorithms such as iNner cOnVex Approximation (NOVA) and in-Network succEssive conveX approximaTion algorithm (NEXT) are given in Scutari, Facchinei, and Lampariello(2017) and Lorenzo and Scutari(2016), respectively. References (Hong 2018; Guo, Hug, and Tonguz 2017; Hong, Luo, and Razaviyayn 2016) provide several distributed alternating direction method of multipliers (ADMM) based non-convex optimization algorithms. Non-convex versions of Decentralized Gradient Descent (DGD) and Proximal Decentralized Gradient Descent (Prox-DGD) are given in Zeng and Yin(2018). Finally, Zeroth-Order NonconvEx (ZONE) optimization algorithms for mesh network (ZONE-M) and star network (ZONE-S) are presented in Hajinezhad, Hong, and Garcia(2019). However, almost all aforementioned consensus optimization algorithms focus on non-stochastic problems and are extremely communication heavy because they require constant communication among the agents.

分布式非凸优化:(非随机或确定性)分布式非凸优化算法的一些早期示例包括分布式近似对偶次梯度(DADS)算法(Zhu和Martnez 2013),NonconvEx原始对偶SpliTTing(NESTT)算法( Hajinezhad et al.2016),以及Proximal PrimalDual Algorithm(Prox-PDA)(Hong,Hajinezhad和Zhao 2017)。最近,在Chatzipanagiotis和Zavlanos(2017)中提出了加速分布的增强拉格朗日算法(ADAL)的非凸版本,以及基于连续凸逼近(SCA)的算法(如iNner cOnVex逼近(NOVA)和网络中的SuccEssive)凸近似算法(NEXT)分别在Scutari,Facchinei和Lampariello(2017)以及Lorenzo和Scutari(2016)中给出。参考文献(Hong 2018; Guo,Hug,and Tonguz 2017; Hong,Luo和Razaviyayn 2016)提供了几种基于乘数(ADMM)的非凸优化算法的分布式交替方向方法。 Zeng和Yin(2018)给出了非凸版本的分散梯度下降(DGD)和近端分散梯度下降(Prox-DGD)。最后,在Hajinezhad,Hong和Garcia(2019)中提出了用于网状网络(ZONE-M)和星型网络(ZONE-S)的零阶NonconvEx(ZONE)优化算法。然而,几乎所有上述共识优化算法都集中在非随机问题上,并且由于需要代理之间的持续通信,因此通信量极大。

Distributed Convex SGD: Within the consensus optimization literature, there exist several works on distributed stochastic gradient methods, but mainly for strongly convex optimization problems. These include the stochastic subgradient-push method for distributed optimization over time-varying directed graphs given in Nedic and Ol- shevsky(2016), distributed stochastic optimization over random networks given in Jakovetic et al.(2018), the Stochastic Unbiased Curvature-aided Gradient (SUCAG) method given in Wai et al.(2018), and distributed stochastic gradient tracking methods Pu and Nedic(2018). There are ´very few works on distributed stochastic gradient methods for non-convex optimization (Tatarenko and Touri 2017; Bianchi and Jakubowicz 2013); however, the push-sum algorithm given in Tatarenko and Touri(2017) assumes there are no saddle-points and it often requires up to 3 times as many internal variables as the proposed algorithm. Compared to Bianchi and Jakubowicz(2013) and Tatarenko and Touri(2017), the proposed algorithm provides an explicit consensus rate and allows the parallel execution of the consensus communication and gradient computation steps.

分布式凸SGD:在共识优化文献中,存在一些关于分布式随机梯度法的工作,但主要针对强凸优化问题。这些包括Nedic和Olshevsky(2016)中给出的时变有向图上的分布式随机优化的随机次梯度推方法,Jakovetic等人(2018)中给出的随机网络上的分布式随机优化,随机无偏曲率辅助Wai等人(2018)中给出的梯度(SUCAG)方法以及Pu和Nedic(2018)中的分布式随机梯度跟踪方法。关于非凸优化的分布式随机梯度方法的工作很少(Tatarenko和Touri,2017年; Bianchi和Jakubowicz,2013年)。然而,Tatarenko和Touri(2017)中给出的推和算法假设没有鞍点,并且通常需要多达所建议算法3倍的内部变量。与Bianchi和Jakubowicz(2013)以及Tatarenko和Touri(2017)相比,该算法提供了明确的共识率,并允许共识通信和梯度计算步骤的并行执行。
Parallel SGD: There exist numerous asynchronous SGD algorithms aimed at parallelizing the data-intensive machine learning tasks. The two popular asynchronous parallel implementations of SGD are the computer network implementation originally proposed in Agarwal and Duchi(2011) and the shared memory implementation introduced in Recht et al.(2011). Computer network implementation follows the master-slave architecture and Agarwal and Duchi(2011) showed that for smooth convex problems, the delays due to asynchrony are asymptotically negligible. Feyzmahdavian, Aytekin, and Johansson(2016) extend the results in Agarwal and Duchi(2011) for regularized SGD. Extensions of the computer network implementation of asynchronous SGD with variance reduction and polynomially growing delays are given in Huo and Huang(2016) and Zhou et al.(2018), respectively. Recht etal.(2011) proposed a lock-free asynchronous parallel implementation of SGD on a shared memory system and proved a sublinear convergence rate for strongly convex smooth objectives. The lock-free algorithm, HOGWILD!, proposed in Recht et al.(2011) has been applied to PageRank approximation (Mitliagkas et al. 2015), deep learning (Noel and Osindero 2014), and recommender systems (Yu et al. 2012).

并行SGD:存在许多异步SGD算法,旨在并行化数据密集型机器学习任务。 SGD的两种流行的异步并行实现是最初在Agarwal和Duchi(2011)中提出的计算机网络实现,以及在Recht等人(2011)中引入的共享内存实现。计算机网络的实现遵循主从结构,Agarwal和Duchi(2011)指出,对于光滑凸问题,异步引起的延迟在渐近上可忽略不计。 Feyzmahdavian,Aytekin和Johansson(2016)将Agarwal和Duchi(2011)的结果扩展为正规SGD。分别在Huo和Huang(2016)和Zhou等人(2018)中给出了异步SGD的计算机网络实现的扩展,其具有方差减少和多项式增长的延迟。 Recht等人(2011年)提出了在共享内存系统上实现SGD的无锁异步并行实现,并证明了强凸光滑目标的亚线性收敛速率。 Recht等人(2011)提出的无锁算法HOGWILD!已应用于PageRank逼近(Mitliagkas等人2015),深度学习(Noel和Osindero 2014)和推荐系统(Yu等人2012) )。
In Duchi, Jordan, and McMahan(2013), authors extended the HOGWILD! algorithm to a dual averaging algorithm that works for non-smooth, non-strongly convex problems with sparse gradients. An extension of HOGWILD! called BUCKWILD! is introduced in De Sa et al.(2015) to account for quantization errors introduced by fixed-point arithmetic. In Chaturapruek, Duchi, and Re(2015), the authors show that ´ because of the noise inherent to the sampling process within SGD, the errors introduced by asynchrony in the shared memory implementation are asymptotically negligible. Recently, several parallel SGD works focus on adjusting the worker-server interaction period or frequency as a way to decrease the communication overhead. For example, (Yu,
Jin, and Yang 2019) and (Yu, Yang, and Zhu 2019) used a fixed period, while (Yu and Jin 2019) and (Lin, Stich, and Jaggi 2018) propose an increasing period as a way to reduce communication. A detailed comparison of both computer network and shared memory implementation is given in Lian et al.(2015). Again, the aforementioned asynchronous algorithms are not distributed since they rely on a shared-memory or central coordinator.

在Duchi,Jordan和McMahan(2013)中,作者扩展了HOGWILD!算法对偶平均算法,适用于稀疏梯度的非平滑,非强凸问题。 HOGWILD的扩展!叫做BUCKWILD! De Sa et al。(2015)中引入了``’’,以解决定点算法引入的量化误差。在Chaturapruek,Duchi和Re(2015)中,作者表明,由于SGD内采样过程固有的噪声,异步在共享内存实现中引入的误差可以渐近地忽略不计。最近,几项并行的SGD工作集中于调整工作者-服务器交互周期或频率,以减少通信开销。例如,(于,
Jin和Yang 2019)和(Yu,Yang和Zhu 2019)使用了固定时间段,而(Yu和Jin 2019)和(Lin,Stich和Jaggi 2018)建议增加时间段作为减少交流的一种方式。 Lian等人(2015)给出了计算机网络和共享内存实现的详细比较。再次,由于异步算法依赖于共享内存或中央协调器,因此它们是不分布式的。

Decentralized SGD:

Recently, numerous decentralized SGD algorithms for non-convex optimization have been proposed as a solution to the communication bottleneck often encountered in the server-client architecture (Lian et al. 2017; Jiang et al. 2017; Tang et al. 2018; Lian et al. 2018; Wang and Joshi 2018; Haddadpour et al. 2019; Assran et al. 2019; Wang et al. 2019). However almost all these works primarily focus on the performance of the algorithm during a fixed time interval, and the constant algorithm step-size, which often depends on the final time, is selected to speed-up the convergence rate. These SGD algorithms with constant step-size can only guarantee convergence to some -ball of the stationary point. Furthermore, most of the aforementioned decentralized SGD algorithms provide convergence rates in terms of the average of all local estimates of the global minimizer without ever proving a similar or faster consensus rate. In fact, most decentralized SGD algorithms can only provide bounded consensus and they require a centralized averaging step after running the algorithm until the final-time (Lian et al. 2017; Tang et al. 2018; Lian et al. 2018; Haddadpour et al. 2019; Wang et al. 2019). Finally, most application of decentralized SGD focus on distributed learning scenarios where the data is distributed identically across all agents.

最近,已经提出了许多用于非凸优化的分散式SGD算法,作为解决服务器-客户端体系结构中经常遇到的通信瓶颈的解决方案(Lian等人2017; Jiang等人2017; Tang等人2018; Lian等人等人; 2018; Wang和Joshi 2018; Haddadpour等2019; Assran等2019; Wang等2019)。但是,几乎所有这些工作都主要集中在固定时间间隔内的算法性能上,并且选择恒定算法步长(通常取决于最终时间)来加快收敛速度​​。这些具有恒定步长的SGD算法只能保证收敛到固定点的某个球。此外,大多数上述分散式SGD算法提供的收敛速度是全局最小化器所有本地估计值的平均值,而没有证明相似或更快的共识率。实际上,大多数去中心化SGD算法只能提供有限的共识,并且在运行算法直到最终时间之后它们需要集中平均步骤(Lian等人2017; Tang等人2018; Lian等人2018; Haddadpour等人2019; Wang等人2019)。最后,分散式SGD的大多数应用都集中在分布式学习方案中,在该方案中,数据在所有代理程序中的分布相同。

Contribution:

Currently, there exists no distributed SGD algorithm for the non-convex problems that doesn’t require constant or periodic communication among the agents. In fact, algorithms in (Lian et al. 2017; Jiang et al. 2017; Tang et al. 2018; Lian et al. 2018; Wang and Joshi 2018; Haddadpour et al. 2019; Assran et al. 2019; Wang et al. 2019) all rely on periodic communication despite the local model has not changed from previously communicated model. This is a waste of resources, especially in wireless setting and therefore we propose an approach that would allow the nodes to transmit only if the local model has significantly changed from previously communicated model. The contributions of this paper are three-fold: (i) we propose a fully distributed machine learning architecture, (ii) we present a distributed SGD algorithm built on a novel communication triggering mechanism, and provide sufficient conditions on step-sizes such that the algorithm is mean-square convergent, and (iii) we demonstrate the efficacy of the proposed event-triggered SGD algorithm for distributed supervised learning with i.i.d. and more importantly, non-i.i.d. data.
当前,不存在不需要代理之间进行恒定或定期通信的非凸问题的分布式SGD算法。实际上,(Lian等人2017; Jiang等人2017; Tang等人2018; Lian等人2018; Wang和Joshi 2018; Haddadpour等人2019; Assran等人2019; Wang等人的算法(2019年),尽管本地模型与以前传达的模型没有变化,但都依赖于定期交流。这是资源的浪费,尤其是在无线设置中,因此,我们提出了一种方法,该方法仅在本地模型与先前通信的模型发生显着变化的情况下才允许节点进行传输。本文的贡献包括三个方面:(i)我们提出了一种完全分布式的机器学习架构,(ii)我们提出了一种基于新型通信触发机制的分布式SGD算法,并为步长提供了足够的条件,从而使该算法是均方收敛的,并且(iii)我们证明了该事件触发的SGD算法对于iid分布式监督学习的有效性更重要的是,非i.d.数据。

Distributed Stochastic Gradient Descent with Event-Triggered Communication_第1张图片

Distributed machine learning

Our problem formulation closely follows the centralized machine learning problem discussed in Bottou, Curtis, and Nocedal(2018). Consider a networked set of n agents, each with

我们的问题表达紧随Bottou,Curtis和Nocedal(2018)中讨论的集中式机器学习问题。 考虑由n个代理组成的网络集合,每个代理
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第2张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第3张图片
total expected risk across all networked agents is given as
所有联网代理的总预期风险为:

Distributed Stochastic Gradient Descent with Event-Triggered Communication_第4张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第5张图片

Distributed event-triggered SGD

Distributed Stochastic Gradient Descent with Event-Triggered Communication_第6张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第7张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第8张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第9张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第10张图片

Convergence analysis

Distributed Stochastic Gradient Descent with Event-Triggered Communication_第11张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第12张图片
Our strategy for proving the convergence of the proposed
distributed event-triggered SGD algorithm to a critical point is as follows. First we show that the consensus error among the agents are diminishing at the rate of
我们证明拟议方案趋同的策略分布式事件触发的SGD算法达到临界点如下。 首先,我们证明代理之间的共识误差正在以Distributed Stochastic Gradient Descent with Event-Triggered Communication_第13张图片

(seeTheorem 1). Asymptotic convergence of the algorithm is then proved in Theorem 3. Theorem 4 then establishes that the weighted expected average gradient norm is a summable sequence. Convergence rate of the algorithm in the typical weak sense is given in Theorem 5. Finally, Theorem 6 proves the asymptotic mean-square convergence of the algorithm to a critical point. Theorem 1. Consider the event-triggered SGD algorithm (6) under Assumptions 1-7. Then, there holds:
然后在定理3中证明了该算法的渐近收敛性。定理4随后确定了加权期望平均梯度范数为可加序列。 定理5中给出了典型弱意义上算法的收敛速度。最后,定理6证明了算法的临界点的渐近均方收敛。 定理1.考虑假设1-7下的事件触发的SGD算法(6)。 然后,保持:

Distributed Stochastic Gradient Descent with Event-Triggered Communication_第14张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第15张图片
.
Theorem 4 establishes results about the weighted sum
of expected average gradient norm and the key takeaway from this result is that, for the distributed SGD in
(8) or (6) with appropriate step-sizes, the expected average gradient norms cannot stay bounded away from zero (See Theorem 9 of (Bottou, Curtis, and Nocedal 2018

定理4建立关于加权和的结果期望的平均梯度范数,从这个结果中得出的主要结论是,对于(8)或(6)具有适当的步长,则期望的平均梯度范数不能保持远离零的边界(请参见(Bottou,Curtis和Nocedal 2018的定理9

Distributed Stochastic Gradient Descent with Event-Triggered Communication_第16张图片
Finally, we present the following result to illustrate that
stronger convergence results follows from the continuity assumption on the Hessian, which has not been utilized in our analysis so far.
最后,我们给出以下结果来说明来自Hessian的连续性假设产生了更强的收敛结果,该假设到目前为止尚未在我们的分析中使用。

Distributed Stochastic Gradient Descent with Event-Triggered Communication_第17张图片
Similar to the centralized SGD (Bottou, Curtis, and
Nocedal 2018), the analysis given here shows the meansquare convergence of the distributed algorithm to a critical point, which include the saddle points. Though SGD has shown to escape saddle points efficiently (Lee et al. 2017; Fang, Lin, and Zhang 2019; Jin et al. 2019), extension of such results for distributed SGD is currently nonexistent and is a topic for future research.

类似于集中式SGD(Bottou,Curtis 和 Nocedal 2018),此处给出的分析显示了分布式算法到临界点的均方收敛,其中包括鞍点。 尽管SGD已显示可以有效地避开鞍点(Lee等人2017; Fang,Lin和Zhang 2019; Jin等人2019),但目前尚不存在将此类结果扩展到分布式SGD的情况,这是未来研究的主题。

Application to distributed supervised learning

We apply the proposed algorithm to distributedly train neural network agents for image classification task. We present extensive results on two different datasets - MNIST1 and CIFAR-102
我们将提出的算法应用于图像分类任务的分布式训练神经网络代理。 我们在两个不同的数据集MNIST1和CIFAR-102上给出了广泛的结果

MNIST

MNIST data set is a handwritten digit recognition data set
containing 60000 grayscale images of 10 digits (0-9) for
training and 10000 images are used for testing. We distributedly train 10 agents that are connected in an undirected unweighted ring topology. The 10-node ring was selected only since it is one of the least connected network (besides the path) and MNIST contains 10 classes. Proposed algorithm would work for any undirected graph as along as it is connected.
Each agent aims to train its own neural network, which is
a randomly initialized LeNet-5 (LeCun et al. 1998). During
training, each agent broadcasts its weights to its neighbors at every iteration or aperiodically as described in the proposed algorithm. Here we conduct the following five experiments: (i) Centralized SGD, where a centralized version of the SGD is implemented by a central node having access to all 60000 training images from all classes; (ii) Distributed SGDr, where all the agents broadcast their respective weights at every iteration, and each agent has access to 6000 training
images, randomly sampled from the entire training set, which forms the i.i.d. case; (iii) Distributed SGD-s, where all the agents broadcast their weights at every iteration, and each agent has access to the images corresponding to a single class, which forms the non-i.i.d. case; (iv) DETSGRAD-r, where the agents aperiodically broadcast their weights using the triggering mechanism in (10), and each agent has access to 6000 training images, randomly sampled from the entire training set, i.e., i.i.d. case; (v) DETSGRAD-s, where the agents aperiodically broadcast their weights using the triggering mechanism in (10), and each agent has access to the images corresponding to a single class, i.e., non-i.i.d. case. In the single class case, for ease of programming, we set the number of training images available for each agent to 5421 (the minimum number of training images available in a single class, which is digit 5 in MNIST data set). Here we select

MNIST数据集是手写数字识别数据集
包含60000个10位数(0-9)的灰度图像
训练和10000张图像用于测试。我们分布式训练了10个以无向,无权环拓扑连接的代理。仅选择10节点环是因为它是连接最少的网络之一(路径之外),并且MNIST包含10个类别。所提出的算法将适用于任何无向图及其连接。
每个代理旨在训练自己的神经网络,即
随机初始化的LeNet-5(LeCun等,1998)。中
训练中,每个代理都会在每次迭代或不定期地将其权重广播给其邻居,如所提出的算法中所述。这里我们进行以下五个实验:(i)集中式SGD,其中SGD的集中式版本由可访问所有类别的所有60000个训练图像的中央节点实现; (ii)分布式SGDr,所有座席在每次迭代中广播各自的权重,每个座席均可以参加6000次培训
从整个训练集中随机采样的图像,形成了i.i.d.案件; (iii)分布式SGD-s,其中所有代理在每次迭代中广播其权重,并且每个代理都可以访问对应于单个类别的图像,这形成了非i.i.d。案件; (iv)DETSGRAD-r,其中特工使用(10)中的触发机制不定期地广播其权重,并且每个特工都可以访问6000个训练图像,这些图像是从整个训练集中(即i.d.案件; (v)DETSGRAD-s,其中代理使用(10)中的触发机制不定期地广播其权重,并且每个代理都可以访问对应于单个类别(即非i.d.d)的图像。案件。在单班情况下,为了便于编程,我们将每个代理可用的训练图像数量设置为5421(单班可用的最小训练图像数量,即MNIST数据集中的数字5)。在这里我们选择
.Distributed Stochastic Gradient Descent with Event-Triggered Communication_第18张图片
The plots of the empirical risk vs. the iterations (parameter update steps), illustrated in Figure 1, show the convergence of the proposed algorithm. The final test accuracies of the 10 agents after 40 training epochs using different algorithms and different training settings are shown in Table 1. Results obtained here indicate that regardless of how the data are distributed (random or single class), the agents are able to train their network and the distributedly trained networks are able to yield similar performance as that of a centrally trained network. More importantly, in the single class case, agents were able to recognize images from all 10 classes even though they had access to data corresponding only to a single class during the training phase. This result has numerous implications for the machine learning community, specifically for federated multi-task learning under information flow constraints.

如图1所示,经验风险与迭代(参数更新步骤)的关系图表明了该算法的收敛性。 表1显示了使用不同的算法和不同的训练设置后的40个训练纪元后10个代理的最终测试准确性。此处获得的结果表明,无论数据如何分布(随机或单类),这些代理都能够进行训练 他们的网络和分布式训练的网络能够产生与集中训练的网络相似的性能。 更重要的是,在单一班级的情况下,即使代理商在培训阶段可以访问仅对应于一个班级的数据,也能够识别所有10个班级的图像。 这个结果对机器学习社区,特别是在信息流约束下的联合多任务学习,具有许多意义。

The total number of event-triggered parameter broadcast
events for the 10 agents using the DETSGRAD algorithm
are shown in Table 2. In the random sampling case, by employing broadcast event-triggering mechanism, we are able to reduce the inter-agent communications from 240000 to an average of 61702 over 40 epochs leading to a reduction of 74.2% in network communications. In the single class case, the agents broadcast the parameters continuously for the first 4 epochs, after which the event-trigger mechanism is started. Here, we are able to reduce the parameter broadcasts for each agent from 216840 to an average of 71933 over 40 epochs leading to a reduction of 66.8% in network communications.
Yet, as can be seen in Table 1, DETSGRAD gives similar
classification performance as distributed SGD with continuous parameter sharing with significant reduction in network

事件触发的参数广播总数
使用DETSGRAD算法的10个代理的事件
如表2所示。在随机采样的情况下,通过使用广播事件触发机制,我们能够将代理间的通信从400000减少到平均61702(超过40个时期),从而使网络通信减少了74.2% 。 在单类情况下,代理在前4个时期连续广播参数,此后启动事件触发机制。 在这里,我们能够将每个代理的参数广播从216840减少到40个时期的平均71933,从而使网络通信减少66.8%。
但是,如表1所示,DETSGRAD给出了类似的结果
具有分布式SGD的分类性能,具有连续的参数共享功能,大大减少了网络通讯。 的广播事件所占的比例
图2中显示了40个纪元以上的10个代理。正如预期的那样,随着代理数收敛到经验风险函数的临界点,广播事件的数量随着纪元数的增加而减少。
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第19张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第20张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第21张图片
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第22张图片
communications. The fractions of the broadcast events for
the 10 agents over 40 epochs are presented in Figure 2. As expected, the number of broadcast events reduces with the increase in epoch number as the agents converge to the critical point of the empirical risk function.

CIFAR-10

CIFAR-10 data set is an image classification data set containing 50000 color images of 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck) for training and 10000 images are used for testing. We distributedly train 8 agents that are connected in an undirected unweighted ring topology. Each agent trains its own neural network, which is a randomly initialized ResNet-203 (He et al. 2016). We conducted the following three experiments: (i) Centralized SGD, where a centralized version of the SGD is implemented by a central node having access to all 50000 training images from all classes; (ii) Distributed SGD-r, where all the agents broadcast their respective weights at every iteration, and each agent has access to 6250 training images, randomly sampled from the entire training set; (iii) DETSGRAD-r, where the agents aperiodically broadcast their weights using the triggering mechanism in (10), and each agent has access to 6250 training images, randomly sampled from the entire training

CIFAR-10数据集是图像分类数据集,其中包含用于训练的10种类别(飞机,汽车,鸟类,猫,鹿,狗,青蛙,马,船和卡车)的50000张彩色图像,并使用10000张图像进行测试。我们分布式训练了8个以无向,无权环拓扑连接的代理。每个代理都训练自己的神经网络,该网络是随机初始化的ResNet-203(He等,2016)。我们进行了以下三个实验:(i)集中式SGD,其中SGD的集中式版本由可访问所有类别的所有50000张训练图像的中央节点实现; (ii)分布式SGD-r,其中所有代理在每次迭代中广播其各自的权重,并且每个代理可以访问从整个训练集中随机采样的6250个训练图像; (iii)DETSGRAD-r,其中代理人使用(10)中的触发机制不定期地广播其权重,并且每个代理人都可以访问从整个训练中随机采样的6250个训练图像
Distributed Stochastic Gradient Descent with Event-Triggered Communication_第23张图片
The plots of the empirical risk vs. epochs, illustrated in
Figure 3, show the convergence of the proposed algorithm. The final test accuracies of the 8 agents after 200 training epochs using two different algorithms are shown in Table 3. Similar to the previous case, results obtained here indicate that the distributedly trained networks are able to yield similar performance as that of a centrally trained network. The total number of event-triggered parameter broadcast events
for the 8 agents using the DETSGRAD algorithm are shown in Table 4. By employing broadcast event-triggering mechanism, we are able to reduce the inter-agent communications from 9800 to an average of 5482 over 200 epochs leading to a reduction of 44.1% in network communications. Yet, as can be seen in Table 3, DETSGRAD gives similar classification performance as distributed SGD with continuous parameter sharing with significant reduction in network communications.

经验风险与时期的关系图,如图所示
图3显示了所提出算法的收敛性。 表3显示了使用两种不同算法在200个训练纪元后8个代理的最终测试准确性。与之前的情况类似,此处获得的结果表明,分布式训练网络能够产生与集中训练网络相似的性能。 。 事件触发的参数广播事件的总数
表4中显示了使用DETSGRAD算法的8个代理程序。通过使用广播事件触发机制,我们能够将代理程序间的通信在200个时期内从9800个减少到平均5482个,从而减少了44.1%。 网络通讯。 但是,如表3所示,DETSGRAD具有与分布式SGD类似的分类性能,并具有连续的参数共享功能,并且大大减少了网络通信量。

Conclusion

This paper presented the development of a distributed stochastic gradient descent algorithm with event-triggered communication mechanism for solving non-convex optimization problems. We presented a novel communication triggering mechanism, which allowed the agents to decidedly reduce the communication overhead by communicating only when the local model has significantly changed from previously communicated model. We presented the sufficient conditions on algorithm step-sizes to guarantee asymptotic mean-square convergence of the proposed algorithm to a critical point and provided the convergence rate of the proposed algorithm. We applied the developed algorithm to distributed supervised learning problem, in which a set of networked agents collaboratively train their individual neural nets to perform image classification. Results indicate that the distributedly trained networks are able to yield similar performance to that of a centrally trained network. Numerical results also show that the proposed event-triggered communication mechanism significantly reduced the inter-agent communication while yielding similar performance to that of a distributedly trained network with constant communication.

本文提出了一种基于事件触发的通信机制的分布式随机梯度下降算法,用于解决非凸优化问题。我们提出了一种新颖的通信触发机制,该机制允许代理仅在本地模型与先前通信的模型相比发生显着更改时才通过通信来确定地减少通信开销。我们给出了算法步长的充分条件,以保证所提出算法的渐近均方收敛到临界点,并提供了所提出算法的收敛速度。我们将开发的算法应用于分布式监督学习问题,在该问题中,一组网络代理共同训练他们各自的神经网络以执行图像分类。结果表明,分布式训练的网络能够产生与集中训练的网络相似的性能。数值结果还表明,所提出的事件触发通信机制显着减少了代理间通信,同时产生了与具有恒定通信的分布式训练网络相似的性能。

Reference

Agarwal, A., and Duchi, J. C. 2011. Distributed delayed stochastic
optimization. In NIPS. 873–881.
Assran, M.; Loizou, N.; Ballas, N.; and Rabbat, M. 2019. Stochastic
Gradient Push for Distributed Deep Learning. In ICML, 344–353.
Bianchi, P., and Jakubowicz, J. 2013. Convergence of a multi-agent
projected stochastic gradient algorithm for non-convex optimization.
IEEE TAC 58(2):391–405.
Bottou, L.; Curtis, F.; and Nocedal, J. 2018. Optimization methods
for large-scale machine learning. SIAM Review 60(2):223–311.
Chaturapruek, S.; Duchi, J. C.; and Re, C. 2015. Asynchronous ´
stochastic convex optimization: the noise is in the noise and sgd
don’t care. In NIPS. 1531–1539.
Chatzipanagiotis, N., and Zavlanos, M. M. 2017. On the convergence of a distributed augmented lagrangian method for nonconvex
optimization. IEEE TAC 62(9):4405–4420.
De Sa, C. M.; Zhang, C.; Olukotun, K.; Re, C.; and R ´ e, C. 2015. ´
Taming the wild: A unified analysis of hogwild-style algorithms. In
NIPS. 2674–2682.
Duchi, J.; Jordan, M. I.; and McMahan, B. 2013. Estimation,
optimization, and parallelism when data is sparse. In NIPS. 2832–2840.

Distributed Stochastic Gradient Descent with Event-Triggered Communication_第24张图片
Fang, C.; Lin, Z.; and Zhang, T. 2019. Sharp analysis for nonconvex
sgd escaping from saddle points. arXiv:1902.00247.
Feyzmahdavian, H. R.; Aytekin, A.; and Johansson, M. 2016. An
asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE TAC 61(12):3740–3754.
George, J., and Gurram, P. 2019. Distributed Deep Learning with
Event-Triggered Communication. arXiv 1909.05020.
Guo, J.; Hug, G.; and Tonguz, O. K. 2017. A case for nonconvex
distributed optimization in large-scale power systems. IEEE TPS
32(5):3842 – 3851.
Haddadpour, F.; Kamani, M. M.; Mahdavi, M.; and Cadambe, V.
2019. Trading redundancy for communication: Speeding up distributed SGD for non-convex optimization. In ICML, 2545–2554.
Hajinezhad, D.; Hong, M.; Zhao, T.; and Wang, Z. 2016. Nestt: A
nonconvex primal-dual splitting method for distributed and stochastic optimization. In NIPS. 3215–3223.
Hajinezhad, D.; Hong, M.; and Garcia, A. 2019. Zone: Zeroth order
nonconvex multi-agent optimization over networks. IEEE TAC.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning
for image recognition. In IEEE CVPR, 770–778.
Hong, M.; Hajinezhad, D.; and Zhao, M.-M. 2017. Prox-PDA:
The proximal primal-dual algorithm for fast distributed nonconvex
optimization and learning over networks. In ICML, 1529 – 1538.
Hong, M.; Luo, Z.; and Razaviyayn, M. 2016. Convergence analysis of alternating direction method of multipliers for a family of
nonconvex problems. SIAM JO 26(1):337–364.
Hong, M. 2018. A distributed, asynchronous, and incremental
algorithm for nonconvex optimization: An admm approach. IEEE
TCNS 5(3):935–945.
Huo, Z., and Huang, H. 2016. Asynchronous Stochastic Gradient
Descent with Variance Reduction for Non-Convex Optimization.
arXiv:1604.03584.
Jakovetic, D.; Bajovic, D.; Sahu, A. K.; and Kar, S. 2018. Convergence rates for distributed stochastic optimization over random
networks. In IEEE CDC, 4238–4245.
Jiang, Z.; Balu, A.; Hegde, C.; and Sarkar, S. 2017. Collaborative
deep learning in fixed topology networks. In NIPS. 5904–5914.
Jin, C.; Netrapalli, P.; Ge, R.; Kakade, S. M.; and Jordan, M. I.
2019. Stochastic gradient descent escapes saddle points efficiently.
arXiv:1902.04811.
Konecn˘ u, J.; McMahan, H. B.; Yu, F. X.; Richtarik, P.; Suresh, A. T.; ´
and Bacon, D. 2016. Federated learning: Strategies for improving
communication efficiency. In NIPSW.
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; et al. 1998. Gradientbased learning applied to document recognition. Proc. of the IEEE
86(11):2278–2324.
Lee, J. D.; Panageas, I.; Piliouras, G.; Simchowitz, M.; Jordan, M. I.;
and Recht, B. 2017. First-order methods almost always avoid saddle
points. arXiv:1710.07406.

Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.;
Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B.-Y. 2014a. Scaling
distributed machine learning with the parameter server. In USENIX
OSDI, 583 – 598.

Li, M.; Andersen, D. G.; Smola, A. J.; and Yu, K. 2014b. Communication efficient distributed machine learning with the parameter
server. In NIPS. 19–27.
Lian, X.; Huang, Y.; Li, Y.; and Liu, J. 2015. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization.
arXiv:1506.08272.
Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.-J.; Zhang, W.; and Liu,
J. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient
descent. In NIPS. 5330–5340.
Lian, X.; Zhang, W.; Zhang, C.; and Liu, J. 2018. Asynchronous
decentralized parallel stochastic gradient descent. In ICML, 3043–
3052.
Lin, T.; Stich, S. U.; and Jaggi, M. 2018. Don’t use large minibatches, use local SGD. arXiv 1808.07217.
Lorenzo, P. D., and Scutari, G. 2016. NEXT: In-network nonconvex
optimization. IEEE TSIPN 2(2):120–136.
McMahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas,
B. A. 2017. Communication-efficient learning of deep networks
from decentralized data. In AISTATS.
Mitliagkas, I.; Borokhovich, M.; Dimakis, A. G.; and Caramanis, C.
2015. Frogwild!: Fast pagerank approximations on graph engines.
Proc. VLDB Endow. 8(8):874–885.
Nedic, A., and Olshevsky, A. 2016. Stochastic gradient-push for ´
strongly convex functions on time-varying directed graphs. IEEE
TAC 61(12):3936–3947.
Noel, C., and Osindero, S. 2014. Dogwild!-distributed hogwild for
cpu & gpu. In NIPSW.
Pu, S., and Nedic, A. 2018. Distributed Stochastic Gradient Track- ´
ing Methods. arXiv:1805.11454.
Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In NIPS.
693–701.
Scutari, G.; Facchinei, F.; and Lampariello, L. 2017. Parallel and
distributed methods for constrained nonconvex optimizationpart i:
Theory. IEEE TSP 65(8):1929 – 1944.
Tang, H.; Lian, X.; Yan, M.; Zhang, C.; and Liu, J. 2018. d
2
:
Decentralized training over decentralized data. In ICML, 4848–
4856.
Tatarenko, T., and Touri, B. 2017. Non-convex distributed optimization. IEEE TAC 62(8):3744 – 3757.
Wai, H.; Freris, N. M.; Nedic, A.; and Scaglione, A. 2018. Sucag:
Stochastic unbiased curvature-aided gradient method for distributed
optimization. In IEEE CDC, 1751–1756.
Wang, J., and Joshi, G. 2018. Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD
Algorithms. arXiv:1808.07576.
Wang, J.; Sahu, A. K.; Yang, Z.; Joshi, G.; and Kar, S. 2019.
MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling. arXiv:1905.09435.
Yu, H., and Jin, R. 2019. On the computation and communication
complexity of parallel SGD with dynamic batch sizes for stochastic
non-convex optimization. In ICML, 7174–7183.
Yu, H.; Hsieh, C.; Si, S.; and Dhillon, I. 2012. Scalable coordinate
descent approaches to parallel matrix factorization for recommender
systems. In IEEE ICDM, 765–774.
Yu, H.; Jin, R.; and Yang, S. 2019. On the linear speedup analysis
of communication efficient momentum SGD for distributed nonconvex optimization. In ICML, 7184–7193.
Yu, H.; Yang, S.; and Zhu, S. 2019. Parallel restarted SGD with
faster convergence and less communication: Demystifying why
model averaging works for deep learning. In AAAI-19, 5693–5700.
Zeng, J., and Yin, W. 2018. On nonconvex decentralized gradient
descent. IEEE TSP 66(11):2834–2848.
Zhang, K.; Alqahtani, S.; and Demirbas, M. 2017. A comparison
of distributed machine learning platforms. In ICCCN, 1–9.
Zhang, J.; Tu, H.; Ren, Y.; Wan, J.; Zhou, L.; Li, M.; and Wang,
J. 2018. An adaptive synchronous parallel strategy for distributed
machine learning. IEEE Access 6:19222–19230.
Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Glynn, P.; Ye, Y.; Li, L.-J.;
and Fei-Fei, L. 2018. Distributed asynchronous optimization with
unbounded delays: How slow can you go? In ICML, 5970–5979.
Zhu, M., and Martnez, S. 2013. An approximate dual subgradient
algorithm for multi-agent non-convex optimization. IEEE TAC
58(6):1534 – 1539.

你可能感兴趣的:(深度学习,论文研读,算法)