联邦学习笔记—《Communication-Efficient Learning of Deep Networks from Decentralized Data》

摘要:

Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos.However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning.We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10–100× as compared to synchronized stochastic gradient descent.

译文:

现代移动设备可以访问大量适合模型学习的数据,从而大大改善设备上的用户体验。例如,语言模型可以改善语音识别和文本输入,图像模型可以自动选择好的照片。然而,这些丰富的数据往往对隐私敏感,数量庞大,或者两者兼而有之,这可能会妨碍登录到数据中心并使用传统方法进行训练。我们提倡一种替代方法,将训练数据分布在移动设备上,并通过聚集本地计算的更新来学习共享模型。我们将这种去中心化的方法称为联邦学习。我们提出了一种基于迭代模型平均的深度网络联邦学习的实用方法,并对五种不同的模型结构和四种数据集进行了广泛的实证评估。这些实验表明,该方法对不平衡和非独立同分布(non-IID)的数据分布具有鲁棒性,这是该设置的一个定义特征。通信成本是主要的限制条件,与同步随机梯度下降相比,我们所需的通信轮数减少了10–100倍。

摘要里,我没有看懂这些名词:迭代模型平均(iterative model averaging)、非独立同分布(non-IID)、鲁棒性(Robust)、同步随机梯度下降(synchronized stochastic gradient descent)

学习后,对于这些名词有了简单的理解:

迭代模型平均(iterative model averaging):(占个坑

非独立同分布(non-IID):在概率论与统计学中是指一组随机变量中每个变量的概率分布都相同,且这些随机变量互相独立。如实验条件保持不变,一系列的抛硬币的正反面结果是独立同分布。

鲁棒性(Robust):鲁棒是Robust的音译,也就是健壮和强壮的意思。它也是在异常和危险情况下系统生存的能力。比如说,计算机软件在输入错误、磁盘故障、网络过载或有意攻击情况下,能否不死机、不崩溃,就是该软件的鲁棒性。

同步随机梯度下降(synchronized stochastic gradient descent):梯度下降法是一个最优化算法,通常也称最速下降法,常用于机器学习和人工智能中递归性逼近最小偏差模型。深度学习最常用的优化方法就是随机梯度下降法,一个经典的例子就是假设你现在在山上,为了以最快的速度下山,且视线良好,你可以看清自己的位置以及所处位置的坡度,那么沿着坡向下走,最终你会走到山底。但是如果你被蒙上双眼,那么你则只能凭借脚踩石头的感觉判断当前位置的坡度,精确性就大大下降,有时候你认为的坡,实际上可能并不是坡,走一段时间后发现没有下山,或者曲曲折折走了好多路才能下山。类似的,批量梯度下降法就好比正常下山,而随机梯度下降法就好比蒙着眼睛下山。例子摘自(https://blog.csdn.net/qq_44614524/article/details/114241259)

正文:

Increasingly, phones and tablets are the primary computing devices for many people [30, 2]. The powerful sensors on these devices (including cameras, microphones, and GPS), combined with the fact they are frequently carried, means they have access to an unprecedented amount of data, much of it private in nature. Models learned on such data hold the promise of greatly improving usability by powering more intelligent applications, but the sensitive nature of the data means there are risks and responsibilities to storing it in a centralized location.

手机和平板电脑日益成为许多人的主要计算设备[30,2]。这些设备(包括摄像头、麦克风和GPS)上强大的传感器,再加上它们经常被携带,意味着它们能够获得前所未有的数据量,其中大部分数据是私有的。在这些数据上学习到的模型有希望通过驱动更智能的应用程序来大大提高可用性,但数据的敏感性意味着将其集中存储存在风险和责任。

We investigate a learning technique that allows users to collectively reap the benefits of shared models trained from this rich data, without the need to centrally store it. We term our approach Federated Learning, since the learning task is solved by a loose federation of participating devices (which we refer to as clients) which are coordinated by a central server. Each client has a local training dataset which is never uploaded to the server. Instead, each client computes an update to the current global model maintained by the server, and only this update is communicated. This is a direct application of the principle of focused collection or data minimization proposed by the 2012 White House report on privacy of consumer data [39]. Since these updates are specific to improving the current model, there is no reason to store them once they have been applied.

我们研究了一种学习技术,它允许用户在不需要集中存储数据的情况下,从这些丰富的数据中获得共享模型的好处。因为学习任务是通过一个由中央服务器协调的参与设备(我们称之为客户机)的松散联合来解决的,所以我们称之为联邦学习。每个客户机都有一个从未上载到服务器的本地训练数据集。相反,每个客户机对由服务器维护的当前全局模型进行更新计算,并且只传递此更新到服务器。这是对《2012年白宫消费者数据隐私报告》提出的集中收集或数据最小化原则的直接应用[39]。由于这些更新是专门用于改进当前模型的,因此没有理由在应用后进行存储。

A principal advantage of this approach is the decoupling of model training from the need for direct access to the raw training data. Clearly, some trust of the server coordinating the training is still required. However, for applications where the training objective can be specified on the basis of data available on each client, federated learning can significantly reduce privacy and security risks by limiting the attack surface to only the device,rather than the device and the cloud.

这种方法的一个主要优点是将模型训练与直接访问原始训练数据的需求分离开来。显然,负责协调训练的服务器仍然需要一些信任。但是,对于可以根据每个客户端上可用的数据指定训练目标的应用程序,联邦学习可以通过将攻击面仅限于设备,而不是设备和云,来显著降低隐私和安全风险。

Our primary contributions are 1) the identification of the problem of training on decentralized data from mobile devices as an important research direction; 2) the selection of a straightforward and practical algorithm that can be applied to this setting; and 3) an extensive empirical evaluation of the proposed approach. More concretely, we introduce the Federated Averaging algorithm, which combines local stochastic gradient descent (SGD) on each client with a server that performs model averaging. We perform extensive experiments on this algorithm, demonstrating it is robust to unbalanced and non-IID data distributions, and can reduce the rounds of communication needed to train a deep network on decentralized data by orders of magnitude.

我们的主要贡献是:1)将移动设备分散数据训练问题,作为一个重要的研究方向;2)选择出了一种可应用于此设置的简单实用的算法;3)对提出的方法进行了广泛的实证评估。更具体地说,我们引入了联邦平均(FederatedAveraging)算法,它将每个客户机上的局部随机梯度下降(SGD)与执行模型平均的服务器相结合。我们对该算法进行了大量的实验,证明了该算法对不平衡和非IID数据分布的鲁棒性,并能将在分散数据上训练深度网络所需的通信次数减少若干个数量级。

    Federated Learning  Ideal problems for federated learning have the following properties: 1) Training on real-world data from mobile devices provides a distinct advantage over training on proxy data that is generally available in the data center. 2) This data is privacy sensitive or large in size (compared to the size of the model), so it is preferable not to log it to the data center purely for the purpose of model training(in service of the focused collection principle). 3) For supervised tasks, labels on the data can be inferred naturally from user interaction.

    联邦学习 联邦学习的理想问题具有以下特性:1)对来自移动设备的真实数据进行训练比对数据中心中通常提供的代理数据进行训练具有明显的优势。2)此数据对隐私敏感或大小较大(与模型的大小相比),因此最好不要仅出于模型训练(服务于集中收集原则)的目的将其记录到数据中心。3)对于有监督的任务,数据上的标签可以从用户交互中自然推断出来。

    Many models that power intelligent behavior on mobile devices fit the above criteria. As two examples, we consider image classification, for example predicting which photos are most likely to be viewed multiple times in the future, or shared; and language models, which can be used to improve voice recognition and text entry on touch-screen keyboards by improving decoding, next-word-prediction, and even predicting whole replies [10]. The potential training data for both these tasks (all the photos a user takes and everything they type on their mobile keyboard, including passwords, URLs, messages, etc.) can be privacy sensitive.The distributions from which these examples are drawn are also likely to differ substantially from easily available proxy datasets: the use of language in chat and text messages is generally much different than standard language corpora,e.g., Wikipedia and other web documents; the photos people take on their phone are likely quite different than typical Flickr photos. And finally, the labels for these problems are directly available: entered text is self-labeled for learning a language model, and photo labels can be defined by natural user interaction with their photo app (which photos are deleted, shared, or viewed).

    许多支持移动设备智能行为的模型都符合上述标准。作为两个例子,我们考虑了图像分类,例如预测哪些照片在未来最可能被多次查看或共享;以及语言模型,通过改进解码、预测下一个单词,甚至预测整个回复[10]来改进触摸屏键盘上的语音识别和文本输入。这两项任务的潜在训练数据(用户拍摄的所有照片以及他们在移动键盘上键入的所有内容,包括密码、URLs、消息等)可能对隐私敏感。这些示例的分布也可能与容易获得的代理数据集有很大的不同:在聊天和文本消息中使用的语言通常与标准语言语料库(如维基百科和其他Web文档)有很大的不同;人们拍摄在他们手机上的的照片可能与典型的Flickr照片大不相同。最后,这些问题的标签是直接可用的;输入的文本是用于语言模型学习的自我标签,照片标签可以通过用户与他们的照片应用程序的自然交互来定义(哪些照片是删除、共享或查看的)。

    Both of these tasks are well-suited to learning a neural network. For image classification feed-forward deep networks,and in particular convolutional networks, are well-known to provide state-of-the-art results [26, 25]. For language modeling tasks recurrent neural networks, and in particular LSTMs, have achieved state-of-the-art results [20, 5, 22].

 这两项任务都非常适合神经网络学习。对于图像分类任务,前馈深度网络,尤其是在特定的卷积网络,能够提供最先进的结果[26,25]。对于语言建模任务,递归神经网络,尤其是LSTMs,已经取得了最先进的结果[20,5,22]。

  Privacy Federated learning has distinct privacy advantages compared to data center training on persisted data.Holding even an“anonymized” dataset can still put user privacy at risk via joins with other data [37]. In contrast, the information transmitted for federated learning is the minimal update necessary to improve a particular model(naturally, the strength of the privacy benefit depends on the content of the updates).1 The updates themselves can (and should) be ephemeral. They will never contain more information than the raw training data (by the data processing inequality), and will generally contain much less. Further, the source of the updates is not needed by the aggregation algorithm, so updates can be transmitted without identifying meta-data over a mix network such as Tor [7] or via a trusted third party. We briefly discuss the possibility of combining federated learning with secure multiparty computation and differential privacy at the end of the paper.

  Privacy 与数据中心对持久数据的训练相比,联邦学习具有明显的隐私优势。即使拥有一个“匿名”的数据集,通过与其他数据的连接,仍然会使用户隐私受到威胁[37]。相比之下,联邦学习传输的信息是改进特定模型所必需的最小更新(自然地,隐私获益的强度取决于更新的内容)。更新本身可以(并且应该)是短暂的。它们永远不会包含比原始训练数据更多的信息(由于数据处理的不平等),而且通常包含的信息要少得多。此外,聚合算法不需要更新的来源,因此可以在不识别元数据的情况下通过混合网络(如Tor[7])或受信任的第三方传输更新。本文最后简要讨论了联邦学习与安全多方计算和差异隐私相结合的可能性。

Federated Optimization We refer to the optimization problem implicit in federated learning as federated optimization, drawing a connection (and contrast) to distributed optimization. Federated optimization has several key properties that differentiate it from a typical distributed optimization problem: 
• Non-IID The training data on a given client is typically based on the usage of the mobile device by a particular user, and hence any particular user's local dataset will not be representative of the population distribution.
• Unbalanced Similarly, some users will make much heavier use of the service or app than others, leading to varying amounts of local training data.
• Massively distributed We expect the number of clients participating in an optimization to be much larger than the average number of examples per client.
• Limited communication Mobile devices are frequently offline or on slow or expensive connections.
In this work, our emphasis is on the non-IID and unbalanced properties of the optimization, as well as the critical nature of the communication constraints. A deployed federated optimization system must also address a myriad of practical issues: client datasets that change as data is added and deleted; client availability that correlates with the local data distribution in complex ways (e.g., phones from speakers of American English will likely be plugged in at different times than speakers of British English); and clients that never respond or send corrupted updates.

Federated Optimization  我们将联邦学习中隐含的优化问题称为联邦优化,并与分布式优化建立了联系(和对比)。联合优化有几个关键特性,可以将其与典型的分布式优化问题区分开来:

  • 非独立同分布:给定客户机上的训练数据通常基于特定用户对移动设备的使用,因此任何特定用户的本地数据集都不能代表总体分布。
  • 不平衡:同样,一些用户会比其他用户更频繁地使用服务或应用程序,从而导致不同数量的本地训练数据。
  • 大规模分布:我们预计参与优化的客户机数量将远远大于每个客户机的平均样例数量。
  • 有限的沟通:移动设备经常处于离线状态,或连接速度慢或费用高。

  在这项工作中,我们的重点是优化的非IID和不平衡特性,以及通信约束的临界性质。部署的联邦优化系统还必须解决大量的实际问题:随着数据的添加和删除而变化的客户端数据集;以复杂方式与本地数据分布相关联的客户端可用性(例如,来自美式英语使用者的电话与说英式英语使用者的电话充电有着不同的时间);以及客户从不响应或发送损坏的更新。

    These issues are beyond the scope of the current work; instead, we use a controlled environment that is suitable for experiments, but still addresses the key issues of client availability and unbalanced and non-IID data. We assume a synchronous update scheme that proceeds in rounds of communication. There is a fixed set of K clients, each with a fixed local dataset. At the beginning of each round, a random fraction C of clients is selected, and the server     sends the current global algorithm state to each of these clients (e.g., the current model parameters). We only select a fraction of clients for efficiency, as our experiments show diminishing returns for adding more clients beyond a certain point. Each selected client then performs local computation based on the global state and its local dataset, and sends an update to the server. The server then applies these updates to its global state, and the process repeats.

    这些问题超出了当前工作的范围;相反,我们使用一个适合实验的受控环境,但仍然需要解决客户机可用性和不平衡和非IID数据的关键问题。我们假设一个同步更新方案,该方案进行多轮通信。有一组固定的K个客户机,每个客户机都有一个固定的本地数据集。在每一轮开始时,客户机随机划分成了占比为C的若干个部分,服务器将当前全局算法状态发送给每个客户机(例如,当前模型参数)。为了提高效率,我们只选择了一个部分的客户,因为我们的实验表明,超过某一点增加更多客户的回报会减少。然后,每个选定的客户机根据全局状态及其本地数据集执行本地计算,并向服务器发送更新。然后,服务器将这些更新应用于其全局状态,并重复该过程。

联邦学习笔记—《Communication-Efficient Learning of Deep Networks from Decentralized Data》_第1张图片    

     当我们关注非凸神经网络目标时,我们考虑的算法适用于任何形式的有限和目标 

联邦学习笔记—《Communication-Efficient Learning of Deep Networks from Decentralized Data》_第2张图片

对于机器学习问题,我们通常采用 fi(w) = l(xi , yi ; w),即模型参数w在样例(xi;yi)上的预测损失。我们假设有K个客户机对数据进行划分,Pk是客户机k上的数据点的索引集,其中nk=|Pk|。因此,我们可以将目标(1)改写为

联邦学习笔记—《Communication-Efficient Learning of Deep Networks from Decentralized Data》_第3张图片

 如果划分Pk是通过在客户机上均匀随机分布训练样例形成的,那么我们将得到 EPk[Fk(w)] = f(w),其中期望值计算的是分配给固定客户机k的一组样例。这是通常由分布式优化算法所做的IID假设;我们将其他情况称为非IID设置(即Fk可能是 f 的任意错误近似值)。

In data center optimization, communication costs are relatively small, and computational costs dominate, with much of the recent emphasis being on using GPUs to lower these costs. In contrast, in federated optimization communication costs dominate — we will typically be limited by an upload bandwidth of 1 MB/s or less. Further, clients will typically only volunteer to participate in the optimization when they are charged, plugged-in, and on an unmetered wi-fi connection. Further, we expect each client will only participate in a small number of update rounds per day. On the other hand, since any single on-device dataset is small compared to the total dataset size, and modern smartphones have relatively fast processors (including GPUs), computation becomes essentially free compared to communication costs for many model types. Thus, our goal is to use additional computation
in order to decrease the number of rounds of communication needed to train a model. There are two primary ways we can add computation: 1) increased parallelism, where we use more clients working independently between each communication round; and, 2) increased computation on each client, where rather than performing a simple computation like a gradient calculation, each client performs a more complex calculation between each communication round. We investigate both of these approaches, but the speedups we achieve are due primarily to adding more computation on each client, once a minimum level of parallelism over clients is used.

    在数据中心优化中,通信成本相对较小,计算成本占主导地位,最近的许多重点是使用GPU来降m低这些成本。相反,在联邦优化中,通信成本占主导地位——我们通常会受到1 MB/s或更低的上传带宽的限制。此外,客户通常只在接入电源和使用未计量的Wi-Fi连接时自愿参与优化。此外,我们希望每个客户每天只参与少量的更新轮次。另一方面,由于与总数据集大小相比,任何单个设备上的数据集都很小,而且现代智能手机具有相对较快的处理器(包括GPU),因此对许多类型的模型来说,与通信成本相比,计算基本上是免费的。因此,我们的目标是使用额外的计算来减少训练模型所需的通信轮数。有两种主要的方法可以添加计算:1)增加并行性,在每轮通信之间使用更多独立工作的客户机;2)增加每个客户机上的计算,而不是执行像梯度计算这样的简单计算,每个客户机在每轮通信之间形成一个更复杂的计算。我们研究了这两种方法,但是我们实现的加速主要是由于一旦使用了客户机上的最小并行级别,就在每个客户机上增加了更多的计算。

Related Work Distributed training by iteratively averaging locally trained models has been studied by McDonald et al. [28] for the perceptron and Povey et al. [31] for speech recognition DNNs. Zhang et al. [42] studies an asynchronous approach with “soft" averaging. These works only consider the cluster / data center setting (at most 16 workers, wall-clock time based on fast networks), and do not consider datasets that are unbalanced and non-IID, properties that are essential to the federated learning setting. We adapt this style of algorithm to the federated setting and perform the appropriate empirical evaluation, which asks different questions than those relevant in the data center setting, and requires different methodology.

Related Work  局部迭代平均训练模型的分布式训练被McDonald等人[28]用于感知器,同时也被Povey等人[31]用于语音识别DNNs。Zhang等人[42]研究了具有“软”平均的异步方法。这些工作只考虑集群/数据中心的设置(最多16个工作人员,基于快速网络的挂钟时间),而没考虑不平衡和非IID的数据集。这是对联邦学习设置至关重要的属性。我们将这种类型的算法应用到联邦设置,并执行适当的实证评估。这体现出了不同于数据中心设置的问题,并且需要不同的方法。

Using similar motivation to ours, Neverova et al. [29] also discusses the advantages of keeping sensitive user data on device. The work of Shokri and Shmatikov [35] is related in several ways: they focus on training deep networks, emphasize the importance of privacy, and address communication costs by only sharing a subset of the parameters during each round of communication; however, they also do not consider unbalanced and non-IID data, and the empirical evaluation is limited.

    与我们的动机相似,Neverova等人[29]也讨论了在设备上保留敏感用户数据的优点。Shokri和Shmatikov[35]的工作与我们有几个方面的关联:专注于训练深度网络,强调隐私的重要性,并且通过在每轮通信中只共享参数的一个子集来解决通信成本;但是,他们也没考虑不平衡的非IID数据,实证评估有限。

In the convex setting, the problem of distributed optimization and estimation has received significant attention[4, 15, 33], and some algorithms do focus specifically on communication efficiency [45, 34, 40, 27, 43]. In addition to assuming convexity, this existing work generally requires that the number of clients is much smaller than the number of examples per client, that the data is distributed across the clients in IID fashion, and that each node has an identical number of data points — all of these assumptions are violated in the federated optimization setting. Asynchronous distributed forms of SGD have also been applied to training neural networks, e.g., Dean et al. [12], but these approaches require a prohibitive number of updates in the federated setting. Distributed consensus algorithms (e.g.,[41]) relax the IID assumption, but are still not a good fit for communication-constrained optimization over very many clients.

在凸设置中,分布式优化和估计问题受到了广泛关注[4,15,33],一些算法特别关注通信效率[45,34,40,27,43]。除了假设凸性之外,现有的工作通常要求客户机的数量比每个客户机的样例数量小得多,数据以IID的方式分布在客户机上,并且每个节点都有相同数量的数据点——所有这些假设都违反了联邦优化的设置。异步分布形式的SGD也被应用于训练神经网络,例如Dean等人[12]。但是这些方法需要在联邦设置中进行大量的更新。分布式共识算法(如[41])放宽了IID的假设,但仍然不适合在许多客户机上进行通信受限的优化。

One endpoint of the (parameterized) algorithm family we consider is simple one-shot averaging, where each client solves for the model that minimizes (possibly regularized) loss on their local data, and these models are averaged to produce the final global model. This approach has been studied extensively in the convex case with IID data, and it is known that in the worst-case, the global model produced is no better than training a model on a single client [44, 3, 46].

 我们考虑的(参数化)算法系列的一个端点是简单的单次平均,其中每个客户机都会解决模型问题,使本地数据损失最小化(可能是规则化),这些模型被平均以生成最终的全局模型。这种方法在具有IID数据的凸情况下得到了广泛的研究,并且众所周知,在最坏的情况下,生成的全局模型并不比在单个客户机上训练模型更好[44,3,46]。

2 The FederatedAveraging Algorithm

The recent multitude of successful applications of deep learning have almost exclusively relied on variants of stochastic gradient descent (SGD) for optimization; in fact, many advances can be understood as adapting the structure of the model (and hence the loss function) to be more amenable to optimization by simple gradient-based methods [16]. Thus, it is natural that we build algorithms for federated optimization by starting from SGD.

最近许多成功的深度学习应用几乎完全依赖于变量的随机梯度下降(SGD)来进行优化;事实上,许多进展可以理解为使模型结构(以及损失函数)更易于通过简单的基于梯度的方法[16]进行优化。因此,我们很自然地从SGD开始构建联邦优化算法。

SGD can be applied naively to the federated optimization problem, where a single batch gradient calculation (say on a randomly selected client) is done per round of communication. This approach is computationally efficient, but requires very large numbers of rounds of training to produce good models (e.g., even using an advanced approach like batch normalization, Ioffe and Szegedy [21] trained MNIST for 50000 steps on minibatches of size 60). We consider this baseline in our CIFAR-10 experiments.

SGD可以简单地应用于联邦优化问题,在联邦优化问题中,每轮通信都要进行一次单批梯度计算(比如在随机选择的客户机上)。这种方法计算效率很高,但需要大量的训练来生成好的模型(例如,即使使用先进的方法,如批量标准化,Ioffe和Szegedy[21]在MNIST数据集上用大小为60的小批量上训练了50000步)。我们在CIFAR-10实验中考虑了这个基线。

In the federated setting, there is little cost in wall-clock time to involving more clients, and so for our baseline we use large-batch synchronous SGD; experiments by Chen et al.[8] show this approach is state-of-the-art in the data center setting, where it outperforms asynchronous approaches. To apply this approach in the federated setting, we select a C- fraction of clients on each round, and compute the gradient of the loss over all the data held by these clients. Thus, C controls the global batch size, with C = 1 corresponding to full-batch (non-stochastic) gradient descent.2 We refer to this baseline algorithm as FederatedSGD (or FedSGD).

联邦设置中,涉及更多客户时,壁钟时间成本很低。因此对于我们的基线,我们使用大批量同步SGD;Chen等人[8]的实验显示此方法在数据中心设置中是最先进的,它优于异步方法。为了在联邦设置中应用这种方法,我们在每一轮中选择占比为C的客户机,并计算这些客户机所持有的所有数据的损失梯度。因此,C控制全局批大小,C=1对应于整批(非随机)梯度下降。我们将此基线算法称为联邦SGD(或FedSGD)。

联邦学习笔记—《Communication-Efficient Learning of Deep Networks from Decentralized Data》_第4张图片

------------------------------------------截止至2021.10.10------------------------------------------------

Tip:

 对文章的翻译引用了Communication-Efficient Learning of Deep Networks from Decentralized Data - 穷酸秀才大艹包 - 博客园 (cnblogs.com)

对文章的理解参考了

联邦学习学习笔记——论文理解《Communication-Efficient Learning of Deep Networks from Decentralized Data》_biongbiongdou的博客-CSDN博客

你可能感兴趣的:(论文学习,机器学习,算法)