最近阅读了两篇关于联邦学习的论文,《Communication-Efficient Learning of Deep Networks from Decentralized Data》和《FedMD Heterogenous Federated Learning via Model Distillation》,提出了两种算法FedAvg和FedMD。
Communication-Efficient Learning of Deep Networks from Decentralized Data
研究背景
Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos.However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches.
面对敏感或者很庞大的且不能直接上传到data center的数据,可能无法使用传统的方法集中训练模型。
主要研究内容
We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning.
提出了一种代替性的模型训练方法——Federated Learning
We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10–100× as compared to synchronized stochastic gradient descent.
提出了基于迭代模型平均的算法,对于非独立同分布数据仍然有效,且相比同步随即梯度下降减少了10-100倍的通信成本
More concretely, we introduce the FederatedAveraging algorithm, which combines local stochastic gradient descent (SGD) on each client with a server that performs model averaging. We perform extensive experiments on this algorithm, demonstrating it is robust to unbalanced and non-IID data distributions, and can reduce the rounds of communication needed to train a deep network on decentralized data by orders of magnitude.
提出FedAvg算法,将每个客户端上的本地随机梯度下降(SGD)与执行模型平均的服务器相结合。 对不平衡和非IID数据分布有很好的鲁棒性,并且可以减少对分散数据进行深度网络训练所需的通信次数。
联邦学习
Federated Learning Ideal problems for federated learn- ing have the following properties: 1) Training on real-world data from mobile devices provides a distinct advantage over training on proxy data that is generally available in the data center. 2) This data is privacy sensitive or large in size (compared to the size of the model), so it is preferable not to log it to the data center purely for the purpose of model training (in service of the focused collection principle). 3) For supervised tasks, labels on the data can be inferred naturally from user interaction.
联合学习联合学习的理想问题具有以下特性:
训练来自移动设备的真实数据比训练数据中心通常提供的代理数据具有明显的优势。
此数据是隐私敏感的或较大的(与模型的大小相比),因此最好不要仅出于模型训练的目的(出于集中收集的原则)将其记录到数据中心。
对于监督任务,可以从用户交互中自然推断出数据上的标签。
隐私
Privacy Federated learning has distinct privacy advantages compared to data center training on persisted data. Holding even an “anonymized” dataset can still put user privacy at risk via joins with other data (Sweeney, 2000). In contrast, the information transmitted for federated learning is the minimal update necessary to improve a particular model (naturally, the strength of the privacy benefit depends on the content of the updates.) The updates themselves can (and should) be ephemeral. They will never contain more information than the raw training data (by the data processing inequality), and will generally contain much less. Further, the source of the updates is not needed by the aggregation algorithm, so updates can be transmitted without identifying meta-data over a mix network such as Tor (Chaum, 1981) or via a trusted third party. We briefly discuss the possibility of combining federated learning with secure multiparty computation and differential privacy at the end of the paper.
- FL 传输的信息是改进特定模型所必需的最小更新(隐私利益的强度取决于更新的内容);
- 更新本身是短暂的,所包含的信息绝不会超过原始训练数据且通常会少得多;
- 聚合算法不需要更新源(不需要知道用户是谁?),因此,可以通过混合网络(例如Tor)或通过受信任的第三方传输更新而无需标识元数据。
- 将联合学习与安全的多方计算及差分隐私相结合
联邦优化
在联邦学习中的优化问题
几个关键属性:
Non-IID The training data on a given client is typically based on the usage of the mobile device by a particular user, and hence any particular user’s local dataset will not be representative of the population distribution.
Unbalanced Similarly, some users will make much heavier use of the service or app than others, leading to varying amounts of local training data.
Massively distributed We expect the number of clients participating in an optimization to be much larger than the average number of examples per client.
Limited communication Mobile devices are frequently offline or on slow or expensive connections.
- 用户数据非独立同分布: 特定的用户数据不能代表用户的整体分布;
- 用户数据量不平衡: 数据量不均衡,因为有的用户使用多,有的用户使用少;
- 用户(分布)是大规模的: 参与优化的用户数大于平均每个用户的数据量;
- 用户端设备通信限制: 移动设备经常掉线、速度缓慢、费用昂贵。
最重要的就是非独立同分布、不平衡的数据和面对的通信约束
实践中面对的问题:
A deployed federated optimization system must also address a myriad of practical issues: client datasets that change as data is added and deleted; client availability that correlates with the local data distribution in complex ways; and clients that never respond or send corrupted updates.
- 随着数据添加和删除而不断改变的客户端数据集;
- 客户端(更新)的可用性与其本地数据分布有着复杂的关系;
- 从来不响应或发送信息的客户端会损坏更新
但在本文中不做考虑
优化方法
执行方法
We assume a synchronous update scheme that proceeds in rounds of communication. There is a fixed set of K clients, each with a fixed local dataset. At the beginning of each round, a random fraction C of clients is selected, and the server sends the current global algorithm state to each of these clients (e.g., the current model parameters). Each client then performs local computation based on the global state and its local dataset, and sends an update to the server. The server then applies these updates to its global state, and the process repeats.
假设同步更新方案在各轮通信中进行;有一组固定的客户端集合,大小为K,每个客户端都有一个固定的本地数据集;
- 在每轮更新开始时,随机选择部分客户端,比例为C(C≤1);
- 服务器将当前的全局算法的状态发送给这些客户(例如,当前的模型参数);
- 每个客户端都基于全局状态及其本地数据集执行本地计算,并将更新发送到服务器;
- 最后,服务器将这些更新应用于其全局状态,然后重复该过程。
非凸神经网络的目标函数
While we focus on non-convex neural network objectives, the algorithm we consider is applicable to any finite-sum objective of the form
For a machine learning problem, we typically take fi(w) = l(xi, yi; w), that is, the loss of the prediction on example (xi, yi) made with model parameters w. We assume there are K clients over which the data is partitioned, with Pk the set of indexes of data points on client k, with nk = |Pk|.
Thus, we can re-write the objective (1) as
对于机器学习问题,我们通常定义fi(w) = L(xi,yi;w);假设数据分布在K个客户端,Dk代表客户端k数据点的集合,nk为Dk的大小,目标函数可以重写为:
如果划分Dk是所有用户数据的随机取样,则目标函数f(w)就等价于损失函数关于Dk的期望:
(这就是传统的分布式优化问题的独立同分布假设)
通信成本和计算成本
**Thus, our goal is to use additional computation in order to decrease the number of rounds of communication needed to train a model. **There are two primary ways we can add computation: **1) increased parallelism, **where we use more clients working independently between each communication round; and, 2) increased computation on each client, where rather than performing a simple computation like a gradient calculation, each client performs a more complex calculation between each communication round. We investigate both of these approaches, but the speedups we achieve are due primarily to adding more computation on each client, once a minimum level of parallelism over clients is used.
在data center的优化问题中,通信成本相对较小,而计算成本占主导地位,重点是可以使用GPU来降低这些成本;
-
在联合优化中,通信成本占主导地位:
通常会受到1 MB/s或更小的上传带宽的限制;
并且客户通常只会在充电,插入电源和不计量的Wi-Fi连接时自愿参与优化;
希望每个客户每天只参加少量的更新回合
而计算成本相对较小:
任何单个设备上的数据集都比总数据集的大小小;
现代智能手机具有相对较快的处理器(包括GPU)
因此,我们的目标是使用额外的计算,以减少训练模型所需的通信次数。
考虑两种方法来添加计算量:
- 提高并行性:在每个通信回合之间使用更多的客户端独立工作;
- 增加每个客户端的计算量: 不像梯度计算那样执行简单的计算,而是每个客户端在每个通信回合之间执行更复杂的计算。
在最低级别的客户端并行性后,我们实现的加速主要是由于在每个客户端上添加了更多的计算。
Asynchronous distributed forms of SGD have also been applied to training neural net- works, e.g., Dean et al. (2012), but these approaches require a prohibitive number of updates in the federated setting. One endpoint of the (parameterized) algorithm family we consider is simple one-shot averaging, where each client solves for the model that minimizes (possibly regularized) loss on their local data, and these models are averaged to produce the final global model. This approach has been studied extensively in the convex case with IID data, and it is known that in the worst-case, the global model produced is no better than training a model on a single client (Zhang et al., 2012; Arjevani and Shamir, 2015; Zinkevich et al., 2010).
SGD的异步分布式形式也已用于训练神经网络
在众多(参数化)算法中,我们最终考虑的是简单一次平均(simple one-shot averaging),其中每个客户解决的模型将其本地数据的损失降到最低(可能是正则化的),然后将这些模型取平均值以生成最终的全局模型。这种方法已经在带有独立同分布数据的凸情况下进行了广泛的研究,在最坏的情况下,生成的全局模型并不比在单个客户端上训练模型更好。
Federated Averaging Algorithm
The recent multitude of successful applications of deep learning have almost exclusively relied on variants of stochastic gradient descent (SGD) for optimization; in fact, many advances can be understood as adapting the structure of the model (and hence the loss function) to be more amenable to optimization by simple gradient-based methods (Goodfellow et al., 2016). Thus, it is natural that we build algorithms for federated optimization by starting from SGD.
- 深度学习的最新成功应用几乎都依赖于随机梯度下降(SGD)的变体进行优化;
- 实际上,许多进展可以理解为通过调整模型的结构(或者损失函数),使其更易于使用简单的gradient-based methods进行优化。
baseline算法——FederatedSGD
SGD can be applied naively to the federated optimization problem, where a single batch gradient calculation (say on a randomly selected client) is done per round of communication. This approach is computationally efficient, but requires very large numbers of rounds of training to produce good models (e.g., even using an advanced approach like batch normalization, Ioffe and Szegedy (2015) trained MNIST for 50000 steps on minibatches of size 60). We consider this baseline in our CIFAR-10 experiments.
问题:计算效率很高,但是需要大量(多轮)训练才能生成好的模型
In the federated setting, there is little cost in wall-clock time to involving more clients, and so for our baseline we use large-batch synchronous SGD; experiments by Chen et al. (2016) show this approach is state-of-the-art in the data center setting, where it outperforms asynchronous approaches. To apply this approach in the federated setting, we select a C-fraction of clients on each round, and computes the gradient of the loss over all the data held by these clients. Thus, C controls the global batch size, with C = 1 corresponding to full-batch (non-stochastic) gradient descent. We refer to this baseline algorithm as FederatedSGD (or FedSGD).
SGD可以直接应用于联邦优化,即每轮在随机选择的客户端上进行一次梯度计算
基线算法:大批量同步SGD(在data center中是最先进的,优于异步方法)
FL形式: 每轮在clients中选择C-fraction,计算这些clients的所有数据的损失函数梯度
参数C: 控制global batch size;C = 1即全批(非随机)梯度下降
FedratedAveraging
The amount of computation is controlled by three key parameters: C, the fraction of clients that perform computation on each round; E, then number of training passes each client makes over its local dataset on each round; and B, the local minibatch size used for the client updates. We write B = ∞ to indicate that the full local dataset is treated as a single minibatch. Thus, at one endpoint of this algorithm family, we can take B = ∞ and E = 1 which corresponds exactly to FedSGD. Complete pseudo-code is given in Algorithm 1
个人认为,FedAvg可以看作FedSGD在用户本地进行多次梯度更新
几个参数:C,B(本地batch),E(本地轮数)
B = INF & E= 1 local_batchsize为全部数据,更新一轮
模型的平均效果分析
对于一般的非凸目标函数,参数空间中的平均模型可能会产生任意不好的模型结果。当我们平均两个从不同初始条件训练的MNIST数字识别模型时,我们恰好看到了这种不良结果(图1,左)。
目前的一些实验显示,从相同的随机初始化开始训练两个模型,然后在不同的数据子集上对每个模型进行独立训练,进行朴素的参数平均效果很好(图1,右)。
The success of dropout training also provides some intuition for the success of our model averaging scheme; dropout training can be interpreted as averaging models of different architectures which share parameters, and the inference- time scaling of the model parameters is analogous to the model averaging used in FedAvg (Srivastava et al., 2014).
Dropout training可以解释为共享参数的不同体系结构的平均模型,模型参数的推理时间缩放类似于FedAvg中使用的模型平均。
实验结果
对图像分类和语言建模分别进行了试验,也对IID和non-IID进行了试验
总结
For all three model classes, FedAvg converges to a higher level of test-set accuracy than the baseline FedSGD models. This trend continues even if the lines are extended beyond the plotted ranges. For example, for the CNN the B = ∞, E = 1 FedSGD model eventually reaches 99.22% accuracy after 1200 rounds (and had not improved further after 6000 rounds), while the B = 10, E = 20 FedAvg model reaches an accuracy of 99.44% after 300 rounds. We conjecture that in addition to lowering communication costs, model averaging produces a regularization benefit similar to that achieved by dropout (Srivastava et al., 2014).
We are primarily concerned with generalization performance, but FedAvg is effective at optimizing the training loss as well, even beyond the point where test-set accuracy plateaus. We observed similar behavior for all three model classes, and present plots for the MNIST CNN in Figure 6 in Appendix A
FedAvg收敛到比基准FedSGD模型更高的测试集准确性水平。(即使超出了绘制范围,这种趋势仍将继续。)例如,对于CNN,B =∞,E = 1 FedSGD模型最终在1200轮后达到了99.22%的准确度(并且在6000轮之后并没有进一步改善);而B = 10,E = 20的FedAvg模型达到了300轮后达到99.44%。
因此推测,除了降低通信成本外,模型平均还产生了与dropout正则化相似的优化效果。
FedAvg具有一定的泛化能力,甚至可以优化训练损失(超出测试集精度的稳定水平)
That is, we would expect that while one round of averaging might produce a reasonable model, additional rounds of communication (and averaging) would not produce further improvements.
当前模型参数仅通过初始化影响每个Client Update中执行的优化。 当E→∞时,至少对于凸问题,并且无论初始化如何,都将达到全局最小值;对于非凸问题,只要初始化是在同一个”盆地“中,算法也会收敛到相同的局部最小值。
对于某些模型,尤其是在收敛的后期阶段,以与降低学习率有用的相同方式来降低每轮的本地计算量(移动到较小的E或较大的B)可能是有用的,但对于很大的E值,我们看不到收敛速度的明显下降。
其他发现
对SGD和FedAvg进行minibatch B = 50的实验,可以将精度视为进行minibatch gradient calculations次数的函数。 我们希望SGD能表现得更好,因为在每次minibatch computation之后都会采取一个顺序步骤。 但是,如图9,对于适当的C和E值,FedAvg在每次minibatch computation中取得相似的进度。 此外,当SGD和FedAvg每轮只有一个client时(绿),准确性显着波动,而对更多clients进行平均则可以解决这一问题(黄)。
FedAvg对比FedSGD
显示了最佳学习率的单调学习曲线。 η= 0.4的FedSGD需要310轮才能达到8.1%的准确度,而η= 18.0的FedAvg仅在20轮就达到了8.5%的准确性(比FedSGD少15倍)。
不同lr对于FedAvg的准确性影响也要小得多
展望
FedAvg可以使用相对较少的轮次来训练高质量的模型,联邦学习是实际可行的
尽管联邦学习目前提供了许多隐私优势,但是通过差分隐私、多方安全计算及他们的组合是未来发展的方向
FedMD Heterogenous Federated Learning via Model Distillation
思路
相比之下这个思路看起来比较简单,使用了Model DIstilation的思想
可以支持多个client使用不同的模型进行本地训练
其中关键是存在一个公共数据集
Before a participant starts the collaboration phase, its model must first undergo the entire transfer learning process. It will be trained fully on the public dataset and then on its own private data. Therefore any future improvements are compared to this baseline.
在正式训练开始前,所有client先要进行一轮预训练,预训练使用的数据集是公共数据集和各个client的本地数据集
We re-purpose the public dataset D0 as the basis of communication between models, which is realized using knowledge distillation. Each learner fk expresses its knowledge by sharing the class scores, fk(x0i ), computed on the public dataset D0. The central server collects these class scores and computes an average f (xi ). Each party then trains fk to approach the consensus f (xi ). In this way, the knowledge of one participant can be understood by others without explicitly
sharing its private data or model architecture. Using the entire large public dataset can cause a large communication burden. In practice, the server may randomly select a much smaller subset dj ⊂ D0 at each round as the basis of communication. In this way, the cost is under control and does not scale with the complexity of participating models.
预训练进行完成之后进行正式训练,在正式训练过程中,每一轮都从公共数据集中拿出一个small_batch进行训练
然后每个client都根据自己的模型计算出一个output(向量,未经过激活函数),并把他上传到server
server将对这些向量进行一个平均,作为所有client的一个consensus,然后server会将这个consensus传给每一个client,然后各个client需要调整权重,向着靠近sonsensus的方向调整。
实验结果
Figure 2: FedMD improves the test accuracy of participating models beyond their baselines. A dashed line (on the left) represents the test accuracy of a model after full transfer learning with the public dataset and its own small private dataset. This baseline is our starting point and overlaps with the beginning of the corresponding learning curve. A dash-dot line (on the right) represents the would-be performance of a model if private datasets from all participants were declassified and made available to every participant of the group.
FedMD提高了参与模型的测试精度,超出了其基线。 虚线(左侧)代表使用公共数据集和其自己的小型私有数据集进行完全转移学习后模型的测试准确性。 该基线是我们的起点,并且与相应的学习曲线的起点重叠。 点线(在右侧)表示如果将所有参与者的私人数据集解密并提供给组中的每个参与者,则模型的预期性能。