《Communication-Efficient Learning of Deep Networks from Decentralized Data》
Modern mobile devices have access to a wealth of data suitable for learning models, which in turn
can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos.However, this rich data is often privacy sensitive,large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning.
We present a practical method for the federated learning of deep networks based on iterative
model averaging, and conduct an extensive empirical evaluation, considering five different model ar-chitectures and four datasets. These experiments demonstrate the approach is robust to the unbal-
anced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10–100× as compared to synchronized stochastic gradient descent.
We investigate a learning technique that allows users to collectively reap the benefits of shared models trained from this rich data, without the need to centrally store it. We term our approach Federated Learning, since the learning task is solved by a loose federation of participating devices (which we refer to as clients) which are coordinated by a central server. Each client has a local training dataset which is never uploaded to the server. Instead, each client computes an update to the current global model maintained by the server, and only this update is communicated. This is a direct application of the principle of focused collection or data minimization proposed by the 2012 White House report on privacy of consumer data [39]. Since these updates are specific to improving the current model, there is no reason to store them once they have been applied.
Google提出了一种模型学习范式,不需要集中移动端的数据到数据中心进行训练,从而保护了用户的隐私。服务器将全局模型(Global Model)下发给客户端(Client),客户端利用本地数据集进行训练,并将训练后的权重上传到服务器,从而实现全局模型的更新。
We investigate a learning technique that allows users to collectively reap the benefits of shared models trained from this rich data, without the need to centrally store it. We term our approach Federated Learning, since the learning task is solved by a loose federation of participating devices (which we refer to as clients) which are coordinated by a central server. Each client has a local training dataset which is never uploaded to the server. Instead, each client computes an update to the current global model maintained by the server, and only this update is communicated. This is a direct application of the principle of focused collection or data minimization proposed by the 2012 White House report on privacy of consumer data [39]. Since these updates are specific to improving the current model, there is no reason to store them once they have been applied.
联邦学习Federated Learning:
Ideal problems for federated learning have the following properties:
- Training on real-world data from mobile devices provides a distinct advantage over
training on proxy data that is generally available in the data center.- This data is privacy sensitive or large in size (compared to the size of the model), so it is preferable not to log it to the data center purely for the purpose of model training (in service of the focused collection principle).
- For supervised tasks, labels on the data can be inferred naturally from user interaction.
联邦优化Federated Optimization
3)用户分布是大规模的, 参与优化的用户数>平均每个用户的数据量(这个不太确定理解的对不对);
We assume a synchronous update scheme that proceeds in rounds of communication. There is a fixed set of K clients, each with a fixed local dataset. At the beginning of each round, a random fraction C of clients is selected, and the server sends the current global algorithm state to each of these clients (e.g., the current model parameters). We only select a fraction of clients for efficiency, as our experiments show diminishing returns for adding more clients beyond a certain point. Each selected client then performs local computation based on the global state and its local dataset, and sends an
update to the server. The server then applies these updates to its global state, and the process repeats.
min w ∈ R d f ( w ) where f ( w ) = def 1 n ∑ i = 1 n f i ( w ) . \min _{w \in \mathbb{R}^{d}} f(w) \quad \text { where } \quad f(w) \stackrel{\text { def }}{=} \frac{1}{n} \sum_{i=1}^{n} f_{i}(w) . w∈Rdminf(w) where f(w)= def n1i=1∑nfi(w).
对一个机器学习的问题来说,有 f i ( w ) = ℓ ( x i , y i ; w ) f_{i}(w)=\ell\left(x_{i}, y_{i} ; w\right) fi(w)=ℓ(xi,yi;w),即用模型参数 w w w去预测实例( x i , y i x_{i},y_{i} xi,yi)的损失。
假设现在有 K K K个client,第 k k k个client的数据点为 P k \mathcal{P_{k}} Pk,对应的数据集数量为 n k = ∣ P k ∣ n_k=|\mathcal P_{k}| nk=∣Pk∣。则上式可写为:
f ( w ) = ∑ k = 1 K n k n F k ( w ) where F k ( w ) = 1 n k ∑ i ∈ P k f i ( w ) f(w)=\sum_{k=1}^{K} \frac{n_{k}}{n} F_{k}(w) \quad \text { where } \quad F_{k}(w)=\frac{1}{n_{k}} \sum_{i \in \mathcal{P}_{k}} f_{i}(w) f(w)=k=1∑KnnkFk(w) where Fk(w)=nk1i∈Pk∑fi(w)
如果 P k \mathcal{P_{k}} Pk上的数据集是随机均匀采样的,我们称为IID设置,此时有:
E P k [ F k ( w ) ] = f ( w ) \mathbb{E}_{\mathcal{P}_{k}}\left[F_{k}(w)\right]=f(w) EPk[Fk(w)]=f(w)
C, the fraction of clients thatperform computation on each round;
E, then number of training passes each client makes over its local dataset on each round;
B, the local minibatch size used for the client updates.
B表示在一轮中,client在本地进行计算时的batchsize,在FedSGD中,B= ∞ \infty ∞.
FedSGD只是FedAvg的一个特例,即当参数E=1,B= ∞ \infty ∞时,FedAvg等价于FedSGD。
- MNIST_2NN: 一个多层感知机,两个隐藏层,每个隐藏层有200个单元,使用ReLu激活函数,(199,210)个参数。
- CNN:with two 5x5 convolution layers (the first with 32 channels, the second with 64, each followed with 2x2 max pooling), a fully connected layer with 512 units and ReLu activation, and a final softmax output layer (1,663,370 total parameters).
IID, where the data is shuffled, and then partitioned into 100 clients each receiving 600 examples, and Non-IID, where we first sort the data by digit label, divide it into 200 shards of size 300, and assign each of 100 clients 2 shards.
在B= ∞ \infty ∞下,增加用户数量取得的优势并不明显;在B=10的情况下,当C≥0.1时,收敛速度有明显的改进,因此后续的实验采用C=0.1。在计算效率与收敛速度取得平衡。
u = ( E [ n k ] / B ) E = n E / ( K B ) u=\left(\mathbb{E}\left[n_{k}\right] / B\right) E=n E /(K B) u=(E[nk]/B)E=nE/(KB)