TensorFlow Training - Distributed Training

https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md

http://sharkdtu.com/posts/dist-tf-evolution.html

http://download.tensorflow.org/paper/whitepaper2015.pdf

https://segmentfault.com/a/1190000008376957

PS-Worker 图内/图间
All-Reduce 简单 -> half ps（做集合分发的） worker -> butterfly -> ring all-reduce
MirroredStrategy
MultiWorkerStrategy
ParameterServerStrategy

All-reduce和ps worker对比

https://zhuanlan.zhihu.com/p/50116885

1 不同的分布式策略

All reduce

把参数加在一起，然后同步到所有的机器上去
https://zhuanlan.zhihu.com/p/79030485

MirroredStrategy

support synchronous distributed training on multiple GPUs on one machine

Multi-workerMirroredStrategy

It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs

ParameterServerStrategy

supports parameter servers training on multiple machines. In this setup, some machines are designated as workers and some as parameter servers. Each variable of the model is placed on one parameter server. Computation is replicated across all GPUs of all the workers.

2 演进

http://sharkdtu.com/posts/dist-tf-evolution.html

1 基本组件

client/master/worker
server(host:port)和task一一对应，cluster由server组成，一系列task称为一个job，每个server有两个Service，即master service和worker service，client通过session连接集群的任意一个server的master service来划分派发task

2 基于PS的分布式TensorFlow编程模型

Parameter Server Task集合为ps
Worker Task集合为worker

Low-level 分布式编程模型

启动
开启3个(两个worker，一个ps)
定义一个cluster
启动三个server
图内复制
模型参数放在ps上，不同部分的副本存在不同的worker上
新建一个client，执行上面步骤，适用于单机多卡，不适用于大规模分布式多机训练
图间复制
每个worker都创建一个client，各个Client构建相同的graph，参数还是放置在ps上，一个worker节点挂掉了，系统还可以继续跑
https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md

High-level 分布式编程模型

使用Estimator和Dataset高阶API