今天写一些tensorflow分布式训练Ps-worker架构,
PS : 执行模型相关的作业,包括模型参数存储,分发,汇总,更新
Worker: 执行训练相关的作业,包括推理计算和梯度计算。
Ps-worker 架构分布式模型的流程:
1. pull : 各个worker根据数据流图的拓扑结构,从PS拉取最新的模型参数
2. feed: 各worker填充不同的批数据
3. compute: 各worker按照相同的模型参数和不同的批数据计算梯度,得出不同的梯度值
4. push 各worker 将计算得到的梯度值上传给PS
5. update: PS 收集所有worker的梯度值,求平均值,更新模型参数。
分布式编程:
1. 创建集群, 如果有GPU需要将操作放置到GPU上.
2. 创建分布式数据流图,包括前向图和后向图,保存模型的参数,占位符,以及同步优化器和同步标记队列,你也可能选择 异步优化器。
3.使用Supervisor来管理分布式会话。
以下是分布式解析:
1. 创建集群并且设置GPU
1.1 创建集群
ps_spec = ["192.168.56.101:20003"]
workers = ["192.168.56.102:20004", "192.168.56.103:20005"]
cluster = tf.train.ClusterSpec({"ps": ps_spec, "worker":workers})
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index = FLAGS.task_index
)
if FLAGS.job_name = "ps":
server.join()
1.2 worker设置GPU设备
if FLAGS.num_gpus > 0:
gpu = (FLAGS.task_index % FLAGS.num_gpus) #获取GPU device id
worker_device = "/job:/worker/task:%d/gpu:%d" %(FLAGS.task_index, gpu)
elif FLAGS.num_gpus == 0:
cpu = 0
worker_device = "/job:worker/task:%d/cpu:%d" %(FLAGS.task_index, cpu)
with tf.device(
tf.train.replica_device_setter(
worker_device = worker_device,
ps_device = "/job:ps/cpu:0",
cluster = cluster
)
):
.....
2. 创建数据流图
tf.device(...):
# trainable=False 表示global_step 不参与训练
global_step = tf.Variable(0, name="global_step", trainable=False)
....模型的建立
#建立优化器
opt = tf.train.AdamOptimizer(FLAGS.learning_rate)
#设置同步优化器
#1. 设置副本数replicas_to_aggregate
if FLAGS.replicas_to_aggregate is None:
replicas_to_aggregate = num_workers
else:
replicas_to_aggregate = FLAGS.replicas_to_aggregate
#建立同步优化器
opt = tf.train.SyncReplicasOptimizer(
opt,
replicas_to_aggregate = replicas_to_aggregate,
total_num_replicas = num_workers,
name = "sync_replicas"
)
train_step = opt.minimize(cross_entropy, gloabal_step=gloabal_step)
#准备同步标记队列
chief_queue_runner = opt.get_chiech_queue_runner()
sync_init_op = opt.get_init_tokens_op()
#为SuperVisor准备数据
local_init_op = opt.local_step_init_op
if is_chief:
local_init_op = opt.chief_init_op
ready_for_local_init_op = opt.ready_for_local_init_op
init_op = tf.global_variables_initializer()
3.会话层
sv = tf.train.Supervisor(
is_chief=is_chief,
logdir=train_dir,
init_op=init_op,
local_init_op=local_init_op,
ready_for_local_init_op=ready_for_local_init_op,
recovery_wait_secs=1,
global_step=global_step)
sess_config = tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=False,
device_filters=["/job:ps",
"/job:worker/task:%d" % FLAGS.task_index])
sess = sv.prepare_or_wait_for_session(server.target, config=sess_config)
#worker 0 初始化同步标记 并 启动同步标记队列
if FLAGS.sync_replicas and is_chief:
sess.run(sync_init_op)
sv.start_queue_runners(sess, [chief_queue_runner])
local_step = 0
....接下就可以使用sess.run了