tensorflow作为一款简单易用的深度学习工具,能够完成我们日常工作中的绝大多数任务,通常来讲,我们要完成一个深度学习程序,大体会经过以下步骤:
1、构建模型
2、定义模型更新动作 train_op
3、准备数据
4、在指定设备执行 train_op
但是,如果我们面对如下几个场景:
这时,也许你需要尝试一下分布式tensorflow
下面开始对分布式tensorflow进行简要的介绍,末尾会附上我实现的一个小demo
2016年04月14日谷歌发布了分布式TensorFlow。在这之后分布式机器学习逐渐被越来越多的人所熟知和使用。2017年6月8日,Facebook发表了一篇论文,展示了他们使用32台服务器上的256块GPU,将卷积神经网络(resnet-50)的训练时间从两周减少到了一个小时。在他们的分布式机器学习实现中,神经网络的学习率跟Mini-Batch的大小成比例。
AI的未来,是弹性可扩展的计算方法。
——强化学习之父 Rich Sutton
一个分布式系统一般会有 Cluster、Job、Task的概念。
其中Cluster即集群,是我们一个任务中所使用的计算机的集合;
一个Cluster中包含多个计算机,每个计算机中都可以执行多个Job,Job的类型包括Parameter Server(简称 PS)和Worker。这里的PS即参数服务器,用于存放模型参数以及进行梯度更新,而Worker则负责运行具体的计算,计算出相应的梯度;
每一个具体的Job被称作Task。
下面是描述一段描述Cluster 的python代码
cluster = tf.train.ClusterSpec({
'ps': [
'10.13.75.134:5000', # /job:ps/task:0
'10.13.75.134:5001'# /job:ps/task:1
],
'worker': [
'10.13.75.134:5002', # /job:worker/task:0
'10.13.75.134:5003'# /job:worker/task:1
]
})
将模型切分为不同的部分放在不同的机器上运行
In-graph模式中,计算已经被扩充到多个GPU节点中,在多个GPU节点中启动服务,将接口暴露给同局域网中的其他机器。之后,只需要在构建图的时候指定使用某一块机器就可以了:
with tf.device('/job/worker/task:n'):
ouput = tf.nn.xw_plus_b(x, w, b)
这里的使用方式和单机多GPU相似,但是数据分发节点还是在同一个节点,数据体量较大的时候,会对训练速度造成一定的影响。
下面是一个简单的demo:
# coding:utf-8
import tensorflow as tf
ps_hosts = ['10.13.75.67:5000', '10.13.75.67:5001']
worker_hosts = ['10.13.75.67:5002', '10.13.75.67:5003']
cluster = tf.train.ClusterSpec({'ps': ps_hosts, 'worker': worker_hosts})
tf.app.flags.DEFINE_string('job_name', 'worker', 'one of ps worker')
tf.app.flags.DEFINE_integer('task_index', 0, 'index of task within the job')
FLAGS = tf.app.flags.FLAGS
def main(_):
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index,
config=config)
server.join()
if __name__ == '__main__':
tf.app.run()
# coding:utf-8
import tensorflow as tf
import numpy as np
import pickle
def build_net(input_x):
"""
这里就简单训练一个x到y的线性变换
:param input_x:
:return:
"""
with tf.device('/job:worker/task:0'):
w_1 = tf.Variable(tf.random_normal((3, 4), stddev=0.01), name='w1')
b_1 = tf.Variable(tf.constant(0., shape=(3, 4)), name='b1')
x = tf.matmul(input_x, w_1) + b_1
w_2 = tf.Variable(tf.random_normal((4, 3), stddev=0.01), name='w2')
b_2 = tf.Variable(tf.constant(0., shape=(4, 4)), name='b2')
x = tf.matmul(w_2, x) + b_2
return x
def main():
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session('grpc://10.13.75.67:5000', config=config) as sess:
x = tf.placeholder(tf.float32, (3, 3), name='input_x')
y = tf.placeholder(tf.float32, (4, 4), name='label_y')
output_ = build_net(x)
cost = tf.reduce_mean(tf.square(output_ - y))
train_op = tf.train.AdamOptimizer().minimize(cost)
input_x = np.arange(90).reshape((10, 3, 3))
with open('label.pkl', 'rb') as f:
output_y = pickle.load(f)
sess.run(tf.global_variables_initializer())
for i in range(10):
cost_value, _ = sess.run([cost, train_op],
{x: input_x[i, :, :].reshape((3, 3)),
y: output_y[i, :, :].reshape((4, 4))
}
)
print(cost_value)
if __name__ == '__main__':
main()
数据并行,每台机器使用完全相同的计算图
这种方式中,每个节点中的图完全相同,但是可以在不同的节点中处理不同的数据,不需要再使用一个节点分发,比较适合大数据的训练方式。
In-graph模式和Between-graph模式都支持这两种更新模式
分布式TF的核心思想是使用多个节点分布式计算梯度,之后将计算出来的梯度传递至参数服务器。同步更新与异步更新的区别就在于如何将子节点计算出来的梯度传递至参数服务器
devices = {
'ps': ['10.13.75.134:5000', '10.13.75.134:5001'],
'worker': ['10.13.75.134:5002', '10.13.75.134:5003']
}
cluster = tf.train.ClusterSpec(devices)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
server = tf.train.Server(cluster, job_name=job_name, task_index=task_index, config=config)
这里定义好server,server即task,通过job_name和task_index可以确定一个sever。
对于Parameter,或者是In-graph模式中,定义好server之后,直接将之挂起
server.join()
接着就是被动等着调用就好了。
with tf.device(tf.train.replica_device_setter(worker_device=device, cluster=cluster)):
input_ = tf.placeholder(tf.float32, [None, 28, 28, 3], 'input_name')
label_ = tf.placeholder(tf.float32, [None, 10], 'label_data')
output_ = le_net(input_, 10, 'worker')
acc = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(output_, 1), tf.argmax(label_, 1)), tf.float32), name='acc')
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=label_, logits=output_), name='cost')
global_step = tf.train.get_or_create_global_step()
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(cost, global_step)
这里使用了tf.train.replica_device_setter(),作用是将在其中定义的模型参数存放至PS,而将图运算保留在worker中。这里使用了异步更新,因此直接使用了AdamOptimizer()。
hooks = [tf.train.StopAtStepHook(num_steps=804)]
with tf.train.MonitoredTrainingSession(master=server.target,
# checkpoint_dir='checkpoint',
hooks=hooks,
# save_checkpoint_secs=600,
config=config,
is_chief=True) as sess:
print('start task_index:%d' % task_index)
x_data, y_data, x_test, y_test, batch_num = get_data(task_index, 128, lock_)
train_acc = []
train_cost = []
val_acc = []
val_cost = []
total_step = 0
while not sess.should_stop():
x_minibatch = x_data[total_step % batch_num]
y_minibatch = y_data[total_step % batch_num]
acc_value, cost_value, _ = sess.run([acc, cost, train_op],
{input_: x_minibatch, label_: y_minibatch})
train_acc.append((total_step, acc_value))
train_cost.append((total_step, cost_value))
print('task_index %d, step %d, train_acc %4f, train_cost %4f' %
(task_index, total_step, acc_value, cost_value))
if total_step % 100 == 0:
val_acc_value, val_cost_value = sess.run([acc, cost],
{input_: x_test[:256], label_: y_test[:256]})
val_acc.append((total_step, val_acc_value))
val_cost.append((total_step, val_cost_value))
print('task_index %d, step %d, train_acc %4f, train_cost %4f, val_acc %4f, val_cost %4f' %
(task_index, total_step, acc_value, cost_value, val_acc_value, val_cost_value))
total_step += 1
使用tf.train.MonitoredTrainingSession()定义好session,之后即进行相应的运算。
下面是我用多线程写的一个简单的demo
# coding:utf-8
import tensorflow as tf
import h5py
import numpy as np
import threading
import time
import pickle
import os
import matplotlib.pyplot as plt
from model import le_net
def get_data(task_index, batch_size, lock_):
"""
通过task_index获取相对应的数据
:param task_index:
:param batch_size:
:param lock_:
:return:
"""
with lock_:
with h5py.File('data_util/mnist.h5') as f:
x_train = f['x_train'].value.astype(np.float32)
y_train = f['y_train'].value
x_test = f['x_test'].value.astype(np.float32)
y_test = f['y_test'].value
interval_len = 60000//6
task_x = x_train[interval_len*task_index: interval_len*(task_index+1)]
task_y = y_train[interval_len*task_index: interval_len*(task_index+1)]
batch_num = task_x.shape[0]//batch_size
x_mini_batch = [task_x[batch_size*i: batch_size*(i+1)] for i in range(batch_num)]
y_mini_batch = [task_y[batch_size*i: batch_size*(i+1)] for i in range(batch_num)]
return x_mini_batch, y_mini_batch, x_test, y_test, batch_num
def work(devices, job_name, task_index, lock_, worker_num):
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
cluster = tf.train.ClusterSpec(devices)
server = tf.train.Server(cluster, job_name=job_name, task_index=task_index, config=config)
if job_name == 'ps':
server.join()
else:
device = '/job:worker/task:' + str(task_index)
with tf.device(tf.train.replica_device_setter(worker_device=device, cluster=cluster)):
input_ = tf.placeholder(tf.float32, [None, 28, 28, 3], 'input_name')
label_ = tf.placeholder(tf.float32, [None, 10], 'label_data')
output_ = le_net(input_, 10, 'worker')
acc = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(output_, 1), tf.argmax(label_, 1)), tf.float32), name='acc')
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=label_, logits=output_), name='cost')
global_step = tf.train.get_or_create_global_step()
optimizer = tf.train.AdamOptimizer(learning_rate=0.0002)
train_op = optimizer.minimize(cost, global_step)
hooks = [tf.train.StopAtStepHook(num_steps=804)]
with tf.train.MonitoredTrainingSession(master=server.target,
# checkpoint_dir='checkpoint',
hooks=hooks,
# save_checkpoint_secs=600,
config=config,
is_chief=True) as sess:
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
print('start task_index:%d' % task_index)
x_data, y_data, x_test, y_test, batch_num = get_data(task_index, 128, lock_)
train_acc = []
train_cost = []
val_acc = []
val_cost = []
total_step = 0
while not sess.should_stop():
x_minibatch = x_data[total_step % batch_num]
y_minibatch = y_data[total_step % batch_num]
acc_value, cost_value, _ = sess.run([acc, cost, train_op],
{input_: x_minibatch, label_: y_minibatch})
train_acc.append((total_step, acc_value))
train_cost.append((total_step, cost_value))
print('task_index %d, step %d, train_acc %4f, train_cost %4f' %
(task_index, total_step, acc_value, cost_value))
if total_step % 100 == 0:
val_acc_value, val_cost_value = sess.run([acc, cost],
{input_: x_test[:256], label_: y_test[:256]})
val_acc.append((total_step, val_acc_value))
val_cost.append((total_step, val_cost_value))
print('task_index %d, step %d, train_acc %4f, train_cost %4f, val_acc %4f, val_cost %4f' %
(task_index, total_step, acc_value, cost_value, val_acc_value, val_cost_value))
total_step += 1
if task_index == 0 and total_step == 200:
current_time = int(time.time())
with open('result/task_index-%d-worker_num-%d-time-%d-train-acc.pkl' % (task_index, worker_num, current_time), 'wb') as f:
pickle.dump(train_acc, f)
with open('result/task_index-%d-worker_num-%d-time-%d-train-lost.pkl' % (task_index, worker_num, current_time), 'wb') as f:
pickle.dump(train_cost, f)
with open('result/task_index-%d-worker_num-%d-time-%d-val-acc.pkl' % (task_index, worker_num, current_time), 'wb') as f:
pickle.dump(val_acc, f)
with open('result/task_index-%d-worker_num-%d-time-%d-val-lost.pkl' % (task_index, worker_num, current_time), 'wb') as f:
pickle.dump(val_cost, f)
if __name__ == '__main__':
lock = threading.Lock()
devices_ = {
'ps': ['10.13.75.116:5000', '10.13.75.116:5001'],
'worker': ['10.13.75.116:5002', '10.13.75.116:5003', '10.13.75.116:5004', '10.13.75.116:5005']
}
task = [('ps', 0), ('ps', 1), ('worker', 0), ('worker', 1), ('worker', 2), ('worker', 3)]
worker_num = 1
if (worker_num+2) < 6:
task = task[:worker_num+2]
devices_['worker'] = devices_['worker'][:worker_num]
ps = [threading.Thread(target=work, args=(devices_, i, j, lock, worker_num)) for i, j in task]
[p.start() for p in ps]
[p.join() for p in ps[2:]]
我这里还没有调试好同步更新的模式,异步更新模式中,worker_num分别取1、2、3、4,训练mnist数据的效果如下:
参考:
https://blog.csdn.net/hjimce/article/details/61197190
https://www.oreilly.com.cn/ideas/?p=1395