Chapter 12 Distributing TensorFlow Across Devices and Servers

Chapter 12 Distributing TensorFlow Across Devices and Servers

OReilly. Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

12.1 Multiple Devices on a Single Machine

12.1.1 Installation

Check GPU compatibility: https://developer.nvidia.com/cuda-gpus

An Amazon AWS GPU instance are available in Žiga Avsec’s helpful blog post.

Google also released a cloud service called Cloud Machine Learning to run TensorFlow graphs.

Tim Dettmers wrote a great blog post to help you choose, and he updates it fairly regularly.

You must then download and install the appropriate version of the CUDA (Compute Unified Device Architecture) and cuDNN (CUDA Deep Neural Network) libraries, and set a few environment variables so TensorFlow knows where to find CUDA and cuDNN.

You can use the nvidia-smi command to check that CUDA is properly installed. It lists the GPU cards, as well as processes running on each card:

$ nvidia-smi

Create an isolated environment using virtualenv, and activate it

$ cd $ML_PATH # Your ML working directory (e.g., $HOME/ml)
$ source env/bin/activate

Install GPU-enabled version of TensorFlow:

$ pip3 install --upgrade tensorflow-gpu

Now you can open up a Python shell and check that TensorFlow detects and uses CUDA and cuDNN properly by importing TensorFlow and creating a session:

>>> import tensorflow as tf
>>> sess=tf.Session()
2019-03-13 13:44:36.870279: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-13 13:44:38.591268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:03:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2019-03-13 13:44:38.591331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-03-13 13:45:08.050299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-13 13:45:08.050344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-03-13 13:45:08.050354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-03-13 13:45:08.050730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10404 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
>>> se>>> import tensorflow as tf
2019-03-13 13:45:08.050730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10404 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)

12.1.2 Managing the GPU Ram

By default TensorFlow automatically grabs all the RAM in all available GPUs the first time you run a graph, so you will not be able to start a second TensorFlow program while the first one is still running.

To run each process on different GPU cards, the simplest option is to set the CUDA_VISIBLE_DEVICES environment variable so that each process only sees the appropriate GPU cards.

$ CUDA_VISIBLE_DEVICES=0,1 python3 program_1.py
# and in another terminal:
$ CUDA_VISIBLE_DEVICES=3,2 python3 program_2.py

Another option is to tell TensorFlow to grab only a fraction of the memory. For example, to make TensorFlow grab only 40% of each GPU’s memory, you must create a ConfigProto object, set its gpu_options.per_process_gpu_memory_fraction option to 0.4, and create the session using this configuration:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config) 

12.1.3 Placing Operations on Devices

The TensorFlow whitepaper presents a friendly dynamic placer algorithm that automagically distributes operations across all available devices, taking into account things like the measured computation time in previous runs of the graph, estimations of the size of the input and output tensors to each operation, the amount of RAM available in each device, communication delay when transferring data in and out of devices, hints and constraints from the user, and more.

Simple placement

The simple placer respects the following rules:

  • If a node was already placed on a device in a previous run of the graph, it is left on that device.
  • Else, if the user pinned a node to a device (described next), the placer places it on that device.
  • Else, it defaults to GPU #0, or the CPU if there is no GPU.

To pin nodes onto a device, you must create a device block using the device() function. For
example, the following code pins the variable a and the constant b on the CPU, but the multiplication node c is not pinned on any device, so it will be placed on the default device:

with tf.device("/cpu:0"):
    a=tf.Variable(3.0)
    b=tf.constant(4.0)
c=a*b

Logging placements

You can set the log_device_placement option to True to tell the placer to log a message whenever it places a node. For example:

config = tf.ConfigProto()
config.log_device_placement = True
sess = tf.Session(config=config)
init=tf.global_variables_initializer()
init.run(session=sess)
sess.run(c)

The lines starting with "I" for Info are the log messages. When we create a session,
TensorFlow logs a message to tell us that it has found a GPU card. Then the first time we run the graph (in this case when initializing the variable a), the simple placer is run and places each node on the device it was assigned to. As expected, the log messages show that all nodes are placed on "/cpu:0" except the multiplication node, which ends up on the default device "/gpu:0". Notice that the second time we run the graph (to compute c), the placer is not used since all the nodes TensorFlow needs to compute c are already placed.

Dynamic placement function

When you create a device block, you can specify a function instead of a device name.
TensorFlow will call this function for each operation it needs to place in the device block, and the function must return the name of the device to pin the operation on. For example, the following code pins all the variable nodes to "/cpu:0" (in this case just the variable a) and all other nodes to "/gpu:0":

def variables_on_cpu(op):
    if op.type=="Variable":
        return "/cpu:0"
    else:
        return "/gpu:0"
with tf.device(variables_on_cpu):
    a=tf.Variable(3.0)
    b=tf.constant(4.0)
    c=a*b

Operations and kernels

For a TensorFlow operation to run on a device, it needs to have an implementation for that device; this is called a kernel. Many operations have kernels for both CPUs and GPUs, but not all of them. For example, TensorFlow does not have a GPU kernel for integer variables, so the following code will fail when TensorFlow tries to place the variable i on GPU #0:

with tf.device("/gpu:0"):
    i = tf.Variable(3)
sess.run(i.initializer)

Soft placement

By default, if you try to pin an operation on a device for which the operation has no kernel, you get the exception shown earlier when TensorFlow tries to place the operation on the device. If you prefer TensorFlow to fall back to the CPU instead, you can set the allow_soft_placement configuration option to True:

with tf.device("/gpu:0"):
    i=tf.Variable(3)
config=tf.ConfigProto()
config.allow_soft_replacement=True
sess=tf.Session(config=config)
sess.run(i.initializer)# the placer runs and falls back to /cpu:0 

12.1.4 Parallel Execution

When TensorFlow runs a graph, it starts by finding out the list of nodes that need to be evaluated, and it counts how many dependencies each of them has. TensorFlow then starts evaluating the nodes with zero dependencies (i.e., source nodes). If these nodes are placed on separate devices, they obviously get evaluated in parallel. If they are placed on the same device, they get evaluated in different threads, so they may run in parallel too (in separate GPU threads or CPU cores).

TensorFlow manages a thread pool on each device to parallelize operations (see Figure 12-5). These are called the inter-op thread pools. Some operations have multi‐threaded kernels: they can use other thread pools (one per device) called the intra-op thread pools.

Operations D and E depend on C. As soon as operation C finishes, the dependency counters of operations D and E will be decremented and will both reach 0, so both operations will be sent to the inter-op thread pool to be executed.

You can control the number of threads per inter-op pool by setting the inter_op_parallelism_threads option. Note that the first session you start creates the inter-op thread pools. All other sessions will just reuse them unless you set the use_per_session_threads option to True. You can control the number of threads per intra-op pool by setting the intra_op_parallelism_threads option.

12.1.5 Control Dependencies

To postpone evaluation of some nodes, a simple solution is to add control dependencies. For example, the following code tells TensorFlow to evaluate x and y only after a and b have been evaluated:

a=tf.constant(1.0)
b=a+2.0

with tf.control_dependencies([a,b]):
    x=tf.constant(3.0)
    y=tf.constant(4.0)
z=x+y   

12.2 Multiple Devices Across Multiple Servers

To run a graph across multiple servers, you first need to define a cluster. A cluster is composed of one or more TensorFlow servers, called tasks, typically spread across several machines. Each task belongs to a job. A job is just a named group of tasks that typically have a common role, such as keeping track of the model parameters (such a job is usually named "ps" for parameter server), or performing computations (such a job is usually named "worker").

cluster_spec=tf.train.ClusterSpec({
    "ps":[
        "machine-a.example.com:2221",# /job:ps/task:0
    ],
    "worker":[
        "machine-a.example.com:2222",# /job:worker/task:0
        "machine-b.example.com:2222",# /job:worker/task:1
    ]
})

To start a TensorFlow server, you must create a Server object, passing it the cluster
specification (so it can communicate with other servers) and its own job name and task number. For example, to start the first worker task, you would run the following code on machine A:

server = tf.train.Server(cluster_spec, job_name="worker", task_index=0)

It is usually simpler to just run one task per machine, but the previous example demonstrates that TensorFlow allows you to run multiple tasks on the same machine if you want. If you have several servers on one machine, you will need to ensure that they don’t all try to grab all the RAM of every GPU, as explained earlier. For example, in Figure 12-6 the “ps” task does not see the GPU devices, since presumably its process was launched with CUDA_VISIBLE_DEVICES="". Note that the CPU is shared by all tasks located on the same machine.

If you want the process to do nothing other than run the TensorFlow server, you can block the main thread by telling it to wait for the server to finish using the join() method (otherwise the server will be killed as soon as your main thread exits). Since there is currently no way to stop the server, this will actually block forever:

server.join() # blocks until the server stops (i.e., never) 

12.2.1 Opening a Session

Once all the tasks are up and running (doing nothing yet), you can open a session on any of the servers, from a client located in any process on any machine (even from a process running one of the tasks), and use that session like a regular local session. For example:

a=tf.constant(1.0)
b=a+2
c=a*3
with tf.Session("grpc://machine-b.example.com:2222") as sess:
    print(c.eval()) #9.0

12.2.2 The Master and Worker Services

The client uses the gRPC protocol (Google Remote Procedure Call) to communicate with the server. Data is transmitted in the form of protocol buffers, another open source Google technology. This is a lightweight binary data interchange format.

Every TensorFlow server provides two services: the master service and the worker service. The master service allows clients to open sessions and use them to run graphs. It coordinates the computations across tasks, relying on the worker service to actually execute computations on other tasks and get their results.

This architecture gives you a lot of flexibility. One client can connect to multiple servers by opening multiple sessions in different threads. One server can handle multiple sessions simultaneously from one or more clients. You can run one client per task (typically within the same process), or just one client to control all tasks. All options are open.

12.2.3 Pinning Operations Across Tasks

You can use device blocks to pin operations on any device managed by any task, by specifying the job name, task index, device type, and device index. For example, the following code pins a to the CPU of the first task in the “ps” job (that’s the CPU on machine A), and it pins b to the second GPU managed by the first task of the “worker” job (that’s GPU #1 on machine A). Finally, c is not pinned to any device, so the master places it on its own default device (machine B’s GPU #0 device).

with tf.device("/job:ps/task:0/cpu:0"):
    a=tf.constant(1.0)

with tf.device("job:worker/task:0/gpu:1"):
    b=a+2

c=a+b

12.2.4 Sharding Variables Across Multiple Parameter Servers

TensorFlow provides the replica_device_setter() function, which distributes variables across all the "ps" tasks in a round-robin fashion. For example, the following code pins five
variables to two parameter servers:

with tf.device(tf.train.replica_device_setter(ps_tasks=2)):
    v1 = tf.Variabel(1.0) # pinned to /job:ps/task:0
    v2 = tf.Variable(2.0) # pinned to /job:ps/task:1
    v3 = tf.Variable(3.0) # pinned to /job:ps/task:0
    v4 = tf.Variable(4.0) # pinned to /job:ps/task:1
	v5 = tf.Variable(5.0) # pinned to /job:ps/task:0

Instead of passing the number of ps_tasks, you can pass the cluster spec cluster=cluster_spec and TensorFlow will simply count the number of tasks in the "ps"
job.

If you create other operations in the block, beyond just variables, TensorFlow automatically pins them to "/job:worker", which will default to the first device managed by the first task in the "worker" job. You can pin them to another device by setting the worker_device parameter, but a better approach is to use embedded device blocks. An inner device block can override the job, task, or device defined in an outer block. For example:

with tf.device(tf.train.replica_device_setter(ps_tasks=2)):
    v1=tf.Variable(1.0) # pinned to /job:ps/task:0 (+ defaults to /cpu:0)
    v2=tf.Variable(2.0) # pinned to /job:ps/task:1 (+ defaults to /cpu:0)
    v3=tf.Variable(3.0) # pinned to /job:ps/task:0 (+ defaults to /cpu:0)
    s=v1+v2             # pinned to /job:worker (+ defaults to task:0/gpu:0)   
    with tf.device("/gpu:1"):
        p1=2*s          # pinned to /job:worker/gpu:1 (+ defaults to /task:0)
        with tf.device("/task:1"):
            p2=3*s      # pinned to /job:worker/task:1/gpu:1

12.2.5 Sharing State Across Sessions Using Resource Container

When you are using distributed sessions, variable state is managed by resource containers located on the cluster itself, not by the sessions. So if you create a variable named x using one client session, it will automatically be available to any other session on the same cluster (even if both sessions are connected to a different server). For example, consider the following client code:

#simple_client.py
import tensorflow as tf
import sys

x=tf.Variable(0.0,name="x")
increment_x=tf.assign(x,x+1)

with tf.Session(sys.argv[1]) as sess:
    if sys.argv[2:]==["init"]:
        sess.run(x.initializer)
    sess.run(increment_x)
    print(x.eval(0))
$ python3 simple_client.py grpc://machine-a.example.com:2222 init
1.0
$ python2 simple_client.py grpc://machine-b.example.com:22222
2.0

If you want to run completely independent computations on the same cluster you will have to be careful not to use the same variable names by accident. One way to ensure that you won’t have name clashes is to wrap all of your construction phase inside a variable scope with a unique name for each computation, for example:

with tf.variable_scope("my_problem_1"):
    [...] # Construction phase of problem 1

A better option is to use a container block:

with tf.container("my_problem_1"):
    [...] # Construction phase of problem 1

The following command will connect to the server on machine A and ask it to reset the container named “my_problem_1”, which will free all the resources this container used (and also close all sessions open on the server). Any variable managed by this container must be initialized before you can use it again:

tf.Session.reset("grpc://machine-a.example.com:2222", ["my_problem_1"]) 

12.2.6 Asynchronous Communication Using TensorFlow Queues

Queues are another great way to exchange data between multiple sessions; for example, one common use case is to have a client create a graph that loads the training data and pushes it into a queue, while another client creates a graph that pulls the data from the queue and trains a model (see Figure 12-8). This can speed up training considerably because the training operations don’t have to wait for the next mini-batch at every step.

The following code creates a FIFO queue that can store up to 10 tensors containing two float values each:

q=tf.FIFOQueue(capacity=10,dtypes=[tf.float32],shapes=[[2]],
               name="q",shared_name="shared_q")

Enqueueing data

#training_data_loader.py
import tensorflow as tf
with tf.container("sharedqueue"):
    #report an error if use "shared_queue" as container name
    q=tf.FIFOQueue(capacity=10,dtypes=[tf.float32],shapes=[[2]],
                  name="q",shared_name="shared_q")
    training_instance=tf.placeholder(tf.float32,shape=(2))
    enqueue =q.enqueue([training_instance])

with tf.container("sharedqueue"):
    with tf.Session("grpc://127.0.0.1:2222") as sess:
        sess.run(enqueue, feed_dict={training_instance: [1., 2.]})
        sess.run(enqueue, feed_dict={training_instance: [3., 4.]})
        sess.run(enqueue, feed_dict={training_instance: [5., 6.]})        

Instead of enqueuing instances one by one, you can enqueue several at a time using an enqueue_many operation:

[...]
training_instances = tf.placeholder(tf.float32, shape=(None, 2))
enqueue_many = q.enqueue([training_instances])
with tf.container("sharedqueue"):
    with tf.Session("grpc://127.0.0.1:2222") as sess:
        sess.run(enqueue_many,feed_dict={training_instances: [[1., 2.], [3., 4.], [5., 6.]]})

Dequeuing data

# trainer.py
import tensorflow as tf
with tf.container("sharedqueue"):
    dequeue = q.dequeue()
    with tf.Session("grpc://127.0.0.1:2222") as sess:
        print(sess.run(dequeue)) # [1., 2.]
        print(sess.run(dequeue)) # [3., 4.]
        print(sess.run(dequeue)) # [5., 6.]

or

[...]
with tf.container("sharedqueue"):
    batch_size = 2
    dequeue_mini_batch= q.dequeue_many(batch_size)
    with tf.Session("grpc://127.0.0.1:2222") as sess:
        print(sess.run(dequeue_mini_batch)) # [[1., 2.], [4., 5.]]
        print(sess.run(dequeue_mini_batch)) # blocked waiting for another instance

Queues of tuples

Each item in a queue can be a tuple of tensors (of various types and shapes) instead of
just a single tensor. For example, the following queue stores pairs of tensors, one of
type int32 and shape (), and the other of type float32 and shape [3,2]:

q=tf.FIFOQueue(capacity=10,dtypes=[tf.int32,tf.float32],shapes=[[],[3,2]],name="q",shared_name="shared_q")

The enqueue operation must be given pairs of tensors (note that each pair represents only one item in the queue):

a=tf.placeholder(tf.int32,shape=())
b=tf.placeholder(tf.float32,shape=(3,2))
enqueue=q.enqueue((a,b))

with tf.Session("grpc://127.0.0.1:2221") as sess:
    sess.run(enqueue,feed_dict={a:10,b:[[1.,2.],[3.,4.],[5.,6.]]})
    sess.run(enqueue, feed_dict={a: 11, b:[[2., 4.], [6., 8.], [0., 2.]]})
    sess.run(enqueue, feed_dict={a: 12, b:[[3., 6.], [9., 2.], [5., 8.]]})
dequeue_a, dequeue_b = q.dequeue()
with tf.Session("grpc://127.0.0.1:2222") as sess:
    a_val, b_val = sess.run([dequeue_a, dequeue_b])
    print(a_val) # 10
    print(b_val) # [[1., 2.], [3., 4.], [5., 6.]]
batch_size = 2
dequeue_as, dequeue_bs = q.dequeue_many(batch_size)
with tf.Session("grpc://127.0.0.1:2222") as sess:
    a, b = sess.run([dequeue_as, dequeue_bs])
    print(a) # [10, 11]
    print(b) # [[[1., 2.], [3., 4.], [5., 6.]], [[2., 4.], [6., 8.], [0., 2.]]]
    a, b = sess.run([dequeue_as, dequeue_bs]) # blocked waiting for another pair

Closing a queue

close_q = q.close()
with tf.Session("grpc://127.0.0.1:2222") as sess:
    [...]
    sess.run(close_q)

RandomShuffleQueue

import tensorflow as tf
tf.reset_default_graph()
q=tf.RandomShuffleQueue(capacity=50, min_after_dequeue=2,
                        dtypes=[tf.int32],shapes=[()],
                       name="q",shared_name="shared_q")
x=tf.placeholder(dtype=tf.int32,shape=())
enqueue_instance= q.enqueue(x)
dequeue = q.dequeue_many(5)
with tf.Session() as sess:
    for i in range(23):
        sess.run(enqueue_instance,feed_dict={x:i})
    print(sess.run(dequeue)) # [ 20. 15. 11. 12. 4.] (17 items left)
    print(sess.run(dequeue)) # [ 5. 13. 6. 0. 17.] (12 items left)
    print(sess.run(dequeue)) # 12 - 5 < 10: blocked waiting for 3 more instances

PaddingFIFOQueue

A PaddingFIFOQueue accepts tensors of variable sizes along any dimension (but with a fixed rank).

q = tf.PaddingFIFOQueue(capacity=50, dtypes=[tf.float32], shapes=[(None, None)],name="q", shared_name="shared_q")
v = tf.placeholder(tf.float32, shape=(None, None))
enqueue = q.enqueue([v])
with tf.Session("grpc://127.0.0.1:2222") as sess:
    sess.run(enqueue, feed_dict={v: [[1., 2.], [3., 4.], [5., 6.]]}) # 3x2
    sess.run(enqueue, feed_dict={v: [[1.]]}) # 1x1
    sess.run(enqueue, feed_dict={v: [[7., 8., 9., 5.], [6., 7., 8., 9.]]}) # 2x4
dequeue = q.dequeue_many(3)
with tf.Session("grpc://127.0.0.1:2222") as sess:
    print(sess.run(dequeue))
#output
[[[ 1. 2. 0. 0.]
[ 3. 4. 0. 0.]
[ 5. 6. 0. 0.]]
[[ 1. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
[[ 7. 8. 9. 5.]
[ 6. 7. 8. 9.]
[ 0. 0. 0. 0.]]]

12.2.7 Loading Data Directly from the Graph

Preload the data into a variable

For datasets that can fit in memory, a better option is to load the training data once and assign it to a variable, then just use that variable in your graph. This is called preloading the training set.

tf.reset_default_graph()
n_features=2
training_set_init=tf.placeholder(tf.float32,shape=(None,n_features))
training_set=tf.Variable(training_set_init,trainable=False,
                         collections=[],name="training_set",
                        validate_shape=False)
with tf.Session() as sess:
    data=[[1,2],[3,4],[5,6],[7.,8.],[9.,10.]]
    sess.run(training_set.initializer,feed_dict={training_set_init:data})

This example assumes that all of your training set (including the labels) consists only of float32 values. If that’s not the case, you will need one variable per type.

Reading the training data directly from the graph

Reader operations: operations capable of reading data directly from the filesystem. This way the training data never needs to flow through the clients at all. TensorFlow provides readers for various file formats:

  • CSV
  • Fixed-length binary records
  • TensorFlow’s own TFRecords format, based on protocol buffers

Let’s look at a simple example reading from a CSV file. Suppose you have file named my_test.csv that contains training instances, and you want to create operations to read it.

x1, x2, target
1. , 2. , 0
4. , 5 , 1
7. , , 0

First, let’s create a TextLineReader to read this file. A TextLineReader opens a file (once we tell it which one to open) and reads lines one by one. It is a stateful operation, like variables and queues: it preserves its state across multiple runs of the graph, keeping track of which file it is currently reading and what its current position is in this file.

reader=tf.TextLineReader(skip_header_lines=1)

Next, we create a queue that the reader will pull from to know which file to read next. We also create an enqueue operation and a placeholder to push any filename we want to the queue, and we create an operation to close the queue once we have no more files to read:

#enqueue filenames
filename_queue=tf.FIFOQueue(capacity=10,dtypes=[tf.string],shape=[()])
filename=tf.placeholder(tf.string)
enqueue_filename= filename_queue.enqueue([filename])
close_filename_queue=filename_queue.close()

Create a read operation that will read one record (i.e., a line) at a time and return a key/value pair. The key is the record’s unique identifier—a string composed of the filename, a colon (?, and the line number—and the value is simply a string containing the content of the line:

#read each line of files of which the names contained in filename_queue
key,value =reader.read(filename_queue)

Parse this string to get the features and target:

x1,x2,target= tf.decode_csv(value,record_defaults=[[-1.],[-1.],[-1]])
features =tf.stack([x1,x2])

Finally, we can push this training instance and its target to a RandomShuffleQueue that we will share with the training graph (so it can pull mini-batches from it), and we create an operation to close that queue when we are done pushing instances to it:

#enqueue each line into instance_queue 
instance_queue = tf.RandomSHuffleQueue(capacity=10, min_after_dequeue=2, dtypes=[tf.float32,tf.int32],shapes=[[2],[]],name="instance_q", shared_name="shared_instance_q")
enqueue_instance =instance_queue.enqueue([features,targer])
close_instance_queue=instance_queue_close()

Run the graph:

minibatch_instances, minibatch_targets=instance_queue.dequeue_up_to(2)
with tf.Session() as sess:
    sess.run(enqueue_filename,feed_dict={filename:"my_test.csv"})
    sess.run(close_filename_queue)
    try:
        while True:
            sess.run(enqueue_instance)
    except tf.errors.OutOfRangeError as ex:
        print("No more files to read")
    sess.run(close_instance_queue)
    try:
        while True:
            print(sess.run([minibatch_instances,minibatch_targets]))
    except tf.errors.OutOfRangeError as ex:
        print("No more training instances")

Multithreaded readers using a Coordinator and a QueueRunner

minibatch_instances, minibatch_targets=instance_queue.dequeue_up_to(2)

n_threads=5
queue_runner=tf.train.QueueRunner(instance_queue,[enqueue_instance]*5)
coord= tf.train.Coordinator()

with tf.Session() as sess:
    sess.run(enqueue_filename,feed_dict={filename:"my_test.csv"})
    sess.run(close_filename_queue)
    enqueue_threads= queue_runner.create_threads(sess,coord=coord,start=True)
    try:
        while True:
            print(sess.run([minibatch_instances,minibatch_targets]))
    except tf.errors.OutOfRangeError as ex:
        print("No more training instances")

The QueueRunner is deprecated. Use tf.data.Dataset instead.

Reading simultaneously from multiple files

def read_and_push_instance(filename_queue,instance_queue):
    reader=tf.TextLineReader(skip_header_lines=1)
    key,value= reader.read(filename_queue)
    x1,x2,target= tf.decode_csv(value,record_defaults=[[-1.],[-1.],[1]])
    n_features=tf.stack([x1,x2])
    enqueue_instance=instance_queue.enqueue([n_features,target])
    return enqueue_instance
filename_queue=tf.FIFOQueue(capacity=10,dtypes=[tf.string],shapes=[()])
filename=tf.placeholder(tf.string)
enqueue_filename=filename_queue.enqueue([filename])
close_filename_queue=filename_queue.close()

instance_queue=tf.RandomShuffleQueue(
    capacity=10, min_after_dequeue=2,dtypes=[tf.float32,tf.int32],
    shapes=[[2],[]],name="instance_q",shared_name="shared_instance_q")

read_and_enqueue_ops=[
    read_and_push_instance(filename_queue,instance_queue)
    for i in range(5)]
queue_runner=tf.train.QueueRunner(instance_queue,read_and_enqueue_ops)

minibatch_instances,minibatch_targets=instance_queue.dequeue_up_to(2)

with tf.Session() as sess:
    sess.run(enqueue_filename,feed_dict={filename:"my_test.csv"})
    sess.run(close_filename_queue)
    coord=tf.train.Coordinator()
    enqueue_threads=queue_runner.create_threads(sess,coord=coord,start=True)
    try:
        while True:
            print(sess.run([minibatch_instances,minibatch_targets]))
    except tf.errors.OutOfRangeError as ex:
        print("No more training instances")

**Other convenient **

The string_input_producer()takes a 1D tensor containing a list of filenames, creates a thread that pushes one filename at a time to the filename queue, and then closes the queue. If you specify a number of epochs, it will cycle through the filenames once per epoch before closing the queue. By default, it shuffles the filenames at each epoch. It creates a QueueRunner to manage its thread, and adds it to the GraphKeys.QUEUE_RUNNERS collection. To start every QueueRunner in that collection, you can call the tf.train.start_queue_runners() function. Note that if you forget to start the QueueRunner, the filename queue will be open and empty, and your readers will be blocked forever.

There are a few other producer functions that similarly create a queue and a corresponding QueueRunner for running an enqueue operation (e.g., input_producer(), range_input_producer(), and slice_input_producer()).

The shuffle_batch() function takes a list of tensors (e.g., [features, target]) and
creates:

  • A RandomShuffleQueue
  • A QueueRunner to enqueue the tensors to the queue (added to the GraphKeys.QUEUE_RUNNERS collection)
  • A dequeue_many operation to extract a mini-batch from the queue

This makes it easy to manage in a single process a multithreaded input pipeline feeding a queue and a training pipeline reading mini-batches from that queue. Also check out the batch(), batch_join(), and shuffle_batch_join() functions that provide similar functionality.

12.3 Parallelizing Neural Networks on a TensorFlow Cluster

12.3.1 One Neural Network per Device

The most trivial way to train and run neural networks on a TensorFlow cluster is to take the exact same code you would use for a single device on a single machine, and specify the master server’s address when creating the session. You can change the device that will run your graph simply by putting your code’s construction phase within a device block.

Another option is to serve your neural networks using [TensorFlow Serving](https://tensor
flow.github.io/serving/).

12.3.2 In-Graph Versus Between-Graph Replication

You can also parallelize the training of a large ensemble of neural networks by simply placing every neural network on a different device.

There are two major approaches to handling a neural network ensemble:

  • You can create one big graph, containing every neural network, each pinned to a
    different device, plus the computations needed to aggregate the individual predictions from all the neural networks (see Figure 12-12). Then you just create one session to any server in the cluster and let it take care of everything (including waiting for all individual predictions to be available before aggregating them). This approach is called in-graph replication.
  • Alternatively, you can create one separate graph for each neural network and handle synchronization between these graphs yourself. This approach is called between-graph replication. One typical implementation is to coordinate the execution of these graphs using queues (see Figure 12-13). A set of clients handles one neural network each, reading from its dedicated input queue, and writing to its dedicated prediction queue. Another client is in charge of reading the inputs and pushing them to all the input queues (copying all inputs to every queue). Finally, one last client is in charge of reading one prediction from each prediction queue and aggregating them to produce the ensemble’s prediction.
tf.reset_default_graph()

q=tf.FIFOQueue(capacity=10,dtypes=[tf.float32],shapes=[()])
v=tf.placeholder(tf.float32)
enqueue=q.enqueue([v])
dequeue=q.dequeue()
output=dequeue+1

config=tf.ConfigProto()
config.operation_timeout_in_ms=1000

with tf.Session(config=config) as sess:
    sess.run(enqueue,feed_dict={v:1.0})
    sess.run(enqueue,feed_dict={v:2.0})
    sess.run(enqueue,feed_dict={v:3.0})
    print(sess.run(output))
    print(sess.run(output,feed_dict={dequeue:5}))
    print(sess.run(output))
    print(sess.run(output))
    try:
        print(sess.run(output))
    except tf.errors.DeadlineExceededError as ex:
        print("Timed out while dequeuing")    

12.3.3 Model Parallelism

Model Parallelism: chopping your model into separate chunks and running each chunk on a different device.

12.3.4 Data Parallelism

Another way to parallelize the training of a neural network is to replicate it on each device, run a training step simultaneously on all replicas using a different mini-batch for each, and then aggregate the gradients to update the model parameters. This is called data parallelism.

There are two variants of this approach: synchronous updates and asynchronous updates.

Synchronous updates

With synchronous updates, the aggregator waits for all gradients to be available before
computing the average and applying the result.

Asynchronous updates

With asynchronous updates, whenever a replica has finished computing the gradients, it immediately uses them to update the model parameters. There is no aggregation (remove the “mean” step in Figure 12-17), and no synchronization. Replicas just work independently of the other replicas. Since there is no waiting for the other replicas, this approach runs more training steps per minute. Moreover, although the parameters still need to be copied to every device at every step, this happens at different times for each replica so the risk of bandwidth saturation is reduced.

By the time a replica has finished computing the gradients based on some parameter values, these parameters will have been updated several times by other replicas (on average N – 1 times if there are N replicas) and there is no guarantee that the computed gradients will still be pointing in the right direction (see Figure 12-18). When gradients are severely out-of-date, they are called stale gradients: they can slow down convergence, introducing noise and wobble effects (the learning curve may contain temporary oscillations), or they can even make the training algorithm diverge.

There are a few ways to reduce the effect of stale gradients:

  • Reduce the learning rate.
  • Drop stale gradients or scale them down.
  • Adjust the mini-batch size.
  • Start the first few epochs using just one replica (this is called the warmup phase). Stale gradients tend to be more damaging at the beginning of training, when gradients are typically large and the parameters have not settled into a valley of the cost function yet, so different replicas may push the parameters in quite different directions.

A paper published by the Google Brain team in April 2016 benchmarked various approaches and found that data parallelism with synchronous updates using a few spare replicas was the most efficient, not only converging faster but also producing a better model. However, this is still an active area of research, so you should not rule out asynchronous updates quite yet.

Bandwidth saturation

For some models, typically relatively small and trained on a very large training set, you are often better off training the model on a single machine with a single GPU.

Here are a few simple steps you can take to reduce the saturation problem:

  • Group your GPUs on a few servers rather than scattering them across many servers. This will avoid unnecessary network hops.
  • Shard the parameters across multiple parameter servers (as discussed earlier).
  • Drop the model parameters’ float precision from 32 bits (tf.float32) to 16 bits (tf.bfloat16). This will cut in half the amount of data to transfer, without much impact on the convergence rate or the model’s performance.

You can actually drop down to 8-bit precision after training to reduce the size of the model and speed up computations. This is called quantizing the neural network. It is particularly useful for deploying and running pretrained models on mobile phones. See Pete Warden’s great post on the subject.

TensorFlow implementation

To implement data parallelism using TensorFlow, you first need to choose whether you want in-graph replication or between-graph replication, and whether you want synchronous updates or asynchronous updates.

With in-graph replication + synchronous updates, you build one big graph containing all the model replicas (placed on different devices), and a few nodes to aggregate all their gradients and feed them to an optimizer. Your code opens a session to the cluster and simply runs the training operation repeatedly.

With in-graph replication + asynchronous updates, you also create one big graph, but with one optimizer per replica, and you run one thread per replica, repeatedly running the replica’s optimizer.

With between-graph replication + asynchronous updates, you run multiple independent clients (typically in separate processes), each training the model replica as if it were alone in the world, but the parameters are actually shared with other replicas (using a resource container).

With between-graph replication + synchronous updates, once again you run multiple clients, each training a model replica based on shared parameters, but this time you wrap the optimizer (e.g., a MomentumOptimizer) within a SyncReplicasOptimizer. Each replica uses this optimizer as it would use any other optimizer, but under the hood this optimizer sends the gradients to a set of queues (one per variable), which is read by one of the replica’s SyncReplicasOptimizer, called the chief. The chief aggregates the gradients and applies them, then writes a token to a token queue for each replica, signaling it that it can go ahead and compute the next gradients. This approach supports having spare replicas.

To sum up, a cluster is a set of TensorFlow servers, called tasks. A job is a named group of tasks that have a common role. A machine may also contain several devices, including CPUs and GPUs. A machine may run several tasks, each of which can grab all of part of RAMs of every GPU. Every TensorFlow server provides two services: the master service and the worker service. The master service allows clients to open sessions and use them to run graphs. It coordinates the computations across tasks, relying on the worker service to actually execute computations on other tasks and get their results. In a distributive environment, an operation can be pinned to a device. You can open a session on any of the servers, from a client located in any process on any machine (even from a process running one of the tasks), and use the session like a regular local session. One client can connect to multiple servers by opening multiple sessions in different threads. One server can handle multiple sessions simultaneously from one or more clients. You can run one client per
task (typically within the same process), or just one client to control all tasks. If you create
a variable named x using one client session, it will automatically be available to any other session on the same cluster (even if both sessions are connected to a different server).

你可能感兴趣的:(Hands-On,Machine,Learning,with,Scik,python,机器学习,深度学习)