文献笔记(8)(OSDI2016)

文章目录

  • 1 abstract & introduction
  • 2 background&motivation
    • 2.1 previous system: DistBelief
    • 2.2 设计原则
      • 2.2.1 dataflow graphs of primitive operators
  • 2.2.2 推迟执行
      • 2.2.3 common abstraction for heterogeneous accelerator
    • 2.3 related work
      • 2.3.1 single-machine frameworks
      • 2.3.2 batch dataflow systems
      • 2.3.3 parameter servers
  • 3 tensorflow执行模型
    • 3.1 数据流图元素
    • 3.2 局部和并发执行
    • 3.3 分布式执行
    • 3.4 dynamic control flow

  • 题目:TensorFlow: A system for large-scale machine learning
  • 时间:2016
  • 会议:OSDI(Operating Systems Design and Implementation)
  • 研究机构:谷歌

1 abstract & introduction

Tensorflow用数据流图表示计算、shared state和改变状态的操作。他map数据流图的结点到集群的多个机器或者同一个机器的多个计算设备如CPU、GPU、ASIC

2 background&motivation

2.1 previous system: DistBelief

TensorFlow前身是DistBelief,是 parameter server architecture
In the parameter server architecture, a job comprises two disjoint sets of processes: stateless worker processes that perform the bulk of the computation when training a model, and stateful parameter server processes that maintain the current version of the model parameters.
用户需要的三种灵活性

  • 加入新的网络层次
  • 改善训练算法:现在是随机梯度下降,还需要新的优化算法
  • 定义新的训练算法:如RNN、强化学习、对抗网络

2.2 设计原则

2.2.1 dataflow graphs of primitive operators

Tensorflow model represents individual mathematical operators (such as matrix multiplication, convolution, etc) as nodes in the dataflow graph.
文献笔记(8)(OSDI2016)_第1张图片

2.2.2 推迟执行

tensorflow应用需要两步

  • 定义计算流图的结点
  • 执行计算流图

2.2.3 common abstraction for heterogeneous accelerator

设备的定义: 至少需要定义

  • issuing a kernel for execution
  • allocating memory for inputs and outputs
  • transferring buffers to and from host memory
    On a cluster, we deploy Tensorflow as a set of tasks(能在网络内部通信的进程)that each export the same graph execution API and contain one or more devices.
  • PS(parameter server) tasks
  • worker tasks

2.3 related work

2.3.1 single-machine frameworks

carry out their work on a single-often GPU-equipped-computer, and serveral single-machine frameworks support this scenario

  • caffe
  • Theano
  • Torch

2.3.2 batch dataflow systems

In these systems, each model update step must process larger batches, slowing convergence(收敛)

2.3.3 parameter servers

A parameter server architecture uses a set of servers to manage shared state that is updated by a set of parallel workers.

3 tensorflow执行模型

The dataflow graph expresses the communication between subcomputations explicitly, thus making it easy to execute independent computation in parallel and to partition computations across multiple devices.

3.1 数据流图元素

  • tensors
  • operations
  • stateful operations: variables
  • stateful operations: queues

3.2 局部和并发执行

文献笔记(8)(OSDI2016)_第2张图片
Many concurrent steps of the training subgraph update the model based on different input batches, to implement data-parallel training.
By default, concurrent executions of a Tensrflow subgraph run asynchronously(异步的) with respect to one another.

3.3 分布式执行

  • map operations in the dataflow graph to device
  • partitions the operations into per-device subgraphs
    a client session maintains the mapping from step definitions to cached subgraphs.

3.4 dynamic control flow

你可能感兴趣的:(文献笔记)