TensorFlow Serving Batching Guide

  • Introduction
  • Simple Batching
    • BatchingSession
    • BasicBatchScheduler
  • Batch Scheduling Parameters and Tuning
    • Performance Tuning
  • Servers with Multiple Models, Model Versions or Subtasks
  • Mixed CPU/GPU/IO Workloads


While serving a TensorFlow model, batching individual model inference requests together can be important for performance. In particular, batching is necessary to unlock the high throughput promised by hardware accelerators such as GPUs. This is a library for batching requests and scheduling the batches. The library is not tied to GPUs, per se, and can be used for any situation that benefits from processing groups of small tasks in tandem (but this document assumes GPUs to simplify exposition). It offers a specific TensorFlow Session API, as well as lower-level APIs that can be used to batch at other granularities.

在为TensorFlow模型提供服务时,将单个模型推断请求分组在一起对性能很重要。特别是,批处理对于解锁gpu等硬件加速器所承诺的高吞吐量非常必要。这是一个用于批处理请求和调度批处理的库。这个库本身并不与gpu绑定,并且可以用于任何可以连续处理小任务组的情况(但是本文假设gpu可以简化显示)。它提供了一个特定的TensorFlow Session API,以及可以在其他粒度上批处理的低级API


The library is currently split across two locations: (1) core/kernels/batching_util (core API and implementation), and (2) tensorflow_serving/batching (higher-level and experimental code).

这个库目前被分为两部分:(1) core/kernels/batching_util(核心API和实现)(2) tensorflow_serving/batching(高层次实验性代码)

The library offers several alternative classes to choose from. The reason for the choices is that there are many reasonable ways to perform batching. No single "best" approach dominates because different use-cases have different requirements, e.g.:

  • API preferences: Tensor API vs. general API; synchronous vs. asynchronous.
  • Does the model have significant CPU components, in addition to GPU?
  • Does the server need to interleave requests to multiple models (or versions)?
  • Is this for online serving or bulk processing (e.g. map-reduce jobs)?


API偏好:张量API  VS   通用API;同步  VS   异步





Furthermore, whereas some deployments need advanced capabilities to squeeze out maximal performance, others just want something simple that performs reasonably well.


This document gives a tour of the batching library, including when to use which class, and some best practices.


Simple Batching

If you are new to the batching library and/or you have only basic requirements, you can focus just on BatchingSession and/or BasicBatchScheduler.




BatchingSession adds batching to a standard tensorflow::Session, and lets you call Session::Run() with individual (non-batched) tensors while getting the benefits of batching "under the covers".

BatchingSession是在标准的tensorflow::Session 增加批处理,允许你使用单独的(非批处理的)张量调用Session::Run(),同时获得在幕后批处理的好处


This abstraction works well if your application uses TensorFlow (naturally), and can accommodate Session::Run()'s synchronous API -- request threads make Session::Run() calls that block while awaiting other calls to group into the same batch.



To achieve good throughput with this synchronous API, it is recommended to set the number of client threads to roughly twice the maximum batch size.



BatchingSession can be used with any of the library's batch schedulers including BasicBatchScheduler, which offers a way to bound how long each Session::Run() call blocks.



The simplest way to use BatchingSession is via CreateRetryingBasicBatchingSession(), which gives you a tensorflow::Session object that uses a BasicBatchScheduler underneath, and also handles retrying requests that overflow the scheduler's queue.



You will supply some key parameters governing the scheduling and execution of batched requests that are passed to the underlying BasicBatchScheduler; see below for details.



BasicBatchScheduler has a bounded-size queue;



you can set parameters that govern whether Session::Run() should fail upon finding a full queue, or retry some number of times with a delay; again, see below.



A final configuration parameter is allowed_batch_sizes. This parameter is optional.


If unset, then batch sizes can vary freely between 1 and the maximum allowed size, say 1024.

如果未设置,batch sizes可在1到最大的允许大小(如1024)范围内变化

Depending on your environment, having a large numbrer of possible batch sizes may cause problems.


The allowed_batch_sizes parameter lets you limit the batch sizes to a fixed set, say 128, 256, 512, 1024.

allowed_batch_sizes参数让你将batch sizes限制在一个固定的集合中,如128, 256, 512, 1024.


BatchingSession adheres to this restriction by padding invalid-size batches with dummy data to round up to the next valid size.




BasicBatchScheduler is a lower-level abstraction than BatchingSession.


It is not tied to tensors/TensorFlow per se, making it quite flexible.


It is suitable for servers that handle homogeneous requests (see basic_batch_scheduler.h for a precise characterization of that restriction).



BasicBatchScheduler offers an asynchronous API that it shares with its less basic cousins (discussed below), called BatchScheduler.



The API is templetized by a BatchTask class that encapsulates a unit of work to be batched.



A non-blocking Schedule() method is used to enqueue a task for processing.


Once a batch of tasks is ready to be processed, a callback is invoked on a separate thread to process the batch.


A good illustration of how to use this API is found in the implementation of BatchingSession in batching_session.cc.


Batch Scheduling Parameters and Tuning


The parameters that govern batch scheduling (e.g. in BasicBatchScheduler::Options) are:

  • max_batch_size: The maximum size of any batch. This parameter governs the throughput/latency tradeoff, and also avoids having batches that are so large they exceed some resource constraint (e.g. GPU memory to hold a batch's data).
  • batch_timeout_micros: The maximum amount of time to wait before executing a batch (even if it hasn't reached max_batch_size). Used to rein in tail latency. (See basic_batch_scheduler.h for the exact latency contract.)
  • num_batch_threads: The degree of parallelism, i.e. the maximum number of batches processed concurrently.
  • max_enqueued_batches: The number of batches worth of tasks that can be enqueued to the scheduler. Used to bound queueing delay, by turning away requests that would take a long time to get to, rather than building up a large backlog.



max_batch_size: 任意批的最大的大小。这个参数控制吞吐量/延迟 平衡,也避免存在一些批过大,超过了一些资源的限制(保存批处理数据的GPU内存)

batch_timeout_micros: 在执行批处理之前等待时间的最大值(即使没达到max_batch_size)。用于控制尾部延迟(basic_batch_scheduler.h查看延迟的协议

num_batch_threads: 并行度。同时处理的最大batches数量

max_enqueued_batches:  可以加入到调度器队列中的任务的批次数量。用于限制队列延迟,通过脱离需要很长时间才能获得的请求,而不是建立一个大的储备。

Performance Tuning  性能调优

The best values to use for the batch scheduling parameters depend on your model, system and environment, as well as your throughput and latency goals.


Choosing good values is best done via experiments. Here are some guidelines that may be helpful in selecting values to experiment with.


Overall Guidelines 总体策略

First of all, while experimenting you should temporarily set max_enqueued_batches to infinity.


Later, for your production setup, set it as follows:


If you are performing online serving, depending on the policy used to (re-)route requests to server instances, consider setting max_enqueued_batches equal to num_batch_threads to minimize queueing delay at a given server while keeping it busy.



For bulk processing jobs, set max_enqueued_batches to a generous value, but low enough to avoid out-of-memory crashes.




Second, if for system architecture reasons you need to constrain the set of possible batch sizes (e.g. just 100, 200 or 400, rather than any value between 1 and 400): If you are using BatchingSession you can set the allowed_batch_sizes parameter. Otherwise, you can arrange for your callback code to pad the batches with dummy elements.




CPU-only: One Approach

If your system is CPU-only (no GPU), then consider starting with the following values: num_batch_threads equal to the number of CPU cores; max_batch_size to infinity; batch_timeout_micros to 0. Then experiment with batch_timeout_micros values in the 1-10 millisecond (1000-10000 microsecond) range, while keeping in mind that 0 may be the optimal value.



GPU: One Approach

If your model uses a GPU device for part or all of your its inference work, consider the following approach:

  1. Set num_batch_threads to the number of CPU cores.


  1. Temporarily set batch_timeout_micros to infinity while you tune max_batch_size to achieve the desired balance between throughput and average latency. Consider values in the hundreds or thousands.
  2. For online serving, tune batch_timeout_micros to rein in tail latency. The idea is that batches normally get filled to max_batch_size, but occasionally when there is a lapse in incoming requests, to avoid introducing a latency spike it makes sense to process whatever's in the queue even if it represents an underfull batch. The best value for batch_timeout_micros is typically a few milliseconds, and depends on your context and goals. Zero is a value to consider; it works well for some workloads. (For bulk processing jobs, choose a large value, perhaps a few seconds, to ensure good throughput but not wait too long for the final (and likely underfull) batch.)



  1. num_batch_threadsCPU的核数
  2. 临时设batch_timeout_micros为无穷大,同时调优max_batch_size以实现吞吐量和平均延迟之间的期望平衡。考虑的值在几百或几千
  3. 对于在线服务,调优batch_timeout_micros来控制尾部延迟。







Servers with Multiple Models, Model Versions or Subtasks


Some server instances service multiple request types (e.g. multiple models, or multiple versions of a model offered concurrently).

In another scenario, a single request gets broken down into sub-requests involving multiple distinct servables

 (e.g. a recommender system might have a triggering model that decides whether to formulate a recommendation, followed by a model that selects the actual recommendation).

A third scenario is bucketizing sequence model requests to batch together requests of similar length, to minimize padding.






Generally speaking, using a separate batch scheduler for each kind of request or sub-task does not work well if they share a common underlying compute resource -- each scheduler would run its own threads that compete with the others' threads to access the resource.

It is better to have a single scheduler with a single thread pool, that is aware of multiple distinct types of tasks and is able to interleave batches of one kind of task with batches of another.



That is what SharedBatchScheduler does. It presents an abstraction of queues, accepts requests to schedule a particular kind of task.

Each batch contains tasks of just one type, i.e. from one queue. The scheduler ensures fairness by interleaving the different types of batches.



The queues implement the BatchScheduler API, so they can be used anywhere a simpler (non-shared) scheduler can be used, including with BatchingSession.

Queues can be added and removed over time, which is useful e.g. for transitioning to new model versions in environments in which clients specify a specific version: while clients learn about the new version, the server will have to process requests for both versions, and SharedBatchScheduler takes care of interleaving batches of both kinds of requests.

队列实现了BatchScheduler API,所以它们可以用于任何地方用一个更简单(非共享)的调度器,包括BatchingSession




Mixed CPU/GPU/IO Workloads

Some models perform nontrivial CPU work, in addition to their main GPU work. While the core matrix operations may run well on a GPU, peripheral operations may take place on a CPU, e.g. embedding lookup, vocabulary lookup, quantization/dequantization.

Depending on how the GPU is managed, batching the entire sequence of CPU and GPU steps as a unit can underutilize the GPU.




Non-GPU pre- and post-processing can be performed in the request threads, with the batch scheduler used only for the GPU portion of the work.


Alternatively, the non-GPU work can be done in the batch threads, in the callback the batch scheduler calls.

To allow the callback to perform non- batched work on tasks before a batch is fully formed, you can use StreamingBatchScheduler.

It is designed for servers that control latency very precisely, and need fine control over each stage of the pipeline.





StreamingBatchScheduler will reject a task if the scheduler currently has no capacity to process it. If you want to automatically retry tasks that are rejected for that reason you can layer a BatchSchedulerRetrier on top of the batch scheduler. There is a convenience function for creating a streaming scheduler coupled with a retrier: `CreateRetryingStreamingBatchScheduler()'.



When splitting model inference logic into multiple distinct phases to optimize latency or utilization, keep in mind that for a given request, every phase should use the same version of the model.


A good way to ensure this property is to coordinate which ServableHandle object(s) get used in each phase, across the threads.


Lastly, I/O-intensive phases of inference, e.g. lookups to disk or remote servers, may benefit from batching to hide their latency. You can use two batch scheduler instances: one to batch these lookups, and a separate one to batch the GPU work.

最后,I/ o密集的推理阶段,例如对磁盘或远程服务器的查找,可能得益于批处理来隐藏它们的延迟。



