Boost.Asio源码阅读(3): concurrency

本文基于Boost 1.69，在展示源代码时删减了部分deprecated或者不同配置的与本文主题无关的代码块。

简介

本期讨论的是Asio中涉及的并发编程实践，依旧是基于源代码进行解析。

多线程技术

scheduler多线程调度

scheduler操作队列不可避免的要考虑多线程的问题：操作队列与线程的关系，操作队列的线程安全问题以及操作在多线程环境的执行。

工具类

call_stack and context。查看源代码可知，call_stack包含一个tss_ptr类型的静态数据成员top_，其中tss_ptr为thread specific storage指针，在Unix平台通过::pthread_xxxxxx接口将某个地址与Thread-specific key绑定;context是call_stack的嵌套类，有趣的是，context的构造函数是一个push操作，而析构函数是pop操作，操作对象是top_。

conditionally_enabled_mutex and conditionally_enabled_event。基于std::condition_variable（或其它类似的实现），实现了一些常见的线程控制功能。conditionally_enabled_mutex额外包装了一个数据成员enabled_，当enabled_等于false时，不进行相应的操作。

调度过程解析

调度过程从两个角度去分析，（生产）用户提交任务和（消费并生产）io_context的event processing loop。

Asio提交任务的两个典型的内部接口是scheduler::post_immediate_completion函数（用于提交一般性任务，查看boost::asio::post源码可知）和reactor::start_op（用于提交io相关任务，查看basic_stream_socket源码可知）方法。查看scheduler::post_immediate_completion源码，涉及到并发的操作很简单，加锁，将任务放入scheduler数据成员op_queue_，解锁并唤醒一个线程。

// file: 
...
void scheduler::post_immediate_completion(
    scheduler::operation* op, bool is_continuation)
{
#if defined(BOOST_ASIO_HAS_THREADS)
  if (one_thread_ || is_continuation)
  {
    if (thread_info_base* this_thread = thread_call_stack::contains(this))
    {
      ++static_cast(this_thread)->private_outstanding_work;
      static_cast(this_thread)->private_op_queue.push(op);
      return;
    }
  }
#else // defined(BOOST_ASIO_HAS_THREADS)
  (void)is_continuation;
#endif // defined(BOOST_ASIO_HAS_THREADS)

  work_started();
  mutex::scoped_lock lock(mutex_);
  op_queue_.push(op);
  wake_one_thread_and_unlock(lock);
}
...

查看reactor::start_op源码。注意到RAII风格的互斥包装器descriptor_lock获取的是对于某个descriptor的锁。针对不同的socket的reactor::start_op可以并行执行。本文的主题是concurrency，所以reactor::start_op函数体这里不做过多的介绍。注意末尾的scheduler_.work_started，该函数仅仅执行++outstanding_work_。

// file: 
...
void epoll_reactor::start_op(int op_type, socket_type descriptor,
    epoll_reactor::per_descriptor_data& descriptor_data, reactor_op* op,
    bool is_continuation, bool allow_speculative)
{
  if (!descriptor_data)
  {
    op->ec_ = boost::asio::error::bad_descriptor;
    post_immediate_completion(op, is_continuation);
    return;
  }

  mutex::scoped_lock descriptor_lock(descriptor_data->mutex_);

  if (descriptor_data->shutdown_)
  {
    post_immediate_completion(op, is_continuation);
    return;
  }

  if (descriptor_data->op_queue_[op_type].empty())
  {
    if (allow_speculative
        && (op_type != read_op
          || descriptor_data->op_queue_[except_op].empty()))
    {
      if (descriptor_data->try_speculative_[op_type])
      {
        if (reactor_op::status status = op->perform())
        {
          if (status == reactor_op::done_and_exhausted)
            if (descriptor_data->registered_events_ != 0)
              descriptor_data->try_speculative_[op_type] = false;
          descriptor_lock.unlock();
          scheduler_.post_immediate_completion(op, is_continuation);
          return;
        }
      }

      if (descriptor_data->registered_events_ == 0)
      {
        op->ec_ = boost::asio::error::operation_not_supported;
        scheduler_.post_immediate_completion(op, is_continuation);
        return;
      }

      if (op_type == write_op)
      {
        if ((descriptor_data->registered_events_ & EPOLLOUT) == 0)
        {
          epoll_event ev = { 0, { 0 } };
          ev.events = descriptor_data->registered_events_ | EPOLLOUT;
          ev.data.ptr = descriptor_data;
          if (epoll_ctl(epoll_fd_, EPOLL_CTL_MOD, descriptor, &ev) == 0)
          {
            descriptor_data->registered_events_ |= ev.events;
          }
          else
          {
            op->ec_ = boost::system::error_code(errno,
                boost::asio::error::get_system_category());
            scheduler_.post_immediate_completion(op, is_continuation);
            return;
          }
        }
      }
    }
    else if (descriptor_data->registered_events_ == 0)
    {
      op->ec_ = boost::asio::error::operation_not_supported;
      scheduler_.post_immediate_completion(op, is_continuation);
      return;
    }
    else
    {
      if (op_type == write_op)
      {
        descriptor_data->registered_events_ |= EPOLLOUT;
      }

      epoll_event ev = { 0, { 0 } };
      ev.events = descriptor_data->registered_events_;
      ev.data.ptr = descriptor_data;
      epoll_ctl(epoll_fd_, EPOLL_CTL_MOD, descriptor, &ev);
    }
  }

  descriptor_data->op_queue_[op_type].push(op);
  scheduler_.work_started();
}
...

接下来是负责“消费和生产”的io_context的event processing loop。loop主要调用其成员的scheduler::run。从scheduler::run入手了解操作队列的调用过程：

声明本地变量this_thread（成员private_op_queue）
scheduler地址和local变量this_thread地址入栈
lock mutex_，其中mutex_为scheduler数据成员
调用do_run_one，lock mutex_，循环
RAII，scheduler地址和local变量this_thread地址出栈

scheduler::do_run_one。现在来分析scheduler::do_run_one的执行过程：

当scheduler的操作队列op_queue_不为空时
1. 复制op_queue_顶部成员o并pop op_queue_
2. 如果o等于&task_operation_
  1. 如果还有更多任务并且多线程的情况下unlock_and_signal_one，否则unlock。剩下的部分可以并发执行：
  2. 初始化task_cleanup实例
  3. 执行reactor::run，传入的操作队列为线程私有队列
  4. task_cleanup实例析构，cleanup（下文解析）
3. 如果o不等于&task_operation_
  1. 如果还有更多任务并且多线程的情况下unlock_and_signal_one，否则unlock。剩下的部分可以并发执行：
  2. 初始化work_cleanup实例
  3. 执行o->complete（完成操作队列首位的操作）
  4. work_cleanup实例析构，cleanup
当scheduler的操作队列op_queue_为空时
1. wakeup_event_ clear and wait，等待其他线程唤醒本线程

介绍一下task_cleanup类，查看源码发现task_cleanup唯一的成员函数为析构函数，主要功能也由其实现:

对（原子类型）scheduler_->outstanding_work_进行increment（非原子类型）this_thread_->private_outstanding_work操作。
加锁并执行scheduler_->op_queue_.push(this_thread_->private_op_queue)等操作。

work_cleanup略微不同，读者请自行阅读源码了解。

// file: 
...
std::size_t scheduler::run(boost::system::error_code& ec)
{
  ec = boost::system::error_code();
  if (outstanding_work_ == 0)
  {
    stop();
    return 0;
  }

  thread_info this_thread;
  this_thread.private_outstanding_work = 0;
  thread_call_stack::context ctx(this, this_thread);

  mutex::scoped_lock lock(mutex_);

  std::size_t n = 0;
  for (; do_run_one(lock, this_thread, ec); lock.lock())
    if (n != (std::numeric_limits::max)())
      ++n;
  return n;
}
...
std::size_t scheduler::do_run_one(mutex::scoped_lock& lock,
    scheduler::thread_info& this_thread,
    const boost::system::error_code& ec)
{
  while (!stopped_)
  {
    if (!op_queue_.empty())
    {
      // Prepare to execute first handler from queue.
      operation* o = op_queue_.front();
      op_queue_.pop();
      bool more_handlers = (!op_queue_.empty());

      if (o == &task_operation_)
      {
        task_interrupted_ = more_handlers;

        if (more_handlers && !one_thread_)
          wakeup_event_.unlock_and_signal_one(lock);
        else
          lock.unlock();

        task_cleanup on_exit = { this, &lock, &this_thread };
        (void)on_exit;

        // Run the task. May throw an exception. Only block if the operation
        // queue is empty and we're not polling, otherwise we want to return
        // as soon as possible.
        task_->run(more_handlers ? 0 : -1, this_thread.private_op_queue);
      }
      else
      {
        std::size_t task_result = o->task_result_;

        if (more_handlers && !one_thread_)
          wake_one_thread_and_unlock(lock);
        else
          lock.unlock();

        // Ensure the count of outstanding work is decremented on block exit.
        work_cleanup on_exit = { this, &lock, &this_thread };
        (void)on_exit;

        // Complete the operation. May throw an exception. Deletes the object.
        o->complete(this, ec, task_result);

        return 1;
      }
    }
    else
    {
      wakeup_event_.clear(lock);
      wakeup_event_.wait(lock);
    }
  }

  return 0;
}
...
  ~task_cleanup()
  {
    if (this_thread_->private_outstanding_work > 0)
    {
      boost::asio::detail::increment(
          scheduler_->outstanding_work_,
          this_thread_->private_outstanding_work);
    }
    this_thread_->private_outstanding_work = 0;

    // Enqueue the completed operations and reinsert the task at the end of
    // the operation queue.
    lock_->lock();
    scheduler_->task_interrupted_ = true;
    scheduler_->op_queue_.push(this_thread_->private_op_queue);
    scheduler_->op_queue_.push(&scheduler_->task_operation_);
  }
...

总结

学习scheduler源码发现，其并发特性如下：

所有针对scheduler数据成员op_queue_的操作必须获取scheduler自身的锁来完成，无法并发
针对scheduler数据成员（原子类型）outstanding_work_的操作为原子操作
reactor::run 的队列参数为线程私有队列，其内部epoll_wait并发执行。
reactor::start_op 需要获取descriptor的锁，不同descriptor之间可以并发执行。

值得注意的是关于op_queue_的几乎所有操作都需要在加锁互斥的情况下完成，这听上去有些不怎么“并发”。Boost有一个lockfree队列实现，虽然可以避免锁的使用，然而这种算法在实际运用中通常比基于锁的算法表现更差。而且scheduler锁只是在op_queue_获取元素（指针）及pop元素的这一个较短的时间段内持有，用户操作的执行并不需要锁，综合来看并发能力也不算差。

strand

当我们要求用户的多个操作互斥时，可以通过strand完成。我们可以通过strand::dispatch提交互斥操作，具体实现为detail::strand_executor_service::dispatch，其执行过程如下：

判断是否在strand内，如果是直接执行操作并返回
包装操作并strand_executor_service::enqueue，将返回值保存于first
first为真则dispatch被invoker类包装的strand implementation。

// file: 
...
  template 
  void dispatch(BOOST_ASIO_MOVE_ARG(Function) f, const Allocator& a) const
  {
    detail::strand_executor_service::dispatch(impl_,
        executor_, BOOST_ASIO_MOVE_CAST(Function)(f), a);
  }
...
// file: 
...
template 
void strand_executor_service::dispatch(const implementation_type& impl,
    Executor& ex, BOOST_ASIO_MOVE_ARG(Function) function, const Allocator& a)
{
  typedef typename decay::type function_type;

  // If we are already in the strand then the function can run immediately.
  if (call_stack::contains(impl.get()))
  {
    // Make a local, non-const copy of the function.
    function_type tmp(BOOST_ASIO_MOVE_CAST(Function)(function));

    fenced_block b(fenced_block::full);
    boost_asio_handler_invoke_helpers::invoke(tmp, tmp);
    return;
  }

  // Allocate and construct an operation to wrap the function.
  typedef executor_op op;
  typename op::ptr p = { detail::addressof(a), op::ptr::allocate(a), 0 };
  p.p = new (p.v) op(BOOST_ASIO_MOVE_CAST(Function)(function), a);

  BOOST_ASIO_HANDLER_CREATION((impl->service_->context(), *p.p,
        "strand_executor", impl.get(), 0, "dispatch"));

  // Add the function to the strand and schedule the strand if required.
  bool first = enqueue(impl, p.p);
  p.v = p.p = 0;
  if (first)
    ex.dispatch(invoker(impl, ex), a);
}
...

接上文，关键函数为strand_executor_service::enqueue和invoker::operator()。其中:

strand_executor_service::enqueue负责在加锁状态下操作入列，并通过对一个bool变量的判定和赋值来确定是否第一个获取锁
invoker::operator()：
1. strand_impl入栈call_stack。
2. 按顺序执行ready_queue_内所有操作，注意由于call_stack的使用，如果一个操作在执行过程调用了同一个strand_impl的dispatch，则被dispatch的操作会立即执行
3. 调用on_invoker_exit析构函数：
  1. 加锁
  2. 将waiting_queue_的成员移动到ready_queue_
  3. ready_queue_为空则清除locked_（表明作为"当前第一个"获取锁的线程，相关工作已经完成）
  4. 释放锁
  5. 如果（加锁状态下）刚才判断ready_queue_不为空则post invoker

// file: 
...
bool strand_executor_service::enqueue(const implementation_type& impl,
    scheduler_operation* op)
{
  impl->mutex_->lock();
  if (impl->shutdown_)
  {
    impl->mutex_->unlock();
    op->destroy();
    return false;
  }
  else if (impl->locked_)
  {
    // Some other function already holds the strand lock. Enqueue for later.
    impl->waiting_queue_.push(op);
    impl->mutex_->unlock();
    return false;
  }
  else
  {
    // The function is acquiring the strand lock and so is responsible for
    // scheduling the strand.
    impl->locked_ = true;
    impl->mutex_->unlock();
    impl->ready_queue_.push(op);
    return true;
  }
}
...
// file: 
~on_invoker_exit()
{
    this_->impl_->mutex_->lock();
    this_->impl_->ready_queue_.push(this_->impl_->waiting_queue_);
    bool more_handlers = this_->impl_->locked_ =
    !this_->impl_->ready_queue_.empty();
    this_->impl_->mutex_->unlock();

    if (more_handlers)
    {
    Executor ex(this_->work_.get_executor());
    recycling_allocator allocator;
    ex.post(BOOST_ASIO_MOVE_CAST(invoker)(*this_), allocator);
    }
}
...
  void operator()()
  {
    // Indicate that this strand is executing on the current thread.
    call_stack::context ctx(impl_.get());

    // Ensure the next handler, if any, is scheduled on block exit.
    on_invoker_exit on_exit = { this };
    (void)on_exit;

    // Run all ready handlers. No lock is required since the ready queue is
    // accessed only within the strand.
    boost::system::error_code ec;
    while (scheduler_operation* o = impl_->ready_queue_.front())
    {
      impl_->ready_queue_.pop();
      o->complete(impl_.get(), ec, 0);
    }
  }
...

总结

strand简单来说就是多个线程获取strand锁然后将操作加入队列，由某一个线程来dispatch strand (as op and contains op)。我们来看看strand对并发的影响：

对于运行在strand内（即操作保存在strand的队列）的操作来说，显然的，是按顺序执行。
锁。运行strand不可避免的增加了额外的锁的操作，由于strand包含两个队列（'ready', 'wait'）的多线程执行逻辑，持有锁的时间略微增加，但主要规律与上文相同，即只是在处理队列（且队列成员类型为指针，开销小）时加锁，操作在执行过程中不需要加锁。

memory_order

(todo 由于笔者对于Asio理解不够深入，这部分内容处于未完成状态)

回顾一下executor_op::do_complete的源码，在调用handler之前构造了一个fenced_block实例，这是与并发相关的代码。std版本的fenced_block源码如下，类的代码比较简单，其主要在构造和析构时调用（或不调用）函数std::atomic_thread_fence。该函数用于建立内存同步顺序。全局搜索Asio源码发现，xxxxxxx_op在执行complete之前构造fenced_block b(fenced_block::half)，而io_context::executor_type, strand_service, thread_pool的成员函数dispatch可能直接执行操作，执行之前构造fenced_block b(fenced_block::full)。抽象的来说，fence的作用在于对fence前后的memory operations的顺序进行某些限制，考虑到cpu或者编译器可能为了优化而打乱顺序。这样，其他线程在观察本线程对内存产生的side effect具有一定的顺序。

为了讲解fence在Asio的作用，介绍一下其他相关内容

implicit strand，引用官网的说明:

Where there is a single chain of asynchronous operations associated with a connection (e.g. in a half duplex protocol implementation like HTTP) there is no possibility of concurrent execution of the handlers. This is an implicit strand.

Concurrency Hints。Concurrency Hints为BOOST_ASIO_CONCURRENCY_HINT_UNSAFE时不使用部分mutex，但多线程运行仍然是可能的，这时用户需要额外的操作来保证io_context内部状态的安全性，官网说明如下：

BOOST_ASIO_CONCURRENCY_HINT_UNSAFE
This special concurrency hint disables locking in both the scheduler and reactor I/O. This hint has the following restrictions:

— Care must be taken to ensure that all operations on the io_context and any of its associated I/O objects (such as sockets and timers) occur in only one thread at a time.

— Asynchronous resolve operations fail with operation_not_supported.

— If a signal_set is used with the io_context, signal_set objects cannot be used with any other io_context in the program.

当scheduler和reactor不启用mutex，用户的操作又符合implicit strand的情况下，Asio如何保证handler A在thread 1修改数据后能被随后的运行在thread 2的handler B看到呢？我们来考虑一下handler A与handler B可能的执行过程

[thread 1] fenced_block b(fenced_block::half);
[thread 1] handler A执行写入变量x
[thread 1] handler A提交一个异步read加上回调handler B的操作
[thread 1] fence析构std::atomic_thread_fence(std::memory_order_release);
[thread 1] xxxx_cleanup析构函数（包含原子操作）
[thread 1] .....
[thread 2] 假如刚开始执行io_context::run，读取原子对象outstanding_work_；或者执行前一个任务之后的xxxx_cleanup析构函数
[thread 2] handler B读取handler A写入的变量x

注意这个顺序:handler A的写操作->fence析构->atomic write；atomic read->handler B的读操作。这恰好是Fence-atomic synchronization。保证了B能够看到A写入x的数据。

（todo）我们再来考虑fenced_block b(fenced_block::full);在Asio的应用。

// file: 
...
  static void do_complete(void* owner, Operation* base,
      const boost::system::error_code& /*ec*/,
      std::size_t /*bytes_transferred*/)
  {
...
    // Make the upcall if required.
    if (owner)
    {
      fenced_block b(fenced_block::half);
      BOOST_ASIO_HANDLER_INVOCATION_BEGIN(());
      boost_asio_handler_invoke_helpers::invoke(handler, handler);
      BOOST_ASIO_HANDLER_INVOCATION_END;
    }
  }
...
// file: 
...
class std_fenced_block
  : private noncopyable
{
public:
  enum half_t { half };
  enum full_t { full };

  // Constructor for a half fenced block.
  explicit std_fenced_block(half_t)
  {
  }

  // Constructor for a full fenced block.
  explicit std_fenced_block(full_t)
  {
    std::atomic_thread_fence(std::memory_order_acquire);
  }

  // Destructor.
  ~std_fenced_block()
  {
    std::atomic_thread_fence(std::memory_order_release);
  }
};
...
// file: 
template 
void io_context::executor_type::dispatch(
    BOOST_ASIO_MOVE_ARG(Function) f, const Allocator& a) const
{
  typedef typename decay::type function_type;

  // Invoke immediately if we are already inside the thread pool.
  if (io_context_.impl_.can_dispatch())
  {
    // Make a local, non-const copy of the function.
    function_type tmp(BOOST_ASIO_MOVE_CAST(Function)(f));

    detail::fenced_block b(detail::fenced_block::full);
    boost_asio_handler_invoke_helpers::invoke(tmp, tmp);
    return;
  }

  // Allocate and construct an operation to wrap the function.
....
}
...

Concurrency Hints

io_context的构造函数接受一个名为Concurrency Hints的参数，这个参数会影响io_context的并发特性。具体说明见官方，由此我们可以总结一下线程安全问题的分工：

Asio保证的是：
1. 默认情况下确保io_context内部状态的线程安全，或者在其他情况下告知用户如何确保这种安全性
2. 实现strand（包括implicit strand）
用户的责任是：
1. 保证操作在并行运行下的安全性，或者利用"strand"来避免操作并行运行
2. 某些情况下需要接受额外的限制来保证io_context内部状态的安全