brpc网络实现思考

这些时间断断续续在分析brpc的网络实现，因为之前说过这个事情。然后因为工作中使用到的框架实现，其底层网络实现是单线程处理所有的收发数据逻辑，然后再转发，不再处理其他多余的逻辑。后来因为线上的问题造成卡顿，猜测这儿是否当网络数据过多时，或者一个socket要写的数据太多，导致其他socket饿死的情况？进而导致client收不到对应response或者广播的消息，比如多只怪愣着不动几秒。虽然client发的请求数据不会太多，不会导致server一直读，但一直写某个socket会影响其他socket上的读写事件，所以这里便引发深入思考。

分析过的开源框架，可能处理的业务场景不同，所以实现和引发的问题也是不同的。比如redis单线程，性能也很好，网络和业务逻辑都在一个线程中；之前分析过的skynet底层虽然是多线程，但只有一个网络线程，接管所有socket连接的读写事件处理；phxrpc虽然是多线程的，一个accept线程对收到的新连接轮询其他unit负载情况，哪个轻的给哪个处理，unit里是多个工作线程+一个网络线程eventloop，虽然是平均但本质还是一个，而且如果有的unit空闲，其他的unit忙，那么忙的unit里的待处理的socket也会因为同一个unit中正在读写的socket情况导致得不到处理，也是有这种情况。

直到brpc，带来一种新的实现，这里引用下原文：
“brpc使用一个或多个EventDispatcher(简称为EDISP)等待任一fd发生事件。和常见的“IO线程”不同，EDISP不负责读取。IO线程的问题在于一个线程同时只能读一个fd，当多个繁忙的fd聚集在一个IO线程中时，一些读取就被延迟了。多租户、复杂分流算法，Streaming RPC等功能会加重这个问题。高负载下常见的某次读取卡顿会拖慢一个IO线程中所有fd的读取，对可用性的影响幅度较大。”

接下来会整体分析下socket的相关实现，这里的socket实现较为复杂，会省去一些细节，会简化分析一些关键实现。

在server启动的时候会调用StartInternal函数，其中会创建listen socket用于监听socket连接。

 967         if (_am == NULL) {
 968             _am = BuildAcceptor();
 969             //check...
 973         }
 974         //more code...
 981         // Pass ownership of `sockfd' to `_am'
 982         if (_am->StartAccept(sockfd, _options.idle_timeout_sec,
 983                              _default_ssl_ctx) != 0) {
 985             return -1;
 986         }

由brpc之消息处理流程中分析可知，在BuildAcceptor会初始化各协议处理handler。

 35 // Accept connections from a specific port and then
 36 // process messages from which it reads
 37 class Acceptor : public InputMessenger {
 38 public:
 39     typedef butil::FlatMap SocketMap;
 40 
 41     enum Status {
 42         UNINITIALIZED = 0,
 43         READY = 1,
 44         RUNNING = 2,
 45         STOPPING = 3,
 46     };
 47 
 48 public:
 83 private:
 84     // Accept connections.
 85     static void OnNewConnectionsUntilEAGAIN(Socket* m);
 86     static void OnNewConnections(Socket* m);
 87 
 96     bthread_keytable_pool_t* _keytable_pool; // owned by Server
 99     bthread_t _close_idle_tid;
107 
108     // The map containing all the accepted sockets
109     SocketMap _socket_map;
112 };

其中Acceptor继承InputMessenger类，管理所有的连接。接着StartAccept：

 50 int Acceptor::StartAccept(int listened_fd, int idle_timeout_sec,
 51                           const std::shared_ptr& ssl_ctx) {
 52     //more code...
 69     if (idle_timeout_sec > 0) {
 70         if (bthread_start_background(&_close_idle_tid, NULL,
 71                                      CloseIdleConnections, this) != 0) {
 73             return -1;
 74         }
 75     }
 79     // Creation of _acception_id is inside lock so that OnNewConnections
 80     // (which may run immediately) should see sane fields set below.
 81     SocketOptions options;
 82     options.fd = listened_fd;
 83     options.user = this;
 84     options.on_edge_triggered_events = OnNewConnections;
 85     if (Socket::Create(options, &_acception_id) != 0) {
 86         // Close-idle-socket thread will be stopped inside destructor
 88         return -1;
 89     }
 90 
 91     _listened_fd = listened_fd;
 92     _status = RUNNING;
 93     return 0;
 94 }

其中会在后台跑个bthread，调用CloseIdleConnections，每隔一段时间醒来，对所有可用连接判断是否要作空闲连接处理，如果没有收发数据一定时间的话，功能是比较简单的。其中void (on_edge_triggered_events)(Socket)是当有事件到来时的回调函数，对于listen socket来说，是OnNewConnections，后面再分析OnNewConnections。接着create socket对象：

 585 // SocketId = 32-bit version + 32-bit slot.
 586 //   version: from version part of _versioned_nref, must be an EVEN number.
 587 //   slot: designated by ResourcePool.
 588 int Socket::Create(const SocketOptions& options, SocketId* id) {
 589     butil::ResourceId slot;
 590     Socket* const m = butil::get_resource(&slot, Forbidden());
 591     if (m == NULL) {
 593         return -1;
 594     }
 601     m->_on_edge_triggered_events = options.on_edge_triggered_events;
 609     m->_this_id = MakeSocketId(
 610             VersionOfVRef(m->_versioned_ref.fetch_add(
 611                     1, butil::memory_order_release)), slot);
 612     m->_preferred_index = -1;
 652     CHECK(NULL == m->_write_head.load(butil::memory_order_relaxed));
 653     // Must be last one! Internal fields of this Socket may be access
 654     // just after calling ResetFileDescriptor.
 655     if (m->ResetFileDescriptor(options.fd) != 0) {
 656         //error...
 661     }
 662     *id = m->_this_id;
 663     return 0;
 664 }

以上列的是涉及重要数据成员的操作，其他的虽然也是重要但在这里不是要关心的细节。在socket.h文件中，类socket的声明加上注释有640多行，这里就不贴上相关代码咯，分析到相关的操作时再列一下，接着ResetFileDescriptor：

 522 int Socket::ResetFileDescriptor(int fd) {
 523     // Reset message sizes when fd is changed.
 524     _last_msg_size = 0;
 525     _avg_msg_size = 0;
 526     // MUST store `_fd' before adding itself into epoll device to avoid
 527     // race conditions with the callback function inside epoll
 528     _fd.store(fd, butil::memory_order_release);
 543     // Make the fd non-blocking.
 544     if (butil::make_non_blocking(fd) != 0) {
 546         return -1;
 547     }
 550     butil::make_no_delay(fd);
 574     if (_on_edge_triggered_events) {
 575         if (GetGlobalEventDispatcher(fd).AddConsumer(id(), fd) != 0) {
 576             //error...
 580         }
 581     }
 582     return 0;
 583 }

其中会设置fd为非阻塞，nodelay，设置fd的SO_SNDBUF和SO_RCVBUF大小，最后添加至epoll中等待事件的发生，其中EventDispatcher是事件分发器的封装，下面分析此类实现。

因为数量是可配的，可以配置多个EventDispatcher实例：

352 static void StopAndJoinGlobalDispatchers() {
353     for (int i = 0; i < FLAGS_event_dispatcher_num; ++i) {
354         g_edisp[i].Stop();
355         g_edisp[i].Join();
356     }    
357 }            
358 void InitializeGlobalDispatchers() {
359     g_edisp = new EventDispatcher[FLAGS_event_dispatcher_num];
360     for (int i = 0; i < FLAGS_event_dispatcher_num; ++i) {
361         const bthread_attr_t attr = FLAGS_usercode_in_pthread ?
362             BTHREAD_ATTR_PTHREAD : BTHREAD_ATTR_NORMAL;
363         CHECK_EQ(0, g_edisp[i].Start(&attr));
364     }
365     // This atexit is will be run before g_task_control.stop() because above
366     // Start() initializes g_task_control by creating bthread (to run epoll/kqueue).
367     CHECK_EQ(0, atexit(StopAndJoinGlobalDispatchers));
368 }
369 
370 EventDispatcher& GetGlobalEventDispatcher(int fd) {
371     pthread_once(&g_edisp_once, InitializeGlobalDispatchers);
372     if (FLAGS_event_dispatcher_num == 1) {
373         return g_edisp[0];
374     }    
375     int index = butil::fmix32(fd) % FLAGS_event_dispatcher_num;
376     return g_edisp[index];
377 }

以上是初始化和随机获取对象的实现，EventDispatcher使用et模式来监听事件：

 89 int EventDispatcher::Start(const bthread_attr_t* consumer_thread_attr) {
115     int rc = bthread_start_background(
116         &_tid, &_consumer_thread_attr, RunThis, this);
117     if (rc) {
119         return -1;
120     }
121     return 0;
122 }

273 void* EventDispatcher::RunThis(void* arg) {
274     ((EventDispatcher*)arg)->Run();
275     return NULL;
276 }

278 void EventDispatcher::Run() {
279     while (!_stop) {
282         epoll_event e[32];
290         const int n = epoll_wait(_epfd, e, ARRAY_SIZE(e), -1);
296         if (_stop) {
297             // epoll_ctl/epoll_wait should have some sort of memory fencing
298             // guaranteeing that we(after epoll_wait) see _stop set before
299             // epoll_ctl.
300             break;
301         }
302         if (n < 0) {
303             if (EINTR == errno) {
304                 // We've checked _stop, no wake-up will be missed.
305                 continue;
306             }
312             break;
313         }
314         for (int i = 0; i < n; ++i) {
316             if (e[i].events & (EPOLLIN | EPOLLERR | EPOLLHUP)) {
321                 // We don't care about the return value.
322                 Socket::StartInputEvent(e[i].data.u64, e[i].events,
323                                         _consumer_thread_attr);
324             }
332         }
333         for (int i = 0; i < n; ++i) {
335             if (e[i].events & (EPOLLOUT | EPOLLERR | EPOLLHUP)) {
336                 // We don't care about the return value.
337                 Socket::HandleEpollOut(e[i].data.u64);
338             }
345         }
346     }
347 }

其中分别处理in/out等事件，然后分别回调StartInputEvent和HandleEpollOut。另外，添加事件：

225 int EventDispatcher::AddConsumer(SocketId socket_id, int fd) {
226     if (_epfd < 0) {
227         errno = EINVAL;  
228         return -1;
229     }        
231     epoll_event evt;
232     evt.events = EPOLLIN | EPOLLET;
233     evt.data.u64 = socket_id;
237     return epoll_ctl(_epfd, EPOLL_CTL_ADD, fd, &evt);
245 }

对于listen socket来说，有新连接到来时发生可读事件，那么会调用StartInputEvent：

1924 int Socket::StartInputEvent(SocketId id, uint32_t events,
1925                             const bthread_attr_t& thread_attr) {
1926     SocketUniquePtr s;
1927     if (Address(id, &s) < 0) {
1928         return -1;
1929     }  
1930     if (NULL == s->_on_edge_triggered_events) {
1931         // Callback can be NULL when receiving error epoll events
1932         // (Added into epoll by `WaitConnected')
1933         return 0;
1934     }      
1935     if (s->fd() < 0) {
1941         return -1;
1942     }
1947     // Passing e[i].events causes complex visibility issues and
1948     // requires stronger memory fences, since reading the fd returns
1949     // error as well, we don't pass the events.
1950     if (s->_nevent.fetch_add(1, butil::memory_order_acq_rel) == 0) {
1956         bthread_t tid;
1957         // transfer ownership as well, don't use s anymore!
1958         Socket* const p = s.release();
1959 
1960         bthread_attr_t attr = thread_attr;
1961         attr.keytable_pool = p->_keytable_pool;
1962         if (bthread_start_urgent(&tid, &attr, ProcessEvent, p) != 0) {
1964             ProcessEvent(p);
1965         }
1966     }
1967     return 0;
1968 }

以上逻辑会根据socket_id获取socket对象，如果没有就表示删除了，这种情况是允许的。然后判断是否有回调函数，没有的话则返回。这里使用_nevent控制是否是有bthread在处理该socket上的读事件，这么做的好处引用：
"EDISP使用Edge triggered模式。当收到事件时，EDISP给一个原子变量加1，只有当加1前的值是0时启动一个bthread处理对应fd上的数据。在背后，EDISP把所在的pthread让给了新建的bthread，使其有更好的cache locality，可以尽快地读取fd上的数据。而EDISP所在的bthread会被偷到另外一个pthread继续执行，这个过程即是bthread的work stealing调度。"

由之前的分析bthread_start_urgent是立即启动一个bthread并执行ProcessEvent：

1017 void* Socket::ProcessEvent(void* arg) {
1018     // the enclosed Socket is valid and free to access inside this function.
1019     SocketUniquePtr s(static_cast(arg));
1020     s->_on_edge_triggered_events(s.get());
1021     return NULL;
1022 }

这里执行回调OnNewConnections：

317 void Acceptor::OnNewConnections(Socket* acception) {
318     int progress = Socket::PROGRESS_INIT;
319     do {
320         OnNewConnectionsUntilEAGAIN(acception);
321         if (acception->Failed()) {
322             return;
323         }
324     } while (acception->MoreReadEvents(&progress));
325 }

243 void Acceptor::OnNewConnectionsUntilEAGAIN(Socket* acception) {
244     while (1) {
245         struct sockaddr in_addr;
246         socklen_t in_len = sizeof(in_addr);
247         butil::fd_guard in_fd(accept(acception->fd(), &in_addr, &in_len));
248         if (in_fd < 0) {
249             // no EINTR because listened fd is non-blocking.
250             if (errno == EAGAIN) {
251                 return;
252             }
253             //error...
260             continue;
261         }
270         SocketId socket_id;
271         SocketOptions options;
272         options.keytable_pool = am->_keytable_pool;
273         options.fd = in_fd;
274         options.remote_side = butil::EndPoint(*(sockaddr_in*)&in_addr);
275         options.user = acception->user();
276         options.on_edge_triggered_events = InputMessenger::OnNewMessages;
277         options.initial_ssl_ctx = am->_ssl_ctx;
278         if (Socket::Create(options, &socket_id) != 0) {
280             continue;
281         }
282         in_fd.release(); // transfer ownership to socket_id
284         // There's a funny race condition here. After Socket::Create, messages
285         // from the socket are already handled and a RPC is possibly done
286         // before the socket is added into _socket_map below. This is found in
287         // ChannelTest.skip_parallel in test/brpc_channel_unittest.cpp (running
288         // on machines with few cores) where the _messenger.ConnectionCount()
289         // may surprisingly be 0 even if the RPC is already done.
290                         
291         SocketUniquePtr sock;
292         if (Socket::AddressFailedAsWell(socket_id, &sock) >= 0) {
293             bool is_running = true;
294             {       
295                 BAIDU_SCOPED_LOCK(am->_map_mutex);
296                 is_running = (am->status() == RUNNING);
297                 // Always add this socket into `_socket_map' whether it
298                 // has been `SetFailed' or not, whether `Acceptor' is
299                 // running or not. Otherwise, `Acceptor::BeforeRecycle'
300                 // may be called (inside Socket::OnRecycle) after `Acceptor'
301                 // has been destroyed
302                 am->_socket_map.insert(socket_id, ConnectStatistics());
303             }
304             if (!is_running) {
305                 //error...
310                 return;
311             }
312         } // else: The socket has already been destroyed, Don't add its id
313           // into _socket_map
314     }
315 }

这里接收到一个新连接，并关联OnNewMessages回调。这里贴上注释以及AddressFailedAsWell相关的代码，是因为比较有思考意义，如注释说的，当create注册到epoll中后，可能立即发生可读事件，并可能切到其他线程所在的bthread执行并立即释放，完成整个rpc过程，所以这里使用到的socket_id所代表的对象可能无效。AddressFailedAsWell实现中较复杂，后面再分析。OnNewMessages回调分析在之前的[brpc之消息处理流程]已经分析过，不再分析。

因为accept连接逻辑是个while(1)，所以当EAGAIN时表示暂时没新连接可accept便返回。接着：

227 inline bool Socket::MoreReadEvents(int* progress) {
228     // Fail to CAS means that new events arrived.
229     return !_nevent.compare_exchange_strong(
230         *progress, 0, butil::memory_order_release,
231             butil::memory_order_acquire);
232 }

这里是因为在OnNewConnections处理过程中，可能发生StartInputEvent，所以这里还是要判断下，如果_nevent和progress的值相同(初始为1)则把_nevent设置回0，否则记录到progress，这样继续处理accept。

当socket发生可写之类的事件时：

1258 int Socket::HandleEpollOut(SocketId id) {
1259     SocketUniquePtr s;
1260     // Since Sockets might have been `SetFailed' before they were
1261     // added into epoll, these sockets miss the signal inside
1262     // `SetFailed' and therefore must be signalled here using
1263     // `AddressFailedAsWell' to prevent waiting forever
1264     if (Socket::AddressFailedAsWell(id, &s) < 0) {
1265         // Ignore recycled sockets
1266         return -1;
1267     } 
1268    
1269     EpollOutRequest* req = dynamic_cast(s->user());
1270     if (req != NULL) {
1271         return s->HandleEpollOutRequest(0, req);
1272     }
1273     
1274     // Currently `WaitEpollOut' needs `_epollout_butex'
1275     // TODO(jiangrujie): Remove this in the future 
1276     s->_epollout_butex->fetch_add(1, butil::memory_order_relaxed);
1277     bthread::butex_wake_except(s->_epollout_butex, 0);
1278     return 0;
1279 }

1295 int Socket::HandleEpollOutRequest(int error_code, EpollOutRequest* req) {
1296     // Only one thread can `SetFailed' this `Socket' successfully
1297     // Also after this `req' will be destroyed when its reference
1298     // hits zero
1299     if (SetFailed() != 0) {
1300         return -1;
1301     }
1302     // We've got the right to call user callback
1303     // The timer will be removed inside destructor of EpollOutRequest
1304     GetGlobalEventDispatcher(req->fd).RemoveEpollOut(id(), req->fd, false);
1305     return req->on_epollout_event(req->fd, error_code, req->data);
1306 }

这里几行代码没啥可分析的，后面综合一下，这里具体分析下当发生要写数据时的实现。

由于对于accept到的连接，只设置EPOLLIN | EPOLLET事件，并未有EPOLLOUT事件，这块是在后面的才调用AddEpollOut设置的。

比如当server对rpc进行响应时，在baidu协议中有：

253         Socket::WriteOptions wopt;
254         wopt.ignore_eovercrowded = true;
255         if (sock->Write(&res_buf, &wopt) != 0) {
256             const int errcode = errno;
258             cntl->SetFailed(errcode, "Fail to write into %s",
259                             sock->description().c_str());
260             return;
261         }

1432 int Socket::Write(butil::IOBuf* data, const WriteOptions* options_in) {
1433     //more code...
1456     WriteRequest* req = butil::get_object();
1457     if (!req) {
1458         return SetError(opt.id_wait, ENOMEM);
1459     }   
1460     
1461     req->data.swap(*data);
1462     // Set `req->next' to UNCONNECTED so that the KeepWrite thread will
1463     // wait until it points to a valid WriteRequest or NULL.
1464     req->next = WriteRequest::UNCONNECTED;
1465     req->id_wait = opt.id_wait;
1466     req->set_pipelined_count_and_user_message(
1467         opt.pipelined_count, DUMMY_USER_MESSAGE, opt.with_auth);
1468     return StartWrite(req, opt);
1469 }

 303 struct BAIDU_CACHELINE_ALIGNMENT Socket::WriteRequest {
 304     static WriteRequest* const UNCONNECTED;
 305     
 306     butil::IOBuf data;
 307     WriteRequest* next; 
 308     bthread_id_t id_wait;
 309     Socket* socket;
54 };

以在开始写的时候会申请一个WriteRequest结构，swap要写的数据并set_pipelined_count_and_user_message(具体作用后面再分析)，接着StartWrite：

1506 int Socket::StartWrite(WriteRequest* req, const WriteOptions& opt) {
1507     // Release fence makes sure the thread getting request sees *req
1508     WriteRequest* const prev_head =
1509         _write_head.exchange(req, butil::memory_order_release);
1510     if (prev_head != NULL) {
1511         // Someone is writing to the fd. The KeepWrite thread may spin
1512         // until req->next to be non-UNCONNECTED. This process is not
1513         // lock-free, but the duration is so short(1~2 instructions,
1514         // depending on compiler) that the spin rarely occurs in practice
1515         // (I've not seen any spin in highly contended tests).
1516         req->next = prev_head;
1517         return 0;
1518     }

设置 _write_head指向req，如果有其他bthread写则把prev_head挂在req的next上去并返回(反向的)，这里不能同时写一个socket否则会出现数据的交错：

1520     int saved_errno = 0;
1521     bthread_t th;
1522     SocketUniquePtr ptr_for_keep_write;
1523     ssize_t nw = 0;
1524 
1525     // We've got the right to write.
1526     req->next = NULL;
1527 
1528     // Connect to remote_side() if not.
1529     int ret = ConnectIfNot(opt.abstime, req);
1530     //ret >=0
1540     // NOTE: Setup() MUST be called after Connect which may call app_connect,
1541     // which is assumed to run before any SocketMessage.AppendAndDestroySelf()
1542     // in some protocols(namely RTMP).
1543     req->Setup(this);
1544 
1545     if (ssl_state() != SSL_OFF) {
1546         // Writing into SSL may block the current bthread, always write
1547         // in the background.
1548         goto KEEPWRITE_IN_BACKGROUND;
1549     }
1551     // Write once in the calling thread. If the write is not complete,
1552     // continue it in KeepWrite thread.
1553     if (_conn) {
1554         //
1556     } else {
1557         nw = req->data.cut_into_file_descriptor(fd());
1558     }
1559     if (nw < 0) {
1560         //error...
1569     } else {
1570         AddOutputBytes(nw);
1571     }
1572     if (IsWriteComplete(req, true, NULL)) {
1573         ReturnSuccessfulWriteRequest(req);
1574         return 0;
1575     }
1577 KEEPWRITE_IN_BACKGROUND:
1578     ReAddress(&ptr_for_keep_write);
1579     req->socket = ptr_for_keep_write.release();
1580     if (bthread_start_background(&th, &BTHREAD_ATTR_NORMAL,
1581                                  KeepWrite, req) != 0) {
1583         KeepWrite(req);
1584     }
1585     return 0;
1586 
1587 FAIL_TO_WRITE:
1588     // `SetFailed' before `ReturnFailedWriteRequest' (which will calls
1589     // `on_reset' callback inside the id object) so that we immediately
1590     // know this socket has failed inside the `on_reset' callback
1591     ReleaseAllFailedWriteRequests(req);
1592     errno = saved_errno;
1593     return -1;
1594 }

如果能写，则检查ConnectIfNot是否有效，这里先假设有效，后面再分析当无效时的处理，不考虑ssl情况，接着开始写req->data.cut_into_file_descriptor(fd())，并对写成功的字节数进行调整AddOutputBytes。然后检查本次写是否完成，如果是获取写的权利，那么当本次req写完时，还要判断有没有其他要写的，因为可能有其他的bthread尝试写时暂时无法写，进而把它的req挂在了head上，即上面的req->next = prev_head这段代码，假设写完则：

 471 void Socket::ReturnSuccessfulWriteRequest(Socket::WriteRequest* p) {
 472     DCHECK(p->data.empty());
 473     AddOutputMessages(1);
 474     const bthread_id_t id_wait = p->id_wait;
 475     butil::return_object(p);
 476     if (id_wait != INVALID_BTHREAD_ID) {
 477         NotifyOnFailed(id_wait);
 478     }
 479 }

如果没有写完，这里分析下IsWriteComplete实现：

1024 // Check if there're new requests appended.
1025 // If yes, point old_head to to reversed new requests and return false;
1026 // If no:
1027 //    old_head is fully written, set _write_head to NULL and return true;
1028 //    old_head is not written yet, keep _write_head unchanged and return false;
1029 // `old_head' is last new_head got from this function or (in another word)
1030 // tail of current writing list.
1031 // `singular_node' is true iff `old_head' is the only node in its list.
1032 bool Socket::IsWriteComplete(Socket::WriteRequest* old_head,
1033                              bool singular_node,
1034                              Socket::WriteRequest** new_tail) {
1035     CHECK(NULL == old_head->next);
1036     // Try to set _write_head to NULL to mark that the write is done.
1037     WriteRequest* new_head = old_head;
1038     WriteRequest* desired = NULL;
1039     bool return_when_no_more = true;
1040     if (!old_head->data.empty() || !singular_node) {
1041         desired = old_head;
1042         // Write is obviously not complete if old_head is not fully written.
1043         return_when_no_more = false;
1044     }
1045     if (_write_head.compare_exchange_strong(
1046             new_head, desired, butil::memory_order_acquire)) {
1047         // No one added new requests.
1048         if (new_tail) {
1049             *new_tail = old_head;
1050         }
1051         return return_when_no_more;
1052     }

如果old_head->data.empty()为false则肯定没完写完，此时要判断是否有其他的写请求进来，如果没有则直接返回false后台写，否则可能要逆置这个写链表：

1057     // Someone added new requests.
1058     // Reverse the list until old_head.
1059     WriteRequest* tail = NULL;
1060     WriteRequest* p = new_head;
1061     do {
1062         while (p->next == WriteRequest::UNCONNECTED) {
1063             // TODO(gejun): elaborate this
1064             sched_yield();
1065         }
1066         WriteRequest* const saved_next = p->next;
1067         p->next = tail;
1068         tail = p;
1069         p = saved_next;
1070         CHECK(p != NULL);
1071     } while (p != old_head);
1072 
1073     // Link old list with new list.
1074     old_head->next = tail;
1075     // Call Setup() from oldest to newest, notice that the calling sequence
1076     // matters for protocols using pipelined_count, this is why we don't
1077     // calling Setup in above loop which is from newest to oldest.
1078     for (WriteRequest* q = tail; q; q = q->next) {
1079         q->Setup(this);
1080     }
1081     if (new_tail) {
1082         *new_tail = new_head;
1083     }
1084     return false;

以上逻辑还是挺简单的，头插法单链表，从原来的a->b->c->old_head，变成old_head->c->b->a这样。其中：

1062         while (p->next == WriteRequest::UNCONNECTED) {
1063             // TODO(gejun): elaborate this
1064             sched_yield();
1065         }

“可能会被一个值仍为UNCONNECTED的节点锁定（这需要发起写的线程正好在原子交换后，在设置next指针前，仅仅一条指令的时间内被OS换出），但在实践中很少出现。”

接着启动个bthread继续写KeepWrite：

1598 void* Socket::KeepWrite(void* void_arg) {
1599     g_vars->nkeepwrite << 1;
1600     WriteRequest* req = static_cast(void_arg);
1601     SocketUniquePtr s(req->socket);
1602 
1603     // When error occurs, spin until there's no more requests instead of
1604     // returning directly otherwise _write_head is permantly non-NULL which
1605     // makes later Write() abnormal.
1606     WriteRequest* cur_tail = NULL;
1607     do {
1608         // req was written, skip it.
1609         if (req->next != NULL && req->data.empty()) {
1610             WriteRequest* const saved_req = req;
1611             req = req->next;
1612             s->ReturnSuccessfulWriteRequest(saved_req);
1613         }
1614         const ssize_t nw = s->DoWrite(req);
1615         if (nw < 0) { 
1616             //error...
1623         } else {
1624             s->AddOutputBytes(nw);
1625         }

以上一个while，会判断req是否完成并回调ReturnSuccessfulWriteRequest，否则直接DoWrite：

1674 ssize_t Socket::DoWrite(WriteRequest* req) {
1675     // Group butil::IOBuf in the list into a batch array.
1676     butil::IOBuf* data_list[DATA_LIST_MAX];
1677     size_t ndata = 0;
1678     for (WriteRequest* p = req; p != NULL && ndata < DATA_LIST_MAX;
1679          p = p->next) {
1680         data_list[ndata++] = &p->data;
1681     }           
1682                 
1683     if (ssl_state() == SSL_OFF) {
1684         // Write IOBuf in the batch array into the fd. 
1685         if (_conn) {
1686             return _conn->CutMessageIntoFileDescriptor(fd(), data_list, ndata);
1687         } else {
1688             ssize_t nw = butil::IOBuf::cut_multiple_into_file_descriptor(
1689                 fd(), data_list, ndata);
1690             return nw;
1691         }
1692     }
1729 }

这里进行批量写，最多DATA_LIST_MAX个块，底层调用writev函数。继续回到KeepWrite中：

1626         // Release WriteRequest until non-empty data or last request.
1627         while (req->next != NULL && req->data.empty()) {
1628             WriteRequest* const saved_req = req;
1629             req = req->next;
1630             s->ReturnSuccessfulWriteRequest(saved_req);
1631         }
1632         // TODO(gejun): wait for epollout when we actually have written
1633         // all the data. This weird heuristic reduces 30us delay...
1634         // Update(12/22/2015): seem not working. better switch to correct code.
1635         // Update(1/8/2016, r31823): Still working.
1636         // Update(8/15/2017): Not working, performance downgraded.
1637         //if (nw <= 0 || req->data.empty()/*note*/) {
1638         if (nw <= 0) {
1639             g_vars->nwaitepollout << 1;
1640             bool pollin = (s->_on_edge_triggered_events != NULL);
1641             // NOTE: Waiting epollout within timeout is a must to force
1642             // KeepWrite to check and setup pending WriteRequests periodically,
1643             // which may turn on _overcrowded to stop pending requests from
1644             // growing infinitely.
1645             const timespec duetime =
1646                 butil::milliseconds_from_now(WAIT_EPOLLOUT_TIMEOUT_MS);
1647             const int rc = s->WaitEpollOut(s->fd(), pollin, &duetime);
1648             if (rc < 0 && errno != ETIMEDOUT) {
1649                 //error...
1653                 break;
1654             }
1655         }
1656         if (NULL == cur_tail) {
1657             for (cur_tail = req; cur_tail->next != NULL;
1658                  cur_tail = cur_tail->next);
1659         }
1660         // Return when there's no more WriteRequests and req is completely
1661         // written.
1662         if (s->IsWriteComplete(cur_tail, (req == cur_tail), &cur_tail)) {
1663             CHECK_EQ(cur_tail, req);
1664             s->ReturnSuccessfulWriteRequest(req);
1665             return NULL;
1666         }
1667     } while (1);
1668 
1669     // Error occurred, release all requests until no new requests.
1670     s->ReleaseAllFailedWriteRequests(req);
1671     return NULL;
1672 }

因为批量写，对写完的块从链表中删除并回调ReturnSuccessfulWriteRequest，接着WaitEpollOut并wait一定的时间：

1087 int Socket::WaitEpollOut(int fd, bool pollin, const timespec* abstime) {
1088     if (!ValidFileDescriptor(fd)) {
1089         return 0;
1090     }           
1091     // Do not need to check addressable since it will be called by
1092     // health checker which called `SetFailed' before
1093     const int expected_val = _epollout_butex->load(butil::memory_order_relaxed);
1094     EventDispatcher& edisp = GetGlobalEventDispatcher(fd);
1095     if (edisp.AddEpollOut(id(), fd, pollin) != 0) {
1096         return -1;
1097     }            
1098         
1099     int rc = bthread::butex_wait(_epollout_butex, expected_val, abstime);
1100     const int saved_errno = errno;
1101     if (rc < 0 && errno == EWOULDBLOCK) {
1102         // Could be writable or spurious wakeup
1103         rc = 0;
1104     }       
1105     // Ignore return value since `fd' might have been removed
1106     // by `RemoveConsumer' in `SetFailed'
1107     butil::ignore_result(edisp.RemoveEpollOut(id(), fd, pollin));
1108     errno = saved_errno;
1109     // Could be writable or spurious wakeup (by former epollout)
1110     return rc;
1111 }

151 int EventDispatcher::AddEpollOut(SocketId socket_id, int fd, bool pollin) {
152     if (_epfd < 0) {
153         errno = EINVAL;
154         return -1;
155     }
158     epoll_event evt;
159     evt.data.u64 = socket_id;
160     evt.events = EPOLLOUT | EPOLLET;
164     if (pollin) {
165         evt.events |= EPOLLIN;
166         if (epoll_ctl(_epfd, EPOLL_CTL_MOD, fd, &evt) < 0) {
167             // This fd has been removed from epoll via `RemoveConsumer',
168             // in which case errno will be ENOENT
169             return -1;
170         }
171     } else {
172         if (epoll_ctl(_epfd, EPOLL_CTL_ADD, fd, &evt) < 0) {
173             return -1;
174         }
175     }
192     return 0;
193 }

以上还是挺简单的实现，注册写事件并butex_wait，回来后RemoveEpollOut事件，并继续返回到上面的KeepWrite逻辑。

再回到ConnectIfNot实现中，如果没有连接则进行Connect，关联回调KeepWriteIfConnected：

1236 int Socket::ConnectIfNot(const timespec* abstime, WriteRequest* req) {
1237     if (_fd.load(butil::memory_order_consume) >= 0) {
1238        return 0;
1239     }
1240     
1241     // Have to hold a reference for `req'
1242     SocketUniquePtr s;
1243     ReAddress(&s);
1244     req->socket = s.get();
1245     if (_conn) {
1246         //
1249     } else {
1250         if (Connect(abstime, KeepWriteIfConnected, req) < 0) {
1251             return -1;
1252         }
1253     }
1254     s.release();
1255     return 1;
1256 }

1113 int Socket::Connect(const timespec* abstime,
1114                     int (*on_connect)(int, int, void*), void* data) {
1134     const int rc = ::connect(
1135         sockfd, (struct sockaddr*)&serv_addr, sizeof(serv_addr));
1136     if (rc != 0 && errno != EINPROGRESS) {
1138         return -1;
1139     }
1140     if (on_connect) {
1141         EpollOutRequest* req = new(std::nothrow) EpollOutRequest;
1142         if (req == NULL) {
1144             return -1;
1145         }
1146         req->fd = sockfd;
1147         req->timer_id = 0;
1148         req->on_epollout_event = on_connect;
1149         req->data = data;
1150         // A temporary Socket to hold `EpollOutRequest', which will
1151         // be added into epoll device soon
1152         SocketId connect_id;
1153         SocketOptions options;
1154         options.user = req;
1155         if (Socket::Create(options, &connect_id) != 0) {
1157             delete req;
1158             return -1;
1159         }
1160         // From now on, ownership of `req' has been transferred to
1161         // `connect_id'. We hold an additional reference here to
1162         // ensure `req' to be valid in this scope
1163         SocketUniquePtr s;
1164         CHECK_EQ(0, Socket::Address(connect_id, &s));
1165 
1166         // Add `sockfd' into epoll so that `HandleEpollOutRequest' will
1167         // be called with `req' when epoll event reaches
1168         if (GetGlobalEventDispatcher(sockfd).
1169             AddEpollOut(connect_id, sockfd, false) != 0) {
1170             //
1174             return -1;
1175         }
1177         // Register a timer for EpollOutRequest. Note that the timeout
1178         // callback has no race with the one above as both of them try
1179         // to `SetFailed' `connect_id' while only one of them can succeed
1180         // It also work when `HandleEpollOutRequest' has already been
1181         // called before adding the timer since it will be removed
1182         // inside destructor of `EpollOutRequest' after leaving this scope
1183         if (abstime) {
1184             int rc = bthread_timer_add(&req->timer_id, *abstime,
1185                                        HandleEpollOutTimeout,
1186                                        (void*)connect_id);
1187             //
1192         }
1193 
1194     } else {
1195         //
1202     }
1203     return sockfd.release();
1204 }

以上Connect实现大致是创建一个socket并connect到指定地址，如果有on_connect则new一个EpollOutRequest并联到user，创建一个Socket对象，接着AddEpollOut事件，并注册timer超时HandleEpollOutTimeout：

1281 void Socket::HandleEpollOutTimeout(void* arg) {
1282     SocketId id = (SocketId)arg;       
1283     SocketUniquePtr s;                 
1284     if (Socket::Address(id, &s) != 0) {
1285         return; 
1286     }           
1287     EpollOutRequest* req = dynamic_cast(s->user());
1288     if (req == NULL) {
1290         return;
1291     } 
1292     s->HandleEpollOutRequest(ETIMEDOUT, req);
1293 }           
1294             
1295 int Socket::HandleEpollOutRequest(int error_code, EpollOutRequest* req) {
1296     // Only one thread can `SetFailed' this `Socket' successfully
1297     // Also after this `req' will be destroyed when its reference
1298     // hits zero
1299     if (SetFailed() != 0) {
1300         return -1;
1301     }
1302     // We've got the right to call user callback
1303     // The timer will be removed inside destructor of EpollOutRequest
1304     GetGlobalEventDispatcher(req->fd).RemoveEpollOut(id(), req->fd, false);
1305     return req->on_epollout_event(req->fd, error_code, req->data);
1306 }

当连接成功，即发生可写事件，则会进行HandleEpollOut-> HandleEpollOutRequest->KeepWriteIfConnected：

1351 int Socket::KeepWriteIfConnected(int fd, int err, void* data) {
1352     WriteRequest* req = static_cast(data);
1353     Socket* s = req->socket;
1354     if (err == 0 && s->ssl_state() == SSL_CONNECTING) {
1355         //more code..
1367     }   
1368     CheckConnectedAndKeepWrite(fd, err, data);
1369     return 0;
1370 }

1372 void Socket::CheckConnectedAndKeepWrite(int fd, int err, void* data) {
1373     butil::fd_guard sockfd(fd);
1374     WriteRequest* req = static_cast(data);
1375     Socket* s = req->socket;
1376     CHECK_GE(sockfd, 0);
1377     if (err == 0 && s->CheckConnected(sockfd) == 0
1378         && s->ResetFileDescriptor(sockfd) == 0) {
1379         if (s->_app_connect) {
1380             //
1381         } else {
1382             // Successfully created a connection
1383             AfterAppConnected(0, req);
1384         }
1385         // Release this socket for KeepWrite
1386         sockfd.release();
1387     } else {
1388         //
1392     }   
1393 }

1308 void Socket::AfterAppConnected(int err, void* data) {
1309     WriteRequest* req = static_cast(data);
1310     if (err == 0) {
1316         // requests are not setup yet. check the comment on Setup() in Write()
1317         req->Setup(s);
1318         bthread_t th;
1319         if (bthread_start_background(
1320                 &th, &BTHREAD_ATTR_NORMAL, KeepWrite, req) != 0) {
1322             KeepWrite(req);
1323         }
1324     } else {
1325         //more code...
1342     }
1343 }

最终还是会调用到KeepWrite。在最开始中Socket::HandleEpollOut，因为有些是没有关联EpollOutRequest的情况，这里只是wakeup相关等待epollout事件的_epollout_butex，即 butex_wake_except。当有可写事件，关闭socket时会进行butex_wake_except(这些可参考之前的分析)。

设计的思想：

229     // Write `msg' into this Socket and clear it. The `msg' should be an
230     // intact request or response. To prevent messages from interleaving
231     // with other messages, the internal file descriptor is written by one
232     // thread at any time. Namely when only one thread tries to write, the
233     // message is written once directly in the calling thread. If the message
234     // is not completely written, a KeepWrite thread is created to continue
235     // the writing. When other threads want to write simultaneously (thread
236     // contention), they append WriteRequests to the KeepWrite thread in a 
237     // wait-free manner rather than writing to the file descriptor directly.
238     // KeepWrite will not quit until all WriteRequests are complete.
239     // Key properties:
240     // - all threads have similar opportunities to write, no one is starved.
241     // - Write once when uncontended(most cases).
242     // - Wait-free when contended.

因为socket比较复杂，很多细节没有分析和考虑全面，后面需要对这种设计进行深入思考，再分析下IOBuf。

brpc网络实现思考

你可能感兴趣的:(brpc网络实现思考)