这些时间断断续续在分析brpc的网络实现,因为之前说过这个事情。然后因为工作中使用到的框架实现,其底层网络实现是单线程处理所有的收发数据逻辑,然后再转发,不再处理其他多余的逻辑。后来因为线上的问题造成卡顿,猜测这儿是否当网络数据过多时,或者一个socket要写的数据太多,导致其他socket饿死的情况?进而导致client收不到对应response或者广播的消息,比如多只怪愣着不动几秒。虽然client发的请求数据不会太多,不会导致server一直读,但一直写某个socket会影响其他socket上的读写事件,所以这里便引发深入思考。
分析过的开源框架,可能处理的业务场景不同,所以实现和引发的问题也是不同的。比如redis单线程,性能也很好,网络和业务逻辑都在一个线程中;之前分析过的skynet底层虽然是多线程,但只有一个网络线程,接管所有socket连接的读写事件处理;phxrpc虽然是多线程的,一个accept线程对收到的新连接轮询其他unit负载情况,哪个轻的给哪个处理,unit里是多个工作线程+一个网络线程eventloop,虽然是平均但本质还是一个,而且如果有的unit空闲,其他的unit忙,那么忙的unit里的待处理的socket也会因为同一个unit中正在读写的socket情况导致得不到处理,也是有这种情况。
直到brpc,带来一种新的实现,这里引用下原文:
“brpc使用一个或多个EventDispatcher(简称为EDISP)等待任一fd发生事件。和常见的“IO线程”不同,EDISP不负责读取。IO线程的问题在于一个线程同时只能读一个fd,当多个繁忙的fd聚集在一个IO线程中时,一些读取就被延迟了。多租户、复杂分流算法,Streaming RPC等功能会加重这个问题。高负载下常见的某次读取卡顿会拖慢一个IO线程中所有fd的读取,对可用性的影响幅度较大。”
接下来会整体分析下socket的相关实现,这里的socket实现较为复杂,会省去一些细节,会简化分析一些关键实现。
在server启动的时候会调用StartInternal函数,其中会创建listen socket用于监听socket连接。
967 if (_am == NULL) {
968 _am = BuildAcceptor();
969 //check...
973 }
974 //more code...
981 // Pass ownership of `sockfd' to `_am'
982 if (_am->StartAccept(sockfd, _options.idle_timeout_sec,
983 _default_ssl_ctx) != 0) {
985 return -1;
986 }
由brpc之消息处理流程中分析可知,在BuildAcceptor会初始化各协议处理handler。
35 // Accept connections from a specific port and then
36 // process messages from which it reads
37 class Acceptor : public InputMessenger {
38 public:
39 typedef butil::FlatMap SocketMap;
40
41 enum Status {
42 UNINITIALIZED = 0,
43 READY = 1,
44 RUNNING = 2,
45 STOPPING = 3,
46 };
47
48 public:
83 private:
84 // Accept connections.
85 static void OnNewConnectionsUntilEAGAIN(Socket* m);
86 static void OnNewConnections(Socket* m);
87
96 bthread_keytable_pool_t* _keytable_pool; // owned by Server
99 bthread_t _close_idle_tid;
107
108 // The map containing all the accepted sockets
109 SocketMap _socket_map;
112 };
其中Acceptor继承InputMessenger类,管理所有的连接。接着StartAccept:
50 int Acceptor::StartAccept(int listened_fd, int idle_timeout_sec,
51 const std::shared_ptr& ssl_ctx) {
52 //more code...
69 if (idle_timeout_sec > 0) {
70 if (bthread_start_background(&_close_idle_tid, NULL,
71 CloseIdleConnections, this) != 0) {
73 return -1;
74 }
75 }
79 // Creation of _acception_id is inside lock so that OnNewConnections
80 // (which may run immediately) should see sane fields set below.
81 SocketOptions options;
82 options.fd = listened_fd;
83 options.user = this;
84 options.on_edge_triggered_events = OnNewConnections;
85 if (Socket::Create(options, &_acception_id) != 0) {
86 // Close-idle-socket thread will be stopped inside destructor
88 return -1;
89 }
90
91 _listened_fd = listened_fd;
92 _status = RUNNING;
93 return 0;
94 }
其中会在后台跑个bthread,调用CloseIdleConnections,每隔一段时间醒来,对所有可用连接判断是否要作空闲连接处理,如果没有收发数据一定时间的话,功能是比较简单的。其中void (on_edge_triggered_events)(Socket)是当有事件到来时的回调函数,对于listen socket来说,是OnNewConnections,后面再分析OnNewConnections。接着create socket对象:
585 // SocketId = 32-bit version + 32-bit slot.
586 // version: from version part of _versioned_nref, must be an EVEN number.
587 // slot: designated by ResourcePool.
588 int Socket::Create(const SocketOptions& options, SocketId* id) {
589 butil::ResourceId slot;
590 Socket* const m = butil::get_resource(&slot, Forbidden());
591 if (m == NULL) {
593 return -1;
594 }
601 m->_on_edge_triggered_events = options.on_edge_triggered_events;
609 m->_this_id = MakeSocketId(
610 VersionOfVRef(m->_versioned_ref.fetch_add(
611 1, butil::memory_order_release)), slot);
612 m->_preferred_index = -1;
652 CHECK(NULL == m->_write_head.load(butil::memory_order_relaxed));
653 // Must be last one! Internal fields of this Socket may be access
654 // just after calling ResetFileDescriptor.
655 if (m->ResetFileDescriptor(options.fd) != 0) {
656 //error...
661 }
662 *id = m->_this_id;
663 return 0;
664 }
以上列的是涉及重要数据成员的操作,其他的虽然也是重要但在这里不是要关心的细节。在socket.h文件中,类socket的声明加上注释有640多行,这里就不贴上相关代码咯,分析到相关的操作时再列一下,接着ResetFileDescriptor:
522 int Socket::ResetFileDescriptor(int fd) {
523 // Reset message sizes when fd is changed.
524 _last_msg_size = 0;
525 _avg_msg_size = 0;
526 // MUST store `_fd' before adding itself into epoll device to avoid
527 // race conditions with the callback function inside epoll
528 _fd.store(fd, butil::memory_order_release);
543 // Make the fd non-blocking.
544 if (butil::make_non_blocking(fd) != 0) {
546 return -1;
547 }
550 butil::make_no_delay(fd);
574 if (_on_edge_triggered_events) {
575 if (GetGlobalEventDispatcher(fd).AddConsumer(id(), fd) != 0) {
576 //error...
580 }
581 }
582 return 0;
583 }
其中会设置fd为非阻塞,nodelay,设置fd的SO_SNDBUF和SO_RCVBUF大小,最后添加至epoll中等待事件的发生,其中EventDispatcher是事件分发器的封装,下面分析此类实现。
因为数量是可配的,可以配置多个EventDispatcher实例:
352 static void StopAndJoinGlobalDispatchers() {
353 for (int i = 0; i < FLAGS_event_dispatcher_num; ++i) {
354 g_edisp[i].Stop();
355 g_edisp[i].Join();
356 }
357 }
358 void InitializeGlobalDispatchers() {
359 g_edisp = new EventDispatcher[FLAGS_event_dispatcher_num];
360 for (int i = 0; i < FLAGS_event_dispatcher_num; ++i) {
361 const bthread_attr_t attr = FLAGS_usercode_in_pthread ?
362 BTHREAD_ATTR_PTHREAD : BTHREAD_ATTR_NORMAL;
363 CHECK_EQ(0, g_edisp[i].Start(&attr));
364 }
365 // This atexit is will be run before g_task_control.stop() because above
366 // Start() initializes g_task_control by creating bthread (to run epoll/kqueue).
367 CHECK_EQ(0, atexit(StopAndJoinGlobalDispatchers));
368 }
369
370 EventDispatcher& GetGlobalEventDispatcher(int fd) {
371 pthread_once(&g_edisp_once, InitializeGlobalDispatchers);
372 if (FLAGS_event_dispatcher_num == 1) {
373 return g_edisp[0];
374 }
375 int index = butil::fmix32(fd) % FLAGS_event_dispatcher_num;
376 return g_edisp[index];
377 }
以上是初始化和随机获取对象的实现,EventDispatcher使用et模式来监听事件:
89 int EventDispatcher::Start(const bthread_attr_t* consumer_thread_attr) {
115 int rc = bthread_start_background(
116 &_tid, &_consumer_thread_attr, RunThis, this);
117 if (rc) {
119 return -1;
120 }
121 return 0;
122 }
273 void* EventDispatcher::RunThis(void* arg) {
274 ((EventDispatcher*)arg)->Run();
275 return NULL;
276 }
278 void EventDispatcher::Run() {
279 while (!_stop) {
282 epoll_event e[32];
290 const int n = epoll_wait(_epfd, e, ARRAY_SIZE(e), -1);
296 if (_stop) {
297 // epoll_ctl/epoll_wait should have some sort of memory fencing
298 // guaranteeing that we(after epoll_wait) see _stop set before
299 // epoll_ctl.
300 break;
301 }
302 if (n < 0) {
303 if (EINTR == errno) {
304 // We've checked _stop, no wake-up will be missed.
305 continue;
306 }
312 break;
313 }
314 for (int i = 0; i < n; ++i) {
316 if (e[i].events & (EPOLLIN | EPOLLERR | EPOLLHUP)) {
321 // We don't care about the return value.
322 Socket::StartInputEvent(e[i].data.u64, e[i].events,
323 _consumer_thread_attr);
324 }
332 }
333 for (int i = 0; i < n; ++i) {
335 if (e[i].events & (EPOLLOUT | EPOLLERR | EPOLLHUP)) {
336 // We don't care about the return value.
337 Socket::HandleEpollOut(e[i].data.u64);
338 }
345 }
346 }
347 }
其中分别处理in/out等事件,然后分别回调StartInputEvent和HandleEpollOut。另外,添加事件:
225 int EventDispatcher::AddConsumer(SocketId socket_id, int fd) {
226 if (_epfd < 0) {
227 errno = EINVAL;
228 return -1;
229 }
231 epoll_event evt;
232 evt.events = EPOLLIN | EPOLLET;
233 evt.data.u64 = socket_id;
237 return epoll_ctl(_epfd, EPOLL_CTL_ADD, fd, &evt);
245 }
对于listen socket来说,有新连接到来时发生可读事件,那么会调用StartInputEvent:
1924 int Socket::StartInputEvent(SocketId id, uint32_t events,
1925 const bthread_attr_t& thread_attr) {
1926 SocketUniquePtr s;
1927 if (Address(id, &s) < 0) {
1928 return -1;
1929 }
1930 if (NULL == s->_on_edge_triggered_events) {
1931 // Callback can be NULL when receiving error epoll events
1932 // (Added into epoll by `WaitConnected')
1933 return 0;
1934 }
1935 if (s->fd() < 0) {
1941 return -1;
1942 }
1947 // Passing e[i].events causes complex visibility issues and
1948 // requires stronger memory fences, since reading the fd returns
1949 // error as well, we don't pass the events.
1950 if (s->_nevent.fetch_add(1, butil::memory_order_acq_rel) == 0) {
1956 bthread_t tid;
1957 // transfer ownership as well, don't use s anymore!
1958 Socket* const p = s.release();
1959
1960 bthread_attr_t attr = thread_attr;
1961 attr.keytable_pool = p->_keytable_pool;
1962 if (bthread_start_urgent(&tid, &attr, ProcessEvent, p) != 0) {
1964 ProcessEvent(p);
1965 }
1966 }
1967 return 0;
1968 }
以上逻辑会根据socket_id获取socket对象,如果没有就表示删除了,这种情况是允许的。然后判断是否有回调函数,没有的话则返回。这里使用_nevent控制是否是有bthread在处理该socket上的读事件,这么做的好处引用:
"EDISP使用Edge triggered模式。当收到事件时,EDISP给一个原子变量加1,只有当加1前的值是0时启动一个bthread处理对应fd上的数据。在背后,EDISP把所在的pthread让给了新建的bthread,使其有更好的cache locality,可以尽快地读取fd上的数据。而EDISP所在的bthread会被偷到另外一个pthread继续执行,这个过程即是bthread的work stealing调度。"
由之前的分析bthread_start_urgent是立即启动一个bthread并执行ProcessEvent:
1017 void* Socket::ProcessEvent(void* arg) {
1018 // the enclosed Socket is valid and free to access inside this function.
1019 SocketUniquePtr s(static_cast(arg));
1020 s->_on_edge_triggered_events(s.get());
1021 return NULL;
1022 }
这里执行回调OnNewConnections:
317 void Acceptor::OnNewConnections(Socket* acception) {
318 int progress = Socket::PROGRESS_INIT;
319 do {
320 OnNewConnectionsUntilEAGAIN(acception);
321 if (acception->Failed()) {
322 return;
323 }
324 } while (acception->MoreReadEvents(&progress));
325 }
243 void Acceptor::OnNewConnectionsUntilEAGAIN(Socket* acception) {
244 while (1) {
245 struct sockaddr in_addr;
246 socklen_t in_len = sizeof(in_addr);
247 butil::fd_guard in_fd(accept(acception->fd(), &in_addr, &in_len));
248 if (in_fd < 0) {
249 // no EINTR because listened fd is non-blocking.
250 if (errno == EAGAIN) {
251 return;
252 }
253 //error...
260 continue;
261 }
270 SocketId socket_id;
271 SocketOptions options;
272 options.keytable_pool = am->_keytable_pool;
273 options.fd = in_fd;
274 options.remote_side = butil::EndPoint(*(sockaddr_in*)&in_addr);
275 options.user = acception->user();
276 options.on_edge_triggered_events = InputMessenger::OnNewMessages;
277 options.initial_ssl_ctx = am->_ssl_ctx;
278 if (Socket::Create(options, &socket_id) != 0) {
280 continue;
281 }
282 in_fd.release(); // transfer ownership to socket_id
284 // There's a funny race condition here. After Socket::Create, messages
285 // from the socket are already handled and a RPC is possibly done
286 // before the socket is added into _socket_map below. This is found in
287 // ChannelTest.skip_parallel in test/brpc_channel_unittest.cpp (running
288 // on machines with few cores) where the _messenger.ConnectionCount()
289 // may surprisingly be 0 even if the RPC is already done.
290
291 SocketUniquePtr sock;
292 if (Socket::AddressFailedAsWell(socket_id, &sock) >= 0) {
293 bool is_running = true;
294 {
295 BAIDU_SCOPED_LOCK(am->_map_mutex);
296 is_running = (am->status() == RUNNING);
297 // Always add this socket into `_socket_map' whether it
298 // has been `SetFailed' or not, whether `Acceptor' is
299 // running or not. Otherwise, `Acceptor::BeforeRecycle'
300 // may be called (inside Socket::OnRecycle) after `Acceptor'
301 // has been destroyed
302 am->_socket_map.insert(socket_id, ConnectStatistics());
303 }
304 if (!is_running) {
305 //error...
310 return;
311 }
312 } // else: The socket has already been destroyed, Don't add its id
313 // into _socket_map
314 }
315 }
这里接收到一个新连接,并关联OnNewMessages回调。这里贴上注释以及AddressFailedAsWell相关的代码,是因为比较有思考意义,如注释说的,当create注册到epoll中后,可能立即发生可读事件,并可能切到其他线程所在的bthread执行并立即释放,完成整个rpc过程,所以这里使用到的socket_id所代表的对象可能无效。AddressFailedAsWell实现中较复杂,后面再分析。OnNewMessages回调分析在之前的[brpc之消息处理流程]已经分析过,不再分析。
因为accept连接逻辑是个while(1),所以当EAGAIN时表示暂时没新连接可accept便返回。接着:
227 inline bool Socket::MoreReadEvents(int* progress) {
228 // Fail to CAS means that new events arrived.
229 return !_nevent.compare_exchange_strong(
230 *progress, 0, butil::memory_order_release,
231 butil::memory_order_acquire);
232 }
这里是因为在OnNewConnections处理过程中,可能发生StartInputEvent,所以这里还是要判断下,如果_nevent和progress的值相同(初始为1)则把_nevent设置回0,否则记录到progress,这样继续处理accept。
当socket发生可写之类的事件时:
1258 int Socket::HandleEpollOut(SocketId id) {
1259 SocketUniquePtr s;
1260 // Since Sockets might have been `SetFailed' before they were
1261 // added into epoll, these sockets miss the signal inside
1262 // `SetFailed' and therefore must be signalled here using
1263 // `AddressFailedAsWell' to prevent waiting forever
1264 if (Socket::AddressFailedAsWell(id, &s) < 0) {
1265 // Ignore recycled sockets
1266 return -1;
1267 }
1268
1269 EpollOutRequest* req = dynamic_cast(s->user());
1270 if (req != NULL) {
1271 return s->HandleEpollOutRequest(0, req);
1272 }
1273
1274 // Currently `WaitEpollOut' needs `_epollout_butex'
1275 // TODO(jiangrujie): Remove this in the future
1276 s->_epollout_butex->fetch_add(1, butil::memory_order_relaxed);
1277 bthread::butex_wake_except(s->_epollout_butex, 0);
1278 return 0;
1279 }
1295 int Socket::HandleEpollOutRequest(int error_code, EpollOutRequest* req) {
1296 // Only one thread can `SetFailed' this `Socket' successfully
1297 // Also after this `req' will be destroyed when its reference
1298 // hits zero
1299 if (SetFailed() != 0) {
1300 return -1;
1301 }
1302 // We've got the right to call user callback
1303 // The timer will be removed inside destructor of EpollOutRequest
1304 GetGlobalEventDispatcher(req->fd).RemoveEpollOut(id(), req->fd, false);
1305 return req->on_epollout_event(req->fd, error_code, req->data);
1306 }
这里几行代码没啥可分析的,后面综合一下,这里具体分析下当发生要写数据时的实现。
由于对于accept到的连接,只设置EPOLLIN | EPOLLET事件,并未有EPOLLOUT事件,这块是在后面的才调用AddEpollOut设置的。
比如当server对rpc进行响应时,在baidu协议中有:
253 Socket::WriteOptions wopt;
254 wopt.ignore_eovercrowded = true;
255 if (sock->Write(&res_buf, &wopt) != 0) {
256 const int errcode = errno;
258 cntl->SetFailed(errcode, "Fail to write into %s",
259 sock->description().c_str());
260 return;
261 }
1432 int Socket::Write(butil::IOBuf* data, const WriteOptions* options_in) {
1433 //more code...
1456 WriteRequest* req = butil::get_object();
1457 if (!req) {
1458 return SetError(opt.id_wait, ENOMEM);
1459 }
1460
1461 req->data.swap(*data);
1462 // Set `req->next' to UNCONNECTED so that the KeepWrite thread will
1463 // wait until it points to a valid WriteRequest or NULL.
1464 req->next = WriteRequest::UNCONNECTED;
1465 req->id_wait = opt.id_wait;
1466 req->set_pipelined_count_and_user_message(
1467 opt.pipelined_count, DUMMY_USER_MESSAGE, opt.with_auth);
1468 return StartWrite(req, opt);
1469 }
303 struct BAIDU_CACHELINE_ALIGNMENT Socket::WriteRequest {
304 static WriteRequest* const UNCONNECTED;
305
306 butil::IOBuf data;
307 WriteRequest* next;
308 bthread_id_t id_wait;
309 Socket* socket;
54 };
以在开始写的时候会申请一个WriteRequest结构,swap要写的数据并set_pipelined_count_and_user_message(具体作用后面再分析),接着StartWrite:
1506 int Socket::StartWrite(WriteRequest* req, const WriteOptions& opt) {
1507 // Release fence makes sure the thread getting request sees *req
1508 WriteRequest* const prev_head =
1509 _write_head.exchange(req, butil::memory_order_release);
1510 if (prev_head != NULL) {
1511 // Someone is writing to the fd. The KeepWrite thread may spin
1512 // until req->next to be non-UNCONNECTED. This process is not
1513 // lock-free, but the duration is so short(1~2 instructions,
1514 // depending on compiler) that the spin rarely occurs in practice
1515 // (I've not seen any spin in highly contended tests).
1516 req->next = prev_head;
1517 return 0;
1518 }
设置 _write_head指向req,如果有其他bthread写则把prev_head挂在req的next上去并返回(反向的),这里不能同时写一个socket否则会出现数据的交错:
1520 int saved_errno = 0;
1521 bthread_t th;
1522 SocketUniquePtr ptr_for_keep_write;
1523 ssize_t nw = 0;
1524
1525 // We've got the right to write.
1526 req->next = NULL;
1527
1528 // Connect to remote_side() if not.
1529 int ret = ConnectIfNot(opt.abstime, req);
1530 //ret >=0
1540 // NOTE: Setup() MUST be called after Connect which may call app_connect,
1541 // which is assumed to run before any SocketMessage.AppendAndDestroySelf()
1542 // in some protocols(namely RTMP).
1543 req->Setup(this);
1544
1545 if (ssl_state() != SSL_OFF) {
1546 // Writing into SSL may block the current bthread, always write
1547 // in the background.
1548 goto KEEPWRITE_IN_BACKGROUND;
1549 }
1551 // Write once in the calling thread. If the write is not complete,
1552 // continue it in KeepWrite thread.
1553 if (_conn) {
1554 //
1556 } else {
1557 nw = req->data.cut_into_file_descriptor(fd());
1558 }
1559 if (nw < 0) {
1560 //error...
1569 } else {
1570 AddOutputBytes(nw);
1571 }
1572 if (IsWriteComplete(req, true, NULL)) {
1573 ReturnSuccessfulWriteRequest(req);
1574 return 0;
1575 }
1577 KEEPWRITE_IN_BACKGROUND:
1578 ReAddress(&ptr_for_keep_write);
1579 req->socket = ptr_for_keep_write.release();
1580 if (bthread_start_background(&th, &BTHREAD_ATTR_NORMAL,
1581 KeepWrite, req) != 0) {
1583 KeepWrite(req);
1584 }
1585 return 0;
1586
1587 FAIL_TO_WRITE:
1588 // `SetFailed' before `ReturnFailedWriteRequest' (which will calls
1589 // `on_reset' callback inside the id object) so that we immediately
1590 // know this socket has failed inside the `on_reset' callback
1591 ReleaseAllFailedWriteRequests(req);
1592 errno = saved_errno;
1593 return -1;
1594 }
如果能写,则检查ConnectIfNot是否有效,这里先假设有效,后面再分析当无效时的处理,不考虑ssl情况,接着开始写req->data.cut_into_file_descriptor(fd()),并对写成功的字节数进行调整AddOutputBytes。然后检查本次写是否完成,如果是获取写的权利,那么当本次req写完时,还要判断有没有其他要写的,因为可能有其他的bthread尝试写时暂时无法写,进而把它的req挂在了head上,即上面的req->next = prev_head这段代码,假设写完则:
471 void Socket::ReturnSuccessfulWriteRequest(Socket::WriteRequest* p) {
472 DCHECK(p->data.empty());
473 AddOutputMessages(1);
474 const bthread_id_t id_wait = p->id_wait;
475 butil::return_object(p);
476 if (id_wait != INVALID_BTHREAD_ID) {
477 NotifyOnFailed(id_wait);
478 }
479 }
如果没有写完,这里分析下IsWriteComplete实现:
1024 // Check if there're new requests appended.
1025 // If yes, point old_head to to reversed new requests and return false;
1026 // If no:
1027 // old_head is fully written, set _write_head to NULL and return true;
1028 // old_head is not written yet, keep _write_head unchanged and return false;
1029 // `old_head' is last new_head got from this function or (in another word)
1030 // tail of current writing list.
1031 // `singular_node' is true iff `old_head' is the only node in its list.
1032 bool Socket::IsWriteComplete(Socket::WriteRequest* old_head,
1033 bool singular_node,
1034 Socket::WriteRequest** new_tail) {
1035 CHECK(NULL == old_head->next);
1036 // Try to set _write_head to NULL to mark that the write is done.
1037 WriteRequest* new_head = old_head;
1038 WriteRequest* desired = NULL;
1039 bool return_when_no_more = true;
1040 if (!old_head->data.empty() || !singular_node) {
1041 desired = old_head;
1042 // Write is obviously not complete if old_head is not fully written.
1043 return_when_no_more = false;
1044 }
1045 if (_write_head.compare_exchange_strong(
1046 new_head, desired, butil::memory_order_acquire)) {
1047 // No one added new requests.
1048 if (new_tail) {
1049 *new_tail = old_head;
1050 }
1051 return return_when_no_more;
1052 }
如果old_head->data.empty()为false则肯定没完写完,此时要判断是否有其他的写请求进来,如果没有则直接返回false后台写,否则可能要逆置这个写链表:
1057 // Someone added new requests.
1058 // Reverse the list until old_head.
1059 WriteRequest* tail = NULL;
1060 WriteRequest* p = new_head;
1061 do {
1062 while (p->next == WriteRequest::UNCONNECTED) {
1063 // TODO(gejun): elaborate this
1064 sched_yield();
1065 }
1066 WriteRequest* const saved_next = p->next;
1067 p->next = tail;
1068 tail = p;
1069 p = saved_next;
1070 CHECK(p != NULL);
1071 } while (p != old_head);
1072
1073 // Link old list with new list.
1074 old_head->next = tail;
1075 // Call Setup() from oldest to newest, notice that the calling sequence
1076 // matters for protocols using pipelined_count, this is why we don't
1077 // calling Setup in above loop which is from newest to oldest.
1078 for (WriteRequest* q = tail; q; q = q->next) {
1079 q->Setup(this);
1080 }
1081 if (new_tail) {
1082 *new_tail = new_head;
1083 }
1084 return false;
以上逻辑还是挺简单的,头插法单链表,从原来的a->b->c->old_head,变成old_head->c->b->a这样。其中:
1062 while (p->next == WriteRequest::UNCONNECTED) {
1063 // TODO(gejun): elaborate this
1064 sched_yield();
1065 }
“可能会被一个值仍为UNCONNECTED的节点锁定(这需要发起写的线程正好在原子交换后,在设置next指针前,仅仅一条指令的时间内被OS换出),但在实践中很少出现。”
接着启动个bthread继续写KeepWrite:
1598 void* Socket::KeepWrite(void* void_arg) {
1599 g_vars->nkeepwrite << 1;
1600 WriteRequest* req = static_cast(void_arg);
1601 SocketUniquePtr s(req->socket);
1602
1603 // When error occurs, spin until there's no more requests instead of
1604 // returning directly otherwise _write_head is permantly non-NULL which
1605 // makes later Write() abnormal.
1606 WriteRequest* cur_tail = NULL;
1607 do {
1608 // req was written, skip it.
1609 if (req->next != NULL && req->data.empty()) {
1610 WriteRequest* const saved_req = req;
1611 req = req->next;
1612 s->ReturnSuccessfulWriteRequest(saved_req);
1613 }
1614 const ssize_t nw = s->DoWrite(req);
1615 if (nw < 0) {
1616 //error...
1623 } else {
1624 s->AddOutputBytes(nw);
1625 }
以上一个while,会判断req是否完成并回调ReturnSuccessfulWriteRequest,否则直接DoWrite:
1674 ssize_t Socket::DoWrite(WriteRequest* req) {
1675 // Group butil::IOBuf in the list into a batch array.
1676 butil::IOBuf* data_list[DATA_LIST_MAX];
1677 size_t ndata = 0;
1678 for (WriteRequest* p = req; p != NULL && ndata < DATA_LIST_MAX;
1679 p = p->next) {
1680 data_list[ndata++] = &p->data;
1681 }
1682
1683 if (ssl_state() == SSL_OFF) {
1684 // Write IOBuf in the batch array into the fd.
1685 if (_conn) {
1686 return _conn->CutMessageIntoFileDescriptor(fd(), data_list, ndata);
1687 } else {
1688 ssize_t nw = butil::IOBuf::cut_multiple_into_file_descriptor(
1689 fd(), data_list, ndata);
1690 return nw;
1691 }
1692 }
1729 }
这里进行批量写,最多DATA_LIST_MAX个块,底层调用writev函数。继续回到KeepWrite中:
1626 // Release WriteRequest until non-empty data or last request.
1627 while (req->next != NULL && req->data.empty()) {
1628 WriteRequest* const saved_req = req;
1629 req = req->next;
1630 s->ReturnSuccessfulWriteRequest(saved_req);
1631 }
1632 // TODO(gejun): wait for epollout when we actually have written
1633 // all the data. This weird heuristic reduces 30us delay...
1634 // Update(12/22/2015): seem not working. better switch to correct code.
1635 // Update(1/8/2016, r31823): Still working.
1636 // Update(8/15/2017): Not working, performance downgraded.
1637 //if (nw <= 0 || req->data.empty()/*note*/) {
1638 if (nw <= 0) {
1639 g_vars->nwaitepollout << 1;
1640 bool pollin = (s->_on_edge_triggered_events != NULL);
1641 // NOTE: Waiting epollout within timeout is a must to force
1642 // KeepWrite to check and setup pending WriteRequests periodically,
1643 // which may turn on _overcrowded to stop pending requests from
1644 // growing infinitely.
1645 const timespec duetime =
1646 butil::milliseconds_from_now(WAIT_EPOLLOUT_TIMEOUT_MS);
1647 const int rc = s->WaitEpollOut(s->fd(), pollin, &duetime);
1648 if (rc < 0 && errno != ETIMEDOUT) {
1649 //error...
1653 break;
1654 }
1655 }
1656 if (NULL == cur_tail) {
1657 for (cur_tail = req; cur_tail->next != NULL;
1658 cur_tail = cur_tail->next);
1659 }
1660 // Return when there's no more WriteRequests and req is completely
1661 // written.
1662 if (s->IsWriteComplete(cur_tail, (req == cur_tail), &cur_tail)) {
1663 CHECK_EQ(cur_tail, req);
1664 s->ReturnSuccessfulWriteRequest(req);
1665 return NULL;
1666 }
1667 } while (1);
1668
1669 // Error occurred, release all requests until no new requests.
1670 s->ReleaseAllFailedWriteRequests(req);
1671 return NULL;
1672 }
因为批量写,对写完的块从链表中删除并回调ReturnSuccessfulWriteRequest,接着WaitEpollOut并wait一定的时间:
1087 int Socket::WaitEpollOut(int fd, bool pollin, const timespec* abstime) {
1088 if (!ValidFileDescriptor(fd)) {
1089 return 0;
1090 }
1091 // Do not need to check addressable since it will be called by
1092 // health checker which called `SetFailed' before
1093 const int expected_val = _epollout_butex->load(butil::memory_order_relaxed);
1094 EventDispatcher& edisp = GetGlobalEventDispatcher(fd);
1095 if (edisp.AddEpollOut(id(), fd, pollin) != 0) {
1096 return -1;
1097 }
1098
1099 int rc = bthread::butex_wait(_epollout_butex, expected_val, abstime);
1100 const int saved_errno = errno;
1101 if (rc < 0 && errno == EWOULDBLOCK) {
1102 // Could be writable or spurious wakeup
1103 rc = 0;
1104 }
1105 // Ignore return value since `fd' might have been removed
1106 // by `RemoveConsumer' in `SetFailed'
1107 butil::ignore_result(edisp.RemoveEpollOut(id(), fd, pollin));
1108 errno = saved_errno;
1109 // Could be writable or spurious wakeup (by former epollout)
1110 return rc;
1111 }
151 int EventDispatcher::AddEpollOut(SocketId socket_id, int fd, bool pollin) {
152 if (_epfd < 0) {
153 errno = EINVAL;
154 return -1;
155 }
158 epoll_event evt;
159 evt.data.u64 = socket_id;
160 evt.events = EPOLLOUT | EPOLLET;
164 if (pollin) {
165 evt.events |= EPOLLIN;
166 if (epoll_ctl(_epfd, EPOLL_CTL_MOD, fd, &evt) < 0) {
167 // This fd has been removed from epoll via `RemoveConsumer',
168 // in which case errno will be ENOENT
169 return -1;
170 }
171 } else {
172 if (epoll_ctl(_epfd, EPOLL_CTL_ADD, fd, &evt) < 0) {
173 return -1;
174 }
175 }
192 return 0;
193 }
以上还是挺简单的实现,注册写事件并butex_wait,回来后RemoveEpollOut事件,并继续返回到上面的KeepWrite逻辑。
再回到ConnectIfNot实现中,如果没有连接则进行Connect,关联回调KeepWriteIfConnected:
1236 int Socket::ConnectIfNot(const timespec* abstime, WriteRequest* req) {
1237 if (_fd.load(butil::memory_order_consume) >= 0) {
1238 return 0;
1239 }
1240
1241 // Have to hold a reference for `req'
1242 SocketUniquePtr s;
1243 ReAddress(&s);
1244 req->socket = s.get();
1245 if (_conn) {
1246 //
1249 } else {
1250 if (Connect(abstime, KeepWriteIfConnected, req) < 0) {
1251 return -1;
1252 }
1253 }
1254 s.release();
1255 return 1;
1256 }
1113 int Socket::Connect(const timespec* abstime,
1114 int (*on_connect)(int, int, void*), void* data) {
1134 const int rc = ::connect(
1135 sockfd, (struct sockaddr*)&serv_addr, sizeof(serv_addr));
1136 if (rc != 0 && errno != EINPROGRESS) {
1138 return -1;
1139 }
1140 if (on_connect) {
1141 EpollOutRequest* req = new(std::nothrow) EpollOutRequest;
1142 if (req == NULL) {
1144 return -1;
1145 }
1146 req->fd = sockfd;
1147 req->timer_id = 0;
1148 req->on_epollout_event = on_connect;
1149 req->data = data;
1150 // A temporary Socket to hold `EpollOutRequest', which will
1151 // be added into epoll device soon
1152 SocketId connect_id;
1153 SocketOptions options;
1154 options.user = req;
1155 if (Socket::Create(options, &connect_id) != 0) {
1157 delete req;
1158 return -1;
1159 }
1160 // From now on, ownership of `req' has been transferred to
1161 // `connect_id'. We hold an additional reference here to
1162 // ensure `req' to be valid in this scope
1163 SocketUniquePtr s;
1164 CHECK_EQ(0, Socket::Address(connect_id, &s));
1165
1166 // Add `sockfd' into epoll so that `HandleEpollOutRequest' will
1167 // be called with `req' when epoll event reaches
1168 if (GetGlobalEventDispatcher(sockfd).
1169 AddEpollOut(connect_id, sockfd, false) != 0) {
1170 //
1174 return -1;
1175 }
1177 // Register a timer for EpollOutRequest. Note that the timeout
1178 // callback has no race with the one above as both of them try
1179 // to `SetFailed' `connect_id' while only one of them can succeed
1180 // It also work when `HandleEpollOutRequest' has already been
1181 // called before adding the timer since it will be removed
1182 // inside destructor of `EpollOutRequest' after leaving this scope
1183 if (abstime) {
1184 int rc = bthread_timer_add(&req->timer_id, *abstime,
1185 HandleEpollOutTimeout,
1186 (void*)connect_id);
1187 //
1192 }
1193
1194 } else {
1195 //
1202 }
1203 return sockfd.release();
1204 }
以上Connect实现大致是创建一个socket并connect到指定地址,如果有on_connect则new一个EpollOutRequest并联到user,创建一个Socket对象,接着AddEpollOut事件,并注册timer超时HandleEpollOutTimeout:
1281 void Socket::HandleEpollOutTimeout(void* arg) {
1282 SocketId id = (SocketId)arg;
1283 SocketUniquePtr s;
1284 if (Socket::Address(id, &s) != 0) {
1285 return;
1286 }
1287 EpollOutRequest* req = dynamic_cast(s->user());
1288 if (req == NULL) {
1290 return;
1291 }
1292 s->HandleEpollOutRequest(ETIMEDOUT, req);
1293 }
1294
1295 int Socket::HandleEpollOutRequest(int error_code, EpollOutRequest* req) {
1296 // Only one thread can `SetFailed' this `Socket' successfully
1297 // Also after this `req' will be destroyed when its reference
1298 // hits zero
1299 if (SetFailed() != 0) {
1300 return -1;
1301 }
1302 // We've got the right to call user callback
1303 // The timer will be removed inside destructor of EpollOutRequest
1304 GetGlobalEventDispatcher(req->fd).RemoveEpollOut(id(), req->fd, false);
1305 return req->on_epollout_event(req->fd, error_code, req->data);
1306 }
当连接成功,即发生可写事件,则会进行HandleEpollOut-> HandleEpollOutRequest->KeepWriteIfConnected:
1351 int Socket::KeepWriteIfConnected(int fd, int err, void* data) {
1352 WriteRequest* req = static_cast(data);
1353 Socket* s = req->socket;
1354 if (err == 0 && s->ssl_state() == SSL_CONNECTING) {
1355 //more code..
1367 }
1368 CheckConnectedAndKeepWrite(fd, err, data);
1369 return 0;
1370 }
1372 void Socket::CheckConnectedAndKeepWrite(int fd, int err, void* data) {
1373 butil::fd_guard sockfd(fd);
1374 WriteRequest* req = static_cast(data);
1375 Socket* s = req->socket;
1376 CHECK_GE(sockfd, 0);
1377 if (err == 0 && s->CheckConnected(sockfd) == 0
1378 && s->ResetFileDescriptor(sockfd) == 0) {
1379 if (s->_app_connect) {
1380 //
1381 } else {
1382 // Successfully created a connection
1383 AfterAppConnected(0, req);
1384 }
1385 // Release this socket for KeepWrite
1386 sockfd.release();
1387 } else {
1388 //
1392 }
1393 }
1308 void Socket::AfterAppConnected(int err, void* data) {
1309 WriteRequest* req = static_cast(data);
1310 if (err == 0) {
1316 // requests are not setup yet. check the comment on Setup() in Write()
1317 req->Setup(s);
1318 bthread_t th;
1319 if (bthread_start_background(
1320 &th, &BTHREAD_ATTR_NORMAL, KeepWrite, req) != 0) {
1322 KeepWrite(req);
1323 }
1324 } else {
1325 //more code...
1342 }
1343 }
最终还是会调用到KeepWrite。在最开始中Socket::HandleEpollOut,因为有些是没有关联EpollOutRequest的情况,这里只是wakeup相关等待epollout事件的_epollout_butex,即 butex_wake_except。当有可写事件,关闭socket时会进行butex_wake_except(这些可参考之前的分析)。
设计的思想:
229 // Write `msg' into this Socket and clear it. The `msg' should be an
230 // intact request or response. To prevent messages from interleaving
231 // with other messages, the internal file descriptor is written by one
232 // thread at any time. Namely when only one thread tries to write, the
233 // message is written once directly in the calling thread. If the message
234 // is not completely written, a KeepWrite thread is created to continue
235 // the writing. When other threads want to write simultaneously (thread
236 // contention), they append WriteRequests to the KeepWrite thread in a
237 // wait-free manner rather than writing to the file descriptor directly.
238 // KeepWrite will not quit until all WriteRequests are complete.
239 // Key properties:
240 // - all threads have similar opportunities to write, no one is starved.
241 // - Write once when uncontended(most cases).
242 // - Wait-free when contended.
因为socket比较复杂,很多细节没有分析和考虑全面,后面需要对这种设计进行深入思考,再分析下IOBuf。