注:转载一个同事的工作笔记。
以下是对相关流程和socket错误码正确处理的小结。
一. Socket/Epoll主要遇到的问题:
(1) 非阻塞socket下,接收流程(recv/recvfrom)对错误(EINTR/EAGAIN/EWOULDBLOCK)当成Fatal错误处理,产生频繁断连.
(2)EPOLLERR/EPOLLHUP事件时,直接调用socket异常处理,产生频繁断连.
(3)udp socket接收到size为0数据时采用异常处理,导致socket关闭.
二.Socket/Epoll主要流程对socket错误码正确处理小结:
1. (send/sendto)和(recv/recvfrom)不要把错误(EINTR/EAGAIN/EWOULDBLOCK)当成Fatal.
1
|
EINTR
4
-
(
"Interrupted system call"
)
"The receive was interrupted by delivery of a signal before any data were available"
|
发送/接收处理过程被中断打断.
1
2
3
|
EAGAIN
11
-
(
"Try again"
)
EWOULDBLOCK
11
-
(
"Resource temporarily unavailable"
)
"The socket is marked nonblocking and the receive operation would block, or a receive timeout had been set and the timeout expired before data was received. POSIX.1-2001 allows either error to be returned for this case,and does not require these constants to have the same value, so a portable application should check for both possibilities."
|
发送:在非阻塞模式下,send/sendto的过程仅仅是将数据拷贝到到协议栈的缓冲区,当发送缓冲区可用空间为0时,返回-1,errno设置为EAGAIN/EWOULDBLOCK.阻塞设置超时,当超时到达时,返回-1,errno设置为EAGAIN/EWOULDBLOCK.
接收:在非阻塞模式下,当recv/recvfrom无数据可读取时,不对阻塞等待数据准备就绪返回,而是返回EAGAIN/EWOULDBLOCK错误,提示进程稍后再试.阻塞设置超时,当超时到达时,返回-1,errno设置为EAGAIN/EWOULDBLOCK.
2. connect不要把错误(EINTR/EINPROGRESS/EAGAIN)当成Fatal.
1
2
|
EINPROGRESS
115
-
(
"Operation now in progress"
)
The
socket
is
nonblocking
and
the
connection
cannot
be
completed
immediately
.
It
is
possible
to
select
(
2
)
or
poll
(
2
)
for
completion
by
selecting
the
socket
for
writing
.
After
select
(
2
)
indicates
writability
,
use
getsockopt
(
2
)
to
read
the
SO_ERROR
option
at
level
SOL_SOCKET
to
determine
whether
connect
(
)
completed
successfully
(
SO_ERROR
is
zero
)
or
unsuccessfully
(
SO_ERROR
is
one
of
the
usual
error
codes
listed
here
,
explaining
the
reason
for
the
failure
)
.
|
当客户端设置非阻塞模式,调用connect请求连接服务器会立刻返回,此时连接三次握手还在进行中,所以返回-1,errno设置为EINPROGRESS,该情况下需要忽略,后续通过getsockopt SO_ERROR获取成功/失败结果.
1
2
|
EAGAIN
11
-
(
"Try again"
)
"No more free local ports or insufficient entries in the routing cache. For AF_INET see the description of /proc/sys/net/ipv4/ip_local_port_range ip(7) for information on how to increase the number of local ports."
|
由于资源问题导致返回-1和EAGAIN错误码,建议通过多次尝试后再报Fatal错误.
3. accept不要把错误(EINTR/ECONNABORTED/EPROTO)当成Fatal.
1
2
|
ECONNABORTED
103
-
(
"Software caused connection"
)
A
connection
has
been
aborted
.
EPROTO
71
-
(
"Protocol error"
)
Protocol
error
|
这两种错误发生在已建立的tcp连接在非被服务器端accept时被客户端夭折的情况下,继承自Berkeley的实现完全由内核来处理已终止连接,服务器进程永远看不到它.然而,大部分的SVR4实现,在accept返回时返回一个错误给进程,而返回的错误又是依赖于实现.SVR4实现返回EPROTO,POSIX返回ECONNABORTED.
4. epoll_wait对EPOLLERR/EPOLLHUP事件和socket error处理.
1
2
3
4
5
6
|
EPOLLERR
Error
condition
happened
on
the
associated
file
descriptor
.
epoll_wait
(
2
)
will
always
wait
for
this
event
;
it
is
not
necessary
to
set
it
in
events
.
EPOLLHUP
Hang
up
happened
on
the
associated
file
d escriptor
.
epoll_wait
(
2
)
will
always
wait
for
this
event
;
it
is
not
necessary
to
set
it
in
events
.
|
大部分系统对EPOLLERR/EPOLLHUP事件直接调用error异常处理,EPOLLIN调用读取,EPOLLOUT调用发送;但这样潜在一些问题.
当socket的异常是通过epoll_wait发现抛出EPOLLERR/EPOLLHUP事件,而不是在read/write流程中发现,这时同样会误以为异常断开连接. 所以只是在读取/发送流程中忽略相关错误码不够完善;当epoll_wait检查到socket错误(EINTR/EAGAIN/EWOULDBLOCK…)时,仍然会当成fatal error处理.(前端服务器遇到的类似问题)
正确的做法,建议:
(1)读取/发送流程中对非致命错误(EINTR/EAGAIN/EWOULDBLOCK…)合理处理.
(2)遇到EPOLLERR/EPOLLHUP事件时,有两种做法:
(2.1)不调用error异常流程,而是跟EPOLLIN一样调用读取流程,让读取流程去确认/处理实际的错误.
(2.2)通过getsockopt SO_ERROR获取具体的错误码,并过滤掉非Fatal错误.
5. recv/recvfrom接收空数据需要区分.
(1)tcp recv接收返回size 0表示对端连接已经关闭,需要做相关异常处理.
(2)udp recvfrom接收返回size 0报文属于正常行为,不能当成异常处理,因为允许发送size 0的空负载udp报文.