TCP Operation
TCP Handshake ‐ SYN, SYNACK
During the TCP connection establishment 3‐way handshake,
Onload negotiates
the MSS, Window Scale, SACK permitted, ECN, PAWS and RTTM timestamps.
For TCP connections
Onload will negotiate an appropriate MSS for the MTU configured on the interface.
However, when using jumbo frames,
Onload will currently negotiate an MSS value
up to a maximum of 2048 bytes minus the number of bytes required
for packet headers.
This is due to the fact that the size of buffers
passed to the Solarflare network interface card is 2048 bytes
and the Onload stack cannot currently handle fragmented packets
on its TCP receive path.
TCP options advertised
during the handshake
can be selected
using the EF_TCP_SYN_OPTS environment variable.
Refer to Parameter Reference on page 210 for details of environment variables.
TCP SYN Cookies
The Onload environment variable EF_TCP_SYNCOOKIES can be enabled
on a per stack basis to force the use of SYNCOOKIES
thereby providing a degree of protection
against the Denial of Service (DOS) SYN flood attack.
EF_TCP_SYNCOOKIES is disabled by default.
Refer to Parameter Reference on page 210 for details of environment variables.
TCP Socket Options
Onload TCP supports the following socket options
which can be used in the setsockopt() and getsockopt() function calls.
SO_PROTOCOL retrieve the socket protocol as an integer.
SO_ACCEPTCONN determines whether the socket can accept
incoming connections
‐ true for listening sockets.
(Only valid as a getsockopt()).
SO_BINDTODEVICE bind this socket
to a particular network interface.
See SO_BINDTODEVICE on page 83.
SO_CONNECT_TIME number of seconds
a connection has been established.
(Only valid as a getsockopt()).
SO_DEBUG enable protocol debugging.
SO_ERROR the errno value of the last error
occurring on the socket.
(Only valid as a getsockopt()).
SO_EXCLUSIVEADDRUSE prevents other sockets
using the SO_REUSEADDR option
to bind to the same address and port.
SO_KEEPALIVE enable sending of keep‐alive messages
on connection oriented sockets.
SO_LINGER when enabled,
a close() or shutdown() will not return
until all queued messages
for the socket have been successfully sent
or the linger timeout has been reached.
Otherwise the close() or shutdown()
returns immediately and sockets are closed
in the background.
SO_OOBINLINE indicates that out‐of‐bound data
should be returned in‐line
with regular data.
This option is only valid
for connection‐oriented protocols
that support out‐of‐band data.
SO_PRIORITY set the priority
for all packets sent on this socket.
Packets with a higher priority
may be processed first
depending on the selected device
queuing discipline.
SO_RCVBUF sets or gets the maximum
socket receive buffer in bytes.
Note that EF_TCP_RCVBUF overrides
this value
and EF_TCP_RCVBUF_ESTABLISHED_DEFAULT can also
override this value.
Setting SO_RCVBUF to a value < MTU
can result in poorer performance
and is not recommended.
SO_RCVLOWAT sets the minimum number of bytes
to process for socket input operations.
SO_RCVTIMEO sets the timeout
for input function to complete.
SO_RECVTIMEO sets the timeout
in milliseconds for blocking receive calls.
SO_REUSEADDR can reuse local port numbers
i.e. another socket can bind to the same port
except when there is an active listening
socket bound to the port.
SO_RESUSEPORT allows multiple sockets to bind
to the same port.
。。。
TCP Level Options
TCP_CORK stops sends on segments less than MSS size
until the connection is uncorked.
TCP_DEFER_ACCEPT a connection is ESTABLISHED
after handshake is complete
instead of leaving it in SYN‐RECV
until the first real data packet arrives.
The connection is placed in the accept queue
when the first data packet arrives.
TCP_INFO populates an internal data structure
with tcp statistic values.
TCP_KEEPALIVE_ABORT_THRESHOLD how long to try to produce
a successful keepalive before giving up.
TCP_KEEPALIVE_THRESHOLD specifies the idle time
for keepalive timers.
TCP_KEEPCNT number of keepalives before giving up.
TCP_KEEPIDLE idle time for keepalives.
TCP_KEEPINTVL time between keepalives.
TCP_MAXSEG gets the MSS size for this connection.
TCP_NODELAY disables Nagle’s Algorithm
and small segments are sent
without delay and without waiting
for previous segments to be acknowledged.
TCP_QUICKACK when enabled ACK messages are sent
immediately following reception
of the next data packet.
This flag will be reset to zero
following every use
i.e. it is a one time option.
New connections start in a mode
where all packets are acknowledged,
and so this value initially defaults to 1.
TCP File Descriptor Control
SOCK_CLOEXEC supported in socket() and accept().
Sets the O_NONBLOCK file status flag
on the new open file descriptor
saving extra calls to fcntl(2) to achieve the same result.
SOCK_NONBLOCK supported in accept().
Sets the close‐on‐exec (FD_CLOEXEC) flag
on the new file descriptor.
TCP Congestion Control
Onload TCP implements congestion control
in accordance with RFC3465
and employs the NewReno algorithm with extensions for Appropriate Byte Counting (ABC).
On new or idle connections
and those experiencing loss,
Onload employs a Fast Start algorithm
in which delayed acknowledgments are disabled,
thereby creating more ACKs
and subsequently ‘growing’ the congestion window rapidly.
Two environment variables;
EF_TCP_FASTSTART_INIT and EF_TCP_FASTSTART_LOSS are associated
with the fast start
‐ Refer to Parameter Reference on page 210 for details.
During Slow Start,
the congestion window is initially set to 2 x maximum segment size (MSS) value.
As each ACK is received the congestion window size is increased
by the number of bytes acknowledged up to a maximum 2 x MSS bytes.
This allows Onload to transmit the minimum of the congestion window
and advertised window size
i.e.transmission window (bytes) = min(CWND, receiver advertised window size)
If loss is detected
‐ either by retransmission timeout (RTO),
or the reception of duplicate ACKs,
Onload will adopt a congestion avoidance algorithm
to slow the transmission rate.
In congestion avoidance the transmission window is halved from its current size
‐ but will not be less than 2 x MSS.
If congestion avoidance was triggered by an RTO timeout
the Slow Start algorithm is again used to restore the transmit rate.
If triggered by duplicate ACKs Onload employs a Fast Retransmit
and Fast Recovery algorithm.
If Onload TCP receives 3 duplicate ACKs
this indicates that a segment has been lost
‐ rather than just received out of order
and causes the immediate retransmission of the lost segment (Fast Retransmit).
The continued reception of duplicate ACKs is an indication
that traffic still flows within the network and Onload will follow Fast Retransmit with Fast Recovery.
TCP Loopback Acceleration
Onload supports the acceleration of TCP loopback connections,
providing an accelerated mechanism
through which two processes on the same host can communicate.
Accelerated TCP loopback connections do not invoke system calls,
reduce the overheads for read/write operations
and offer improved latency over the kernel implementation.
The server and client processes
who want to communicate
using an accelerated TCP loopback connection
do not need to be configured
to share an Onload stack.
However,
the server and client TCP loopback sockets can only be accelerated
if they are in the same Onload stack.
Onload has the ability to move a TCP loopback socket
between Onload stacks to achieve this.
TCP loopback acceleration is configured
via the environment variables
EF_TCP_CLIENT_LOOPBACK and EF_TCP_SERVER_LOOPBACK.
As well as enabling TCP loopback acceleration
these environment variables control Onload’s behavior
when the server and client sockets do not originate in the same Onload stack.
This gives the user greater flexibility and control
when establishing loopback on TCP sockets
either from the listening (server) socket or from the connecting (client) socket.
The connecting socket can use any local address or specify the loopback address.
The client loopback option EF_TCP_CLIENT_LOOPBACK=4,
when used with the server loopback option EF_TCP_SERVER_LOOPBACK=2,
differs from other loopback options
such that rather than move sockets between existing stacks
they will create an additional stack
and move sockets from both ends of the TCP connection
into this new stack.
This avoids the possibility of having many loopback sockets sharing
and contending for the resources of a single stack.
When client and server are not the same UID,
set the environment variable EF_SHARE_WITH to allow
both processes to share the created shared stack.
Listen/Accept Sockets
TCP sockets accepted
from a listening socket
will share a wildcard filter
with the parent socket.
The following Onload module options
can be used
to control behavior when the parent socket is closed.
oof_shared_keep_thresh
‐ default 100, is the number of accepted sockets
sharing a wildcard filter
that will cause the filter to persist
after the listening socket has closed.
oof_shared_steal_thresh
‐ default 200, is the number of sockets
sharing a wildcard filter
that will cause the filter to persist
even when a new listening socket needs the filter.
If the listening socket is closed
the behavior depends on the number of remaining accepted sockets as follows:
> oof_shared_keep_thresh but < oof_shared_steal_thresh Retain the wildcard filter
shared by all accepted sockets.
If a new listening socket
requires the filter,
Onload will install a full‐match filter
for each accepted socket
allowing the listening socket
to use the wildcard filter.
> oof_shared_steal_thresh Retain the wildcard filter
shared by all accepted sockets.
A new listening socket can be created
but a filter cannot be installed
meaning the socket will receive no traffic
until the number
of accepted connections is reduced.
Socket Caching
Shared local ports
Transparent Reverse Proxy Modes
Transparent Reverse Proxy on Multiple CPUs
Performance in lossy network environments