onload-‐tcp

TCP Operation

TCP Handshake ‐ SYN, SYNACK

During the TCP connection establishment 3‐way handshake, 
Onload negotiates 
the MSS, Window Scale, SACK permitted, ECN, PAWS and RTTM timestamps.
For TCP connections 
Onload will negotiate an appropriate MSS for the MTU configured on the interface. 	
However, when using jumbo frames, 
Onload will currently negotiate an MSS value 
up to a maximum of 2048 bytes minus the number of bytes required 
for packet headers. 
This is due to the fact that the size of buffers 
passed to the Solarflare network interface card is 2048 bytes 
and the Onload stack cannot currently handle fragmented packets 
on its TCP receive path.
TCP options advertised 
during the handshake 
can be selected 
using the EF_TCP_SYN_OPTS environment variable. 
Refer to Parameter Reference on page 210 for details of environment variables.

TCP SYN Cookies

The Onload environment variable EF_TCP_SYNCOOKIES can be enabled 
on a per stack basis to force the use of SYNCOOKIES 
thereby providing a degree of protection 
against the Denial of Service (DOS) SYN flood attack. 
EF_TCP_SYNCOOKIES is disabled by default. 
Refer to Parameter Reference on page 210 for details of environment variables.

TCP Socket Options

Onload TCP supports the following socket options 
which can be used in the setsockopt() and getsockopt() function calls.
SO_PROTOCOL						retrieve the socket protocol as an integer.
SO_ACCEPTCONN					determines whether the socket can accept 
								incoming connections 
								‐ true for listening sockets. 
								(Only valid as a getsockopt()).
SO_BINDTODEVICE					bind this socket 
								to a particular network interface. 
								See SO_BINDTODEVICE on page 83.
SO_CONNECT_TIME					number of seconds 
								a connection has been established. 		
								(Only valid as a getsockopt()).
SO_DEBUG						enable protocol debugging.
SO_ERROR						the errno value of the last error 
								occurring on the socket. 
								(Only valid as a getsockopt()).
SO_EXCLUSIVEADDRUSE				prevents other sockets 
								using the SO_REUSEADDR option 
								to bind to the same address and port.
SO_KEEPALIVE					enable sending of keep‐alive messages 
								on connection oriented sockets.
SO_LINGER						when enabled, 
								a close() or shutdown() will not return 
								until all queued messages 
								for the socket have been successfully sent 
								or the linger timeout has been reached. 
								Otherwise the close() or shutdown() 
								returns immediately and sockets are closed 
								in the background.
SO_OOBINLINE					indicates that out‐of‐bound data 
								should be returned in‐line 
								with regular data. 
								This option is only valid 
								for connection‐oriented protocols 
								that support out‐of‐band data.
SO_PRIORITY						set the priority 
								for all packets sent on this socket. 
								Packets with a higher priority 
								may be processed first 
								depending on the selected device 
								queuing discipline.
SO_RCVBUF						sets or gets the maximum 
								socket receive buffer in bytes.
								Note that EF_TCP_RCVBUF overrides 
								this value 
								and EF_TCP_RCVBUF_ESTABLISHED_DEFAULT can also 
								override this value.
								Setting SO_RCVBUF to a value < MTU 
								can result in poorer performance 
								and is not recommended.
SO_RCVLOWAT						sets the minimum number of bytes 
								to process for socket input operations.
SO_RCVTIMEO						sets the timeout 
								for input function to complete.
SO_RECVTIMEO					sets the timeout 
								in milliseconds for blocking receive calls.
SO_REUSEADDR					can reuse local port numbers 
								i.e. another socket can bind to the same port 
								except when there is an active listening 		
								socket bound to the port.
SO_RESUSEPORT					allows multiple sockets to bind 
								to the same port.
。。。

TCP Level Options

TCP_CORK						stops sends on segments less than MSS size 
								until the connection is uncorked.
TCP_DEFER_ACCEPT				a connection is ESTABLISHED 
								after handshake is complete 
								instead of leaving it in SYN‐RECV 
								until the first real data packet arrives. 
								The connection is placed in the accept queue 
								when the first data packet arrives.
TCP_INFO						populates an internal data structure 
								with tcp statistic values.
TCP_KEEPALIVE_ABORT_THRESHOLD	how long to try to produce 
								a successful keepalive before giving up.
TCP_KEEPALIVE_THRESHOLD			specifies the idle time 
								for keepalive timers.
TCP_KEEPCNT						number of keepalives before giving up.
TCP_KEEPIDLE					idle time for keepalives.
TCP_KEEPINTVL					time between keepalives.
TCP_MAXSEG						gets the MSS size for this connection.
TCP_NODELAY						disables Nagle’s Algorithm 
								and small segments are sent 
								without delay and without waiting 
								for previous segments to be acknowledged.
TCP_QUICKACK					when enabled ACK messages are sent 
								immediately following reception 
								of the next data packet. 
								This flag will be reset to zero 
								following every use 
								i.e. it is a one time option. 
								New connections start in a mode 
								where all packets are acknowledged, 
								and so this value initially defaults to 1.

TCP File Descriptor Control

SOCK_CLOEXEC	supported in socket() and accept(). 
				Sets the O_NONBLOCK file status flag 
				on the new open file descriptor 
				saving extra calls to fcntl(2) to achieve the same result.
SOCK_NONBLOCK	supported in accept(). 
				Sets the close‐on‐exec (FD_CLOEXEC) flag 
				on the new file descriptor.

TCP Congestion Control

	Onload TCP implements congestion control 
	in accordance with RFC3465 
	and employs the NewReno algorithm with extensions for Appropriate Byte Counting (ABC).
	On new or idle connections 
	and those experiencing loss, 
	Onload employs a Fast Start algorithm 
	in which delayed acknowledgments are disabled, 
	thereby creating more ACKs 
	and subsequently ‘growing’ the congestion window rapidly. 
	Two environment variables; 
	EF_TCP_FASTSTART_INIT and EF_TCP_FASTSTART_LOSS are associated 
	with the fast start 
	‐ Refer to Parameter Reference on page 210 for details.
	During Slow Start, 
	the congestion window is initially set to 2 x maximum segment size (MSS) value. 
	As each ACK is received the congestion window size is increased 
	by the number of bytes acknowledged up to a maximum 2 x MSS bytes. 
	This allows Onload to transmit the minimum of the congestion window 
	and advertised window size 
	i.e.transmission window (bytes) = min(CWND, receiver advertised window size)
	If loss is detected 
	‐ either by retransmission timeout (RTO), 
	or the reception of duplicate ACKs, 
	Onload will adopt a congestion avoidance algorithm 
	to slow the transmission rate. 
	In congestion avoidance the transmission window is halved from its current size 
	‐ but will not be less than 2 x MSS. 
	If congestion avoidance was triggered by an RTO timeout 
	the Slow Start algorithm is again used to restore the transmit rate. 
	If triggered by duplicate ACKs Onload employs a Fast Retransmit 
	and Fast Recovery algorithm.
	If Onload TCP receives 3 duplicate ACKs 
	this indicates that a segment has been lost 
	‐ rather than just received out of order 
	and causes the immediate retransmission of the lost segment (Fast Retransmit). 
	The continued reception of duplicate ACKs is an indication 
	that traffic still flows within the network and Onload will follow Fast Retransmit with Fast Recovery.

TCP Loopback Acceleration

Onload supports the acceleration of TCP loopback connections, 
providing an accelerated mechanism 
through which two processes on the same host can communicate. 
Accelerated TCP loopback connections do not invoke system calls, 
reduce the overheads for read/write operations 
and offer improved latency over the kernel implementation.
The server and client processes 
who want to communicate 
using an accelerated TCP loopback connection 
do not need to be configured 
to share an Onload stack. 
However, 
the server and client TCP loopback sockets can only be accelerated 
if they are in the same Onload stack. 
Onload has the ability to move a TCP loopback socket 
between Onload stacks to achieve this.
TCP loopback acceleration is configured 
via the environment variables 
EF_TCP_CLIENT_LOOPBACK and EF_TCP_SERVER_LOOPBACK. 
As well as enabling TCP loopback acceleration 
these environment variables control Onload’s behavior 
when the server and client sockets do not originate in the same Onload stack. 
This gives the user greater flexibility and control 
when establishing loopback on TCP sockets 
either from the listening (server) socket or from the connecting (client) socket. 
The connecting socket can use any local address or specify the loopback address.

The client loopback option EF_TCP_CLIENT_LOOPBACK=4, 
when used with the server loopback option EF_TCP_SERVER_LOOPBACK=2, 
differs from other loopback options 
such that rather than move sockets between existing stacks 
they will create an additional stack 
and move sockets from both ends of the TCP connection 
into this new stack. 
This avoids the possibility of having many loopback sockets sharing 
and contending for the resources of a single stack.
When client and server are not the same UID, 
set the environment variable EF_SHARE_WITH to allow 
both processes to share the created shared stack.

Listen/Accept Sockets

TCP sockets accepted 
from a listening socket 
will share a wildcard filter 
with the parent socket. 
The following Onload module options 
can be used 
to control behavior when the parent socket is closed.
oof_shared_keep_thresh 
‐ default 100, is the number of accepted sockets 
sharing a wildcard filter 
that will cause the filter to persist 
after the listening socket has closed.
oof_shared_steal_thresh 
‐ default 200, is the number of sockets 
sharing a wildcard filter 
that will cause the filter to persist 
even when a new listening socket needs the filter.
If the listening socket is closed 
the behavior depends on the number of remaining accepted sockets as follows:
> oof_shared_keep_thresh but < oof_shared_steal_thresh	Retain the wildcard filter 
																								shared by all accepted sockets.
																								If a new listening socket 
																								requires the filter, 
																								Onload will install a full‐match filter 
																								for each accepted socket 
																								allowing the listening socket 
																								to use the wildcard filter.
> oof_shared_steal_thresh														Retain the wildcard filter 
																								shared by all accepted sockets.
																								A new listening socket can be created 
																								but a filter cannot be installed 
																								meaning the socket will receive no traffic 
																								until the number 
																								of accepted connections is reduced.

Socket Caching

Shared local ports

Transparent Reverse Proxy Modes

Transparent Reverse Proxy on Multiple CPUs

Performance in lossy network environments

你可能感兴趣的:(网络#Onload,tcp/ip,网络,网络协议)