原文地址: http://baus.net/on-tcp_cork/
April 6, 2005
I previously mentioned the leakiness of Unix's file metaphor. The leak often becomes a gushing torrent when trying to bump up performance. TCP_CORK is yet another example.
Before I get into the details of TCP_CORK and the problem it addresses, I wantto point out that this is a Linux only option,although variants exist on other *nix flavors -- for instance TCP_NOPUSHon FreeBSD and Mac OS X (although from what I read the OS X implementationis buggy). This is one of the unfortunate aspects of modern Unix programming.While most of the APIs are identical between Unix like OSes, if the functionality isn't specified by POSIX, none of themajor *nix's can seem to agree on an implementation.
The root of the abstraction leak derives from the semantics of the write() functionwhen applied to TCP/IP. Historically (and any Unix experts in the crowdfeel free to correct me here if this is not accurate) the write() functionresulted in a physical, non-buffered, write to the device. With TCP/IP the device is a network packet, but the implementors were forced to define a physical write given Unix's file semantics, so a TCP/IP write() was defined as follows:
Any data that has been sent to the kernel with write() is placedinto one or more packets and immediately sent onto the wire.
The resulting behavior is what application programmers expected. When they called write() the data would be sent and available to host on the other side of the wire. But it didn't take long to realize that this resulted in some interesting performance problems, which were addressedby Nagle's algorithm.
In the early 1980'sJohn Nagle found that the networks at Ford Aerospace were becomingcongested with packets containing only a single character's worthof data. Basicallyevery time a user struck a key in a telnet-like console app an entire packetwas put onto the network. As Nagle pointed out, this resulted in about 4000% overhead (the total amount of data sent vs.the actual application data). Nagle's solution was simple: wait for thepeer to acknowledge the previously sent packet before sending anypartial packets. This gives the OS time to coalescemultiple calls to write() from the application into larger packets before forwarding the data to the peer.
Nagle's algorithm is transparent to applicationdevelopers, and it effectively sticks a fat finger in the abstraction leak. Calls to write() guarantee that data is delivered to the peer. Nagle also hasthe side benefit of providing additional rudimentary flow control.
While Nagle's algorithm is an excellent compromise for many applications, and it is thedefault behavior for most TCP/IP implementations including Linux's, it isn't without drawbacks. The Nagle algorithm is most effective ifTCP/IP traffic is generated sporadically by user input, not by applications using stream oriented protocols. It worksgreat for Telnet, but it is less than optimal for HTTP. For example, if an application needs to send 1 1/2 packets of data to complete a message, the second packetis delayed until an ACK is received from the previous packet, thereby needlessly increasing latency when the application doesn't expect to sendmore data.
It also requires the peer to process more packets when networklatency is low. This can affect the responsiveness of the peer,by causing it to needlessly consume resources.
Unfortunately, as is often the case, the file abstraction must be violated to improve performance. The application must instruct the OS not to sendany packets unless they are full, or the application signals the OS tosend all pending data. This is the effect of TCP_CORK.
The application must tell the OS where the boundaries of the application layer messages are. For instance multiple HTTPmessages can be passed on one connection using HTTP pipelines. When a message is complete the application should signal the OS to send any outstanding data. If the application fails to signal the peerof a completed message, the peer will hang waiting for theremainder of the message.
In my HTTP implementation, I use the flush metaphor which is commonwith streams, but not usually associated with calls to write() whichare supposed to be physical. I set the TCP_CORK option when thesocket is created, and then "flush" the socket at message boundaries.
If you need to write multiple buffers that are currently in memory youshould prefer the gather function writev() before considering TCP_CORK with multiplecalls to write(). This function allows multiple non-contiguous buffers to be written with one system call. The kernel can then coalesce the buffers efficientlyinto packet structures before writing them to the network. It alsoreduces the number of system calls required to send the data, and henceimproves performance.
This should be combined with TCP_NODELAY option or TCP_CORK options. TCP_NODELAY disables the Nagle algorithm and ensures that the data will be written immediately.Using TCP_CORK with writev() will allow the kernel to buffer and align packets betweenmultiple calls to write() or writev(), but you must remember to remove the cork optionto write the data as described in the next section.
TCP_NODELAY is set on a socket as follows:
int state = 1;
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &state, sizeof(state));
The drawback of writev() is that it is difficult to use with non-blocking I/O, when the functionmay return before all the data is written. A post call operation must be preformed to determine how much data was written, and to realign the buffersfor subsequent calls. This is an area with auxiliary library functionality would help. Also the behavior of writev() with non-blocking I/O isn't well documented.
If you need the kernel to align and buffer packet data over the lifespanof buffers (hence the inability of using writev()), then TCP_CORK shouldbe considered.TCP_CORK is set on a socket file descriptor using the setsockopt() function.When the TCP_CORK option is set, only full packets are sent, untilthe TCP_CORK option is removed. This is important. To ensure all waiting data is sent, the TCP_CORK option MUST be removed. Herein lies the beauty of the Nagle algorithm. It doesn't require anyintervention from the application programmer. But once you set TCP_CORK,you have to be prepared to remove it when there is no more data to send.I can't stress this enough, as it is possible that TCP_CORK could causesubtle bugs if the cork isn't pulled at the appropriate times.
To set TCP_CORK use the following:
int state = 1;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));
The cork can be removed and partial packets data send with:
int state = 0;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));
As I mentioned, I use the flush paradigm, which involves awkwardly removing and reapplying of the TCP_CORK option.This can be done as follows:
int state = 0;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));
state ~= state;
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));
User mode buffered streams, is another solution to problem. User mode buffering is implemented follows: instead of calling write() directly, the application stores data in a write buffer. When the write buffer is full, all data is then sentwith a call to write().
Even with buffered streams the application must be ableto instruct the OS to forward all pending data when the stream has been flushed for optimal performance.The application does not know where packet boundaries reside, hencebuffer flushes might not align on packet boundaries. TCP_CORK can packdata more effectively, because it has direct access to the TCP/IP layer.
Also application buffering requires gratuitous memory copies, whichmany high performance servers attempt to minimize. Memory bus contention and latency often limit a server's throughput.
If you do use an application buffering and streaming mechanism (as doesApache), I highly recommend applying the TCP_NODELAYsocket option which disables Nagle's algorithm. All calls to write() will then result in immediate transfer of data.