By Stephen Cleary | 20 Jun 2009
摘自:http://www.codeproject.com/Articles/37490/Detection-of-Half-Open-Dropped-TCP-IP-Socket-Conne/?fid=1542585&df=90&mpp=25&noise=3&prof=False&sort=Position&view=Quick&fr=3
(This post is part of the TCP/IP .NET Sockets FAQ.)
There is a three-way handshake to open a TCP/IP connection, and a four-way handshake to close it. However, once the connection has been established, if neither side sends any data, then no packets are sent over the connection. TCP is an "idle" protocol, happy to assume that the connection is active until proven otherwise.
TCP was designed this way for resiliency and efficiency. This design enables a graceful recovery from unplugged network cables and router crashes. e.g., a client may connect to a server, an intermediate router may be rebooted, and after the router comes back up, the original connection still exists (this is true unless data is sent across the connection while the router was down). This design is also efficient, since no "polling" packets are sent across the network just to check if the connection is still OK (reduces unnecessary network traffic).
TCP does have acknowledgments for data, so when one side sends data to the other side, it will receive an acknowledgment if the connection is still active (or an error if it is not). Thus, broken connections can be detected by sending out data. It is important to note that the act of receiving data is completely passive in TCP; a socket that only reads cannot detect a dropped connection.
This leads to a scenario known as a "half-open connection". At any given point in most protocols, one side is expected to send a message and the other side is expecting to receive it. Consider what happens if an intermediate router is suddenly rebooted at that point: the receiving side will continue waiting for the message to arrive; the sending side will send its data, and receive an error indicating the connection was lost. Since broken connections can only be detected bysending data, the receiving side will wait forever. This scenario is called a "half-open connection"because one side realizes the connection was lost but the other side believes it is still active.(名词解释)
Terminology alert: "half-open" is completely different than "half-closed". Half-closed connections are when one side performs a Shutdown operation on its socket, shutting down only the sending (outgoing) stream. See Socket Operations for more details on the Shutdown operation.
Half-open connections are in that annoying list of problems that one seldomly sees in a test environment but commonly happen in the real world. This is because if the socket is shut down with the normal four-way handshake (or even if it is abruptly closed), the half-open problem will not occur. Some of the common causes of a half-open connection are described below:
In all of the situations above, it is possible that one side may be aware of the loss of connection, while the other side is not.
There are some situations in which detection is not necessary. Even when it is necessary, one must consider how quickly the process needs to detect dropped connections.
The needs of half-open detection can be roughly summarized by three categories:
The necessity of detection must be considered separately for each side of the communication. e.g., if the protocol is based on a polling scheme, then the side doing the polling does not need explicit keepalive handling, but the side responding to the polling likely does need explicit keepalive handling.
True Story: I once had to write software to control a serial device that operated through a "bridge" device that exposed the serial port over TCP/IP. The company that developed the bridge implemented a simple protocol: they listened for a single TCP/IP connection (from anywhere), and - once the connection was established - sent any data received from the TCP/IP connection to the serial port, and any data received from the serial port to the TCP/IP connection. Of course, they only allowed one TCP/IP connection (otherwise, there could be contention over the serial port), so other connections were refused as long as there was an established connection.
The problem? No keepalives. If the bridge ever ended up in a half-open situation, it wouldnever recover; any connection requests would be rejected because the bridge would believe the original connection was still active. Usually, the bridge was deployed to a stationary device on a physical network; presumably, if the device ever stopped working, someone would walk over and perform a power cycle. However, we were deploying the bridge onto mobile devices on a wireless network, and it was normal for our devices to pass out of and back into access point coverage. Furthermore, this was part of an automated system, and people weren't near the devices to perform a power cycle. Of course, the bridge failed during our prototyping; when we brought the root cause to the other company's attention, they were unable to implement a keepalive (the embedded TCP/IP stack didn't support it), so they worked with us in developing a method of remotely resetting the bridge.
It's important to note that we did have keepalive testing on our side of the connection (via a timer), but this was insufficient.It is necessary to have keepalive testing on both sides of the connection.
This bridge was in full production, and had been for some time. The company that made this error was a billion-dollar global corporation centered around networking products. The company I worked for had four programmers at the time. This just goes to show that even the big guys can make mistakes.
There are a couple of wrong methods to detect dropped connections. Beginning socket programmers often come up with these incorrect solutions to the half-open problem. They are listed here only for reference, along with a brief description of why they are wrong.
There are several correct solutions to the half-open problem. Each one has their pros and cons, depending on the problem domain. This list is in order from best solution to worst solution (IMO):
null
" message that should just be ignored. 加入一种 null消息。SocketOptionName.KeepAlive
. The MSDN documentation isn't clear that this uses a 2-hour timeout, which is too long for most applications. This can be changed (system-wide) through a registry key, but changing this system-wide (i.e., for all other applications) is greatly frowned upon. This is the old-fashioned way to enable keepalive packets.Each side of the application protocol may employ different keepalive solutions, and even different keepalive solutions at different states in the protocol; however, one of the solutions above should always be used. For example, the client side of a request/response style protocol may choose to send "null
" requests when there is not a request pending, and switch to a timeout solution while waiting for a response.
However, when designing a new protocol, it is best to employ one of the solutions consistently.
(This post is part of the TCP/IP .NET Sockets FAQ.)
This article, along with any associated source code and files, is licensed underThe Code Project Open License (CPOL)