earning to write network-aware applications has never been considered easy. In reality, though, there are just a few principles to master—creating and connecting a socket, accepting a connection, and sending and receiving data. The real difficulty is writing network applications that scale from a single connection to many thousands of connections. This article will discuss the development of scalable Windows NT® and Windows 2000-based applications that use Windows® Sockets 2.0 (Winsock2). The primary focus will be the server side of the client/ server model; however, many of the topics discussed apply to both. Because the notion of writing a scalable Winsock application implies a server application, the following discussion is pertinent to applications running on Windows NT 4.0 and Windows 2000. We're not including Windows NT 3.x because this solution relies on the features of Winsock2 that are available only on Windows NT 4.0 and newer.
APIs and Scalability The overlapped I/O mechanism in Win32® allows an application to initiate an operation and receive notification of its completion later. This is especially useful for operations that take a long time to complete. The thread that initiates the overlapped operation is then free to do other things while the overlapped request completes behind the scenes. The only I/O model that provides true scalability on Windows NT and Windows 2000 is overlapped I/O using completion ports for notification. Mechanisms like the WSAAsyncSelect and select functions are provided for easier porting from Windows 3.1 and Unix respectively, but are not designed to scale. The completion port mechanism is optimized for the operating system's internal workings.
Completion Ports A completion port is a queue into which the operating system puts notifications of completed overlapped I/O requests. Once the operation completes, a notification is sent to a worker thread that can process the result. A socket may be associated with a completion port at any point after creation. Typically an application will also create a number of worker threads to process these notifications. The number of worker threads depends on the specific needs of the application. The ideal number is one per processor, but that implies that none of these threads should execute a blocking operation such as a synchronous read/write or a wait on an event. Each thread is given a certain amount of CPU time, known as the quantum, for which it can execute before another thread is allowed to grab a time slice. If a thread performs a blocking operation, the operating system will throw away its unused time slice and let other threads execute instead. Thus, the first thread has not fully utilized its quantum, and the application should therefore have other threads ready to run and utilize that time slot. Using a completion port is a two-step process. First, the completion port is created, as shown in the following code:
HANDLE hIocp;
hIocp = CreateIoCompletionPort(
INVALID_HANDLE_VALUE,
NULL,
(ULONG_PTR)0,
0);
if (hIocp == NULL) {
// Error
}
Once the completion port is created, each socket that wants to use the completion port must be associated with it. This is done by calling CreateIoCompletionPort again, this time setting the first parameter, FileHandle, to the socket handle to be associated, and setting ExistingCompletionPort to the handle of the completion port you just created. The following code creates a socket and associates it with the completion port created earlier:
SOCKET s;
s = socket(AF_INET, SOCK_STREAM, 0);
if (s == INVALID_SOCKET) {
// Error
if (CreateIoCompletionPort((HANDLE)s,
hIocp,
(ULONG_PTR)0,
0) == NULL)
{
// Error
}
•••
}
At this point, the socket s is associated with the completion port. Any overlapped operations performed on the socket will use the completion port for notification. Note that the third parameter of CreateIoCompletionPort allows a completion key to be specified along with the socket handle to be associated. This can be used to pass context information that is associated with the socket. Each time a completion notification arrives, this context information can be retrieved. Once the completion port has been created and sockets have been associated with it, one or more threads are needed to process the completion notifications. Each thread will sit in a loop that calls GetQueuedCompletionStatus each time through and returns completion notifications. Before illustrating what a typical worker thread looks like, we need to address the ways in which an application keeps track of its overlapped operations. When an overlapped call is made, a pointer to an overlapped structure is passed as a parameter. GetQueuedCompletionStatus will return the same pointer when the operation completes. With this structure alone, however, an application can't tell which operation just completed. In order to keep track of the operations that have completed, it's useful to define your own OVERLAPPED structure that contains any extra information about each operation queued to the completion port (see Figure 1). Whenever an overlapped operation is performed, an OVERLAPPEDPLUS structure is passed as the lpOverlapped parameter (as in WSASend, WSARecv, and so on). This allows you to set operation state information for each overlapped call. When the operation completes, the OVERLAPPED pointer returned from GetQueuedCompletionStatus will now point to your extended structure. Note that the OVERLAPPED field within the extended structure does not necessarily have to be the first field. After the pointer to the OVERLAPPED structure is returned, the CONTAINING_RECORD macro can be used to obtain a pointer to the extended structure. Take a look at the example worker thread in Figure 2. The PerHandleKey variable will return anything that was passed as the CompletionKey parameter to CreateIoCompletionPort when associating a given socket handle. The Overlap parameter returns a pointer to the OVERLAPPEDPLUS structure that is used to initiate the overlapped operation. Keep in mind that if an overlapped operation fails immediately (that is, returns SOCKET_ERROR and the error is not WSA_IO_PENDING), then no completion notification will be posted to the queue. Alternately, if the overlapped call succeeds or fails with WSA_IO_PENDING, a completion event will always be posted to the completion port. For more information on using completion ports with Winsock, take a look at the Microsoft® Platform SDK, which includes a Winsock completion port sample (under the Winsock section in the iocp directory). Additionally, consult Network Programming for Microsoft Windows by Anthony Jones and Jim Ohlund (Microsoft Press, 1999), which includes samples for completion ports as well as the other I/O models.
The Windows NT and Windows 2000 Sockets Architecture A basic understanding of the sockets architecture of Windows NT and Windows 2000 is helpful in fully comprehending the principles of scalability. Figure 3 illustrates the current implementation of Winsock in Windows 2000. An application should not depend on the specific details mentioned here (names of drivers, DLLs, and so on), as these may change in a future release of the operating system.
Figure 3Socket Architecture
The Windows Sockets 2.0 specification allows for a variety of protocols and their related providers. These user-mode service providers can be layered on top of existing providers in order to extend their functionality. For example, a proxy layered service provider (LSP) may install itself on top of the existing TCP/IP provider. This allows the proxy LSP to intercept and redirect or log calls to the base provider. Unlike some other operating systems, the Windows NT and Windows 2000 transport protocols do not have a sockets-style interface which applications can use to talk to them directly. Instead, they implement a much more general API called the Transport Driver Interface (TDI). The generality of this API keeps the subsystems of Windows NT from being tied to a particular flavor-of-the-decade network programming interface. The Winsock kernel mode driver provides the sockets emulation (currently implemented in AFD.SYS). This driver is responsible for the connection and buffer management needed to provide a sockets-style interface to an application. AFD.SYS, in turn, uses TDI to talk to the transport protocol driver.
Who Manages the Buffers? As just mentioned, AFD.SYS handles buffer management for applications that use Winsock to talk to the transport protocol drivers. This means that when an application calls the send or WSASend function to send data, the data gets copied by AFD.SYS to its internal buffers (up to the SO_SNDBUF setting) and the send or WSASend function returns immediately. The data is then sent by AFD.SYS behind the application's back, so to speak. Of course, if the application wants to issue a send for a buffer larger than the SO_SNDBUF setting, the WSASend call blocks until all the data is sent. Similarly, on receiving data from the remote client, AFD.SYS will copy the data to its own buffers as long as there is no outstanding data to receive from the application, and as long as the SO_RCVBUF setting is not exceeded. When the application calls recv or WSARecv, the data is copied from AFD.SYS's buffers to the application-provided buffer. In most cases, this architecture works very well. This is especially true for applications that use traditional socket paradigms with nonoverlapped sends and receives. Before going apoplectic over the buffer copying that's involved in sending and receiving data, a programmer should take great care to understand the consequences of turning off the buffering in AFD.SYS, which can be done by setting the SO_SNDBUF and SO_RCVBUF values to 0 using the setsockopt API. Consider, for example, an application that turns off buffering by setting SO_SNDBUF to 0 and issues a blocking send. In this case, the application's buffer is locked into memory by the kernel and the send API does not return until the other end of the connection acknowledges the entire buffer. That may seem like a neat way to determine whether all your data has actually been received by the other side, but in fact it is a bad thing to do. For one thing, even acknowledgment by the remote TCP is no guarantee that the data will be delivered to the client application, as there may be out-of-resource conditions that prevent it from copying the data from AFD.SYS. An even more significant problem with this approach is that your application can only do one send at a time in each thread. This is extremely inefficient, to say the least. Turning off receive buffering in AFD.SYS by setting SO_RCVBUF to 0 offers no real performance gains. Setting the receive buffer to 0 forces received data to be buffered at a lower layer than Winsock. Again, this leads to buffer copying when you actually post a receive, which defeats your purpose in turning off AFD's buffering. It should be clear by now that turning off buffering is a really bad idea for most applications. Turning off receive buffering is not usually necessary, as long as the application takes care to always have a few overlapped WSARecvs outstanding on a connection. The availability of posted application buffers removes the need for AFD to buffer incoming data. However, a high-performance server application can turn off the send buffering, yet not lose performance. Such an application must, however, take great care to ensure that it posts multiple overlapped sends, instead of waiting for one overlapped send to complete before posting another. If the application posts overlapped sends in a sequential manner, it wastes the time window between one send completion and the posting of the next send. If it had another buffer already posted, the transport would be able to use that buffer immediately and not wait for the application's next send operation.
Resource Constraints A major design goal of any server application is robustness. That is, you want your server application to ride out any transient problems that might occur, such as a spike in the number of client requests, temporary lack of available memory, or other relatively short-lived phenomena. To handle these incidents gracefully, the application developer should be aware of the resource constraints on typical Windows NT and Windows 2000-based systems. The most basic resource that you have direct control over is the bandwidth of the network on which the application is sending data. It's a fair assumption that an application that uses the User Datagram Protocol (UDP) is probably already aware of this limitation, since such a server would want to minimize packet loss. However, even with TCP connections, a server should take great care to never overrun the network for extended periods of time. Otherwise, there will be a lot of retransmissions and aborted connections. The specifics of the bandwidth management method are application-dependent and are beyond the scope of this article. Virtual memory used by the application also needs careful management. Conservative memory allocations and frees, perhaps using lookaside lists (a cache) to reuse previous allocations, will keep the server application's footprint smaller and allow the system to keep more of the application address space in memory all the time. An application can also use the SetWorkingSetSize Win32 API to increase the amount of physical memory the operating system will let it use. There are two other resource constraints that an application indirectly encounters when using Winsock. The first one is the locked page limit. Whenever an application posts a send or receive, and AFD.SYS's buffering is disabled, all pages in the buffer are locked into physical memory. They need to be locked because the memory will be accessed by kernel-mode drivers and cannot be paged out for the duration of the access. This would not be a problem in most circumstances, except that the operating system must make sure that there is always some pageable memory available to other applications. The goal is to prevent an ill-behaved application from locking up all of the physical RAM and bringing down the system. This means that your application must be conscious of hitting a system-defined limit on the number of pages locked in memory. The limit on locked memory in Windows NT and Windows 2000 is about one-eighth the physical RAM for all applications combined. This is a rough estimate and should not be used as an exact figure on which to base calculations. Just be aware that an overlapped operation may occasionally fail with ERROR_INSUFFICIENT_RESOURCES, and this limitation is a likely cause if there are too many send/receives pending. The application should take care not to have an excessive amount of memory locked in this fashion. Also note that all pages containing your buffer(s) will be locked, so it pays to have buffers that are aligned on page boundaries. The other resource limitation that an application will run into somewhere in its lifetime is the system non-paged pool limit. The Windows NT and Windows 2000 drivers have the ability to allocate memory from a special non-paged pool. The memory allocated from this region is never paged out. It is intended to store information that can be accessed by various kernel-mode components, some of which may not be able to access a location in memory that is paged out. Whenever an application creates a socket (or opens a file, for that matter), some amount of non-paged pool is allocated. In addition, the act of binding and/or connecting a socket also results in additional non-paged pool allocations. Add to this the fact that an outstanding I/O request, such as a send or a receive, allocates a little more non-paged pool (a small structure is required to keep track of pending I/O operations), and you can see that eventually there will be a problem. The operating system therefore limits the amount of non-pageable memory. The exact amount of non-paged pool allocated per connection is different for Windows NT 4.0 and Windows 2000 and will likely be different again for future versions of Windows. In the interests of your application's longevity, you should not calculate the exact amount of non-paged pool you need. However, the application must take care to avoid hitting the non-paged limit. When the system runs low on non-paged pool memory, you expose yourself to the risk that some driver that's completely unrelated to your application will throw a fit because it cannot allocate a non-paged pool at that particular time. In the worst case, this can lead to a system crash. This is especially likely (but impossible to predict in advance) in the presence of third-party devices and drivers on a system. You must also remember that there might be other server applications running on the same machine that consume non-paged pool memory. It is best to be very conservative in your resource estimation, and design the application accordingly. Handling the resource constraints is complicated by the fact that there is no special error code returned when either of the conditions is encountered. The application will get generic WSAENOBUFS or ERROR_INSUFFICIENT_RESOURCES errors from various calls. To handle these errors, first increase the working set of the application to some reasonable maximum. (For more information on adjusting your working set, see the Bugslayer column by John Robbins in this issue of MSDN Magazine.) Then, if you still continue to get these errors, check the possibility that you may be exceeding the bandwidth of the medium. Once you have done that, make sure you don't have too many send or receives outstanding. Finally, if you still receive out-of-resource errors, you're most probably running into non-paged pool limits. To free up a non-paged pool, the application must close a good portion of its outstanding connections and wait for the transient situation to correct itself.
Accepting Connections One of the most common things a server does is accept connections from clients. The AcceptEx function is the only Winsock API capable of using overlapped I/O to accept connections on a socket. The interesting thing about AcceptEx is that it requires an additional socket as one of the parameters to the API. In a normal, synchronous accept function call, the new socket is the return value from the API. However, since AcceptEx is an overlapped operation, the accepted socket must be created (but not bound or connected) in advance, and passed to the API. A typical psuedocode snippet that uses AcceptEx might look like the following:
do {
-Wait for a previous AcceptEx to complete
-Create a new socket and associate it with the completion port
-Allocate context structure etc.
-Post an AcceptEx request.
}while(TRUE);
A responsive server must always have enough AcceptEx calls outstanding so that any client connection can be immediately handled. The number of posted AcceptEx operations will depend on the type of traffic your server expects. A high incoming connection rate (because of short-lived connections or spurts in traffic) requires more outstanding AcceptEx calls than an application where the clients connect infrequently. It may be wise to let the number of posted AcceptEx operations vary between application-specific low and high watermarks, and avoid deciding on one fixed number as the magic figure. On Windows 2000, Winsock provides some help in determining if the application is falling behind on posting AcceptEx requests. When creating the listening socket, associate it with an event by using the WSAEventSelect API and registering for an FD_ACCEPT notification. If there are no accept operations pending, the event will be signaled by an incoming connection. This event can thus be used as an indication that you need to post more AcceptEx requests or detect a possible misbehaving remote entity, as we'll describe shortly. This mechanism is not available on Windows NT 4.0. A significant benefit to using the AcceptEx call is the ability to receive data and accept a client connection in one call via the lpOutputBuffer parameter. This means that if a client connects and immediately sends data, AcceptEx will complete only after the connection is established and the client sends data. This can be very useful, but it can also lead to problems since the AcceptEx call will not return until data is received, even if a connection has been established. This is because an AcceptEx call with an output buffer is not one atomic operation, but a two-step process consisting of accepting a connection and waiting for incoming data. However, the application is not notified that a connection has been accepted before data is received. That means a client could connect to your server and not send any data. With enough of these connections, your server will start to refuse connections to legitimate clients because it has no more accepts pending. This is a common method of waging a denial of service attack. To prevent malicious attacks or stale connections, the accepting thread should occasionally check the sockets outstanding in AcceptEx by calling getsockopt and SO_CONNECT_TIME. The option value is set to the length of time the socket has been connected for, or -1 if it is still unconnected. The WSAEventSelect feature serves as an excellent indicator that the sockets that are outstanding in AcceptEx need their connection times checked. Any connections that have existed for a while without receiving data from the client should be terminated by closing the socket supplied to AcceptEx. An application should not, under most noncritical circumstances, close a socket that is outstanding in AcceptEx but not yet connected. For performance reasons, the kernel-mode data structures created for and associated with such an AcceptEx request will not be cleaned up until a new connection comes in or the listening socket itself is closed. It may seem that the logical thread to post AcceptEx requests is one of the worker threads that is associated with the completion port and involved in processing other I/O completion notifications. However, recall that a worker thread should not execute a blocking or high-latency system call if such an action can be avoided. One of the side effects of the layered architecture of Winsock2 is that the overhead to a socket or WSASocket API call may be significant. Every AcceptEx call requires the creation of a new socket, so it is best to have a separate thread that posts AcceptEx and is not involved in other I/O processing. You may also choose to use this thread for performing other tasks such as event logging. One last thing to note about AcceptEx is that a Winsock2 implementation from another vendor is not required to implement these APIs. This also applies to the other APIs that are specific to Microsoft, such as TransmitFile, GetAcceptExSockAddrs, and any others that Microsoft may add in a later version of Windows. On systems running Windows NT and Windows 2000, these APIs are implemented in the Microsoft provider DLL (mswsock.dll), and can be invoked by linking with mswsock.lib, or dynamically loading the function pointers via WSAIoctl SIO_GET_EXTENSION_FUNCTION_POINTER. Calling the function without previously obtaining a function pointer (that is, by linking with mswsock.lib and calling AcceptEx directly) is costly because AcceptEx sits outside the layered architecture of Winsock2. AcceptEx must request a function pointer using WSAIoctl for every call on the off chance that the application is actually trying to invoke AcceptEx from a provider layered on top of mswsock (see Figure 3). To avoid this significant performance penalty on each call, an application that intends to use these APIs should obtain the pointers to these functions directly from the layered provider by calling WSAIoctl.
TransmitFile and TransmitPackets Winsock offers two functions for transmitting data that are optimized for file and memory transfers. The TransmitFile API is present on both Windows NT 4.0 and Windows 2000, while TransmitPackets is a new Microsoft extension function that is expected to be available in a future release of Windows. TransmitFile allows the contents of a file to be transferred on a socket. Normally, if an application were to send the contents of a file over a socket, it would have to call CreateFile to open the file and then loop on ReadFile and WSASend until the entire file was read. This is very inefficient because each ReadFile and WSASend call requires a transition from user mode to kernel-mode. TransmitFile simply requires an open handle to the file to transmit and the number of bytes to transfer. The overhead is incurred when opening the file via CreateFile, followed by a single kernel-mode transition. If your app sends the contents of files over sockets, this is the API to use. The TransmitPackets API takes the TransmitFile API a step further by allowing the caller to specify multiple file handles and memory buffers to be transmitted in a single call. The function prototype looks like this:
BOOL TransmitPackets(
SOCKET hSocket,
LPTRANSMIT_PACKET_ELEMENT lpPacketArray,
DWORD nElementCount,
DWORD nSendSize,
LPOVERLAPPED lpOverlapped,
DWORD dwFlags
);
The lpPacketArray is an array of structures. Each entry can specify either a file handle or a memory buffer to be transmitted. The structure is defined as:
typedef struct _TRANSMIT_PACKETS_ELEMENT {
DWORD dwElFlags;
DWORD cLength;
union {
struct {
LARGE_INTEGER nFileOffset;
HANDLE hFile;
};
PVOID pBuffer;
};
} TRANSMIT_FILE_BUFFERS;
The fields are self explanatory. The dwElFlags field identifies whether the current element specifies a file handle or memory buffer via the constants TF_ELEMENT_FILE and TF_ELEMENT_MEMORY. The cLength field dictates how many bytes to send from the given data source (a zero indicates the entire file in the case of a file element). The unnamed union then contains the memory buffer of file handle (and possible offset) of the data to be sent. Another benefit of using these two APIs is that you can reuse the socket handle by specifying the TF_REUSE_SOCKET flag in addition to the TF_DISCONNECT flag. Once the API completes the data transfer, a transport-level disconnect is initiated. The socket can then be reused in an AcceptEx call. Using this optimization would lessen the overhead associated with creating sockets in the separate accept thread, as discussed earlier. The only caveat of using either of these two extension APIs is that on Windows NT Workstation or Windows 2000 Professional only two requests will be processed at a time. You must be running on Windows NT or Windows 2000 Server, Windows 2000 Advanced Server, or Windows 2000 Data Center to get full usage of these specialized APIs.
Putting it Together In the preceding sections, we covered the APIs and methods necessary for high-performance, scalable applications, as well as the resource bottlenecks that may be encountered. What does this mean to you? Well, that depends on how your server and client are structured. The more control you have over the design of both the client and server, the better you can avoid bottlenecks. Let's look at a sample scenario. In this situation we'll design a server that handles clients that connect, send a request, receive data from the server, and then disconnect. In this situation, the server will create a listening socket and associate it with a completion port, creating a worker thread for each CPU. Another thread will post the AcceptEx calls. Since you know the client will connect and immediately send data, supplying a receive buffer can make things substantially easier. Of course, you shouldn't forget to occasionally poll the client sockets used in the AcceptEx calls, using the SO_CONNECT_TIME option to make sure there are no stale connections. An important issue in this design is to determine how many outstanding AcceptEx calls are allowed. Because a receive buffer is being posted with each accept call, a significant number of pages could be locked in memory. (Remember each overlapped operation consumes a small portion of non-paged pool and also locks any data buffers into memory.) There is no real answer or concrete formula for determining how many accept calls should be allowed. The best solution is to make this number tunable so that performance tests may be run to determine the best value for the typical environment that the server will be running in. Now that you have determined how the server will accept connections, the next step is sending data. An important factor in deciding how to send data is the number of concurrent connections you expect the server to handle. In general, the server should limit the number of concurrent connections, as well as the number of outstanding send calls. More established connections mean more non-paged pool usage. The number of concurrent send calls should be limited to prevent reaching the locked pages limit. Again, both of these limits should be tunable. In this situation it is not necessary to disable the per-socket receive buffers since the only receive that occurs is in AcceptEx call. Of course it wouldn't hurt for you to guarantee that each connection has a receive buffer posted. Now, if the client/server interaction changes so that the client sends additional data after the initial request, disabling the receive buffer would be a bad idea unless, in order to receive these additional requests, you guarantee that an overlapped receive is posted on each connection.
Conclusion Developing a scalable Winsock server is not terribly difficult. It's a matter of setting up a listening socket, accepting connections, and making overlapped send and receive calls. The main challenge lies in managing resources by placing limits on the number of outstanding overlapped calls so that the non-paged pool is not exhausted. Following the guidelines we covered here will allow you to create high-performance, scalable server applications. |