VoIP in-depth: An introduction to the SIP protocol, Part 1

http://arstechnica.com/business/news/2010/01/voip-in-depth-an-introduction-to-the-sip-protocol-part-1.ars/2

In our last VoIP installment, we looked at the main reasons why SIP has become a widely adopted protocol, but we left details of the protocol's inner workings fairly vague. This article will drill down into the way the Session Initiation Protocol (SIP) works, and it should serve as a good starting point for really learning SIP. If you haven't already done so, you are encouraged to read the previous article, although it's not a prerequisite. This introduction also covers the latest SIP extensions and changes, so it gives a complete view of the protocol's current state, rather than just the basic, underlying RFC.

Session Initiation Protocol (SIP) is a VoIP signaling protocol. As its name suggests, it has everything to do with setting up sessions, which means it has the responsibility for starting a session after you dial a number (or double-click, in some cases). As such, SIP's role also includes maintaining user registrations with a server, defining session routing, handling various error scenarios, and, of course, modifying and tearing down sessions.

We'll present this introduction in two parts. In the first part, we'll focus on the SIP foundation layers. These layers allow creating a network of SIP servers. In the next article, we will go through the way a phone communicates with the rest of the world using this server network, based on the same foundation layers.

<!--page 1-->

SIP Foundations

Message Structure

SIP shares the same message structure as HTTP and RTSP. As a result, most of the text description here would fit these protocols as well. Each message is either a request or a response. All messages are text-based and have the following form:

  • First line
  • Headers
  • Empty line
  • Optional body

The first line is different for a request and a response. A request's first line uses the following format:

<method (request type)> <URI (address/resource)> <version>

For example, a SIP request's first line can be:

REGISTER sip:arstechnica.com SIP/2.0

A response comes in the form of:

<version> <code> <reason phrase>

The version is the same in all messages. The code is a 3-digit value, and the reason phrase could be any text describing the nature of the response. A proper first line in the response could be:

SIP/2.0 200 OK

It's easy to scan the first characters of a message to detect whether it's a request or a response ("/" is not a valid character for the method field, therefore even a generic parser could differentiate a request from a response).

Next come the headers of the message. Some headers contain vital information and thus are mandatory, but many headers are optional. Each header has a name and a value, and some have parameters. For example:

Contact: sip:[email protected];Expires=2000

Contact is the header name, sip:[email protected] is the value, and expires=2000 is a parameter. Other parameters may appear separated by semicolons.

Some headers may appear just once in a message, and some may appear multiple times. An equivalent approach to using multiple headers is to separate each value-parameter set with a comma. Some headers have a compact form for which the header name is shorter. For example, you can use "m" instead of "Contact".

After the last header, there's an empty line followed by the body of the message. The body could be anything, even binary information. The header Content-Type specifies the type of the message body. In order to know the length of the body, you have to look at the Content-Length header. (If the transport type is UDP, then the Content-Length header is not mandatory. It is, however, mandatory for TCP). One common use of the body is to encapsulate the media negotiation protocol within the SIP message.

That's pretty much what you need to know in order to understand the basic structure of SIP. It is rather straightforward, and anyone who is already familiar with protocols that have similar structure will feel right at home. Of course, we still have to understand what this text means and what a SIP device should do with the messages it sends and receives.

Transport Types

By default, SIP messages are sent on port 5060 if they're unencrypted. Encrypted messages are sent over port 5061. One can specify a port other than the default within the SIP address and override this default value.

SIP mandates support for both UDP and TCP, but it can successfully operate on practically any transport type. It defines different behavior per transport only when the characteristics of the specific transport require it to do so. For example, UDP does not guarantee delivery, so SIP retransmits the messages. In TCP, such retransmission is really unnecessary (and confusing), so no one should retransmit a message. For the most part, other than the transaction layer (detailed in the next sub-section), almost no other component changes its behavior due to the transport.

In fact, because SIP operates hop-by-hop (clients do not usually communicate directly, but rather use proxies along the signaling path to send and receive messages), each hop could change the transport type. So a client may receive a message over TCP even if the original message was sent over UDP. SIP's transport type independence enables the possibility of defining new transport types that were not originally included when SIP itself was first defined. For example, RFC 4158 is a very short RFC that defines SIP over SCTP.

For connection-based transports (e.g., TCP and SCTP), the state of the connections is maintained. Connections are kept open and reused to save time and resources. The recommendation is to keep a connection open for at least 32 seconds after the last message, but in practice it's application-defined. Because there is no defined limit for the number of different SIP messages that one can send on a connection, two devices such as proxies usually have one or very few connections between them.

NAT vs. VoIP

One of the main challenges that VoIP protocols have encountered, and SIP is no exception, is the existence of NAT devices. NATs usually aggregate several IP addresses (in many cases within a private network) to a few external IP addresses, mapping different traffic from different IP addresses to different ports. (This is a very simplistic description of NATs; there are in fact several ways NATs can work, but this is a common one that's easy to describe). In order to map different addresses to IP and ports, NATs usually maintain a dynamic mapping table. If traffic from 172.16.1.1 with port 5060 maps to a public IP address with port 10000, the NAT will keep this mapping as long as it has a flow of packets from that address. If the flow stops, the NAT will remove this mapping after a configurable amount of time to allow another internal IP and port to use the external IP and port combination.

VoIP's problemarisesbecause signaling protocols are as minimalistic as possible in terms of traffic. In order to allow the rest of the world to locate a device, the device first registers by sending a REGISTER request. A response from the registrar accepting this registration will usually tell the device that its registration is valid for a long time, most commonly an hour.

Now, suppose a NAT is located between the client and the server. When the REGISTER request is sent, the NAT maps the device's internal IP and port to an external one. When the response is sent back, the NAT has the proper mapping to the original IP and port. If the client does not send any packets to the server (for example, does not make a call), the NAT may remove the mapping it created during the registration. The outcome of this scenario is that incoming calls cannot reach the client. The proxy server receiving the request to the registered client cannot reach the internal IP address without the NAT mapping.

It's because of this NAT issue that RFC 5626 was introduced. This RFC defines the techniques that a client can use to maintain the NAT mapping. It introduces the concept of a flow that should be maintained by the registering client. The client sends two empty lines (carriage return and line feed, or CRLF) on connection-oriented flow (TCP and SCTP) and expects to receive from the server a single empty line as a response. For connection-less transports, the client maintains the flow by using STUN (defined by RFC 5389).

Transaction layer

The SIP RFC divides the architecture into layers. We actually went through two of the layers in the discussion above: the first was the syntax and encoding layer that defines the message structure, and the second was the transport layer. Now it's time to inspect the contents of the SIP message by taking a look at the transaction layer.

VoIP in-depth: An introduction to the SIP protocol, Part 1
The SIP layers

Every SIP message is associated with a single transaction. Similar to HTTP, messages are either requests or responses, but unlike HTTP, matching responses to requests is not simple. HTTP uses TCP as its transport, so you can match a response based on the order of the requests. But a SIP transaction can have more than a single response, and, in some cases, more than one request. When a SIP device sends a request, it acts as a user agent client (UAC). The recipient of the request, the one that sends the response, acts as a user agent server (UAS). The layer above the transaction layer is named "transaction user" or TU. Let's look at a SIP request that a UAC can initiate:

REGISTER sip:arstechnica.com SIP/2.0
Via: SIP/2.0/UDP home.mynetwork.org;branch=z9hG4bKmq0Tgb
To: sip:[email protected]
From: sip:[email protected];tag=m25caI4
Call-ID: [email protected]
CSeq: 153 REGISTER
Contact: sip:[email protected]
Max-Forwards: 70

We have already seen that REGISTER is the method (type of request), sip:arstechnica.com is the request-URI, and SIP/2.0 is the version. All the headers above are the mandatory. At this point, we'll cover the headers that are important to the transaction layer, and we'll cover the rest of them when we get to the way proxies and registrars work. First, let's examine the Via header.

The Via header has a parameter called "branch" with an odd value. The first 7 letters (z9hG4bK) are fixed, and they help identify this as a SIP transaction based on RFC 3261. These letters, often referred to as the "magic cookie", would not appear with a request that is using the previous SIP RFC, which has different transaction matching rules. We'll only look the cases that have the magic cookie because it's very rare today to encounter an implementation that has not caught up with the latest spec.

After the seven letters, the rest is just a random string. Every time you see a different branch value it's a different transaction; conversely, if both messages have the same branch value, then they should be the same transaction. One exception to this rule is if the method of the CSeq header is different. This is because the CANCEL method uses the same branch value to identify which transaction you should cancel. So, to fully match two messages to the same transaction, both the branch and CSeq method have to match. Naturally, this means that responses to a request will have matching values.

Before moving on, one final note on the Via header. When we refer to this header, we actually refer to the first, or topmost, Via header. Via is one of those headers that can appear multiple times within a message. The reason for this will become clear in the proxy section, but it's important to note that you always match the transaction based on the first Via header and ignore the rest.

The UAS sends a response to an incoming request. SIP responses are divided into 6 different classes, and the first digit of the 3-digit response code identifies each class. A 1xx response means any response in the range of 100 to 199. The response types are:

  • 1xx - Provisional response, which indicates that the request is handled, but without a final response yet. For example, 180 Ringing is a common provisional response.
  • 2xx - Successful response. The most common one is 200 OK.
  • 3xx - Redirect response. A client receiving this response would know the user moved to a different location. For example, a phone may redirect all its calls to a different address by responding back with a 302 Moved Temporarily.
  • 4xx - Client error, which means that the request cannot be fulfilled and the sender should modify its request. For example, you can send 401 Unauthorized if the request does not contain the correct user credentials.
  • 5xx - Server error, which usually indicates that the error is not related to the request, but to the state of the server or the server capabilities. For example, you would send 501 Not Implemented when receiving an unknown method.
  • 6xx - Global error, which indicates the request cannot be fulfilled by any server. It would be rare to receive such responses, as it requires having global knowledge of the network.

SIP dedicates special attention to making sure the response is sent back to the same source IP that sent the request. This is, in fact, one of the roles of the transport layer, not the transaction layer. The transport layer does this by adding a "received" parameter to the top Via header of the request. Later, RFC 3581 defined a new parameter called "rport" to ensure that the response is sent back to the same originating port. Both of these additions were aimed at making SIP work over NAT. SIP's default behavior is to send the response back on the same connection of the request, but in case it fails to do so, it will attempt to open a new connection. Therefore, none of the layers can assume a single transaction uses a single connection. A possible SIP response to the request above is:

SIP/2.0 200 OK
Via: SIP/2.0/UDP home.mynetwork.org;branch=z9hG4bKmq0Tgb;received=172.16.75.2
To: sip:[email protected];tag=q8K2f1zv
From: sip:[email protected];tag=m25caI4
Call-ID: [email protected]
CSeq: 153 REGISTER
Contact: sip:[email protected];Expires=3600

The example shows a successful response, but a UAS may choose to send an error response, such as the well-known "404 not found," in an instance where the user is not known. Both the UAC and UAS maintain a state machine for each transaction, and each state machine has timers. Timers are necessary in case the other side does not respond in time, and they're also required in case the layer above the transaction layer does not send a proper event and leaves the transaction open.

Ultimately, SIP has built each of its layers to be as decoupled as possible from the other layers, and an error in any one layer has minimal impact on the rest. This separation makes it easy for programmers to separate their software into smaller components.

The protocol distinguishes between 4 types of transactions, so it has 4 different types of state machines: client INVITE, client non-INVITE, server INVITE and server non-INVITE. We haven't mentioned the INVITE method yet, and for a good reason. INVITE is a method used to generate a call, and these lower layers do not maintain the call state. However, this transaction is different because calls have a 3-way handshake that affects the state-machine. Let's start with a diagram of the client non-INVITE transaction state-machine:

VoIP in-depth: An introduction to the SIP protocol, Part 1
The Non-INVITE client transaction

Most of the timers are for retransmissions in UDP, and they are disabled in TCP. An additional timeout timer exists in case no response is received. Transactions normally exist for 32 seconds until they time out. The equivalent server state-machine is quite similar; it receives a request, sends it to the TU, sends the response back, and handles retransmissions if required. It should be noted that some of the non-INVITE transaction definitions were updated by RFC 4320.

Let's cover the 3-way handshake. The UAC sending the INVITE waits for a response, but this time to complete the handshake it sends an ACK request back to the server. ACK has no response, as it's the 3rd message in the handshake. This fact forces ACK to be an exception to many of the rules.

When a client receives a successful (2xx) response type, it means a call was created and it will send the ACK in a new transaction. A failure response (300-699) means the ACK will be on the same transaction. The reason for this lies in the behavior of the upper layers. We will see that proxies are not aware of a call state, and those that are stateful maintain just the transaction state. There are scenarios in which a proxy would need to ACK a failed response, but it cannot ACK a successful response because that would require understanding call-related information. The INIVITE client state machine is as follows:

VoIP in-depth: An introduction to the SIP protocol, Part 1
The INVITE client transaction

你可能感兴趣的:(protocol)