《TCP/IP 卷一》笔记、ping和traceroute 的实现思路

一、TCP协议相关笔记

Normally  TCP does not send an ACK the instant it receives data. Instead, it  delays the ACK hoping to have data going in the same direction as the ACK, so the ACK can be sent  along with the data. (This is sometimes called having the ACK  piggyback  with the data.)  Most implementations use a 200-ms delay-that is, TCP will delay an ACK up to 200 ms  to see if there is data to send with the ACK.

The  Nagle algorithm   says that a TCP connection can have only one outstanding small segment  that has not yet been acknowledged. No additional small segments can be sent until the  acknowledgment is received. Instead, small amounts of data are collected by TCP and  sent in a single segment when the acknowledgment arrives. The beauty of this algorithm  is that it is self-clocking: the faster the ACKs come back, the faster the data is sent. But  on a slow WAN, where it is desired to reduce the number of tinygrams, fewer segments  are sent.

This shows that TCP can acknowledge  received data before the application has read and processed that data. The TCP
acknowledgment just means TCP has correctly received the data. We also have an  indication that the server process has not read these 3 bytes of data because the advertised  window in the final segment is 8189, not 8192.

There are times when the Nagle algorithm needs to be turned off. The classic example is  the X Window System server  : small messages (mouse movements) must  be delivered without delay to provide real-time feedback for interactive users doing  certain operations.
The sockets API uses the  TCP_NODELAY  socket option to disable the Nagle algorithm.

Here we'll show another example that's easier to demonstrate-typing one of the terminal's  special function keys during an interactive login. The function keys normally generate  multiple bytes of data, often beginning with the ASCII escape character. If TCP gets the  data 1 byte at a time, it's possible for it to send the first byte (the ASCII ESC) and then  hold the remaining bytes of the sequence waiting for the ACK of this byte. But when the  server receives this first byte it doesn't generate an echo until the remaining bytes are  received. This often triggers the  delayed ACK algorithm  on the server, meaning that the  remaining bytes aren't sent for up to 200 ms. This can lead to noticeable delays to the  interactive user.

**************************************************************************************
With TCP's sliding-window protocol the receiver does not have to acknowledge  every received packet. With TCP, the ACKs are cumulative-they acknowledge that the  receiver has correctly received all bytes up through the acknowledged sequence number  minus one.

The ordering of  the packets that we see on the wire depends on many factors, most of which we have no  control over: the sending TCP implementation, the receiving TCP implementation, the  reading of data by the receiving process (which depends on the process scheduling by the  operating system), and the dynamics of the network (i.e., Ethernet collisions and backoffs).  There is no single correct way for two TCPs to exchange a given amount of data.

The sockets API allows a process to set the sizes of the send buffer and the receive buffer. The size of the  receive buffer is the maximum size of the  advertised window (也就是Tcp头部的滑动窗口大小) for that connection. Some applications change  the socket buffer sizes to increase performance.

*****************************************************************************************************
In all the examples we've seen so far in this chapter, the sender starts off by injecting  multiple segments into the network, up to the window size advertised by the receiver. While  this is OK when the two hosts are on the same LAN, if there are routers and slower links  between the sender and the receiver, problems can arise. Some intermediate router must  queue the packets, and it's possible for that router to run out of space.

TCP is now required to support an algorithm called slow start. It operates by observing that  the rate at which new packets should be injected into the network is the rate at which the  acknowledgments are returned by the other end.

Slow start adds another window to the sender's TCP: the congestion window, called cwnd.  When a new connection is established with a host on another network, the congestion  window is initialized to one segment (i.e., the segment size announced by the other end). Each time an ACK is received, the congestion window is increased by one segment,  (cwnd  is maintained in bytes, but slow start always increments it by the segment size.) The sender  can transmit up to the  minimum  of the congestion window and the advertised window. The  congestion window is flow control imposed by the sender, while the advertised window is flow control imposed by the receiver .

The sender starts by transmitting one segment and waiting for its ACK. When that ACK is  received, the congestion window is incremented from one to two, and two segments can be  sent. When each of those two segments is acknowledged, the congestion window is  increased to four. This provides an exponential increase.  At some point the capacity of the internet can be reached, and an intermediate router will  start discarding packets. This tells the sender that its congestion window   we'll see how this is handled, and what happens to the congestion window. For now, let's  watch slow start in action.
********************************************************************
Congestion  can occur when data arrives on a big pipe (a fast LAN) and gets sent out a  smaller pipe (a slower WAN). Congestion can also occur when multiple input streams arrive  at a router whose output capacity is less than the sum of the inputs.
*********************************************************************
TCP itself says little more about  urgent data . There is no way to specify where the urgent  data starts in the data stream. The only information sent across the connection by TCP is that  urgent mode has begun (the URG bit in the TCP header) and the pointer to the last byte of  urgent data. Everything else is left to the application.

Unfortunately many implementations incorrectly call TCP's urgent mode out-of-band data.  If an application really wants a separate out-of-band channel, a second TCP connection is  the easiest way to accomplish this. (Some transport layers do provide what most people  consider true out-of-band data: a logically separate data path using the same connection as  the normal data path. This is not what TCP provides.)

The confusion between TCP's urgent mode and out-of-band data is also because the predominant  programming interface, the sockets API, maps TCP's urgent mode into what sockets calls out-of-band data.
******************************************************************************
TCP manages four different timers  for each connection.
1. A retransmission timer is used when expecting an acknowledgment from the other  end
2. A persist timer keeps window size information flowing even if the other end closes its  receive window.
3. A keepalive timer detects when the other end on an otherwise idle connection crashes  or reboots. 
4. A 2MSL timer measures the time a connection has been in the TIME_WAIT state. 
*****************************************************************************************
TCP calculates the round-trip time and then uses these measurements to keep track of a  smoothed  RTT estimator and a smoothed mean deviation estimator. These two estimators are  then used to calculate the next retransmission timeout value. Many  implementations only  measure a single RTT per window. Karn's algorithm removes the retransmission ambiguity  problem by preventing us from measuring the RTT when a packet is lost.
*****************************************************************************************
Indeed, Berkeley-derived  implementations count the number of duplicate ACKs received, and when the third one is  received, assume that a segment has been lost and retransmit only one segment, starting with  that sequence number. This is Jacobson's  fast retransmit  algorithm, which is followed by his  fast recovery  algorithm.
*****************************************************************************************
Slow start,   is the way to initiate data flow across a  connection. But at some point we'll reach the limit of an intervening router, and packets can  be dropped. Congestion avoidance is a way to deal with lost packets. 

The assumption of the algorithm is that packet loss caused by damage is very small (much  less than 1%), therefore the loss of a packet signals congestion somewhere in the network  between the source and destination. There are two indications of packet loss: a timeout  occurring and the receipt of duplicate ACKs. (  If we are  using a timeout as an indication of congestion, we can see the need for a good RTT  algorithm( retransmission timeout (RTO)) .)

Congestion avoidance and slow start are independent algorithms with different objectives.  But when congestion occurs we want to slow down the transmission rate of packets into the  network, and then invoke slow start to get things going again. In practice they are  implemented together.

Congestion avoidance and slow start require that two variables be maintained for each  connection:  a congestion window, cwnd ,  and  a slow start threshold size, ssthresh .  The  combined algorithm operates as follows:

1. Initialization for a given connection sets cwnd to one segment and ssthresh to 65535  bytes.
2. The TCP output routine never sends more than the minimum of cwnd and the  receiver's advertised window.

Congestion avoidance is flow control imposed by the sender, while the advertised  window is flow control imposed by the receiver. The former is based on the sender's  assessment of perceived network congestion; the latter is related to the amount of  available buffer space at the receiver for this connection.

3. When congestion occurs (indicated by a timeout or the reception of duplicate ACKs) one-half of the current window size (the minimum of  cwnd  and the receiver's  advertised window, but at least two segments) is saved in  ssthresh.  Additionally, if the  congestion is indicated by a timeout,  cwnd  is set to one segment (i.e., slow start).

4. When new data is acknowledged by the other end, we increase cwnd, but the way it  increases depends on whether we're performing slow start or congestion avoidance.  If  cwnd  is less than or equal to  ssthresh,  we're doing slow start; otherwise we're doing  congestion avoidance. Slow start continues until we're halfway to where we were  when congestion occurred (since we recorded half of the window size that got us into  trouble in step 2), and then congestion avoidance takes over.

Slow start has cwnd start at one segment, and be incremented by one segment every  time an ACK is received. This opens the window  exponentially: send one segment, then two, then four, and so on.

Congestion avoidance dictates that cwnd be incremented by  1/ cwnd each time an  ACK is received. This is an additive increase, compared to slow start's exponential  increase. We want to increase  cwnd  by at most one segment each round-trip time  (regardless how many ACKs are received in that RTT), whereas slow start will  increment  cwnd  by the number of ACKs received in a round-trip time.

Before describing the change, realize that TCP is required to generate an immediate  acknowledgment (a duplicate ACK) when an out-of-order segment is received. This  duplicate ACK should not be delayed. The purpose of this duplicate ACK is to let the other  end know that a segment was received out of order, and to tell it what sequence number is  expected.
*****************************************************************************************
Since we don't know whether a duplicate ACK is caused by a lost segment or just a  reordering of segments, we wait for a small number of duplicate ACKs to be received. It is  assumed that if there is just a reordering of the segments, there will be only one or two  duplicate ACKs before the reordered segment is processed, which will then generate a new  ACK. If three or more duplicate ACKs are received in a row, it is a strong indication that a  segment has been lost.   We then perform a retransmission of  what appears to be the missing segment, without waiting for a retransmission timer to expire.  This is the  fast retransmit  algorithm. Next, congestion avoidance, but not slow start is  performed. This is the  fast recovery  algorithm.

1. When the third duplicate ACK is received, set ssthresh to one-half the current  congestion window,  cwnd.  Retransmit the missing segment. Set  cwnd  to  ssthresh  plus 3 times the segment size.

2. Each time another duplicate ACK arrives, increment cwnd by the segment size and  transmit a packet (if allowed by the new value of  cwnd).

3. When the next ACK arrives that acknowledges new data, set cwnd to ssthresh (the  value set in step 1). This should be the ACK of the retransmission from step 1, one  round-trip time after the retransmission. Additionally, this ACK should acknowledge  all the intermediate segments sent between the lost packet and the receipt of the first  duplicate ACK. This step is congestion avoidance, since we're slowing down to onehalf  the rate we were at when the packet was lost.
*****************************************************************************************
If an acknowledgment is lost, we could end up with both sides waiting for the other: the  receiver waiting to receive data (since it provided the sender with a nonzero window) and  the sender waiting to receive the window update allowing it to send. To prevent this form  of deadlock from occurring the sender uses a  persist timer  that causes it to query the  receiver periodically, to find out if the window has been increased. These segments from  the sender are called window probes.

TCP's persist timer is set by one end of a connection when it has data to send, but has  been stopped because the other end has advertised a zero-sized window. The sender  keeps probing the closed window using a retransmission interval.  This probing of the closed window continues indefinitely.
*****************************************************************************************
There are times, however, when a server wants to know if the client's host has either  crashed and is down, or crashed and rebooted.  The  keepalive timer a feature of many  implementations, provides this capability.

If there is no activity on a given connection for 2 hours, the server sends a probe segment to the client.
The client host must be in one of four states.

1. The client host is still up and running and reachable from the server. The client's  TCP responds normally and the server knows that the other end is still up. The  server's TCP will reset the keepalive timer for 2 hours in the future. If there is  application traffic across the connection before the next 2-hour timer expires, the  timer is reset for 2 hours in the future, following the exchange of data.

2. The client's host has crashed and is either down or in the process of rebooting. In  either case, its TCP is not responding. The server will not receive a response to its  probe and it times out after 75 seconds. The server sends a total of 10 of these  probes, 75 seconds apart, and if it doesn't receive a response, the server considers  the client's host as down and terminates the connection.

3. The client's host has crashed and rebooted. Here the server will receive a response  to its keepalive probe, but the response will be a reset, causing the server to  terminate the connection.

4. The client's host is up and running, but unreachable from the server. This is the  same as scenario 2, because TCP can't distinguish between the two. All it can tell  is that no replies are received to its probes.
*****************************************************************************************************
二、ping和traceroute的实现思路

(一)、ping 的实现

1. 相关ICMP协议概述
  这里只讲解与ping有关的ICMP消息类型,主机发送回送消息(Type =8),被请求主机回送响应消息(Type= 0),基本格式如下:
  回送消息[ECHO]


回送响应消息[ECHOREPLY]


   其中•Code = 0
•CheckSum为校验和,重点注意从ICMP的头部(Type开始),到data结束(即到整个数据包结束),具体计算见代码。
•Identifier为标识符,由主机设定,一般设置为进程号,回送响应消息与回送消息中identifier保持一致
•Sequence Number为序列号,由主机设定,一般设为由0递增的序列,回送响应消息与回送消息中Sequence Number保持一致
•data为数据,由主机设定,回送响应消息与回送消息中data保持一致


2. Ping流程
Ping实际上利用的就是ICMP ECHOICMP ECHOREPLY包来探测主机是否存在,所以Ping程序的流程十分简单:发送ICMP ECHO---->接收ICMP ECHOREPLY
  发送ICMP ECHO包时填充Identifier为进程IDSequence Number为从0递增计数,data填充为发送时间
  接收ICMP ECHOREPLY包时检查Identifier,Sequence Number是否正确,通过IP报头的源地址字段获得回送报文的主机地址是否正确

 

3. 模拟Ping实现pingy
Ping的基本流程已讲解完

  由于要自己构造ICMP包,因此创建需要创建原始套接字(即需要自己填充报头)
sockfd= socket(AF_INET, SOCK_RAW, IPPROTO_ICMP)
SOCK_RAW用于直接访问网络层,应用程序负责构造自己的协议首部;IPPROTO_ICMP表示ICMP报头由程序构造
  构造ICMP报头,注意各个字段的填充,特别是校验和(可以参照icmp的结构定义)
icmp->icmp_type
icmp->icmp_code
icmp->icmp_cksum
icmp->icmp_id
icmp->icmp_seq
icmp->icmp_data;

  而完成系统的Ping命令还需添加信息统计,如发送字节数,收到字节数,发送包,接收包,发送时间,TTL等;另外,添加信号处理,在用户使用Control^C等退出时打印即时信息

程序实现主要参考  http://blog.csdn.net/qy532846454/article/details/5429700,修改了部分代码,现在可以ping 域名。
下载地址:http://download.csdn.net/detail/simba888888/6432977

(二)、traceroute的实现

1. 相关ICMP协议概述
   
这里只讲解与tracert有关的ICMP消息类型,网关发送超时报文(type = 11),主机发送目标不可达报文(type = 3),基本格式如下:
   
超时报文

         

        其中code = 0,表示由网关发送

这里提一下,当code = 1时表示是IP层组装分片超时。

The one we've been describing is generated when the TTL reaches 0, and is specified by a code of 0.

It's also possible for a host to send an ICMP "time exceeded during reassembly" when it timesout during the reassembly of a fragmented datagram. This error is specified by a code of 1.


   
目标主机不可达报文

         

        其中code = 3,表示在目的主机,端口不可用

 

2. Traceroute流程
    
首先明确TTLIP报头中的字段,TTL表示了数据包的time to live,即还能经由多少跳,所以TTL = 1表示数据包将在下个路由或主机被丢弃,并发送超时报文;
    
其次为了明确已到达主机,发送时目的端口设为非法端口(58127),这样主机收到报文后会发送目标不可达报文。
下面是tracert的流程:
     1>
构造UDP数据包,设置TTL = 1
     2>
发送UDP数据包,记录发送时间t1
     3>
接收ICMP差错包,如果是超时报文,则是经过的中间路由,记录路由信息,记录接收时间t2,计算时间(t2 - t1);如果是目标不可达报文,则抵达目的主机,记录接收时间t2,打印信息,退出
     4>
构造UDP数据包,设置TTL += 1,返回第二步
其中,TTL的每个数值(TTL = 1)发送3UDP包,即重复2~33次;
   接收超时,打印"*"表示报文丢失


程序实现主要参考http://blog.csdn.net/qy532846454/article/details/5443718, 但测试发现有点问题,故自己修改了部分代码,对返回的ICMP报文增加了一些类型判断,比如type为3的不可达报文实际上有16种情况,即使不是目的端口不可达code=3,其他不可达情况也得退出循环打印提示,否则就卡在那边了。如果不是ICMP类型报文则继续接收直到recvfrom超时,打印'*'。
下载地址:http://download.csdn.net/detail/simba888888/6432977

参考:
《TCP/IP Illustrated Volume 1: The protocols》
http://blog.csdn.net/qy532846454


你可能感兴趣的:(读书笔记,linux,环境网络编程)