Reference:
应用进程之间的通信(端到端的通信):
两个主机进行通信实际上就是两个主机的应用进程互相通信;
传输层的端口
运行在计算机中的进程是用进程标识符(pid)来标志的;
运行在应用层的各种应用进程却不应当让计算机操作系统指派它的进程标识符:
为了使运行不同操作系统的计算机的应用进程能够互相通信,就必须用统一的方法对 TCP/IP 体系的应用进程进行标志;
需要解决的问题:
由于进程的创建和撤销都是动态的,发送方几乎无法识别其他机器上的进程
有时我们会改换接收报文的进程,但并不需要通知所有发送方
我们往往需要利用目的主机提供的功能来识别终点,而不需要知道实现这个功能的进程
解决问题的方法就是在运输层使用协议端口号(protocol port number),或通常简称为端口(port);
虽然通信的终点是应用进程,但我们可以把端口想象是通信的终点,因为我们只要把要传送的报文交到目的主机的某一个合适的目的端口,剩下的工作(即最后交付目的进程)就由TCP来完成;
软件端口和硬件端口
三类端口
TCP的连接
TCP把连接作为最基本的抽象;
每一条TCP连接有两个端点;
TCP连接的端点不是主机,不是主机的IP地址,不是应用进程,也不是运输层的协议端口:
端口号拼接到 (contatenated with) IP地址即构成了套接字;
socket = (IP Address: Port Number)
每一条 TCP 连接唯一地被通信两端的两个端点(即两个套接字)所确定;
Multiplexing(多路复用):
Demultiplexing(多路分解):
Transport-layer multiplexing and demultiplexing
Connectionless Multiplexing and Demultiplexing(无连接的多路复用与多路分解)
When a UDP socket is created in this manner, the transport layer automatically assigns a port number to the socket.
When host receives UDP segement:
UDP socket identified by two-tuple: (dest IP address, dest port number)
Connectionless demux:what is the purpose of the source port number?
Connection-Oriented Multiplexing and Demultiplexing(面向连接的多路复用与多路分解)
TCP socket identified by four-tuple: (source IP address, source port number,dest IP address, dest port number):
Server host may support many simultaneous (并行的) TCP sockets:
Web servers have different sockets for each connecting client
Non-persistent (非持久) HTTP will have different socket for each request (http 短连接)
Connection-oriented demux:多进程Web服务器
Connection-oriented demux:Multi-Threaded Web Server(多线程Web服务器)
No-frills(不提供不必要服务的), bare-bones transport protocol.
“Best-effort” service, UDP segments may be:
With UDP there is no handshaking between sending and receiving transport-layer entities before sending a segment.
为什么有些应用采用UDP?
Often used for streaming multimedia apps: movie
Other UDP Uses: DNS(域名服务器)\SNMP(简单网络管理协议) 使用UDP。
Reliable transfer over UDP(UDP上的可靠数据传输): add reliability at application layer (application-specific error recovery)
面向报文的UDP
UDP Segment Structure(UDP 报文段结构)
The application data occupies the data field of the UDP segment.
The port numbers allow the destination host to pass the application data to the correct process running on the destination end system(that is, to perform the demultiplexing function).
The length field specifies the number of bytes in the UDP segment (header plus data).
The checksum is used by the receiving host to check whether errors have been introduced into the segment.
UDP校验和
As an example, suppose that we have the following three 16-bit words:
0110011001100000
0101010101010101
1000111100001100
The sum of first two of these 16-bit words is
0110011001100000
0101010101010101
1011101110110101
Adding the third word to the above sum gives
1011101110110101
1000111100001100
0100101011000001(溢出1,加0000000000000001)
0000000000000001
0100101011000010(加法)Note that this last addition had overflow, which was wrapped around(把溢出的最高位1和低16位做加法运算).
1’s component: 10110101 00111101
Although UDP provides error checking, it does not do anything to recover from an error.
The service abstraction provided to the upper-layer entities is that of a reliable channel through which data can be transferred.
It is the responsibility of a reliable data transfer protocol(可靠数据传输协议) to implement this service abstraction (This task is made difficult by the fact that the layer below the reliable data transfer protocol may be unreliable).
rdt 3.0 性能问题的核心在于它是一个停等协议
如何处理丢失、损坏与延时过大的分组。
停止等待ARQ协议:
在发送完一个分组后,必须暂时保留已发送的分组的副本;
分组和确认分组都必须进行编号;
超时计时器的重传时间应当比数据在分组传输的平均往返时间更长一些;
确认(ack)丢失与确认迟到:
可靠通信的实现:
1.使用上述的确认和重传机制,我们就可以在不可靠的传输网络上实现可靠的通信;
2.这种可靠传输协议常称为自动重传请求ARQ(Automatic Repeat re-Quest);
3.ARQ表明重传的请求是自动进行的,接收方不需要请求发送方重传某个出错的分组;
信道利用率:
流水线协议:连续 ARQ 协议
发送方可连续发送多个分组,不必每发完一个分组就停顿下来等待对方的确认;
由于信道上一直有数据不间断地传送,这种传输方式可获得很高的信道利用率;
Go-Back-N(也称为滑动窗口协议,sliding-window-protocol)是解决流水线的差错恢复的方法,还有一种方法是选择重传(SR,selective repeat)。
在回退(GBN)协议中,允许发送方发送多个分组(当有多个分组可用时)而不需等待确认,但它也受限于在流水线中未确认的分组数不能超过某个最大允许数N。
Figure shows the operation of the GBN protocol for the case of a window size of four packets.Because of this window size limitation(N=4), the sender sends packets 0 through 3 but then must wait for one or more of these packets to be acknowledged before proceeding. As each successive ACK (for example, ACK0 and ACK1 ) is received, the window slides forward and the sender can transmit one new packet (pkt4 and pkt5, respectively). On the receiver side, packet 2 is lost and thus packets 3, 4, and 5 are found to be out of order and are discarded.
TCP is said to be connection-oriented because before one application process can begin to send data to another, the two processes must first “handshake” with each other;
A TCP connection provides a full-duplex service;
A TCP connection is also always point-to-point, that is, between a single sender and a single receiver(“multicasting(多播)” isn’t possible with TCP);
Connection-establishment procedure: There-way handshake
Once a TCP connection is established, the two application processes can send data to each other:
Source and destination port numbers(源端口号和目的端口号):
Sequence number field and acknowledgment number field: 序号与确认号
Both of them are 32-bit (4-bytes);
Sequence number:
Acknowledgment number:
The acknowledgment number that Host A puts in its segment is the sequence number of the next byte Host A is expecting from Host B
Suppose that Host A has received all bytes numbered 0 through 535 from B and suppose that it is about to send a segment to Host B. Host A is waiting for byte 536 and all the subsequent bytes in Host B’s data stream. So Host A puts 536 in the acknowledgment number field of the segment it sends to B.
Used by the TCP sender and receiver in implementing a reliable data transfer service(用于实现可靠数据传输);
What does a host do when it receives out-of-order segments in a TCP connection? Interestingly, the TCP RFCs do not impose any rules here and leave the decision up to the programmers implementing a TCP implementation. There are basically two choices: either (1) the receiver immediately discards out-of-order segments (which, as we discussed earlier, can simplify receiver design), or (2) the receiver keeps the out-of-order bytes and waits for the missing bytes to fill in the gaps. Clearly, the latter choice is more efficient in terms of network bandwidth, and is the approach taken in practice.
Receive window(接收窗口): 16-bit
Header length field(头部长度字段): 4-bit
Flag field: 6 bits
The ACK bit is used to indicate that the value carried in the acknowledgment field is valid (That is, the segment contains an acknowledgment for a segment that has been successfully received)
The RST(Reset), SYN, and FIN bits are used for connection setup and teardow.
- 当 RST = 1 时,表明 TCP 连接中出现严重差错(如由于主机崩溃或其他原因),必须释放连接,然后再重新建立运输连接;
- 同步 SYN = 1 表示这是一个连接请求或连接接受报文;
- FIN = 1 表明此报文段的发送端的数据已发送完毕,并要求释放运输连接;
The PSH(Push) bit indicates that whether the receiver should pass the data to the upper layer immediately or not.
The URG bit is used to indicate that there is data in this segment that the sending-side upper-layer entity has marked as “urgent”(1: High priority)( The location of the last byte of this urgent data is indicated by the 16-bit urgent data pointer field).
The CWR and ECE bits are used in explicit congestion notification.
Checksum field:
- 检验和字段检验的范围包括首部和数据这两部分;
- 在计算检验和时,要在 TCP 报文段的前面加上 12 字节的伪首部;
Options field(选项字段): 可选与变长的
Cumulative acknowledgments (累积确认):
Selective ACK
选择确认:
RFC 2018 规定:
TCP uses a timeout/retransmit mechanism to recover from lost segments. Although this is conceptually simple, many subtle issues arise when we implement a timeout/retransmit mechanism in an actual protocol such as TCP. Perhaps the most obvious question is the length of the timeout intervals (超时间隔). Clearly, the timeout should be larger than the connection’s round-trip time (RTT), that is, the time from when a segment is sent until it is acknowledged. Otherwise, unnecessary retransmissions would be sent.
往返时延的方差很大:由于TCP的下层是一个互联网环境,IP数据报所选择的路由变化很大。因而运输层的往返时间的方差也很大;
Estimating the Round-Trip Time 估计往返时间
The sample RTT, denoted SampleRTT, for a segment is the amount of time between when the segment is sent (that is, passed to IP) and when an acknowledgment for the segment is received.
Obviously, the SampleRTT values will fluctuate from segment to segment due to congestion in the routers and to the varying load on the end systems. Because of this fluctuation, any given SampleRTT value may be atypical.
In order to estimate a typical RTT, it is therefore natural to take some sort of average of the SampleRTT values. TCP maintains an average, called EstimatedRTT , of theSampleRTT values.
Upon obtaining a new SampleRTT , TCP updates EstimatedRTT according to the following formula: (加权平均往返时间 or 平滑的往返时间 R T T s RTT_s RTTs)
EstimatedRTT=(1- α \alpha α)·EstimatedRTT+ α \alpha α·SampleRTT
The recommended value of α α α is $\alpha $ = 0.125 (that is, 1/8).
EstimatedRTT is a weighted average of the SampleRTT values. This weighted average puts more weight on recent samples than on old samples. as the more recent samples better reflect the current congestion in the network. In statistics, such an average is called an exponential weighted moving average (EWMA) (指数加权移动平均).
In addition to having an estimate of the RTT, it is also valuable to have a measure of the variability of the RTT. the RTT variation, DevRTT , as an estimate of how much SampleRTT typically deviates (偏离) from EstimatedRTT:RTT的偏差的加权平均值, R T T D RTT_D RTTD
DevRTT=(1−β)⋅DevRTT+β⋅|SampleRTT−EstimatedRTT|**
Note that DevRTT is an EWMA of the difference between SampleRTT and EstimatedRTT . If the SampleRTT values have little fluctuation (波动), then DevRTT will be small; on the other hand, if there is a lot of fluctuation, DevRTT will be large. The recommended value of β is 0.25.
Setting and Managing the Retransmission Timeout Interval
Given values of EstimatedRTT and DevRTT , what value should be used for TCP’s timeout interval?
Clearly, the interval should be greater than or equal to EstimatedRTT , or unnecessary retransmissions would be sent. But the timeout interval should not be too much larger than EstimatedRTT ; otherwise, when a segment is lost, TCP would not quickly retransmit the segment, leading to large data transfer delays.
It is therefore desirable to set the timeout equal to the EstimatedRTT plus some margin. The margin should be large when there is a lot of fluctuation in the SampleRTT values; it should be small when there is little fluctuation. The value of DevRTT should thus come into play here.
All of these considerations are taken into account in TCP’s method for determining the retransmission timeout interval: 超时重传时间(RetransmissionTime-Out, RTO)
TimeoutInterval=EstimatedRTT+4⋅DevRTT**
An initial TimeoutInterval value of 1 second is recommended.
Karn 算法:
TCP可靠通信的实现
发送缓存与接收缓存
发送缓存与接收缓存的作用
强调:
TCP provides a flow-control service to its applications to eliminate the possibility of the se-nder overflowing the receiver’s buffer.
Flow control is a speed-matching service—matching the rate at which the sender is sending against the rate at which the receiving application is reading.
TCP provides flow control by having the sender maintain a variable called the receive window(接收窗口).
Suppose that Host A is sending a large file to Host B over a TCP connection:
Host B allocates a receive buffer to this connection, denote its size by RcvBuffer;
The application process in Host B reads from the buffer, define the following variables;
Because TCP is not permitted to overflow the allocated buffer, we must have:
The receive window, denoted rwnd is set to the amount of spare room in the buffer:
How does the connection use the variable rwnd to provide the flow-control service?
持续计时器(Persistence timer)
必须考虑传输效率——TCP报文段的发送实际控制机制:
- 第一种机制是TCP维持一个变量,它等于最大报文段长度MSS,只要缓存中存放的数据达到MSS字节时,就组装成一个TCP报文段发送出去;
- 第二种机制是由发送方的应用进程指明要求发送报文段,即TCP支持的推送(push)操作;
- 第三种机制是发送方的一个计时器期限到了,这时就把当前已有的缓存数据装入报文段(但长度不能超过MSS)发送出去;
Connection: 三次握手***
Suppose a process running in one host (client) wants to initiate a connection with another process in another host (server).
The client application process first informs the client TCP that it wants to establish a con-nection to a process in the server.
The TCP in the client then proceeds to establish a TCP connection with the TCP in the server in the following manner (Three-way handshake,三次握手):
Step 1:
- The client-side TCP first sends a special TCP segment to the server-side TCP.
- This special segment contains no application-layer data. But one of the flag bits in the segment’s header, the SYN bit, is set to 1(For this reason, this special segment is referred to as a SYN segment).
- In addition, the client randomly chooses an initial sequence number (client_isn) and puts this number in the sequence number field of the initial TCP SYN segment (This segment is encapsulated within an IP datagram and sent to the server).
- There has been considerable interest in properly randomizing the choice of the client_isn in order to avoid certain security attacks.
Step 2:
Once the IP datagram containing the TCP SYN segment arrives at the server host, the server extracts the TCP SYN segment from the datagram, allocates the TCP buffers and variables to the connection, and sends a connection-granted segment to the client TCP. This connection-granted segment also contains no application-layer data (The connection-granted segment is referred to as a SYNACK segment).
However, it does contain three important pieces of information in the segment header:
First: the SYN bit is set to 1;
Second: the acknowledgment field of the TCP segment header is set to client_isn+1;
Finally: the server chooses its own initial sequence number (server_isn) and
puts this value in the sequence number field of the TCP segment header;
Step 3:
Upon receiving the SYNACK segment, the client also allocates buffers and variables to the connection. The client host then sends the server yet another segment; this last segment acknowledges the server’s connection-granted segment (the client does so by putting the value server_isn+1 in the acknowledgment field of the TCP segment header). The SYN bit is set to zero, since the connection is established. This third stage of the three-way handshake may carry client-to-server data in the segment payload(报文段负载).
Once these three steps have been completed, the client and server hosts can send segments containing data to each other (In each of these future segments, the SYN bit will be set to zero).
为啥要三次握手:
Deconnection: 四次挥手
When a connection ends, the “resources” (that is, the buffers and variables) in the hosts are deallocated.
Suppose the client decides to close the connection:
Client 必须等待2MSL(最长报文段寿命)的时间:
- 第一,为了保证其发送的最后一个 ACK 报文段能够到达 server;
- 第二,防止“已失效的连接请求报文段”出现在本连接中:A在发送完最后一个ACK报文段后,再经过时间2MSL,就可以使本连接持续的时间内所产生的所有报文段,都从网络中消失,这样就可以使下一个新的连接中不会出现这种旧的连接请求报文段;
During the life of a TCP connection:
The TCP protocol running in each host makes transitions through various TCP states.
A typical swquence of TCP states visited by a client TCP:
A typical sequence of TCP states visited by a server-side TCP:
Congestion:
拥塞控制的一般原理:
略
At the highest level, we can distinguish among congestion-control approaches by whether the network layer provides explicit assistance to the transport layer for congestion-control purposes:
End-to-end congestion control(端到端拥塞控制)
Network-assisted congestion control
拥塞控制与流量控制的关系
开环控制和闭环控制
TCP Congestion Control
- 发送方维持一个叫做拥塞窗口 cwnd(congestion window) 的状态变量;拥塞窗口的大小取决于网络的拥塞程度,并且动态地在变化。
- 发送方让自己的发送窗口等于拥塞窗口;如再考虑到接收方的接收能力,则发送窗口还可能小于拥塞窗口。
- 发送方控制拥塞窗口的原则是:只要网络没有出现拥塞,拥塞窗口就再增大一些,以便把更多的分组发送出去;但只要网络出现拥塞,拥塞窗口就减小一些,以减少注入到网络中的分组数。
几种拥塞控制方法:重要
Slow Start (慢开始) and Congestion Avoidance (拥塞避免):
慢开始算法的原理:
在主机刚刚开始发送报文段时可先设置拥塞窗口 cwnd = 1,即设置为一个最大报文段 MSS 的数值;
在每收到一个对新的报文段的确认后,将拥塞窗口加1,即增加一个MSS的数值;
用这样的方法逐步增大发送端的拥塞窗口 cwnd,可以使分组注入到网络的速率更加合理;
传输轮次(transmission round):
- 使用慢开始算法后,每经过一个传输轮次,拥塞窗口 cwnd 就加倍;
- 一个传输轮次所经历的时间其实就是往返时间 RTT;
- “传输轮次”更加强调:把拥塞窗口 cwnd 所允许发送的报文段都连续发送出去,并收到了对已发送的最后一个字节的确认;
- 例如,拥塞窗口 cwnd = 4,这时的往返时间 RTT 就是发送方连续发送 4 个报文段,并收到这 4 个报文段的确认,总共经历的时间;
拥塞避免算法的思路:
设置慢开始门限状态变量
当网络出现拥塞时:
无论在慢开始阶段还是在拥塞避免阶段,只要发送方判断网络出现拥塞(其根据就是没有按时收到确认),就要把慢开始门限 ssthresh 设置为出现拥塞时的发送方窗口值的一半(但不能小于2);
然后把拥塞窗口 cwnd 重新设置为 1 ,执行慢开始算法;
这样做的目的就是要迅速减少主机发送到网络中的分组数,使得发生拥塞的路由器有足够时间把队列中积压的分组处理完毕;
假接收窗口足够大;
乘法减小(multiplicative decrease)
“乘法减小“是指不论在慢开始阶段还是拥塞避免阶段,只要出现一次超时(即出现一次网络拥塞),就把慢开始门限值 ssthresh 设置为当前的拥塞窗口值乘以0.5;
当网络频繁出现拥塞时,ssthresh 值就下降得很快,以大大减少注入到网络中的分组数;
加法增大(additive increase)
“加法增大”是指执行拥塞避免算法后,在收到对所有报文段的确认后(即经过一个往返时间),就把拥塞窗口 cwnd 增加一个 MSS 大小,使拥塞窗口缓慢增大,以防止网络过早出现拥塞;
强调:
Fast Retransmit and Fast Recovery:
快重传:
快恢复:
发送窗口的上限值:
- 发送方的发送窗口的上限值应当取为接收方窗口 rwnd 和拥塞窗口 cwnd 这两个变量中较小的一个:
- 当 rwnd < cwnd 时,则是接收方的接收能力限制发送窗口的最大值;
- 当 cwnd < rwnd 时,则是网络的拥塞限制发送窗口的最大值;