4.2 TCP Segmentation Offload(TSO)

  TSO (TCP Segmentation Offload) 是一种利用网卡替代CPU对大数据包进行分片,降低CPU负载的技术。如果数据包的类型只能是TCP,则被称之为TSO。此功能需要网卡提供支持。TSO 是使得网络协议栈能够将大块 buffer 推送至网卡,然后网卡执行分片工作,这样减轻了CPU的负荷,其本质实际是延缓分片。这种技术在Linux中被叫做GSO(Generic Segmentation Offload),它不需要硬件的支持分片就可使用。对于支持TSO功能的硬件,则先经过GSO功能处理,然后使用网卡的硬件分片能力进行分片;而当网卡不支持TSO功能时,则将分片的执行放在了将数据推送的网卡之前,也就是在调用网卡驱动注册的ndo_start_xmit函数之前。

    在TCP连接的建立阶段,需要开启TSO功能。SYN发送:

142 int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 143 {
...
 234     sk->sk_gso_type = SKB_GSO_TCPV4;
 235     sk_setup_caps(sk, &rt->dst);
...
   三次握手完成:
 1642 struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
1643                   struct request_sock *req,      
1644                   struct dst_entry *dst)         
1645 {
...
1662     newsk->sk_gso_type = SKB_GSO_TCPV4;
...
1689     sk_setup_caps(newsk, dst);
...
   sk_setup_caps 函数:
1507 void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
1508 {  
1509     __sk_dst_set(sk, dst);
1510     sk->sk_route_caps = dst->dev->features;//获取网卡功能特性
1511     if (sk->sk_route_caps & NETIF_F_GSO)//网卡支持GSO
1512         sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;//添加所有GSO相关特性标志
1513     sk->sk_route_caps &= ~sk->sk_route_nocaps;//取消遭到禁用的特性
1514     if (sk_can_gso(sk)) {//网卡支持GSO且socket开启了GSO
1515         if (dst->header_len) {//在SKB的头部需要更多的空间
1516             sk->sk_route_caps &= ~NETIF_F_GSO_MASK;//禁用GSO相关功能
1517         } else {
1518             sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;//为网卡开启分散-聚集功能和计算检验和的功能
1519             sk->sk_gso_max_size = dst->dev->gso_max_size;
1520             sk->sk_gso_max_segs = dst->dev->gso_max_segs;
1521         }
1522     }
1523 }
  现在回顾数据发送过程,tcp_sendmsg函数:
 1016 int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
1017         size_t size)
1018 {   
1019     struct iovec *iov;
1020     struct tcp_sock *tp = tcp_sk(sk);
1021     struct sk_buff *skb;
1022     int iovlen, flags, err, copied = 0;
1023     int mss_now = 0, size_goal, copied_syn = 0, offset = 0;
1024     bool sg;
...
1067     mss_now = tcp_send_mss(sk, &size_goal, flags);//计算当前最大报文段大小,并将网卡能一次发送的最大数据长度的值放入size_goal中
...
1078     sg = !!(sk->sk_route_caps & NETIF_F_SG);//如果网卡开启分散-聚集功能,则sg为1
1079
1080     while (--iovlen >= 0) {
1081         size_t seglen = iov->iov_len;
...
1095         while (seglen > 0) {
1096             int copy = 0;
1097             int max = size_goal;
1098
1099             skb = tcp_write_queue_tail(sk);
1100             if (tcp_send_head(sk)) {
1101                 if (skb->ip_summed == CHECKSUM_NONE)//如果网卡不支持计算IP检验和
1102                     max = mss_now;//不能使用GSO功能,只发送一个MSS大小的报文
1103                 copy = max - skb->len;
1104             }
1105
1106             if (copy <= 0) {
...
1114                 skb = sk_stream_alloc_skb(sk,
1115                               select_size(sk, sg),
1116                               sk->sk_allocation);//根据select_size函数计算的大小申请SKB
1117                 if (!skb)
1118                     goto wait_for_memory;
...
1127                 /*
1128                  * Check whether we can use HW checksum.
1129                  */
1130                 if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
1131                     skb->ip_summed = CHECKSUM_PARTIAL;//网卡支持计算IP检验和
... 
1142             /* Where to copy to? */
1143             if (skb_availroom(skb) > 0) {//线性区还有空间
1144                 /* We have some space in skb head. Superb! */
1145                 copy = min_t(int, copy, skb_availroom(skb));
1146                 err = skb_add_data_nocache(sk, skb, from, copy);//将用户态内存中的数据copy到SKB中
1147                 if (err)
1148                     goto do_fault;
1149             } else {
1150                 bool merge = true;
1151                 int i = skb_shinfo(skb)->nr_frags;//已经分配非连续页的数量
1152                 struct page_frag *pfrag = sk_page_frag(sk);
1153 
1154                 if (!sk_page_frag_refill(sk, pfrag))//判断当前页是否有空间可写;如果没有则申请新页,申请不到则需要等待有内存可用
1155                     goto wait_for_memory;
1156 
1157                 if (!skb_can_coalesce(skb, i, pfrag->page,
1158                               pfrag->offset)) {//判断当前页是否需要加入到skb_shinfo(skb)->frags数组中
1159                     if (i == MAX_SKB_FRAGS || !sg) {
1160                         tcp_mark_push(tp, skb);
1161                         goto new_segment;
1162                     }
1163                     merge = false;//不需要加入,因为当前页是skb_shinfo(skb)->frags数组最后的成员且有空间可写
1164                 }
1165 
1166                 copy = min_t(int, copy, pfrag->size - pfrag->offset);
1167 
1168                 if (!sk_wmem_schedule(sk, copy))//查看socket的内存限制
1169                     goto wait_for_memory;
1170 
1171                 err = skb_copy_to_page_nocache(sk, from, skb,
1172                                    pfrag->page,
1173                                    pfrag->offset,
1174                                    copy);//将用户态空间中的数据copy到页中
1175                 if (err)
1176                     goto do_error;
1177 
1178                 /* Update the skb. */
1179                 if (merge) {//没有申请新页
1180                     skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);//更新页大小
1181                 } else {
1182                     skb_fill_page_desc(skb, i, pfrag->page,
1183                                pfrag->offset, copy);//将新页加入到skb_shinfo(skb)->frags数组
1184                     get_page(pfrag->page);
1185                 }
1186                 pfrag->offset += copy;
1187             }
...

  1150-1163:线性区没有空间,但还允许向这个skb中写入数据,原因是mss变大了或者网卡支持分散-聚集IO,只能将数据保存在非连续空间中;如果网卡不支持分散-聚集IO,则系统会在将数据发送到驱动前将非线性区中的数据线性化

  tcp_send_mss函数:

780 static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 781                        int large_allowed)             
 782 {
 783     struct tcp_sock *tp = tcp_sk(sk);
 784     u32 xmit_size_goal, old_size_goal;
 785
 786     xmit_size_goal = mss_now;
 787 //如果发送的数据不是紧急数据,则large_allowed为1;紧急数据要求尽快发送,不允许等待多段数据集合成大包再发
 788     if (large_allowed && sk_can_gso(sk)) {
 789         xmit_size_goal = ((sk->sk_gso_max_size - 1) - //GSO报文最大长度,此值是在sk_setup_caps中继承自dst->dev->gso_max_size
 790                   inet_csk(sk)->icsk_af_ops->net_header_len -//减去IP首部长度
 791                   inet_csk(sk)->icsk_ext_hdr_len -//减去IP选项长度
 792                   tp->tcp_header_len);    //减去TCP头长度      
 793
 794         /* TSQ : try to have two TSO segments in flight */
 795         xmit_size_goal = min_t(u32, xmit_size_goal,
 796                        sysctl_tcp_limit_output_bytes >> 1);
 797
 798         xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);//根据本端所见到的最大的窗口重新计算
 799
 800         /* We try hard to avoid divides here */
 801         old_size_goal = tp->xmit_size_goal_segs * mss_now;
 802
 803         if (likely(old_size_goal <= xmit_size_goal &&
 804                old_size_goal + mss_now > xmit_size_goal)) {//如果新的size_goal比旧的大但差值不超过一个窗口
 805             xmit_size_goal = old_size_goal;//使用旧的值
 806         } else {//否则使用新的size_goal
 807             tp->xmit_size_goal_segs =
 808                 min_t(u16, xmit_size_goal / mss_now,
 809                       sk->sk_gso_max_segs);
 810             xmit_size_goal = tp->xmit_size_goal_segs * mss_now;//调整size_goal为mss的整数倍
 811         }
 812     }
 813
 814     return max(xmit_size_goal, mss_now);
 815 }
 816
 817 static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 818 {
 819     int mss_now;
 820
 821     mss_now = tcp_current_mss(sk);
 822     *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
 823
 824     return mss_now;
 825 }
   在申请skb时,用select_size 函数确定申请线性空间的大小
 961 static inline int select_size(const struct sock *sk, bool sg)
 962 {
 963     const struct tcp_sock *tp = tcp_sk(sk);
 964     int tmp = tp->mss_cache;
 965
 966     if (sg) {//支持分散-聚集IO
 967         if (sk_can_gso(sk)) {//使用GSO功能
 968             /* Small frames wont use a full page:
 969              * Payload will immediately follow tcp header.
 970              */
 971             tmp = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);//线性区大小为2048字节空间内能容纳的数据大小
 972         } else {
 973             int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);//计算一个页内在TCP头最大时能容纳数据的大小
 974
 975             if (tmp >= pgbreak &&   //MSS大于一个页能容纳的数据          
 976                 tmp <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)//MSS小于skb能在page中填充的最大数据量
 977                 tmp = pgbreak;//线性区大小为一个页内在TCP头最大时能容纳数据的大小
 978         }//若MSS大于skb的frag_list所能提供的最大空间,则返回MSS大小,使空间全部位于线性区,避免利用非线性区。
 979     }
 980
 981     return tmp;
 982 }

  如果网卡支持分散-聚集IO,可以在申请内存时尽量使用非连续空间,这样可以增加内存申请成功的概率;网卡在发送数据时能实现将非连续空间中的数据发送出去,相应的负载由网卡承担,不需要消耗CPU。

  在调用tcp_write_xmit函数发送数据时,还需要检查是否需要延迟发送:

1811 static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
1812                int push_one, gfp_t gfp)
1813 {
1814     struct tcp_sock *tp = tcp_sk(sk);
1815     struct sk_buff *skb;
...
1836         tso_segs = tcp_init_tso_segs(sk, skb, mss_now); //得到当前skb需要被分段的数量
1837         BUG_ON(!tso_segs);
...
1854         if (tso_segs == 1) {
1855             if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
1856                              (tcp_skb_is_last(sk, skb) ?
1857                               nonagle : TCP_NAGLE_PUSH))))
1858                 break;
1859         } else { //分段数大于1
1860             if (!push_one && tcp_tso_should_defer(sk, skb)) //不是只发送一个包且应该延迟发送
1861                 break; //不发送
1862         }
...
1871         limit = mss_now;
1872         if (tso_segs > 1 && !tcp_urg_mode(tp))    //TSO段的数量大于1且没有紧急数据要发送
1873             limit = tcp_mss_split_point(sk, skb, mss_now,
1874                             min_t(unsigned int,
1875                               cwnd_quota,
1876                               sk->sk_gso_max_segs));
1877 
1878         if (skb->len > limit &&    //数据长度超出限制
1879             unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))    //拆分skb
1880             break;
...
  tcp_tso_should_defer用于判断是否需要延迟发送TSO包:
1595 static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
1596 {
1597     struct tcp_sock *tp = tcp_sk(sk);
1598     const struct inet_connection_sock *icsk = inet_csk(sk);
1599     u32 send_win, cong_win, limit, in_flight;
1600     int win_divisor;
1601 
1602     if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
1603         goto send_now;
1604 
1605     if (icsk->icsk_ca_state != TCP_CA_Open)
1606         goto send_now;
1607 
1608     /* Defer for less than two clock ticks. */
1609     if (tp->tso_deferred &&
1610         (((u32)jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1) //上个skb被延迟超过1个jiffies,则不再延迟;即TSO不能延迟2个jiffies以上
1611         goto send_now;
1612 
1613     in_flight = tcp_packets_in_flight(tp);
1614 
1615     BUG_ON(tcp_skb_pcount(skb) <= 1 || (tp->snd_cwnd <= in_flight));
1616  
1617     send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq; //剩余发送窗口大小
1618 
1619     /* From in_flight test above, we know that cwnd > in_flight.  */
1620     cong_win = (tp->snd_cwnd - in_flight) * tp->mss_cache; //剩余拥塞窗口大小
1621 
1622     limit = min(send_win, cong_win); //取二者中最小的作为上限
1623 
1624     /* If a full-sized TSO skb can be sent, do it. */
1625     if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
1626                sk->sk_gso_max_segs * tp->mss_cache))
1627         goto send_now;
1628 
1629     /* Middle in queue won't get any more data, full sendable already? */
1630     if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len)) //队列中间的skb不会被放入更多的数据
1631         goto send_now;
1632 
1633     win_divisor = ACCESS_ONCE(sysctl_tcp_tso_win_divisor); //sysctl_tcp_tso_win_divisor是单个TSO段可消耗拥塞窗口的比例
1634     if (win_divisor) {
1635         u32 chunk = min(tp->snd_wnd, tp->snd_cwnd * tp->mss_cache);
1636 
1637         /* If at least some fraction of a window is available,
1638          * just use it.
1639          */
1640         chunk /= win_divisor;
1641         if (limit >= chunk)
1642             goto send_now;
1643     } else {
1644         /* Different approach, try not to defer past a single
1645          * ACK.  Receiver should ACK every other full sized
1646          * frame, so if we have space for more than 3 frames
1647          * then send now.
1648          */
1649         if (limit > tcp_max_tso_deferred_mss(tp) * tp->mss_cache)
1650             goto send_now;
1651     }
1652 
1653     /* Ok, it looks like it is advisable to defer.
1654      * Do not rearm the timer if already set to not break TCP ACK clocking.
1655      */
1656     if (!tp->tso_deferred) 
1657         tp->tso_deferred = 1 | (jiffies << 1); //记录本次延迟的时间 
1658 
1659     return true;
1660 
1661 send_now:
1662     tp->tso_deferred = 0;
1663     return false;
1664 }

     不需要延迟的包会向网卡发送。TCP包下发到网卡前,会先调用dev_hard_start_xmit函数:
 2513 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
2514             struct netdev_queue *txq)
2515 {
2516     const struct net_device_ops *ops = dev->netdev_ops;
2517     int rc = NETDEV_TX_OK;
2518     unsigned int skb_len;
2519
2520     if (likely(!skb->next)) {//如果输出的是单个数据包;非单个数据包意味着已经被处理过了,直接发送即可
...
2549         if (netif_needs_gso(skb, features)) {//skb是GSO包,且网卡不支持GSO或无需软件计算检验和
2550             if (unlikely(dev_gso_segment(skb, features)))//如果网卡不支持GSO则模拟网卡分割数据包;如果是无需软件计算检验和则计算部分检验和
2551                 goto out_kfree_skb;
2552             if (skb->next)
2553                 goto gso;
2554         } else {
2555             if (skb_needs_linearize(skb, features) &&  //skb中有非连续的数据且网卡不支持分散-聚集IO
2556                 __skb_linearize(skb))//将非连续空间中的数据整合到线性区中
2557                 goto out_kfree_skb;
2558
2559             /* If packet is not checksummed and device does not
2560              * support checksumming for this protocol, complete
2561              * checksumming here.
2562              */
2563             if (skb->ip_summed == CHECKSUM_PARTIAL) {
2564                 if (skb->encapsulation)
2565                     skb_set_inner_transport_header(skb,
2566                         skb_checksum_start_offset(skb));
2567                 else
2568                     skb_set_transport_header(skb,
2569                         skb_checksum_start_offset(skb));
2570                 if (!(features & NETIF_F_ALL_CSUM) &&
2571                      skb_checksum_help(skb))
2572                     goto out_kfree_skb;
2573             }
2574         }
2575
2576         if (!list_empty(&ptype_all))
2577             dev_queue_xmit_nit(skb, dev);
2578
2579         skb_len = skb->len;
2580         rc = ops->ndo_start_xmit(skb, dev);//调用驱动设定的发包函数发送数据
2581         trace_net_dev_xmit(skb, rc, dev, skb_len);
2582         if (rc == NETDEV_TX_OK)
2583             txq_trans_update(txq);
2584         return rc;
2585     }
2586
2587 gso:  //循环发送经过软件执行GSO分割数据包形成的skb链表中的skb
2588     do {
2589         struct sk_buff *nskb = skb->next;
2590
2591         skb->next = nskb->next;
2592         nskb->next = NULL;
2593
2594         if (!list_empty(&ptype_all))
2595             dev_queue_xmit_nit(nskb, dev);
2596
2597         skb_len = nskb->len;
2598         rc = ops->ndo_start_xmit(nskb, dev);
2599         trace_net_dev_xmit(nskb, rc, dev, skb_len);
2600         if (unlikely(rc != NETDEV_TX_OK)) {
2601             if (rc & ~NETDEV_TX_MASK)
2602                 goto out_kfree_gso_skb;
2603             nskb->next = skb->next;
2604             skb->next = nskb;
2605             return rc;
2606         }
2607         txq_trans_update(txq);
2608         if (unlikely(netif_xmit_stopped(txq) && skb->next))
2609             return NETDEV_TX_BUSY;
2610     } while (skb->next);
  dev_gso_segment函数:
2436 static int dev_gso_segment(struct sk_buff *skb, netdev_features_t features)
2437 {
2438     struct sk_buff *segs;
2439
2440     segs = skb_gso_segment(skb, features);//将skb拆成多个
2441
2442     /* Verifying header integrity only. */
2443     if (!segs)
2444         return 0;
2445
2446     if (IS_ERR(segs))
2447         return PTR_ERR(segs);
2448
2449     skb->next = segs;
2450     DEV_GSO_CB(skb)->destructor = skb->destructor;
2451     skb->destructor = dev_gso_skb_destructor;
2452
2453     return 0;
2454 }

  dev_gso_segment函数将一个skb分割成多个skb并串成链表,再用原始的skb的next指针指向链表的头。在发送时直接发送skb>next指向的链表中的skb,原始的skb就不需要发送了。

  skb_gso_segment函数封装了__skb_gso_segment函数,会调用skb_mac_gso_segment函数分割skb:

2280 struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
2281                     netdev_features_t features)
2282 {
2283     struct sk_buff *segs = ERR_PTR(-EPROTONOSUPPORT);
2284     struct packet_offload *ptype;  
2285     __be16 type = skb_network_protocol(skb);
2286
2287     if (unlikely(!type))
2288         return ERR_PTR(-EINVAL);
2289
2290     __skb_pull(skb, skb->mac_len);
2291
2292     rcu_read_lock();     
2293     list_for_each_entry_rcu(ptype, &offload_base, list) {
2294         if (ptype->type == type && ptype->callbacks.gso_segment) {
2295             if (unlikely(skb->ip_summed != CHECKSUM_PARTIAL)) {//如果该skb不属于“部分校验和已由内核计算,需要硬件计算剩余部分的校验和”
2296                 int err;
2297
2298                 err = ptype->callbacks.gso_send_check(skb);
2299                 segs = ERR_PTR(err);           
2300                 if (err || skb_gso_ok(skb, features))//出错或可以让网卡进行GSO操作;出现后者的情况是,网卡支持GSO,但需要软件先计算部分检验和
2301                     break;
2302                 __skb_push(skb, (skb->data -
2303                          skb_network_header(skb)));
2304             }
2305             segs = ptype->callbacks.gso_segment(skb, features);//调用inet_gso_segment或ipv6_gso_segment分割skb,并将多个skb组织成链
2306             break;
2307         }
2308     }
2309     rcu_read_unlock();
2310
2311     __skb_push(skb, skb->data - skb_mac_header(skb));
2312
2313     return segs;
2314 }
  2298:调用inet_gso_send_check或ipv6_gso_send_check检查是否有足够的空间进行pull操作如果空间不够则将skb的非连续数据整合到线性空间,并调用tcp_v4_gso_send_check或tcp_v6_gso_send_check计算部分校验和
  inet_gso_segment函数:

1278 static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
1279     netdev_features_t features)
1280 {
1281     struct sk_buff *segs = ERR_PTR(-EINVAL);
1282     const struct net_offload *ops;
1283     struct iphdr *iph;   
1284     int proto;
1285     int ihl;
1286     int id;
1287     unsigned int offset = 0;
1288     bool tunnel;
1289
1290     if (unlikely(skb_shinfo(skb)->gso_type &
1291              ~(SKB_GSO_TCPV4 |
1292                SKB_GSO_UDP |
1293                SKB_GSO_DODGY |
1294                SKB_GSO_TCP_ECN |              
1295                SKB_GSO_GRE |
1296                SKB_GSO_TCPV6 |
1297                SKB_GSO_UDP_TUNNEL |           
1298                0)))
1299         goto out;
1300
1301     if (unlikely(!pskb_may_pull(skb, sizeof(*iph)))) //检查是否有足够的空间pull一个IP基本首部大小,如果空间不够则将skb的非连续数据整合到线性空间
1302         goto out;
1303
1304     iph = ip_hdr(skb);
1305     ihl = iph->ihl * 4;
1306     if (ihl < sizeof(*iph))
1307         goto out;
1308
1309     if (unlikely(!pskb_may_pull(skb, ihl)))  //检查是否有足够的空间pull IP头大小,如果空间不够则将skb的非连续数据整合到线性空间
1310         goto out;
1311
1312     tunnel = !!skb->encapsulation;
1313
1314     __skb_pull(skb, ihl);  //skb->data指向TCP首部
1315     skb_reset_transport_header(skb);
1316     iph = ip_hdr(skb);
1317     id = ntohs(iph->id);
1318     proto = iph->protocol;
1319     segs = ERR_PTR(-EPROTONOSUPPORT);
1320
1321     rcu_read_lock();
1322     ops = rcu_dereference(inet_offloads[proto]);
1323     if (likely(ops && ops->callbacks.gso_segment))
1324         segs = ops->callbacks.gso_segment(skb, features);//调用tcp_tso_segment分割skb
1325     rcu_read_unlock();
1326
1327     if (IS_ERR_OR_NULL(segs))
1328         goto out;
1329
1330     skb = segs;
1331     do {//处理分割后所有包的IP头
1332         iph = ip_hdr(skb);
1333         if (!tunnel && proto == IPPROTO_UDP) {
1334             iph->id = htons(id);
1335             iph->frag_off = htons(offset >> 3);
1336             if (skb->next != NULL)
1337                 iph->frag_off |= htons(IP_MF);
1338             offset += (skb->len - skb->mac_len - iph->ihl * 4);
1339         } else  {
1340             iph->id = htons(id++);
1341         }
1342         iph->tot_len = htons(skb->len - skb->mac_len);
1343         iph->check = 0;
1344         iph->check = ip_fast_csum(skb_network_header(skb), iph->ihl);//计算IP检验和
1345     } while ((skb = skb->next));
1346
1347 out:
1348     return segs;
1349 }
   tcp_tso_segment 函数:
2885 struct sk_buff *tcp_tso_segment(struct sk_buff *skb,
2886     netdev_features_t features)
2887 {       
2888     struct sk_buff *segs = ERR_PTR(-EINVAL);
2889     struct tcphdr *th;
2890     unsigned int thlen;
2891     unsigned int seq;
2892     __be32 delta;
2893     unsigned int oldlen;
2894     unsigned int mss;
2895     struct sk_buff *gso_skb = skb;
2896     __sum16 newcheck;
2897     bool ooo_okay, copy_destructor;
2898
2899     if (!pskb_may_pull(skb, sizeof(*th)))//检查是否有足够的空间pull TCP基本首部大小,如果空间不够则将skb的非连续数据整合到线性空间
2900         goto out;
2901         
2902     th = tcp_hdr(skb);
2903     thlen = th->doff * 4;
2904     if (thlen < sizeof(*th))
2905         goto out;
2906     
2907     if (!pskb_may_pull(skb, thlen))//检查是否有足够的空间pull TCP头大小,如果空间不够则将skb的非连续数据整合到线性空间
2908         goto out;
2909
2910     oldlen = (u16)~skb->len;
2911     __skb_pull(skb, thlen);  //skb->data指向TCP首部后面的数据
2912
2913     mss = skb_shinfo(skb)->gso_size;
2914     if (unlikely(skb->len <= mss))//skb长度太小,则无需分割
2915         goto out;
2916
2917     if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
2918         /* Packet is from an untrusted source, reset gso_segs. */
2919         int type = skb_shinfo(skb)->gso_type;
2920
2921         if (unlikely(type &
2922                  ~(SKB_GSO_TCPV4 |
2923                    SKB_GSO_DODGY |
2924                    SKB_GSO_TCP_ECN |
2925                    SKB_GSO_TCPV6 |
2926                    SKB_GSO_GRE |
2927                    SKB_GSO_UDP_TUNNEL |
2928                    0) ||
2929                  !(type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))))
2930             goto out;
2931
2932         skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);
2933
2934         segs = NULL;
2935         goto out;
2936     }
2937
2938     copy_destructor = gso_skb->destructor == tcp_wfree;
2939     ooo_okay = gso_skb->ooo_okay;
2940     /* All segments but the first should have ooo_okay cleared */
2941     skb->ooo_okay = 0;
2942
2943     segs = skb_segment(skb, features);//分割skb,将分割后的skb组成链表
2944     if (IS_ERR(segs))
2945         goto out;
2946
2947     /* Only first segment might have ooo_okay set */
2948     segs->ooo_okay = ooo_okay;
2949
2950     delta = htonl(oldlen + (thlen + mss));
2951
2952     skb = segs;
2953     th = tcp_hdr(skb);
2954     seq = ntohl(th->seq);
2955
2956     newcheck = ~csum_fold((__force __wsum)((__force u32)th->check +
2957                            (__force u32)delta));
2958
2959     do {//重新设置分割后的每个包的TCP首部
2960         th->fin = th->psh = 0;
2961         th->check = newcheck;
2962
2963         if (skb->ip_summed != CHECKSUM_PARTIAL)
2964             th->check =
2965                  csum_fold(csum_partial(skb_transport_header(skb),
2966                             thlen, skb->csum));
2967
2968         seq += mss;
2969         if (copy_destructor) {
2970             skb->destructor = gso_skb->destructor;
2971             skb->sk = gso_skb->sk;
2972             /* {tcp|sock}_wfree() use exact truesize accounting :
2973              * sum(skb->truesize) MUST be exactly be gso_skb->truesize
2974              * So we account mss bytes of 'true size' for each segment.
2975              * The last segment will contain the remaining.
2976              */
2977             skb->truesize = mss;
2978             gso_skb->truesize -= mss;
2979         }
2980         skb = skb->next;
2981         th = tcp_hdr(skb);
2982
2983         th->seq = htonl(seq);
2984         th->cwr = 0;
2985     } while (skb->next);
2986
2987     /* Following permits TCP Small Queues to work well with GSO :
2988      * The callback to TCP stack will be called at the time last frag
2989      * is freed at TX completion, and not right now when gso_skb
2990      * is freed by GSO engine
2991      */
2992     if (copy_destructor) {
2993         swap(gso_skb->sk, skb->sk);
2994         swap(gso_skb->destructor, skb->destructor);
2995         swap(gso_skb->truesize, skb->truesize);
2996     }
2997
2998     delta = htonl(oldlen + (skb->tail - skb->transport_header) +
2999               skb->data_len);
3000     th->check = ~csum_fold((__force __wsum)((__force u32)th->check +
3001                 (__force u32)delta));
3002     if (skb->ip_summed != CHECKSUM_PARTIAL)
3003         th->check = csum_fold(csum_partial(skb_transport_header(skb),
3004                            thlen, skb->csum));
3005
3006 out:
3007     return segs;
3008 }
  skb_segment函数真正执行skb分割任务:
2762 struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features)
2763 {
2764     struct sk_buff *segs = NULL;
2765     struct sk_buff *tail = NULL;
2766     struct sk_buff *fskb = skb_shinfo(skb)->frag_list;
2767     unsigned int mss = skb_shinfo(skb)->gso_size;
2768     unsigned int doffset = skb->data - skb_mac_header(skb); //mac header + ip header + tcp header
2769     unsigned int offset = doffset;
2770     unsigned int tnl_hlen = skb_tnl_header_len(skb); //隧道协议头长度
2771     unsigned int headroom;
2772     unsigned int len;
2773     __be16 proto;
2774     bool csum;
2775     int sg = !!(features & NETIF_F_SG);  //支持分散-聚集IO(Scatter/Gather IO)
2776     int nfrags = skb_shinfo(skb)->nr_frags;
2777     int err = -ENOMEM;
2778     int i = 0;
2779     int pos;
2780 
2781     proto = skb_network_protocol(skb);
2782     if (unlikely(!proto))
2783         return ERR_PTR(-EINVAL);
2784 
2785     csum = !!can_checksum_protocol(features, proto);
2786     __skb_push(skb, doffset);
2787     headroom = skb_headroom(skb);
2788     pos = skb_headlen(skb);  //skb线性区中数据的长度
2789 
2790     do {
2791         struct sk_buff *nskb;
2792         skb_frag_t *frag;
2793         int hsize;
2794         int size;
2795 
2796         len = skb->len - offset;
2797         if (len > mss)
2798             len = mss;
2799 
2800         hsize = skb_headlen(skb) - offset;
2801         if (hsize < 0)
2802             hsize = 0;
2803         if (hsize > len || !sg)
2804             hsize = len;
2805 
2806         if (!hsize && i >= nfrags) {
2807             BUG_ON(fskb->len != len);
2808 
2809             pos += len;
2810             nskb = skb_clone(fskb, GFP_ATOMIC);
2811             fskb = fskb->next;
2812 
2813             if (unlikely(!nskb))
2814                 goto err;
2815 
2816             hsize = skb_end_offset(nskb);
2817             if (skb_cow_head(nskb, doffset + headroom)) {
2818                 kfree_skb(nskb);
2819                 goto err;
2820             }
2821 
2822             nskb->truesize += skb_end_offset(nskb) - hsize;
2823             skb_release_head_state(nskb);
2824             __skb_push(nskb, doffset);
2825         } else {
2826             nskb = __alloc_skb(hsize + doffset + headroom,
2827                        GFP_ATOMIC, skb_alloc_rx_flag(skb),
2828                        NUMA_NO_NODE);
2829 
2830             if (unlikely(!nskb))
2831                 goto err;
2832 
2833             skb_reserve(nskb, headroom);
2834             __skb_put(nskb, doffset);
2835         }
2836 
2837         if (segs)
2838             tail->next = nskb;
2839         else
2840             segs = nskb;
2841         tail = nskb;
2842 
2843         __copy_skb_header(nskb, skb);
2844         nskb->mac_len = skb->mac_len;
2845 
2846         /* nskb and skb might have different headroom */
2847         if (nskb->ip_summed == CHECKSUM_PARTIAL)
2848             nskb->csum_start += skb_headroom(nskb) - headroom;
2849 
2850         skb_reset_mac_header(nskb);
2851         skb_set_network_header(nskb, skb->mac_len);
2852         nskb->transport_header = (nskb->network_header +
2853                       skb_network_header_len(skb));
2854 
2855         skb_copy_from_linear_data_offset(skb, -tnl_hlen,
2856                          nskb->data - tnl_hlen,
2857                          doffset + tnl_hlen);
2858 
2859         if (fskb != skb_shinfo(skb)->frag_list)
2860             continue;
2861 
2862         if (!sg) {
2863             nskb->ip_summed = CHECKSUM_NONE;
2864             nskb->csum = skb_copy_and_csum_bits(skb, offset,
2865                                 skb_put(nskb, len),
2866                                 len, 0);
2867             continue;
2868         }
2869 
2870         frag = skb_shinfo(nskb)->frags;
2871 
2872         skb_copy_from_linear_data_offset(skb, offset,
2873                          skb_put(nskb, hsize), hsize);
2874 
2875         skb_shinfo(nskb)->tx_flags = skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG;
2876 
2877         while (pos < offset + len && i < nfrags) {  
2878             *frag = skb_shinfo(skb)->frags[i];  //将skb中第i个frag page加入到nskb的frag数组中
2879             __skb_frag_ref(frag);  //增加frag page的引用计数
2880             size = skb_frag_size(frag);
2881 
2882             if (pos < offset) {
2883                 frag->page_offset += offset - pos;
2884                 skb_frag_size_sub(frag, offset - pos);
2885             }
2886 
2887             skb_shinfo(nskb)->nr_frags++;
2888 
2889             if (pos + size <= offset + len) {
2890                 i++;
2891                 pos += size;
2892             } else {
2893                 skb_frag_size_sub(frag, pos + size - (offset + len));
2894                 goto skip_fraglist;  //一个MSS报文填充完毕,无需使用frag_list
2895             }
2896 
2897             frag++;
2898         }
2899 
2900         if (pos < offset + len) {
2901             struct sk_buff *fskb2 = fskb;
2902 
2903             BUG_ON(pos + fskb->len != offset + len);
2904 
2905             pos += fskb->len;
2906             fskb = fskb->next;
2907 
2908             if (fskb2->next) {
2909                 fskb2 = skb_clone(fskb2, GFP_ATOMIC);
2910                 if (!fskb2)
2911                     goto err;
2912             } else
2913                 skb_get(fskb2);
2914 
2915             SKB_FRAG_ASSERT(nskb);
2916             skb_shinfo(nskb)->frag_list = fskb2;
2917         }
2918 
2919 skip_fraglist:
2920         nskb->data_len = len - hsize;
2921         nskb->len += nskb->data_len;
2922         nskb->truesize += nskb->data_len;
2923 
2924         if (!csum) {
2925             nskb->csum = skb_checksum(nskb, doffset,
2926                           nskb->len - doffset, 0);
2927             nskb->ip_summed = CHECKSUM_NONE;
2928         }
2929     } while ((offset += len) < skb->len);
2930 
2931     return segs;
2932 
2933 err:
2934     while ((skb = segs)) {
2935         segs = skb->next;
2936         kfree_skb(skb);
2937     }
2938     return ERR_PTR(err);
2939 }

  2775:分散-聚集IO是网卡支持的一种将skb的非连续空间(frag page)中数据发送出去的方法

  2796-2798:len为skb->len减去直到offset的部分。skb->data指向mac header,skb->len则为mac header + ip header + tcp header + payload。开始时,offset只是mac header + ip header + tcp header的长度(即doffset),len即tcp payload的长度。随着segment增加, offset每次都增加mss长度。因此len的定义是每个segment的payload长度(最后一个segment的payload可能小于一个mss长度)

  2800-2802:hsize为skb header减去offset后的大小,而skb_headlen是skb->len减去非线性区中payload的长度,即为线性区中payload的长度;如果hsize小于0,那么说明payload在skb的frags,frag_list中。随着offset一直增长,必定会有hsize一直<0的情况开始出现,除非skb是一个完全linearize化的skb

  2803-2804:如果不支持sg同时hsize大于len,那么hsize就为len,此时说明segment的payload还在线性区中

  2806-2824:如果skb的线性区没有数据而且frag page也已经耗尽,意味着skb的frag_list中还有skb,接下来将这些skb加入到链表中即可。不过只使用数据部分,协议首部信息得从原始的skb中复制(见2855-2857)

  2826-2834:如果skb的线性区有数据或frag page也可以用,则需要用alloc新的skb,skb->data到skb->head之间保留headroom,skb->tail到skb->data之间保留mac header + ip header + tcp header + hsize的长度

  2837-2841:将新的skb加入到链表中

  2855-2857:把skb->data开始doffset长度的内容拷贝到nskb->data中,即把mac header , ip header, tcp header都复制过去

  2859-2860:如果skb的线性区和frag page中的数据都已经分割完毕,并开始消耗frag_list中的数据,则需要将frag_list中的skb全部放入分割后的skb链表中

  2862-2867:如果网卡不支持分散-聚集IO,则需要将skb的frag_list和frag page中的offset长的数据copy到nskb header中,并跳过对frag_list和frag page的处理

  2872-2873:如果hsize不为0,那么拷贝hsize的内容到nskb header中

  2877-2894:如果线性区中的数据数据用尽也不能填满一个MSS报文,而且有存放在frag page的数据,则将skb的frag page添加到nskb的frag数组中

  2099-2916:如果frag都用完还是无法满足MSS的大小,那么就将skb的frag_list中的skb放入nskb的frag_list中

  2929:完成一个nskb之后,继续下一个seg,一直到skb的所有数据都被填充到seg的链表中;其中每个nskb的数据长度为MSS,最后一个nskb的数据长度可能不足MSS

  skb_segment函数完成skb分割的方法是:将skb的线性区、frag page、frag_list中的数据依次填充到多个数据长度固定为MSS大小的nskb中;如果网卡不支持分散-聚集IO,则每个nskb的数据都位于线性区中;如果支持分散-聚集IO,在skb线性区的数据全部copy到nskb的线性区后,则可以将skb的frag page中的部分数据放入nskb的frag page中,将skb的frag_list的部分数据放入nskb的frag_list中,但一个nskb线性区、frag page、frag_list的数据的总长度也不会超过MSS。

  可见dev_gso_segment函数的功能是用软件替代网卡硬件实现TSO功能的流程,如果网卡支持TSO,这些工作则不需要软件做,硬件会全部完成。如果网卡支持TSO功能则会减轻CPU的很多负担。

  对于TCP,使用TSO的方法是:在调用发送数据的系统调用时tcp_sendmsg函数会为skb申请大于一个MSS的线性数据区,并且会将数据放入skb的frag page中,这样一个skb就可以携带远大于一个MSS大小的数据。在底层协议发送数据时,skb会被分割为多个数据大小为一个MSS的skb然后再发送。如果skb的分割由网卡完成,则会提升TCP的性能。

你可能感兴趣的:(TCP协议详解)