TCP/IP协议栈分析之-IP Fragmentation

 http://hi.chinaunix.net/?uid-21365544-action-viewspace-itemid-42476

 

大半年了,一直在分析TCP/IP源码,现在主要是在分析IP部分,目前着重把分段与重组好好的看了下。。

现在这部分基本是分析完了,准备写个应用层的ip分段重组的模拟程序,各位看官如有什么想法,或者小弟的函数分析出错的话,请留下意见,咱们相互讨论讨论。。O(∩_∩)O~

以下是我分析的分段的主要函数,ip_fragment(),这也应该是tcp/ip中最麻烦的函数之一了(老师开始时这么说的,但当他看到TCP部分时就改口了,呵呵~)。。

源码版本:2.6.27.5

参考书籍:《Understanding linux network internals》

                    《TCP/IP  Illustrated》

 

 427 /*
 428  *  This IP datagram is too large to be sent in one piece.  Break it up into
 429  *  smaller pieces (each of size equal to IP header plus
 430  *  a block of the data of the original IP data part) that will yet fit in a
 431  *  single device frame, and queue such a frame. for sending.
 432  */
 433 / * 2009年4月15日开始分析此函数 
 434  * 对于转发分组的话,若当前的数据<mtu的话,就直接调用
 435  * ip_finish_output这个函数,不会调用ip_fragment.
 436  * 对于这个函数的分析,要考虑两个方面:
 437  * 1.转发分组,即 数据不是本地上层产生的,因为不存在farg_list链表挂载的skb,
 438  *   此时若需再分段的话则只能发生slow_path模式。
 439  * 2.本地产生的数据。这个也要分3方面来考虑:
 440  *   a.上层 协议提供辅助分段且在fast_path过程检查成功,即可以只添加分段头部以完成分段。
 441  *   b.上层协议提供辅助分段但在fast_path过程检查中失败而跳入slow_path模式,再次对数据进行分段加头。
 442  *   c.上层没有进行任何的辅助分段操作,对packet的分段操作直接进入slow_path模式。
 443  *                                                                                          ----cdc. 09.5.21
 444  */
 445 int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff*))    /*无论是fast_path or slow_path,skb中都包含已初始化好的ip头*/
 446 {
 447     struct iphdr *iph;     /*ip头指针*/
 448     int raw = 0;
 449     int ptr;
 450     struct net_device *dev;
 451     struct sk_buff *skb2;   /*skb2是分段目标skb,对于每个分段,数据都被复制到skb2指向的缓存区,最后也是skb2被发送*/
 452     unsigned int mtu, hlen, left, len, ll_rs, pad;
 453     int offset;      /*分段数据区长度,以字节为单位*/
 454     __be16 not_last_frag;      /*标识当前fragment是否是最后一个*/
 455     struct rtable *rt = skb->rtable; /*获得出口路由*/
 456     int err = 0;
 457
 458     dev = rt->u.dst.dev; /*出口 路由设备*/
 459
 460     /*
 461      *  Point into the IP datagram header.
 462      */
 463
 464     iph = ip_hdr(skb);           /*得到指向ip头的指针*/
 465
 466     /*If the input IP packet cannot be fragmented because the source
 467      * has set the DF flag, ip_fragment sends an ICMP packet back to
 468      * the source to notify it of the problem, and then drops the packet.
 469     */
 470     if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {     /*
 471                                                                             unlike()说明括号里的值为true的概率很小,其中IP_DF=0x4000,The local_df flag                                                                                 shown in the if condition is set mainly by the Virtual Server code when it                                                                                 does not want the condition just described to generate an ICMP message.
 472                                                                           */
 473         IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 474         icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,/*若无法分段,向发送方返回一个icmp通知报文*/
 475               htonl(ip_skb_dst_mtu(skb)));
 476         kfree_skb(skb);       /*若无法分段,则释放该sk_buff到cache中*/
 477         return -EMSGSIZE;
 478     }
 479
 480     /*
 481      *  Setup starting values.
 482      */
 483
 484     hlen = iph->ihl * 4;    /*三层头长度,单位byte,ihl指的是首部长度,4bit,单位4bytes,hlen最大60,即ihl最大15*/
 485     mtu = dst_mtu(&rt->u.dst) - hlen;   /* Size of data space */
 486     IPCB(skb)->flags |= IPSKB_FRAG_COMPLETE; /*
 487                                                宏IPCB(skb)定义为((struct inet_skb_parm*)((skb)->cb),sk_buff中的cb是以inet_skb_parm
 488                                                形式存储的,分片完成的一个IP数据报,它的每一个skb的cb->flags被置上IPSKB_FRAG_COMPLETE标志
 489                                              */
 490
 491     /* When frag_list is given, use it. First, check its validity:
 492      * some transformers could create wrong frag_list or break existing
 493      * one, it is not prohibited. In this case fall back to copying.
 494      *
 495      * LATER: this step can be merged to real generation of fragments,
 496      * we can switch to copy when see the first bad fragment.
 497      */
 498     /*宏skb_shinfo(skb)得到指向skb_shared_info的指针,因为在skb中并没有直接指向该结构的指针所以必须用使用该宏*/
 499     /*Fast_path*/
 500     if (skb_shinfo(skb)->frag_list) {
 501         struct sk_buff *frag;
 502        /*注意下面得代码的作用:
 503         * 针对fast_path,把有ip_push_pending_frames函数处理过的主skb的len值变成主的len,
 504         * 即该skb对应缓存的大小+frag[]指向的碎片区的大小,以对应后面有可能发生的slow_path
 505         * 过程
 506         */
 507         int first_len = skb_pagelen(skb);
 508         int truesizes = 0;        /*len+sizeof(skb)*/
 509
 510         if (first_len - hlen > mtu ||         /*数据区>mtu*/
 511             ((first_len - hlen) & 7) ||         /*判断是否是8bytes的整数倍,若不是则slow_path,因为发送的分段除了最后一个外,必须是8bytes的整数倍*/
 512             /*
 513              * in ip_push_pending_frames, iph->frag_off is set to 0x4000, and IP_OFFSET = 0x1FFF
 514              * thus 0x4000 & 0x1FFF = 0, this is obviously reasonable for the first IP fragment.
 515              *
 516              *  As for forwarded ip packets, iph->frag_off is not always set to 0x4000, thus the
 517              *  following offset will not be always 0, and << 3 means to restore the real offset
 518              *  value from this skb's iph->frag_off, thus we can calculate the new frag_off for
 519              *  the new fragment being handled now.
 520              *          
 521             */
 522             (iph->frag_off & htons(IP_MF|IP_OFFSET)) ||
 523             skb_cloned(skb))             /*
 524                                            如果该skb是被克隆的话就返回1否则0,若该skb是被克隆的,则对
 525                                            应的缓冲区就无法被修改,即无法添加ip头。但是对于slow_path的话,
 526                                            分段就是被允许的,因为缓存的数据要被复制(只是被复制,未修改!)到多个skb,但并不修改
 527                                            原缓存的数据。
 528                                          */
 529             goto slow_path;
 530              /*
 531               * 遍历由frag_list构建的skbuff list
 532               ip_append_data的主要任务只是创建发送 网络数据的套接字缓冲区skb,它根据输出路由查询得到的
 533               输出网络设备接口的MTU,把超过MTU长度的应用数据分割开,并创建了多个skb,并且每个skb对应一段应用数据,放入套接字的发送
 534               缓冲队列sk_write_queue,但它并没有为任何一个skb数据加上网络层首部,并且,随后在ip_push_pending_frames
 535               函数中,又把发送缓冲队列中的所有的skb,以一个链表的形式追加到第一个skb的end成员后面
 536               的struct skb_shared_info结构体中的frag_list上,并只为第一个skb加上了网络层首部,所以,实际上,
 537               整个应用数据已经在各个skb中,ip_append_data这样做只是为接下来的真正的IP数据的分片作好准备。
 538               这里的list就相当于一个数据包,只不过由很多skb来存储其中的每一部分。
 539              */
 540         for (frag = skb_shinfo(skb)->frag_list; frag; frag = frag->next) {
 541             /* Correct geometry. */
 542             if (frag->len > mtu ||
 543                 ((frag->len & 7) && frag->next) ||        /*存在下一个sk_buff且其指向的缓存的大小不是8bytes的整数倍且下一个skb不为空*/
 544                 skb_headroom(frag) < hlen)  /* skb_headroom()函数得到在加了四层头部后,缓存剩余的 空间,若剩余的缓存空间无法
 545                                              * 存放三层头时,亦进行分段,不过感觉这种情况一般不会发生,因为在分配缓存空间时
 546                                              * 内核是按最坏打算分配的,即空间只能剩余,不可能不够
 547                                              */
 548                 goto slow_path;
 549
 550             /* Partially cloned skb? */
 551             /*如果该skb被复制,则无法修改对应缓存的数据,
 552              * 即无法添加ip头
 553              */
 554             if (skb_shared(frag))
 555                 goto slow_path;
 556
 557             BUG_ON(frag->sk);    /*相当于断言*/
 558             if (skb->sk){
 559                 /*
 560                  * sock_hold, Grab socket reference count. This operation is valid only
 561                  * when sk is ALREADY grabbed f.e. it is found in hash table
 562                  * or a list and the lookup is made under lock preventing hash table
 563                  * modifications.
 564                  */
               sock_hold(skb->sk);
 566                 frag->sk = skb->sk; /*对变量赋值,使得形参的sk指针指向frag,使得该socket获得对frag的所有权。
 567                                       struct sock *sk,This is a pointer to a sock data structure of the socket
 568                                       that owns this buffer. This pointer is needed when data is either locally generated
 569                                       or being received by a local process, because the data and socket-related information
 570                                       is used by L4 (TCP or UDP)and by the user application. When a buffer is merely being
 571                                       forwarded (that is, neither the source nor the destination is on the local machine),this
 572                                       pointer is NULL.
 573                                     */
 574                 frag->destructor = sock_wfree;  /* 初始化析构函数指针
 575                                                  * void (*destructor),When the buffer belongs to a socket, it is usually set to sock_rfree or sock_wf                                                      *ree (by the skb_set_owner_r and skb_set_owner_w initialization functions, respectively),The two soc                                                      *ik_xxx routines are used to update the amount of memory held by the socket in its queues.
 576                                                 */
 577                 truesizes += frag->truesize;     /*得到除主skb外所有的len+sizeof(skb)*/
 578             }/*遍历结束*/
 579         }
 580
 581         /* Everything is OK. Generate! */
 582
 583         err = 0;
 584         ffset = 0;
 585         frag = skb_shinfo(skb)->frag_list;
 586         skb_shinfo(skb)->frag_list = NULL; /*清除frag_list和其挂载的skb的关系,使用frag指向挂载的skb*/
 587         skb->data_len = first_len - skb_headlen(skb);     /*
 588                                                             转换data_len的大小,使其等于主skb对应的碎片区的大小
 589                                                             在转换之前,data_len=主skb碎片区大小+后续所有的len和
 590                                                            */
 591         skb->truesize -= truesizes;     /*
 592                                           得到主skb的len+sizeof(skb)
 593                                           因为在上层的ip_push_pending_frame函数处理后,主skb的
 594                                           truesize值就变成了所有的skb的len的和+所有skb本身的大
 595                                           小之和,类似于上面说的这skb的len和datalen。
 596                                          */
 597         skb->len = first_len;          /*将主skb的len的值变成只是对于该skb的len,不包括挂载的skb*/
 598         iph->tot_len = htons(first_len);           /*对于一切要发送的数据,都要转换字节序*/
 599         /*
 600          *in ip_push_pending_frames, iph->frag_off is set to 0x4000
 601          *下面的代码中,IP_MF的值是0x2000,所以对于第一个分段来说,
 602          offset的值为0
 603          *
 604          */
 605         iph->frag_off = htons(IP_MF);
 606         ip_send_check(iph);          /*校验第一个分段的ip头*/
 607         /*为后续分段添加ip header*/
 608         for (;;) {
 609             /* Prepare header of the next frame,
 610              * before previous one went down. */
 611             if (frag) {
             frag->ip_summed = CHECKSUM_NONE;    /*ip_summed represent the checksum and associated status flag,CHECKSUM_NONE=0*/
 613                 skb_reset_transport_header(frag);   /*skb->transport_header = skb->data - skb->head;获得四层头部相对head的偏移量*/
 614                 __skb_push(frag, hlen);   /*去掉三层头部,即指针指向四层头部,其中该函数有这么一句skb->len+=hlen,功能是为要添加头部的
 615                                             skb开辟三层头部空间
 616                                           */
 617                 skb_reset_network_header(frag);            /*类似skb_reset_transport_header,得到ip头部相对head的偏移量*/
 618 /*  void *memcpy(void *dest, const void *src, size_t count){           复制数据并返回指向被复制区域的指针
 619 *            char *tmp = dest;
 620 *            const char *s = src;
 621 *                            
 622 *            while (count--)
 623 *             *tmp++ = *s++;
 624 *           return dest;
 625 *   }           
 626 */
 627                 memcpy(skb_network_header(frag), iph, hlen); /*复制ip头到该skb,得到一个指向四层头的指针
 628                                                                skb_network_header返回return skb->head + skb->network_header;
 629                                                                即一个指向四层头部的指针
 630                                                              */
 631                 iph = ip_hdr(frag);           /*得到指向ip头部的指针*/
 632                 iph->tot_len = htons(frag->len);   /*把主机字节序变幻成网络字节序*/
 633                 ip_copy_metadata(frag, skb);   /*拷贝一些关于skb的 管理字段,如protocol,优先级等,其对分段并无任何影响*/
 634                 if (offset == 0)
 635                     ip_options_fragment(frag);       /*修改第一个分段包,去掉选项部分,以便后面的分段循环利用ip header*/
 636                 offset += skb->len - hlen;           /*计算偏移量*/
 637                 iph->frag_off = htons(offset>>3);  /*frag_off是13bit,offset是16bit且是8bytes的整数倍。*/
 638                 if (frag->next != NULL)
 639                     iph->frag_off |= htons(IP_MF);
 640                 /* Ready, complete checksum */
 641                 ip_send_check(iph);
 642             }
 643
 644             err = output(skb);    /*输出正确output()函数返回0,错误返回1
 645                                     接上面的if(frag),如果frag不存在,error=1
 646                                   */
 647
 648             if (!err)        /**若未出错,即frag存在*/
 649                 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES);
 650             if (err || !frag)    /*判断出错或者frag没有后继*/
 651                 break;
 652
 653             skb = frag;     /*将frag赋给skb*/
 654             frag = skb->next;/*frag指向下一个skb*/
 655         skb->next = NULL;/*清除skb与frag的关系*/
 656         } /*加头结束*/
 657
 658         if (err == 0) {        /*Fast_path成功统计分段个数,并推出该函数*/
 659             IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS);
 660             return 0;
 661         }
 662
 663         while (frag) {      /*Fast_path失败*/
664             skb = frag->next;
 665             kfree_skb(frag);  /*释放所有分配的skb*/
 666             frag = skb;
 667         }
 668         IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 669         return err;     /*返回错误*/
 670     }    /*Fast_path结束*/
 671
 672 slow_path:
 673     left = skb->len - hlen;     /* Space per frame,数据区的长度 */
 674     ptr = raw + hlen;       /* Where to start from,第一次是指向数据区开始的位置,以后依次指向分段开始的位置 */
 675
 676     /* for bridged IP traffic encapsulated inside f.e. a vlan header,
 677      * we need to make room for the encapsulating header
 678      */
 679      /*nf_bridge_pad, This is called by the IP fragmenting code and it ensures there is
 680       * enough room for the encapsulating header (if there is one)
 681      */
 682     pad = nf_bridge_pad(skb);           /*pad是为vlan、pppoe等添加的头部*/
 683     ll_rs = LL_RESERVED_SPACE_EXTRA(rt->u.dst.dev, pad);   /*II_rs, Link Layer Reserved Space--链路层预定空间*/
 684     /*room for L4 data should exclude the pad from mtu*/
 685     mtu -= pad;
 686
 687     /*
 688      *  Fragment the datagram.
 689      */
 690
 691     ffset = (ntohs(iph->frag_off) & IP_OFFSET) << 3; /*得到数据区为8bytes整数倍的长度,其中define IP_OFFSET   0x1FFF*/
 692     not_last_frag = iph->frag_off & htons(IP_MF);/*is true when more data is supposed to follow the current fragment in the packet*/
 693
 694     /*
 695      *  Keep copying data until we run out.
 696      */
 697
 698     while (left > 0) {
 699         len = left;               /*在三层,把ip分组的数据区长度赋给缓存里数据块的大小,即由data和tail指向的空间大小*/
 700         /* IF: it doesn't fit, use 'mtu' - the data space left */
 701         /*数据区的长度大于MTU时,将数据区的长度变成MTU,否则就等于left*/
 702         if (len > mtu)
 703             len = mtu;
 704         /* IF: we are not sending upto and including the packet end
 705            then align the next start on an eight byte boundary */
 706         /*如果ip分组的数据区长度大于MTU,即进行分段处理,变成8byte的整数倍*/
 707         if (len < left) {
 708             len &= ~7;
 709         }
 710         /*
 711          *  Allocate buffer.
 712          *  The size of the buffer allocated to hold a fragment is the sum of:
 713          *
 714          *  The size of the IP payload   ->       len
 715          *
 716          *  The size of the IP header    ->       hlen
 717          *
 718          *  The size of the L2 header    ->       II_rs
 719          *
 720          *
 721          */
 722
 723         if ((skb2 = alloc_skb(len+hlen+ll_rs, GFP_ATOMIC)) == NULL) {/*
 724                                                                        此处分配空间必须正确,尤其是对于最后一个分段,分的大小必须
 725                                                                        满足最后的数据要完全装满,否则的话会出现错误,复制数据失败
 726                                                                       */
 727             NETDEBUG(KERN_INFO "IP: frag: no memory for new fragment!\n");
 728             err = -ENOMEM;
 729             goto fail;
 730         }
 731
 732         /*
 733          *  Set up data on packet
 734          */
 735
 736         ip_copy_metadata(skb2, skb); /*拷贝一些关于skb的字段 设置到skb2,比如protocol,优先级等。 */
 737         /*Increase the headroom of an empty &sk_buff by reducing the tail
 738          *room. This is only allowed for an empty buffer.
 739          */
 740         skb_reserve(skb2, ll_rs);     /*在空缓存上为二层头部分配空间*/
 741         skb_put(skb2, len + hlen);     /*在上面的 基础上,为ip_header和ip_payload分配空间*/
 742         /*  skb2->nh.raw = skb2->data;
 743          *  skb2->h.raw = skb2->data + hlen;
 744          *  以下的两句代码的意义相当于以上这两句,即分别初始化指向ip header和L4 header的指针*/
 745         skb_reset_network_header(skb2);     /*返回head到ip头之间的偏移量*/
 746         skb2->transport_header = skb2->network_header + hlen;/*得到head到四层头部的偏移量*/
 747
 748         /*
 749          *  Charge(填充) the memory for the fragment to any owner
 750          *  it might possess
 751          */
 752
 753         if (skb->sk)       /*将一个sk_buff buffer和指定的sock结构连接起来,The newly allocated buffer is associated with
 754                              the socket attempting the transmission, if any
 755                             */
 756             skb_set_owner_w(skb2, skb->sk);
 757
 758         /*
 759          *  Copy the packet header into the new buffer.
 760          */
 761
 762         skb_copy_from_linear_data(skb, skb_network_header(skb2), hlen);   /*复制ip头到分段*/
 763
 764         /*
 765          *  Copy a block of the IP datagram.
 766          */
 767         /*
 768          * offset,指的是分段的偏移量,对应的是每个分段相对于原始的packet
 769          * ptr,指的是对于当前的正在处理的packet,被复制区域相对该packet的偏移,对象是原始的packet
 770          */
 771         if (skb_copy_bits(skb, ptr, skb_transport_header(skb2), len))/*复制数据区到新的skb。 2009 5.19该函数分析完毕*/
 772
 773
 774             BUG();
 775         left -= len; /*计算剩余的数据区长度*/
 776
 777         /*
 778          *  Fill in the new header fields.
 779          */
 780         iph = ip_hdr(skb2);
 781         iph->frag_off = htons((offset >> 3));
 782
 783         /* ANK: dirty, but effective trick. Upgrade options only if
 784          * the segment to be fragmented was THE FIRST (otherwise,
 785          * options are already fixed) and make it ONCE
 786          * on the initial skb, so that all the following fragments
 787          * will inherit fixed options.
 788          */
 789         if (offset == 0)      /*为第一个分段添加选项字段*/
 790             ip_options_fragment(skb);       /*The first fragment (where offset is 0)is special from the IP options point
 791                                               of view because it is the only one that includes a full copy of the options
 792                                               from the original IP packet. Not all the options have to be replicated into
 793                                               all of the fragments; only the first fragment will include all of them.
 794                                             */
 795
 796         /*
 797          *  Added AC : If we are fragmenting a fragment that's not the
 798          *         last fragment then keep MF on each bit
 799          */
 800         if (left > 0 || not_last_frag)
 801             iph->frag_off |= htons(IP_MF);
 802         ptr += len;/*The following two statements update two offsets. It is easy to confuse the two. offset is maintaine
 803                      d because the packet currently being fragmented may be a fragment of a larger packet; if so, offset
 804                      represents the offset of the current fragment within the original packet (otherwise, it is simply 0).
 805                      ptr is an offset within the packet we are fragmenting and changes as the loop progresses. The two variables
 806                      have the same value in two cases: where the packet we are fragmenting is not a fragment itself, and where
 807                      this fragment is the very first fragment.
 808                      */
 809         offset += len;
 810
 811         /*
 812          *  Put this fragment into the sending queue.
 813          */
 814         iph->tot_len = htons(len + hlen);/*计算ip分段的总长度,数据区长度+ip头部长度*/
 815
 816         ip_send_check(iph);     /*ip头校验*/
 817
 818         err = output(skb2);  /*发送分段*/
 819         if (err)/*出错处理*/
 820             goto fail;
 821
 822         IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES);/*若未出错,统计分段个数*/
 823     }/*Slow_path结束*/
 824     kfree_skb(skb); /*Slow_path成功,释放掉传进来的skb*/
 825     IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGOKS);/*统计分段成功的个数*/
 826     return err;
 827 /*出错处理*/
 828 fail:
 829     kfree_skb(skb);
 830     IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 831     return err;
 832 }    / *ip_fragment函数结束       2009.5.19*/                                   // 注意这个时间,呵 呵~                                                                                                                                                    

你可能感兴趣的:(Fragment)