linux内核网络协议栈架构分析,全流程分析-干货

 

 

 

 

 

LINUX内核协议栈分析

 

 


目  录

1      说明...4

2      TCP协议...4

2.1       分层...4

2.2       TCP/IP的分层...5

2.3       互联网的地址...6

2.4       封装...7

2.5       分用...8

3      数据包格式...8

3.1       ethhdr.8

3.2       iphdr.10

3.3       udphdr.11

4      数据结构...12

4.1       内核协议栈分层结构...13

4.2       msghdr.14

4.3       iovec.14

4.4       file.15

4.5       file_operations.16

4.6       socket.17

4.7       sock.18

4.8       sock_common.22

4.9       inet_sock.23

4.10     udp_sock.25

4.11     proto_ops.25

4.12     proto.27

4.13     net_proto_family.29

4.14     softnet_data.31

4.15     sk_buff31

4.16     sk_buff_head.35

4.17     net_device.35

4.18     inet_protosw..42

4.19     inetsw_array.43

4.20     sock_type.44

4.21     IPPROTO..45

4.22     net_protocol46

4.23     packet_type.46

4.24     rtable.49

4.25     dst_entry.50

4.26     napi_struct.52

5      数据结构类图...53

6      协议栈注册流程...53

6.1       内核启动流程...53

6.2       协议栈初始化流程...55

7      协议栈收包流程...56

7.1       驱动收包流程...57

7.2       应用层收包流程...57

8      协议栈发包流程...57

9      总结...57

10        参考文献...57

 


 

1      说明

本文档制作基于版本  linux-2.6.32,本文档的目的是让有一定的网络协议基础的人了解到网络数据包在协议栈中的传输流程,大致理解到从网卡收到数据包传输到应用层所经历的步骤,以及每个步骤所做的事情。 图片贴到最后。

本文档阅读基础:C语言基础,C语言回调函数,UML建模基础,C++面向对象封装思想,TCP/IP协议或网络基础。

2      TCP协议

本章摘自[TCP-IP详解卷一] 第一章。

2.1分层

网络协议通常分不同层次进行开发,每一层分别负责不同的通信功能。一个协议族,比如T C P / I P,是一组不同层次上的多个协议的组合。T C P / I P通常被认为是一个四层协议系统,如图1 - 1所示。每一层负责不同的功能:

1)链路层,有时也称作数据链路层或网络接口层,通常包括操作系统中的设备驱动程序和计算机中对应的网络接口卡。它们一起处理与电缆(或其他任何传输媒介)的物理接口细节。

2)网络层,有时也称作互联网层,处理分组在网络中的活动,例如分组的选路。在

TC P / I P协议族中,网络层协议包括I P协议(网际协议),I C M P协议(I n t e r n e t互联网控制报文协议),以及I G M P协议(I n t e r n e t组管理协议)。

3) 运输层主要为两台主机上的应用程序提供端到端的通信。在T C P / I P协议族中,有两个互不相同的传输协议:T C P(传输控制协议)和U D P(用户数据报协议)。T C P为两台主机提供高可靠性的数据通信。它所做的工作包括把应用程序交给它的数据分成合适的小块交给下面的网络层,确认接收到的分组,设置发送最后确认分组的超时时钟等。由于运输层提供了高可靠性的端到端的通信,因此应用层可以忽略所有这些细节。而另一方面,U D P则为应用层提供一种非常简单的服务。它只是把称作数据报的分组从一台主机发送到另一台主机,但并不保证该数据报能到达另一端。任何必需的可靠性必须由应用层来提供。

这两种运输层协议分别在不同的应用程序中有不同的用途,这一点将在后面看到。

4 ) 应用层负责处理特定的应用程序细节。几乎各种不同的T C P / I P实现都会提供下面这些通用的应用程序:

• Telnet 远程登录。

• FTP 文件传输协议。

• SMTP 简单邮件传送协议。

• SNMP 简单网络管理协议。

2.2TCP/IP的分层

在TC P / I P协议族中,有很多种协议。图1 - 4给出了本书将要讨论的其他协议。

 

T C P和U D P是两种最为著名的运输层协议,二者都使用I P作为网络层协议。

虽然T C P使用不可靠的I P服务,但它却提供一种可靠的运输层服务。本书第1 7~2 2章将详细讨论T C P的内部操作细节。然后,我们将介绍一些T C P的应用,如第2 6章中的Te l n e t和R l o g i n、第2 7章中的F T P以及第2 8章中的S M T P等。这些应用通常都是用户进程。

U D P为应用程序发送和接收数据报。一个数据报是指从发送方传输到接收方的一个信息单元(例如,发送方指定的一定字节数的信息)。但是与T C P不同的是,U D P是不可靠的,它不能保证数据报能安全无误地到达最终目的。本书第11章将讨论U D P,然后在第1 4章(D N S :域名系统),第1 5章(T F T P:简单文件传送协议),以及第1 6章(BO OT P:引导程序协议)介绍使用U D P的应用程序。S N M P也使用了U D P协议,但是由于它还要处理许多其他的协议,因此本书把它留到第2 5章再进行讨论。

I P是网络层上的主要协议,同时被T C P和U D P使用。T C P和U D P的每组数据都通过端系统和每个中间路由器中的I P层在互联网中进行传输。在图1 - 4中,我们给出了一个直接访问I P的应用程序。这是很少见的,但也是可能的(一些较老的选路协议就是以这种方式来实现的。当然新的运输层协议也有可能使用这种方式)。第3章主要讨论I P协议,但是为了使内容更加有针对性,一些细节将留在后面的章节中进行讨论。第9章和第1 0章讨论I P如何进行选路。

I C M P是I P协议的附属协议。I P层用它来与其他主机或路由器交换错误报文和其他重要信息。

第6章对I C M P的有关细节进行讨论。尽管I C M P主要被I P使用,但应用程序也有可能访问它。我们将分析两个流行的诊断工具,P i n g和Tr a c e r o u t e(第7章和第8章),它们都使用了I C M P。

I G M P是I n t e r n e t组管理协议。它用来把一个U D P数据报多播到多个主机。我们在第1 2章中描述广播(把一个U D P数据报发送到某个指定网络上的所有主机)和多播的一般特性,然后在第1 3章中对I G M P协议本身进行描述。

A R P(地址解析协议)和R A R P(逆地址解析协议)是某些网络接口(如以太网和令牌环网)使用的特殊协议,用来转换I P层和网络接口层使用的地址。我们分别在第4章和第5章对这两种协议进行分析和介绍。

2.3互联网的地址

互联网上的每个接口必须有一个唯一的I n t er n e t地址(也称作I P地址)。I P地址长32 bit。I n t e r n e t地址并不采用平面形式的地址空间,如1、2、3等。I P地址具有一定的结构,五类不同 的互联网地址格式如图1 - 5所示。

这些3 2位的地址通常写成四个十进制的数,其中每个整数对应一个字节。这种表示方法称作“点分十进制表示法(Dotted decimal notation)”。例如,作者的系统就是一个B类地址,它表示为:1 4 0 . 2 5 2 .1 3 . 3 3。

区分各类地址的最简单方法是看它的第一个十进制整数。图1 - 6列出了各类地址的起止范围,其中第一个十进制整数用加黑字体表示。

需要再次指出的是,多接口主机具有多个I P地址,其中每个接口都对应一个I P地址。

由于互联网上的每个接口必须有一个唯一的I P地址,因此必须要有一个管理机构为接入互联网的网络分配I P地址。这个管理机构就是互联网络信息中心(Internet Network InformationC e n t r e),称作I n t e r N I C。I n t e r N I C只分配网络号。主机号的分配由系统管理员来负责。

I n t e r n e t注册服务( I P地址和D N S域名)过去由N I C来负责,其网络地址是n i c . d d n . m i l。1 9 9 3年4月1日,I n t e r N I C成立。现在,N I C只负责处理国防数据网的注册请求,所有其他的I n t e r n e t用户注册请求均由I n t e rN I C负责处理,其网址是:r s . i n t er n i c . n e t。

事实上I n t e r N I C由三部分组成:注册服务(r s. i n t e r n i c . n e t),目录和数据库服

务(d s . i n t e r n i c. n e t),以及信息服务(i s . i n t e rn i c . n e t)。有关I n t e r N I C的其他信息参见习题1 . 8。

有三类I P地址:单播地址(目的为单个主机)、广播地址(目的端为给定网络上的所有主机)以及多播地址(目的端为同一组内的所有主机)。第1 2章和第1 3章将分别讨论广播和多播的更多细节。

在3 . 4节中,我们在介绍I P选路以后将进一步介绍子网的概念。图3 - 9给出了几个特殊的I P地址:主机号和网络号为全0或全1。

2.4封装

 

当应用程序用T C P传送数据时,数据被送入协议栈中,然后逐个通过每一层直到被当作一串比特流送入网络。其中每一层对收到的数据都要增加一些首部信息(有时还要增加尾部信息),该过程如图1 - 7所示。T C P传给I P的数据单元称作T C P报文段或简称为T C P段(T C P s e g m e n t)。I P传给网络接口层的数据单元称作I P数据报(IP datagram)。通过以太网传输的比特流称作帧(Fr a m e )。1 - 7中帧头和帧尾下面所标注的数字是典型以太网帧首部的字节长度

 

 

2.5分用

 

当目的主机收到一个以太网数据帧时,数据就开始从协议栈中由底向上升,同时去掉各

层协议加上的报文首部。每层协议盒都要去检查报文首部中的协议标识,以确定接收数据的

上层协议。这个过程称作分用( D e m u lt i p l e x i n g),图1 - 8显示了该过程是如何发生的。[TCP-IP详解卷一]

3      数据包格式

1. 

2. 

3.1ethhdr

 

描述以太网头部

/*

 *  Thisis an Ethernet frame header.

 */

 

struct ethhdr {

    unsigned char h_dest[ETH_ALEN];/* destination ethaddr  */

    unsigned char h_source[ETH_ALEN]; /* source ether addr */

    __be16     h_proto;     /* packet type ID field  */

} __attribute__((packed));

 

 

/*

 *  Theseare the defined Ethernet Protocol ID's.

 */

 

#define ETH_P_LOOP   0x0060    /* Ethernet Loopback packet */

#define ETH_P_PUP 0x0200     /* Xerox PUP packet      */

#define ETH_P_PUPAT  0x0201    /* Xerox PUP Addr Trans packet  */

#define ETH_P_IP  0x0800     /* Internet Protocol packet */

#define ETH_P_X25 0x0805     /* CCITT X.25        */

#define ETH_P_ARP 0x0806     /* Address Resolution packet    */

#define    ETH_P_BPQ  0x08FF    /* G8BPQ AX.25Ethernet Packet  [ NOT AN OFFICIALLYREGISTERED ID ] */

#define ETH_P_IEEEPUP    0x0a00    /* Xerox IEEE802.3 PUP packet */

#define ETH_P_IEEEPUPAT  0x0a01    /* Xerox IEEE802.3 PUP Addr Trans packet */

#define ETH_P_DEC       0x6000         /* DEC Assigned proto           */

#define ETH_P_DNA_DL    0x6001         /* DEC DNA Dump/Load            */

#define ETH_P_DNA_RC    0x6002         /* DEC DNA Remote Console       */

#define ETH_P_DNA_RT    0x6003         /* DEC DNA Routing              */

#define ETH_P_LAT       0x6004         /* DEC LAT                      */

#define ETH_P_DIAG      0x6005         /* DEC Diagnostics              */

#define ETH_P_CUST      0x6006         /* DEC Customer use             */

#define ETH_P_SCA       0x6007         /* DEC Systems Comms Arch       */

#define ETH_P_TEB 0x6558     /* Trans Ether Bridging     */

#define ETH_P_RARP      0x8035      /* Reverse Addr Res packet  */

#define ETH_P_ATALK  0x809B    /* Appletalk DDP     */

#define ETH_P_AARP   0x80F3    /* Appletalk AARP    */

#define ETH_P_8021Q  0x8100         /* 802.1Q VLAN Extended Header  */

#define ETH_P_IPX 0x8137     /* IPX over DIX          */

#define ETH_P_IPV6   0x86DD    /* IPv6 over bluebook       */

#define ETH_P_PAUSE  0x8808    /* IEEE Pause frames. See 802.3 31B */

#define ETH_P_SLOW   0x8809    /* Slow Protocol. See 802.3ad 43B */

#define ETH_P_WCCP   0x883E    /* Web-cache coordination protocol

                   * defined in draft-wilson-wrec-wccp-v2-00.txt*/

#define ETH_P_PPP_DISC   0x8863    /* PPPoE discovery messages     */

#define ETH_P_PPP_SES    0x8864    /* PPPoE session messages   */

#define ETH_P_MPLS_UC    0x8847    /* MPLS Unicast traffic     */

#define ETH_P_MPLS_MC    0x8848    /* MPLS Multicast traffic   */

#define ETH_P_ATMMPOA    0x884c    /* MultiProtocol Over ATM   */

#define ETH_P_ATMFATE    0x8884    /* Frame-based ATM Transport

                   * over Ethernet

                   */

#define ETH_P_PAE 0x888E     /* Port Access Entity (IEEE 802.1X) */

#define ETH_P_AOE 0x88A2     /* ATA over Ethernet     */

#define ETH_P_TIPC   0x88CA    /* TIPC           */

#define ETH_P_1588   0x88F7    /* IEEE 1588 Timesync */

#define ETH_P_FCOE   0x8906    /* Fibre Channel over Ethernet  */

#define ETH_P_TDLS   0x890D    /* TDLS */

#define ETH_P_FIP 0x8914     /* FCoE Initialization Protocol */

#define ETH_P_EDSA   0xDADA    /* Ethertype DSA [ NOT AN OFFICIALLY REGISTERED ID] */

#define ETH_P_AF_IUCV   0xFBFB     /* IBM af_iucv [ NOT AN OFFICIALLY REGISTERED ID ]*/

 

 

3.2iphdr

描述ip头部

struct iphdr {

#if defined(__LITTLE_ENDIAN_BITFIELD)

    __u8   ihl:4,

       version:4;

#elif defined (__BIG_ENDIAN_BITFIELD)

    __u8   version:4,

        ihl:4;

#else

#error "Please fix"

#endif

    __u8   tos;

    __be16 tot_len;

    __be16 id;

    __be16 frag_off;

    __u8   ttl;

    __u8   protocol;

    __sum16    check;

    __be32 saddr;

    __be32 daddr;

    /*The options start here. */

};

 

3.3udphdr

描述udp头部

struct udphdr {

    __be16 source;

    __be16 dest;

    __be16 len;

    __sum16    check;

};

 

4      数据结构

内核协议栈涉及的数据结较多,错综复杂,这里只是粘贴了设计到的数据结构的源码。源码和注释用10字体,高亮显示;重要的成员和方法用加粗11号字体标出。例如

 

4.1内核协议栈分层结构

图4-1 内核协议栈分层结构

 

Physical device hardware : 指的实实在在的物理设备。    对应physical layer

Device agnostic interface : 设备无关层。                                对应Link layer

Network protocols            :  网络层。                                        对应Ip layer 和 transportlayer

Protocol agnostic interface: 协议无关层                                  适配系统调用层,屏蔽了协议的细节

System callinterface:系统调用层     提供给应用层的系统调用,屏蔽了socket操作的细节

BSD socket:  BSD Socket层           提供统一socket操作的接口, socket结构关系紧密

Inet socket:      inet socket 层          调用ip层协议的统一接口,sock结构关系紧密

4.2msghdr

描述了从应用层传递下来的消息格式包含有用户空间地址消息标记等重要信息。

/*

 *  Aswe do 4.4BSD message passing we use a 4.4BSD message passing

 *  system,not 4.3. Thus msg_accrights(len) are now missing. They

 *  belongin an obscure libc emulation or the bin.

 */

 

struct msghdr {

    void   *   msg_name; /* Socket name           */

    int    msg_namelen; /* Length of name    */

    struct iovec*    msg_iov;  /* Data blocks           */

    __kernel_size_t   msg_iovlen;  /* Number of blocks      */

    void   *   msg_control; /* Per protocolmagic (eg BSD file descriptor passing) */

    __kernel_size_t   msg_controllen;  /* Length of cmsglist */

    unsigned   msg_flags;

};

4.3iovec

描述了用户空间地址的起始位置。

/*

 *  Berkeleystyle UIO structures   -   Alan Cox 1994.

 *

 *     Thisprogram is free software; you can redistribute it and/or

 *     modifyit under the terms of the GNU General Public License

 *     aspublished by the Free Software Foundation; either version

 *     2of the License, or (at your option) any later version.

 */

 

struct iovec {

    void __user*iov_base;  /* BSD uses caddr_t(1003.1g requires void *) */

    __kernel_size_t iov_len;/* Must be size_t(1003.1g) */

};

 

4.4file

描述文件属性的结构体,与文件描述符一一对应。

 

struct file {

    /*

     * fu_list becomes invalid after file_free iscalled and queued via

     * fu_rcuhead for RCU freeing

     */

    union {

       struct list_head  fu_list;

       struct rcu_head   fu_rcuhead;

    } f_u;

    struct path       f_path;

#define f_dentry  f_path.dentry

#define f_vfsmnt  f_path.mnt

    const struct file_operations   *f_op;

    spinlock_t    f_lock; /* f_ep_links,f_flags, no IRQ */

    atomic_long_t     f_count;

    unsigned int      f_flags;

    fmode_t           f_mode;

    loff_t        f_pos;

    struct fown_struct   f_owner;

    const struct cred *f_cred;

    struct file_ra_state f_ra;

 

    u64        f_version;

#ifdef CONFIG_SECURITY

    void          *f_security;

#endif

    /* needed for tty driver, and maybeothers */

   void        *private_data;

 

#ifdef CONFIG_EPOLL

    /* Used by fs/eventpoll.c to link allthe hooks to this file */

    struct list_head  f_ep_links;

#endif /*#ifdef CONFIG_EPOLL */

    struct address_space*f_mapping;

#ifdef CONFIG_DEBUG_WRITECOUNT

    unsigned long f_mnt_write_state;

#endif

};

4.5file_operations

文件操作相关结构体,包括read(), write(), open(),ioctl()等。

 

/*

 * NOTE:

 * read, write, poll, fsync, readv,writev, unlocked_ioctl and compat_ioctl

 * can be called without the bigkernel lock held in all filesystems.

 */

structfile_operations {

    struct module *owner;

    loff_t (*llseek)(struct file*, loff_t,int);

    ssize_t (*read) (struct file*,char __user*,size_t, loff_t*);

    ssize_t (*write) (struct file*,constchar __user*,size_t, loff_t*);

    ssize_t (*aio_read)(struct kiocb*, const struct iovec *,unsignedlong, loff_t);

    ssize_t (*aio_write)(struct kiocb*, const struct iovec *,unsignedlong, loff_t);

    int (*readdir)(struct file*,void*, filldir_t);

    unsigned int (*poll)(struct file*,struct poll_table_struct *);

    int (*ioctl) (struct inode*,struct file*,unsignedint,unsignedlong);

    long (*unlocked_ioctl)(struct file*, unsigned int,unsignedlong);

    long (*compat_ioctl)(struct file*, unsigned int,unsignedlong);

    int (*mmap)(struct file*,struct vm_area_struct *);

    int (*open) (struct inode*,struct file*);

    int (*flush)(struct file*, fl_owner_t id);

    int (*release)(struct inode*,struct file *);

    int (*fsync)(struct file*,struct dentry *,int datasync);

    int (*aio_fsync)(struct kiocb*, int datasync);

    int (*fasync)(int,struct file *,int);

    int (*lock)(struct file*,int,struct file_lock *);

    ssize_t (*sendpage)(struct file*, struct page *, int, size_t, loff_t *,int);

    unsigned long (*get_unmapped_area)(struct file*,unsignedlong,unsignedlong,unsignedlong,unsignedlong);

    int (*check_flags)(int);

    int (*flock)(struct file*,int,struct file_lock *);

    ssize_t (*splice_write)(struct pipe_inode_info*,struct file *, loff_t*,size_t,unsignedint);

    ssize_t (*splice_read)(struct file*, loff_t *,struct pipe_inode_info*,size_t,unsignedint);

    int (*setlease)(struct file*,long,struct file_lock **);

};

 

4.6socket

向应用层提供的BSD socket操作结构体,协议无关,主要作用为应用层提供统一的socket操作。BSD: BerkeleySoftwareDistribution

 

/**

 * struct socket - general BSD socket

 * @state: socket state (%SS_CONNECTED, etc)

 * @type: socket type (%SOCK_STREAM, etc)

 * @flags: socket flags (%SOCK_ASYNC_NOSPACE, etc)

 *  @ops:protocol specific socket operations

 * @fasync_list: Asynchronous wake up list

 * @file: File back pointer for gc

 *  @sk:internal networking protocol agnostic socket representation

 * @wait: wait queue for several uses

 */

struct socket {

   socket_state    state;

 

    kmemcheck_bitfield_begin(type);

    short         type;

    kmemcheck_bitfield_end(type);

 

    unsigned long     flags;

    /*

     * Please keep fasync_list & wait fields inthe same cache line

     */

    struct fasync_struct*fasync_list;

    wait_queue_head_t wait;

 

    struct file    *file;

   struct sock    *sk;

   const struct proto_ops   *ops;

};

 

typedef enum {

    SS_FREE = 0,         /* not allocated     */

    SS_UNCONNECTED,         /* unconnected to any socket    */

    SS_CONNECTING,          /* in process of connecting */

    SS_CONNECTED,       /* connected to socket      */

    SS_DISCONNECTING     /* in process of disconnecting  */

} socket_state;

4.7sock

网络层sock(可理解为C++基类),定义与协议无关操作,是网络层的统一的结构,传输层在此基础上实现了inet_sock(可理解为C++派生类)。

/**

  * structsock - network layer representation of sockets

  * @__sk_common:shared layout with inet_timewait_sock

  * @sk_shutdown:mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN

  * @sk_userlocks:%SO_SNDBUF and %SO_RCVBUF settings

  * @sk_lock:  synchronizer

  * @sk_rcvbuf:size of receive buffer in bytes

  * @sk_sleep:sock wait queue

  * @sk_dst_cache:destination cache

  * @sk_dst_lock:destination cache lock

  * @sk_policy:flow policy

  * @sk_rmem_alloc:receive queue bytes committed

  * @sk_receive_queue:incoming packets

  * @sk_wmem_alloc:transmit queue bytes committed

  * @sk_write_queue:Packet sending queue

  * @sk_async_wait_queue:DMA copied packets

  * @sk_omem_alloc:"o" is "option" or "other"

  * @sk_wmem_queued:persistent queue size

  * @sk_forward_alloc:space allocated forward

  * @sk_allocation:allocation mode

  * @sk_sndbuf:size of send buffer in bytes

  * @sk_flags:%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,

  *        %SO_OOBINLINE settings, %SO_TIMESTAMPINGsettings

  * @sk_no_check:%SO_NO_CHECK setting, wether or not checkup packets

  * @sk_route_caps:route capabilities (e.g. %NETIF_F_TSO)

  * @sk_gso_type:GSO type (e.g. %SKB_GSO_TCPV4)

  * @sk_gso_max_size:Maximum GSO segment size to build

  * @sk_lingertime:%SO_LINGER l_linger setting

  * @sk_backlog:always used with the per-socket spinlock held

  * @sk_callback_lock:used with the callbacks in the end of this struct

  * @sk_error_queue:rarely used

  * @sk_prot_creator:sk_prot of original sock creator (see ipv6_setsockopt,

  *          IPV6_ADDRFORM for instance)

  * @sk_err:last error

  * @sk_err_soft:errors that don't cause failure but are the cause of a

  *           persistent failure not just 'timed out'

  * @sk_drops:raw/udp drops counter

  * @sk_ack_backlog:current listen backlog

  * @sk_max_ack_backlog:listen backlog set in listen()

  * @sk_priority:%SO_PRIORITY setting

  * @sk_type:socket type (%SOCK_STREAM, etc)

  * @sk_protocol:which protocol this socket belongs in this network family

  * @sk_peercred:%SO_PEERCRED setting

  * @sk_rcvlowat:%SO_RCVLOWAT setting

  * @sk_rcvtimeo:%SO_RCVTIMEO setting

  * @sk_sndtimeo:%SO_SNDTIMEO setting

  * @sk_filter:socket filtering instructions

  * @sk_protinfo:private area, net family specific, when not using slab

  * @sk_timer:sock cleanup timer

  * @sk_stamp:time stamp of last packet received

  * @sk_socket:Identd and reporting IO signals

  * @sk_user_data:RPC layer private data

  * @sk_sndmsg_page:cached page for sendmsg

  * @sk_sndmsg_off:cached offset for sendmsg

  * @sk_send_head:front of stuff to transmit

  * @sk_security:used by security modules

  * @sk_mark:generic packet mark

  * @sk_write_pending:a write to stream socket waits to start

  * @sk_state_change:callback to indicate change in the state of the sock

  * @sk_data_ready:callback to indicate there is data to be processed

  * @sk_write_space:callback to indicate there is bf sending space available

  * @sk_error_report:callback to indicate errors (e.g. %MSG_ERRQUEUE)

  * @sk_backlog_rcv:callback to process the backlog

  * @sk_destruct:called at sock freeing time, i.e. when all refcnt == 0

 */

struct sock {

    /*

     * Now struct inet_timewait_sock also usessock_common, so please just

     * don't add nothing before this first member(__sk_common) --acme

     */

    struct sock_common   __sk_common;

#define sk_node          __sk_common.skc_node

#define sk_nulls_node       __sk_common.skc_nulls_node

#define sk_refcnt    __sk_common.skc_refcnt

 

#define sk_copy_start       __sk_common.skc_hash

#define sk_hash          __sk_common.skc_hash

#define sk_family    __sk_common.skc_family

#define sk_state     __sk_common.skc_state

#define sk_reuse     __sk_common.skc_reuse

#define sk_bound_dev_if     __sk_common.skc_bound_dev_if

#define sk_bind_node     __sk_common.skc_bind_node

#definesk_prot          __sk_common.skc_prot

#define sk_net           __sk_common.skc_net

    kmemcheck_bitfield_begin(flags);

    unsigned int      sk_shutdown  : 2,

              sk_no_check  :2,

              sk_userlocks :4,

              sk_protocol  :8,

              sk_type      :16;

    kmemcheck_bitfield_end(flags);

    int        sk_rcvbuf;

    socket_lock_t     sk_lock;

    /*

     * The backlog queue is special, it is alwaysused with

     * the per-socket spinlock held and requireslow latency

     * access. Therefore we special case it'simplementation.

     */

    struct {

       struct sk_buff *head;

       struct sk_buff *tail;

    } sk_backlog;

    wait_queue_head_t *sk_sleep;

    struct dst_entry  *sk_dst_cache;

#ifdef CONFIG_XFRM

    struct xfrm_policy  *sk_policy[2];

#endif

    rwlock_t      sk_dst_lock;

    atomic_t       sk_rmem_alloc;

    atomic_t      sk_wmem_alloc;

    atomic_t      sk_omem_alloc;

    int        sk_sndbuf;

    struct sk_buff_head  sk_receive_queue;

    struct sk_buff_head  sk_write_queue;

#ifdef CONFIG_NET_DMA

    struct sk_buff_head  sk_async_wait_queue;

#endif

    int        sk_wmem_queued;

    int        sk_forward_alloc;

    gfp_t         sk_allocation;

    int        sk_route_caps;

    int        sk_gso_type;

    unsigned int      sk_gso_max_size;

    int        sk_rcvlowat;

    unsigned long     sk_flags;

    unsigned long        sk_lingertime;

    struct sk_buff_head  sk_error_queue;

    struct proto      *sk_prot_creator;

    rwlock_t      sk_callback_lock;

    int        sk_err,

              sk_err_soft;

    atomic_t      sk_drops;

    unsigned short       sk_ack_backlog;

    unsigned short       sk_max_ack_backlog;

    __u32         sk_priority;

    struct ucred      sk_peercred;

    long          sk_rcvtimeo;

    long          sk_sndtimeo;

    struct sk_filter     *sk_filter;

    void          *sk_protinfo;

    struct timer_list sk_timer;

    ktime_t           sk_stamp;

    struct socket     *sk_socket;

    void          *sk_user_data;

    struct page       *sk_sndmsg_page;

    struct sk_buff      *sk_send_head;

    __u32         sk_sndmsg_off;

    int        sk_write_pending;

#ifdef CONFIG_SECURITY

    void          *sk_security;

#endif

    __u32         sk_mark;

    u32        sk_classid;

    void          (*sk_state_change)(struct sock*sk);

    void          (*sk_data_ready)(struct sock*sk,int bytes);

    void          (*sk_write_space)(struct sock*sk);

    void          (*sk_error_report)(struct sock*sk);

    int        (*sk_backlog_rcv)(struct sock*sk,

                       struct sk_buff*skb); 

    void                   (*sk_destruct)(struct sock*sk);

};

 

4.8sock_common

最小网络层表示结构体

 

/**

 *  structsock_common - minimal network layer representation of sockets

 *  @skc_node:main hash linkage for various protocol lookup tables

 *  @skc_nulls_node:main hash linkage for UDP/UDP-Lite protocol

 *  @skc_refcnt:reference count

 *  @skc_hash:hash value used with various protocol lookup tables

 *  @skc_family:network address family

 *  @skc_state:Connection state

 *  @skc_reuse:%SO_REUSEADDR setting

 *  @skc_bound_dev_if:bound device index if != 0

 *  @skc_bind_node:bind hash linkage for various protocol lookup tables

 *  @skc_prot:protocol handlers inside a network family

 *  @skc_net:reference to the network namespace of this socket

 *

 *  Thisis the minimal network layer representation of sockets, the header

 *  forstruct sock and struct inet_timewait_sock.

 */

struct sock_common {

    /*

     * first fields are not copied in sock_copy()

     */

    union {

       struct hlist_node skc_node;

       struct hlist_nulls_node skc_nulls_node;

    };

    atomic_t      skc_refcnt;

 

    unsigned int      skc_hash;

    unsigned short       skc_family;

    volatile unsigned char   skc_state;

    unsigned char     skc_reuse;

    int        skc_bound_dev_if;

    struct hlist_node skc_bind_node;

    struct proto     *skc_prot;

#ifdef CONFIG_NET_NS

    struct net    *skc_net;

#endif

};

 

 

4.9inet_sock

Inet_sock表示层结构体,在sock上做的扩展,用于在网络层之上表示inet协议族的的传输层公共结构体。

/** struct inet_sock - representation of INET sockets

 *

 * @sk - ancestor class

 * @pinet6 - pointer to IPv6 controlblock

 * @daddr - Foreign IPv4 addr

 * @rcv_saddr - Bound local IPv4addr

 * @dport - Destination port

 * @num - Local port

 * @saddr - Sending source

 * @uc_ttl - Unicast TTL

 * @sport - Source port

 * @id - ID counter for DF pkts

 * @tos - TOS

 * @mc_ttl - Multicasting TTL

 * @is_icsk - is this aninet_connection_sock?

 * @mc_index - Multicast deviceindex

 * @mc_list - Group array

 * @cork - info to build ip hdr oneach ip frag while socket is corked

 */

structinet_sock {

    /* sk and pinet6 has to be the firsttwo members of inet_sock */

    struct sock       sk;

#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

    struct ipv6_pinfo *pinet6;

#endif

    /* Socket demultiplex comparisons onincoming packets. */

    __be32        daddr;

    __be32        rcv_saddr;

    __be16        dport;

    __u16         num;

    __be32        saddr;

    __s16         uc_ttl;

    __u16         cmsg_flags;

    struct ip_options *opt;

    __be16        sport;

    __u16         id;

    __u8          tos;

    __u8          mc_ttl;

    __u8          pmtudisc;

    __u8          recverr:1,

              is_icsk:1,

              freebind:1,

              hdrincl:1,

              mc_loop:1,

              transparent:1,

              mc_all:1;

    int        mc_index;

    __be32        mc_addr;

    struct ip_mc_socklist   *mc_list;

    struct {

       unsigned int      flags;

       unsigned int      fragsize;

       struct ip_options*opt;

       struct dst_entry *dst;

       int        length;/* Total length ofall frames */

       __be32        addr;

       struct flowi      fl;

    } cork;

};

 

 

4.10  udp_sock

传输层UDP协议专用sock结构,在传输层inet_sock上扩展

structudp_sock {

    /* inet_sock has to be the firstmember */

    struct inet_sock inet;

    int    pending; /* Any pending frames ? */

    unsigned int  corkflag; /* Cork is required*/

    __u16      encap_type; /* Is this anEncapsulation socket? */

    /*

     * Following member retains the information tocreate a UDP header

     * when the socket is uncorked.

     */

    __u16      len;     /* total length ofpending frames */

    /*

     * Fields specific to UDP-Lite.

     */

    __u16      pcslen;

    __u16      pcrlen;

/* indicator bits used by pcflag: */

#define UDPLITE_BIT      0x1        /* set by udpliteproto init function */

#define UDPLITE_SEND_CC  0x2       /* set via udplitesetsockopt         */

#define UDPLITE_RECV_CC  0x4   /* set via udplite setsocktopt        */

    __u8       pcflag;       /* marks socket asUDP-Lite if > 0    */

    __u8       unused[3];

    /*

     * For encapsulation sockets.

     */

    int (*encap_rcv)(struct sock*sk,struct sk_buff *skb);

};

 

4.11  proto_ops

BSD socket层到inet_sock层接口,主要用于操作socket结构

structproto_ops {

 

    int    family;

    struct module *owner;

    int    (*release)  (struct socket*sock);

    int    (*bind)        (struct socket*sock,

                    struct sockaddr*myaddr,

                    int sockaddr_len);

    int    (*connect)  (struct socket*sock,

                    struct sockaddr*vaddr,

                     int sockaddr_len,int flags);

    int    (*socketpair)(struct socket*sock1,

                    struct socket*sock2);

    int    (*accept)   (struct socket*sock,

                    struct socket*newsock,int flags);

    int    (*getname)  (struct socket*sock,

                    struct sockaddr*addr,

                    int*sockaddr_len,int peer);

    unsigned int  (*poll)        (struct file*file,struct socket *sock,

                    struct poll_table_struct*wait);

    int    (*ioctl)    (struct socket*sock,unsignedint cmd,

                    unsignedlong arg);

    int    (*compat_ioctl)(struct socket*sock,unsignedint cmd,

                    unsignedlong arg);

    int    (*listen)   (struct socket*sock,int len);

    int    (*shutdown) (struct socket*sock,int flags);

    int    (*setsockopt)(struct socket*sock,int level,

                    int optname,char __user*optval,unsignedint optlen);

    int    (*getsockopt)(struct socket*sock,int level,

                    int optname,char __user*optval,int __user*optlen);

    int    (*compat_setsockopt)(struct socket*sock,int level,

                    int optname,char __user*optval,unsignedint optlen);

    int    (*compat_getsockopt)(struct socket*sock,int level,

                    int optname,char __user*optval,int __user*optlen);

    int    (*sendmsg)  (struct kiocb*iocb,struct socket*sock,

                    struct msghdr*m, size_t total_len);

    int    (*recvmsg)  (struct kiocb*iocb,struct socket*sock,

                    struct msghdr*m, size_t total_len,

                    int flags);

    int    (*mmap)        (struct file*file,struct socket*sock,

                    struct vm_area_struct* vma);

    ssize_t       (*sendpage) (struct socket*sock,struct page *page,

                    int offset, size_t size,int flags);

    ssize_t    (*splice_read)(struct socket*sock,  loff_t*ppos,

                     struct pipe_inode_info*pipe, size_t len,unsignedint flags);

};

4.12  proto

inet_sock 层到传输层 操作的统一接口,主要用于操作sock结构

/* Networking protocol blocks we attach to sockets.

 * socket layer -> transportlayer interface

 * transport -> network interfaceis defined by struct inet_proto

 */

struct proto {

    void          (*close)(struct sock*sk,

                  long timeout);

    int        (*connect)(struct sock*sk,

                      struct sockaddr*uaddr,

                  int addr_len);

    int        (*disconnect)(struct sock*sk,int flags);

 

    struct sock *     (*accept)(struct sock*sk,int flags,int*err);

 

    int        (*ioctl)(struct sock*sk,int cmd,

                   unsignedlong arg);

    int        (*init)(struct sock*sk);

    void          (*destroy)(struct sock*sk);

    void          (*shutdown)(struct sock*sk,int how);

    int        (*setsockopt)(struct sock*sk,int level,

                  int optname,char __user*optval,

                  unsignedint optlen);

    int        (*getsockopt)(struct sock*sk,int level,

                  int optname,char __user*optval,

                  int __user*option);      

#ifdef CONFIG_COMPAT

    int        (*compat_setsockopt)(struct sock*sk,

                  int level,

                  int optname,char __user*optval,

                  unsignedint optlen);

    int        (*compat_getsockopt)(struct sock*sk,

                  int level,

                  int optname,char __user*optval,

                  int __user*option);

#endif

    int        (*sendmsg)(struct kiocb*iocb,struct sock *sk,

                     struct msghdr*msg, size_t len);

    int        (*recvmsg)(struct kiocb*iocb,struct sock *sk,

                     struct msghdr*msg,

                  size_t len,int noblock,int flags,

                  int *addr_len);

    int        (*sendpage)(struct sock*sk,struct page *page,

                  int offset, size_t size,int flags);

    int        (*bind)(struct sock*sk,

                  struct sockaddr*uaddr,int addr_len);

 

    int        (*backlog_rcv)(struct sock*sk,

                     struct sk_buff*skb);

 

    /* Keeping track of sk's, lookingthem up, and port selection methods. */

    void          (*hash)(struct sock*sk);

    void          (*unhash)(struct sock*sk);

    int        (*get_port)(struct sock*sk,unsignedshort snum);

 

    /* Keeping track of sockets in use */

#ifdef CONFIG_PROC_FS

    unsigned int      inuse_idx;

#endif

 

    /* Memory pressure */

    void          (*enter_memory_pressure)(struct sock*sk);

    atomic_t      *memory_allocated;  /* Current allocated memory. */

    struct percpu_counter   *sockets_allocated; /* Current number ofsockets. */

    /*

     * Pressure flag: try to collapse.

     * Technical note: it is used by multiplecontexts non atomically.

     * All the __sk_mem_schedule() is of thisnature: accounting

     * is strict, actions are advisory and havesome latency.

     */

    int        *memory_pressure;

    int        *sysctl_mem;

    int        *sysctl_wmem;

    int        *sysctl_rmem;

    int        max_header;

 

    struct kmem_cache *slab;

    unsigned int      obj_size;

    int        slab_flags;

 

    struct percpu_counter   *orphan_count;

 

    struct request_sock_ops *rsk_prot;

    struct timewait_sock_ops*twsk_prot;

 

    union {

       struct inet_hashinfo*hashinfo;

       struct udp_table *udp_table;

       struct raw_hashinfo *raw_hash;

    } h;

 

    struct module     *owner;

 

    char          name[32];

 

    struct list_head  node;

#ifdef SOCK_REFCNT_DEBUG

    atomic_t      socks;

#endif

};

 

 

4.13  net_proto_family

用于标识和注册协议族,常见的协议族有 ipv4, ipv6

协议族: 用于完成某些特定的功能的协议集合。

structnet_proto_family {

    int    family;

    int    (*create)(struct net*net,struct socket *sock,

                int protocol,int kern);

    struct module *owner;

};

 

内核中声明了大量的协议族,并不是所有的协议族都支持。

/* Supported address families. */

#define AF_UNSPEC 0

#define AF_UNIX      1   /* Unix domain sockets      */

#define AF_LOCAL  1   /* POSIX name for AF_UNIX   */

#define AF_INET      2   /* Internet IP Protocol */

#define AF_AX25      3   /* Amateur Radio AX.25      */

#define AF_IPX       4   /* Novell IPX        */

#define AF_APPLETALK 5   /* AppleTalk DDP     */

#define AF_NETROM 6   /* Amateur Radio NET/ROM    */

#define AF_BRIDGE 7   /* Multiprotocol bridge */

#define AF_ATMPVC 8   /* ATM PVCs          */

#define AF_X25       9   /* Reserved for X.25 project    */

#define AF_INET6  10  /* IP version 6          */

#define AF_ROSE      11  /* Amateur Radio X.25 PLP   */

#define AF_DECnet 12  /* Reserved for DECnet project  */

#define AF_NETBEUI   13  /* Reserved for 802.2LLC project*/

#define AF_SECURITY  14  /* Security callback pseudo AF */

#define AF_KEY       15      /* PF_KEY key management API */

#define AF_NETLINK   16

#define AF_ROUTE  AF_NETLINK /* Alias to emulate4.4BSD */

#define AF_PACKET 17  /* Packet family     */

#define AF_ASH       18  /* Ash            */

#define AF_ECONET 19  /* Acorn Econet          */

#define AF_ATMSVC 20  /* ATM SVCs          */

#define AF_RDS       21  /* RDS sockets           */

#define AF_SNA       22  /* Linux SNA Project (nutters!) */

#define AF_IRDA      23  /* IRDA sockets          */

#define AF_PPPOX  24  /* PPPoX sockets     */

#define AF_WANPIPE   25  /* Wanpipe API Sockets */

#define AF_LLC       26  /* Linux LLC         */

#define AF_CAN       29  /* Controller Area Network      */

#define AF_TIPC      30  /* TIPC sockets          */

#define AF_BLUETOOTH 31  /* Bluetooth sockets     */

#define AF_IUCV      32  /* IUCV sockets          */

#define AF_RXRPC  33  /* RxRPC sockets     */

#define AF_ISDN      34  /* mISDN sockets     */

#define AF_PHONET 35  /* Phonet sockets    */

#define AF_IEEE802154    36  /* IEEE802154 sockets       */

#define AF_MAX       37  /* For now.. */

 

static const struct net_proto_family *net_families[NPROTO];

 

4.14  softnet_data

内核为每个CPU都分配一个这样的softnet_data数据空间。

每个CPU都有一个这样的队列,用于接收数据包。

/*

 * Incoming packets are placed onper-cpu queues so that

 * no locking is needed.

 */

structsoftnet_data {

    struct Qdisc      *output_queue;

    struct list_head  poll_list;

    struct sk_buff      *completion_queue;

 

    /* Elements below can be accessedbetween CPUs for RPS */

    struct call_single_data  csd ____cacheline_aligned_in_smp;

    unsigned int            input_queue_head;

    struct sk_buff_head  input_pkt_queue;

    struct napi_struct   backlog;

};

 

 

4.15  sk_buff

描述一个帧结构的属性,持有socket,到达时间,到达设备,各层头部大小,下一站路由入口,帧长度,校验和,等等。

/**

 *  structsk_buff - socket buffer

 *  @next:Next buffer in list

 *  @prev:Previous buffer in list

 *  @sk:Socket we are owned by

 *  @tstamp:Time we arrived

 *  @dev:Device we arrived on/are leaving by

 *  @transport_header:Transport layer header

 *  @network_header:Network layer header

 *  @mac_header:Link layer header

 *  @_skb_dst:destination entry

 *  @sp:the security path, used for xfrm

 *  @cb:Control buffer. Free for use by every layer. Put private vars here

 *  @len:Length of actual data

 *  @data_len:Data length

 *  @mac_len:Length of link layer header

 *  @hdr_len:writable header length of cloned skb

 *  @csum:Checksum (must include start/offset pair)

 *  @csum_start:Offset from skb->head where checksumming should start

 *  @csum_offset:Offset from csum_start where checksum should be stored

 *  @local_df:allow local fragmentation

 *  @cloned:Head may be cloned (check refcnt to be sure)

 *  @nohdr:Payload reference only, must not modify header

 *  @pkt_type:Packet class

 *  @fclone:skbuff clone status

 *  @ip_summed:Driver fed us an IP checksum

 *  @priority:Packet queueing priority

 *  @users:User count - see {datagram,tcp}.c

 *  @protocol:Packet protocol from driver

 *  @truesize:Buffer size

 *  @head:Head of buffer

 *  @data:Data head pointer

 *  @tail:Tail pointer

 *  @end:End pointer

 *  @destructor:Destruct function

 *  @mark:Generic packet mark

 *  @nfct:Associated connection, if any

 *  @ipvs_property:skbuff is owned by ipvs

 *  @peeked:this packet has been seen already, so stats have been

 *     donefor it, don't do them again

 *  @nf_trace:netfilter packet trace flag

 *  @nfctinfo:Relationship of this skb to the connection

 *  @nfct_reasm:netfilter conntrack re-assembly pointer

 *  @nf_bridge:Saved data about a bridged frame - see br_netfilter.c

 *  @iif:ifindex of device we arrived on

 *  @queue_mapping:Queue mapping for multiqueue devices

 *  @tc_index:Traffic control index

 *  @tc_verd:traffic control verdict

 *  @ndisc_nodetype:router type (from link layer)

 *  @dma_cookie:a cookie to one of several possible DMA operations

 *     doneby skb DMA functions

 *  @secmark:security marking

 *  @vlan_tci:vlan tag control information

 */

struct sk_buff{

    /* These two members must be first. */

    struct sk_buff      *next;

    struct sk_buff      *prev;

 

    struct sock      *sk;

    ktime_t           tstamp;

    struct net_device*dev;

 

    unsigned long     _skb_dst;

#ifdef CONFIG_XFRM

    struct sec_path   *sp;

#endif

    /*

     * This is the control buffer. It is free touse for every

     * layer. Please put your private variablesthere. If you

     * want to keep them across layers you have todo a skb_clone()

     * first. This is owned by whoever has the skbqueued ATM.

     */

    char          cb[48];

 

    unsigned int      len,

              data_len;

    __u16         mac_len,

              hdr_len;

    union {

       __wsum     csum;

       struct {

           __u16  csum_start;

           __u16  csum_offset;

       };

    };

    __u32         priority;

    kmemcheck_bitfield_begin(flags1);

    __u8          local_df:1,

              cloned:1,

              ip_summed:2,

              nohdr:1,

              nfctinfo:3;

    __u8          pkt_type:3,

               fclone:2,

              ipvs_property:1,

              peeked:1,

              nf_trace:1;

    __be16        protocol:16;

    kmemcheck_bitfield_end(flags1);

 

    void          (*destructor)(struct sk_buff*skb);

#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)

    struct nf_conntrack *nfct;

    struct sk_buff      *nfct_reasm;

#endif

#ifdef CONFIG_BRIDGE_NETFILTER

    struct nf_bridge_info   *nf_bridge;

#endif

 

    int        iif;

#ifdef CONFIG_NET_SCHED

    __u16         tc_index; /* traffic controlindex */

#ifdef CONFIG_NET_CLS_ACT

    __u16         tc_verd;  /* traffic controlverdict */

#endif

#endif

 

    kmemcheck_bitfield_begin(flags2);

    __u16         queue_mapping:16;

#ifdef CONFIG_IPV6_NDISC_NODETYPE

    __u8          ndisc_nodetype:2,

              deliver_no_wcard:1;

#else

    __u8          deliver_no_wcard:1;

#endif

#ifndef __GENKSYMS__

    __u8          ooo_okay:1;

#endif

    kmemcheck_bitfield_end(flags2);

 

    /* 0/13 bit hole */

 

#ifdef CONFIG_NET_DMA

    dma_cookie_t      dma_cookie;

#endif

#ifdef CONFIG_NETWORK_SECMARK

    __u32         secmark;

#endif

    union {

       __u32      mark;

       __u32      dropcount;

    };

 

    __u16         vlan_tci;

#ifndef __GENKSYMS__

    __u16         rxhash;

#endif

    sk_buff_data_t       transport_header;

    sk_buff_data_t       network_header;

    sk_buff_data_t       mac_header;

    /* These elements must be at the end,see alloc_skb() for details.  */

    sk_buff_data_t       tail;

    sk_buff_data_t       end;

    unsigned char     *head,

              *data;

    unsigned int      truesize;

    atomic_t      users;

};

4.16  sk_buff_head

数据包队列结构

structsk_buff_head {

    /* These two members must be first.*/

    struct sk_buff    *next;

    struct sk_buff    *prev;

 

    __u32      qlen;

    spinlock_t lock;

};

4.17  net_device

这个巨大的结构体描述一个网络设备的所有属性,数据等信息。

/*

 *  TheDEVICE structure.

 *  Actually,this whole structure is a big mistake. It mixes I/O

 *  datawith strictly "high-level" data, and it has to know about

 *  almostevery data structure used in the INET module.

 *

 *  FIXME:cleanup struct net_device such that network protocol info

 *  movesout.

 */

 

structnet_device

{

 

    /*

     * This is the first field of the"visible" part of this structure

     * (i.e. as seen by users in the"Space.c" file).  It is thename

     * the interface.

     */

    char          name[IFNAMSIZ];

    /* device name hash chain */

    struct hlist_node name_hlist;

    /* snmp alias */

    char          *ifalias;

 

    /*

     *  I/Ospecific fields

     *  FIXME:Merge these and struct ifmap into one

     */

    unsigned long     mem_end;   /* shared mem end */

    unsigned long     mem_start; /* shared mem start  */

    unsigned long     base_addr; /* device I/Oaddress    */

    unsigned int      irq;       /* device IRQ number */

 

    /*

     *  Somehardware also needs these fields, but they are not

     *  partof the usual set specified in Space.c.

     */

 

    unsigned char     if_port;   /* Selectable AUI,TP,..*/

    unsigned char     dma;       /* DMA channel       */

 

    unsigned long     state;

 

    struct list_head  dev_list;

    struct list_head  napi_list;

 

    /* Net device features */

    unsigned long     features;

#define NETIF_F_SG       1   /* Scatter/gather IO. */

#define NETIF_F_IP_CSUM     2  /* Can checksum TCP/UDP over IPv4. */

#define NETIF_F_NO_CSUM     4  /* Does not require checksum. F.e. loopack. */

#define NETIF_F_HW_CSUM     8  /* Can checksum all the packets. */

#define NETIF_F_IPV6_CSUM   16 /* Can checksum TCP/UDP over IPV6 */

#define NETIF_F_HIGHDMA     32 /* Can DMA to high memory. */

#define NETIF_F_FRAGLIST 64  /* Scatter/gather IO. */

#define NETIF_F_HW_VLAN_TX  128/* Transmit VLAN hw acceleration */

#define NETIF_F_HW_VLAN_RX  256/* Receive VLAN hw acceleration */

#define NETIF_F_HW_VLAN_FILTER  512/* Receive filtering on VLAN */

#define NETIF_F_VLAN_CHALLENGED 1024  /* Device cannot handle VLAN packets */

#define NETIF_F_GSO      2048  /* Enable software GSO. */

#define NETIF_F_LLTX     4096  /* LockLess TX - deprecated. Please */

                  /* do not use LLTXin new drivers */

#define NETIF_F_NETNS_LOCAL 8192  /* Does not change network namespaces */

#define NETIF_F_GRO      16384 /* Generic receive offload */

#define NETIF_F_LRO      32768 /* large receive offload */

 

/* the GSO_MASK reserves bits 16 through 23 */

#define NETIF_F_FCOE_CRC (1 <<24)/* FCoECRC32 */

#define NETIF_F_SCTP_CSUM   (1<< 25)/* SCTPchecksum offload */

#define NETIF_F_FCOE_MTU (1 <<26)/*Supports max FCoE MTU, 2158 bytes*/

#define NETIF_F_NTUPLE      (1<< 27)/*N-tuple filters supported */

#define NETIF_F_RXHASH      (1<< 28)/*Receive hashing offload */

#define NETIF_F_RXCSUM      (1<< 29)/*Receive checksumming offload */

 

    /* Segmentation offload features */

#define NETIF_F_GSO_SHIFT   16

#define NETIF_F_GSO_MASK 0x00ff0000

#define NETIF_F_TSO      (SKB_GSO_TCPV4<< NETIF_F_GSO_SHIFT)

#define NETIF_F_UFO      (SKB_GSO_UDP<< NETIF_F_GSO_SHIFT)

#define NETIF_F_GSO_ROBUST  (SKB_GSO_DODGY<< NETIF_F_GSO_SHIFT)

#define NETIF_F_TSO_ECN     (SKB_GSO_TCP_ECN<< NETIF_F_GSO_SHIFT)

#define NETIF_F_TSO6     (SKB_GSO_TCPV6<< NETIF_F_GSO_SHIFT)

#define NETIF_F_FSO      (SKB_GSO_FCOE<< NETIF_F_GSO_SHIFT)

#define NETIF_F_ALL_TSO (NETIF_F_TSO| NETIF_F_TSO6 | NETIF_F_TSO_ECN)

 

    /* List of features with softwarefallbacks. */

#define NETIF_F_GSO_SOFTWARE    (NETIF_F_TSO| NETIF_F_TSO_ECN | \

               NETIF_F_TSO6 | NETIF_F_UFO)

 

 

#define NETIF_F_GEN_CSUM (NETIF_F_NO_CSUM| NETIF_F_HW_CSUM)

#define NETIF_F_V4_CSUM     (NETIF_F_GEN_CSUM| NETIF_F_IP_CSUM)

#define NETIF_F_V6_CSUM     (NETIF_F_GEN_CSUM| NETIF_F_IPV6_CSUM)

#define NETIF_F_ALL_CSUM (NETIF_F_V4_CSUM| NETIF_F_V6_CSUM)

 

    /*

     * If one device supports one of thesefeatures, then enable them

     * for all in netdev_increment_features.

     */

#define NETIF_F_ONE_FOR_ALL (NETIF_F_GSO_SOFTWARE| NETIF_F_GSO_ROBUST | \

               NETIF_F_SG | NETIF_F_HIGHDMA |    \

               NETIF_F_FRAGLIST)

 

    /* Interface index. Unique deviceidentifier  */

    int        ifindex;

    int        iflink;

 

    struct net_device_stats  stats;

 

#ifdef CONFIG_WIRELESS_EXT

    /* List of functions to handleWireless Extensions (instead of ioctl).

     * See for details.Jean II */

    const struct iw_handler_def *   wireless_handlers;

    /* Instance data managed by the coreof Wireless Extensions. */

    struct iw_public_data*  wireless_data;

#endif

    /* Management operations */

    const struct net_device_ops *netdev_ops;

    const struct ethtool_ops *ethtool_ops;

 

    /* Hardware header description */

    const struct header_ops *header_ops;

 

    unsigned int      flags; /* interface flags(a la BSD)   */

    unsigned short       gflags;

        unsigned short          priv_flags;/* Like 'flags' but invisible touserspace. */

    unsigned short       padded;    /* How much paddingadded by alloc_netdev() */

 

    unsigned char     operstate; /* RFC2863 operstate */

    unsigned char     link_mode; /* mapping policy to operstate */

 

    unsigned      mtu;  /* interface MTUvalue      */

    unsigned short       type;  /* interfacehardware type  */

    unsigned short       hard_header_len; /* hardware hdr length   */

 

    /* extra head- and tailroom thehardware may need, but not in all cases

     * can this be guaranteed, especially tailroom.Some cases also use

     * LL_MAX_HEADER instead to allocate the skb.

     */

    unsigned short       needed_headroom;

    unsigned short       needed_tailroom;

 

    struct net_device *master;/* Pointer to masterdevice of a group,

                    * which this device is member of.

                    */

 

    /* Interface address info. */

    unsigned char     perm_addr[MAX_ADDR_LEN];/* permanent hwaddress */

    unsigned char     addr_assign_type;/* hw address assignment type */

    unsigned char     addr_len;  /* hardware addresslength  */

    unsigned short          dev_id;      /* for sharednetwork cards */

 

    struct netdev_hw_addr_list  uc;/* Secondary unicast

                        mac addresses */

    int        uc_promisc;

    spinlock_t    addr_list_lock;

    struct dev_addr_list*mc_list; /* Multicast mac addresses  */

    int        mc_count; /* Number of installed mcasts   */

    unsigned int      promiscuity;

    unsigned int      allmulti;

 

 

    /* Protocol specific pointers */

   

#ifdef CONFIG_NET_DSA

    void          *dsa_ptr; /* dsa specific data */

#endif

    void          *atalk_ptr;  /* AppleTalk link    */

    void          *ip_ptr;  /* IPv4 specific data    */

    void                   *dn_ptr;       /* DECnet specific data */

    void                   *ip6_ptr;      /* IPv6specific data */

    void          *ec_ptr;  /* Econet specific data  */

    void          *ax25_ptr;/* AX.25 specific data

                        also used by openvswitch */

    struct wireless_dev *ieee80211_ptr;  /* IEEE 802.11 specific data,

                        assign before registering */

 

/*

 * Cache line mostly used on receivepath (including eth_type_trans())

 */

    unsigned long     last_rx;   /* Time of last Rx   */

    /* Interface address info used ineth_type_trans() */

    unsigned char     *dev_addr;/* hw address,(before bcast

                        because most packets are

                        unicast) */

 

    struct netdev_hw_addr_list  dev_addrs;/* list of device

                           hw addresses */

 

    unsigned char     broadcast[MAX_ADDR_LEN];/* hw bcast add   */

 

    struct netdev_queue  rx_queue;

 

    struct netdev_queue *_tx____cacheline_aligned_in_smp;

 

    /* Number of TX queues allocated atalloc_netdev_mq() time  */

    unsigned int      num_tx_queues;

 

    /* Number of TX queues currentlyactive in device  */

    unsigned int      real_num_tx_queues;

 

    /* root qdisc from userspace point ofview */

    struct Qdisc      *qdisc;

 

    unsigned long     tx_queue_len; /* Max frames perqueue allowed */

    spinlock_t    tx_global_lock;

/*

 * One part is mostly used on xmitpath (device)

 */

    /* These may be needed for futurenetwork-power-down code. */

 

    /*

     * trans_start here is expensive for high speeddevices on SMP,

     * please use netdev_queue->trans_startinstead.

     */

    unsigned long     trans_start;  /* Time (in jiffies)of last Tx */

 

    int        watchdog_timeo;/* used bydev_watchdog() */

    struct timer_list watchdog_timer;

 

    /* Number of references to thisdevice */

    atomic_t      refcnt ____cacheline_aligned_in_smp;

 

    /* delayed register/unregister */

    struct list_head  todo_list;

    /* device index hash chain */

    struct hlist_node index_hlist;

 

    struct net_device *link_watch_next;

 

    /* register/unregister state machine*/

    enum { NETREG_UNINITIALIZED=0,

          NETREG_REGISTERED,/* completedregister_netdevice */

           NETREG_UNREGISTERING, /* called unregister_netdevice */

           NETREG_UNREGISTERED,  /* completed unregister todo */

           NETREG_RELEASED,      /* called free_netdev */

           NETREG_DUMMY,     /* dummy device forNAPI poll */

    } reg_state;

 

    /* Called from unregister, can beused to call free_netdev */

    void (*destructor)(struct net_device*dev);

 

#ifdef CONFIG_NETPOLL

    struct netpoll_info *npinfo;

#endif

 

#ifdef CONFIG_NET_NS

    /* Network namespace this networkdevice is inside */

    struct net    *nd_net;

#endif

 

    /* mid-layer private */

    void          *ml_priv;

 

    /* bridge stuff */

    struct net_bridge_port  *br_port;

    /* macvlan */

    struct macvlan_port *macvlan_port;

    /* GARP */

    struct garp_port  *garp_port;

 

    /* class/net/name entry */

    struct device     dev;

    /* space for optional statistics andwireless sysfs groups */

    const struct attribute_group *sysfs_groups[3];

 

    /* rtnetlink link ops */

    const struct rtnl_link_ops *rtnl_link_ops;

 

    /* VLAN feature mask */

    unsigned long vlan_features;

 

    /* for setting kernel sock attributeon TCP connection setup */

#define GSO_MAX_SIZE     65536

    unsigned int      gso_max_size;

 

#ifdef CONFIG_DCB

    /* Data Center Bridging netlink ops*/

    const struct dcbnl_rtnl_ops *dcbnl_ops;

#endif

 

#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)

    /* max exchange id for FCoE LRO byddp */

    unsigned int      fcoe_ddp_xid;

#endif

};

 

 

4.18  inet_protosw

向IP层注册socket层的调用操作接口

/* This is used to register socket interfaces for IP protocols.  */

structinet_protosw {

    struct list_head list;

 

        /* These two fields form the lookupkey.  */

    unsigned short    type;    /* This is the 2ndargument to socket(2). */

    unsigned short    protocol; /* This is the L4protocol number.  */

 

    struct proto*prot;

   const struct proto_ops *ops;

 

    char             no_check;  /* checksum on rcv/xmit/none? */

    unsigned char flags;     /* SeeINET_PROTOSW_* below.  */

};

 

4.19  inetsw_array

socket层调用IP层操作接口都在这个数组中注册。

/* Upon startup we insert all the elements in inetsw_array[] into

 * the linked list inetsw.

 */

static struct inet_protoswinetsw_array[] =

{

    {

       .type =      SOCK_STREAM,

       .protocol=  IPPROTO_TCP,

       .prot =      &tcp_prot,

       .ops =        &inet_stream_ops,

       .no_check=  0,

       .flags =     INET_PROTOSW_PERMANENT |

                 INET_PROTOSW_ICSK,

    },

 

    {

       .type =      SOCK_DGRAM,

       .protocol =  IPPROTO_UDP,

       .prot =       &udp_prot,

       .ops =        &inet_dgram_ops,

       .no_check=  UDP_CSUM_DEFAULT,

       .flags =     INET_PROTOSW_PERMANENT,

       },

 

       {

       .type =      SOCK_DGRAM,

       .protocol=  IPPROTO_ICMP,

       .prot =      &ping_prot,

       .ops =        &inet_dgram_ops,

       .no_check=  UDP_CSUM_DEFAULT,

       .flags =     INET_PROTOSW_REUSE,

       },

 

       {

           .type=      SOCK_RAW,

           .protocol=  IPPROTO_IP, /* wild card */

           .prot=      &raw_prot,

           .ops =        &inet_sockraw_ops,

           .no_check=  UDP_CSUM_DEFAULT,

           .flags=     INET_PROTOSW_REUSE,

       }

};

4.20  sock_type

socket类型

/**

 * enum sock_type - Socket types

 * @SOCK_STREAM: stream (connection)socket

 * @SOCK_DGRAM: datagram (conn.less)socket

 * @SOCK_RAW: raw socket

 * @SOCK_RDM: reliably-deliveredmessage

 * @SOCK_SEQPACKET: sequentialpacket socket

 * @SOCK_DCCP: Datagram CongestionControl Protocol socket

 * @SOCK_PACKET: linux specific wayof getting packets at the dev level.

 *       For writing rarp and other similar things onthe user level.

 *

 * When adding some new socket typeplease

 * grep ARCH_HAS_SOCKET_TYPEinclude/asm-* /socket.h, at least MIPS

 * overrides this enum for binarycompat reasons.

 */

enumsock_type {

    SOCK_STREAM   =1,

    SOCK_DGRAM = 2,

    SOCK_RAW   =3,

    SOCK_RDM   =4,

    SOCK_SEQPACKET    =5,

    SOCK_DCCP  =6,

    SOCK_PACKET   =10,

};

4.21  IPPROTO

传输层协议类型ID

/* Standard well-defined IP protocols. */

enum {

  IPPROTO_IP = 0,     /* Dummy protocol for TCP       */

  IPPROTO_ICMP =1,     /* Internet Control Message Protocol   */

  IPPROTO_IGMP =2,     /* Internet Group Management Protocol  */

  IPPROTO_IPIP =4,     /* IPIP tunnels (older KA9Q tunnels use 94) */

  IPPROTO_TCP = 6,       /* Transmission Control Protocol   */

  IPPROTO_EGP = 8,       /* Exterior Gateway Protocol       */

  IPPROTO_PUP = 12,      /* PUP protocol             */

  IPPROTO_UDP = 17,      /* User Datagram Protocol       */

  IPPROTO_IDP = 22,      /* XNS IDP protocol         */

  IPPROTO_DCCP =33,    /* Datagram Congestion Control Protocol */

  IPPROTO_RSVP =46,    /* RSVP protocol         */

  IPPROTO_GRE = 47,      /* Cisco GRE tunnels (rfc 1701,1702)   */

 

  IPPROTO_IPV6 =41,    /* IPv6-in-IPv4tunnelling      */

 

  IPPROTO_ESP = 50,            /* Encapsulation Security Payloadprotocol */

  IPPROTO_AH = 51,             /* Authentication Headerprotocol       */

  IPPROTO_BEETPH =94,         /* IP option pseudoheader for BEET */

  IPPROTO_PIM    =103,      /* ProtocolIndependent Multicast  */

 

  IPPROTO_COMP   =108,               /* CompressionHeader protocol */

  IPPROTO_SCTP   =132,     /* Stream ControlTransport Protocol   */

  IPPROTO_UDPLITE =136,/* UDP-Lite (RFC 3828)          */

 

  IPPROTO_RAW  =255,   /* Raw IP packets        */

  IPPROTO_MAX

};

 

/* The inetsw table contains everything that inet_create needs to

 * build a new socket.

 */

static struct list_head inetsw[SOCK_MAX];

 

 

4.22  net_protocol

用于传输层协议向IP层注册收包的接口

/* This is used to register protocols. */

struct net_protocol{

         int                     (*handler)(structsk_buff*skb);

    void          (*err_handler)(struct sk_buff*skb, u32 info);

    int        (*gso_send_check)(struct sk_buff*skb);

    struct sk_buff          *(*gso_segment)(struct sk_buff*skb,

                         int features);

    struct sk_buff         **(*gro_receive)(struct sk_buff**head,

                         struct sk_buff*skb);

    int        (*gro_complete)(struct sk_buff*skb);

    unsigned int      no_policy:1,

              netns_ok:1;

};

 

实例,UDP向IP层注册的接口

static const struct net_protocoludp_protocol={

    .handler = udp_rcv,

    .err_handler=    udp_err,

    .gso_send_check= udp4_ufo_send_check,

    .gso_segment= udp4_ufo_fragment,

    .no_policy =  1,

    .netns_ok =   1,

};

 

IP层收包的接口都在这个数组中注册。

externconst struct net_protocol *inet_protos[MAX_INET_PROTOS];

4.23  packet_type

以太网数据包的结构,包括了以太网帧类型,包处理方法等。

structpacket_type {

    __be16        type;  /* This is really htons(ether_type). */

    struct net_device *dev; /* NULL is wildcarded here       */

    int        (*func) (struct sk_buff*,

                   struct net_device*,

                   struct packet_type*,

                   struct net_device*);

    struct sk_buff      *(*gso_segment)(struct sk_buff*skb,

                     int features);

    int        (*gso_send_check)(struct sk_buff*skb);

    struct sk_buff      **(*gro_receive)(struct sk_buff**head,

                         struct sk_buff*skb);

    int        (*gro_complete)(struct sk_buff*skb);

    void          *af_packet_priv;

    struct list_head  list;

};

 

IP协议向链路层注册的包处理接口。

/*

 *  IPprotocol layer initialiser

 */

static struct packet_typeip_packet_type  ={

    .type = cpu_to_be16(ETH_P_IP),

    .func = ip_rcv,

    .gso_send_check= inet_gso_send_check,

    .gso_segment= inet_gso_segment,

    .gro_receive= inet_gro_receive,

    .gro_complete= inet_gro_complete,

};

/*

 *  Theseare the defined Ethernet Protocol ID's.

 */

 

#define ETH_P_LOOP   0x0060    /* Ethernet Loopback packet */

#define ETH_P_PUP 0x0200     /* Xerox PUP packet      */

#define ETH_P_PUPAT  0x0201    /* Xerox PUP Addr Trans packet  */

#defineETH_P_IP  0x0800    /*Internet Protocol packet */

#define ETH_P_X25 0x0805     /* CCITT X.25        */

#define ETH_P_ARP 0x0806     /* Address Resolution packet    */

#define    ETH_P_BPQ  0x08FF    /* G8BPQ AX.25Ethernet Packet  [ NOT AN OFFICIALLYREGISTERED ID ] */

#define ETH_P_IEEEPUP    0x0a00    /* Xerox IEEE802.3 PUP packet */

#define ETH_P_IEEEPUPAT  0x0a01    /* Xerox IEEE802.3 PUP Addr Trans packet */

#define ETH_P_DEC       0x6000         /* DEC Assigned proto           */

#define ETH_P_DNA_DL    0x6001         /* DEC DNA Dump/Load            */

#define ETH_P_DNA_RC    0x6002         /* DEC DNA Remote Console       */

#define ETH_P_DNA_RT    0x6003         /* DEC DNA Routing              */

#define ETH_P_LAT       0x6004         /* DEC LAT                      */

#define ETH_P_DIAG      0x6005         /* DEC Diagnostics              */

#define ETH_P_CUST      0x6006         /* DEC Customer use             */

#define ETH_P_SCA       0x6007         /* DEC Systems Comms Arch       */

#define ETH_P_TEB 0x6558     /* Trans Ether Bridging     */

#define ETH_P_RARP      0x8035      /* Reverse Addr Res packet  */

#define ETH_P_ATALK  0x809B    /* Appletalk DDP     */

#define ETH_P_AARP   0x80F3    /* Appletalk AARP    */

#define ETH_P_8021Q  0x8100         /* 802.1Q VLAN Extended Header  */

#define ETH_P_IPX 0x8137     /* IPX over DIX          */

#define ETH_P_IPV6   0x86DD    /* IPv6 over bluebook       */

#define ETH_P_PAUSE  0x8808    /* IEEE Pause frames. See 802.3 31B */

#define ETH_P_SLOW   0x8809    /* Slow Protocol. See 802.3ad 43B */

#define ETH_P_WCCP   0x883E    /* Web-cache coordination protocol

                   * defined in draft-wilson-wrec-wccp-v2-00.txt*/

#define ETH_P_PPP_DISC   0x8863    /* PPPoE discovery messages     */

#define ETH_P_PPP_SES    0x8864    /* PPPoE session messages   */

#define ETH_P_MPLS_UC    0x8847    /* MPLS Unicast traffic     */

#define ETH_P_MPLS_MC    0x8848    /* MPLS Multicast traffic   */

#define ETH_P_ATMMPOA    0x884c    /* MultiProtocol Over ATM   */

#define ETH_P_ATMFATE    0x8884    /* Frame-based ATM Transport

                   * over Ethernet

                   */

#define ETH_P_PAE 0x888E     /* Port Access Entity (IEEE 802.1X) */

#define ETH_P_AOE 0x88A2     /* ATA over Ethernet     */

#define ETH_P_TIPC   0x88CA    /* TIPC           */

#define ETH_P_1588   0x88F7    /* IEEE 1588 Timesync */

#define ETH_P_FCOE   0x8906    /* Fibre Channel over Ethernet  */

#define ETH_P_TDLS   0x890D    /* TDLS */

#define ETH_P_FIP 0x8914     /* FCoE Initialization Protocol */

#define ETH_P_EDSA   0xDADA    /* Ethertype DSA [ NOT AN OFFICIALLY REGISTERED ID] */

#define ETH_P_AF_IUCV   0xFBFB     /* IBM af_iucv [ NOT AN OFFICIALLY REGISTERED ID ]*/


 

网络层向链路层注册操作函数集合在此数据。

static struct list_head ptype_base[PTYPE_HASH_SIZE];

4.24  rtable

路由表结构,描述一个路由表的完整形态。

struct rtable {

    union

    {

       struct dst_entry  dst;

    } u;

 

    /* Cache lookup keys */

    struct flowi      fl;

 

    struct in_device  *idev;

   

    int        rt_genid;

    unsigned      rt_flags;

    __u16         rt_type;

 

    __be32        rt_dst;   /* Path destination  */

    __be32        rt_src;   /* Path source       */

    int        rt_iif;

 

    /* Info on neighbour */

    __be32        rt_gateway;

 

    /* Miscellaneous cached information*/

    __be32        rt_spec_dst;/* RFC1122 specific destination */

    struct inet_peer  *peer;/* long-living peerinfo */

};

 

4.25  rt_hash_bucket

路由表缓存

/*

 * Route cache.

 */

 

/* The locking scheme is rather straight forward:

 *

 * 1) Read-Copy Update protects thebuckets of the central route hash.

 * 2) Only writers remove entries,and they hold the lock

 *   as they look at rtable reference counts.

 * 3) Only readers acquirereferences to rtable entries,

 *   they do so with atomic increments and with the

 *   lock held.

 */

 

structrt_hash_bucket {

    struct rtable *chain;

};

 

4.26  dst_entry

包的去向接口,描述了包的去留,下一跳等路由关键信息。

/* Each dst_entry has reference count and sits in some parent list(s).

 * When it is removed from parentlist, it is "freed" (dst_free).

 * After this it enters dead state(dst->obsolete > 0) and if its refcnt

 * is zero, it can be destroyedimmediately, otherwise it is added

 * to gc list and garbage collectorperiodically checks the refcnt.

 */

structdst_entry

{

    struct rcu_head      rcu_head;

    struct dst_entry  *child;

    struct net_device      *dev;

    short         error;

    short         obsolete;

    int        flags;

#define DST_HOST     1

#define DST_NOXFRM       2

#define DST_NOPOLICY     4

#define DST_NOHASH       8

    unsigned long     expires;

 

    unsigned short       header_len;  /* more space athead required */

    unsigned short       trailer_len; /* space to reserveat tail */

 

    unsigned int      rate_tokens;

    unsigned long     rate_last; /* rate limiting forICMP */

 

    struct dst_entry  *path;

 

    struct neighbour  *neighbour;

    struct hh_cache     *hh;

#ifdef CONFIG_XFRM

    struct xfrm_state *xfrm;

#else

    void          *__pad1;

#endif

    int        (*input)(struct sk_buff*);

    int        (*output)(struct sk_buff*);

 

    struct dst_ops          *ops;

 

/* This Red Hat kABI workaround will shift tclassid 32 bit, while we

 * still keep the original size ofdst_entry and assures alignment

 * (see further down).

 */

#ifdef __GENKSYMS__

    u32        metrics[RTAX_MAX_ORIG];

#else

    u32        metrics[RTAX_MAX];

#endif

 

#ifdef CONFIG_NET_CLS_ROUTE

    __u32         tclassid;

#else

    __u32         __pad2;

#endif

 

 

    /*

     * Align __refcnt to a 64 bytes alignment

     * (L1_CACHE_SIZE would be too much)

     */

/* Red Hat kABI workaround to assure aligning __refcnt, while

 * consuming 32 bit of padding forour metrics expansion above.

 * On 32bit archs not padding remains.

 */

#ifdef __GENKSYMS__

#ifdef CONFIG_64BIT

    long          __pad_to_align_refcnt[2];

#else

    long          __pad_to_align_refcnt[1];

#endif

#else  /* __GENKSYMS__ */

#ifdef CONFIG_64BIT

    u32        __pad_hole_in_struct;

    long          __pad_to_align_refcnt[1];

#endif

#endif /*__GENKSYMS__ */

 

    /*

     * __refcnt wants to be on a different cacheline from

     * input/output/ops or performance tanks badly

     */

    atomic_t      __refcnt; /* client references */

    int        __use;

    unsigned long     lastuse;

    union {

       struct dst_entry*next;

       struct rtable   *rt_next;

       struct rt6_info  *rt6_next;

       struct dn_route *dn_next;

    };

};

 

4.27  napi_struct

NAPI调度的结构

NAPI: NAPI是LINUX上采用的一种提高网络处理效率的技术,它的核心概念就是不采用中断的方式读取数据,而代之以首先采用中断唤醒数据接收服务,然后采用poll的方法来轮询数据。NAPI技术适用于高速率的短长度数据包的处理。

/*

 * Structure for NAPI schedulingsimilar to tasklet but with weighting

 */

structnapi_struct {

    /* The poll_list must only be managedby the entity which

     * changes the state of the NAPI_STATE_SCHEDbit.  This means

     * whoever atomically sets that bit can addthis napi_struct

     * to the per-cpu poll_list, and whoever clearsthat bit

     * can remove from the list right beforeclearing the bit.

     */

    struct list_head  poll_list;

 

    unsigned long     state;

    int        weight;

    int        (*poll)(structnapi_struct*,int);

#ifdef CONFIG_NETPOLL

    spinlock_t    poll_lock;

    int        poll_owner;

#endif

 

    unsigned int      gro_count;

 

    struct net_device *dev;

    struct list_head  dev_list;

    struct sk_buff      *gro_list;

    struct sk_buff      *skb;

};

5       数据结构类图

图2  数据结构

6       协议栈注册流程

6.1内核启动流程

当内核完成自解压过程后进入内核启动,这一过程先在arch/mips/kernel/head.S 程序中,这个程序负责数据区(BBS)、中断描述表(IDT)、段描述表(GDT)、页表和寄存器的初始化,程序中定义了内核的入口函数 kernel_entry( ) , kernel_entry( )函数是体系结构相关的汇编代码,它首先初始化内核堆栈段为创建系统中的第一过程进行准备,接着用一段循环将内核映像的未初始化的数据段清零,最后跳到 start_kernel()函数中初始化硬件相关的代码,完成Linux核心环境的建立。

 

start_kenrel()定义在init/main.c中,真正的内核初始化过程就是从这里才开始。函数start_kerenl()将会调用一系列的初始化函数,用来完成内核本身的各方面设置,如中断,内存管理,进程管理信号,文件系统,目的是最终建立起基本完整的Linux核心环境

 

start_kernel()函数中主要函数及调用关系如下:

 

start_kernel

setup_arch

sched_init

init_IRQ

proc_root_init

mm_init

console_init

rest_init

cpu_probe

prom_init

cpu_report

arch_mem_init

resource_init

kernel_init

cpu_idle

do_basic_setup

init_post

init_tmpfs

driver_init

do_initcalls

6.2协议栈初始化流程

sock_init:  Initializesk_buff SLAB cache注册socket文件系统

net_inuse_init:         为每个CPU分配缓存。

proto_init:        在/proc/net域下建立protocols文件,注册相关文件操作函数

net_dev_init:   建立netdevice在/proc/sys相关的数据结构,并且开启网卡收发中断。

为每个CPU初始化一个数据包接收队列(softnet_data),包接收的回调。注册本地回环操作,注册默认网络设备操作。 驱动层

Inet_init:   注册Inet协议族的socket创建方法,注册tcp,udp,icmp,igmp 接口基本的收包方法。为IPv4协议族创建proc文件。

此函数为协议栈主要的注册函数:

1.        rc = proto_register(&udp_prot, 1); 注册inet层udp协议,为其分配快速缓存。

2.        (void)sock_register(&inet_family_ops); 向static const struct net_proto_family *net_families[NPROTO] ; 结构注册inet协议族的操作集合(主要是协议族inetsocket的创建操作)。Inet socket

3.        inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0,  向externconst struct net_protocol *inet_protos[MAX_INET_PROTOS];(网络层)注册传输层UDP的操作集合。网络层

4.        static struct list_head inetsw[SOCK_MAX];    for (r = &inetsw[0]; r < &inetsw[SOCK_MAX];++r)    INIT_LIST_HEAD(r);   初始化SOCKET类型数组,其中保存了这是个链表数组,每个元素是一个链表,连接使用同种socket类型的协议和操作集合。

5.        for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN];++q)

a)        inet_register_protosw(q);

向sock层注册协议的的调用操作集合   bsd socket inet socket

6.        arp_init(); 启动arp协议支持

7.        ip_init(); 启动Ip协议支持

8.        udp_init(); 启动UDP协议支持

9.        dev_add_pack(&ip_packet_type);  向 ptype_base[PTYPE_HASH_SIZE] ; 注册ip 协议的操作集合。 协议无关层

10.    系统调用层:  socket.c中提供的系统调用接口。

 

7      socket创建流程

         本章主要介绍socket创建的流程,参数传递过程。fd = socket(family, type, protocol); 创建后,内存中的数据结构的组织结构。

 

图3  socket创建流程

8      协议栈收包流程

以UDP协议为例

图  收发流程页


 

8.1驱动收包流程

以UDP协议为例

 

图         内核收包流程页

8.2应用层收包流程

以UDP协议为例

 

图        应用层收包流程页

9      协议栈发包流程

以UDP协议为例

 

图    UDP发包流程

 

10            总结

本文只是对协议栈流程做了些粗略的分析,里面涉及到大量的技术思想没有办法传达,要深入理解可先参考csdn博主yming0221的关于协议栈的文章,链接为  http://blog.csdn.net/column/details/linux-kernel-net.html。或者直接阅读linux内核协议栈源码。

1.       TCP/IP详解卷一

2 .博客 http://blog.csdn.net/column/details/linux-kernel-net.html


图1    初始化流程


linux内核网络协议栈架构分析,全流程分析-干货_第1张图片



图2   分层数据结构

linux内核网络协议栈架构分析,全流程分析-干货_第2张图片


图3  socket 创建流程

linux内核网络协议栈架构分析,全流程分析-干货_第3张图片


图4 收发流程 

linux内核网络协议栈架构分析,全流程分析-干货_第4张图片


图 5  内核收包流程细化 (中断收包)


linux内核网络协议栈架构分析,全流程分析-干货_第5张图片


图6 应用层收包流程   

linux内核网络协议栈架构分析,全流程分析-干货_第6张图片


图7  UDP发包流程


linux内核网络协议栈架构分析,全流程分析-干货_第7张图片


你可能感兴趣的:(内核开发,TCP/IP)