http://blog.chinaunix.net/uid-24148050-id-464587.html
http://blog.csdn.net/zhangskd/article/details/21469399
http://blog.csdn.net/weixiuc/article/details/2955565
http://blog.csdn.net/zhangskd/article/details/21627963 ---- NAPI
http://blog.csdn.net/qy532846454/article/details/6993695 ------- L4处理
http://blog.chinaunix.net/uid-15014334-id-4411101.html --- 文字概述
NAPI是linux新的网卡数据处理API,据说是由于找不到更好的名字,所以就叫NAPI(New API),在2.5之后引入。
简单来说,NAPI是综合中断方式与轮询方式的技术。
中断的好处是响应及时,如果数据量较小,则不会占用太多的CPU事件;缺点是数据量大时,会产生过多中断,
而每个中断都要消耗不少的CPU时间,从而导致效率反而不如轮询高。轮询方式与中断方式相反,它更适合处理
大量数据,因为每次轮询不需要消耗过多的CPU时间;缺点是即使只接收很少数据或不接收数据时,也要占用CPU
时间。
NAPI是两者的结合,数据量低时采用中断,数据量高时采用轮询。平时是中断方式,当有数据到达时,会触发中断
处理函数执行,中断处理函数关闭中断开始处理。如果此时有数据到达,则没必要再触发中断了,因为中断处理函
数中会轮询处理数据,直到没有新数据时才打开中断。
很明显,数据量很低与很高时,NAPI可以发挥中断与轮询方式的优点,性能较好。如果数据量不稳定,且说高不高
说低不低,则NAPI则会在两种方式切换上消耗不少时间,效率反而较低一些。
linux启动时,注册函数:
net_dev_init函数内:
for_each_possible_cpu(i) {
struct softnet_data *sd = &per_cpu(softnet_data, i);
sd->backlog.poll = process_backlog; ------ 软中断中,处理报文时调用。
}
open_softirq(NET_TX_SOFTIRQ, net_tx_action);
open_softirq(NET_RX_SOFTIRQ, net_rx_action); ---------- 注册接收报文的软中断。
inet_init函数内:
static struct packet_type ip_packet_type __read_mostly = {
.type = cpu_to_be16(ETH_P_IP),
.func = ip_rcv,
};
dev_add_pack(&ip_packet_type); ------------- 注册ETH_P_IP类型的处理函数。
全部:
#define ETH_P_LOOP 0x0060/* Ethernet Loopback packet*/
#define ETH_P_PUP 0x0200/* Xerox PUP packet*/
#define ETH_P_PUPAT 0x0201/* Xerox PUP Addr Trans packet*/
#define ETH_P_IP 0x0800/* Internet Protocol packet*/
#define ETH_P_X25 0x0805/* CCITT X.25*/
#define ETH_P_ARP 0x0806/* Address Resolution packet*/
#define ETH_P_BPQ0x08FF/* G8BPQ AX.25 Ethernet Packet[ NOT AN OFFICIALLY REGISTERED ID ] */
#define ETH_P_IEEEPUP 0x0a00/* Xerox IEEE802.3 PUP packet */
#define ETH_P_IEEEPUPAT 0x0a01/* Xerox IEEE802.3 PUP Addr Trans packet */
#define ETH_P_BATMAN 0x4305/* B.A.T.M.A.N.-Advanced packet [ NOT AN OFFICIALLY REGISTERED ID ] */
#define ETH_P_DEC 0x6000 /* DEC Assigned proto */
#define ETH_P_DNA_DL 0x6001 /* DEC DNA Dump/Load */
#define ETH_P_DNA_RC 0x6002 /* DEC DNA Remote Console */
#define ETH_P_DNA_RT 0x6003 /* DEC DNA Routing */
#define ETH_P_LAT 0x6004 /* DEC LAT */
#define ETH_P_DIAG 0x6005 /* DEC Diagnostics */
#define ETH_P_CUST 0x6006 /* DEC Customer use */
#define ETH_P_SCA 0x6007 /* DEC Systems Comms Arch */
#define ETH_P_TEB 0x6558/* Trans Ether Bridging*/
#define ETH_P_RARP 0x8035 /* Reverse Addr Res packet*/
#define ETH_P_ATALK 0x809B/* Appletalk DDP*/
#define ETH_P_AARP 0x80F3/* Appletalk AARP*/
#define ETH_P_8021Q 0x8100 /* 802.1Q VLAN Extended Header */
#define ETH_P_IPX 0x8137/* IPX over DIX*/
#define ETH_P_IPV6 0x86DD/* IPv6 over bluebook*/
#define ETH_P_PAUSE 0x8808/* IEEE Pause frames. See 802.3 31B */
#define ETH_P_SLOW 0x8809/* Slow Protocol. See 802.3ad 43B */
#define ETH_P_WCCP 0x883E/* Web-cache coordination protocol
* defined in draft-wilson-wrec-wccp-v2-00.txt */
#define ETH_P_PPP_DISC 0x8863/* PPPoE discovery messages */
#define ETH_P_PPP_SES 0x8864/* PPPoE session messages*/
#define ETH_P_MPLS_UC 0x8847/* MPLS Unicast traffic*/
#define ETH_P_MPLS_MC 0x8848/* MPLS Multicast traffic*/
#define ETH_P_ATMMPOA 0x884c/* MultiProtocol Over ATM*/
#define ETH_P_LINK_CTL 0x886c/* HPNA, wlan link local tunnel */
#define ETH_P_ATMFATE 0x8884/* Frame-based ATM Transport
* over Ethernet
*/
#define ETH_P_PAE 0x888E/* Port Access Entity (IEEE 802.1X) */
#define ETH_P_AOE 0x88A2/* ATA over Ethernet*/
#define ETH_P_8021AD 0x88A8 /* 802.1ad Service VLAN*/
#define ETH_P_802_EX1 0x88B5/* 802.1 Local Experimental 1. */
#define ETH_P_TIPC 0x88CA/* TIPC*/
#define ETH_P_8021AH 0x88E7 /* 802.1ah Backbone Service Tag */
#define ETH_P_MVRP 0x88F5 /* 802.1Q MVRP */
#define ETH_P_1588 0x88F7/* IEEE 1588 Timesync */
#define ETH_P_FCOE 0x8906/* Fibre Channel over Ethernet */
#define ETH_P_TDLS 0x890D /* TDLS */
#define ETH_P_FIP 0x8914/* FCoE Initialization Protocol */
#define ETH_P_QINQ1 0x9100/* deprecated QinQ VLAN [ NOT AN OFFICIALLY REGISTERED ID ] */
#define ETH_P_QINQ2 0x9200/* deprecated QinQ VLAN [ NOT AN OFFICIALLY REGISTERED ID ] */
#define ETH_P_QINQ3 0x9300/* deprecated QinQ VLAN [ NOT AN OFFICIALLY REGISTERED ID ] */
#define ETH_P_EDSA 0xDADA/* Ethertype DSA [ NOT AN OFFICIALLY REGISTERED ID ] */
#define ETH_P_AF_IUCV 0xFBFB /* IBM af_iucv [ NOT AN OFFICIALLY REGISTERED ID ] */
#define ETH_P_802_3_MIN 0x0600/* If the value in the ethernet type is less than this value
* then the frame is Ethernet II. Else it is 802.3 */
/*
* Non DIX types. Won't clash for 1500 types.
*/
#define ETH_P_802_3 0x0001/* Dummy type for 802.3 frames */
#define ETH_P_AX25 0x0002/* Dummy protocol id for AX.25 */
#define ETH_P_ALL 0x0003/* Every packet (be careful!!!) */
#define ETH_P_802_2 0x0004/* 802.2 frames*/
#define ETH_P_SNAP 0x0005/* Internal only*/
#define ETH_P_DDCMP 0x0006 /* DEC DDCMP: Internal only */
#define ETH_P_WAN_PPP 0x0007 /* Dummy type for WAN PPP frames*/
#define ETH_P_PPP_MP 0x0008 /* Dummy type for PPP MP frames */
#define ETH_P_LOCALTALK 0x0009 /* Localtalk pseudo type*/
#define ETH_P_CAN 0x000C/* CAN: Controller Area Network */
#define ETH_P_CANFD 0x000D/* CANFD: CAN flexible data rate*/
#define ETH_P_PPPTALK 0x0010/* Dummy type for Atalk over PPP*/
#define ETH_P_TR_802_2 0x0011/* 802.2 frames*/
#define ETH_P_MOBITEX 0x0015/* Mobitex ([email protected])*/
#define ETH_P_CONTROL 0x0016/* Card specific control frames */
#define ETH_P_IRDA 0x0017/* Linux-IrDA*/
#define ETH_P_ECONET 0x0018/* Acorn Econet*/
#define ETH_P_HDLC 0x0019/* HDLC frames*/
#define ETH_P_ARCNET 0x001A/* 1A for ArcNet :-) */
#define ETH_P_DSA 0x001B/* Distributed Switch Arch.*/
#define ETH_P_TRAILER 0x001C/* Trailer switch tagging*/
#define ETH_P_PHONET 0x00F5/* Nokia Phonet frames */
#define ETH_P_IEEE802154 0x00F6 /* IEEE802.15.4 frame*/
#define ETH_P_CAIF 0x00F7/* ST-Ericsson CAIF protocol*/
inet_init函数中:
if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
pr_crit("%s: Cannot add ICMP protocol\n", __func__);
if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
pr_crit("%s: Cannot add UDP protocol\n", __func__);
if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
pr_crit("%s: Cannot add TCP protocol\n", __func__);
#ifdef CONFIG_IP_MULTICAST
if (inet_add_protocol(&igmp_protocol, IPPROTO_IGMP) < 0)
pr_crit("%s: Cannot add IGMP protocol\n", __func__);
#endif
enum {
IPPROTO_IP = 0, /* Dummy protocol for TCP*/
IPPROTO_ICMP = 1, /* Internet Control Message Protocol*/
IPPROTO_IGMP = 2, /* Internet Group Management Protocol*/
IPPROTO_IPIP = 4, /* IPIP tunnels (older KA9Q tunnels use 94) */
IPPROTO_TCP = 6, /* Transmission Control Protocol*/
IPPROTO_EGP = 8, /* Exterior Gateway Protocol*/
IPPROTO_PUP = 12, /* PUP protocol*/
IPPROTO_UDP = 17, /* User Datagram Protocol*/
IPPROTO_IDP = 22, /* XNS IDP protocol*/
IPPROTO_DCCP = 33, /* Datagram Congestion Control Protocol */
IPPROTO_RSVP = 46, /* RSVP protocol*/
IPPROTO_GRE = 47, /* Cisco GRE tunnels (rfc 1701,1702)*/
IPPROTO_IPV6 = 41,/* IPv6-in-IPv4 tunnelling*/
IPPROTO_ESP = 50, /* Encapsulation Security Payload protocol */
IPPROTO_AH = 51, /* Authentication Header protocol */
IPPROTO_BEETPH = 94, /* IP option pseudo header for BEET */
IPPROTO_PIM = 103, /* Protocol Independent Multicast*/
IPPROTO_COMP = 108, /* Compression Header protocol */
IPPROTO_SCTP = 132, /* Stream Control Transport Protocol*/
IPPROTO_UDPLITE = 136, /* UDP-Lite (RFC 3828)*/
IPPROTO_RAW = 255,/* Raw IP packets*/
IPPROTO_MAX
};
struct sk_buff {
struct sock *sk; ------- 所属的socket
struct net_device*dev; ------- 所属的device
unsigned int len,
data_len;
__u16 mac_len,
hdr_len;
__be16 inner_protocol;
__u16 inner_transport_header;
__u16 inner_network_header;
__u16 inner_mac_header;
__u16 transport_header;
__u16 network_header;
__u16 mac_header;
/* These elements must be at the end, see alloc_skb() for details. */
sk_buff_data_ttail;
sk_buff_data_tend;
unsigned char *head,
*data;
unsigned int truesize;
atomic_t users;
__u8 pkt_type:3, -------- 报文类型,PACKET_HOST,PACKET_BROADCAST,PACKET_MULTICAST等
__be16 protocol; -------- 协议类型,ETH_P_802_3等
};
非NAPI情况:
上半部处理流程:
调用netif_rx函数,里面执行ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);将报文放到队列中。
一般会执行eth_type_trans,需要执行(下面这段出过错!!!)
skb_reset_mac_header(skb);
skb_pull_inline(skb, ETH_HLEN); -----把data移到指向三层,并判断报文类型等。
如果复制报文,考虑用sk_copy_expand。socket设置skb_set_owner_w。
下半部处理流程:(可以参考http://blog.csdn.net/weixiuc/article/details/2955569)
非NAPI,接收数据包的下半部处理流程为:
net_rx_action // 软中断
|--> process_backlog() // 默认poll
|--> __netif_receive_skb(执行了skb_reset_network_header) // L2处理函数,此时也可能查看二层信息,如br_handle_frame,但sk_buff的data已经指向三层,遍历所有三层入口函数
|--> ip_rcv() // L3入口, 内部执行Netfilter hook函数
|---> tcp_v4_rcv等 //L4入口
net_rx_action,里面通过work = n->poll(n, weight);
如果网卡驱动不支持NAPI,则默认的napi_struct->poll()函数为process_backlog()。process_backlog里面执行__netif_receive_skb,__netif_receive_skb_core,
如果是bridge接口,执行过br_add_if,err = netdev_rx_handler_register(dev, br_handle_frame, p);,则dev->rx_handler为br_handle_frame,在报文处理时,依次执行__netif_receive_skb_core,br_handle_frame,会依次执行bridge层和IP层的filter功能。
这是ebtables功能---(http://ebtables.sourceforge.net/misc/ebtables-man.html http://www.cnblogs.com/peteryj/archive/2011/07/24/2115602.html)
根据类型,选择三层处理函数,执行ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); 如ip_rcv。
ip_rcv内执行NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);里面执行的netfilter的hook,这些hook都是通过nf_register_hooks,nf_register_hook注册。如br_netfilter_init等。Netfilter主要采用连接跟踪(Connection Tracking)、包过滤(Packet Filtering)、地址转换(NAT)、包处理(Packet Mangling)四种技术。--------- 这是iptables功能。和ebtables、arptables等都使用netfilter实现。
最后执行ip_rcv_finish,ip_route_input_noref等,ip_local_deliver_finish,最后根据4层协议,执行ipprot->handler(skb);
L4如执行udp_rcv,__udp4_lib_rcv,udp_queue_rcv_skb,__udp_queue_rcv_skb,sock_queue_rcv_skb,sk->sk_data_ready,sock_def_readable,根据端口找到socket,并调用相应的接收函数,唤醒socket所属的进程。