收包流程:
传统方式和NAPI方式收包流程是有差异的,如图所示。
传统收包是中断,驱动处理完后直接调用netif_rx将报文送入内核处理,内核将报文skb挂到该CPU的softnet_data结构input_pkt_queue队列上, 为了统一传统收包和NAPI设备收包的处理,内核为所有不使用NAPI的驱动程序提供一个虚拟设备,叫做积压设备,每个CPU一个积压设备,对应结构softnet_data->backlog_dev。input_pkt_queue即是该设备的积压队列,用于存储skb,该队列是一个双向链表,组织结构如下。中断上半部只是将报文入队,并将backlog的实例挂到poll_list上,等待下半部软中断轮询poll_list net_rx_action->preocess_backlog将报文进一步处理。
input_pkt_queue structure
+------------------------------------------------------------+
| |
| skb_buff_head skb_buff skb_buff |
| _______ _______________ _______________ |
+-->| next |---->| next|---->| next|----+
+---| pre |<----| pre |<----| pre |<---+
| |_len=2_| |_______________| |_______________| |
| |
+------------------------------------------------------------+
传统收包是每个报文都触发中断,如果报文太快,中断太频繁,CPU总是处理中断,其他任务无法得到调度,于是NAPI(NewAPI)出现了,采用中断+轮询的方式收包以提高吞吐。
NAPI收包需要网卡驱动支持,如intel e1000系列网卡,在收包中断中e1000_intr_msix_rx将网卡napi实例加入softnet_data的poll_list链表上,然后设置NET_RX_SOFTIRQ软中断标志,等待net_rx_action中检查标志并处理。何时运行软中断?两个时机:1,do_IRQ-->irq_exit-->do_softirq-->call_softirq-->__do_softirq中断上半部退出的时候调用软中断处理函数net_rx_action,net_rx_action遍历poll_list链表上的网卡,函数执行过程如下(kernel version 3.2.x)。2,__do_softirq循环调用MAX_SOFTIRQ_RESTART = 10次net_rx_action如果还有pending的报文,则wakeup_softirqd唤醒ksoftirqd内核线程运行run_ksoftirqd-->__do_softirq-->net_rx_action收包。
static void net_rx_action(struct softirq_action *h)
{
struct softnet_data *sd = &__get_cpu_var(softnet_data);
unsigned long time_limit = jiffies + 2;
int budget = netdev_budget; //一次中断处理的skb数目,系统默认300,对应net.core.netdev_budget = 300
void *have;
local_irq_disable(); //关闭中断以访问softnet_data
while (!list_empty(&sd->poll_list)) {
struct napi_struct *n;
int work, weight;
/* If softirq window is exhuasted then punt.
* Allow this to run for 2 jiffies since which will allow
* an average latency of 1.5/HZ.
*/
if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit))) //轮询时间不要超过2个jiffies,处理skb数目不要超过预算300
goto softnet_break;
local_irq_enable();
/* Even though interrupts have been re-enabled, this
* access is safe because interrupts can only add new
* entries to the tail of this list, and only ->poll()
* calls can remove this head entry from the list.
*/
n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list); //取poll_list链表的头,即某网卡的napi实例
have = netpoll_poll_lock(n);
weight = n->weight;//该网卡一次轮询最多处理的报文个数,64
/* This NAPI_STATE_SCHED test is for avoiding a race
* with netpoll's poll_napi(). Only the entity which
* obtains the lock and sees NAPI_STATE_SCHED set will
* actually make the ->poll() call. Therefore we avoid
* accidentally calling ->poll() when NAPI is not scheduled.
*/
work = 0;
if (test_bit(NAPI_STATE_SCHED, &n->state)) {
work = n->poll(n, weight);//调用设备特定的poll函数处理报文,poll中如果一次把包收完会将设备从poll_list上摘除?;如果是非NAPI调用的是process_backlog;
trace_napi_poll(n);
}
WARN_ON_ONCE(work > weight);
budget -= work;
local_irq_disable();
/* Drivers must not modify the NAPI state if they
* consume the entire weight. In such cases this code
* still "owns" the NAPI instance and therefore can
* move the instance around on the list at-will.
*/
//如果一次就把weight消耗光了,说明可能还需要继续轮询这个设备,所以把这个napi放到poll_list的末尾;如果还有报文在gro处理中,不再等待直接将报文feed进协议栈
if (unlikely(work == weight)) {
if (unlikely(napi_disable_pending(n))) {
local_irq_enable();
napi_complete(n);
local_irq_disable();
} else {
if (n->gro_list) {
/* flush too old packets
* If HZ < 1000, flush all packets.
*/
local_irq_enable();
napi_gro_flush(n, HZ >= 1000);
local_irq_disable();
}
list_move_tail(&n->poll_list, &sd->poll_list);
}
}
netpoll_poll_unlock(have);
}
out:
net_rps_action_and_irq_enable(sd);
#ifdef CONFIG_NET_DMA
/*
* There may not be any more sk_buffs coming right now, so push
* any pending DMA copies to hardware
*/
dma_issue_pending_all();
#endif
return;
softnet_break:
sd->time_squeeze++;
__raise_softirq_irqoff(NET_RX_SOFTIRQ);//如果本轮轮询没有处理完,设置软中断标志,等下次软中断调用net_rx_action处理?
goto out;
}
软中断之后报文进入内核协议栈进行处理。期间还设计netfilter,xfrm(ipsec)等的处理,后续再详细分析。
IP报文的处理过程如下:
硬件中断 -->do_IRQ-->handle_irq-->e1000_intr_msix_rx-->__napi_schedule(&adapter->napi)-->
____napi_schedule-->__raise_softirq_irqoff(NET_RX_SOFTIRQ)
do_IRQ-->irq_exit-->do_softirq-->call_softirq-->__do_softirq-->
net_rx_action->e1000e_poll-->e1000_receive_skb->napi_gro_receive-->
netif_receive_skb-->__netif_receive_skb-->__netif_receive_skb_core-->
deliver_skb-->ip_rcv-->NF_HOOK(NF_INET_PRE_ROUTING)-->
ip_rcv_finish-->dst_input-->ip_local_deliver-->
NF_HOOK(NF_INET_LOCAL_IN)-->ip_local_deliver_finish-->ipprot->handler()
ip_forward-->NF_HOOK(NF_INET_FORWARD)-->ip_forward_finish-->
dst_output-->dst->output-->ip_output-->NF_HOOK_COND(NF_INET_POST_ROUTING)-->
ip_finish_output-->ip_finish_output2-->__ipv4_neigh_lookup_noref-->
dst_neigh_output-->neigh_hh_output-->dev_queue_xmit-->dev_hard_start_xmit-->ndo_start_xmit
网上找到个协议栈收发包流程图图,非常好,感谢原作者.
参考:http://zgykill.lofter.com/post/19b38e_a26bb1
http://blog.csdn.net/hui6075/article/details/51196056
http://www.cnblogs.com/super-king/p/3296201.html
https://yq.aliyun.com/articles/5002