The journey of a packet through the Linux 2.6.10 network stack
http://svn.gnumonks.org/trunk/doc/packet-journey-2.6.xml
Author: Harald Welte netfilter core team
Email: [email protected]
Date: Sep 14, 2004
Revision: 1.4
This document describes the journey of a network packet inside the linux kernel 2.6.x. This has changed quite a bit since 2.6 because the globally serialized bottom half was abandoned in favor of the new softirq system.
Preface
I have to excuse for my ignorance, but this document has a strong focus on the"default case": x86 architecture and ip packets which get forwarded. If youwant to contribute your favourite part, feel free to send me a patch.
While I've been working on netfilter/iptabes for quite some time, I amdefinitely no core networking guru and the information provided by thisdocument may be wrong. So don't expect too much, I'll always appreciate Yourcomments and bugfixes.
The document tries to reflect the latest kernel at the time of it's writing,which is 2.6.10-rc2. If you are working on an earlier or later kernel, partsof the network stack might already have changed again.
Receiving the packet
The receive interrupt
If the network card receives an ethernet frame which matches the local MACaddress, an address programmed into the multicast filter or for the linklayerbroadcast address, it issues an interrupt.
The network driver for this particular card handles the interrupt, fetches thepacket data via DMA / PIO / whatever into RAM. It then allocates a skb andcalls a function of the protocol independent device support routines: net/core/dev.c:netif_rx(skb) .
Please note that in Linux 2.6.x, drivers can also be written to support theso-called NAPI (New API). NAPI tries to prevent DoS attackscaused by packet floods that make the cpu spin in the hardirq handler. Insteadof using netif_rx() the way described above, they disableinterrupt generation on the card and schedule polling by calling the function include/linux/netdev.h:netif_rx_schedule(dev) .
netif_rx()
At this early time, the kernel checks whether there are any users of netpollregistered. Netpoll is a low-level mechanism for network access to incomingpackets used by code that wants to avoid using the full network stack, likenetconsole.
If the driver didn't already timestamp the skb, and some piece of code insidethe kernel requested timestamps by asserting netstamp_needed, the kerneltimestamps the skb now by calling include/net/sock.h:net_timestamp() .
Afterwards the skb gets enqueued in the apropriate queue for the processorhandling this packet. If the queue backlog is full the packet is dropped atthis place. After enqueuing the skb the receive softinterrupt is marked forexecution via include/linux/netdev.h:netif_rx_schedule() .The cautious reader will have discovered that this function was previouslymentioned in relation to NAPI drivers. And yes, indeed, this is the pointwhere the two codepaths rejoin and continue their common way through the restof the stack.
If the queue is already full (queue->throttle != 0), then the packet isdropped rather than enqueued.
netif_rx() returns the queue congestion level to give somefeedback to the driver. The congestion level can be either NET_RX_SUCCESS , NET_RX_CN_LOW , NET_RX_CN_MOD , ET_RX_CN_HIGH or NET_RX_DROP .
The interrupt handler now exits and all interrupts are reenabled
The network RX softirq
Like in Linux 2.4, the whole network stack is running in softirq context. Softirqs have the major advantage that they may run on more than one CPUsimultaneously (as opposed to the old "bottom halves" in Linux 2.2.x).
Our network receive softirq is registered in net/core/dev.c:net_dev_init() using the function kernel/softirq.c:open_softirq() provided by the softirq subsystem.
Further handling of our packet is done in the network receive softirq (NET_RX_SOFTIRQ ) which is called from kernel/softirq.c:__do_softirq() via kernel/softirq.c:do_softirq() . do_softirq() itself is called from three places within the kernel:
from kernel/irq/handle.c:irq_exit() , which is called by architecture-specific code after the hardware interrupt handler has finished.
from kernel/softirq.c:ksoftirqd() , that is the kernel softirq daemon.
from kernel/softirq.c:local_bh_enable() , that is FIXME.
from net/core/dev.c:netif_rx_in() , which is a special version of netif_rx(), used by bluetooth bnep and the tun driver.
So if execution passes one of these points, __do_softirq() is called, it detects the NET_RX_SOFTIRQ marked an calls net/core/dev.c:net_rx_action() . Here the sbks are dequeuedfrom the local CPU's backlog queue ( net/core/dev.c:process_backlog() ) using a weighting scheme between thedifferent incoming devices.
netif_receive_skb()
The next function is net/core/dev.c:netif_receive_skb() , which is the main input function for the receive softirq.
First there is again a check for any netpoll useers via netpoll_rx() . If there is no timestamp, and timestamps have been requested somehwere in the kernel, net_timestamp() is called.
In case the incoming interface is part of a group of bound interfaces, skb_bond() saves skb->dev to skb->real_dev and changes skb->dev to point to the master device structure.
Now the packet is devlivered to all layer 3 protocol handlers that have registered for all packets (such as PF_PACKET sockets) by calling deliver_skb()
If the kernel supports 'tc actions' (i.e. it was compiled with CONFIG_NET_CLS_ACT enabled), the ingress filter is now run via ing_filter() . If the filter verdict is TC_ACT_SHOT or TC_ACT_STOLEN , the skb is dropped by kfree_skb() and thus all further processing of the packet stopped.
Next, it is checked ( include/linux/divert.h:handle_diverter() ) if somebody uses the packet dirverter, nother obscure feature of the linux kernel. If yes, processing continues at net/core/dv.c:divert_frame() .
If the kernel has support for ethernet bridging (i.e. CONFIG_BRIDGE is enabled), it is handled via net/core/dev.c:handle_bridge() and br_handle_frame_hook().
Finally, the regular layer 3 packet handlers are called by a lookup in the ptype hash and a successive call to net/core/dev.c:deliver_skb() .