http://info.iet.unipi.it/~luigi/netmap/
News
- 20120608 new source code for Linux and FreeBSD including support for the VALE Virtual Local Ethernet
- 20120430 openvswitch patches to make openvswitch work on FreeBSD/libpcap and get good performance in userspace.
- 20120322 a pre-built TinyCore image (20120322) with fixed "igb" driver.
- 20120217 announcement and first release of the Linux version
- 20120214 the main code has been imported in FreeBSD stable/9 and FreeBSD stable/8
- 20110117 initial import into FreeBSD head
Project summary
The high packet rates of today's high speed interfaces (up to 14.8Mpps on 10GigE interfaces) make it very difficult to do software packet processing at wire rate. An important reason is that the APIs and software architecture that we use is the same we had 20-30 years ago when "fast" was 1000 times slower.
netmap is a very efficient framework for line-rate raw packet I/O from user space, which is capable to support 14.88Mpps on an ordinary PC and OS. Netmap integrates some known ideas into a novel, robust and easy to use framework that is available on FreeBSD and Linux without the need of special hardware or proprietary software.
With
netmap, it takes as little as 60-65 clock cycles to move one packet between the user program and the wire. As an example, a single core running at 900 MHz can generate the 14.8 Mpps that saturate a 10 GigE interface. This is a 10-20x improvement over the use of a standard device driver.
The rest of this page gives a high level description of the project.
Useful Links
- Netmap is now part of FreeBSD HEAD so you will find there the most recent version of the code. The core support is also present in stable/9 and stable/8.
- a linux version has been recently completed and should be released later in February 2012. Performance is very similar to that of the FreeBSD version.
- netmap sources (20120806). Kernel code/patches for FreeBSD HEAD, plus a couple of example applications (bridge, pkt-gen) and a libpcap over netmap wrapper good enough for trafshow and a few other pcap-based apps.
- picobsd image (20110803) containing netmap and a few device drivers (ixgbe, em, re), plus two applications (pkt-gen, bridge) and libpcap, tcpdump, trafshow, and a Click version that can use netmap. This may be slightly outdated, it is only provided for a quick evaluation of the software.
Papers and presentations
Here are a few papers and submissions describing netmap and applications using it:
- May 2012: netmap: a novel framework for fast packet I/O
(Usenix ATC'12, June 2012) with details on the design and architecture, and performance data updated to May.2012. An older version of this paper is available as netmap: fast and safe access to network adapters for user programs
- Mar. 2012: Transparent acceleration of software packet forwarding using netmap (Infocom 2012, March 2012). Describes how we ported OpenvSwitch and Click to netmap (with huge speedups);
- Jan. 2012: Revisiting network I/O APIs: The netmap Framework a recent article in ACM Queue
- Aug. 2011: netmap: memory mapped access to network devices a SIGCOMM 2011 poster describing the idea and initial implementation. The poster won the Best Poster Award at the conference.
- Aug. 2011: slides for a series of talks I gave in the bay area Aug.8-13 (the slides are written using sttp);
- related to high speed packet processing you might be interested in qfq, a fast O(1) packet scheduler.
Note that we are actively developing and improving the code. The performance data below refers to the July 2011 version, and in many cases we have achieved some speedups (10-20% and more in some cases).
Architecture
netmap uses some well known performance-boosting techniques, such as memory-mapping the card's packet buffers, I/O batching, and modeling the send and receive queues as circular buffers to match what the hardware implements. Unlike other systems, applications using
netmap cannot crash the OS, because they run in user space and have no direct access to critical resources (device registers, kernel memory pointers, etc.). The programming model is extremely simple (circular rings of fixed size buffers), and applications use only standard system calls: non-blocking
ioctl() to synchronize with the hardware, and
poll()-able file descriptors to wait for packet receptions or transmissions on individual queues.
Performance
netmap can generate traffic at line rate (14.88Mpps) on a 10GigE link with just a single core running at 900Mhz. This equals to about 60-65 clock cycles per packet, and scales well with cores and clock frequency (with 4 cores, line rate is achieved at less than 450 MHz). Similar rates are reached on the receive side. In the graph below, the two top curves (green and red) indicate the performance of netmap on FreeBSD with 1 and 4 cores, respectively (Intel 82599 10Gbit card). The blue curve is the fastest available packet generator on Linux (pktgen, works entirely in the kernel), while the purple curve on the bottom shows the performance of a user-space generator on FreeBSD using udp sockets.
netmap scales well to multicore systems: individual file descriptors can be associated to different cards or queues of a multi-queue card, and move packets between queues without the need to synchronize with each other.
Operation
netmap implements a special device,
/dev/netmap, which is the gateway to switch one or more network cards to
netmap mode, where the card's datapath is disconnected from the operating system.
open("/dev/netmap") returns a file descriptor that can be used with
ioctl(fd, NIOCREG, ...) to switch an interface to netmap mode. A subsequent
mmap() exports to userspace a replica of the TX and RX rings of the card, and the actual packet buffers. Each "shadow" ring indicates the number of available buffers, the current read or write index, and the address and length each buffer (buffers have fixed size and are preallocated by the kernel).
Two ioctl() synchronize the state of the rings between kernel and userspace: ioctl(fd, NIOCRXSYNC) tells the kernel which buffers have been read by userspace, and informs userspace of any newly received packets. On the TX side, ioctl(fd, NIOCTXSYNC) tells the kernel about new packets to transmit, and reports to userspace how many free slots are available.
The file descriptor returned by open() can be used to poll() one or all queues of a card, so that blocking operation can be integrated seamlessly in existing programs.
Data movement
Receiving a packet is as simple as reading from the buffer in the mmapped region; eventually,
ioctl(fd, NIOCRXSYNC) is used to release one or more buffers at once. Writing to the network requires to fill one or more buffers with data, set the lengths, and eventually invoke the
ioctl(fd, NIOCTXSYNC) to issue the appropriate commands to the card.
The memory mapped region contains all rings and buffers of all cards in netmap mode, so it is trivial to implement packet forwarding between interfaces. Zero-copy operation is also possible, by simply writing the address of the received buffer into the in the transmit ring.
Talking to the host stack
In addition to the "hardware" rings, each card in netmap mode exposes two additional rings that connect to the host stack. Packets coming from the stack are put in an RX ring where they can be processed in the same way as those coming from the network. Similarly, packets written to the additional TX ring are passed up to the host stack when the
ioctl(fd, NIOCTXSYNC) is invoked. Zero-copy bridging between the host stack and the card is then possible in the same way as between two cards. In terms of performance, using the card in netmap mode and bridging in software is often more efficient than using standard mode, because the driver uses simpler and faster code paths.
Device independence
Programs using
netmap do not need any special library or knowledge of the inner details of the network controller. Not only the ring and buffer format is independent of the card itself, but any operation that requires to program the card is done entirely within the kernel.
Status
netmap is currently available for FreeBSD 8 and HEAD, and supports the
ixgbe (Intel 10GigE),
e1000 (Intel) and
re (Realtek) 1GigE drivers. Support for other cards is coming, as well as a Linux port.
The code consists of about 2000 lines for a kernel module, a 400-500 line diff for each individual driver (mostly mechanical modifications). The simplicity of the programming model makes it possible to use netmap without any userspace library.
Example of use
Below is a code snapshot to set a device in netmap mode and read packets from it. Macros are used to assign pointers because the shared memory region contains kernel virtual addresses.
-
struct netmap_if *nifp;
struct nmreq req;
int i, len;
char *buf;
fd = open("/dev/netmap", 0);
strcpy(req.nr_name, "ix0"); // register the interface
ioctl(fd, NIOCREG, &req); // offset of the structure
mem = mmap(NULL, req.nr_memsize, PROT_READ|PROT_WRITE, 0, fd, 0);
nifp = NETMAP_IF(mem, req.nr_offset);
for (;;) {
struct pollfd x[1];
struct netmap_ring *ring = NETMAP_RX_RING(nifp, 0);
x[0].fd = fd;
x[0].events = POLLIN;
poll(x, 1, 1000);
for ( ; ring->avail > 0 ; ring->avail--) {
i = ring->cur;
buf = NETMAP_BUF(ring, i);
use_data(buf, ring->slot[i].len);
ring->cur = NETMAP_NEXT(ring, i);
}
}
Detailed performance data
When talking about performance it is important to understand what are the relevant metrics. I won't enter into a long discussion here, please have a look at the netmap paper for a more detailed discussion.
In short:
- if you only care about "system" overheads (i.e. the time to move a packet between the wire and the application) then the packet generator and receiver should be as simple as possible (maybe not even touching the data). The program pkt-gen that you find on the distribution implements exactly that -- outgoing packets are prepared once so they can be sent without regenerating all the times, incoming packets are counted but not read;
- if you have a more complex application, it might need to send and receive at the same time, look at the payload, etc. There are two "bridge" applications in the distribution, transparently passing traffic between interfaces. The one called bridge uses the native netmap API and can do zero-copy forwarding; another one, called testpcap, uses a wrapper library that implements pcap calls on top of netmap. Among other things it does a copy of the packet on the outgoing link, plus it reads a timestamp (in the kernel) at each syscall because certain pcap clients expect packets to be timestamped.
Transmit and receive speed is shown in the previous section, and is relatively uninteresting as we go at line rate even with a severely underclocked CPU.
More interesting is what happens when you touch the data.
Below you find some preliminary performance data of the two bridge utilities using batches of different sizes (beware that in the 'bridgé case the batch size is per-queue, and the test was run on 4 queues, so we handle up to 4*batch_size packets per system call).
Tests are on an Asus MB with i7-870 CPU@ 2.93 GHz and dual port 10G intel cards (82599 based)
BRIDGING SPEED in Millions of Packets Per Second (Mpps)
batch bridge bridge testpcap
size no_ts do_ts (with ts)
1 9.59 8.57 0.75
2 10.05 8.96 1.37
16 4.86
1024 10.66 9.42 7.50
As a comparison, native packet forwarding using the in-kernel bridge does about 700Kpps on the same hardware. Though the comparison is a bit unfair because our bridge and testpcap don't do address lookups; however we have some real forwarding code (a modified version of openvswitch) that does almost 3Mpps using netmap.