The purpose behind BPF is to let an application specify a filteringfunction to select only the network packets that it wants to see. An earlyBPF user was the tcpdump, which used BPF to implement the filtering behindits complex command-line syntax. Other packet capture programs also makeuse of it. On Linux, there is another interesting application of BPF: the"socket filter" mechanism allows an application to filter incoming packetson any type of socket with BPF. In this mode, it can function as a sort ofper-application firewall, eliminating packets before the application eversees them.
The original BPF distribution came in the form of a user-space library, butthe BPF interface quickly found its way into the kernel. When networktraffic is high, there is a lot of value in filtering unwanted packetsbefore they are copied into user space. Obviously, it is also importantthat BPF filters run quickly; every bit of per-packet overhead is going tohurt in a high-traffic situation. BPF was designed to allow a wide varietyof filters while keeping speed in mind, but that does not mean that itcannot be made faster.
BPF defines a virtual machine which is almost Turing-machine-like in itssimplicity. There are two registers: an accumulator and an indexregister. The machine also has a small scratch memory area, an implicitarray containing the packet in question, and a small set of arithmetic,logical, and jump instructions. The accumulator is used for arithmeticoperations, while the index register provides offsets into the packet orinto the scratch memory areas. A very simple BPF program (taken from the1993 USENIXpaper [PDF]) might be:
ldh [12] jeq #ETHERTYPE_IP, l1, l2 l1: ret #TRUE l2: ret #0
The first instruction loads a 16-bit quantity from offset 12 in the packetto the accumulator; that value is the Ethernet protocol type field. Itthen compares the value to see if the packet is an IP packet or not; IPpackets are accepted, while anything else is rejected. Naturally, filterprograms get more complicated quickly. Header length can vary, so theprogram will have to calculate the offsets of (for example) TCP headervalues; that is where the index register comes into play. Scratch memory(which is the only place a BPF program can store to) is used whenintermediate results must be kept.
The Linux BPF implementation can be found in net/core/filter.c; itprovides "standard" BPF along with a number of Linux-specific ancillaryinstructions which can test whether a packet is marked, which CPU thefilter is running on, which interface the packet arrived on, and more. Itis, at its core, a long switch statement designed to run the BPFinstructions quickly. This code has seen a number of enhancements andspeed improvements over the years, but there has not been any fundamentalchange for a long time.
Eric Dumazet's patch is a fundamentalchange: it puts a just-in-time compiler into the kernel to translate BPFcode directly into the host system's assembly code. The simplicity of theBPF machine makes the JIT translation relatively simple; every BPFinstruction maps to a straightforward x86 instruction sequence. There area few assembly language helpers which help to implement the virtualmachine's semantics; the accumulator and index are just stored in theprocessor's registers. The resulting program is placed in a bit ofvmalloc() space and run directly when a packet is to be tested.A simple benchmark shows a 50ns savings foreach invocation of a simple filter - that may seem small, but, whenmultiplied by the number of packets going through a system, that differencecan add up quickly.
The current implementation is limited to the x86-64 architecture; indeed,that architecture is wired deeply into the code, which is littered withhard-coded x86 instruction opcodes. Should anybody want to add a secondarchitecture, they will be faced with the choice of simply replicating thewhole thing (it is not huge) or trying to add a generalized opcodegenerator to the existing JIT code.
An obvious question is: can this same approach be applied to iptables,which is more heavily used than BPF? The answer may be "yes," but it mightalso make more sense to bring back the nftables idea, which is built on a BPF-likevirtual machine of its own. Given that there has been some talk of usingnftables in other contexts (internal packet classification for packetscheduling, for example), the value of a JIT-translated nftables could beeven higher. Nftables is a job for another day, though; meanwhile, we havea proof of the concept for BPF that appears to get the job done nicely.
A JIT for packet filters
Posted Apr 14, 2011 6:57 UTC (Thu) by jengelh (subscriber, #33263) [Link]
Since xtables modules are already handcrafted for a specific task, anyinterpreter module for arbitrary expressions (such as xt_u32 and nft)has a tendency to naturally run slower. But, if BPF can be JITed, itwould seem it not being impossible to extend xt_u32.
A JIT for packet filters
Posted Apr 15, 2011 14:01 UTC (Fri) by Nelson (subscriber, #21712) [Link]
Someone beat me to the punch.. I have been toying with code that doesthis for a few months. Congrats, I hope the community accepts thepatch.I did some research on this for work about 2 years ago, withsomething like BPF there is a dramatic gain due to the nature of theinterpreter. You can cut a lot of cruft out with JIT. As for genericfirewall rules? It's not as dramatic but you can get a pretty generalacross the board improvement and on some architectures it's definitelyin the interesting range (maybe 30%, maybe more, I guess it depends howfar you go) If you simply recode firewall rules as binary, think aboutit this way: the firewall has a linked list of instructions, all thelinked list code goes away (it's not much, but some memory reads, a fewinstructions here and there) and then as you execute the instructionsthe JIT ones basically turn memory reads into literals so you've candump the loads and some other machinery. It's not warp speed but a nicebump, maybe in the 30%ish generally, it depends on the architecture andthere are a lot of variables. Like I said, it's interesting andnoticeable improvement.
Now where it can be interesting is if you coded up a moreadvanced compiler to optimize the rules. (tcpdump does this, it's uglybut check it out, look at the optimized output some time) a typicalstream of rules might have 10 rules that all apply to TCP packets andthen check various IP ranges and ports. xtables currently would executeeach "instruction" until it reached a result (is packet TCP? doespacket src IP match range Y from rule.. okay go to the next one, is thepacket TCP?... it would check TCP 10 times) A good compiler can invertthat logic and figure out better ways to do it, (if packet TCP? yesthen see if it's in the ranges of IPs from these 10 rules and maybethose rules can be compressed in to just checking a couple bits becauseall the IPs are similar... no it's not TCP, then skip all ten rules andlook at the next batch.) I wouldn't hazard a guess as to how muchfaster this makes the firewall but the potential is HUGE. So we couldmaybe replace iptables with some sort of LLVM based compiler thatgenerated a bytecode "program" that contained the whole firewall.
Whether it's worth the complexity, the difficulty in debugging and porting is a different question.
A JIT for packet filters
Posted Apr 16, 2011 2:55 UTC (Sat) by wahern (subscriber, #37304) [Link]
For fair comparison with a JIT compiler, the interpreter would insteadjump directly from one instruction to the next using jumptables--indexing into a table of labels constructed using GCC's labeladdress-of operator, &&.
On my own VM I can dramatically improve performance on many programsmerely by threading the interpreter. If doing this gives the sameperformance, which it very well could given that BPF might be databound and the ops are so simple, then it would be far preferable ratherthan adding hundreds of lines of new code for each architecture (or,conversely, having some architectures needlessly disadvantaged).
A JIT for packet filters
Posted Apr 18, 2011 18:33 UTC (Mon) by Nelson (subscriber, #21712) [Link]
That's a fair criticism, you can make the BPF VM more efficient, it'sstill a comparison of whats there to a JIT though. Even with thoseimprovements, you can get a fairly consistent boost with a JIT, justfrom turning the loads in to literals. It might not be worth thecomplexity but if there was a more generic JIT framework such that theplatform support was there it is an interesting optimization if yourely upon BPF stuff a lot.
A JIT for packet filters
Posted Apr 14, 2011 10:24 UTC (Thu) by Cyberax (subscriber, #52523) [Link]
A JIT for packet filters
Posted Apr 14, 2011 17:48 UTC (Thu) by fuhchee (subscriber, #40059) [Link]
A JIT for packet filters
Posted Apr 14, 2011 15:13 UTC (Thu) by trasz (subscriber, #45786) [Link]
A JIT for packet filters
Posted Apr 14, 2011 16:28 UTC (Thu) by wahern (subscriber, #37304) [Link]
I'd be interested in this because I'm looking for a tiny JIT compiler.MyJIT is the best I can find so far, but it can't recover from OOMerrors and it's quite large, which means I'm too lazy to fix it.
A JIT for packet filters
Posted Apr 14, 2011 16:37 UTC (Thu) by trasz (subscriber, #45786) [Link]
It was introduced six years ago, with this commit:
r153151 | jkim | 2005-12-06 03:58:12 +0100 (Tue, 06 Dec 2005) | 17 lines
Add experimental BPF Just-In-Time compiler for amd64 and i386.
Use the following kernel configuration option to enable:
options BPF_JITTER
If you want to use bpf_filter() instead (e. g., debugging), do:
sysctl net.bpf.jitter.enable=0
to turn it off.
Currently BIOCSETWF and bpf_mtap2() are unsupported, and bpf_mtap() is
partially supported because 1) no need, 2) avoid expensive m_copydata(9).
Obtained from: WinPcap 3.1 (for i386)
A JIT for packet filters
Posted Apr 14, 2011 16:39 UTC (Thu) by wahern (subscriber, #37304) [Link]
Nevermind
A JIT for packet filters
Posted Apr 15, 2011 9:56 UTC (Fri) by rilder (subscriber, #59804) [Link]
A JIT for packet filters
Posted Apr 17, 2011 1:40 UTC (Sun) by jzbiciak (✭ supporter ✭, #5246) [Link]
I agree that doing this in userspace seems to make much more sense thandoing it in the kernel if optimized performance is your main careabout,since you can bring more resources to bear on the problem withoutbloating the kernel. It then comes down to managing the potentialsecurity issues, and trusting the correctness of the translator sinceyou lose any sandboxing the interpreter might have offered.
(Yes, the translator can insert the required bounds checks, but nothingrequires it to if you're loading raw machine code into the kernel.)
A JIT for packet filters
Posted Apr 17, 2011 21:08 UTC (Sun) by rilder (subscriber, #59804) [Link]
A JIT for packet filters
Posted Apr 26, 2011 2:16 UTC (Tue) by welinder (guest, #4699) [Link]
A JIT for packet filters
Posted May 21, 2011 11:56 UTC (Sat) by snemarch (guest, #75085) [Link]
why does this need a JIT compiler?
Posted Apr 28, 2011 6:07 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]
filters change very infrequently, so why do you need a JIT compiler instead of a normal compiler?
am I missing something on the definition here? or are they misusing theterm JIT? or are they using a JIT setup when they could just as easilyuse a normal compiler?
why does this need a JIT compiler?
Posted May 21, 2011 12:00 UTC (Sat) by snemarch (guest, #75085) [Link]
JIT isn't a super precisely defined term.
In this context, the "just in time" means the code is not compiledinto the kernel (or as a LKM), but generated from user data. You don'tneed the Java/.NET style "interpret until determined hotspot, thengenerate native" behavior in order to call something JIT :-)
the original link:https://lwn.net/Articles/437981/