vlan
# ip link add link eth0 name eth0.100 type vlan id 100
Veth
ip link add name n1.0.1 type veth peer name n1.0
ip link add name n1.0.1 type veth peer name n1.0
Can
# ip link set can0 up type can bitrate 125000
Bridge
$ /sbin/ip li add br0 type bridge
$ /sbin/ip li set dev eth0 master br0
dummy
$ /sbin/ip li add dummy0 type dummy0
Vritual can
sudo modprobe can
sudo modprobe can_raw
sudo modprobe vcan
sudo ip link add dev vcan0 type vcan
ifb
sudo modprobe ifb
sudo ip link set dev ifb0 up txqueuelen 1000
$ ip link add type veth help
Usage: ip link type veth [peer ]
To get type 'ip link add help'
$ ip link add type vlan help
Usage: ... vlan id VLANID [ FLAG-LIST ]
[ ingress-qos-map QOS-MAP ] [ egress-qos-map QOS-MAP ]
VLANID := 0-4095
FLAG-LIST := [ FLAG-LIST ] FLAG
FLAG := [ reorder_hdr { on | off } ] [ gvrp { on | off } ]
[ loose_binding { on | off } ]
QOS-MAP := [ QOS-MAP ] QOS-MAPPING
QOS-MAPPING := FROM:TO
Filter 的匹配类型:
1. basic
2. flow
3. fw
4. u32
6。 route
7. rsvp
8, cgroup
9. tcindex
(This is some notes I'm jotting down while I'm working on this, I intend to come back and clean this up later)
- u32 is dumb: http://lists.openwall.net/netdev/2007/08/15/65
- noone knows what "tc action" does.
- basic match meta filter module looks awesome http://lwn.net/Articles/119536/
- maybe use basic match cmp() to match on ethertype, mac addresses and other layer 2 miscellanea.
- for debugging filters you can use actions (untested). eg:
ifup dummy0
tc filter add dev $DEV parent ffff: protocol ip filter-rule flowid 1:2 action mirred egress mirror dev dummy0
tcpdump -i dummy0
- HTB: quantum of class XXXXYYYY is big. Consider r2q change. means class XXXX:YYYY has a massive quantum. quantum by default is the rate of the class, divided by "r2q". http://www.docum.org/docum.org/faq/cache/31.html
- filter...protocol specifies which skb->protocol you're talking about, normally skb->protocol == ethertype. If you don't care you /must/ in some circumstances specify "protocol all". In some situations protocol ethertype works, in some situations it gives an invalid argument.
- The default action of the sfq's internal classifier when a packet doesn't match, is to always put it in bucket 0. These users will get abysmal performance under any kind of load. The "flow" external classifier, when a packet doesn't match, is to drop it all together. These users get 100% packet loss, with or without load. At least the second one is an obvious problem during testing, the first one is often only discovered after users whine.
- ifb (Intermediate Functional Block) is a replacement for IMQ. ifb is in the kernel. ref
- We were getting errors having two rules at the same priority. I think having two u32 rules at the same priority tries to merge them into one hashtable, if this is not possible you get an "invalid argument". Consider using a unique priority/preference for one of the rules and see if that solves the issue.
To match PPPoE discovery ethertype:
$TC filter add dev $DEV \
pref 10 parent $Q_ROOT: \
protocol all \
basic match "cmp(u16 at 12 layer 2 eq $ETH_P_PPPOED)" \
flowid $Q_ROOT:$C_PPPoE
- If you want things to be perfectly fair:
tc qdisc add dev $DEV \
root handle 1: \
sfq
tc filter add dev $DEV \
pref 1 parent 1:1 handle 100 \
protocol all \
flow hash keys dst divisor 1024
This will be fair across all destination IP addresses. We have a set of patches to allow this across src/dst mac addresses.
filters
basic
./tc filter add dev eth1 basic match help
basic is anything but. It allows complicated matches to be built up from boolean operations on various criteria (called extended matches, or "ematches"). The syntax for this is "criteria(arguments)". You can use brackets to force precedence, as well as "and","or" and "not" to combine criteria. Supported extended match modules are "cmp","meta","nbyte" and "u32". see tc filter add dev lo basic match cmp(help), tc filter add dev lo basic match meta(help), tc filter add dev lo basic match meta(list), tc filter add dev lo basic match nbyte(help) and tc filter add dev lo basic match u32(help) for suggested syntaxes.
the cmp extended basic match appears to be the recommended way to match on layer 2 fields (ref)
cmp ematch
This ematch module lets you match on various 8,16 or 32 bit quantities relative to layer 2, layer 3 or transport headers.
An example (that we didn't have time to get to work properly, but it shows a valid syntax), this should match IP packets inside PPPoE sourced from 192.0.2.0/24.:
$TC filter add dev $DEV \
parent 1: prio 10 \
protocol all \
basic match "cmp(u16 at 12 layer 2 eq 0x8864) and cmp(u32 at 34 layer 2 mask 0xFFFFFF00 eq 0xC0000200)" \
flowid 1:10
meta ematch
This ematch module lets you match on various attributes of the system (such as load average), or metadata about the packet (such as the firewall mark). tc filter add dev lo basic match meta(list) lists all the possible attributes.
nbyte ematch
When you want to match on a string inside a packet, nbyte is the module for you.
u32 ematch
u32 is the same as the normal u32 match. Being an ematch it allows for lt,gt or eq matches as well as the usual matches. You can also use the "basic" system to allow for combining this with other ematches in one single rule.
u32
The u32 match appears to be the most frequently used match. It appears that having multiple u32 matches on the same "prio" will be attempted to be "stack" into a single hash table. Errors can occur if they can't be "stacked". Try giving them a unique prio. u32 always matches from the "network" (aka IP/IP6) header. To get at the link layer header, negative offsets can be used as a hack.
tcindex
This matches on the skb->tc_index. I don't know what this is used for? (ref)
rsvp
Match on RSVP flow labels. (ref)
flow
This is an extremely useful classifier that allows classifying packets into queues inside a SFQ.
Example:
$TC filter add dev $DEV \
parent 1: prio 1 \
handle 2 \
protocol all \
flow hash keys dst divisor 1024
This rule will change the SFQ classifier from the Internal one, to using one that only matches on destination address. This will fairly share bandwidth between destination IP's, instead of between 5 tuple flows.
One caveat discovered with the sfq classifier is that if a packet doesn't match, it will get dropped from the sfq, where as the default behavior of the SFQ's internal hashing algorithm is for packets it can't classify, to place them in bucket 0. While the external sfq classifier makes this obvious during testing (100% packet loss), the internal classifier will only show horrible performance when the sfq is under load (and there are many other buckets used).
The divisor is the divisor of a modulo operation. It must be equal or smaller than the hash size configured in the SFQ that this is classifying for. The SFQ size is defined at compile time, by default to be 1,024 elements in size. So set the divisor to 1024.
the flow keys can be src (source ip), dst (destination ip), proto (ip protocol), proto-src (transport protocol source address), proto-dst (transport protocol destination address), iif (input interface), priority (?), mark (firewall mark), nfct (netfilter conntrack?), nfct-src (original netfilter source), nfct-dst (original netfilter destination), nfct-proto-src (original netfilter conntrack transport protocol src), nfct-proto-dst (and so on), rt-classid (?), sk-uid (uid from the skbuff), sk-gid (gid from the skbuff), vlan-tag. At WAND we have extended this to include mac-src, mac-dst, mac-proto.
This also supports or/and/xor/rshift/append NUM. I don't know why this is here, possibly to allow you to attach multiple classifiers to the same sfq, and then limit them to different parts of the hash table?
fw
This match module matches only on the fwmark. It uses the "handle" to select which firewall mark to match. Internally this uses a hash table, so having multiple fwmarks at the same prio appear to able to "stack".
Example:
$TC filter add dev $DEV \
parent 1: prio 2 \
protocol ip \
handle $FWMARK \
fw \
flowid 1:10
route
This match module allows matching on "realms". realms are tags that can be applied to routes. Supports matching from realm, fromif tag, to realm. I've not experimented with this match, but several other people have. This seems to be the easiest way to match routes from quagga. (eg national vs international)
So what are network namespaces? Generally speaking, an installation of Linux shares a single set of network interfaces and routing table entries. You can modify the routing table entries using policy routing (here’s an introduction I wrote and here’s a write-up on a potential use case for policy routing), but that doesn’t fundamentally change the fact that the set of network interfaces and routing tables/entries are shared across the entire OS. Network namespaces change that fundamental assumption. With network namespaces, you can have different and separate instances of network interfaces and routing tables that operate independent of each other.
Connecting Network Namespaces to the Physical Network
This part of it threw me for a while. I can’t really explain why, but it did. Once I’d figured it out, it was obvious. To connect a network namespace to the physical network, just use a bridge.In my case, I used an Open vSwitch (OVS) bridge, but a standard Linux bridge would work as well. Place one or more physical interfaces as well as one of the veth interfaces in the bridge, and—bam!—there you go. Naturally, if you had different namespaces, you’d probably want/need to connect them to different physical networks or different VLANs on the physical network.
The Intermediate Functional Block device is the successor to the IMQ iptables module that was never integrated.
Advantage over current IMQ; cleaner in particular in SMP;
with a _lot_ less code. Old Dummy device functionality is preserved while new one only
kicks in if you use actions.
Contents
- 1 IFB Usage
- 2 Typical Usage
- 3 Run A little test
- 4 IFB Example
- 5 IFB requirements
- 6 IFB Example
|
IFB Usage
As far as i know the reasons listed below is why people use IMQ.
It would be nice to know of anything else that i missed.
- qdiscs/policies that are per device as opposed to system wide. IMQ allows for sharing.
- Allows for queueing incoming traffic for shaping instead of dropping. I am not aware of any study that shows policing is worse than shaping in achieving the end goal of rate control. I would be interested if anyone is experimenting. (re shaping vs policing: the desire for shaping comes more from the need to have complex rules like with htb)
- Very interesting use: if you are serving p2p you may wanna give preference to your own localy originated traffic (when responses come back) vs someone using your system to do bittorent. So QoSing based on state comes in as the solution. What people did to achieve this was stick the IMQ somewhere prelocal hook. I think this is a pretty neat feature to have in Linux in general. (i.e not just for IMQ).
But I wont go back to putting netfilter hooks in the device to satisfy this. I also dont think its worth it hacking ifb some more to be
aware of say L3 info and play ip rule tricks to achieve this.
Instead the plan is to have a contrack related action. This action will selectively either query/create contrack state on incoming packets. Packets could then be redirected to ifb based on what happens (e.g. on incoming packets); if we find they are of known state we could send to a different queue than one which didnt have existing state. This all however is dependent on whatever rules the admin enters.
At the moment this function does not exist yet. I have decided instead of sitting on the patch to release it and then if theres pressure i will add this feature.
What you can do with ifb currently with actions
Lets say you are policing packets from alias 192.168.200.200/32
you dont want those to exceed 100kbps going out.
tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
match ip src 192.168.200.200/32 flowid 1:2 \
action police rate 100kbit burst 90k drop
If you run tcpdump on eth0 you will see all packets going out
with src 192.168.200.200/32 dropped or not
Extend the rule a little to see only the ones that made it out:
tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
match ip src 192.168.200.200/32 flowid 1:2 \
action police rate 10kbit burst 90k drop \
action mirred egress mirror dev ifb0
Now fire tcpdump on ifb0 to see only those packets ..
tcpdump -n -i ifb0 -x -e -t
Essentially a good debugging/logging interface.
If you replace mirror with redirect, those packets will be blackholed and will never make it out. This redirect behavior changes with new patch (but not the mirror).
Typical Usage
What you can do with the patch to provide functionality that most people use IMQ for below:
export TC="/sbin/tc"
$TC qdisc add dev ifb0 root handle 1: prio
$TC qdisc add dev ifb0 parent 1:1 handle 10: sfq
$TC qdisc add dev ifb0 parent 1:2 handle 20: tbf rate 20kbit buffer 1600 limit 3000
$TC qdisc add dev ifb0 parent 1:3 handle 30: sfq
$TC filter add dev ifb0 protocol ip pref 1 parent 1: handle 1 fw classid 1:1
$TC filter add dev ifb0 protocol ip pref 2 parent 1: handle 2 fw classid 1:2
ifconfig ifb0 up
$TC qdisc add dev eth0 ingress
# redirect all IP packets arriving in eth0 to ifb0
# use mark 1 --> puts them onto class 1:1
$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
match u32 0 0 flowid 1:1 \
action ipt -j MARK --set-mark 1 \
action mirred egress redirect dev ifb0
Run A little test
from another machine ping so that you have packets going into the box:
[root@jzny action-tests]# ping 10.22
PING 10.22 (10.0.0.22): 56 data bytes
64 bytes from 10.0.0.22: icmp_seq=0 ttl=64 time=2.8 ms
64 bytes from 10.0.0.22: icmp_seq=1 ttl=64 time=0.6 ms
64 bytes from 10.0.0.22: icmp_seq=2 ttl=64 time=0.6 ms
--- 10.22 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.6/1.3/2.8 ms
[root@jzny action-tests]#
Now look at some stats:
[root@jmandrake]:~# $TC -s filter show parent ffff: dev eth0
filter protocol ip pref 10 u32
filter protocol ip pref 10 u32 fh 800: ht divisor 1
filter protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1
match 00000000/00000000 at 0
action order 1: tablename: mangle hook: NF_IP_PRE_ROUTING
target MARK set 0x1
index 1 ref 1 bind 1 installed 4195sec used 27sec
Sent 252 bytes 3 pkts (dropped 0, overlimits 0)
action order 2: mirred (Egress Redirect to device ifb0) stolen
index 1 ref 1 bind 1 installed 165 sec used 27 sec
Sent 252 bytes 3 pkts (dropped 0, overlimits 0)
[root@jmandrake]:~# $TC -s qdisc
qdisc sfq 30: dev ifb0 limit 128p quantum 1514b
Sent 0 bytes 0 pkts (dropped 0, overlimits 0)
qdisc tbf 20: dev ifb0 rate 20Kbit burst 1575b lat 2147.5s
Sent 210 bytes 3 pkts (dropped 0, overlimits 0)
qdisc sfq 10: dev ifb0 limit 128p quantum 1514b
Sent 294 bytes 3 pkts (dropped 0, overlimits 0)
qdisc prio 1: dev ifb0 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 504 bytes 6 pkts (dropped 0, overlimits 0)
qdisc ingress ffff: dev eth0 ----------------
Sent 308 bytes 5 pkts (dropped 0, overlimits 0)
[root@jmandrake]:~# ifconfig ifb0
ifb0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
UP BROADCAST RUNNING NOARP MTU:1500 Metric:1
RX packets:6 errors:0 dropped:3 overruns:0 frame:0
TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:32
RX bytes:504 (504.0 b) TX bytes:252 (252.0 b)
Dummy continues to behave like it always did.
You send it any packet not originating from the actions it will drop them.
In this case the three dropped packets were ipv6 ndisc.
IFB Example
Many readers have found this page to be unhelpful in terms of expressing how IFB is useful and how it should be used usefully.
These examples are taken from a posting of Jamal at http://www.mail-archive.com/[email protected]/msg04900.html
What this script will demonstrate is the following sequence:
- any packet coming going out on eth0 10.0.0.229 is classified as class 1:10 and redirected to ifb0.
-
- on reaching ifb0 the packet is classified as class 1:2
- subjected to a token buffer shaping of rate 20kbit/s
- sent back to eth0
- on coming back to eth0, the classificaction 1:10 is still valid and this packet is put through an HTB classifier which limits the rate to 256Kbps
export TC="/root/tc"
$TC qdisc del dev ifb0 root handle 1: prio
$TC qdisc add dev ifb0 root handle 1: prio
$TC qdisc add dev ifb0 parent 1:1 handle 10: sfq
$TC qdisc add dev ifb0 parent 1:2 handle 20: tbf \
rate 20kbit buffer 1600 limit 3000
$TC qdisc add dev ifb0 parent 1:3 handle 30: sfq
$TC filter add dev ifb0 parent 1: protocol ip prio 1 u32 \
match ip dst 11.0.0.0/24 flowid 1:1
$TC filter add dev ifb0 parent 1: protocol ip prio 2 u32 \
match ip dst 10.0.0.0/24 flowid 1:2
ifconfig ifb0 up
$TC qdisc del dev eth0 root handle 1: htb default 2
$TC qdisc add dev eth0 root handle 1: htb default 2
$TC class add dev eth0 parent 1: classid 1:1 htb rate 800Kbit
$TC class add dev eth0 parent 1: classid 1:2 htb rate 800Kbit
$TC class add dev eth0 parent 1:1 classid 1:10 htb rate 256kbit ceil 384kbit
$TC class add dev eth0 parent 1:1 classid 1:20 htb rate 512kbit ceil 648kbit
$TC filter add dev eth0 parent 1: protocol ip prio 1 u32 \
match ip dst 10.0.0.229/32 flowid 1:10 \
action mirred egress redirect dev ifb0
A Little test (be careful if you are sshed in and are classifying on
that IP, counters may be not easy to follow)
A ping
mambo:~# ping -c2 10.0.0.229
First, look at ifb0, observe that second filter twice being successful
mambo:~# $TC -s filter show dev ifb0 parent 1:
filter protocol ip pref 1 u32
filter protocol ip pref 1 u32 fh 800: ht divisor 1
filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid
1:1 (rule hit 2 success 0)
match 0b000000/ffffff00 at 16 (success 0 )
filter protocol ip pref 2 u32
filter protocol ip pref 2 u32 fh 801: ht divisor 1
filter protocol ip pref 2 u32 fh 801::800 order 2048 key ht 801 bkt 0 flowid
1:2 (rule hit 2 success 2)
match 0a000000/ffffff00 at 16 (success 2 )
Next the qdisc numbers, observe that 1:2 has 2 packets
mambo:~# $TC -s qdisc show dev ifb0
qdisc prio 1: bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 196 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc sfq 10: parent 1:1 limit 128p quantum 1514b
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc tbf 20: parent 1:2 rate 20000bit burst 1599b lat 546.9ms
Sent 196 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p----- requeues 0
qdisc sfq 30: parent 1:3 limit 128p quantum 1514b
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
Next look at eth0, observe class 1:10 which is where the pings went through after
// they came back from the ifb0 device.
mambo:~# $TC -s class show dev eth0
class htb 1:1 root rate 800000bit ceil 800000bit burst 1699b cburst 1699b
Sent 196 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 16425 ctokens: 16425
class htb 1:10 parent 1:1 prio 0 rate 256000bit ceil 384000bit burst 1631b
cburst 1647b
Sent 196 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
lended: 2 borrowed: 0 giants: 0
tokens: 49152 ctokens: 33110
class htb 1:2 root prio 0 rate 800000bit ceil 800000bit burst 1699b cburst 1699b
Sent 47714 bytes 321 pkt (dropped 0, overlimits 0 requeues 0)
rate 3920bit 3pps backlog 0b 0p requeues 0
lended: 321 borrowed: 0 giants: 0
tokens: 16262 ctokens: 16262
class htb 1:20 parent 1:1 prio 0 rate 512000bit ceil 648000bit burst 1663b
cburst 1680b
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 26624 ctokens: 21251
And now...
mambo:~# $TC -s filter show dev eth0 parent 1:
filter protocol ip pref 1 u32
filter protocol ip pref 1 u32 fh 800: ht divisor 1
filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid
1:10 (rule hit 235 success 4)
match 0a0000e5/ffffffff at 16 (success 4 )
action order 1: mirred (Egress Redirect to device ifb0) stolen
index 2 ref 1 bind 1 installed 114 sec used 100 sec
Action statistics:
Sent 196 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
IFB requirements
In order to use ifb you need:
- Support for ifb on kernel (2.6.20 works OK)
- Menu option: Device drivers -> Network device support -> Intermediate Functional Block support
- Module name: ifb
- Tc iproute2 with support of "actions" (2.6.20 - 20070313 works OK and package from Debian etch is outdated). You can download it from here: http://developer.osdl.org/dev/iproute2/download/
IFB Example
export netif="ifb1" prio="2"
ip link set dev $netif up
# ifb7 (mirred flowid 1:7) default class in ifb1 ($netif)
ip link set dev ifb7 up
interface eth0
$TC qdisc del dev eth0 ingress 2>/dev/null
$TC qdisc add dev eth0 ingress
$TC filter add dev eth0 parent ffff: protocol ip prio 10 u32 \
match u32 0 0 flowid 1:1 \
action mirred egress redirect dev $netif
interface ifb1
$TC qdisc del root dev $netif 2>/dev/null
####
$TC qdisc add dev $netif root handle 1:0 hfsc default 7
#### glowna klasa
$TC class add dev $netif parent 1:0 classid 1:1 hfsc rt m2 10240kbit
## Class
# Admin
$TC class add dev $netif parent 1:1 classid 1:2 hfsc rt m2 2048kbit
$TC qdisc add dev $netif parent 1:2 handle 2 sfq perturb 10
# all user
$TC class add dev $netif parent 1:1 classid 1:4 hfsc rt m2 9144kbit
# user
$TC class add dev $netif parent 1:4 classid 1:5 hfsc rt m2 9144kbit
# default --> bin
$TC class add dev $netif parent 1:4 classid 1:7 hfsc rt m2 256kbit
$TC qdisc add dev $netif parent 1:7 handle 7 sfq perturb 10
filters
# Admin ip
$TC filter add dev $netif protocol ip parent 1:0 prio $prio u32 ht 800:: \
match ip src 172.1.0.0/16 flowid 1:2
# users ip
$TC filter add dev $netif protocol ip parent 1:0 prio $prio u32 ht 800:: \
match ip src 10.1.1.0/24 flowid 1:5
$TC filter add dev $netif protocol ip parent 1:0 prio $prio u32 ht 800:: \
match ip src 10.2.1.0/24 flowid 1:5
# default
$TC filter add dev $netif protocol ip parent 1:0 prio $prio u32 ht 800:: \
match ip src 0.0.0.0/0 at 12 flowid 1:7 \
action mirred egress mirror dev ifb7
# ok,
# show traffic in ifb1 and ifb7 (default class)
tcpdump -i ifb1 -n
tcpdump -i ifb7 -n
# show
tc -s filter show parent ffff: dev eth0
tc -s filter show dev ifb1 |grep flowid 1:7
tc -s filter show dev ifb1 |grep mirred -A3 -B3
# del qdisc
tc qdisc del dev eth0 handle ffff: ingress
Groups:
Linux 限制用户网络带宽
方案:cgroup + TC,前者负责对用户数据包打标签,后者负责对指定标签的数据包做网络IO限制。
network packages -> cgroup/iptables (label) -> tc(class) -> tc(queue)
TC
TC 是Linux内核中控制网络流量的工具,详细的说明请查看man手册或详细中文手册第9章。这里只简单使用一下。
TC基本思想是Linux内核首先把需要发送的数据包交给TC队列,由TC进行排队,然后内核再从TC队列中取出,通过网卡驱动发送出去。 TC有三个核心概念:
- qdisc(queueing discipline) 规则队列:可以是树形
- class 类别:附属于一个qdisc
- filiter 过滤器:映射指定标签的数据包到qdisc
qdisc和class的命名方式以: major number : minor number
,qdisc占用major,class占用minor。
简单的实验:针对网卡eth1
,对cgroup标记的数据包做速度限制。 首先eth1上清空之前的qdics,然后新建major number为10的qdisc。
tc qdisc del dev eth1 root
tc qdisc add dev eth1 root handle 10: htb
在qdics增加一个分类,设置其带宽为400Mbit。 创建过滤器,使用cgroup做标签。
tc class add dev eth1 parent 10: classid 10:1 htb rate 400mbit
tc filter add dev eth1 parent 10: protocol ip prio 10 handle 1: cgroup
以后如果修改带宽限制,可以设置:
tc class change dev eth1 parent 10: classid 10:1 htb rate 200mbit
cgroup
首先CentOS6下安装cgroup:
# yum install libcgroup
cgroup可以动态配置,也可以写入到配置文件里。 这里我直接写在配置文件里。 cgroup有两个配置文件:
- cgconfig.conf:定义group和mount。
- cgrules.conf:用户或组到group的映射
cgroup管理网络资源的类型叫net_cls
。 我们首先增加一个net_cls
的组。 上一节我们TC建立的classid为10:1,这个id在cgroup里以 0xAAAABBBB
方式表示,其中AAAA
是major number, BBBB
是minor number。 因此我们首先执行:
cgcreate -g net_cls:test_bw
echo 0x100001 > /cgroup/net_cls/test_bw/net_cls.classid
然后把生成的配置直接覆盖到配置文件:
cgsnapshot -s > /etc/cgconfig.conf
我的内容如下:
mount {
cpuset = /cgroup/cpuset;
cpu = /cgroup/cpu;
cpuacct = /cgroup/cpuacct;
memory = /cgroup/memory;
devices = /cgroup/devices;
freezer = /cgroup/freezer;
net_cls = /cgroup/net_cls;
blkio = /cgroup/blkio;
}
group test_bw {
net_cls {
net_cls.classid="1048577";
}
}
然后把需要限制带宽的用户(或组)加入到这个cgroup组中,在 /etc/cgrules.conf
中添加一行:
jack net_cls test_bw/
这样jack就加入到这个组中。 重启服务:
service cgconfig restart
service cgred restart
这样应该就生效了。
限制Daemon带宽
比如想限制nfs服务器的带宽,只需把nfsd
deamon加入到rules里:
*:nfsd net_cls test_bw/
然后重启cgred
即可。
参考
- How to priotize packets using tc and cgroups
- Redhat: Introduction to Control Groups
- 限制单个进程的带宽
- 使用 linux 下的 TC 进行服务器流量控制
- 使用cgroups来控制磁盘IO带宽
- Redhat: Prioritizing Network Traffic
2.10.- TCINDEX classifier |
|
The tcindex classifier was specifically designed to implement the Differentiated Service architecture on Linux. It is explained in Differentiated Services on Linux [10] and Linux Advanced Routing & Traffic Control HOWTO [7], but in both documents, in my modest opinion, the explanation is highly technical and a little bit confuse, having the reader even more questions and doubts when the reading is finished. |
The tcindex classifier bases its behavior in the skb->tc_index field located in the packet sk_buff buffer structure. This buffer space is created for every packet entering into or being created from the Linux box. Because of this, the tcindexclassifier must be used only with those queuing disciplines that can recognize and set the skb->tc_index field; these are: GRED, DSMARK and INGRESS queuing disciplines. |
|
I think it is easier to approach the tcindex classifier study by analyzing which qdisc/class/filter writes and which qdisc/class/filter reads the skb->tc_index field. Let's start by copying the figure 2.9.3 from the previous section, renumbered as 2.10.1 here, to be used as reference: |
|
|
|
Next we have to go to the C code (sorry, but it's better) to poke around. We will use this procedure: |
|
- We write one asseveration.
- We present the C code that sustains it.
- We make the reference with the figure above.
- We present the tc command required to get that behavior.
|
|
|
|
|
Asseveration: The skb->tc_index value is set by the dsmark queuing discipline when the set_tc_index parameter is set. The skb->iph->tos, which contains the packet's DS field value, is copied onto the skb->tc_index field. |
|
In the figure, this process is represented by the big red vertical line going from top (skb->iph->tos field) to bottom (skb->tc_index field) in thedsmark entrance. As an example, the next tc command is used to get this behavior: |
|
Asseveration: the skb->tc_index field value is read by the tcindex classifier; then, the filter applies a bitwise operation on a copy of the value (this means, the original value is not modified); the final value obtained from this operation is passed down to the filter elements to get a match. Having a match, the class identifier corresponding to this filter element is returned back and passed to the queuing discipline as the resulted class identifier. |
|
Okay, the classifier lookup is done by applying first the following bitwise operation to the skb->tc_index field: |
( skb->tc_index & p.mask ) >> p.shift
|
mask and shift are integer values we have to define in the main filter. Let's see a command to understand better this complicated part: |
|
The first command sets the DSMARK queuing discipline. Because set_tc_index is set, the packet's DS field value is copied onto the skb->tc_indexfield when the packet enters the qdisc. The second command is the main filter. For this example, it has three elements (next 3-commands). In the main filter we define mask=0xfc and shift=2. This filter reads the skb->tc_index value (containing the packet's DS field value), applies the bitwise operation, and the value resulted is passed down to the elements to look for a match. pass_on means: if a match is not found in this element continue with the next one. |
Let's suppose that a packet having its DS field value marked as 0x30 (corresponding to the AF12 class) is entering the qdisc. The value 0x30 is copied by dsmark from the packet's DS field onto the skb->tc_index field. Next the classifier reads this field. It applies the bitwise operation to the value. What happens? |
|
( skb->tc_index & p.mask ) >> p.shift = ( 0x30 & 0xfc ) >> 2 = |
|
( 00110000 & 11111100 ) = 00110000 >> 2 = 00001100 = 0xc |
|
Final value after bitwise operation is 0xc which corresponds to the decimal value 12. This value is passed down to the filter elements. The first element doesn't match because it matches decimal value 10 (handle 10 tcindex). |
Next element matches because it matches decimal value 12 (handle 12 tcindex); then, the class identifier returned back to the queuing discipline will be 1:112 (classid 1:112). In the figure, this process is represented by the big blue vertical line going from bottom (skb->tc_index field) to the green filter elements (to get a class identifier) and then from the green filter elements to the yellow class identifier returned back from the classifier to the queuing discipline. Now let's see what's going to do the dsmark queuing discipline with the class identifier value returned back. |
Asseveration: the minor part value of the class identifier returned to the DSMARK queuing discipline by the tcindex classifier is copied back by the queuing discipline onto the skb->tc_index field. |
|
Well, fellows, finally the class identifier returned back again (it likes to travel, doesn't it?) to the skb->tc_index field. But, be careful. The value copied back was the class identifier's minor value, this means, 112 from the classid 1:112. It is very important to interpret well this ubiquitous value. 112doesn't mean decimal value 112. Each of this numbers is a nibble (4-bits). Then 112 is really: 000100010010. It's better if we separe the nibbles, then the number is: 0001-0001-0010. This is now the new value contained in the skb->tc_index field. |
In the figure, this process is represented by the big green vertical line going from yellow rectangle classid to the bottom skb->tc_index field. |
Asseveration: On DSMARK queuing discipline the skb->tc_index value is used as an index to the internal table of mask-value pairs to get the pair to be used. Then, the pair selected is used, when dequeuing, to modify the packet's DS field value using a combined and-or bitwise operation. |
We saw this already in previous section. Let's show first the C code: |
|
The commands above, are taken from the afcbq example of the Differentiated Service on Linux distribution (we will see every example on the DS on Linux distribution in detail, but later on). For now, to explaining this part, we will use a different set of commands; then we have: |
|
|
|
This example is not as intelligent as it should be, but, for what we are trying to explain it is good enough. The first command sets the dsmark queuing discipline 1:0. Next 3-commands define the classes of the discipline. Now is the turn for the main filter. As we saw above, this filter reads the skb->tc_index field containing the packet's DS field value and after doing a bitwise operation on a copy of it, passes down the result obtained to the filter elements. |
|
This commands are in fact changing the AF class of packets marked as AF1x to AF2x. This is what is called re-markingwhen talking on differentiated service terminology. The class is changed preserving the rest of the bits (drop precedence and ecn bits). When one AF1x's packet enters, its DS field is copied onto the skb->tc_index field by the dsmark queuing discipline, just because the set_tc_index parameter is set. Next the main filter is invoked. Let's suppose that one AF12 packet is entering. After the copy the new skb_>tc_index value will be 0x30. |
|
|
|
|
The main filter takes a copy of this value (0x30) and applies its bitwise operation with mask=0xfc and shift=2; then we have: |
|
(0x30 & 0xfc) >> 2 = (00110000 & 11111100) >> 2 = 00110000 >> 2 = 00001100 = 0xc = 12 |
|
Great!! Final value is decimal 12. This value is passed down to the filter elements. Second element matches and the class id value 1:2 is returned back to the dsmark queuing discipline. As we saw above, immediately dsmark strippes the class id major value and copies back the minor value again onto the skb->tc_index field. The new value of skb->tc_index is now decimal 2. |
Now is the turn again for dsmark queuing discipline when the packet is leaving out the discipline. The discipline reads the skb->tc_index field in the buffer's packet. The value is decimal 2. With this value it enters its own internal table. But this table was built for us with the 3 commands following the queuing discipline creation. Entering with decimal 2 index, the table contains the values mask=0x1f and value=0x40. The example is idiot because all classes have the same mask-value pair parameters. But, anyway, I'm tired and I don't want to think too much, just enough to explain how this stuff makes its work. |
Finally the dsmark queuing discipline does the following operation over the AF12 marked packet. |
|
(0x30 & 0x1f) | 0x40 = (00110000 & 00011111) | (01000000) = (00010000 | 01000000) = 01010000 = 0x50 |
|
Okay, 0x50 is the value which corresponds to the class AF22. Tha packet enters as class-drop precedence AF12 and departures as class-drop precedence AF22. |
The really important thing to understand here is that dsmark reads the skb->tc_index value to select a class or an index into the internal table ofmask-value pairs, for getting the pair to be used, later on, to update the DS field from the dequeing packet. This entire process is represented by the big purple lines and arrows and the internal dsmark table representation to the right of the figure 2.10.1 above. |
Asseveration: The 4-rightmost bits of the skb->tc_index field are used by the GRED queuing discipline to select a red virtual queue (VQ) for the packet entering the discipline. If the value (4-rightmost bits) is out of range of number of virtual queues, the skb->tc_index field is set (it shouldn't be) to the number of the default virtual queue by the GRED queuing discipline. |
|
In this case, we don't have to put explicitly the packet into one virtual queue using a filter. It's good enough to set the skb->tc_index field value of the packet's buffer with the number of the virtual queue we want to select. For setting the skb->tc_index field we can use a dsmark qdisc and its attached filter, or we can use the ingress queuing discipline as it will be explained later on. Let's see how to set an example of this configuration: |
|
These commands show a GRED configuration using DSMARK to select the virtual queue. The first command creates the dsmark queuing discipline. Packet's DS field will be copied onto the skb->tc_index field on entrance. Next command sets the main filter. Packets having DS field values corresponding to classes AF11, AF12 and AF13 will generate values 10, 12 and 14 respectively, after the (DS field & 0xfc) >> 2 bitwise operation is applied. |
These values are passed down to the filter elements which are set using the next 3 commands. Class id 1:1, 1:2 and 1:3 are returned back for classesAF11, AF12 and AF13 respectively. When the dsmark queuing discipline receives back the class id returned values, it sets the skb->tc_index with the minor values of them. This way, skb->tc_index is set to 1, 2, or 3 for packet's class AF11, AF12, or AF13 respectively. It's great!! We have already set the skb->tc_index field for the gred queuing discipline. |
Next command sets the main gred queuing discipline having as parent the dsmark queuing discipline. Last 3 commands set the gred virtual queuesnumber 1, 2 and 3 respectively. But, we don't have to worry about how to put packets into the gred virtual queues. GRED does itself its work by reading the skb->tc_index value and placing the packets into the corresponding virtual queues. |
Our last asseveration: When using the INGRESS queuing discipline, skb->tc_index field is set with the minor part of the class identifier returned by the used attached filter. |
|
The ingress queuing discipline is not a queue at all. It just invokes the attached classifier and when the class identifier is returned, it extracts the minor part from it and copies the result onto the skb->tc_index field. |
The ingress qdisc's classifier could be a u32 classifier or a fw classifier. The tcindex classifier cannot be used because it requires that the skb->tc_index field is set, and because the setting is done by the ingress queuing discipline itself, the initial skb->tc_index value will be zero. Excluding the tcindex classifier, I suppose we can use any kind of classifier to be attached to the ingress queuing discipline. Being u32 or fw the used classifier, in both cases you can police the flows entering at the same time by implementing a policer into the classifier. Because this is specially important for the Differentiated Service architecture, we are going to explain a little more about policing in the next section. For now, we are going to show two examples using the fw classifier and the u32 classifier. |
|
In this example we use the fw classifier. Traffic enters through the eth1 interface and leaves the router through the eth0 interface. The ingress queuing discipline is configured on interface eth1. On this interface we, previously, set iptables to mark any flow entering with fw mark=2, and then flows from network 10.2.0.0/24 with fw mark=1. |
Using two filter elements we set skb->tc_index field with the value 1 (flowid :1) for packets with fw set to 1 (handle 1 fw), and with the value 2 (flowid :2) for packets with fw set to 2 (handle 2 fw). |
Finally we configure a dsmark queuing discipline on outgoing interface eth0. Packets leaving the router with its skb->tc_index field set to 1 (classid 1:1) are marked on its DS field applying the bitwise operation ((DS & mask) | value). Then, packets from network 10.2.0.0/24 (identified with skb->tc_index=1) are marked as 0x88 (which corresponds toDS class AF41), and rest of traffic (identified with skb->tc_index=2), is marked as 0x90 (which corresponds to DS classAF42). |
|
The u32 classifier is used in a similar way; but we don't need iptables for this case. For example: |
|
|
|
As you see this configuration is even simpler than when using the fw classifier.We configure the ingress queuing discipline, then using two filter elements attached to it, we set skb->tc_index field with the value 1 (flowid :1) for packets with DS field set to 0x28 (match ip tos 0x28 0xfc), preserving the ecn bits, and with the value 2 (flowid :2) for packets with DS fieldset to 0x30 (match ip tos 0x30 0xfc), again preserving the ecn bits. These packets happen to be the differentiated service classes AF11 and AF12, respectively. |
|
Our setting is some kind of "promoting packets" configuration. The dsmark queuing discipline marks packets leaving the router with its skb->tc_index field set to 1 (classid 1:1), i.e., AF11's class packets, as 0xb8 (which corresponds to DSclass EF), and packets leaving the router with its skb->tc_index field set to 2 (classid 1:2), i.e., AF12's class packets, as0x28 (which corresponds to DS class AF11). Then DS AF11's class packets are promoted to DS EF and DS AF12's class packets are promoted to DS AF11. |
|
|
|
|
Well, fellows. With this explanation we finish the TCINDEX classifier. Next section will be dedicated to explore a little about the filter's police capability. |
The u32 filter
Overview
The u32 filter allows you to match on any bit field within a packet, so it is in some ways the most powerful filter provided by the Linux traffic control engine. It is also the most complex, and by far the hardest to use. To explain it I will start with a bit of a tutorial.
Matching
The base operation of the u32 filter is actually very simple.
It extracts a bit field from a 32 bit word in the packet, and if it is equal to a value supplied by you it has a match. The 32 bit word must lie at a 32 bit boundry. The syntax in tc is:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
classid 1:1 \
match u32 0xc0a80800 0xffffff00 at 12
The first line uses the same syntax shared by all filters, so I will ignore it for now. The second line just says that if the filter matches assign the packet to class 1:1. The third line is the interesting one; this is what it means:
match u32 This keyword introduces a match condition. The u32
is the type of match. It must be followed by a value
and mask. A u32 match extracts a 32 bit word out of
the header, masks it and compares the result to the
supplied value. This is in fact the only type of
match the kernel can do. Tc "compiles" all other
types of matches into this one.
0xc0a80800 This is the value to compare the masked 32 bit word
to. If it is equal to the masked word the match is
successfull.
0xffffff00 This is the mask. The word extracted from the
packet is bit-wise and'ed with this mask before
comparision.
at 12 This keyword tells the kernel where the 32 bit word
lives in the packet. It is an offset, in bytes,
from the start of the packet. So in this case
we are loading the 32 bit word that is 12 bytes from
the start of the packet. The offset is optional.
If not supplied it defaults to 0 which is generally
not what you want.
Now if you look at rfc791 you will see that the source address is stored at offset 12 in an IP packet. So the match condition could be read as: "match if the packet was sent from the network 192.168.8.0/24". To use the u32 filter you do have to be familiar with the fields in IP and TCP, UDP and ICMP headers. But you don't have to remember the offsets of the individual fields - tc has some syntatic sugar for that. This command has does the same thing as the one above. The syntax is different, but the filter submitted to the kernel is identical:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
classid 1:1 \
match ip src 192.168.8.0/24
A u32 filter item can logically "and" several matches together, succeeding if only if all matches succeed. This example will succeed only if the packet was sent from network 192.168.8.0/24, and has a TOS of 10 hex:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
classid 1:1 \
match ip src 192.168.8.0/24 \
match ip tos 0x10 1e
You can have as many match conditions on the one line as you want. All must be successful for the filter item to score a match.
If you enter several tc filter commands the filters are tried in turn until one matches. For example:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
classid 1:1 \
match ip src 192.168.8.0/24 \
match ip tos 0x10 1e
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
classid 1:2 \
match ip src 192.168.4.0/24 \
match ip tos 0x08 1e
The first filter item checks if the packet is from network 192.16.8.0/24 and has a TOS of 10 hex. If so the packet is assigned to class 1:1. If not the second filter item is tried. It checks if the packet is from network 192.168.0.4/24 and has a TOS of 08 hex and if so it will assign to packet to class 1:2. If not the next filier item would be tried. But there is none, so u32 filter fails to classify the packet.
Now it is time to discuss u32 handles. A u32 handle is actually 3 numbers, written like this: 800:0:3. They are all in hex. For now we are only interested in the last one. This last number identifies the filter items we have been adding. Because we did not specify an number generated for the filter item the kernel allocated one for us. In fact it allocated the handles 800:0:800 and 800:0:801. The handle it generates is one bigger than the largest handle used so far, with a minum value of 800 hex. Valid filter item handles range from 1 to ffe hex. Like all filter handles, the complete handle (as in 800:0:801) must be unique. We can force a particular handle to be used for a filier item by using the "handle" option of "tc filter", like this:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
classid 1:1 \
match ip src 192.168.8.0/24 \
match ip tos 0x10 1e
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 handle ::1 u32 \
classid 1:2 \
match ip src 192.168.4.0/24 \
match ip tos 0x08 1e
These tc commands are almost identical to the previous example. In fact the tc command creating the first item is identical, so it will be allocated the same handle as before, 800:0:800. The second command only differs from the previous example in that it specifies item handle 1 is to be used. (The rest of the numbers in the handle are not specified, so the defaults are used.) The full handle created for the second filter item will be 800:0:1. The kernel evaluates filter items in handle order, with lower handle numbers being checked first. So the impact of doing this will be to reverse the order they two filter items are evaluated by the kernel, compared to the previous example.
Linking
Before proceeding we need a new concept. In effect filter items that share the same prefix in their handle (800:0 in the above examples) form a numbered list. The number is the filter item number, ie the last number is the handle. In the last example above we had a two item list with these handles:
list 800:0:
1 [src=192.168.4.0/24, tos=0x08] -> return classid 1:2
800 [src=192.168.8.0/24, tos=0x10] -> return classid 1:1
I will call this a u32 filter list, or just a filter list for short. The prefix (800:0 in this case) can be used as a handle to identify the list. In the section above I described how the kernel "executes" such a list. To recap it does this by running through the list in filter item number order, checking each filter item in turn to see if it matches. If a filter item matches it can classify the packet, in which case the u32 filter stops and returns the classified packet. But when a u32 filter item matches a packet there is one other thing it can do besides classifing the packet. It can "link" to another u32 filter list. For example:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
link 1:0: \
match ip src 192.168.8.0/24
If this filter item matches it will "link" to filter list 1:0:, meaning the kernel will now execute filter list 1:0:. If a filter item in that list matches and classifies a packet then the u32 stops and returns classified packet. If that does not happen, ie if no filter item in the list classifies the packet then the kernel resumes executing the original list. Execution continues at the next filter item in the original list, ie the one after the filter item that did the "link". A linked list can in turn link to other lists. You can nest up to 7 link commands.
If you specify a "link" command for a filter item any attempt to classify a packet in the same filter item will be ignored. Another way of saying this is the "classid" option and its aliases won't work if you put "link" on the command line.
This linking is not in itself very useful. It is usually faster to use one big list, and it always easier to do it that way. But there are two commands you can combine with the "link" command, and in fact neither can be used without it.
Hashing
The filter lists we have been discussing are actually part of much larger structures called hash tables. A hash table is just an an array of things that I will call buckets. A bucket contains one thing: a filter list. This will all become clear shortly, I hope.
We can now look at the meanings of the other two numbers in a u32 filter handle. One handle in the examples above was 800:0:1. Well, the 800 identifies the hash table, and the 0 is the bucket within that hash table. So 10:20:30 means: filter item 30, which is located in bucket 20, which is located in hash table 10.
Hash table 800 is special. It is called the root. When you create a u32 filter the root hash table gets created for you automaically. It always has exactly one bucket, numbered 0. This means the root hash table also exactly one filter list associated with it. When the u32 filter starts execution it always executes this filter list. In other words a u32 filter does its thing by executing filter list 800:0. If filter list 800:0 does not classify the packet (implying that none of the lists it linked to clasified it either) then the u32 filter returns the packet unclassified.
Not unsurprisingly you can't delete the root hash table. Actually you can't delete any other hash table either (as of 2.4.9), but that is because of a bug in the in the kernel u32 filter's reference counting code. The only way to get rid of a hash table in 2.4.9 or earlier is to delete the entire u32 filter.
Hash tables other than the root must be created before you can add filter items that link to them. Use this tc command to create hash table:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 handle 1: u32 \
divisor 256
This creates a hash table with 256 buckets. The buckets are numbered 0 through to 255. So we have effectively created 256 filter lists with handles 1:0, 1:1, ... 1:255. A hash table can have 1, 2, 4, 8, 16, 32, 64, 128 or 256 buckets. Other values are possible but can be very inefficient. The kernel has a bug that will allow you to have 257 buckets, but doing that may cause an oops.
If you omit the "handle" option the kernel will allocate you a new handle. Currently (2.4.9) the kernel has a bug - the handle allocation routing will go infinite rather than return failure in the very unlikely circumstance that all hash table handles are in use.
The way the tc "link" option is written it might appear that you can link to any bucket. You can't. The link option only allows you to specify bucket 0 (implying that "link 1:1" is illegal). To select a bucket other than 0 you must use the "hashkey" option:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
link 1: hashkey mask ffffff00 at 12 \
match ip src 192.168.8.0/24
The hashkey option causes the kernel to calculate the bucket number of the filter list to link to from data in the packet. You get to specify what data. This operation is usually called a hashing. In this case the hash it is a particularly fast but primitive one. In the example above this is what happens, in detail, if the match succeeds:
- The kernel reads the 32 bit word at offset 12 of the packet being sent.
- This word is masked with ffffff00
- The word is left shifted by 8 bits, and them masked with 0xff. The amount of left shift is calculated from the mask - it is the number of bits the mask has to be shifted so the first 1 bit appears in the least significant bit. The is the "hashing function". It changed between 2.4 and 2.6. In 2.4 the 4 bytes in the word are xor'ed together. From what I have seen, the 2.4 version did a better job on real data.
- The result of the hash, which is a number in the range 0..255, is then masked with (number of buckets - 1).
- The result is a bucket number, which is then combined with the hash table in the link option to form a filter list handle.
- That filter list is then executed.
If you look at rfc791 you will see the hash in the example is selecting the senders network address. Tc offers no syntatic sugar to help you this time, ie there is no "hashkey ip src 0.0.0.0/24" or similar. You to do it the hard way and look up the rfc's.
Why would you hash on the source network rather than testing for it in a match option? Its only useful if you want to classify a packet based on a lot of different source networks. If there is only one or two source networks you are better off using match as doing a couple of matches is faster than doing a hash. But, the amount of time required to test all the matches will grow as number of source networks grows. Hashing on the other hand takes a fixed amount of time regardless whether there is 1 or 100 source networks, so if there are thousand's of source networks hashing is going to be literally 100's of times faster than testing them one by one using matching.
I mention this because there is an example from Alexey's "README.iproute2+tc" that selected the TCP protocol (among others) using hashing. As an example of how to use hashing it is good, but it has been cut and pasted by every man + dog, altered to only select the TCP protocol, and then quoted as the way to do it. Wrong. A simple "link" without hashing would be better in that case.
We have dealt with one side of hashing - how the filter list to be executed (hash table, bucket) is selected. There is a second side to it - adding items to the selected filter list. The problem is really quite simple - which one is it? You know the hash table number, it is the bucket number that is the problem. You could use the description of the hashing algroithm above and manually calculate the bucket number. That is a bad option for two reasons. Firstly, its hard work in the general case. Secondly its fragile, because the hashing algroithm in the kernel can and has changed. Tc can calculate the hash for you, and it is better and easier to let it do so. Letting Tc do this does not effect time it takes the kernel to execute the filter. Here is how you do it:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
classid 1:2 \
ht 1: sample ip src 192.168.8.0/8 \
match ip src 192.168.8.0/8 \
match ip tos 0x08 1e
Caveat: as of 07/Feb/2006, the hashing algorithm in tc is still at the 2.4 version(!). Ergo, for 2.6 tc ends up with the wrong answer, so this example above won't work until this bug is fixed.
The line in question is the third one - the rest we have seen before. The "ht 1:" says the filter item is to be inserted into hash table 1::. The "sample ..." says what value we want to calculate has bucket number for, ie we want that value to be hashed. Tc will apply the same hashing algorithm used by the kernel to calculate the bucket number. The "..." can be anything that could legally follow a "match" option, so all the syntatic sugar for calculating IP offsets is available to you.
There is, unfortunately, three bugs in the current version of "tc" (ie tc up until cvs 2006-02-09), which render "sample" useless. Firstly, "sample" assumes the target hash table has 256 buckets. If it doesn't, you are out of luck - you must use the "ht" option instead. Secondly, the "sample" option always uses the 2.4 kernel hashing function. Ie, it doesn't work on 2.6 kernels. Finally, the "sample" parsing code in "tc" has a bug (a missing memset()), which causes tc to get segmentation violations. This last bug renders it completely useless.
Now for some random points. First of all, why did I not use the "handle" option of tc to specify the hash table, as is done everywhere else? Answer: because you must give the "ht" option. You can also give the "handle" option, but if you do the hash table number in it must be blank, (as in ::1), or be equal to the hash table given in the "ht" option. Is there a good reason why tc and the kernel work like this? No, not that I can see.
Secondly, why is the fourth line required in the command? First of all, perhaps isn't obvious why it may not be required. It may not be needed because the "sample" option has already selected the filter list for this source network. If no other source networks hash this this same bucket there is indeed no reason to for the match command. But if several source networks hash to the same bucket it is required - the filter won't work without it. If you are hashing for a good reason, ie to speed up the process of selecting among many possibilities, and you are being conservative, ie you assume you don't know the internals of the hashing algroithm, then you can never be sure that each bucket will only have one filter item. So this match line should always be present.
Thirdly, there are many examples on the net that hash on the IP protocol, then selects protocol 6 directly using "ht 1:6:" rather than using the "sample" option. Should I copy that? Answer: No. This example should sound familiar. It is the same cut & paste (aka hack, because they always try to improve the example) from Alexey's "README.iproute2+tc" file I referred to earlier. In that example Alexey assumed he knew how the hashing algroithm worked. It probably sounded like a reasonable assumption to him - he designed and coded the algorithm. But it is not a good assumption for the rest of us. He did this because under the current hashing algorithm the value is trival to calculate under some circumstances. If you are selecting one byte from the packet on a byte boundry, and use a hash table 256 elements long, then the byte always hashes to itself. The IP protocol byte meets those conditions.
Fourthly, should I allocate my own handles to filter items in a hash bucket? Answer: Avoid it if possible. You can manually allocate filter item numbers using the handle option, as in "handle ::1". If you do so be sure to allocate a unique filter item number to each filter item in the hash table (as opposed to unique to just the bucket the filter item lives it). You have to do this if you assume (as you should) that you don't know what bucket the filter item is going to hash to. But, as I said earlier, avoid it if possible. You would not be hashing if there weren't a lot of filter items to choose from. And if there are a lot of them doing your own filter item numbering will be painful.
Header Offsets
The IP header (and other headers) are variable length. This creates a problem if you are trying to use "match" to look at a value in a header that follows - you don't know where it is. It is not an impossible problem because every header in an IP packet contains a length field. The "header offsets" feature of u32 allows you to extract that length from the packet, and then add it to the offset specified in the "match" option.
Here is how it works. Recall that the match option looks like this:
match u32 VALUE MASK at OFFSET
I said earlier that OFFSET tells the kernel which word in the packet to compare to VALUE. That statement was a simplification. Two other values can be added to OFFSET to determine which word to use. Both those values start off as 0, but they can be modified when a "link" option calls another filter list. Any modification made only applies while called filter list is being executed as the old values are restored if the called filter list fails to classify the packet. Here are the two values and the names I call them:
permoff This value is unconditionally added to every OFFSET
that is done in the destination link, ie that one
that is called. This includes calculations of new
permoff's and tempoff's. Permoff's are cumulative
in that if the destination link calls another link
and calculates a new permoff, the result is added to
this one.
tempoff A "match" option in the destintaion link can optionally
add this value its OFFSET. Tempoff's are temporary, in
that it does not apply to any links the destination link
calls. It also does not effect the calculation of
OFFSET's for new permoff's and tempoff's.
Time for an example. Consider this command:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
link 1: offset at 0 mask 0f00 shift 6 plus 0 eat \
match ip protocol 6 ff
The match extression selects tcp packets (which is IP protocol 6). If we have protocol 6 we execute filter 1:0. Now for the rest of it:
offset This signals that we want to modify permoff or tempoff
if the link is executed. If this is not present,
neither permoff nor tempoff are effected - in other
words the target of the link inherits the current
permoff and tempoff.
at 0 This says the 16 bit word that contains the value we
are going to use to calculate permoff or tempoff lives
offset 0 the IP packet - ie at the start of the packet.
This offset must be even. If not specified 0 is used.
mask 0f00 This mask (which is in hex) is bit-wise anded with the
16 bit word extracted from the packet header. It
isolates the header length from the rest of the
information in the word. If not specified 0 is used
for the extracted value.
shift 6 This says the word extracted is to be divided by 32
after being masked. If not present the value is not
shifted.
plus 0 After extracting the word, masking it and dividing it by
32, this value is now added to it. If not present is
assumed to be 0.
eat If this is present we are calculating permoff, and the
result of the calculation above is added to it. Tempoff
is set to 0 in this case. If this is not present we are
calculating tempoff, and the result of the calculation
becomes tempoff's new value. Permoff is not altered in
this case.
If you don't understand this then accept at face value that it does calculate the position of the second header in an IP packet. Copy & paste it into your scripts. I am not going to try and explain it further. You should of course dig out rfc791 and verify it for yourself. That way you will be able to apply it to headers beyond the second one.
Having calculated your offset you can now add entries to the destination filter list that depend on it. Here is an example entry:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
classid 1:4 \
ht 1:0 \
match u32 0x140000 ffff0000 at nexthdr+0
We have see almost all of this before. "ht 1:0" inserts this filter item into hash table 1, bucket 0. "classid 1:4" classifies the packet if the filter matches. The "match" selects protocol 14 hex (which is 20 decimal - ftp). The "at nexthdr+0" is the only new bit, or at least the "nexthdr+" is new. The "0" sort of means the same thing as it always did - that the 32 bit word that contains the TCP port is at offset 0. But it is offset 0 from the TCP header, because either permoff of tempoff has been set to point to that header. As for "nexthdr+", recall that adding "tempoff" was optional. If you add "nexthdr+" it gets added. If you don't it doesn't.
Tc does supply syntatic sugar for this as well. I could of written this way, and generated an identical filter item:
# tc filter add dev eth0 parent 999:0 protocol ip prio 99 u32 \
classid 1:4 \
ht 1:0 \
match tcp protocol 6 ff
Recall that I said modification made to permoff and tempoff only applies while called filter list is being executed as the old values are restored if the called filter list fails to classify the packet. This was a lie. Permoff is restored, but tempoff isn't. This can make for subtle suprises in the way a U32 filter executes, because you tend to assume that during the execution of a filter list permoff and tempoff never change. But if you link to another list tempoff may change. I recommend always using permoff's (ie, always specify "eat", and never use "nexthdr+") to avoid this.
Reference
Handles
The u32 filter uses 3 numbers for its handle. These numbers are written: H:B:I, eg 1:1:2. All are in hex. The first number, H, identifies a hash table. The second number, B, identifies a bucket within the hash table, and the third number, I, identifies the filter item within the bucket. The combination must be unique.
Hash table numbers must lie between 001 and fff hex. The traffic control engine will generate a hash table number for you if you don't supply one. Generated numbers are 800 or above. The hash table number in the handle is not used when creating or changing a hash table item. Instead the hash table specified by the "ht" option is used, and the hash table in the handle must be not specified, 0, or equal to the hash table in the "ht" option.
A bucket number can range from 0 to 1 less than then number of buckets in the parent hash table. If no bucket is specified (as in 1::2), then 0 is assumed.
Filter Item numbers must lie between 001 and fff hex. The traffic control engine will generate a filter item number for you if you don't supply one. The generated number is the larger of 800 and one bigger than the current largest item number in the bucket.
Execution
Each "tc filter add ... u32" item adds either a hash table, or adds a filter item to a bucket within a hash table. When a u32 filter is created the root hash table, whose handle is 800::, is automatically created. It has one bucket. The u32 starts by checking each filter item in bucket 0 of the root hash table. Filter items within a bucket are always checked in filter item number order. As soon as a filter item classifies the packet the u32 filter stops execution. Filter items may use the "link" option to execute a filter item list held by a bucket in another hash table.
Options
classid :: | flowid ::
If all the match options succeed then this will :classify:
the packet and the u32 filter will stop execution. Ignored
if "link" option is given.
divisor
If supplied this parameter must appear on its own, without
any other arguments. It creates a new hash table.
specifies the number of buckets in the hash table. It can
range from 1 to 256, and should be a power of 2. The hash
table number is taken from the handle supplied. If no handle
is supplied a new hash table number is generated.
hashkey mask at
If the link specified by the "link" option is taken then this
option specifies the bucket within the hash table to use. This
is how the bucket number is calculated:
1. The 32 bit work at offset is read from the packet.
2. The 32 word is masked with . is in hex.
3. The 4 bytes in the result are xor'ed together.
4. The result is bit-wise anded with the value (number of
buckets in the hash tabled linked to - 1).
ht
This option specifies the handle of a filter item being added
or changed. The filter item in must be
unspecified or 0 - it can only be specified by the "handle"
option. The bucket specified may be overridden by the "sample"
option.
link
If all match options succeed in this filter item the "link"
option causes the filter items in another hash table's bucket to
be checked. If none of the filter items in the linked to bucket
classify the item then u32 filter continues checking filter
items in the current bucket. The "link"ed to bucket may link to
yet another bucket, to a maximum level of 7 such calls (in 2.4.9
.. 2.6.15). The specifies the hash table to
link to. The bucket and filter item numbers in that handle must
both be unspecified or blank. Bucket 0 will be used unless
overridden by the "hashkey" option. The "offset" option can be
used to alter packet offsets in the linked to bucket.
match
This option checks if a field in the packet has a particular
value. A filter item may contain more than one "match" option.
All match options must be satisified before the filter item
considers it has a match. What the filter item does when it
has a match is specified by the "link", "classid"/"flowid", and
"police" options. If none are specified the filter item does
nothing when it matches. Selectors are described below.
offset mask at shift plus eat
If the link specified by the "link" option is taken then the
position of the values extracted from the packet by the hash
table linked to will be offset by this specification. This is
how the offset is evaluated & implemented:
1. The 16 bit word at offset is read from the packet. If
is not present the 16 bit work is read from offset 0.
2. The 16 bit word is masked with . If no masked is
specified 0 is assumed.
3. The masked 16 bit word is divided by (2**).
4. The resulting value has added to it. If not specified
defaults to 0.
5. If none of , , , nor are specified
then the current temporary offset is used.
6. If "eat" is specified the offset is permanent, and is added
to the current permanant offset. The permanent offset is
unconditionally added to the value in "match", "offset"
and "hashkey" options in the hash table linked to, and any
nested links. If "eat" is not specified the offset is
temporary. Temporary offsets any added to the "at
nexthdr+" values in "match" options, but do not effect
any other values.
7. If then specified does not classify the packet, and hence
execution resumes at the next filter item, then the permanent
offset calculated here is discarded. The temporary offset,
however, remains in effect.
8. When the u32 filter starts executing both the permanent and
temporary offset are initialised to 0.
police
If all the match options succeed then this will :police: the
packet and the u32 filter will stop execution. Ignored if "link"
option is given.
sample
This option computes the bucket for the filter item being or
changed from the passed. The packet offset and
mask parts of the selector are ignored if given. When
calculating the hash bucket, the divisor in the target hash
bucket is assumed to be 256. There is no way of altering
this. If the divisor isn't 256, use the "ht" option instead.
Selectors are described below.
Selectors
Selectors are used by the match option to extract information from the packet and compare it to a value. All selectors compile to the one format which is accepted by the kernel. This format reads a 32 value from the supplied offset within the packet. The offset must be on a 32 bit boundary. The value read is bit wise anded with the supplied mask. If match succeeds if the result is equal to the supplied value. In C:
if ((*(u32*)((char*)packet+offset+permoff) & mask) == value)
match();
The "permoff" variable in this statement is calculated by the "offset" option that executed this filter list.
Here are some conventions which won't be repeated below for brevity:
at nexthdr+
Except where noted this can be appended to all selectors to
override the default position of the field in the packet. The
is the offset within the packet where the field can
be found. If an 16 bit value is being compared the
should be on a 16 bit boundary, and if a 32 bit value is being
compared if should be on a 32 bit boundary. The is
given in decimal; prefix with 0x to enter it in hex. If
"nexthdr+" is present any temporary offset calculated by the
"offset" option is added to . The current permanent
offset calculated by the "offset" optional is unconditionally
added to . It is unlikely you will want to specify
the "at" option with anthing other than u32, u16 and u8
selectors.
/
This specifies set of up to 4 32 bit masks and values that
will match a 128 bit IPv6 address. The combined values equal
the IPv6 address supplied, which may be in any IPv6 address
format. The combined masks are derived from the
portion - it is a 128 bit word with the upper bits
set to 1's, the rest are 0's. If the HOST is not given the
host is all 1's. The IP address must be numeric.
/
This specifies a mask and value. The value is equal to the
IPv4 address supplied. The mask is derived from the
portion - it is a 32 bit word with the upper
bits set to 1's, the rest are 0's. If
is not given the mask is all 1's. The IP address must be
numeric. For example, 192.168.10.0/24 would yield a value
of c0a80a00 hex and a mask of ffffff00 hex.
This specifies a mask value the field will be bit wise anded
with before being compared to . It is given in hex.
This specifies the value the field extracted from the packet
must equal, after being anded with the . It is decimal,
unless prefixed with 0x, in which case it is hex. Ie 0x10 and
16 both mean the same thing.
Here are the selectors that can follow a "match" or "sample" option:
icmp code
Match the 8 bit code field an the icmp packet. This must
be in a hash table that is "link"ed to by a filter item which
contains an "offset" option that skips the IP header.
icmp type
Match the 8 bit type field an the icmp packet. This must be
in a hash table that is "link"ed to by a filter item which
contains an "offset" option that skips the IP header.
ip df
Matches if the IPv4 packet has the "don't fragment" bit set.
May not be followed by an "at" option.
ip dport
Matches the 16 bit desination port in a tcp or udp IPv4 packet.
This only works if the ip header contains no options. Use the
"link" and "match tcp dst" or "match udp dst" option if you can
not be sure of that.
ip dst /
Matches the destination IP address of an IPv4 packet.
ip firstfrag
Matches is this IPv4 packet is not fragmented, or it the first
first fragment.
ip icmp_code
Matches the 8 bit code field in icmp IPv4 packet. This only
works if the ip header contains no options. Use the "link"
and "match icmp code" options if you can not be sure of that.
ip icmp_type
Matches the 8 bit type field in ICMP IPv4 packet. This only
works if the ip header contains no options. Use the "link"
and "match ip icmp" options if you can not be sure of that.
ip ihl
Matches the 8 bit ip version + header length byte in the IPv4
header.
ip mf
Matches if the IPv4 packet is there are more fragments from the
same packet to follow this one. May not be followed by an "at"
option.
ip nofrag
Matches if this is not a fragmented IPv4 packet. May not be
followed by an "at" option.
ip protocol
Matches the 8 bit protocol byte in the IPv4 header. You can
not use symbolic protocol names (eg "tcp" or "udp").
ip sport
Matches the 16 bit source port in a TCP or UDP IPv4 packet.
This only works if the ip header contains no options. Use the
"link" and "match tcp src" or "match udp src" options if you
can not be sure of that.
ip src /
Matches the source IP address of an IPv4 packet.
ip tos | ip precedence
Matches the 8 bit TOS byte in the IPv4 header.
ip6 dport
Matches the 16 bit desination port in a TCP or UDP IPv6 packet.
This only works if the ip header contains no options. Use the
"link" and "match ip tcp" or "match ip udp" options if you can
not be sure of that.
ip6 dst /
Matches the destination IP address of an IPv6 packet.
ip6 icmp_code
Matches the 8 bit code field in ICMP IPv6 packet. This only
works if the ip header contains no options. Use the "link" and
"match icmp" options if you can not be sure of that.
ip6 icmp_type
Matches the 8 bit type field in an ICMP IPv4 packet. This only
works if the ip header contains no options. Use the "link" and
"match icmp" options if you can not be sure of that.
ip6 flowlabel
Matches the 32 bit flowlabel in the IPv6 header.
ip6 priority
Matches the 8 bit priority byte in the IPv6 header.
ip6 protocol
Matches the 8 bit protocol byte in the IPv6 header. You can
not use symbolic protocol names (eg "tcp" or "udp").
ip6 sport
Matches thw 16 bit source port in a TCP or UDP IPv6. This only
works if the ip header contains no options. Use the "link" and
"match tcp src" or "match udp src" options if you can not be sure
of that.
ip6 src /
Matches the src IP address in an IPv6 packet.
tcp dst
Match the 16 bit destination port in the tcp packet. This must
be in a hash table is "link"ed to by a filter item which contains
an "offset" option that skips the IP header.
tcp src
Match the 16 bit source port in the tcp packet. This must be
in a hash table is "link"ed to by a filter item which contains
an "offset" option that skips the IP header.
u16
Match a 16 bit value in the packet. The offset defaults to 0
which is usually not want you want, at append the "at" option
to give the correct value.
u32
Match a 32 bit value in the packet. The offset defaults to 0
which is usually not want you want, at append the "at" option
to give the correct value.
u8
Match a 8 bit value in the packet. The offset defaults to 0
which is usually not want you want, at append the "at" option
to give the correct value.
udp dst
Match the 16 bit destination port in the udp packet. This must
be in a hash table is "link"ed to by a filter item which contains
an "offset" option that skips the IP header.
udp src
Match the 16 bit source port in the udp packet. This must be
in a hash table is "link"ed to by a filter item which contains
an "offset" option that skips the IP header.