万兆网卡小包线速:
64B + 7B(Preamble) + 1B(SFD) + 12B(IFG) = 84B
10*10^9/84/8 = 14880952 pps
万兆网卡大包线速:
1518B + 7B(Preamble) + 1B(SFD) + 12B(IFG) = 1538B
10*10^9/1538/8 = 812743 pps
# 查看网卡信息
[root@localhost ~]# ethtool enp7s0f0
Settings for enp7s0f0:
...
Speed: 10000Mb/s # 网卡速率
Duplex: Full
Port: FIBRE
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
# 查看驱动信息
[root@localhost ~]# ethtool -i enp7s0f0
driver: ixgbe # 驱动类型
version: 5.1.0-k-rh7.5 # 驱动版本
firmware-version: 0x8000084b
expansion-rom-version:
bus-info: 0000:07:00.0 # 总线号
...
# 查看offload
[root@localhost ~]# ethtool -k enp7s0f0|grep offload
tcp-segmentation-offload: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
...
# 查看PCI信息
[root@localhost ~]# lspci -vvvs 07:00.0
07:00.0 Ethernet controller: Intel Corporation Ethernet Connection X553 10 GbE SFP+ (rev 11)
...
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 16
Region 0: Memory at df800000 (64-bit, prefetchable) [size=2M]
Region 4: Memory at dfa04000 (64-bit, prefetchable) [size=16K]
Expansion ROM at dfc80000 [disabled] [size=512K]
...
# 最多支持64个MSI-X中断
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00002000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 128 bytes, MaxReadReq 256 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
# PCIe带宽,参见https://en.wikipedia.org/wiki/PCI_Express
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
...
查看所有PCI设备:lspci -vvv
domain:bus:slot.func(16位域号、8位总线号、5位设备号、3位功能号)
查看所有网卡:lspci -vvv|grep Ethernet
查看网卡所属NUMA:cat /sys/class/net/enp7s0f0/device/numa_node
若逻辑CPU数不大于16,只需要RSS分散中断;否则,需要RSS + RPS分散中断
# 查看中断
[root@localhost ~]# cat /proc/interrupts|egrep 'CPU|enp7s0f0'
CPU0 CPU1 CPU2 CPU3
78: 880 0 0 0 PCI-MSI-edge enp7s0f0-TxRx-0
79: 862 0 0 0 PCI-MSI-edge enp7s0f0-TxRx-1
80: 868 0 0 0 PCI-MSI-edge enp7s0f0-TxRx-2
81: 860 0 0 0 PCI-MSI-edge enp7s0f0-TxRx-3
82: 2 0 0 0 PCI-MSI-edge enp7s0f0
# 查看CPU亲和性
[root@localhost ~]# cat /proc/irq/78/smp_affinity
1
# 修改CPU亲和性
[root@localhost ~]# echo 1 > /proc/irq/78/smp_affinity
# 查看hash indirection table
[root@localhost ~]# ethtool -x enp7s0f0
RX flow hash indirection table for enp7s0f0 with 4 RX ring(s):
0: 0 1 2 3 0 1 2 3
8: 0 1 2 3 0 1 2 3
16: 0 1 2 3 0 1 2 3
24: 0 1 2 3 0 1 2 3
32: 0 1 2 3 0 1 2 3
40: 0 1 2 3 0 1 2 3
48: 0 1 2 3 0 1 2 3
56: 0 1 2 3 0 1 2 3
64: 0 1 2 3 0 1 2 3
72: 0 1 2 3 0 1 2 3
80: 0 1 2 3 0 1 2 3
88: 0 1 2 3 0 1 2 3
96: 0 1 2 3 0 1 2 3
104: 0 1 2 3 0 1 2 3
112: 0 1 2 3 0 1 2 3
120: 0 1 2 3 0 1 2 3
...
# 修改hash indirection table
[root@localhost ~]# ethtool -X enp7s0f0 equal 16
# 查看hash input
[root@localhost ~]# ethtool -n enp7s0f0 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
# 修改hash input
[root@localhost ~]# ethtool -N enp7s0f0 rx-flow-hash udp4 sdfn
# 查看队列
[root@localhost ~]# ls /sys/class/net/enp7s0f0/queues
rx-0 rx-1 rx-2 rx-3 tx-0 tx-1 tx-2 tx-3
# 查看CPU亲和性
[root@localhost ~]# cat /sys/class/net/enp7s0f0/queues/rx-0/rps_cpus
1
# 修改CPU亲和性
[root@localhost ~]# echo 1 > /sys/class/net/enp7s0f0/queues/rx-0/rps_cpus
# 查看网络软中断
[root@localhost ~]# cat /proc/softirqs|egrep 'CPU|TX|RX'
CPU0 CPU1 CPU2 CPU3
NET_TX: 954 1301 0 0
NET_RX: 87032 0 0 0
# 查看队列
[root@localhost ~]# ls /sys/class/net/enp7s0f0/queues
rx-0 rx-1 rx-2 rx-3 tx-0 tx-1 tx-2 tx-3
# 查看CPU亲和性
[root@localhost ~]# cat /sys/class/net/enp7s0f0/queues/tx-0/xps_cpus
1
# 修改CPU亲和性
[root@localhost ~]# echo 1 > /sys/class/net/enp7s0f0/queues/tx-0/xps_cpus
# 查看网络软中断
[root@localhost ~]# cat /proc/softirqs|egrep 'CPU|TX|RX'
CPU0 CPU1 CPU2 CPU3
NET_TX: 954 1301 0 0
NET_RX: 87032 0 0 0
FD和RSS都是针对接收方向,FD的优先级高于RSS,FD一个比较典型的例子是保证回包也落到发包的队列
RSS通过五元组hash实现了数据包在各个队列之间的负载均衡,但是不能保证回包也落到同一个队列,对称hash(src和dst交换后hash不变)可以部分解决该问题,但是对于一些需要做NAT的设备(比如负载均衡设备)就失效了,FD可以解决该问题,参见MGW——美团点评高性能四层负载均衡
# on/off表示支持FD,[fixed]表示不支持FD
[root@localhost ~]# ethtool -k enp7s0f0|grep ntuple
ntuple-filters: off
# 打开FD
[root@localhost ~]# ethtool -K enp7s0f0 ntuple on
# 关闭FD
[root@localhost ~]# ethtool -K enp7s0f0 ntuple off
# 将目的IP为192.168.0.1的UDP流绑定到队列0
[root@localhost ~]# ethtool -N enp7s0f0 flow-type udp4 dst-ip 192.168.0.1 action 0
# 查看Rx/Tx Ring Buffer大小
[root@localhost ~]# ethtool -g enp7s0f0
Ring parameters for enp7s0f0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 512
RX Mini: 0
RX Jumbo: 0
TX: 512
# 修改Rx/Tx Ring Buffer大小
[root@localhost ~]# ethtool -G enp7s0f0 rx 4096 tx 4096
我们以82599为例介绍,82599一共有128个硬件发送队列(Tx FIFO)和128个硬件接收队列(Rx FIFO),实际使用的队列数主要由DCB和RSS决定
DCB(Data Center Bridging)
Packets are classified into one of several (up to eight) Traffic Classes (TCs). Each TC is associated with a single unique packet buffer. Packets that reside in a specific packet buffer are then routed to one of a set of Rx queues based on their TC value and other considerations such as RSS and virtualization.
RSS(Receive Side Scaling)
RSS assigns to each received packet an RSS index. Packets are routed to one of a set of Rx queues based on their RSS index and other considerations such as DCB and virtualization.
如下图所示,硬件接收队列的index有7位(其中高3位由DCB决定,低4位由RSS决定),RSS最多支持2^4 = 16个队列
四种情况
No RSS | RSS | |
---|---|---|
No DCB | Queue 0 is used for all packets | A set of 16 queues is allocated for RSS |
DCB | A single queue is allocated per TC to a total of eight queues (if the number of TCs is eight), or to a total of four queues (if the number of TCs is four) | A packet is assigned to one of 128 queues (8 TCs x 16 RSS) or one of 64 queues (4 TCs x 16 RSS) |
我们以下图为例,同时使能DCB和RSS,其中DCB有4个TC,每个TC对应16个队列。上面一排的64个硬件队列用于4个TC,在TC0、1、2、3中,RSS分别使用8、4、4、8个硬件队列。下面一排的64个硬件队列用于其它Filters
数据包经过各个Filters,如果匹配,送到对应队列,否则,计算RSS index和TC index,综合得到队列index,送到对应队列