This post provides basicsteps on how to configure and set up basic parameters for the MellanoxConnectX-5 100Gb/s adapter.
This post is basic and ismeant for beginners. The procedure is very similar to the one for theConnectX-4 adapter (in fact, it uses the same mlx5 driver).
Note: ConnectX-5adapters can be used only with MLNX_OFED rel. 4.0 or later installed. Inthis document, we use driver version:MLNX_OFED_LINUX-4.2-1.2.0.0-rhel7.4-x86_64
Notes: Afterinstall this package, No needs to re-install MLNX_EN driver which is forEthernet device only, MLNX_OFED driver package includes InfiniBand and Ethernetdevices drivers.
1.Setup
The basic setup consists of:
1. Two servers equippedwith PCI gen3x16 slots
2. Two MellanoxConnectX-5 adapter cards
3. One 100Gb/s cable
In this specific setup, RHEL7.4 was installed on the servers.
If you plan to runperformance tests, we recommend that you tune the BIOS to high performance.
a. Disable Hyper Threading
b. Disable P-State
c. Disable C-State
d. Power - Configure power to run atmaximum power for maximum performance.
e. CPU Frequency - maximum speed formaximum performance.
f. Memory Speed - maximum speed formaximum performance.
g. Enable NUMA
1. Install the latest MLNX_OFED(rel. 4.0 and later). Now we install version MLNX_OFED_LINUX-4.2-1.2.0.0-rhel7.4-x86_64.Please download the OFED driver from :
http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers
Extract the tgz file and install the driveraccording README.Notes: if install failed, please add parameter –force.
After reboot, you can check the Ethernet port info:
# ethtool –i enp37s0
2. Checkthat the adapters are "recognized" by running the lspci command:
# lspci | grep Mellanox
25:00.0 Ethernet controller: Mellanox TechnologiesMT27800 Family [ConnectX-5]
45:00.1 Ethernet controller: Mellanox TechnologiesMT27800 Family [ConnectX-5]
Note: In ConnectX-5, each port is identified by aunique number.
3. Changethe link protocol to Ethernet using the MFT mlxconfig tool.
a. Start MFT.
# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success
b. Extract the vendor_part_id parameter. Note: ConnectX-5's ID is 4119.
# ibv_devinfo | grep vendor_part_id
vendor_part_id: 4119
vendor_part_id: 4119
c. Query the Host aboutConnectX-4 adapters:
# mlxconfig -d /dev/mst/mt4119_pciconf0 q
Device #1:
----------
Device type: ConnectX5
PCI device: /dev/mst/mt4119_pciconf0
Configurations: Current
...
LINK_TYPE_P1 1
LINK_TYPE_P2 1
....
Note that the LINK_TYPE_P1 and LINK_TYPE_P2 equal 1 (InfiniBand) by default.
d. Change the port type toEthernet (LINK_TYPE = 2):
#mlxconfig -d /dev/mst/mt4119_pciconf0 set LINK_TYPE_P1=2LINK_TYPE_P2=2
Device #1:
----------
Device type: ConnectX5
PCI device: /dev/mst/mt4119_pciconf0
Configurations: Current New
LINK_TYPE_P1 1 2
LINK_TYPE_P2 1 2
Apply new Configuration? ? (y/n) [n] : y
Applying... Done!
-I- Please reboot machine to load new configurations.
e. Reboot the server.
4. ConfigureIPs and MTUs on both servers.
For Server S1(SUT):
# ifconfig enp37s0 1.1.1.1/24 up
# ifconfig enp37s0 mtu 9000
5. After you reboot, checkthat the port type was changed to Ethernet for each:
# ibdev2netdev
mlx5_0 port 1 ==> enp37s0 (Up)
mlx5_1 port 1 ==> enp69s0 (Up)
6. Make sure that you disablethe firewall, iptables, SELINUX, and other security processes that might blockthe traffic.
# systemctlstop firewalld.service
# systemctlstatus firewalld.service
# systemctlstop iptables.service
# systemctl statusiptables.service
Disable SELINUX in the config file located at: /etc/selinux/config.
a. To setting you system for throughput mode.
Command:mlnx_tune -p HIGH_THROUGHPUT
b. Set Ethernet port MTU to 9000 for wholeEthernet port.
Command:ifconfig
Example:ifconfig enp37s0 mtu 9000
You can check the value with command:
# ifconfig enp37s0
mtu 9000(default 1500)
c. Check which NUMA node adaptor is connectedto(0 is NUMA0, 1 is NUMA 1,-1 is disable NUMA), normal the NUMA is “0”, If theEthernet port is in NUMA “1” that change IRQ Affinity(Node) improvement,
Command:set_irq_affinity_bynode.sh 1
Example:set_irq_affinity_bynode.sh 1 enp37s0
d. Check the interrelation of the Ethernetport info:
I. # cat/sys/class/net/enp37s0/device/numa_node
0
# cat/sys/class/net/enp69s0/device/numa_node
1
II. Check Ethernet port PCI ID:
#ethtool –i enp37s0
0000:25:00.0
#ethtool –i enp69s0
0000:45:00.0
So25:00.0<->enp37s0 45:00.0<->enp69s0
III. # mlnx_tune –r
So25:00.0<->cpu core [0,1,2……22,23] 45:00.0<->cpu core [24,25……46,47]
IV. # ibdev2netdev
mlx5_0port 1 == > enp37s0 (up)
mlx5_1port 1 == > enp69s0 (up)
so
25:00.0<->enp37s0<--> cpu core[0,1,2……22,23]<-> mlx5_0 port 1
45:00.0<->enp69s0<--> cpu core[24,25……46,47]<->mlx5_1 port 1
Or you can use command “mst status -v” to check the info:
#mst status –v
e. Run_perftest_loopback to check PCIe capability
Command:run_perftest_loopback 0 1 ib_write_bw -d
Example:run_perftest_loopback 0 1 ib_write_bw -d mlx5_0 --report_gbit -F--output=bandwidth -x 0
# run_perftest_lookback 0 1 ib_write_bw –d mlx_0 --report_gbit –F --output=bandwidth –x 0
101.174433
101.174433
# run_perftest_lookback 24 25 ib_write_bw –d mlx_1 --report_gbit –F --output=bandwidth –x 0
101.130661
101.130661
# run_perftest_lookback 48 49 ib_write_bw –d mlx_2 --report_gbit –F --output=bandwidth –x 0
101.201363
101.201363
# run_perftest_lookback 72 73 ib_write_bw –d mlx_3 --report_gbit –F --output=bandwidth –x 0
101.151671
101.151671
f. Check PCIe width and link speed(check withLnkSta for current)
Command: lspci-vvv -s
Example: lspci-vvv -s 25:00.0 | grepSpeed
g. Show CPU working frequency.
Command: grep-E '^model name|^cpu MHz' /proc/cpuinfo
h. If the frequency not in Max please to setuse.
Command:cpupower -c all frequency-set -g performance
i. Show CPU working frequency again
Command: grep-E '^model name|^cpu MHz' /proc/cpuinfo
j. Disable IRQ balance,
Command:systemctl disable irqbalance
k. Start auto Tuning Utility,
Command:mlnx_affinity start
l. Get the IRQ numbers for the relevant port
Command: cat/proc/interrupts | grep
Example: cat/proc/interrupts | grep enp37s0
m. To show the current irq affinity settings,
Command:show_irq_affinity.sh
Example:show_irq_affinity.sh enp37s0
-if the resultis “000ffff or ffffff” that mention the did not seting succeed.
n. Set PCIe buff size,,only change the firstdigi and keep the rest the same
command:setpci -s
setpci –s
Example;
setpci -s 25:00.0 68.w
2930
setpci -s 2:00.0 68.w=5930
5930
setpci -s 45:00.0 68.w
2930
setpci -s 45:00.0 68.w=5930
5930
o. Check PCIe buff size was succeed
Command: lspci-s
Example;
lspci -s 02:00.0 -vvv | grep MaxReadReq
MaxPayload 256 bytes, MaxReadReq 4096 bytes
5. start perf test;
Command:
iperf –sP8
iperf –c
to check the dropped package, please run iperf testto check throughput, meanwhile, run command “watch –n 1 “ifconfig enp37s0””
1. If MLNX_OFED rel. 4.0 orlater is not used, the card will be identified as a ConnectX-4 adapter bydefault.
# ofed_info -s
MLNX_OFED_LINUX-3.4-2.0.0.0:
# lspci | grep Mellanox
81:00.0 Infiniband controller: Mellanox Technologies MT28800Family [ConnectX-4]
81:00.1 Infiniband controller: Mellanox Technologies MT28800Family [ConnectX-4]
To correct this, installMLNX_OFED rel. 4.0 or later.
# ofed_info -s
MLNX_OFED_LINUX-4.0-0.1.5.0:
# lspci | grep Mel
81:00.0 Infiniband controller: Mellanox Technologies MT27800Family [ConnectX-5]
81:00.1 Infiniband controller: Mellanox Technologies MT27800Family [ConnectX-5]
2. Make sure that you run the iperf process from the root "/" folder.
https://community.mellanox.com/docs/DOC-2386
During the latest driver install, it will automatically check and upgrade this card FW version to the latest one which is included in driver package.If want to downgrade this card FW, please refer below command.
#mlxfwmanager -u -i fw-ConnectX5-rel-16_21_2010-MCX515A-CCA_Ax-FlexBoot-3.5.305.bin -f
通过这条命令ethtool -s enp37s0 speed 100000 autoneg off可以把网卡速率恢复到100000(一次不行,就多执行几次,貌似有个缓冲),但是如果先执行命令ethtool -s enp37s0 speed 100000 autoneg off,接着执行ethtool -s enp37s0 autoneg on,则返回信息:cannot advertise speed 100000
测试步骤:
测试pass/fail 标准:
这样执行,结果不符合预期,需要15-20s才能恢复ping包。在第四步操作后,紧接着在client端执行command “ip –s –s neigh flush all” or command “arp –d ip_addr”,ping包立刻恢复。厂商回复:这个是软件层的问题,和硬件无关。你们的测试方式需要采用这种方式来测试。
可以用mstconfig这个工具恢复firmware: mstconfig –d 25:00.0 r
然后根据提示重启服务器。
待服务器启动进系统后,用mlxconfig-d device q查询一下所有的参数,然后和其他的机器做个对比。
命令:mlxconfig –d 25:00.0 q
原因排查(虚拟机shutdown状态下才能添加虚拟网口):
BIOS下,hyperthread enabled,VT-d enabled,SR-IOV enabled,Virtual Machine enabled。
OS下,
GRUB_CMDLINE_LINUX=”crashkernel=auto rhgb quiet intel_iommu=on”
Steps:
Install dpdk:
# cd
# export RTE_SDK=
# export RTE_TARGET=x86_64-native-linuxapp-gcc
# vim
CONFIG_RTE_LIBRTE_MLX5_PMD=y
# vim
CONFIG_RTE_KNI_KMOD=n
CONFIG_RTE_LIBRTE_KNI=n
#make install –j T=x86_64-native-linuxapp-gcc
Install pktgen:
# rpm –ivh libpcap-devel-1.5.3-9.el7.x86_64.rpm
# cd < PktgenInstallDir >
# export RTE_SDK=
# export RTE_TARGET=x86_64-native-linuxapp-gcc
# make
# cp
l2fwd/l3fwd test:
SUT 端:
#export RTE_SDK=
#export RTE_TARGET=x86_64-native-linuxapp-gcc
#echo 40960 >/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
#echo 40960 >/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
#mkdir -p /mnt/huge
#mount -t hugetlbfs hugetlb /mnt/huge
#modprobe uio
#insmod x86_64-native-linuxapp-gcc/kmod/igb_uio.ko
#cd
#make
#cd build
#./l2fwd -l 0-8 -w 5e:00.0 -n 8 -- -q 8 -p 0x3 //for l2fwd test, 5e:00.0 is NIC PCI id.
./l3fwd -l 1,2 -w 5e:00.0 -n 4 -- -P -p 0x3 --config="(0,0,1),(0,1,2)" --parse-ptype //for l3fwd test
DAT 端:
# cd
# ./pktgen -c 0x1ff -w 25:00.0 -n 4 -- -P -m "[1-2:3-4].0"
Copyright (c) <2010-2017>, Intel Corporation. All rights reserved. Powered by Intel® DPDK
EAL: Detected 80 lcore(s)
EAL: No free hugepages reported in hugepages-1048576kB
……
……
| Ports 0-0 of 1
Flags:Port : P--------------:0
Link State :
Pkts/s Max/Rx : 0/0 0/0
Max/Tx : 0/0 0/0
MBits/s Rx/Tx : 0/0 0/0
Broadcast : 0
Multicast : 0
64 Bytes : 0
65-127 : 0
128-255 : 0
256-511 : 0
512-1023 : 0
1024-1518 : 0
Runts/Jumbos : 0/0
Errors Rx/Tx : 0/0
Total Rx Pkts : 0
Tx Pkts : 0
Rx MBs : 0
Tx MBs : 0
ARP/ICMP Pkts : 0/0
:
Pattern Type : abcd...
Tx Count/% Rate : Forever /100%
PktSize/Tx Burst : 64 / 32
Src/Dest Port : 1234 / 5678
Pkt Type:VLAN ID : IPv4 / TCP:0001
Dst IP Address : 192.168.1.1
Src IP Address : 192.168.0.1/24
Dst MAC Address : 00:00:00:00:00:00
Src MAC Address : ec:0d:9a:c1:b3:d0
VendID/PCI Addr : 15b3:1017/5e:00.0
-- Pktgen Ver: 3.2.6 (DPDK 16.11.0) Powered by Intel® DPDK -------------------
For l2fwd:
Pktgen:/> set 0 dst mac xx:xx:xx:xx:xx:xx //set SUT NIC mac
Pktgen:/> set 0 dst ip x.x.x.x //set SUT ip 可以默认
Pktgen:/> set 0 src ip x.x.x.x/24 //set DAT ip 可以默认
Pktgen:/> set 0 size 512 //set packet frame size 64、128、256、512、1024、1518
Pktgen:/> start all //start test, stop test use `stop all`
PS:只有当接收端iperf –s端iommu打开时,才会影响性能;如果只打开发送端iperf –c的iommu,performance是不会受到影响的。
解决方法:host下首先mst start,其次mlxconfig -d /dev/mst/m…. set NUM_VF_MSIX=24,重启host后,虚拟机的中断可以增加