How to Getting Started with ConnectX-5 100Gbs Adapters for Linux

This post provides basicsteps on how to configure and set up basic parameters for the MellanoxConnectX-5 100Gb/s adapter.

This post is basic and ismeant for beginners. The procedure is very similar to the one for theConnectX-4 adapter (in fact, it uses the same mlx5 driver).

 

Note: ConnectX-5adapters can be used only with MLNX_OFED rel. 4.0 or later installed. Inthis document, we use driver version:MLNX_OFED_LINUX-4.2-1.2.0.0-rhel7.4-x86_64

Notes: Afterinstall this package, No needs to re-install MLNX_EN driver which is forEthernet device only, MLNX_OFED driver package includes InfiniBand and Ethernetdevices drivers.

1.Setup

The basic setup consists of:

1.      Two servers equippedwith PCI gen3x16 slots

2.      Two MellanoxConnectX-5 adapter cards

3.      One 100Gb/s cable

In this specific setup, RHEL7.4 was installed on the servers.

2. Prerequisites

If you plan to runperformance tests, we recommend that you tune the BIOS to high performance.

a. Disable Hyper Threading

b. Disable P-State

c. Disable C-State

d. Power - Configure power to run atmaximum power for maximum performance.

e. CPU Frequency - maximum speed formaximum performance.

f. Memory Speed - maximum speed formaximum performance.

g. Enable NUMA

 

please referdocument “Understanding BIOS Configuration for Performance Tuning” from https://community.mellanox.com/docs/DOC-2488

 

3. Configuration

1. Install the latest MLNX_OFED(rel. 4.0 and later). Now we install version MLNX_OFED_LINUX-4.2-1.2.0.0-rhel7.4-x86_64.Please download the OFED driver from :

http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers

 

 

 

 

Extract the tgz file and install the driveraccording README.Notes: if install failed, please add parameter –force.

 

After reboot, you can check the Ethernet port info:

 

# ethtool –i enp37s0

 

 

2.   Checkthat the adapters are "recognized" by running the lspci command:

# lspci | grep Mellanox

25:00.0 Ethernet controller: Mellanox TechnologiesMT27800 Family [ConnectX-5]

45:00.1 Ethernet controller: Mellanox TechnologiesMT27800 Family [ConnectX-5]

 

Note: In ConnectX-5, each port is identified by aunique number.

 

3.   Changethe link protocol to Ethernet using the MFT mlxconfig tool.

 

a. Start MFT.

# mst start

Starting MST (Mellanox Software Tools) driver set

Loading MST PCI module - Success

Loading MST PCI configuration module - Success

Create devices

Unloading MST PCI module (unused) - Success

 

b. Extract the vendor_part_id parameter. Note: ConnectX-5's ID is 4119.

# ibv_devinfo  | grep vendor_part_id

  vendor_part_id: 4119

  vendor_part_id: 4119

c. Query the Host aboutConnectX-4 adapters:

# mlxconfig -d /dev/mst/mt4119_pciconf0 q

 

Device #1:

----------

 

Device type:   ConnectX5      

PCI device:     /dev/mst/mt4119_pciconf0

 

 

Configurations:         Current

...

         LINK_TYPE_P1   1      

        LINK_TYPE_P2    1

....

 

  Note that the LINK_TYPE_P1 and LINK_TYPE_P2 equal 1 (InfiniBand) by default.

 

d. Change the port type toEthernet (LINK_TYPE = 2):

#mlxconfig -d /dev/mst/mt4119_pciconf0 set LINK_TYPE_P1=2LINK_TYPE_P2=2

 

Device #1:

----------

 

Device type:   ConnectX5     

PCI device:     /dev/mst/mt4119_pciconf0

 

Configurations:         Current New

        LINK_TYPE_P1    1      2      

        LINK_TYPE_P2    1      2      

 

Apply new Configuration? ? (y/n) [n] : y

Applying... Done!

-I- Please reboot machine to load new configurations.

 

e. Reboot the server.

 

4.   ConfigureIPs and MTUs on both servers.

 

For Server S1(SUT):

# ifconfig enp37s0 1.1.1.1/24 up

# ifconfig enp37s0 mtu 9000

 

5. After you reboot, checkthat the port type was changed to Ethernet for each:

# ibdev2netdev

mlx5_0 port 1 ==> enp37s0  (Up)

mlx5_1 port 1 ==> enp69s0  (Up)

 

6. Make sure that you disablethe firewall, iptables, SELINUX, and other security processes that might blockthe traffic.

# systemctlstop firewalld.service

# systemctlstatus firewalld.service

# systemctlstop iptables.service

 # systemctl statusiptables.service

 

Disable SELINUX in the config file located at: /etc/selinux/config.

 

OS layer setting

a.   To setting you system for throughput mode.

Command:mlnx_tune -p HIGH_THROUGHPUT

b.   Set Ethernet port MTU to 9000 for wholeEthernet port.

Command:ifconfig mtu 9000

Example:ifconfig enp37s0  mtu 9000

You can check the value with command:

# ifconfig enp37s0

mtu 9000(default 1500)

c.    Check which NUMA node adaptor is connectedto(0 is NUMA0, 1 is NUMA 1,-1 is disable NUMA), normal the NUMA is “0”, If theEthernet port is in NUMA “1” that change IRQ Affinity(Node) improvement,

Command:set_irq_affinity_bynode.sh 1

Example:set_irq_affinity_bynode.sh 1 enp37s0

d.   Check the interrelation of the Ethernetport info:

I.      # cat/sys/class/net/enp37s0/device/numa_node

0

# cat/sys/class/net/enp69s0/device/numa_node

1

II.      Check Ethernet port PCI ID:

#ethtool –i enp37s0

0000:25:00.0

#ethtool –i enp69s0

0000:45:00.0

So25:00.0<->enp37s0  45:00.0<->enp69s0

III.    # mlnx_tune –r

 

So25:00.0<->cpu core [0,1,2……22,23] 45:00.0<->cpu core [24,25……46,47]

IV.      # ibdev2netdev

mlx5_0port 1 == > enp37s0 (up)

mlx5_1port 1 == > enp69s0 (up)

 

so

25:00.0<->enp37s0<--> cpu core[0,1,2……22,23]<-> mlx5_0 port 1

45:00.0<->enp69s0<--> cpu core[24,25……46,47]<->mlx5_1 port 1

 

Or you can use command “mst status -v” to check the info:

#mst status –v

 

 

e.   Run_perftest_loopback to check PCIe capability

 

 

Command:run_perftest_loopback 0 1 ib_write_bw -d --report_gbit -F--output=bandwidth -x 0

Example:run_perftest_loopback 0 1 ib_write_bw -d mlx5_0 --report_gbit -F--output=bandwidth -x 0

 

# run_perftest_lookback 0 1 ib_write_bw –d mlx_0  --report_gbit –F  --output=bandwidth –x 0

101.174433

101.174433

# run_perftest_lookback 24 25 ib_write_bw –d mlx_1  --report_gbit –F  --output=bandwidth –x 0

101.130661

101.130661

# run_perftest_lookback 48 49 ib_write_bw –d mlx_2  --report_gbit –F  --output=bandwidth –x 0

101.201363

101.201363

# run_perftest_lookback 72 73 ib_write_bw –d mlx_3  --report_gbit –F  --output=bandwidth –x 0

101.151671

101.151671

If the value<100,please note that the configure is not ok, or the throughput will not reachabout 99Gbits/sec

 

f.    Check PCIe width and link speed(check withLnkSta for current)

Command: lspci-vvv -s | grep Speed

Example: lspci-vvv -s 25:00.0 | grepSpeed

g.   Show CPU working frequency.

Command: grep-E '^model name|^cpu MHz' /proc/cpuinfo

h.   If the frequency not in Max please to setuse.

Command:cpupower -c all frequency-set -g performance

i.     Show CPU working frequency again

Command: grep-E '^model name|^cpu MHz' /proc/cpuinfo

j.     Disable IRQ balance,

Command:systemctl disable irqbalance

k.   Start auto Tuning Utility,

Command:mlnx_affinity start

l.     Get the IRQ numbers for the relevant port

Command: cat/proc/interrupts  | grep

Example: cat/proc/interrupts  | grep enp37s0

m.  To show the current irq affinity settings,

Command:show_irq_affinity.sh

Example:show_irq_affinity.sh enp37s0

-if the resultis “000ffff or ffffff” that mention the did not seting succeed.

n.   Set PCIe buff size,,only change the firstdigi and keep the rest the same

command:setpci -s 68.w

        setpci –s 68.w=5xxx

Example;

setpci -s 25:00.0 68.w

2930

setpci -s 2:00.0 68.w=5930

5930

setpci -s 45:00.0 68.w

2930

setpci -s 45:00.0 68.w=5930

5930

o.   Check PCIe buff size was succeed

Command: lspci-s -vvv | grep MaxReadReq

Example;

lspci -s 02:00.0 -vvv | grep MaxReadReq

MaxPayload 256 bytes, MaxReadReq 4096 bytes

5.    start perf test;

Command:

iperf –sP8

iperf –c -P8  –t  86400 –i 600 > log.txt

 

 

to check the dropped package, please run iperf testto check throughput, meanwhile, run command “watch –n 1 “ifconfig enp37s0””

 

Troubleshooting

1. If MLNX_OFED rel. 4.0 orlater is not used, the card will be identified as a ConnectX-4 adapter bydefault.

# ofed_info -s

MLNX_OFED_LINUX-3.4-2.0.0.0:

 

# lspci | grep Mellanox

81:00.0 Infiniband controller: Mellanox Technologies MT28800Family [ConnectX-4]

81:00.1 Infiniband controller: Mellanox Technologies MT28800Family [ConnectX-4]

 

 

To correct this, installMLNX_OFED rel. 4.0 or later.

# ofed_info -s

MLNX_OFED_LINUX-4.0-0.1.5.0:

 

# lspci | grep Mel

81:00.0 Infiniband controller: Mellanox Technologies MT27800Family [ConnectX-5]

81:00.1 Infiniband controller: Mellanox Technologies MT27800Family [ConnectX-5]

 

2. Make sure that you run the iperf process from the root "/" folder.

 

 

References

  1. HowTo Configure SR-IOV for ConnectX-4/ConnectX-5 with KVM (Ethernet)

https://community.mellanox.com/docs/DOC-2386

 

  1. Downgrade FW:

During the latest driver install, it will automatically check and upgrade this card FW version to the latest one which is included in driver package.If want to downgrade this card FW, please refer below command.

 

#mlxfwmanager -u -i fw-ConnectX5-rel-16_21_2010-MCX515A-CCA_Ax-FlexBoot-3.5.305.bin -f

  1. 关于此网卡速率降到1000Mb/s后不能恢复到100000Mb/s的问题:

 

通过这条命令ethtool -s enp37s0 speed 100000 autoneg off可以把网卡速率恢复到100000(一次不行,就多执行几次,貌似有个缓冲),但是如果先执行命令ethtool -s enp37s0 speed 100000 autoneg off,接着执行ethtool -s enp37s0 autoneg on,则返回信息:cannot advertise speed 100000

 

  1. mac 设置

测试步骤:

  1. ifconfig ethx ip_addr netmask net_mask up (确保线缆或光纤已连接)
    2、辅助端与被测端口接入同一个交换机。
    3、从辅助端A ping被测端  ping ip_addr -c 100000 -i 1
    4、在被测端执行ifconfig ethx hw ether MAC_RANDOM。 //MAC_RANDOM为随机选择的有效MAC地址。
    5、等待3s,观察辅助端A的ping包结果。
    6、重复步骤4-5,10次

测试pass/fail 标准:

 

  1. 步骤5、ping包中断一会,并在3s内恢复。若超过3s未恢复,测试FAIL,请厂商定位。

这样执行,结果不符合预期,需要15-20s才能恢复ping包。在第四步操作后,紧接着在client端执行command “ip –s –s neigh flush all” or command “arp –d ip_addr”,ping包立刻恢复。厂商回复:这个是软件层的问题,和硬件无关。你们的测试方式需要采用这种方式来测试。

  1. 如何load default Mellanox 100G standard MCX515A-CCAT FW?

 

可以用mstconfig这个工具恢复firmware: mstconfig –d 25:00.0 r

 

然后根据提示重启服务器。

待服务器启动进系统后,用mlxconfig-d device q查询一下所有的参数,然后和其他的机器做个对比。
 
命令:mlxconfig –d 25:00.0 q
  1. 虚拟机添加虚拟网口后,虚拟机启动失败。
 
原因排查(虚拟机shutdown状态下才能添加虚拟网口):
BIOS下,hyperthread enabled,VT-d enabled,SR-IOV enabled,Virtual Machine enabled。
OS下,
  1. Vim /etc/default/grub, 在GRUB_CMDLINE_LINUX=”crashkernel=auto rhgb quiet”后添加intel_iommu=on,即:
GRUB_CMDLINE_LINUX=”crashkernel=auto rhgb quiet intel_iommu=on”
  1. 敲入命令:grub2-mkconfig  -o /boot/grub2/grucfg
  2. Reboot 就可以了
 
 
  1. Dpdk l2fwd/l3fwd环境搭建和测试步骤

Steps:

  1. 两台机器用网线直连。
  2. 对网卡做优化设置并配置IP.
  3. SUT和DAT安装dpdk. 
  4. DAT安装pktgen.
  5. 搭建l2fwd和l3fwd测试环境。  

 

 

Install dpdk:

# cd

# export RTE_SDK=

# export RTE_TARGET=x86_64-native-linuxapp-gcc

# vim /config/common_base

CONFIG_RTE_LIBRTE_MLX5_PMD=y

# vim /config/common_linuxapp

CONFIG_RTE_KNI_KMOD=n

CONFIG_RTE_LIBRTE_KNI=n

#make install –j T=x86_64-native-linuxapp-gcc

 

Install pktgen:

# rpm –ivh libpcap-devel-1.5.3-9.el7.x86_64.rpm

# cd < PktgenInstallDir >

# export RTE_SDK=

# export RTE_TARGET=x86_64-native-linuxapp-gcc

# make

# cp /Pktgen.lua  /app/app/x86_64-native-linuxapp-gcc/app

 

l2fwd/l3fwd test:

SUT 端:

#export RTE_SDK=

#export RTE_TARGET=x86_64-native-linuxapp-gcc

#echo 40960 >/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

#echo 40960 >/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

#mkdir -p /mnt/huge

#mount -t hugetlbfs hugetlb /mnt/huge

#modprobe uio                     

#insmod x86_64-native-linuxapp-gcc/kmod/igb_uio.ko

#cd / example/l2fwd

#make

#cd build

#./l2fwd -l 0-8 -w 5e:00.0 -n 8 -- -q 8 -p 0x3                 //for l2fwd test, 5e:00.0 is NIC PCI id.

  ./l3fwd -l 1,2 -w 5e:00.0 -n 4 -- -P -p 0x3 --config="(0,0,1),(0,1,2)" --parse-ptype                 //for l3fwd test

 

DAT 端:

# cd /app/app/x86_64-native-linuxapp-gcc/app

# ./pktgen -c 0x1ff -w 25:00.0 -n 4 -- -P -m "[1-2:3-4].0"

Copyright (c) <2010-2017>, Intel Corporation. All rights reserved. Powered by Intel® DPDK

EAL: Detected 80 lcore(s)

EAL: No free hugepages reported in hugepages-1048576kB

……

……

| Ports 0-0 of 1  

 Copyright (c) <2010-2016>, Intel Corporation

  Flags:Port      :   P--------------:0

Link State        :           ----TotalRate----

Pkts/s Max/Rx     :                 0/0                   0/0

       Max/Tx     :                 0/0                   0/0

MBits/s Rx/Tx     :                 0/0                   0/0

Broadcast         :                   0

Multicast         :                   0

  64 Bytes        :                   0

  65-127          :                   0

  128-255         :                   0

  256-511         :                   0

  512-1023        :                   0

  1024-1518       :                   0

Runts/Jumbos      :                 0/0

Errors Rx/Tx      :                 0/0

Total Rx Pkts     :                   0

      Tx Pkts     :                   0

      Rx MBs      :                   0

      Tx MBs      :                   0

ARP/ICMP Pkts     :                 0/0

                  :

Pattern Type      :             abcd...

Tx Count/% Rate   :       Forever /100%

PktSize/Tx Burst  :           64 /   32

Src/Dest Port     :         1234 / 5678

Pkt Type:VLAN ID  :     IPv4 / TCP:0001

Dst  IP Address   :         192.168.1.1

Src  IP Address   :      192.168.0.1/24

Dst MAC Address   :   00:00:00:00:00:00

Src MAC Address   :   ec:0d:9a:c1:b3:d0

VendID/PCI Addr   :   15b3:1017/5e:00.0

 

-- Pktgen Ver: 3.2.6 (DPDK 16.11.0)  Powered by Intel® DPDK -------------------

For l2fwd:

Pktgen:/> set 0 dst mac xx:xx:xx:xx:xx:xx               //set SUT NIC mac

Pktgen:/> set 0 dst ip x.x.x.x                                        //set SUT ip 可以默认

Pktgen:/> set 0 src ip x.x.x.x/24                                //set DAT ip  可以默认

Pktgen:/> set 0 size 512                                               //set packet frame size  64、128、256、512、1024、1518

Pktgen:/> start all                                                         //start test, stop test use `stop all`

 
 
  
  1. 配置config文件以使在VT-d enable的情况下使带宽达标
  1. add iommu=pt to the conf file
  2. add “intel_pstate=disable intel_idle.max_cstate=0” in grub.cfg file
  3. I think only max_cstate=0 should be good
  4. Broadcom 也遇到过这个issue,当intel_iommu=on,带宽会从23.5降到12G左右,drop 45%,enable iommu后,会force nic只工作在一个cpu core上面,做任何绑核的动作都是不成功的,所以才会导致performance drop~

PS:只有当接收端iperf –s端iommu打开时,才会影响性能;如果只打开发送端iperf  –c的iommu,performance是不会受到影响的。

 

  1. 用命令 cat /proc/cmdline可以看到具体在哪一行加iommu=pt(UTF-8)以及是否加了dptk参数

 

  1. 提升TCP单口双向带宽_vf 需要在VFhostclient端做的设置

 

  1. client端:BIOS下disable VT-d,OS下做绑核等一系列优化
  2. host端:BIOS下enable VT-d(否则不能创建虚拟机),OS下要把mtu设置为最大(9000),否则如果host  mtu为1500,但虚拟机下mtu设置为9000时会ping不通。创建虚拟机时分配一半物理核给虚拟机,memory可以分100G左右,不然性能会很低,OS下不做绑核优化,但是需要a. systemctl status NetworkManager 如果为active状态,就要stop掉, systemctl stop iptables.service。b. 输入命令getenforce应该显示为disable,tuned-adm profile balanced, 改变一次mtu,client和SUT端要全杀一次netperf,netperf.sh, netserver,netserver.sh
  3. 查看虚拟机cpu个数:lscpu
  4. 查看虚拟机mem情况:free –he
  5. 杀掉进程:killall -9 iperf
  6. 虚拟机Guest的CPU和Memory,要做手动设置,默认是1个core和memory空间,请将物理机上至少一半的CPU分给虚拟机,memory可以分100G左右,不然性能会很低;
  7. 网卡vf是无条件依赖pf的,如果对pf做过一些change,vf与client端通信就会有问题,虽然Link显示yes,但仍ping不通,此时需要重启虚拟机使其vf自动匹配pf的change,使通信恢复;
  8. 虚拟机是从内到外不依赖于其物理主机的,从kernel层到用户层均不依赖,所以在物理主机上做的一些动作,例如stop irqbanlace,smp绑核,stop firewalld,stop NetworkManager,disable getenforce 等,均需要在虚拟机上rebuild。
  9.  VF相关测试并不受intel_iommu=on的影响,在server端打开iommu,建好虚拟机后,在虚拟机host里,kernel还是OS初始状态,client端是用iperf对虚拟机host冲包,此时iperf依赖的是虚拟机里的kernel,所以vf performance测试不受intel_iommu=on的影响,在server端用intel_iommu=on也可以的,待vf相关测试测完后,把server kernel复原即可
  1. 虚拟机下device只有8irq 无法用ethtool -L combined 20
解决方法:host下首先mst start,其次mlxconfig -d /dev/mst/m….  set NUM_VF_MSIX=24,重启host后,虚拟机的中断可以增加

 

 

 

你可能感兴趣的:(Linux)