RDMA在高性能计算,AI大模型训练中发挥着重要的作用。
主流支持RDMA的协议有IB、RoCev1、RoCev2、iWARP。
其中RoCev2是应用最广泛的协议,因为其RDMA over UDP/IP,不依赖昂贵的IB网络设备,同时支持路由,性能上也接近原生IB水准。
两台虚拟机,测试接口设置同一个二层网络(同一个Bridge)
node-1配置
ifconfig ens192 10.0.0.1/24
node-2配置
ifconfig ens192 10.0.0.2/24
node-1 ping node-2
root@u20-test:~# ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.277 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.284 ms
64 bytes from 10.0.0.2: icmp_seq=3 ttl=64 time=0.265 ms
sudo apt install libibverbs1 ibverbs-utils librdmacm1 libibumad3 ibverbs-providers rdma-core rdmacm-utils perftest
root@u20-test:~# rdma link add 1 type rxe netdev ens192
root@u20-test:~#
root@u20-test:~# rdma link
link rocep11s0/1 state ACTIVE physical_state LINK_UP netdev ens192
root@u20-test:~#
root@u20-test:~# ibv_devices
device node GUID
------ ----------------
rocep11s0 020c29fffe3ed0e9
root@u20-test:~# ibv_devinfo -d rocep11s0
hca_id: rocep11s0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 020c:29ff:fe3e:d0e9
sys_image_guid: 020c:29ff:fe3e:d0e9
vendor_id: 0x0000
vendor_part_id: 0
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
root@u20-test:~# rdma link add rxeee type rxe netdev ens192
root@u20-test:~# ibv_devices
device node GUID
------ ----------------
rocep11s0 020c29fffe8ce06a
root@u20-test:~# ibv_devinfo
hca_id: rocep11s0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 020c:29ff:fe8c:e06a
sys_image_guid: 020c:29ff:fe8c:e06a
vendor_id: 0x0000
vendor_part_id: 0
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
node-2建立服务端
其中参数 -R 代表通过cm建链
root@u20-test:~# ib_send_bw -d rocep11s0 -R
************************************
* Waiting for client to connect... *
************************************
node-1连接node-2
root@u20-test:~# ib_send_bw -d rocep11s0 10.0.0.2 -R
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : rocep11s0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0012 PSN 0x184d5e
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:00:01
remote address: LID 0000 QPN 0x0012 PSN 0x9b467e
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:00:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 30.08 28.94 0.000463
---------------------------------------------------------------------------------------
其中关键信息:
Base Transport Header中目的QP 0x000001代表管理QP
DETH中 Queue Key 80010000,是特殊的专门为CM使用的。
IB Spec:
To prevent address spoofing attempts by user applications, the
source IP address and the port number shall be filled in by privileged
kernel mode. The passive side shall verify that the CM REQ Message
contains a privileged Q-key and its value is 0x80010000.
CM头中Local QPN 0x000011 ,此次通信本地的Queue Pair Number, 对方往本端发的消息要用该QPN作为目的QPN在BTH中。
Ready to Use
往对端 QPN上发数据, 一个send only 一个ack
node-1 node-2各自发断链请求,再都回复。
node-2 开启服务端,没有-r,默认socket建链
root@u20-test:~# ib_send_bw -d rocep11s0
************************************
* Waiting for client to connect... *
************************************
node-1连接node-2
root@u20-test:~# ib_send_bw -d rocep11s0 10.0.0.2
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : rocep11s0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 1
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0011 PSN 0xe5662a
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:00:01
remote address: LID 0000 QPN 0x0011 PSN 0x4e7fad
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:00:00:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 27.39 26.82 0.000429
三次握手以后,十几个包不知道干什么的,然后看到这个建链请求和响应
中间有个000011是local QPN, e5662是local PSN
第一个数据包PSN,就是建链是指定的e5662a.QPN是0x00011.
acknowledge 是对 send last的响应,表示前面几十个包已经收到。
与CM一样四次挥手。
https://zhuanlan.zhihu.com/p/164908617
IB Specification Vol 1-Release-1.4-2020-04-07