本文主要总结一下在k8s环境中,mellanox网卡使用dpdk driver可能会遇到的问题及解决办法。
1. 不能挂载 /sys 目录到pod中
其他厂家的网卡,比如intel的x710等,如果想在k8s中,使用dpdk driver,/sys目录是必须挂载的,因为dpdk启动过程会读取这个目录下的文件。但是对于mellanox网卡来说,它是比较特殊的,在使用dpdk driver时,也必须绑定在kernel driver mlx5_core上面。
如果挂载了 /sys 目录到pod中,就会报如下的错误。
net_mlx5: port 0 cannot get MAC address, is mlx5_en loaded? (errno: No such file or directory)
net_mlx5: probe of PCI device 0000:00:09.0 aborted after encountering an error: No such device
EAL: Requested device 0000:00:09.0 cannot be used
原因是 host上的 /sys/ 会覆盖 pod 里的 /sys/ 内容,而 mlx 网卡会读取这些目录,比如 /sys/devices/pci0000:00/0000:00:09.0/net/,如果覆盖了,就会报错。下面分析下代码
mlx5_pci_probe
mlx5_dev_spawn
/* Configure the first MAC address by default. */
if (mlx5_get_mac(eth_dev, &mac.addr_bytes)) {
DRV_LOG(ERR,
"port %u cannot get MAC address, is mlx5_en"
" loaded? (errno: %s)",
eth_dev->data->port_id, strerror(rte_errno));
err = ENODEV;
goto error;
}
//如果host上的 /sys/ 覆盖 pod 里的 /sys/ 内容,会出现问题(具体哪行代码出问题还有待调查)
int
mlx5_get_mac(struct rte_eth_dev *dev, uint8_t (*mac)[ETHER_ADDR_LEN])
{
struct ifreq request;
int ret;
ret = mlx5_ifreq(dev, SIOCGIFHWADDR, &request);
int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
mlx5_get_ifname(dev, &ifr->ifr_name);
ioctl(sock, req, ifr);
if (ret)
return ret;
memcpy(mac, request.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);
return 0;
}
2. dpdk启动过程中找不到mellanox网卡
有时候会出现下面的错误,找不到mellanox网卡,并且提示了是不是没有加载kernel driver,如果没有在k8s环境中,这个提示是有用的,可能网卡确实没有绑定到mlx5_core。
EAL: PCI device 0000:00:06.0 on NUMA socket -1"}
EAL: Invalid NUMA socket, default to 0"}
EAL: probe driver: 15b3:101a net_mlx5"}
net_mlx5: no Verbs device matches PCI device 0000:00:06.0, are kernel drivers loaded?"}
EAL: Requested device 0000:00:06.0 cannot be used"}
但是在k8s环境中,就有其他原因了。问题现象是,如果启动的pod有privileged权限,那dpdk是可以识别到网卡并且启动成功的,但是如果没有privileged权限,就会报上面的错误。
先分析下为什么报这个错误,dpdk代码如下
mlx5_pci_probe
unsigned int n = 0;
//调用 libibverbs 里的函数 ibv_get_device_list 获取 ibv 设备
ibv_list = mlx5_glue->get_device_list(&ret);
while (ret-- > 0) {
ibv_match[n++] = ibv_list[ret];
}
//在这里报错,说明n为0,n在上面遍历ret的时候赋值,n为0,说明ret也为0
if (!n) {
DRV_LOG(WARNING,
"no Verbs device matches PCI device " PCI_PRI_FMT ","
" are kernel drivers loaded?",
pci_dev->addr.domain, pci_dev->addr.bus,
pci_dev->addr.devid, pci_dev->addr.function);
rte_errno = ENOENT;
ret = -rte_errno;
}
要想知道为什么ret也为0,得分析下 libibverbs 源码,libibverbs 源码可以在 OFED 包中找到,路径如下 MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu16.04-x86_64/src/MLNX_OFED_SRC-5.2-2.2.0.0/SOURCES/rdma-core-52mlnx1/libibverbs
LATEST_SYMVER_FUNC(ibv_get_device_list, 1_1, "IBVERBS_1.1",
struct ibv_device **,
int *num)
ibverbs_get_device_list(&device_list);
find_sysfs_devs(&sysfs_list);
setup_sysfs_dev
try_access_device(sysfs_dev)
struct stat cdev_stat;
char *devpath;
int ret;
//查看这个文件是否存在 /dev/infiniband/uverbs0
if (asprintf(&devpath, RDMA_CDEV_DIR"/%s", sysfs_dev->sysfs_name) < 0)
return ENOMEM;
ret = stat(devpath, &cdev_stat);
free(devpath);
return ret;
由上面代码可知,ibverbs会检查/dev/infiniband/uverbs0文件是否存在(一个网卡对应一个uverbs0),如果不存在就认为没有找到网卡。
如果pod有privileged权限,pod内就可以读取/dev/infiniband/uverbs0这些文件,如果没有privileged权限,pod内部是没有/dev/infiniband这个目录的。
//pod没有privilege特权时
root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ls /dev
core fd full mqueue null ptmx pts random shm stderr stdin stdout termination-log tty urandom zero
//pod有privilege特权时,可看到infiniband
root@pod-dpdk:~/# ls /dev/
autofs infiniband mqueue sda2 tty12 tty28 tty43 tty59 ttyS16 ttyS31 vcs4 vfio
...
root@pod-dpdk:~# ls /dev/infiniband/
uverbs0
由上可知,pod里有没有这个文件 /dev/infiniband/uverbs0 是关键。如果pod内没有这些 /dev/infiniband 文件就会报上面的错误。
那这个文件怎么才能传到pod里呢?有两种办法
a. 给pod privileged权限,这种方式是不需要将 /dev/ 挂载到pod内部的,因为有特权的pod可以直接访问host上的这些文件
b. 使用 k8s 的 sriov-network-device-plugin,要注意版本,高点的版本才能避免此问题。如果是mlx网卡,它会将需要的device通过 docker的 --device 挂载给container,比如下面的环境,将
/dev/infiniband/uverbs2等文件挂载到container中
root# docker inspect 1dfe96c8eff4
[
{
"Id": "1dfe96c8eff4c8ede0d8eb4e480fec9f002f68c4da1bb5265580ee968c6d7502",
"Created": "2021-04-12T03:24:22.598030845Z",
...
"HostConfig": {
...
"CapAdd": [
"NET_RAW",
"NET_ADMIN",
"IPC_LOCK"
],
"Privileged": false,
"Devices": [
{
"PathOnHost": "/dev/infiniband/ucm2",
"PathInContainer": "/dev/infiniband/ucm2",
"CgroupPermissions": "rwm"
},
{
"PathOnHost": "/dev/infiniband/issm2",
"PathInContainer": "/dev/infiniband/issm2",
"CgroupPermissions": "rwm"
},
{
"PathOnHost": "/dev/infiniband/umad2",
"PathInContainer": "/dev/infiniband/umad2",
"CgroupPermissions": "rwm"
},
{
"PathOnHost": "/dev/infiniband/uverbs2",
"PathInContainer": "/dev/infiniband/uverbs2",
"CgroupPermissions": "rwm"
},
{
"PathOnHost": "/dev/infiniband/rdma_cm",
"PathInContainer": "/dev/infiniband/rdma_cm",
"CgroupPermissions": "rwm"
}
],
...
},
sriov plguin 的 代码,将hostpath上的目录挂载到pod中
// NewRdmaSpec returns the RdmaSpec
func NewRdmaSpec(pciAddrs string) types.RdmaSpec {
deviceSpec := make([]*pluginapi.DeviceSpec, 0)
isSupportRdma := false
rdmaResources := rdmamap.GetRdmaDevicesForPcidev(pciAddrs)
if len(rdmaResources) > 0 {
isSupportRdma = true
for _, res := range rdmaResources {
resRdmaDevices := rdmamap.GetRdmaCharDevices(res)
for _, rdmaDevice := range resRdmaDevices {
deviceSpec = append(deviceSpec, &pluginapi.DeviceSpec{
HostPath: rdmaDevice,
ContainerPath: rdmaDevice,
Permissions: "rwm",
})
}
}
}
return &rdmaSpec{isSupportRdma: isSupportRdma, deviceSpec: deviceSpec}
}
sriov plguin 的 log,可看到它会把 /dev/infiniband 下面的几个文件传到pod
###/var/log/sriovdp/sriovdp.INFO
I0412 03:17:57.120886 5327 server.go:123] AllocateResponse send: &AllocateResponse{ContainerResponses:[]
*ContainerAllocateResponse{&ContainerAllocateResponse{Envs:map[string]string{PCIDEVICE_INTEL_COM_DP_SRIOV_MLX5: 0000:00:0a.0,},
Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/infiniband/ucm2,HostPath:/dev/infiniband/ucm2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/issm2,HostPath:/dev/infiniband/issm2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/umad2,HostPath:/dev/infiniband/umad2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/uverbs2,HostPath:/dev/infiniband/uverbs2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/rdma_cm,HostPath:/dev/infiniband/rdma_cm,Permissions:rwm,},},
Annotations:map[string]string{},},},}
3. 没有特权,dpdk启动失败
为了安全考虑,不会给pod特权,这样在pod内部启动dpdk会失败,报错如下
root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ./l2fwd -cf -n4 -w 00:09.0 -- -p1
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: Cannot obtain physical addresses: No such file or directory. Only vfio will function.
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
error allocating rte services array
dpdk的启动参数加上参数 -iova-mode=va,这样就不需要特权了
root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ./l2fwd -cf -n4 -w 00:09.0 --iova-mode=va -- -p1
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
EAL: PCI device 0000:00:09.0 on NUMA socket -1
EAL: Invalid NUMA socket, default to 0
EAL: probe driver: 15b3:1016 net_mlx5
4. 读取网卡统计计数失败
dpdk提供了两个函数,用来获取网卡计数,rte_eth_stats_get和rte_eth_xstats_get,前者用来获取几个固定的计数,比如收发报文数和字节数,
后者用来获取扩展计数,每种网卡都有它自己的计数。
对于mellanox网卡来说,它的dpdk driver mlx5也提供了对应的函数,如下
rte_eth_stats_get -> stats_get -> mlx5_stats_get
rte_eth_xstats_get -> xstats_get -> mlx5_xstats_get
mlx5_stats_get 在pod内部可以正常读取计数,但是mlx5_xstats_get就会遇到问题。
mlx5_xstats_get -> mlx5_read_dev_counters 这个函数会读取下面路径的文件获取计数
root# ls /sys/devices/pci0000\:00/0000\:00\:09.0/infiniband/mlx5_0/ports/1/hw_counters/
duplicate_request out_of_buffer req_cqe_flush_error resp_cqe_flush_error rx_atomic_requests
implied_nak_seq_err out_of_sequence req_remote_access_errors resp_local_length_error rx_read_requests
lifespan packet_seq_err req_remote_invalid_request resp_remote_access_errors rx_write_requests
local_ack_timeout_err req_cqe_error resp_cqe_error rnr_nak_retry_err
但是在pod内部同样的路径下是没有 hw_counters 目录下。因为这个目录是在加载驱动时生成,在pod内看不到。
root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ls /sys/devices/pci0000\:00/0000\:00\:09.0/infiniband/mlx5_0/ports/1/
cap_mask gid_attrs gids lid lid_mask_count link_layer phys_state pkeys rate sm_lid sm_sl state
可能的解决办法:
a. 手动 mount /sys/devices/pci0000:00/0000:00:09.0/infiniband/mlx5_0/ports/1/hw_counters/ 到pod内
b. 修改 sriov-network-device-plugin 代码,自动mount上面的目录
root# cat dpdk-mlx.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-dpdk
annotations:
k8s.v1.cni.cncf.io/networks: host-device1
spec:
nodeName: node1
containers:
- name: appcntr3
image: l2fwd:v3
imagePullPolicy: IfNotPresent
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 300000; done;" ]
securityContext:
privileged: true
resources:
requests:
memory: 100Mi
hugepages-2Mi: 500Mi
cpu: '3'
limits:
hugepages-2Mi: 500Mi
cpu: '3'
memory: 100Mi
volumeMounts:
- mountPath: /mnt/huge
name: hugepage
readOnly: False
- mountPath: /var/run
name: var
readOnly: False
volumes:
- name: hugepage
emptyDir:
medium: HugePages
- name: var
hostPath:
path: /var/run/
https://github.com/Amirzei/mlnx_docker_dpdk/blob/master/Dockerfile
https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/issues/123
https://github.com/DPDK/dpdk/commit/fc40db997323bb0e9b725a6e8d65eae95372446c
https://doc.dpdk.org/guides-18.08/linux_gsg/linux_drivers.html?highlight=bifurcation#bifurcated-driver
https://github.com/kubernetes/kubernetes/issues/60748
https://stackoverflow.com/questions/59290752/how-to-use-device-dev-video0-with-kubernetes
也可参考:k8s mellanox网卡使用dpdk驱动问题总结 - 简书 (jianshu.com)