k8s mellanox网卡使用dpdk驱动问题总结

本文主要总结一下在k8s环境中,mellanox网卡使用dpdk driver可能会遇到的问题及解决办法。

1. 不能挂载 /sys 目录到pod中
其他厂家的网卡,比如intel的x710等,如果想在k8s中,使用dpdk driver,/sys目录是必须挂载的,因为dpdk启动过程会读取这个目录下的文件。但是对于mellanox网卡来说,它是比较特殊的,在使用dpdk driver时,也必须绑定在kernel driver mlx5_core上面。

如果挂载了 /sys 目录到pod中,就会报如下的错误。

net_mlx5: port 0 cannot get MAC address, is mlx5_en loaded? (errno: No such file or directory)
net_mlx5: probe of PCI device 0000:00:09.0 aborted after encountering an error: No such device
EAL: Requested device 0000:00:09.0 cannot be used

原因是 host上的 /sys/ 会覆盖 pod 里的 /sys/ 内容,而 mlx 网卡会读取这些目录,比如 /sys/devices/pci0000:00/0000:00:09.0/net/,如果覆盖了,就会报错。下面分析下代码

mlx5_pci_probe
    mlx5_dev_spawn
        /* Configure the first MAC address by default. */
        if (mlx5_get_mac(eth_dev, &mac.addr_bytes)) {
            DRV_LOG(ERR,
                "port %u cannot get MAC address, is mlx5_en"
                " loaded? (errno: %s)",
                eth_dev->data->port_id, strerror(rte_errno));
            err = ENODEV;
            goto error;
        }

//如果host上的 /sys/ 覆盖 pod 里的 /sys/ 内容,会出现问题(具体哪行代码出问题还有待调查)
int
mlx5_get_mac(struct rte_eth_dev *dev, uint8_t (*mac)[ETHER_ADDR_LEN])
{
    struct ifreq request;
    int ret;

    ret = mlx5_ifreq(dev, SIOCGIFHWADDR, &request);
        int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
        mlx5_get_ifname(dev, &ifr->ifr_name);
        ioctl(sock, req, ifr);
                
    if (ret)
        return ret;
    memcpy(mac, request.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);
    return 0;
}

2. dpdk启动过程中找不到mellanox网卡
有时候会出现下面的错误,找不到mellanox网卡,并且提示了是不是没有加载kernel driver,如果没有在k8s环境中,这个提示是有用的,可能网卡确实没有绑定到mlx5_core。

EAL: PCI device 0000:00:06.0 on NUMA socket -1"}
EAL:   Invalid NUMA socket, default to 0"}
EAL:   probe driver: 15b3:101a net_mlx5"}
net_mlx5: no Verbs device matches PCI device 0000:00:06.0, are kernel drivers loaded?"}
EAL: Requested device 0000:00:06.0 cannot be used"}

但是在k8s环境中,就有其他原因了。问题现象是,如果启动的pod有privileged权限,那dpdk是可以识别到网卡并且启动成功的,但是如果没有privileged权限,就会报上面的错误。
先分析下为什么报这个错误,dpdk代码如下

mlx5_pci_probe
    unsigned int n = 0;
    //调用 libibverbs 里的函数 ibv_get_device_list 获取 ibv 设备
    ibv_list = mlx5_glue->get_device_list(&ret);
    while (ret-- > 0) {
        ibv_match[n++] = ibv_list[ret];
    }
    //在这里报错,说明n为0,n在上面遍历ret的时候赋值,n为0,说明ret也为0
    if (!n) {
        DRV_LOG(WARNING,
            "no Verbs device matches PCI device " PCI_PRI_FMT ","
            " are kernel drivers loaded?",
            pci_dev->addr.domain, pci_dev->addr.bus,
            pci_dev->addr.devid, pci_dev->addr.function);
        rte_errno = ENOENT;
        ret = -rte_errno;
    }

要想知道为什么ret也为0,得分析下 libibverbs 源码,libibverbs 源码可以在 OFED 包中找到,路径如下 MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu16.04-x86_64/src/MLNX_OFED_SRC-5.2-2.2.0.0/SOURCES/rdma-core-52mlnx1/libibverbs

LATEST_SYMVER_FUNC(ibv_get_device_list, 1_1, "IBVERBS_1.1",
           struct ibv_device **,
           int *num)
    ibverbs_get_device_list(&device_list);
        find_sysfs_devs(&sysfs_list);
            setup_sysfs_dev
                try_access_device(sysfs_dev)
                    struct stat cdev_stat;
                    char *devpath;
                    int ret;
                    //查看这个文件是否存在 /dev/infiniband/uverbs0
                    if (asprintf(&devpath, RDMA_CDEV_DIR"/%s", sysfs_dev->sysfs_name) < 0)
                        return ENOMEM;

                    ret = stat(devpath, &cdev_stat);
                    free(devpath);
                    return ret;

由上面代码可知,ibverbs会检查/dev/infiniband/uverbs0文件是否存在(一个网卡对应一个uverbs0),如果不存在就认为没有找到网卡。
如果pod有privileged权限,pod内就可以读取/dev/infiniband/uverbs0这些文件,如果没有privileged权限,pod内部是没有/dev/infiniband这个目录的。

//pod没有privilege特权时
root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ls /dev
core  fd  full  mqueue  null  ptmx  pts  random  shm  stderr  stdin  stdout  termination-log  tty  urandom  zero

//pod有privilege特权时,可看到infiniband
root@pod-dpdk:~/# ls /dev/
autofs           infiniband        mqueue              sda2             tty12  tty28  tty43  tty59   ttyS16  ttyS31     vcs4   vfio
...

root@pod-dpdk:~# ls /dev/infiniband/
uverbs0

由上可知,pod里有没有这个文件 /dev/infiniband/uverbs0 是关键。如果pod内没有这些 /dev/infiniband 文件就会报上面的错误。

那这个文件怎么才能传到pod里呢?有两种办法
a. 给pod privileged权限,这种方式是不需要将 /dev/ 挂载到pod内部的,因为有特权的pod可以直接访问host上的这些文件
b. 使用 k8s 的 sriov-network-device-plugin,要注意版本,高点的版本才能避免此问题。如果是mlx网卡,它会将需要的device通过 docker的 --device 挂载给container,比如下面的环境,将
/dev/infiniband/uverbs2等文件挂载到container中

root# docker inspect 1dfe96c8eff4
[
    {
        "Id": "1dfe96c8eff4c8ede0d8eb4e480fec9f002f68c4da1bb5265580ee968c6d7502",
        "Created": "2021-04-12T03:24:22.598030845Z",
        ...
        "HostConfig": {
            ...
            "CapAdd": [
                "NET_RAW",
                "NET_ADMIN",
                "IPC_LOCK"
            ],

            "Privileged": false,

            "Devices": [
                {
                    "PathOnHost": "/dev/infiniband/ucm2",
                    "PathInContainer": "/dev/infiniband/ucm2",
                    "CgroupPermissions": "rwm"
                },
                {
                    "PathOnHost": "/dev/infiniband/issm2",
                    "PathInContainer": "/dev/infiniband/issm2",
                    "CgroupPermissions": "rwm"
                },
                {
                    "PathOnHost": "/dev/infiniband/umad2",
                    "PathInContainer": "/dev/infiniband/umad2",
                    "CgroupPermissions": "rwm"
                },
                {
                    "PathOnHost": "/dev/infiniband/uverbs2",
                    "PathInContainer": "/dev/infiniband/uverbs2",
                    "CgroupPermissions": "rwm"
                },
                {
                    "PathOnHost": "/dev/infiniband/rdma_cm",
                    "PathInContainer": "/dev/infiniband/rdma_cm",
                    "CgroupPermissions": "rwm"
                }
            ],
            ...
        },

sriov plguin 的 代码,将hostpath上的目录挂载到pod中

// NewRdmaSpec returns the RdmaSpec
func NewRdmaSpec(pciAddrs string) types.RdmaSpec {
        deviceSpec := make([]*pluginapi.DeviceSpec, 0)
        isSupportRdma := false
        rdmaResources := rdmamap.GetRdmaDevicesForPcidev(pciAddrs)
        if len(rdmaResources) > 0 {
                isSupportRdma = true
                for _, res := range rdmaResources {
                        resRdmaDevices := rdmamap.GetRdmaCharDevices(res)
                        for _, rdmaDevice := range resRdmaDevices {
                                deviceSpec = append(deviceSpec, &pluginapi.DeviceSpec{
                                        HostPath:      rdmaDevice,  
                                        ContainerPath: rdmaDevice,
                                        Permissions:   "rwm",
                                })
                        }
                }
        }

        return &rdmaSpec{isSupportRdma: isSupportRdma, deviceSpec: deviceSpec}
}

sriov plguin 的 log,可看到它会把 /dev/infiniband 下面的几个文件传到pod

###/var/log/sriovdp/sriovdp.INFO
I0412 03:17:57.120886    5327 server.go:123] AllocateResponse send: &AllocateResponse{ContainerResponses:[]
*ContainerAllocateResponse{&ContainerAllocateResponse{Envs:map[string]string{PCIDEVICE_INTEL_COM_DP_SRIOV_MLX5: 0000:00:0a.0,},
Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/infiniband/ucm2,HostPath:/dev/infiniband/ucm2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/issm2,HostPath:/dev/infiniband/issm2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/umad2,HostPath:/dev/infiniband/umad2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/uverbs2,HostPath:/dev/infiniband/uverbs2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/rdma_cm,HostPath:/dev/infiniband/rdma_cm,Permissions:rwm,},},
Annotations:map[string]string{},},},}

3. 没有特权,dpdk启动失败
为了安全考虑,不会给pod特权,这样在pod内部启动dpdk会失败,报错如下

root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ./l2fwd -cf -n4  -w 00:09.0 -- -p1
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: Cannot obtain physical addresses: No such file or directory. Only vfio will function.
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
error allocating rte services array

dpdk的启动参数加上参数 -iova-mode=va,这样就不需要特权了

root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ./l2fwd -cf -n4  -w 00:09.0 --iova-mode=va -- -p1
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
EAL: PCI device 0000:00:09.0 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 15b3:1016 net_mlx5

4. 读取网卡统计计数失败

dpdk提供了两个函数,用来获取网卡计数,rte_eth_stats_get和rte_eth_xstats_get,前者用来获取几个固定的计数,比如收发报文数和字节数,
后者用来获取扩展计数,每种网卡都有它自己的计数。

对于mellanox网卡来说,它的dpdk driver mlx5也提供了对应的函数,如下

rte_eth_stats_get -> stats_get -> mlx5_stats_get 
rte_eth_xstats_get -> xstats_get -> mlx5_xstats_get

mlx5_stats_get 在pod内部可以正常读取计数,但是mlx5_xstats_get就会遇到问题。
mlx5_xstats_get -> mlx5_read_dev_counters 这个函数会读取下面路径的文件获取计数

root# ls /sys/devices/pci0000\:00/0000\:00\:09.0/infiniband/mlx5_0/ports/1/hw_counters/
duplicate_request      out_of_buffer    req_cqe_flush_error         resp_cqe_flush_error       rx_atomic_requests
implied_nak_seq_err    out_of_sequence  req_remote_access_errors    resp_local_length_error    rx_read_requests
lifespan               packet_seq_err   req_remote_invalid_request  resp_remote_access_errors  rx_write_requests
local_ack_timeout_err  req_cqe_error    resp_cqe_error              rnr_nak_retry_err

但是在pod内部同样的路径下是没有 hw_counters 目录下。因为这个目录是在加载驱动时生成,在pod内看不到。

root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ls /sys/devices/pci0000\:00/0000\:00\:09.0/infiniband/mlx5_0/ports/1/
cap_mask  gid_attrs  gids  lid  lid_mask_count  link_layer  phys_state  pkeys  rate  sm_lid  sm_sl  state

可能的解决办法:
a. 手动 mount /sys/devices/pci0000:00/0000:00:09.0/infiniband/mlx5_0/ports/1/hw_counters/ 到pod内
b. 修改 sriov-network-device-plugin 代码,自动mount上面的目录

测试文件

root# cat dpdk-mlx.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-dpdk
  annotations:
    k8s.v1.cni.cncf.io/networks: host-device1
spec:
  nodeName: node1
  containers:
  - name: appcntr3
    image: l2fwd:v3
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    securityContext:
      privileged: true
    resources:
      requests:
        memory: 100Mi
        hugepages-2Mi: 500Mi
        cpu: '3'
      limits:
        hugepages-2Mi: 500Mi
        cpu: '3'
        memory: 100Mi
    volumeMounts:
    - mountPath: /mnt/huge
      name: hugepage
      readOnly: False
    - mountPath: /var/run
      name: var
      readOnly: False
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: var
    hostPath:
      path: /var/run/

参考

https://github.com/Amirzei/mlnx_docker_dpdk/blob/master/Dockerfile
https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/issues/123
https://github.com/DPDK/dpdk/commit/fc40db997323bb0e9b725a6e8d65eae95372446c
https://doc.dpdk.org/guides-18.08/linux_gsg/linux_drivers.html?highlight=bifurcation#bifurcated-driver
https://github.com/kubernetes/kubernetes/issues/60748
https://stackoverflow.com/questions/59290752/how-to-use-device-dev-video0-with-kubernetes

也可参考:k8s mellanox网卡使用dpdk驱动问题总结 - 简书 (jianshu.com) 

你可能感兴趣的:(疑难杂症,kubernetes,DPDK,k8s,dpdk,container,mellanox,mlx4)