aws eks 测试eks自定义网络和相关报错

参考资料

  • https://docs.aws.amazon.com/zh_cn/eks/latest/userguide/cni-custom-network.html
  • https://aws.github.io/aws-eks-best-practices/networking/vpc-cni/
  • https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md

本次测试环境如下

  • eks集群1.23
  • vpccni插件版本1.10.4-eksbuild:1
  • 节点子网网段为10.100.1.0/24,自定义联网后pod网段为10.120.1.0/24

自定义联网的需求

  • 主网络接口所在子网中可用的 IPv4 地址数量有限
  • pods 可能需要使用与节点的主网络接口不同的子网或安全组

注意事项

  • 开启自定义联网后,主网卡不会再将ip分配给pod(实例的pod数量可能会受限)
  • 使用ipv6的集群不需要使用自定义联网(无法使用),地址耗尽的场景建议使用ipv6集群
  • 设置的子网和安全组需要位于同一vpc中

配置自定义联网

配置自定义联网比较简单,开启环境变量

kubectl set env daemonset aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true
kubectl set env daemonset aws-node -n kube-system ENI_CONFIG_LABEL_DEF=topology.kubernetes.io/zone

在两个可用区的创建crd

$ cat cn-north-1a.yaml
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
  name: cn-north-1a
spec:
  securityGroups:
    - sg-088b947d8817fdc2f
  subnet: subnet-04cb3b7313ca37a80
$ cat cn-north-1b.yaml
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
  name: cn-north-1b
spec:
  securityGroups:
    - sg-088b947d8817fdc2f
  subnet: subnet-0cd19152401a80eb5

kubectl apply -f cn-north-1a.yaml
kubectl apply -f cn-north-1b.yaml

aws-node启动日志

  • net/ipv4/conf/eth0/rp_filter设置为2,对应源码如下

    关于nodeport支持的参数,https://github.com/aws/amazon-vpc-cni-k8s#aws_vpc_cni_node_port_support

    if n.nodePortSupportEnabled {
    		// If node port support is enabled, configure the kernel's reverse path filter check on eth0 for "loose" filtering.  This is required because NodePorts are exposed on eth0 The kernel's RPF check happens after incoming packets to NodePorts are DNATted to the pod IP. For pods assigned to secondary ENIs, the routing table includes source-based routing.  When the kernel does the RPF check, it looks up the route using the pod IP as the source. Thus, it finds the source-based route that leaves via the secondary ENI. In "strict" mode, the RPF check fails because the return path uses a different interface to the incoming packet.  In "loose" mode, the check passes because some route was found.
    		const eth0RPFilter = "/proc/sys/net/ipv4/conf/eth0/rp_filter"
    		const rpFilterLoose = "2"
    		err := n.setProcSys(eth0RPFilter, rpFilterLoose)
    		if err != nil {
    			return errors.Wrapf(err, "failed to configure eth0 RPF check")
    		}
    	}
    
  • net/ipv4/tcp_early_demux设置为1

aws-node Installed /host/opt/cni/bin/aws-cni
aws-node Installed /host/opt/cni/bin/egress-v4-cni
aws-node time="2023-03-22T07:59:16Z" level=info msg="Starting IPAM daemon... "
aws-node time="2023-03-22T07:59:16Z" level=info msg="Checking for IPAM connectivity... "
aws-node time="2023-03-22T07:59:17Z" level=info msg="Copying config file... "
aws-node time="2023-03-22T07:59:17Z" level=info msg="Successfully copied CNI plugin binary and config file."
aws-vpc-cni-init time="2023-03-22T07:59:14Z" level=info msg="Copying CNI plugin binaries ..."
aws-vpc-cni-init Installed /host/opt/cni/bin/loopback
aws-vpc-cni-init Installed /host/opt/cni/bin/portmap
aws-vpc-cni-init Installed /host/opt/cni/bin/bandwidth
aws-vpc-cni-init time="2023-03-22T07:59:14Z" level=info msg="Copied all CNI plugin binaries to /host/opt/cni/bin"
aws-vpc-cni-init Installed /host/opt/cni/bin/host-local
aws-vpc-cni-init Installed /host/opt/cni/bin/aws-cni-support.sh
aws-vpc-cni-init time="2023-03-22T07:59:14Z" level=info msg="Found primaryMAC 02:67:d2:91:12:ae"
aws-vpc-cni-init time="2023-03-22T07:59:14Z" level=info msg="Found primaryIF eth0"
aws-vpc-cni-init time="2023-03-22T07:59:14Z" level=info msg="Updated net/ipv4/conf/eth0/rp_filter to 2\n"
aws-vpc-cni-init time="2023-03-22T07:59:14Z" level=info msg="Updated net/ipv4/tcp_early_demux to 1\n"
aws-vpc-cni-init time="2023-03-22T07:59:14Z" level=info msg="CNI init container done"

在几点的ipamd日志中可以看到aws-node环境便的相关配置,例如Custom networking enabled true表示开启了自定义联网3

{"level":"info","ts":"2023-03-22T07:59:16.772Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2023-03-22T07:59:16.772Z","caller":"aws-k8s-agent/main.go:39","msg":"Testing communication with server"}
{"level":"info","ts":"2023-03-22T07:59:16.783Z","caller":"wait/wait.go:222","msg":"Successful communication with the Cluster! Cluster Version is: v1.23+. git version: v1.23.16-eks-48e63af. git tree state: clean. commit: e6332a8a3feb9e0fe3db851878f88cb73d49dd7a. platform: linux/amd64"}
{"level":"warn","ts":"2023-03-22T07:59:16.799Z","caller":"awssession/session.go:64","msg":"HTTP_TIMEOUT env is not set or set to less than 10 seconds, defaulting to httpTimeout to 10sec"}
{"level":"debug","ts":"2023-03-22T07:59:16.801Z","caller":"ipamd/ipamd.go:392","msg":"Discovered region: cn-north-1"}
{"level":"info","ts":"2023-03-22T07:59:16.801Z","caller":"ipamd/ipamd.go:392","msg":"Custom networking enabled true"}
{"level":"debug","ts":"2023-03-22T07:59:16.802Z","caller":"awsutils/awsutils.go:431","msg":"Found availability zone: cn-north-1a "}
{"level":"debug","ts":"2023-03-22T07:59:16.803Z","caller":"awsutils/awsutils.go:431","msg":"Discovered the instance primary IPv4 address: 10.100.0.81"}
{"level":"debug","ts":"2023-03-22T07:59:16.803Z","caller":"awsutils/awsutils.go:431","msg":"Found instance-id: i-093668d8490163140 "}
{"level":"debug","ts":"2023-03-22T07:59:16.804Z","caller":"awsutils/awsutils.go:431","msg":"Found instance-type: m5.2xlarge "}
{"level":"debug","ts":"2023-03-22T07:59:16.804Z","caller":"awsutils/awsutils.go:431","msg":"Found primary interface's MAC address: 02:67:d2:91:12:ae"}
{"level":"debug","ts":"2023-03-22T07:59:16.805Z","caller":"awsutils/awsutils.go:431","msg":"eni-036525abd58c37374 is the primary ENI of this instance"}
{"level":"debug","ts":"2023-03-22T07:59:16.805Z","caller":"awsutils/awsutils.go:431","msg":"Found subnet-id: subnet-0fa7c3d3729a3009b "}
{"level":"debug","ts":"2023-03-22T07:59:16.805Z","caller":"ipamd/ipamd.go:401","msg":"Using WARM_ENI_TARGET 1"}
{"level":"debug","ts":"2023-03-22T07:59:16.805Z","caller":"ipamd/ipamd.go:404","msg":"Using WARM_PREFIX_TARGET 1"}
{"level":"info","ts":"2023-03-22T07:59:16.806Z","caller":"ipamd/ipamd.go:422","msg":"Prefix Delegation enabled false"}
{"level":"debug","ts":"2023-03-22T07:59:16.806Z","caller":"ipamd/ipamd.go:427","msg":"Start node init"}
{"level":"debug","ts":"2023-03-22T07:59:16.806Z","caller":"ipamd/ipamd.go:459","msg":"Max ip per ENI 14 and max prefixes per ENI 0"}
{"level":"info","ts":"2023-03-22T07:59:16.806Z","caller":"awsutils/awsutils.go:1684","msg":"Will attempt to clean up AWS CNI leaked ENIs after waiting 2m1s."}

创建完毕后查看pod的ip

  • node和pod的ip网段不同,表明自定义联网生效
$ k get pod -A -o wide
NAMESPACE     NAME                         READY   STATUS    RESTARTS   AGE    IP             NODE                                          NOMINATED NODE   READINESS GATES
default       nginx-dep-6cb4756ddb-7sjgn   1/1     Running   0          119s   10.120.0.248   ip-10-100-0-215.cn-north-1.compute.internal              
default       nginx-dep-6cb4756ddb-8t8wz   1/1     Running   0          119s   10.120.0.196   ip-10-100-0-215.cn-north-1.compute.internal              
default       nginx-dep-6cb4756ddb-95rm5   1/1     Running   0          119s   10.120.0.60    ip-10-100-0-215.cn-north-1.compute.internal              

相关报错和解决

(1)没有权限

cni插件没有获取托管权限AmazonEKS_CNI_Policy

查看aws-node日志,出现如下报错

Unauthorized operation: failed to call ec2:DescribeNetworkInterfaces due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role

查看节点ipamd日志,关键词为

  • Unauthorized operation
  • 没有权限执行ec2:DescribeNetworkInterfaces操作
{"level":"debug","ts":"2023-03-22T07:50:48.809Z","caller":"awsutils/awsutils.go:1947","msg":"Sent pod event: eventType: Warning, reason: MissingIAMPermissions, message: Unauthorized operation: failed to call ec2:DescribeNetworkInterfaces due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role"}
{"level":"error","ts":"2023-03-22T07:50:48.809Z","caller":"ipamd/ipamd.go:475","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-004de29dbc204f264 eni-0562c3edc5c3ebc75]: UnauthorizedOperation: You are not authorized to perform this operation.\n\tstatus code: 403, request id: 79aa8fa8-c7ed-4023-8f29-2b10d4abea7c"}
{"level":"error","ts":"2023-03-22T07:50:48.920Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd init: failed to retrieve attached ENIs info: UnauthorizedOperation: You are not authorized to perform this operation.\n\tstatus code: 403, request id: 79aa8fa8-c7ed-4023-8f29-2b10d4abea7c"}

(2)eniconfig安全组配置错误

配置错误的安全组sg-088b947d8817fdc2f

由于安全组为可选配置,因此不配置安全组反而不会受到影响。

  • 关键词为Failed to find IPAMing CRI-migrated version
  • 尝试将集群的vpc-cni插件升级为最新版本

这里误导性很强,实际并不意味着需要升级cni插件,任何自定义联网的错误都可能会出现这个错误

{"level":"debug","ts":"2023-03-22T06:54:46.947Z","caller":"ipamd/rpc_handler.go:222","msg":"UnassignPodIPAddress: IP address pool staandbox aws-cni/c3631138652f4b26d6b52635e103880008f049a1a5e8dc016159d5659f9eb8bb/eth0"}
{"level":"debug","ts":"2023-03-22T06:54:46.947Z","caller":"ipamd/rpc_handler.go:222","msg":"UnassignPodIPAddress: Failed to find IPAMing CRI-migrated version"}
{"level":"warn","ts":"2023-03-22T06:54:46.947Z","caller":"ipamd/rpc_handler.go:222","msg":"UnassignPodIPAddress: Failed to find sandb1138652f4b26d6b52635e103880008f049a1a5e8dc016159d5659f9eb8bb/unknown"}
{"level":"info","ts":"2023-03-22T06:54:46.947Z","caller":"rpc/rpc.pb.go:731","msg":"Send DelNetworkReply: IPv4Addr , DeviceNumber: 0,od"}

(3)没有配置节点注解环境变量

如果在设置aws-node插件时,eniconfig的crd文件使用了可用区名称,但是没有指定如下环境变量,则会出现找不到eniconfig的错误

kubectl set env daemonset aws-node -n kube-system ENI_CONFIG_LABEL_DEF=topology.kubernetes.io/zone

此时pod处于ContainerCreating状态

Warning  FailedCreatePodSandBox  0s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "76d1xe56f8" network for pod "nginx-dep-6cb4756ddb-rr5hs": networkPlugin cni failed to set up pod "nginx-dep-6cb4756ddb-rr5hs_default" network: add cmd: failed to assign an IP address to container

查看节点的ipamd日志,出现如下报错

  • ipamd一直在寻找名为default的eniconfig文件,但是始终没有找到
{"level":"error","ts":"2023-03-22T09:19:07.812Z","caller":"ipamd/ipamd.go:871","msg":"error while retrieving eniconfig: ENIConfig.crd.k8s.amazonaws.com \"default\" not found"}
{"level":"error","ts":"2023-03-22T09:19:07.812Z","caller":"ipamd/ipamd.go:845","msg":"Failed to get pod ENI config"}

这个错误的逻辑在于

  • 如果使用可用区作为eniconfig的名称,例如cn-north-1a,则需要使用以上注解开启支持。ipamd会使用节点的可用区作为eniconfig的名称来查找配置文件

  • 如果使用可用区之外的名称作为eniconfig文件的名称,则需要使用以下命令单独注解节点。

    kubectl annotate node ip-192-168-0-126.us-west-2.compute.internal k8s.amazonaws.com/eniConfig=EniConfigName1
    kubectl annotate node ip-192-168-0-92.us-west-2.compute.internal k8s.amazonaws.com/eniConfig=EniConfigName2
    

(4)无法设置rp_filter参数

此时aws-node启动成功,但是pod处于ContainerCreating状态

这个错误同样足够明显,即ipamd没有权限修改内核参数

https://aws.github.io/aws-eks-best-practices/networking/vpc-cni/

Amazon VPC CNI has two components CNI binary and ipamd (aws-node) Daemonset. The CNI runs as a binary on a node and has access to node root file system, also has privileged access as it deals with iptables at the node level.

但是实际上cni插件在特权模式启动,并且有权限访问根文件系统

  • 类似的issue可以在github上找到,例如os中网卡名称并非eth0,https://github.com/aws/amazon-vpc-cni-k8s/issues/796
  • 或者容器没有以特权模式运行

节点的ipamd日志如下

{"level":"debug","ts":"2023-03-22T02:46:55.250Z","caller":"ipamd/ipamd.go:381","msg":"Setting RPF for primary interface: net/ipv4/conf/eth0/rp_filter"}
{"level":"error","ts":"2023-03-22T02:46:55.250Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd init: failed to set up host network: failed to configure eth0 RPF check: open /proc/sys/net/ipv4/conf/eth0/rp_filter: read-only file system"}

查看源码实际上是通过open函数修改配置文件,

func (n *linuxNetwork) setProcSys(key, value string) error {
	f, err := n.openFile(key, os.O_WRONLY, 0644)
	if err != nil {
		return err
	}
	defer f.Close()
	_, err = f.WriteString(value)
	if err != nil {
		return err
	}
	return nil
}

目前没有其他可鞥导致该问题的可能,解决方式为重启cni插件

(5)cidr段没有被eks集群识别

官方文档中提到,如果将额外的cidr块添加到vpc中,并在此cidr中启动i新的节点组,则会出现节点无法加入集群的问题

https://aws.amazon.com/cn/premiumsupport/knowledge-center/eks-multiple-cidr-ranges/

在某些情况下,创建集群后,Amazon EKS 无法与通过添加到 VPC 的其他 CIDR 数据块在子网中启动的节点进行通信。向现有集群添加 CIDR 数据块所导致的更新范围可能需要长达 5 小时才能显示。

查看节点的kubelet日志会发现姐弟啊您无法访问到apiserver,但是实际并不存在联网问题。

这个新创建的cidr段被集群识别的时长可能会持续5小时。

你可能感兴趣的:(AWS,aws,网络,kubernetes)