使用kubeadm创建集群失败报Unable to register node with API server

环境

  • containerd 1.6.4
  • k8s 1.24.1(1.23.5)

错误现象


	Unfortunately, an error has occurred:
		timed out waiting for the condition

	This error is likely caused by:
		- The kubelet is not running
		- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

	If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
		- 'systemctl status kubelet'
		- 'journalctl -xeu kubelet'

	Additionally, a control plane component may have crashed or exited when started by the container runtime.
	To troubleshoot, list all containers using your preferred container runtimes CLI.

	Here is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:
		- 'crictl --runtime-endpoint /run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
		Once you have found the failing container, you can inspect its logs with:
		- 'crictl --runtime-endpoint /run/containerd/containerd.sock logs CONTAINERID'

couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase

日志分析

根据提示使用journalctl -xeu kubelet 查看日志,日志文件中的错误主要有四种:

  1. 找不到节点
Error getting node" err="node \"k8s-master\" not found
  1. Unable to register node with API server
  2. 获取不到pause镜像
May 31 09:04:45 k8s-master kubelet[12906]: E0531 09:04:45.363423   12906 remote_runtime.go:198] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.6\": failed to pull image \"k8s.gcr.io/pause:3.6\": failed to pull and unpack image \"k8s.gcr.io/pause:3.6\": failed to resolve reference \"k8s.gcr.io/pause:3.6\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 142.250.157.82:443: i/o timeout"
May 31 09:04:45 k8s-master kubelet[12906]: E0531 09:04:45.363556   12906 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.6\": failed to pull image \"k8s.gcr.io/pause:3.6\": failed to pull and unpack image \"k8s.gcr.io/pause:3.6\": failed to resolve reference \"k8s.gcr.io/pause:3.6\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 142.250.157.82:443: i/o timeout" pod="kube-system/kube-controller-manager-k8s-master"
May 31 09:04:45 k8s-master kubelet[12906]: E0531 09:04:45.363628   12906 kuberuntime_manager.go:833] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.6\": failed to pull image \"k8s.gcr.io/pause:3.6\": failed to pull and unpack image \"k8s.gcr.io/pause:3.6\": failed to resolve reference \"k8s.gcr.io/pause:3.6\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 142.250.157.82:443: i/o timeout" pod="kube-system/kube-controller-manager-k8s-master"
May 31 09:04:45 k8s-master kubelet[12906]: E0531 09:04:45.363775   12906 pod_workers.go:949] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-controller-manager-k8s-master_kube-system(a7773f029975563a22f260af603bc174)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-controller-manager-k8s-master_kube-system(a7773f029975563a22f260af603bc174)\\\": rpc error: code = Unknown desc = failed to get sandbox image \\\"k8s.gcr.io/pause:3.6\\\": failed to pull image \\\"k8s.gcr.io/pause:3.6\\\": failed to pull and unpack image \\\"k8s.gcr.io/pause:3.6\\\": failed to resolve reference \\\"k8s.gcr.io/pause:3.6\\\": failed to do request: Head \\\"https://k8s.gcr.io/v2/pause/manifests/3.6\\\": dial tcp 142.250.157.82:443: i/o timeout\"" pod="kube-system/kube-controller-manager-k8s-master" podUID=a7773f029975563a22f260af603bc174
  1. cni网络初始化错误
May 31 09:04:54 k8s-master kubelet[12906]: E0531 09:04:54.426474   12906 kubelet.go:2347] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

解决思路

  1. 最开始以为是第4个问题,在这上面也花了很多时间,baidu、google,以及k8s issues都搜完了,没有发现一致的错误,以为是新版本问题,使用人少,还没人提交错误,换成1.23.5,还是存在同样问题。
  2. 于是将思路放在问题2上,通过网上的解决方案添加pod-infra-container-image参数,但是添加该参数后仍然报错。
tee /etc/sysconfig/kubelet <<-EOF
KUBELET_EXTRA_ARGS="--pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.6"
EOF

拉取的还是k8s.gcr.io/pause:3.6镜像,同时查看containerd的日志

May 31 09:11:10 k8s-master containerd[9807]: time="2022-05-31T09:11:10.355757461+08:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-apiserver-k8s-master,Uid:4209d27a0268bc5305037fe9024040af,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"k8s.gcr.io/pause:3.6\": failed to pull image \"k8s.gcr.io/pause:3.6\": failed to pull and unpack image \"k8s.gcr.io/pause:3.6\": failed to resolve reference \"k8s.gcr.io/pause:3.6\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 64.233.189.82:443: i/o timeout"
May 31 09:11:12 k8s-master containerd[9807]: time="2022-05-31T09:11:12.355948935+08:00" level=info msg="trying next host" error="failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 64.233.189.82:443: i/o timeout" host=k8s.gcr.io
May 31 09:11:12 k8s-master containerd[9807]: time="2022-05-31T09:11:12.361334669+08:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-controller-manager-k8s-master,Uid:a7773f029975563a22f260af603bc174,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"k8s.gcr.io/pause:3.6\": failed to pull image \"k8s.gcr.io/pause:3.6\": failed to pull and unpack image \"k8s.gcr.io/pause:3.6\": failed to resolve reference \"k8s.gcr.io/pause:3.6\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 64.233.189.82:443: i/o timeout"
May 31 09:11:23 k8s-master containerd[9807]: time="2022-05-31T09:11:23.353642109+08:00" level=info msg="trying next host" error="failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 64.233.189.82:443: i/o timeout" host=k8s.gcr.io
May 31 09:11:23 k8s-master containerd[9807]: time="2022-05-31T09:11:23.357821141+08:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:etcd-k8s-master,Uid:267aa25988340cd5f9ebe7bf0bc5b507,Namespace:kube-system,Attempt:0,} failed, error" error="failed to get sandbox image \"k8s.gcr.io/pause:3.6\": failed to pull image \"k8s.gcr.io/pause:3.6\": failed to pull and unpack image \"k8s.gcr.io/pause:3.6\": failed to resolve reference \"k8s.gcr.io/pause:3.6\": failed to do request: Head \"https://k8s.gcr.io/v2/pause/manifests/3.6\": dial tcp 64.233.189.82:443: i/o timeout"
May 31 09:11:25 k8s-master containerd[9807]: time="2022-05-31T09:11:25.345054250+08:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-apiserver-k8s-master,Uid:4209d27a0268bc5305037fe9024040af,Namespace:kube-system,Attempt:0,}"


发现api-server、controller-manager、etcd因为pause镜像拉取失败而未能启动容器,导致问题1、问题2,于是 通过命令crictl img查看镜像,里面只有registry.aliyuncs.com/google_containers/pause:3.7,
手动拉取registry.aliyuncs.com/google_containers/pause:3.6,修改镜像tag,重新执行创建集群命令,创建成功。

ctr -n k8s.io i tag registry.aliyuncs.com/google_containers/pause:3.6 k8s.gcr.io/pause:3.6

总结

导致此坑的根本问题是国内不能直接访问google的镜像仓库,虽然在初始化命令中添加了本地镜像仓库参数,image-repository,但是不完全有效,希望在新的版本中能解决这个问题。

附加其他坑

  1. 命令行参数转成config文件参数对应问题
--pod-network-cidr=10.244.0.0/16 对应如下:
networking:
  podSubnet: 10.244.0.0/16

kubeadm config print init-defaults > kubeadm_config.yaml //通过该命令查看对应

参考:K8S: convert “kubeadm init” command-line arguments to “–config” YAML-file equivalent

  1. 安装Pod network add-on flannel
  • 错误提示
    create flannel failed, Failed to find any valid interface to use: failed to get default interface: protocol not available
  • 版本:0.18
  • 解决方法
    kube-flannel.yml中镜像版本改为17
    使用kubeadm创建集群失败报Unable to register node with API server_第1张图片
    参考:https://github.com/flannel-io/flannel/issues/1573
  1. Unable to connect to the server: x509: certificate signed by unknown authority
  • 原因
    执行kubeadm reset 命令后重新创建集群,没有删除$HOME/.kube 目录,删除重新执行集群创建成功后的命令
 mkdir -p $HOME/.kube
 sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
 sudo chown $(id -u):$(id -g) $HOME/.kube/config

你可能感兴趣的:(云原生,kubernetes)