kubernetes实际生产中遇到的问题及解决办法

文档持续跟新中,请点关注。

calico遇到的问题

calico-node启动失败

基础版本信息:Kubernetes v1.29.0 ,calico v3.27.0,centos7.9,Kernel  6.6.8

遇到问题描述:calico启动失败,pod状态running,但是ready这里只有0/1,实际没有运行。

NAME                                       READY   STATUS    RESTARTS      AGE
calico-kube-controllers-78d68c6746-cmqqg   0/1     Running   2 (49s ago)   3m11s
calico-node-769qn                          0/1     Running   0             41h
calico-node-fl4jc                          0/1     Running   0             2m57s
calico-node-gbv4p                          1/1     Running   0             2m42s
calico-node-psk2b                          0/1     Running   0             2m29s

具体报错日志:

kubectl logs -n calico-system calico-node-769qn --tail=10

# 以下是日志内容
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
2024-01-05 03:55:48.494 [WARNING][11840] felix/ipsets.go 346: Failed to resync with dataplane error=exit status 1 family="inet"
2024-01-05 03:55:48.527 [INFO][11840] felix/ipsets.go 337: Retrying after an ipsets update failure... family="inet"
2024-01-05 03:55:48.534 [ERROR][11840] felix/ipsets.go 599: Bad return code from 'ipset list'. error=exit status 1 family="inet" stderr="ipset v7.11: Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace.\n"
2024-01-05 03:55:48.534 [WARNING][11840] felix/ipsets.go 346: Failed to resync with dataplane error=exit status 1 family="inet"
2024-01-05 03:55:48.599 [INFO][11840] felix/ipsets.go 337: Retrying after an ipsets update failure... family="inet"
2024-01-05 03:55:48.606 [ERROR][11840] felix/ipsets.go 599: Bad return code from 'ipset list'. error=exit status 1 family="inet" stderr="ipset v7.11: Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace.\n"

分析问题:

       Calico的calico-node容器在与dataplane同步时遇到了问题,其中涉及到ipset的错误。特别是,错误信息中提到了"Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace",这表明内核和用户空间之间的不兼容性。

降低系统内核版本或是变更calico的版本

解决问题:

  1. 内核版本降低:使用 (Kernel 5.4.264)可以正常启动。

calico-apiserver启动失败

问题描述

calico-apiserver一直处于Pending状态

NAMESPACE          NAME                                                           READY   STATUS    RESTARTS   AGE
calico-apiserver   calico-apiserver-65fb845b45-t2vpj                              0/1     Pending   0          9m33s
calico-apiserver   calico-apiserver-65fb845b45-zb2q9                              0/1     Pending   0          9m33s

排查问题

kubectl describe pod calico-apiserver-65fb845b45-t2vpj -n calico-apiserver
###
Warning  FailedScheduling  4m47s (x2 over 10m)  default-scheduler  0/4 nodes are available: 4 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling

以上问题是calico-node启动失败导致的calico-apiserver一直处于Pending状态。

 calico-node报错

基础信息:calico版本3.27.0

问题描述:calico-node加入网络错误,BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: 

查看pod状态 kubectl describe pod calico-node-5pmzs -n calico-system

Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused

解决问题:重新删除calico网络,这里删除calico有可能会卡主,耐心等待,如果一直卡着,可以考虑直接删除namespace命名空间资源。

 kubectl delete -f tigera-operator.yaml
 kubectl delete -f custom-resources.yaml

新增两行配置文件,这里指定网卡

nodeAddressAutodetectionV4:

interface: ens.*

# This section includes base Calico installation configuration.
# For more information, see: https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.Installation
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  # Configures Calico networking.
  calicoNetwork:
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - blockSize: 26
      cidr: 30.244.0.0/16
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()
    nodeAddressAutodetectionV4:
      interface: ens.*

---

# This section configures the Calico API server.
# For more information, see: https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
  name: default
spec: {}

变更好配置文件后重新运行创建calico资源。

 kubectl create -f tigera-operator.yaml
 kubectl create -f custom-resources.yaml

master节点遇到的问题

master节点集群初始化不成功

集群初始化不成功的时候 先重置集群,需要制定容器类型,不指定会报错(这里需要指定容器运行的类型)

kubeadm reset --cri-socket unix:///var/run/cri-dockerd.sock

注意

The reset process does not clean your kubeconfig files and you must remove them manually.

Please, check the contents of the $HOME/.kube/config file.

这里如果想重新初始化 需要删除掉这个配置文件

rm -rf $HOME/.kube

worker节点遇到的问题

worker节点无法加入集群

问题描述:worker节点重复添加集群,之前有添加失败的情况

[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.
error execution phase kubelet-start: error uploading crisocket: Unauthorized
To see the stack trace of this error execute with --v=5 or higher

解决问题:这里使用的是docker容器所以命令里需要指定。这里重新初始化再次添加节点。这个命令有风险需要谨慎操作。

# 重新初始化集群
kubeadm reset --cri-socket unix:///var/run/cri-dockerd.sock
rm -rf $HOME/.kube

# 重新加入节点
kubeadm join 192.168.13.133:6443 --token qsq414.hrw44xwjoxt2l15l \
	--discovery-token-ca-cert-hash sha256:75ae4d4d07420b61a3d9143c847ff74caxxxxxxxxxxxxxxxx1656bef98 --cri-socket unix:///var/run/cri-dockerd.sock
### 以下输出就是正常加入了集群
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

注意:

这里确保kubelet服务是加入到开机自启,非启动状态的,如果在初始化的时候就启动会报错端口占用。

你可能感兴趣的:(kubernetes,kubernetes,容器,云原生)