基础版本信息:Kubernetes v1.29.0 ,calico v3.27.0,centos7.9,Kernel 6.6.8
遇到问题描述:calico启动失败,pod状态running,但是ready这里只有0/1,实际没有运行。
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-78d68c6746-cmqqg 0/1 Running 2 (49s ago) 3m11s
calico-node-769qn 0/1 Running 0 41h
calico-node-fl4jc 0/1 Running 0 2m57s
calico-node-gbv4p 1/1 Running 0 2m42s
calico-node-psk2b 0/1 Running 0 2m29s
具体报错日志:
kubectl logs -n calico-system calico-node-769qn --tail=10
# 以下是日志内容
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
2024-01-05 03:55:48.494 [WARNING][11840] felix/ipsets.go 346: Failed to resync with dataplane error=exit status 1 family="inet"
2024-01-05 03:55:48.527 [INFO][11840] felix/ipsets.go 337: Retrying after an ipsets update failure... family="inet"
2024-01-05 03:55:48.534 [ERROR][11840] felix/ipsets.go 599: Bad return code from 'ipset list'. error=exit status 1 family="inet" stderr="ipset v7.11: Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace.\n"
2024-01-05 03:55:48.534 [WARNING][11840] felix/ipsets.go 346: Failed to resync with dataplane error=exit status 1 family="inet"
2024-01-05 03:55:48.599 [INFO][11840] felix/ipsets.go 337: Retrying after an ipsets update failure... family="inet"
2024-01-05 03:55:48.606 [ERROR][11840] felix/ipsets.go 599: Bad return code from 'ipset list'. error=exit status 1 family="inet" stderr="ipset v7.11: Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace.\n"
分析问题:
Calico的calico-node容器在与dataplane同步时遇到了问题,其中涉及到ipset的错误。特别是,错误信息中提到了"Kernel and userspace incompatible: settype hash:ip,port with revision 7 not supported by userspace",这表明内核和用户空间之间的不兼容性。
降低系统内核版本或是变更calico的版本
解决问题:
问题描述
calico-apiserver一直处于Pending状态
NAMESPACE NAME READY STATUS RESTARTS AGE
calico-apiserver calico-apiserver-65fb845b45-t2vpj 0/1 Pending 0 9m33s
calico-apiserver calico-apiserver-65fb845b45-zb2q9 0/1 Pending 0 9m33s
排查问题
kubectl describe pod calico-apiserver-65fb845b45-t2vpj -n calico-apiserver
###
Warning FailedScheduling 4m47s (x2 over 10m) default-scheduler 0/4 nodes are available: 4 node(s) had untolerated taint {node.kubernetes.io/network-unavailable: }. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling
以上问题是calico-node启动失败导致的calico-apiserver一直处于Pending状态。
基础信息:calico版本3.27.0
问题描述:calico-node加入网络错误,BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl:
查看pod状态 kubectl describe pod calico-node-5pmzs -n calico-system
Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
解决问题:重新删除calico网络,这里删除calico有可能会卡主,耐心等待,如果一直卡着,可以考虑直接删除namespace命名空间资源。
kubectl delete -f tigera-operator.yaml
kubectl delete -f custom-resources.yaml
新增两行配置文件,这里指定网卡
nodeAddressAutodetectionV4:
interface: ens.*
# This section includes base Calico installation configuration.
# For more information, see: https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.Installation
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec:
# Configures Calico networking.
calicoNetwork:
# Note: The ipPools section cannot be modified post-install.
ipPools:
- blockSize: 26
cidr: 30.244.0.0/16
encapsulation: VXLANCrossSubnet
natOutgoing: Enabled
nodeSelector: all()
nodeAddressAutodetectionV4:
interface: ens.*
---
# This section configures the Calico API server.
# For more information, see: https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.APIServer
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
name: default
spec: {}
变更好配置文件后重新运行创建calico资源。
kubectl create -f tigera-operator.yaml
kubectl create -f custom-resources.yaml
集群初始化不成功的时候 先重置集群,需要制定容器类型,不指定会报错(这里需要指定容器运行的类型)
kubeadm reset --cri-socket unix:///var/run/cri-dockerd.sock
注意
The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
这里如果想重新初始化 需要删除掉这个配置文件
rm -rf $HOME/.kube
问题描述:worker节点重复添加集群,之前有添加失败的情况
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.
error execution phase kubelet-start: error uploading crisocket: Unauthorized
To see the stack trace of this error execute with --v=5 or higher
解决问题:这里使用的是docker容器所以命令里需要指定。这里重新初始化再次添加节点。这个命令有风险需要谨慎操作。
# 重新初始化集群
kubeadm reset --cri-socket unix:///var/run/cri-dockerd.sock
rm -rf $HOME/.kube
# 重新加入节点
kubeadm join 192.168.13.133:6443 --token qsq414.hrw44xwjoxt2l15l \
--discovery-token-ca-cert-hash sha256:75ae4d4d07420b61a3d9143c847ff74caxxxxxxxxxxxxxxxx1656bef98 --cri-socket unix:///var/run/cri-dockerd.sock
### 以下输出就是正常加入了集群
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
注意:
这里确保kubelet服务是加入到开机自启,非启动状态的,如果在初始化的时候就启动会报错端口占用。