本笔录主要是记录kubernetes从v1.7.6升级到v1.11.2的问题以及相关的解决办法
问题一:不能访问master的10255
日志
[root@test-master-113 qinzhao]# kubectl logs heapster-1967565778-v5wkr -n kube-system heapster
I0918 10:37:05.893124 1 heapster.go:72] /heapster --source=kubernetes.summary_api:''
I0918 10:37:05.893235 1 heapster.go:73] Heapster version v1.4.1
I0918 10:37:05.893944 1 configs.go:61] Using Kubernetes client with master "https://10.0.0.1:443" and version v1
I0918 10:37:05.893993 1 configs.go:62] Using kubelet port 10255
I0918 10:37:05.896697 1 heapster.go:196] Starting with Metric Sink
I0918 10:37:05.911591 1 heapster.go:106] Starting heapster on port 8082
E0918 10:38:05.026522 1 summary.go:97] error while getting metrics summary from Kubelet test-master-113(10.39.0.113:10255): Get http://10.39.0.113:10255/stats/summary/: dial tcp 10.39.0.113:10255: getsockopt: connection refused
[root@test-master-113 qinzhao]# kubectl logs heapster-1967565778-v5wkr -n kube-system heapster
I0918 10:37:05.893124 1 heapster.go:72] /heapster --source=kubernetes.summary_api:''
I0918 10:37:05.893235 1 heapster.go:73] Heapster version v1.4.1
I0918 10:37:05.893944 1 configs.go:61] Using Kubernetes client with master "https://10.0.0.1:443" and version v1
I0918 10:37:05.893993 1 configs.go:62] Using kubelet port 10255
I0918 10:37:05.896697 1 heapster.go:196] Starting with Metric Sink
I0918 10:37:05.911591 1 heapster.go:106] Starting heapster on port 8082
E0918 10:38:05.026522 1 summary.go:97] error while getting metrics summary from Kubelet test-master-113(10.39.0.113:10255): Get http://10.39.0.113:10255/stats/summary/: dial tcp 10.39.0.113:10255: getsockopt: connection refused
原因是没有打开10255这个端口
[root@test-master-113 qingcloud-csi]# netstat -tlnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 19995/kubelet
tcp 0 0 127.0.0.1:40264 0.0.0.0:* LISTEN 19995/kubelet
tcp 0 0 127.0.0.1:10249 0.0.0.0:* LISTEN 32286/proxy
tcp 0 0 10.39.0.113:10250 0.0.0.0:* LISTEN 19995/kubelet
tcp 0 0 127.0.0.1:9099 0.0.0.0:* LISTEN 21392/calico-node
tcp 0 0 127.0.0.1:10251 0.0.0.0:* LISTEN 22716/scheduler
tcp 0 0 10.39.0.113:6443 0.0.0.0:* LISTEN 22529/apiserver
tcp 0 0 10.39.0.113:2379 0.0.0.0:* LISTEN 30095/etcd
tcp 0 0 127.0.0.1:2379 0.0.0.0:* LISTEN 30095/etcd
tcp 0 0 127.0.0.1:10252 0.0.0.0:* LISTEN 21364/controller-ma
tcp 0 0 10.39.0.113:2380 0.0.0.0:* LISTEN 30095/etcd
tcp 0 0 127.0.0.1:9101 0.0.0.0:* LISTEN 1654/node_exporter
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd
tcp 0 0 0.0.0.0:179 0.0.0.0:* LISTEN 21507/bird
tcp 0 0 0.0.0.0:9120 0.0.0.0:* LISTEN 795/sshd
tcp6 0 0 :::9898 :::* LISTEN 31707/kube-discover
tcp6 0 0 :::9100 :::* LISTEN 1725/kube-rbac-prox
tcp6 0 0 :::111 :::* LISTEN 1/systemd
tcp6 0 0 :::10256 :::* LISTEN 32286/proxy
tcp6 0 0 :::8080 :::* LISTEN 22529/apiserver
tcp6 0 0 :::30004 :::* LISTEN 32286/proxy
tcp6 0 0 :::30900 :::* LISTEN 32286/proxy
tcp6 0 0 :::30902 :::* LISTEN 32286/proxy
tcp6 0 0 :::31543 :::* LISTEN 32286/proxy
tcp6 0 0 :::30903 :::* LISTEN 32286/proxy
tcp6 0 0 :::9120 :::* LISTEN 795/sshd
问题二:装不上qingcloud-csi的存储插件
日志
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 1m (x4 over 2m) kubelet, test-slave-117 Container image "csiplugin/csi-qingcloud:v0.2.0" already present on machine
Normal Created 1m (x4 over 2m) kubelet, test-slave-117 Created container
Warning Failed 1m (x4 over 2m) kubelet, test-slave-117 Error: failed to start container "csi-qingcloud": Error response from daemon: linux mounts: Path /var/lib/kubelet is mounted on / but it is not a shared mount.
Normal Pulled 1m (x2 over 2m) kubele
t, test-slave-117 Container image "quay.io/k8scsi/driver-registrar:v0.2.0" already present on machine
Normal Created 1m (x2 over 2m) kubelet, test-slave-117 Created container
Normal Started 1m (x2 over 2m) kubelet, test-slave-117 Started container
Warning BackOff 1m (x7 over 2m) kubelet, test-slave-117 Back-off restarting failed container
解决办法
[root@test-slave-117 ~]# systemctl cat docker | grep -i mount
MountFlags=slave
把docker 的参数MountFlags=slave
改为MountFlags=shared
就可以搞定了
[root@test-slave-117 ~]# cat /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target rhel-push-plugin.socket registries.service
Wants=docker-storage-setup.service
Requires=docker-cleanup.timer
[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/run/containers/registries.conf
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/etc/sysconfig/docker-network
Environment=GOTRACEBACK=crash
Environment=DOCKER_HTTP_HOST_COMPAT=1
Environment=PATH=/usr/libexec/docker:/usr/bin:/usr/sbin
ExecStart=/usr/bin/dockerd-current \
--add-runtime docker-runc=/usr/libexec/docker/docker-runc-current \
--default-runtime=docker-runc \
--exec-opt native.cgroupdriver=systemd \
--userland-proxy-path=/usr/libexec/docker/docker-proxy-current \
--seccomp-profile=/etc/docker/seccomp.json \
$OPTIONS \
$DOCKER_NETWORK_OPTIONS \
$DOCKER_STORAGE_OPTIONS \
$ADD_REGISTRY \
$BLOCK_REGISTRY \
$INSECURE_REGISTRY \
$REGISTRIES
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
TimeoutStartSec=0
Restart=on-abnormal
MountFlags=shared ##修改这个参数
KillMode=process
[Install]
WantedBy=multi-user.target
相关文档以及issue
https://kubernetes.io/docs/concepts/storage/volumes/#configuration
https://bugs.centos.org/view.php?id=14455
https://github.com/yunify/qingcloud-csi/blob/master/docs/static-provisioner-zh.md
问题三:在指定空间打开网络隔离通过kube-lb服务不能访问指定空间的服务,也ping 不通服务
关闭网路隔离
问题四:证书不对的问题
日志
[root@dev-master-105 ~]# journalctl -u kubelet -f
-- Logs begin at Fri 2018-09-14 02:49:34 CST. --
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.863620 17816 feature_gate.go:230] feature gates: &{map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]}
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.863678 17816 feature_gate.go:230] feature gates: &{map[RotateKubeletServerCertificate:true RotateKubeletClientCertificate:true]}
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.863841 17816 plugins.go:97] No cloud provider specified.
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.863854 17816 server.go:524] No cloud provider specified: "" from the config file: ""
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.863875 17816 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.865699 17816 bootstrap.go:86] No valid private key and/or certificate found, reusing existing private key or creating a new one
Sep 14 18:21:19 dev-master-105 kubelet[17816]: F0914 18:21:19.889677 17816 server.go:262] failed to run Kubelet: cannot create certificate signing request: Post https://10.39.0.105:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: x509: certificate signed by unknown authority
对比kube-apiserver指定的证书就可以搞定
问题五:启用rbac模式下报的错误
Normal SandboxChanged 42s (x12 over 1m) kubelet, dev-master-105 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 41s (x4 over 48s) kubelet, dev-master-105 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "efb32cbd3303ed7761e84f6521566a2c6a38fce4cd289b0ff183a42002007f21" network for pod "kube-dns-58d745cd6d-gdgcp": NetworkPlugin cni failed to set up pod "kube-dns-58d745cd6d-gdgcp_kube-system" network: pods "kube-dns-58d745cd6d-gdgcp" is forbidden: User "system:serviceaccount:kube-system:default" cannot get pods in the namespace "kube-system"
解决方法:将system:serviceaccount:kube-system:default绑定cluster-admin clusterrole
[root@master-47-35 prome]# cat default.yaml
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: default
namespace: kube-system
subjects:
- kind: ServiceAccount
name: default
namespace: kube-system
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: ""
打node 或者master 节点的标签
kubectl label node dev-master-105 node-role.kubernetes.io/master='master'
kubectl label node dev-slave-107 node-role.kubernetes.io/node='node'
root@dev-master-105 ~]# kubectl get no
NAME STATUS ROLES AGE VERSION
dev-master-105 Ready master 304d v1.11.0-168+f47446a730ca03
dev-slave-107 Ready node 282d v1.11.0-168+f47446a730ca03
dev-slave-108 Ready node 308d v1.11.0-168+f47446a730ca03
dev-slave-110 Ready node 308d v1.11.0-168+f47446a730ca03
问题六:prometheus 报错
level=error ts=2018-09-16T23:33:54.76892054Z caller=main.go:211 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:178: Failed to list *v1.Service: Get https://10.0.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.0.0.1:443: getsockopt: no route to host"
level=error ts=2018-09-16T23:33:54.76892347Z caller=main.go:211 component=k8s_cli^Cent_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:178: Failed to list *v1.Service: Get https://10.0.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.0.0.1:443: getsockopt: no route to host"
level=error ts=2018-09-16T23:33:54.768960604Z caller=main.go:211 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:177: Failed to list *v1.Endpoints: Get https://10.0.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.0.0.1:443: getsockopt: no route to host"
这个问题是dns挂掉了
重启dns就可以解决问题
问题七:通过kubectl logs pod-xxx -n namespace不能查看日志的解决办法
[root@test-master-113 kube-proxy]# kubectl logs prometheus-2159841327-pgfsv -n kube-system prometheus
Error from server (Forbidden): Forbidden (user=kubernetes, verb=get, resource=nodes, subresource=proxy) ( pods/log prometheus-2159841327-pgfsv)
执行以下命令
kubectl create clusterrolebinding kube-apiserver:kubelet-apis --clusterrole=system:kubelet-api-admin --user kubernetes
问题八 通过新的kubectl 命令获取pod时出现好多Evicted pod
[root@dev-master-105 prometheus]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
alertmanager-1589824335-hm74b 0/1 ContainerCannotRun 18 15h
calico-node-1tlfp 2/2 Running 2 90d
calico-node-d0dz4 2/2 Running 0 155d
calico-node-jjcgc 2/2 Running 2 155d
calico-node-ntlqk 2/2 Running 2 90d
calico-policy-controller-4081051551-4rc73 1/1 Running 1 146d
calico-policy-controller-4081051551-tgk92 0/1 Evicted 0 155d
dummy-3337601728-37zvf 0/1 Evicted 0 155d
dummy-3337601728-h95z2 0/1 Evicted 0 237d
dummy-3337601728-k7zlj 1/1 Running 1 15h
elasticsearch-logging-v1-3910393438-k3cp8 1/1 Running 1 155d
etcd-dev-master-105 1/1 Running 0 10h
filebeat-9k1ng 1/1 Running 1 46d
filebeat-cx4q3 1/1 Running 1 46d
filebeat-l4lhp 1/1 Running 3 46d
filebeat-xxs7f 1/1 Running 40 46d
fluentd-elk-jjnps 1/1 Running 1 37d
grafana-1688250711-jx42d 0/1 ContainerCannotRun 21 15h
grafana-1688250711-sk4p9 0/1 Evicted 0 155d
heapster-4291925761-47kt1 0/4 Evicted 0 155d
heapster-4291925761-r891j 3/4 CrashLoopBackOff 21 146d
informer 0/1 Error 0 254d
kube-apiserver-dev-master-105 1/1 Running 0 10h
kube-controller-manager-dev-master-105 1/1 Running 0 10h
kube-dns-2744242050-8j83t 0/3 CrashLoopBackOff 398 99d
kube-lb-bk738 1/1 Running 1 99d
kube-lb-kx0wj 1/1 Running 0 155d
kube-lb-qws2c 1/1 Running 1 155d
kube-lb-vjbp7 1/1 Running 1 155d
kube-proxy-54p8x 1/1 Running 1 90d
kube-proxy-9t76j 1/1 Running 1 90d
kube-proxy-nb1gl 1/1 Running 6 90d
kube-proxy-v27hm 1/1 Running 1 90d
kube-scheduler-dev-master-105 1/1 Running 0 10h
kubectl-dqjb8 1/1 Running 1 99d
kubectl-kbfvb 1/1 Running 1 155d
kubectl-ml3hb 0/1 ContainerCreating 0 155d
kubectl-x08fl 1/1 Running 1 155d
kubernetes-dashboard-1096786477-3xhb0 0/1 Evicted 0 155d
kubernetes-dashboard-1096786477-ngb71 0/1 CrashLoopBackOff 193 145d
prometheus-7343409-0wql3 0/2 Evicted 0 264d
prometheus-7343409-2l1sr 0/2 Evicted 0 265d
prometheus-7343409-401pp 0/2 Evicted 0 264d
prometheus-7343409-731gp 0/2 Evicted 0 266d
prometheus-7343409-bb6m9 0/2 Evicted 0 264d
prometheus-7343409-cbbzj 0/2 Evicted 0 277d
prometheus-7343409-g4j4z 0/2 Evicted 0 266d
prometheus-7343409-hwk57 0/2 Evicted 0 264d
prometheus-7343409-j4l2v 0/2 Evicted 0 264d
prometheus-7343409-lsw54 2/2 Running 2 155d
prometheus-7343409-mw3vv 0/2 Evicted 0 264d
prometheus-7343409-n4l7z 0/2 Evicted 0 266d
prometheus-7343409-sdmxd 0/2 Evicted 0 266d
prometheus-7343409-spc5r 0/2 Evicted 0 266d
qingcloud-volume-provisioner 1/1 Running 1 151d
解决办法:删除掉就行
kubectl get pods -n kube-system |grep Evicted | awk '{print $1}' | xargs kubectl delete pod -nkube-system
问题九 通过新的calicoctl 命令获取查看不了node status
解决办法下载对应版本的calicoctl 配置上相应的命令就搞定
问题十
calico-node 起不来
kubectl describe pod pod-id的查看到以下信息
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Started 45s kubelet, dev-slave-110 Started container
Normal Pulled 45s kubelet, dev-slave-110 Container image "harbor.enncloud.cn/enncloud/node:v3.2.1" already present on machine
Normal Created 45s kubelet, dev-slave-110 Created container
Normal Started 45s kubelet, dev-slave-110 Started container
Normal Pulled 45s kubelet, dev-slave-110 Container image "harbor.enncloud.cn/enncloud/cni:v3.2.1" already present on machine
Normal Created 45s kubelet, dev-slave-110 Created container
Warning DNSConfigForming 43s (x4 over 46s) kubelet, dev-slave-110 Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 127.0.0.1 10.38.240.8 10.36.8.40
Warning Unhealthy 10s (x3 over 30s) kubelet, dev-slave-110 Readiness probe failed: calico/node is not ready: felix is not ready: Get http://localhost:9099/readiness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
同时发现dns服务起不来
[root@dev-master-105 calico]# kubectl get pods -n kube-system -owide
kube-dns-58d745cd6d-gdgcp 0/3 ContainerCreating 3 14d dev-master-105
解决办法:
删除dns 的pod
或者删除 /etc/resolv.conf 多余的dns服务
etcd 服务重启出现publish error: etcdserver: request timed out, possibly due to connection lost
具体日志如下:
Sep 30 12:26:51 master-47-34 etcd[5342]: publish error: etcdserver: request timed out, possibly due to connection lost
Sep 30 12:26:58 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:05 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:12 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:19 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:26 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:33 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:40 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:47 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:54 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:28:01 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:28:08 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
排查思路
使用journalctl -u etcd -f
命令查看是否有权限的问题
如
cannot open database at /opt/etcd/member/snap/db (open /opt/etcd/member/snap/db: permission denied)
如果有使用
chown etcd:etcd /opt/etcd/member/snap/db
再次启动有问题,就直接通过执行systemctl status etcd
获取启动命令,手动执行看看是啥错误
[root@master-47-34 wal]# systemctl status etcd
● etcd.service - Etcd Server
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Sun 2018-09-30 12:36:06 CST; 11ms ago
Process: 10892 ExecStart=/usr/bin/etcd --name=etcd-47-34 --cert-file=/etc/kubernetes/ssl/etcd.pem --key-file=/etc/kubernetes/ssl/etcd-key.pem --peer-cert-file=/etc/kubernetes/ssl/etcd.pem --peer-key-file=/etc/kubernetes/ssl/etcd-key.pem --trusted-ca-file=/etc/kubernetes/ssl/ca.pem --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem --initial-advertise-peer-urls=https://10.39.47.34:2380 --listen-peer-urls=https://10.39.47.34:2380 --listen-client-urls=https://10.39.47.34:2379,http://127.0.0.1:2379 --advertise-client-urls=https://10.39.47.34:2379 --initial-cluster-token=k8s-etcd-cluster --initial-cluster=etcd-47-34=https://10.39.47.34:2380,etcd-47-35=https://10.39.47.35:2380,etcd-47-36=https://10.39.47.36:2380 --eletion-timeout=3000 --initial-cluster-state=new --data-dir=/opt/etcd/ (code=exited, status=2)
Main PID: 10892 (code=exited, status=2)
Sep 30 12:36:06 master-47-34 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 30 12:36:06 master-47-34 systemd[1]: Failed to start Etcd Server.
Sep 30 12:36:06 master-47-34 systemd[1]: Unit etcd.service entered failed state.
Sep 30 12:36:06 master-47-34 systemd[1]: etcd.service failed.
执行
/usr/bin/etcd --name=etcd-47-34 --cert-file=/etc/kubernetes/ssl/etcd.pem --key-file=/etc/kubernetes/ssl/etcd-key.pem --peer-cert-file=/etc/kubernetes/ssl/etcd.pem --peer-key-file=/etc/kubernetes/ssl/etcd-key.pem --trusted-ca-file=/etc/kubernetes/ssl/ca.pem --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem --initial-advertise-peer-urls=https://10.39.47.34:2380 --listen-peer-urls=https://10.39.47.34:2380 --listen-client-urls=https://10.39.47.34:2379,http://127.0.0.1:2379 --advertise-client-urls=https://10.39.47.34:2379 --initial-cluster-token=k8s-etcd-cluster --initial-cluster=etcd-47-34=https://10.39.47.34:2380,etcd-47-35=https://10.39.47.35:2380,etcd-47-36=https://10.39.47.36:2380 --initial-cluster-state=new --data-dir=/opt/etcd/
执行看看是啥问题,确认没有问题后经systemctl resatrt etcd
重启etcd 服务
CICD镜像pull不下来的原因
现象----> 日志
[2018/09/30 17:14:24][Warning] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 127.0.0.1 10.38.240.8 10.36.8.40
[2018/09/30 17:14:26][Warning] Failed to pull image "harbor.enncloud.cn/paas/clone-repo:v2.2": [rpc error: code = Unknown desc = Error: image paas/clone-repo:v2.2 not found, rpc error: code = Unknown desc = Error: image paas/clone-repo:v2.2 not found]
[2018/09/30 17:14:26][Warning] Error: ErrImagePull
[2018/09/30 17:14:26][Warning] Error: ImagePullBackOff
解决办法
[root@dev-slave-108 ~]# cd /var/lib/kubelet/
[root@dev-slave-108 kubelet]# ls -l
total 196
lrwxrwxrwx 1 root root 20 Dec 8 2017 config -> /root/.docker/config
lrwxrwxrwx 1 root root 25 Jan 8 2018 config.json -> /root/.docker/config.json
-rw-r--r-- 1 root root 40 Sep 15 08:57 cpu_manager_state
drwxr-xr-x 2 root root 4096 Sep 30 17:20 device-plugins
drwx------ 2 root root 4096 Sep 15 08:57 plugin-containers
drwxr-x--- 4 root root 4096 Sep 17 11:05 plugins
drwxr-x--- 51 root root 180224 Sep 30 17:20 pods
[root@dev-slave-108 kubelet]# rm -rf config
[root@dev-slave-108 kubelet]# ls -l
total 196
lrwxrwxrwx 1 root root 25 Jan 8 2018 config.json -> /root/.docker/config.json
-rw-r--r-- 1 root root 40 Sep 15 08:57 cpu_manager_state
drwxr-xr-x 2 root root 4096 Sep 30 17:20 device-plugins
drwx------ 2 root root 4096 Sep 15 08:57 plugin-containers
drwxr-x--- 4 root root 4096 Sep 17 11:05 plugins
drwxr-x--- 51 root root 180224 Sep 30 17:20 pods
Oct 12 11:27:41 test-slave-114 kubelet[27200]: E1012 11:27:41.992855 27200 kubelet_volumes.go:140] Orphaned pod "32828c49-a053-11e8-a528-5254eec04736" found, but volume paths are still present on disk : There were a total of 3 errors similar to this. Turn up verbosity to see them.
Oct 12 11:27:43 test-slave-114 kubelet[27200]: E1012 11:27:43.988378 27200 kubelet_volumes.go:140] Orphaned pod "32828c49-a053-11e8-a528-5254eec04736" found, but volume paths are still present on disk : There were a total of 3 errors similar to this. Turn up verbosity to see them.
Oct 12 11:27:45 test-slave-114 kubelet[27200]: E1012 11:27:45.992073 27200 kubelet_volumes.go:140] Orphaned pod "32828c49-a053-11e8-a528-5254eec04736" found, but volume paths are still present on disk : There were a total of 3 errors similar to this. Turn up verbosity to see them.
Oct 12 11:27:48 test-slave-114 kubelet[27200]: E1012 11:27:48.119069 27200 kubelet_volumes.go:140] Orphaned pod "32828c49-a053-11e8-a528-5254eec04736" found, but volume paths are still present on disk : There were a total of 3 errors similar to this. Turn up verbosity to see them.
Oct 12 11:27:49 test-slave-114 kubelet[27200]: E1012 11:27:49.994550 27200 kubelet_volumes.go:140] Orphaned pod "32828c49-a053-11e8-a528-5254eec04736" found, but volume paths are still present on disk : There were a total of 3 errors similar to this. Turn up verbosity to see them.
解决办法 手动删除对应的pod主机上的信息,还不行就重启kubelet
例如
错误
Oct 12 12:54:36 test-slave-114 kubelet[68577]: E1012 12:54:36.359113 68577 kubelet_volumes.go:140] Orphaned pod "80a01854-c1fe-11e8-8f5d-5254eec04736" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
[root@test-slave-114 pods]# ls -l
total 32
drwxr-x--- 5 root root 4096 Oct 8 13:42 01bdc09f-cabd-11e8-afbb-5254eec04736
drwxr-x--- 4 root root 4096 Sep 27 10:39 80a01854-c1fe-11e8-8f5d-5254eec04736
drwxr-x--- 5 root root 4096 Oct 9 10:11 89620113-cb68-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 12 11:12 9cf8d3b8-cdcc-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 8 16:40 c07b96d6-cad5-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 9 09:30 d71de50f-cb62-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 12 11:07 e8fe1a5f-cdcb-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 8 13:21 f800c26d-cab9-11e8-afbb-5254eec04736
[root@test-slave-114 pods]# rm -rf 80a01854-c1fe-11e8-8f5d-5254eec04736
[root@test-slave-114 pods]# ls -l
total 28
drwxr-x--- 5 root root 4096 Oct 8 13:42 01bdc09f-cabd-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 9 10:11 89620113-cb68-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 12 11:12 9cf8d3b8-cdcc-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 8 16:40 c07b96d6-cad5-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 9 09:30 d71de50f-cb62-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 12 11:07 e8fe1a5f-cdcb-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 8 13:21 f800c26d-cab9-11e8-afbb-5254eec04736
通过命令journalctl -u kubelet -f
查看kubelet的日志,没有再看到相关的错误日志
问题kubelet: - exit status 1, rootInodeErr: cmd [find /var/lib/docker/overlay/fda10ef69b731eb6aa0dae3b0cb438a5d114ac1195ec45e6472ba26cd7f169b9 -xdev -printf .] failed. stderr: find
Oct 23 16:46:43 proxy-20-85 kubelet: - exit status 1, rootInodeErr: cmd [find /var/lib/docker/overlay/fda10ef69b731eb6aa0dae3b0cb438a5d114ac1195ec45e6472ba26cd7f169b9 -xdev -printf .] failed. stderr: find: ‘/var/lib/docker/overlay/fda10ef69b731eb6aa0dae3b0cb438a5d114ac1195ec45e6472ba26cd7f169b9’: No such file or directory
Oct 23 16:46:43 proxy-20-85 kubelet: ; err: exit status 1, extraDiskErr: du command failed on /var/lib/docker/containers/a073fea95af9ba1227bc2e91437fbc02f34a07dd3963042cd9d37a0f6086b365 with output stdout: , stderr: du: cannot access ‘/var/lib/docker/containers/a073fea95af9ba1227bc2e91437fbc02f34a07dd3963042cd9d37a0f6086b365’: No such file or directory
Oct 23 16:46:43 proxy-20-85 kubelet: - exit status 1
Oct 23 16:46:43 proxy-20-85 kubelet: - exit status 1, rootInodeErr: cmd [find /var/lib/docker/overlay/43887091ad63cbc9d83048b602d2968200e91dbe5465fe194470b3ecf1a8ee98 -xdev -printf .] failed. stderr: find: ‘/var/lib/docker/overlay/43887091ad63cbc9d83048b602d2968200e91dbe5465fe194470b3ecf1a8ee98’: No such file or directory
Oct 23 16:46:43 proxy-20-85 kubelet: ; err: exit status 1, extraDiskErr: du command failed on /var/lib/docker/containers/b7bc9aecb053743d5af528fd16ae814157c084c5be3f023f6ffd08022fa50969 with output stdout: , stderr: du: cannot access ‘/var/lib/docker/containers/b7bc9aecb053743d5af528fd16ae814157c084c5be3f023f6ffd08022fa50969’: No such file or directory
Oct 23 16:46:43 proxy-20-85 kubelet: - exit status 1
Oct 23 16:59:28 proxy-20-85 kubelet: ERROR:1023 16:59:28.852195 20189 kubelet_network.go:378] Failed to ensure marking rule for KUBE-MARK-DROP: error checking rule: exit status 4: iptables: Resource temporarily unavailable.
Oct 23 17:29:47 proxy-20-85 kubelet: ERROR:1023 17:29:47.538377 20189 kubelet_network.go:412] Failed to ensure marking rule for KUBE-MARK-MASQ: error checking rule: exit status 4: iptables: Resource temporarily unavailable.
Oct 23 18:24:40 proxy-20-85 kubelet: ERROR:1023 18:24:40.045560 20189 kubelet_network.go:412] Failed to ensure marking rule for KUBE-MARK-MASQ: error checking rule: exit status 4: iptables: Resource temporarily unavailable.
相关issues
Oct 24 16:05:15 node-23-40 kubelet[14309]: /usr/local/go/src/runtime/asm_amd64.s:2361
Oct 24 16:05:15 node-23-40 kubelet[14309]: E1024 16:05:15.438773 14309 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
Oct 24 16:05:15 node-23-40 kubelet[14309]: /usr/local/go/src/runtime/asm_amd64.s:573
Oct 24 16:05:15 node-23-40 kubelet[14309]: /usr/local/go/src/runtime/panic.go:502
Oct 24 16:05:15 node-23-40 kubelet[14309]: /usr/local/go/src/runtime/panic.go:63
Oct 24 16:05:15 node-23-40 kubelet[14309]: /usr/local/go/src/runtime/signal_unix.go:388
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/flexvolume/plugin.go:85
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/flexvolume/probe.go:130
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/flexvolume/probe.go:88
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/flexvolume/probe.go:88
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/flexvolume/probe.go:77
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/plugins.go:593
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/plugins.go:540
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/cache/desired_state_of_world.go:192
这个问题是flex-volume的问题,删除这个目录就可以了
/usr/libexec/kubernetes/kubelet-plugins/volume/exec/qingcloud~flex-volume/
nfs mount failed
相关日志
[root@node-23-41 ~]# mount -t nfs 10.39.35.39:/mnt/cid_4c5ba11442cf_tanyanliao_nfs8 /var/lib/kubelet/pods/0d522e1b-d8ce-11e8-b14e-5254416bb222/volumes/kubernetes.io~nfs/tanyanliao-nfs8
mount: wrong fs type, bad option, bad superblock on 10.39.35.39:/mnt/cid_4c5ba11442cf_tanyanliao_nfs8,
missing codepage or helper program, or other error
(for several filesystems (e.g. nfs, cifs) you might
need a /sbin/mount. helper program)
In some cases useful info is found in syslog - try
dmesg | tail or so.
解决办法nfs挂载错误wrong fs type, bad option, bad superblock
apt-get install nfs-common
或者
yum install nfs-utils
问题
相关日志
failed to open log file "/var/log/pods/7e4623a0-dda3-11e8-885a-5254c2cdf2fd/carrier-test-mask_0.log": open /var/log/pods/7e4623a0-dda3-11e8-885a-5254c2cdf2fd/carrier-test-mask_0.log: no such file or directory
Nov 01 15:05:32 slave-20-50 kubelet[5662]: ERROR:1101 15:05:32.221060 5662 rbd.go:415] rbd: failed to setup mount /var/lib/kubelet/pods/4901cdc2-d27e-11e8-885a-5254c2cdf2fd/volumes/kubernetes.io~rbd/ceres-study.datadir-redis01-2 rbd: image ceres-study.CID-516874818ed4.datadir-redis01-2 is locked by other nodes. carrier-test-mask-3950755848-wgt4h 0/1 rpc error: code = 2 desc = failed to start container "9a78c1dceaa3ff06f13f94bdd9e4de618605a7e9cf4d8e150ca1d51fd6eaf658": Error response from daemon: {"message":"oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:359: container init caused \\\"rootfs_linux.go:54: mounting \\\\\\\"/var/lib/kubelet/pods/c8e98e2b-dda4-11e8-885a-5254c2cdf2fd/volumes/kubernetes.io~configmap/configmap-volume-1/default.conf\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/devicemapper/mnt/dc291cf0d7814a457fd5df2d4aeff54274a84d3265c706727d350f74691607da/rootfs\\\\\\\" at \\\\\\\"/etc/nginx/conf.d/default.conf/default.conf\\\\\\\" caused \\\\\\\"lstat /var/lib/docker/devicemapper/mnt/dc291cf0d7814a457fd5df2d4aeff54274a84d3265c706727d350f74691607da/rootfs/etc/nginx/conf.d/default.conf/default.conf: not a directory\\\\\\\"\\\"\"\n: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type"
挂载了镜像里没有的目录
/rootfs/etc/nginx/conf.d/default.conf/default.conf: not a directory