kubernetes 升级问题笔录

本笔录主要是记录kubernetes从v1.7.6升级到v1.11.2的问题以及相关的解决办法

问题一:不能访问master的10255
日志

[root@test-master-113 qinzhao]# kubectl logs heapster-1967565778-v5wkr -n kube-system heapster 
I0918 10:37:05.893124       1 heapster.go:72] /heapster --source=kubernetes.summary_api:''
I0918 10:37:05.893235       1 heapster.go:73] Heapster version v1.4.1
I0918 10:37:05.893944       1 configs.go:61] Using Kubernetes client with master "https://10.0.0.1:443" and version v1
I0918 10:37:05.893993       1 configs.go:62] Using kubelet port 10255
I0918 10:37:05.896697       1 heapster.go:196] Starting with Metric Sink
I0918 10:37:05.911591       1 heapster.go:106] Starting heapster on port 8082
E0918 10:38:05.026522       1 summary.go:97] error while getting metrics summary from Kubelet test-master-113(10.39.0.113:10255): Get http://10.39.0.113:10255/stats/summary/: dial tcp 10.39.0.113:10255: getsockopt: connection refused
[root@test-master-113 qinzhao]# kubectl logs heapster-1967565778-v5wkr -n kube-system heapster 
I0918 10:37:05.893124       1 heapster.go:72] /heapster --source=kubernetes.summary_api:''
I0918 10:37:05.893235       1 heapster.go:73] Heapster version v1.4.1
I0918 10:37:05.893944       1 configs.go:61] Using Kubernetes client with master "https://10.0.0.1:443" and version v1
I0918 10:37:05.893993       1 configs.go:62] Using kubelet port 10255
I0918 10:37:05.896697       1 heapster.go:196] Starting with Metric Sink
I0918 10:37:05.911591       1 heapster.go:106] Starting heapster on port 8082
E0918 10:38:05.026522       1 summary.go:97] error while getting metrics summary from Kubelet test-master-113(10.39.0.113:10255): Get http://10.39.0.113:10255/stats/summary/: dial tcp 10.39.0.113:10255: getsockopt: connection refused

原因是没有打开10255这个端口

[root@test-master-113 qingcloud-csi]# netstat -tlnp 
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      19995/kubelet       
tcp        0      0 127.0.0.1:40264         0.0.0.0:*               LISTEN      19995/kubelet       
tcp        0      0 127.0.0.1:10249         0.0.0.0:*               LISTEN      32286/proxy         
tcp        0      0 10.39.0.113:10250       0.0.0.0:*               LISTEN      19995/kubelet       
tcp        0      0 127.0.0.1:9099          0.0.0.0:*               LISTEN      21392/calico-node   
tcp        0      0 127.0.0.1:10251         0.0.0.0:*               LISTEN      22716/scheduler     
tcp        0      0 10.39.0.113:6443        0.0.0.0:*               LISTEN      22529/apiserver     
tcp        0      0 10.39.0.113:2379        0.0.0.0:*               LISTEN      30095/etcd          
tcp        0      0 127.0.0.1:2379          0.0.0.0:*               LISTEN      30095/etcd          
tcp        0      0 127.0.0.1:10252         0.0.0.0:*               LISTEN      21364/controller-ma 
tcp        0      0 10.39.0.113:2380        0.0.0.0:*               LISTEN      30095/etcd          
tcp        0      0 127.0.0.1:9101          0.0.0.0:*               LISTEN      1654/node_exporter  
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd           
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      21507/bird          
tcp        0      0 0.0.0.0:9120            0.0.0.0:*               LISTEN      795/sshd            
tcp6       0      0 :::9898                 :::*                    LISTEN      31707/kube-discover 
tcp6       0      0 :::9100                 :::*                    LISTEN      1725/kube-rbac-prox 
tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd           
tcp6       0      0 :::10256                :::*                    LISTEN      32286/proxy         
tcp6       0      0 :::8080                 :::*                    LISTEN      22529/apiserver     
tcp6       0      0 :::30004                :::*                    LISTEN      32286/proxy         
tcp6       0      0 :::30900                :::*                    LISTEN      32286/proxy         
tcp6       0      0 :::30902                :::*                    LISTEN      32286/proxy         
tcp6       0      0 :::31543                :::*                    LISTEN      32286/proxy         
tcp6       0      0 :::30903                :::*                    LISTEN      32286/proxy         
tcp6       0      0 :::9120                 :::*                    LISTEN      795/sshd           

问题二:装不上qingcloud-csi的存储插件
日志

Type     Reason   Age              From                     Message
  ----     ------   ----             ----                     -------
  Normal   Pulled   1m (x4 over 2m)  kubelet, test-slave-117  Container image "csiplugin/csi-qingcloud:v0.2.0" already present on machine
  Normal   Created  1m (x4 over 2m)  kubelet, test-slave-117  Created container
  Warning  Failed   1m (x4 over 2m)  kubelet, test-slave-117  Error: failed to start container "csi-qingcloud": Error response from daemon: linux mounts: Path /var/lib/kubelet is mounted on / but it is not a shared mount.
  Normal   Pulled   1m (x2 over 2m)  kubele
t, test-slave-117  Container image "quay.io/k8scsi/driver-registrar:v0.2.0" already present on machine
  Normal   Created  1m (x2 over 2m)  kubelet, test-slave-117  Created container
  Normal   Started  1m (x2 over 2m)  kubelet, test-slave-117  Started container
  Warning  BackOff  1m (x7 over 2m)  kubelet, test-slave-117  Back-off restarting failed container

解决办法

[root@test-slave-117 ~]# systemctl cat docker | grep -i mount
MountFlags=slave

把docker 的参数MountFlags=slave改为MountFlags=shared就可以搞定了

[root@test-slave-117 ~]# cat /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target rhel-push-plugin.socket registries.service
Wants=docker-storage-setup.service
Requires=docker-cleanup.timer

[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/run/containers/registries.conf
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/etc/sysconfig/docker-network
Environment=GOTRACEBACK=crash
Environment=DOCKER_HTTP_HOST_COMPAT=1
Environment=PATH=/usr/libexec/docker:/usr/bin:/usr/sbin
ExecStart=/usr/bin/dockerd-current \
          --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current \
          --default-runtime=docker-runc \
          --exec-opt native.cgroupdriver=systemd \
          --userland-proxy-path=/usr/libexec/docker/docker-proxy-current \
          --seccomp-profile=/etc/docker/seccomp.json \
          $OPTIONS \
          $DOCKER_NETWORK_OPTIONS \
          $DOCKER_STORAGE_OPTIONS \
          $ADD_REGISTRY \
          $BLOCK_REGISTRY \
          $INSECURE_REGISTRY \
	  $REGISTRIES
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
TimeoutStartSec=0
Restart=on-abnormal
MountFlags=shared ##修改这个参数
KillMode=process

[Install]
WantedBy=multi-user.target

相关文档以及issue

https://kubernetes.io/docs/concepts/storage/volumes/#configuration
https://bugs.centos.org/view.php?id=14455
https://github.com/yunify/qingcloud-csi/blob/master/docs/static-provisioner-zh.md


问题三:在指定空间打开网络隔离通过kube-lb服务不能访问指定空间的服务,也ping 不通服务
关闭网路隔离


问题四:证书不对的问题
日志

[root@dev-master-105 ~]# journalctl -u kubelet -f 
-- Logs begin at Fri 2018-09-14 02:49:34 CST. --
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.863620   17816 feature_gate.go:230] feature gates: &{map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]}
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.863678   17816 feature_gate.go:230] feature gates: &{map[RotateKubeletServerCertificate:true RotateKubeletClientCertificate:true]}
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.863841   17816 plugins.go:97] No cloud provider specified.
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.863854   17816 server.go:524] No cloud provider specified: "" from the config file: ""
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.863875   17816 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
Sep 14 18:21:19 dev-master-105 kubelet[17816]: I0914 18:21:19.865699   17816 bootstrap.go:86] No valid private key and/or certificate found, reusing existing private key or creating a new one
Sep 14 18:21:19 dev-master-105 kubelet[17816]: F0914 18:21:19.889677   17816 server.go:262] failed to run Kubelet: cannot create certificate signing request: Post https://10.39.0.105:6443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: x509: certificate signed by unknown authority

对比kube-apiserver指定的证书就可以搞定


问题五:启用rbac模式下报的错误

  Normal   SandboxChanged          42s (x12 over 1m)  kubelet, dev-master-105  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  41s (x4 over 48s)  kubelet, dev-master-105  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "efb32cbd3303ed7761e84f6521566a2c6a38fce4cd289b0ff183a42002007f21" network for pod "kube-dns-58d745cd6d-gdgcp": NetworkPlugin cni failed to set up pod "kube-dns-58d745cd6d-gdgcp_kube-system" network: pods "kube-dns-58d745cd6d-gdgcp" is forbidden: User "system:serviceaccount:kube-system:default" cannot get pods in the namespace "kube-system"

解决方法:将system:serviceaccount:kube-system:default绑定cluster-admin clusterrole

[root@master-47-35 prome]# cat default.yaml 
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: default
  namespace: kube-system
subjects:
- kind: ServiceAccount
  name: default
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: ""


打node 或者master 节点的标签

kubectl label node dev-master-105 node-role.kubernetes.io/master='master'
kubectl label node dev-slave-107  node-role.kubernetes.io/node='node'
root@dev-master-105 ~]# kubectl get no
NAME             STATUS    ROLES     AGE       VERSION
dev-master-105   Ready     master    304d      v1.11.0-168+f47446a730ca03
dev-slave-107    Ready     node      282d      v1.11.0-168+f47446a730ca03
dev-slave-108    Ready     node      308d      v1.11.0-168+f47446a730ca03
dev-slave-110    Ready     node      308d      v1.11.0-168+f47446a730ca03

问题六:prometheus 报错

level=error ts=2018-09-16T23:33:54.76892054Z caller=main.go:211 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:178: Failed to list *v1.Service: Get https://10.0.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.0.0.1:443: getsockopt: no route to host"
level=error ts=2018-09-16T23:33:54.76892347Z caller=main.go:211 component=k8s_cli^Cent_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:178: Failed to list *v1.Service: Get https://10.0.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.0.0.1:443: getsockopt: no route to host"
level=error ts=2018-09-16T23:33:54.768960604Z caller=main.go:211 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:177: Failed to list *v1.Endpoints: Get https://10.0.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.0.0.1:443: getsockopt: no route to host"

这个问题是dns挂掉了
重启dns就可以解决问题


问题七:通过kubectl logs pod-xxx -n namespace不能查看日志的解决办法

[root@test-master-113 kube-proxy]# kubectl logs prometheus-2159841327-pgfsv -n kube-system prometheus 
Error from server (Forbidden): Forbidden (user=kubernetes, verb=get, resource=nodes, subresource=proxy) ( pods/log prometheus-2159841327-pgfsv)

执行以下命令

kubectl create clusterrolebinding kube-apiserver:kubelet-apis --clusterrole=system:kubelet-api-admin --user kubernetes

问题八 通过新的kubectl 命令获取pod时出现好多Evicted pod

[root@dev-master-105 prometheus]# kubectl get  pods -n kube-system 
NAME                                        READY     STATUS               RESTARTS   AGE
alertmanager-1589824335-hm74b               0/1       ContainerCannotRun   18         15h
calico-node-1tlfp                           2/2       Running              2          90d
calico-node-d0dz4                           2/2       Running              0          155d
calico-node-jjcgc                           2/2       Running              2          155d
calico-node-ntlqk                           2/2       Running              2          90d
calico-policy-controller-4081051551-4rc73   1/1       Running              1          146d
calico-policy-controller-4081051551-tgk92   0/1       Evicted              0          155d
dummy-3337601728-37zvf                      0/1       Evicted              0          155d
dummy-3337601728-h95z2                      0/1       Evicted              0          237d
dummy-3337601728-k7zlj                      1/1       Running              1          15h
elasticsearch-logging-v1-3910393438-k3cp8   1/1       Running              1          155d
etcd-dev-master-105                         1/1       Running              0          10h
filebeat-9k1ng                              1/1       Running              1          46d
filebeat-cx4q3                              1/1       Running              1          46d
filebeat-l4lhp                              1/1       Running              3          46d
filebeat-xxs7f                              1/1       Running              40         46d
fluentd-elk-jjnps                           1/1       Running              1          37d
grafana-1688250711-jx42d                    0/1       ContainerCannotRun   21         15h
grafana-1688250711-sk4p9                    0/1       Evicted              0          155d
heapster-4291925761-47kt1                   0/4       Evicted              0          155d
heapster-4291925761-r891j                   3/4       CrashLoopBackOff     21         146d
informer                                    0/1       Error                0          254d
kube-apiserver-dev-master-105               1/1       Running              0          10h
kube-controller-manager-dev-master-105      1/1       Running              0          10h
kube-dns-2744242050-8j83t                   0/3       CrashLoopBackOff     398        99d
kube-lb-bk738                               1/1       Running              1          99d
kube-lb-kx0wj                               1/1       Running              0          155d
kube-lb-qws2c                               1/1       Running              1          155d
kube-lb-vjbp7                               1/1       Running              1          155d
kube-proxy-54p8x                            1/1       Running              1          90d
kube-proxy-9t76j                            1/1       Running              1          90d
kube-proxy-nb1gl                            1/1       Running              6          90d
kube-proxy-v27hm                            1/1       Running              1          90d
kube-scheduler-dev-master-105               1/1       Running              0          10h
kubectl-dqjb8                               1/1       Running              1          99d
kubectl-kbfvb                               1/1       Running              1          155d
kubectl-ml3hb                               0/1       ContainerCreating    0          155d
kubectl-x08fl                               1/1       Running              1          155d
kubernetes-dashboard-1096786477-3xhb0       0/1       Evicted              0          155d
kubernetes-dashboard-1096786477-ngb71       0/1       CrashLoopBackOff     193        145d
prometheus-7343409-0wql3                    0/2       Evicted              0          264d
prometheus-7343409-2l1sr                    0/2       Evicted              0          265d
prometheus-7343409-401pp                    0/2       Evicted              0          264d
prometheus-7343409-731gp                    0/2       Evicted              0          266d
prometheus-7343409-bb6m9                    0/2       Evicted              0          264d
prometheus-7343409-cbbzj                    0/2       Evicted              0          277d
prometheus-7343409-g4j4z                    0/2       Evicted              0          266d
prometheus-7343409-hwk57                    0/2       Evicted              0          264d
prometheus-7343409-j4l2v                    0/2       Evicted              0          264d
prometheus-7343409-lsw54                    2/2       Running              2          155d
prometheus-7343409-mw3vv                    0/2       Evicted              0          264d
prometheus-7343409-n4l7z                    0/2       Evicted              0          266d
prometheus-7343409-sdmxd                    0/2       Evicted              0          266d
prometheus-7343409-spc5r                    0/2       Evicted              0          266d
qingcloud-volume-provisioner                1/1       Running              1          151d

解决办法:删除掉就行

kubectl get  pods -n kube-system |grep Evicted | awk '{print $1}' | xargs kubectl delete pod -nkube-system

问题九 通过新的calicoctl 命令获取查看不了node status

解决办法下载对应版本的calicoctl 配置上相应的命令就搞定


问题十
calico-node 起不来
kubectl describe pod pod-id的查看到以下信息

Events:
  Type     Reason            Age                From                    Message
  ----     ------            ----               ----                    -------
  Normal   Started           45s                kubelet, dev-slave-110  Started container
  Normal   Pulled            45s                kubelet, dev-slave-110  Container image "harbor.enncloud.cn/enncloud/node:v3.2.1" already present on machine
  Normal   Created           45s                kubelet, dev-slave-110  Created container
  Normal   Started           45s                kubelet, dev-slave-110  Started container
  Normal   Pulled            45s                kubelet, dev-slave-110  Container image "harbor.enncloud.cn/enncloud/cni:v3.2.1" already present on machine
  Normal   Created           45s                kubelet, dev-slave-110  Created container
  Warning  DNSConfigForming  43s (x4 over 46s)  kubelet, dev-slave-110  Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 127.0.0.1 10.38.240.8 10.36.8.40
  Warning  Unhealthy         10s (x3 over 30s)  kubelet, dev-slave-110  Readiness probe failed: calico/node is not ready: felix is not ready: Get http://localhost:9099/readiness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

同时发现dns服务起不来

[root@dev-master-105 calico]# kubectl get pods -n kube-system  -owide 

kube-dns-58d745cd6d-gdgcp                   0/3       ContainerCreating   3          14d                 dev-master-105   

解决办法:
删除dns 的pod

或者删除 /etc/resolv.conf 多余的dns服务


etcd 服务重启出现publish error: etcdserver: request timed out, possibly due to connection lost

具体日志如下:

Sep 30 12:26:51 master-47-34 etcd[5342]: publish error: etcdserver: request timed out, possibly due to connection lost
Sep 30 12:26:58 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:05 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:12 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:19 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:26 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:33 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:40 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:47 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:27:54 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:28:01 master-47-34 etcd[5342]: publish error: etcdserver: request timed out
Sep 30 12:28:08 master-47-34 etcd[5342]: publish error: etcdserver: request timed out

排查思路
使用journalctl -u etcd -f 命令查看是否有权限的问题

cannot open database at /opt/etcd/member/snap/db (open /opt/etcd/member/snap/db: permission denied)

如果有使用

chown etcd:etcd /opt/etcd/member/snap/db

再次启动有问题,就直接通过执行systemctl status etcd 获取启动命令,手动执行看看是啥错误

[root@master-47-34 wal]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Sun 2018-09-30 12:36:06 CST; 11ms ago
  Process: 10892 ExecStart=/usr/bin/etcd --name=etcd-47-34 --cert-file=/etc/kubernetes/ssl/etcd.pem --key-file=/etc/kubernetes/ssl/etcd-key.pem --peer-cert-file=/etc/kubernetes/ssl/etcd.pem --peer-key-file=/etc/kubernetes/ssl/etcd-key.pem --trusted-ca-file=/etc/kubernetes/ssl/ca.pem --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem --initial-advertise-peer-urls=https://10.39.47.34:2380 --listen-peer-urls=https://10.39.47.34:2380 --listen-client-urls=https://10.39.47.34:2379,http://127.0.0.1:2379 --advertise-client-urls=https://10.39.47.34:2379 --initial-cluster-token=k8s-etcd-cluster --initial-cluster=etcd-47-34=https://10.39.47.34:2380,etcd-47-35=https://10.39.47.35:2380,etcd-47-36=https://10.39.47.36:2380 --eletion-timeout=3000 --initial-cluster-state=new --data-dir=/opt/etcd/ (code=exited, status=2)
 Main PID: 10892 (code=exited, status=2)

Sep 30 12:36:06 master-47-34 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 30 12:36:06 master-47-34 systemd[1]: Failed to start Etcd Server.
Sep 30 12:36:06 master-47-34 systemd[1]: Unit etcd.service entered failed state.
Sep 30 12:36:06 master-47-34 systemd[1]: etcd.service failed.

执行

/usr/bin/etcd --name=etcd-47-34 --cert-file=/etc/kubernetes/ssl/etcd.pem --key-file=/etc/kubernetes/ssl/etcd-key.pem --peer-cert-file=/etc/kubernetes/ssl/etcd.pem --peer-key-file=/etc/kubernetes/ssl/etcd-key.pem --trusted-ca-file=/etc/kubernetes/ssl/ca.pem --peer-trusted-ca-file=/etc/kubernetes/ssl/ca.pem --initial-advertise-peer-urls=https://10.39.47.34:2380 --listen-peer-urls=https://10.39.47.34:2380 --listen-client-urls=https://10.39.47.34:2379,http://127.0.0.1:2379 --advertise-client-urls=https://10.39.47.34:2379 --initial-cluster-token=k8s-etcd-cluster --initial-cluster=etcd-47-34=https://10.39.47.34:2380,etcd-47-35=https://10.39.47.35:2380,etcd-47-36=https://10.39.47.36:2380  --initial-cluster-state=new --data-dir=/opt/etcd/

执行看看是啥问题,确认没有问题后经systemctl resatrt etcd重启etcd 服务


CICD镜像pull不下来的原因
现象----> 日志

[2018/09/30 17:14:24][Warning] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 127.0.0.1 10.38.240.8 10.36.8.40 
[2018/09/30 17:14:26][Warning] Failed to pull image "harbor.enncloud.cn/paas/clone-repo:v2.2": [rpc error: code = Unknown desc = Error: image paas/clone-repo:v2.2 not found, rpc error: code = Unknown desc = Error: image paas/clone-repo:v2.2 not found] 
[2018/09/30 17:14:26][Warning] Error: ErrImagePull 
[2018/09/30 17:14:26][Warning] Error: ImagePullBackOff

解决办法

[root@dev-slave-108 ~]# cd /var/lib/kubelet/
[root@dev-slave-108 kubelet]# ls -l 
total 196
lrwxrwxrwx  1 root root     20 Dec  8  2017 config -> /root/.docker/config
lrwxrwxrwx  1 root root     25 Jan  8  2018 config.json -> /root/.docker/config.json
-rw-r--r--  1 root root     40 Sep 15 08:57 cpu_manager_state
drwxr-xr-x  2 root root   4096 Sep 30 17:20 device-plugins
drwx------  2 root root   4096 Sep 15 08:57 plugin-containers
drwxr-x---  4 root root   4096 Sep 17 11:05 plugins
drwxr-x--- 51 root root 180224 Sep 30 17:20 pods
[root@dev-slave-108 kubelet]# rm -rf config
[root@dev-slave-108 kubelet]# ls -l 
total 196
lrwxrwxrwx  1 root root     25 Jan  8  2018 config.json -> /root/.docker/config.json
-rw-r--r--  1 root root     40 Sep 15 08:57 cpu_manager_state
drwxr-xr-x  2 root root   4096 Sep 30 17:20 device-plugins
drwx------  2 root root   4096 Sep 15 08:57 plugin-containers
drwxr-x---  4 root root   4096 Sep 17 11:05 plugins
drwxr-x--- 51 root root 180224 Sep 30 17:20 pods

Oct 12 11:27:41 test-slave-114 kubelet[27200]: E1012 11:27:41.992855   27200 kubelet_volumes.go:140] Orphaned pod "32828c49-a053-11e8-a528-5254eec04736" found, but volume paths are still present on disk : There were a total of 3 errors similar to this. Turn up verbosity to see them.
Oct 12 11:27:43 test-slave-114 kubelet[27200]: E1012 11:27:43.988378   27200 kubelet_volumes.go:140] Orphaned pod "32828c49-a053-11e8-a528-5254eec04736" found, but volume paths are still present on disk : There were a total of 3 errors similar to this. Turn up verbosity to see them.
Oct 12 11:27:45 test-slave-114 kubelet[27200]: E1012 11:27:45.992073   27200 kubelet_volumes.go:140] Orphaned pod "32828c49-a053-11e8-a528-5254eec04736" found, but volume paths are still present on disk : There were a total of 3 errors similar to this. Turn up verbosity to see them.
Oct 12 11:27:48 test-slave-114 kubelet[27200]: E1012 11:27:48.119069   27200 kubelet_volumes.go:140] Orphaned pod "32828c49-a053-11e8-a528-5254eec04736" found, but volume paths are still present on disk : There were a total of 3 errors similar to this. Turn up verbosity to see them.
Oct 12 11:27:49 test-slave-114 kubelet[27200]: E1012 11:27:49.994550   27200 kubelet_volumes.go:140] Orphaned pod "32828c49-a053-11e8-a528-5254eec04736" found, but volume paths are still present on disk : There were a total of 3 errors similar to this. Turn up verbosity to see them.

解决办法 手动删除对应的pod主机上的信息,还不行就重启kubelet
例如

错误

Oct 12 12:54:36 test-slave-114 kubelet[68577]: E1012 12:54:36.359113   68577 kubelet_volumes.go:140] Orphaned pod "80a01854-c1fe-11e8-8f5d-5254eec04736" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
[root@test-slave-114 pods]# ls -l 
total 32
drwxr-x--- 5 root root 4096 Oct  8 13:42 01bdc09f-cabd-11e8-afbb-5254eec04736
drwxr-x--- 4 root root 4096 Sep 27 10:39 80a01854-c1fe-11e8-8f5d-5254eec04736
drwxr-x--- 5 root root 4096 Oct  9 10:11 89620113-cb68-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 12 11:12 9cf8d3b8-cdcc-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct  8 16:40 c07b96d6-cad5-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct  9 09:30 d71de50f-cb62-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 12 11:07 e8fe1a5f-cdcb-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct  8 13:21 f800c26d-cab9-11e8-afbb-5254eec04736
[root@test-slave-114 pods]# rm -rf 80a01854-c1fe-11e8-8f5d-5254eec04736
[root@test-slave-114 pods]# ls -l 
total 28
drwxr-x--- 5 root root 4096 Oct  8 13:42 01bdc09f-cabd-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct  9 10:11 89620113-cb68-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 12 11:12 9cf8d3b8-cdcc-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct  8 16:40 c07b96d6-cad5-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct  9 09:30 d71de50f-cb62-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct 12 11:07 e8fe1a5f-cdcb-11e8-afbb-5254eec04736
drwxr-x--- 5 root root 4096 Oct  8 13:21 f800c26d-cab9-11e8-afbb-5254eec04736

通过命令journalctl -u kubelet -f查看kubelet的日志,没有再看到相关的错误日志


问题kubelet: - exit status 1, rootInodeErr: cmd [find /var/lib/docker/overlay/fda10ef69b731eb6aa0dae3b0cb438a5d114ac1195ec45e6472ba26cd7f169b9 -xdev -printf .] failed. stderr: find

Oct 23 16:46:43 proxy-20-85 kubelet: - exit status 1, rootInodeErr: cmd [find /var/lib/docker/overlay/fda10ef69b731eb6aa0dae3b0cb438a5d114ac1195ec45e6472ba26cd7f169b9 -xdev -printf .] failed. stderr: find: ‘/var/lib/docker/overlay/fda10ef69b731eb6aa0dae3b0cb438a5d114ac1195ec45e6472ba26cd7f169b9’: No such file or directory
Oct 23 16:46:43 proxy-20-85 kubelet: ; err: exit status 1, extraDiskErr: du command failed on /var/lib/docker/containers/a073fea95af9ba1227bc2e91437fbc02f34a07dd3963042cd9d37a0f6086b365 with output stdout: , stderr: du: cannot access ‘/var/lib/docker/containers/a073fea95af9ba1227bc2e91437fbc02f34a07dd3963042cd9d37a0f6086b365’: No such file or directory
Oct 23 16:46:43 proxy-20-85 kubelet: - exit status 1
Oct 23 16:46:43 proxy-20-85 kubelet: - exit status 1, rootInodeErr: cmd [find /var/lib/docker/overlay/43887091ad63cbc9d83048b602d2968200e91dbe5465fe194470b3ecf1a8ee98 -xdev -printf .] failed. stderr: find: ‘/var/lib/docker/overlay/43887091ad63cbc9d83048b602d2968200e91dbe5465fe194470b3ecf1a8ee98’: No such file or directory
Oct 23 16:46:43 proxy-20-85 kubelet: ; err: exit status 1, extraDiskErr: du command failed on /var/lib/docker/containers/b7bc9aecb053743d5af528fd16ae814157c084c5be3f023f6ffd08022fa50969 with output stdout: , stderr: du: cannot access ‘/var/lib/docker/containers/b7bc9aecb053743d5af528fd16ae814157c084c5be3f023f6ffd08022fa50969’: No such file or directory
Oct 23 16:46:43 proxy-20-85 kubelet: - exit status 1
Oct 23 16:59:28 proxy-20-85 kubelet: ERROR:1023 16:59:28.852195   20189 kubelet_network.go:378] Failed to ensure marking rule for KUBE-MARK-DROP: error checking rule: exit status 4: iptables: Resource temporarily unavailable.
Oct 23 17:29:47 proxy-20-85 kubelet: ERROR:1023 17:29:47.538377   20189 kubelet_network.go:412] Failed to ensure marking rule for KUBE-MARK-MASQ: error checking rule: exit status 4: iptables: Resource temporarily unavailable.
Oct 23 18:24:40 proxy-20-85 kubelet: ERROR:1023 18:24:40.045560   20189 kubelet_network.go:412] Failed to ensure marking rule for KUBE-MARK-MASQ: error checking rule: exit status 4: iptables: Resource temporarily unavailable.

相关issues


Oct 24 16:05:15 node-23-40 kubelet[14309]: /usr/local/go/src/runtime/asm_amd64.s:2361
Oct 24 16:05:15 node-23-40 kubelet[14309]: E1024 16:05:15.438773   14309 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
Oct 24 16:05:15 node-23-40 kubelet[14309]: /usr/local/go/src/runtime/asm_amd64.s:573
Oct 24 16:05:15 node-23-40 kubelet[14309]: /usr/local/go/src/runtime/panic.go:502
Oct 24 16:05:15 node-23-40 kubelet[14309]: /usr/local/go/src/runtime/panic.go:63
Oct 24 16:05:15 node-23-40 kubelet[14309]: /usr/local/go/src/runtime/signal_unix.go:388
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/flexvolume/plugin.go:85
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/flexvolume/probe.go:130
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/flexvolume/probe.go:88
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/flexvolume/probe.go:88
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/flexvolume/probe.go:77
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/plugins.go:593
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/volume/plugins.go:540
Oct 24 16:05:15 node-23-40 kubelet[14309]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/volumemanager/cache/desired_state_of_world.go:192

这个问题是flex-volume的问题,删除这个目录就可以了

/usr/libexec/kubernetes/kubelet-plugins/volume/exec/qingcloud~flex-volume/

nfs mount failed
相关日志

[root@node-23-41 ~]# mount -t nfs 10.39.35.39:/mnt/cid_4c5ba11442cf_tanyanliao_nfs8 /var/lib/kubelet/pods/0d522e1b-d8ce-11e8-b14e-5254416bb222/volumes/kubernetes.io~nfs/tanyanliao-nfs8
mount: wrong fs type, bad option, bad superblock on 10.39.35.39:/mnt/cid_4c5ba11442cf_tanyanliao_nfs8,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount. helper program)

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

解决办法nfs挂载错误wrong fs type, bad option, bad superblock

apt-get install nfs-common

或者

yum install nfs-utils

问题

相关日志

failed to open log file "/var/log/pods/7e4623a0-dda3-11e8-885a-5254c2cdf2fd/carrier-test-mask_0.log": open /var/log/pods/7e4623a0-dda3-11e8-885a-5254c2cdf2fd/carrier-test-mask_0.log: no such file or directory
Nov 01 15:05:32 slave-20-50 kubelet[5662]: ERROR:1101 15:05:32.221060    5662 rbd.go:415] rbd: failed to setup mount /var/lib/kubelet/pods/4901cdc2-d27e-11e8-885a-5254c2cdf2fd/volumes/kubernetes.io~rbd/ceres-study.datadir-redis01-2 rbd: image ceres-study.CID-516874818ed4.datadir-redis01-2 is locked by other nodes.   carrier-test-mask-3950755848-wgt4h   0/1       rpc error: code = 2 desc = failed to start container "9a78c1dceaa3ff06f13f94bdd9e4de618605a7e9cf4d8e150ca1d51fd6eaf658": Error response from daemon: {"message":"oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:359: container init caused \\\"rootfs_linux.go:54: mounting \\\\\\\"/var/lib/kubelet/pods/c8e98e2b-dda4-11e8-885a-5254c2cdf2fd/volumes/kubernetes.io~configmap/configmap-volume-1/default.conf\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/devicemapper/mnt/dc291cf0d7814a457fd5df2d4aeff54274a84d3265c706727d350f74691607da/rootfs\\\\\\\" at \\\\\\\"/etc/nginx/conf.d/default.conf/default.conf\\\\\\\" caused \\\\\\\"lstat /var/lib/docker/devicemapper/mnt/dc291cf0d7814a457fd5df2d4aeff54274a84d3265c706727d350f74691607da/rootfs/etc/nginx/conf.d/default.conf/default.conf: not a directory\\\\\\\"\\\"\"\n: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type"

挂载了镜像里没有的目录

/rootfs/etc/nginx/conf.d/default.conf/default.conf: not a directory

你可能感兴趣的:(kubetnetes)