项目线上在K8S 重启的问题排查

项目线上在K8S重启的问题排查

  1. 判断是否是因为进程内存限制重启
  2. 判断是否是因为超出K8S 资源限制重启
  3. 判断是否超出宿主资源重启

模拟内存溢出环境

创建一个小demo


	// java 内存
  @GetMapping("test/bytes/{bytes}")
   public String testForAllocBytes(@PathVariable Integer bytes) {
       byte[] alloc = new byte[bytes];
       String msg = Strings.lenientFormat("alloc %s byte success", bytes);
       log.info(msg);
       return msg;
   }

	// java 直接内存
 @GetMapping("test/bytes/direct/{bytes}")
  public String testForAllocDirectBytes(@PathVariable Integer bytes) {
      UnsafeUtils.getUnsafe().allocateMemory(bytes.longValue());
      String msg = Strings.lenientFormat("alloc %s direct byte success", bytes);
      log.info(msg);
      return msg;
  }

编写Dockerfile限制内存大小 堆内 128MB 元空间128MB 直接内存128MB, 理论来说应该是 384MB

FROM openjdk:8-jdk-alpine

ENV JAVA_OPTS="\
-Xmx128M \
-Xms128M \
-XX:+PrintGC \
-XX:+PrintGCDetails \
-XX:+PrintGCDateStamps  \
-XX:MetaspaceSize=128M \
-XX:MaxMetaspaceSize=128M \
-XX:MaxDirectMemorySize=128M \
-XX:+HeapDumpOnOutOfMemoryError \
-XX:HeapDumpPath=/data/logs/heapdump.%p.hprof"

RUN mkdir -p /data/logs

VOLUME /data/logs

EXPOSE 8080

COPY ./target/echo-demo-0.0.1-SNAPSHOT.jar /opt/echo-demo-0.0.1-SNAPSHOT.jar
ENTRYPOINT ["/bin/sh", "-c", "java -jar ${JAVA_OPTS} /opt/echo-demo-0.0.1-SNAPSHOT.jar"]

2 设计系统资源限制 150MB 和 300MB

300MB


apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo-demo
  template:
    metadata:
      labels:
#        配置 pod 标签
        app: echo-demo
    spec:
      containers:
        - name: echo-demo
          image: registry.cn-hangzhou.aliyuncs.com/zxw_img/echo-demo:0.1
          resources:
            limits:
              cpu: 200m
              memory: 300Mi
              #切换
              #memory: 150Mi
            requests:
              cpu: 100m
              memory: 150Mi
          imagePullPolicy: Always
          ports:
            - containerPort: 8080

申请150MB堆内内存

root@ubuntu:~/data/echo-demo# curl http://echo-demo.com:31638/test/bytes/150000000

2022-10-16 06:04:39.774 ERROR 1 --- [nio-8080-exec-2] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space] with root cause

java.lang.OutOfMemoryError: Java heap space
        at com.example.echodemo.EchoController.testForAllocBytes(EchoController.java:27) ~[classes!/:0.0.1-SNAPSHOT]

申请150MB直接内存

root@ubuntu:~/data/echo-demo# curl http://echo-demo.com:31638/test/bytes/direct/150000000

2022-10-16 06:02:15.795 ERROR 1 --- [nio-8080-exec-1] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Direct buffer memory] with root cause

java.lang.OutOfMemoryError: Direct buffer memory
        at java.nio.Bits.reserveMemory(Bits.java:666) ~[na:1.8.0_212]
        at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) ~[na:1.8.0_212]
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[na:1.8.0_212]
        at com.example.echodemo.EchoController.testForAllocDirectBytes(EchoController.java:35) ~[classes!/:0.0.1-SNAPSHOT]

结果 系统依旧稳定运行

root@ubuntu:~/data/echo-demo# kubectl get pod -A
NAMESPACE              NAME                                         READY   STATUS      RESTARTS   AGE
default                echo-demo-7d56689947-h69qz                   1/1     Running     0          3m50s

可见在系统资源充足的情况下, JAVA 在内存溢出并不会重启

150MB


apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo-demo
  template:
    metadata:
      labels:
#        配置 pod 标签
        app: echo-demo
    spec:
      containers:
        - name: echo-demo
          image: registry.cn-hangzhou.aliyuncs.com/zxw_img/echo-demo:0.1
          resources:
            limits:
              cpu: 200m
              memory: 150Mi
            requests:
              cpu: 100m
              memory: 150Mi
          imagePullPolicy: Always
          ports:
            - containerPort: 8080

尝试申请30MB,POD 重启

root@ubuntu:~/data/echo-demo# curl http://echo-demo.com:31638/test/bytes/direct/30000000

进程 137

可以看出重启了1次

root@ubuntu:~/data/echo-demo# kubectl get pod -A
NAMESPACE              NAME                                         READY   STATUS      RESTARTS   AGE
default                echo-demo-848db8b848-56tj6                   1/1     Running     1          80s
ingress-nginx          ingress-nginx-admission-create-v6kkb         0/1     Completed   0          19h
ingress-nginx          ingress-nginx-admission-patch-xcrt9          0/1     Completed   0          19h
ingress-nginx          ingress-nginx-controller-555bff4cb7-zkssr    1/1     Running     1          19h
kube-system            coredns-54d67798b7-l884n                     1/1     Running     2          25h
kube-system            etcd-minikube                                1/1     Running     2          25h
kube-system            kube-apiserver-minikube                      1/1     Running     2          25h
kube-system            kube-controller-manager-minikube             1/1     Running     2          25h
kube-system            kube-proxy-5ddrf                             1/1     Running     2          25h
kube-system            kube-scheduler-minikube                      1/1     Running     2          25h
kube-system            storage-provisioner                          1/1     Running     6          25h
kubernetes-dashboard   dashboard-metrics-scraper-5896898794-cjxm4   1/1     Running     2          25h
kubernetes-dashboard   kubernetes-dashboard-7d795556f4-5xz48        1/1     Running     3          25h

但是Last State的状态为Terminated, 可见资源不足时会被系统强制重启

root@ubuntu:~/data/echo-demo# kubectl describe pod echo-demo-848db8b848-56tj6
Name:         echo-demo-848db8b848-56tj6
Namespace:    default
Priority:     0
Node:         minikube/192.168.49.2
Start Time:   Sat, 15 Oct 2022 23:09:39 -0700
Labels:       app=echo-demo
              pod-template-hash=848db8b848
Annotations:  kubectl.kubernetes.io/restartedAt: 2022-10-15T23:01:35-07:00
Status:       Running
IP:           172.17.0.6
IPs:
  IP:           172.17.0.6
Controlled By:  ReplicaSet/echo-demo-848db8b848
Containers:
  echo-demo:
    Container ID:   docker://55398fa11f456582e2cde7c79e5ff8f6134ac4910e1ef354c2a5a65fd5328854
    Image:          registry.cn-hangzhou.aliyuncs.com/zxw_img/echo-demo:0.1
    Image ID:       docker-pullable://registry.cn-hangzhou.aliyuncs.com/zxw_img/echo-demo@sha256:eb3736e24aa7f9323c73b4ecb75483455d6fc8feccbdb7c56f92508eb68b1a17
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 15 Oct 2022 23:10:41 -0700
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Sat, 15 Oct 2022 23:09:40 -0700
      Finished:     Sat, 15 Oct 2022 23:10:40 -0700
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     200m
      memory:  150Mi
    Requests:
      cpu:        100m
      memory:     150Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xjfjx (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-xjfjx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-xjfjx
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age                From               Message
  ----    ------     ----               ----               -------
  Normal  Scheduled  100s               default-scheduler  Successfully assigned default/echo-demo-848db8b848-56tj6 to minikube
  Normal  Pulled     99s                kubelet            Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/zxw_img/echo-demo:0.1" in 305.422048ms
  Normal  Pulling    38s (x2 over 99s)  kubelet            Pulling image "registry.cn-hangzhou.aliyuncs.com/zxw_img/echo-demo:0.1"
  Normal  Created    38s (x2 over 99s)  kubelet            Created container echo-demo
  Normal  Started    38s (x2 over 99s)  kubelet            Started container echo-demo
  Normal  Pulled     38s                kubelet            Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/zxw_img/echo-demo:0.1" in 220.503441ms

结论

重启原因是超出K8S资源限制
1.设置好项目的内存大小

你可能感兴趣的:(kubernetes,java,容器)