遇到问题--kubernets--k8s-argo-The node was low on resource: nodefs和Pod The node was low on resource: [Di

情况

argo运行任务报错如下:

STEP                    PODNAME                        DURATION  MESSAGE
 ● impute2-name-zbh8b                                            
 └-● module-step                                                 
   ├-✖ module-step(0)   impute2-name-zbh8b-180694592   2h        The node was low on resource: nodefs.
   ├-✖ module-step(1)   impute2-name-zbh8b-918762733   1s        Pod The node was low on resource: [DiskPressure].
   ├-✖ module-step(2)   impute2-name-zbh8b-247510878   10h       The node was low on resource: imagefs.
   ├-✖ module-step(3)   impute2-name-zbh8b-2864569179  1s        Pod The node was low on resource: [DiskPressure].
   ├-✖ module-step(4)   impute2-name-zbh8b-2999922964  1s        Pod The node was low on resource: [DiskPressure].
   ├-✖ module-step(5)   impute2-name-zbh8b-2395781585  1s        Pod The node was low on resource: [DiskPressure].
   ├-✖ module-step(6)   impute2-name-zbh8b-2529855442  2s        Pod The node was low on resource: [DiskPressure].
   ├-✖ module-step(7)   impute2-name-zbh8b-851946447   1s        Pod The node was low on resource: [DiskPressure].
   ├-✖ module-step(8)   impute2-name-zbh8b-2329509752  1s        Pod The node was low on resource: [DiskPressure].
   ├-✖ module-step(9)   impute2-name-zbh8b-383158853   2s        Pod The node was low on resource: [DiskPressure].
   ├-✖ module-step(10)  impute2-name-zbh8b-2500762977  1s        Pod The node was low on resource: [DiskPressure].
   ├-✖ module-step(11)  impute2-name-zbh8b-2031136740  2h        The node was low on resource: imagefs.
   └-◷ module-step(12)  impute2-name-zbh8b-4178127519  3m        Unschedulable: 0/35 nodes are available: 1 node(s) had disk pressure, 34 Insufficient cpu, 5 Insufficient memory.

原因

使用grafana进行监控发现pod运行所在的node的磁盘压力达到了100%

资源内存等也全部使用打满导致pod异常退出

遇到问题--kubernets--k8s-argo-The node was low on resource: nodefs和Pod The node was low on resource: [Di_第1张图片

解决方法

两个角度

一是增加资源申请和限制

      container:
        image: 123.amazonaws.com.cn/module/impute2-module:beta
        imagePullPolicy: Always
        resources:
          limits:
            cpu: 30
            memory: 200Gi
          requests:
            cpu: 30
            memory: 200Gi

二是 拆分任务,一次不要处理太多数据。

你可能感兴趣的:(遇到问题,云存储云计算)