PyTorch On K8S 共享内存问题定位

PyTorch On K8S 共享内存问题定位

Background

将Pytorch运行在K8S,报以下错误:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

问题定位

根据PyTorch README发现:

Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.

这里说明了,PyTorch的IPC会利用共享内存,所以共享内存必须足够大。

Docker默认共享内存是64M,并且可以通过docker run --shm-size进行修改,但是K8S怎么搞呢?根据API文档,发现K8S没办法直接指定,所以只能另辟蹊径。

参考: issue

最终解决方法:

      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      containers:
        - volumeMounts:
          - mountPath: /dev/shm
            name: dshm

原来emptyDir还是支持内存,然后挂载到容器的shm目录,最终实现对容器的共享内存进行扩容。脑洞有点大,学习了。

你可能感兴趣的:(PyTorch On K8S 共享内存问题定位)