pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

在Docker中运行报错:

Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 782, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/opt/conda/envs/rapids/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/opt/conda/envs/rapids/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cuda/initialize.py", line 32, in _create_cuda_context
    distributed.comm.ucx.init_once()
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/comm/ucx.py", line 86, in init_once
    pre_existing_cuda_context = has_cuda_context()
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 91, in has_cuda_context
    running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses_v2(handle)
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 2191, in nvmlDeviceGetComputeRunningProcesses_v2
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
  File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 785, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
2022-05-16 15:19:14,517 - distributed.preloading - INFO - Run preload setup click command: dask_cuda.initialize
2022-05-16 15:19:14,517 - distributed.worker - INFO -       Start worker at:    ws://10.233.68.22:39537/
2022-05-16 15:19:14,517 - distributed.worker - INFO -          Listening to:    ws://10.233.68.22:39537/
2022-05-16 15:19:14,517 - distributed.worker - INFO -          dashboard at:         10.233.68.22:35313
2022-05-16 15:19:14,517 - distributed.worker - INFO - Waiting to connect to: ws://launcher-svc-1245231:8786/
2022-05-16 15:19:14,517 - distributed.worker - INFO - -------------------------------------------------
2022-05-16 15:19:14,517 - distributed.worker - INFO -               Threads:                          1
2022-05-16 15:19:14,517 - distributed.worker - INFO -                Memory:                 400.00 GiB
2022-05-16 15:19:14,517 - distributed.worker - INFO -       Local Directory: /rapids/notebooks/dask-worker-space/worker-ave_m7tw
2022-05-16 15:19:14,517 - distributed.worker - INFO - Starting Worker plugin PreImport-0b003d61-7c5f-4530-bf6f-c95b93c83338
2022-05-16 15:19:14,517 - distributed.worker - INFO - Starting Worker plugin CPUAffinity-a1d437c7-bb5d-408e-a3e0-3120dd6c6a5f
2022-05-16 15:19:14,518 - distributed.worker - INFO - Starting Worker plugin RMMSetup-03e12d8b-4b23-4e0e-9b3c-a79b6b12e7ab
2022-05-16 15:19:14,974 - distributed.worker - INFO - -------------------------------------------------
2022-05-16 15:19:15,025 - distributed.worker - INFO -         Registered to: ws://launcher-svc-1245231:8786/
2022-05-16 15:19:15,025 - distributed.worker - INFO - -------------------------------------------------
2022-05-16 15:19:15,026 - distributed.core - INFO - Starting established connection

用nvidia-smi查看当前Cuda版本:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 35%   33C    P8    18W / 220W |    552MiB /  7959MiB |     13%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

参照提示:解决方案
因为Cuda和pynvml库间存在对应关系,要么升级Cuda,要么降级pynvml。
进入python3,查看pynvml版本:

>>> import pynvml
>>> print(pynvml.__version__)
11.5.1

猜想可能是pynvml版本过高与Cuda不匹配导致的,直接通过pip降级pynvml。

pip install pynvml==11.4.1

问题解决。

你可能感兴趣的:(python)