Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 782, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/opt/conda/envs/rapids/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
func = self.__getitem__(name)
File "/opt/conda/envs/rapids/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cuda/initialize.py", line 32, in _create_cuda_context
distributed.comm.ucx.init_once()
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/comm/ucx.py", line 86, in init_once
pre_existing_cuda_context = has_cuda_context()
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 91, in has_cuda_context
running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses_v2(handle)
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 2191, in nvmlDeviceGetComputeRunningProcesses_v2
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
File "/opt/conda/envs/rapids/lib/python3.8/site-packages/pynvml/nvml.py", line 785, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
2022-05-16 15:19:14,517 - distributed.preloading - INFO - Run preload setup click command: dask_cuda.initialize
2022-05-16 15:19:14,517 - distributed.worker - INFO - Start worker at: ws://10.233.68.22:39537/
2022-05-16 15:19:14,517 - distributed.worker - INFO - Listening to: ws://10.233.68.22:39537/
2022-05-16 15:19:14,517 - distributed.worker - INFO - dashboard at: 10.233.68.22:35313
2022-05-16 15:19:14,517 - distributed.worker - INFO - Waiting to connect to: ws://launcher-svc-1245231:8786/
2022-05-16 15:19:14,517 - distributed.worker - INFO - -------------------------------------------------
2022-05-16 15:19:14,517 - distributed.worker - INFO - Threads: 1
2022-05-16 15:19:14,517 - distributed.worker - INFO - Memory: 400.00 GiB
2022-05-16 15:19:14,517 - distributed.worker - INFO - Local Directory: /rapids/notebooks/dask-worker-space/worker-ave_m7tw
2022-05-16 15:19:14,517 - distributed.worker - INFO - Starting Worker plugin PreImport-0b003d61-7c5f-4530-bf6f-c95b93c83338
2022-05-16 15:19:14,517 - distributed.worker - INFO - Starting Worker plugin CPUAffinity-a1d437c7-bb5d-408e-a3e0-3120dd6c6a5f
2022-05-16 15:19:14,518 - distributed.worker - INFO - Starting Worker plugin RMMSetup-03e12d8b-4b23-4e0e-9b3c-a79b6b12e7ab
2022-05-16 15:19:14,974 - distributed.worker - INFO - -------------------------------------------------
2022-05-16 15:19:15,025 - distributed.worker - INFO - Registered to: ws://launcher-svc-1245231:8786/
2022-05-16 15:19:15,025 - distributed.worker - INFO - -------------------------------------------------
2022-05-16 15:19:15,026 - distributed.core - INFO - Starting established connection
用nvidia-smi查看当前Cuda版本:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 35% 33C P8 18W / 220W | 552MiB / 7959MiB | 13% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
参照提示:解决方案
因为Cuda和pynvml库间存在对应关系,要么升级Cuda,要么降级pynvml。
进入python3,查看pynvml版本:
>>> import pynvml
>>> print(pynvml.__version__)
11.5.1
猜想可能是pynvml版本过高与Cuda不匹配导致的,直接通过pip降级pynvml。
pip install pynvml==11.4.1
问题解决。