NCCL 故障排除 一

官方文档 http://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#troubleshooting

========================================================================

5. Troubleshooting  NCCL 故障排除

Ensure you are familiar with the following known issues and useful debugging strategies.

5.1. Errors

NCCL  calls may return a variety of return codes. Ensure that the return codes are always equal to  ncclSuccess . If any call fails, and returns a value different from  ncclSuccess , setting  NCCL_DEBUG  to  WARN  will make  NCCL  print an explicit warning message before returning the error.
NCCL调用可能会返回各种返回码。 确保返回码始终等于 ncclSuccess 。 如果任何调用失败,并返回一个不同于ncclSuccess的值,将 NCCL_DEBUG 设置为  WARN  将使NCCL在返回错误之前打印一个明确的警告消息。
Errors are grouped into different categories.
  • ncclUnhandledCudaError and ncclSystemError indicate that a call to an external library failed.
  • ncclInvalidArgument and ncclInvalidUsage indicates there was a programming error in the application using NCCL.
In either case, refer to the  NCCL  warning message to understand how to resolve the problem.
错误分为不同的类别。
  • ncclUnhandledCudaError 和 ncclSystemError 表示对外部库的调用失败。
  • ncclInvalidArgument 和 ncclInvalidUsage 指示使用NCCL的应用程序中存在编程错误。

无论哪种情况,请参阅NCCL警告消息以了解如何解决问题。

5.2. Networking Issues 网络问题

5.2.1. IP Network Interfaces IP 网络接口

NCCL auto-detects which network interfaces to use for inter-node communication. If some interfaces are in state up, however are not able to communicate between nodes, NCCL may try to use them anyway and therefore fail during the init functions or even hang.
For more information about how to specify which interfaces to use, see NCCL Knobs topic, particularly the NCCL_SOCKET_IFNAME knob.
NCCL自动检测哪些网络接口用于节点间通信。 如果某些接口处于up状态,但是无法在节点之间进行通信,则NCCL可能会尝试使用它们,从而在init函数期间失败甚至挂起。
有关如何指定要使用哪个接口的更多信息,请参阅 NCCL Knobs 主题,特别是 NCCL_SOCKET_IFNAME 旋钮。

5.2.2. InfiniBand

Before running NCCL on InfiniBand, running low-level InfiniBand tests (and in particular the ib_write_bw test) can help verify which nodes are able to communicate properly.

在InfiniBand上运行NCCL之前,运行低级InfiniBand测试(尤其是ib_write_bw测试)可以帮助验证哪些节点能够正常通信。

5.3. Known Issues

Ensure you are familiar with the following known issues:

Sharing Data 共享数据
In order to share data between ranks,  NCCL  may require shared system memory for IPC and pinned (page-locked) system memory resources. The operating system’s limits on these resources may need to be increased accordingly. Please see your system’s documentation for details. In particular,  Docker®  containers default to limited shared and pinned memory resources. When using  NCCL  inside a container, it is recommended that you increase these resources by issuing:

为了在队列之间共享数据,NCCL可能需要IPC的共享系统内存和固定(页面锁定)系统内存资源。操作系统对这些资源的限制可能需要相应的增加。有关详细信息,请参阅您的系统文档。特别是,Docker®容器默认为使用有限的共享和固定内存资源。在容器内使用NCCL时,建议您通过使用以下命令来增加这些资源:
--shm-size=1g --ulimit memlock=-1
in the command line to
nvidia-docker run

Concurrency between NCCL and CUDA calls (NCCL up to 2.0.5 or CUDA 8) NCCL和CUDA调用之间的并发性(NCCL版本不低于2.0.5或CUDA 8)
NCCL  uses  CUDA  kernels to perform inter-GPU communication. The  NCCL  kernels synchronize with each other, therefore, each kernel requires other kernels on other GPUs to be also executed in order to complete. The application should therefore make sure that nothing prevents the  NCCL  kernels from being executed concurrently on the different devices of a  NCCL  communicator.

NCCL使用CUDA内核来执行GPU间通信。 NCCL内核彼此同步,因此,每个内核都需要其他GPU上的内核也执行才能完成。因此,应用程序应该确保没有阻止在NCCL通信器的不同设备上同时执行NCCL内核。

For example, let's say you have a process managing multiple  CUDA  devices, and, also features a thread which calls  CUDA  functions asynchronously. In this case,  CUDA  calls could be executed between the enqueuing of two  NCCL  kernels. The  CUDA  call may wait for the first  NCCL  kernel to complete and prevent the second one from being launched, causing a deadlock since the first kernel will not complete until the second one is executed. To avoid this issue, one solution is to have a lock around the  NCCL  launch on multiple devices (around  ncclGroupStart  and  ncclGroupEnd  when using a single thread, around the  NCCL  launch when using multiple threads, using thread synchronization if necessary) and take this lock when calling  CUDA  from the asynchronous thread.

例如,假设您有一个管理多个CUDA设备的进程,并且还具有一个异步调用CUDA函数的线程。在这种情况下,可以在排队的两个NCCL内核之间执行CUDA调用。 CUDA调用可能会等待第一个NCCL内核完成,并阻止第二个内核启动,从而导致死锁,因为直到执行第二个内核,第一个内核才会完成。为了避免这个问题,一个解决方案是锁定多个设备上的NCCL启动(当使用单个线程时围绕 ncclGroupStart ncclGroupEnd ,在使用多个线程时围绕 NCCL  launch ,必要时使用线程同步), 并在 调用异步线程中的CUDA时,使用此锁。

Starting with  NCCL  2.1.0, this issue is no longer present when using  CUDA  9, unless  Cooperative Group Launch  is disabled in the  NCCL_LAUNCH_MODE = PARALLEL   setting.

从NCCL 2.1.0开始,使用CUDA 9时,此问题不再存在,除非在NCCL_LAUNCH_MODE = PARALLEL设置中禁用了“合作组启动”。


Read more at: http://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#ixzz53lG7oopU 
Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

你可能感兴趣的:(ML/DL,linux)