pytorch中dataloader的num_workers参数

结论速递

在Windows系统中,num_workers参数建议设为0,在Linux系统则不需担心。

1 问题描述

在之前的任务超大图上的节点表征学习中,使用PyG库用于数据加载的DataLoaderClusterLoader等类时,会涉及到参数num_workers。在不同系统环境试验该参数的设定值时,会出现不同的结果

  • colab(Linux),设为12
    有warning message,提示num_workers参数设置不合理,不影响继续运行
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 12 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary
  • Windows系统,设为1
    报错
    pytorch中dataloader的num_workers参数_第1张图片

于是想要弄明白到底为什么会出现报错。

2 探索

于是阅读源码,溯源至父类torch.utils.data.dataloader源码,关于num_workers的定义都在这段源码中。

2.1 num_workers

在注释中,有对num_workers参数的描述
在这里插入图片描述
num_workers为进行数据加载时,将创建多少个subprocesses用于数据加载,若设定为0,则主进程加载。根据我们拥有的知识,这个subprocesses的数量是收到设备cpu核数和线程数限制的,不是我们设置多少个num_workers,就会有多少个subprocesses。
因此源码里头定义了check_worker_number_rationality()这个方法,来执行num_workers设定合理性的检查。

2.2 关于colab的warning message

colab的warning message的来源就是check_worker_number_rationality()这个方法,这个方法定义的源码开头,讲述了进行num_workers设定合理性检查的原因。

This function check whether the dataloader’s worker number is rational based on current system’s resource. Current rule is that if the number of workers this Dataloader will create is bigger than the number of logical cpus that is allowed to use, than we will pop up a warning to let user pay attention.
eg. If current system has 2 physical CPUs with 16 cores each. And each core support 2 threads, then the total logical cpus here is 2 * 16 * 2 = 64. Let’s say current DataLoader process can use half of them which is 32, then the rational max number of worker that initiated from this process is 32. Now, let’s say the created DataLoader has num_works = 40, which is bigger than 32. So the warning message is triggered to notify the user to lower the worker number if necessary.

num_workers的设定数不能超过cpu可以开的线程数max_num_worker_suggest。当num_workers设定数超过线程数时,会引发warning message,对应源码如下:
pytorch中dataloader的num_workers参数_第2张图片
在上图中标注出了对linux及windows系统,不同的确定系统max_num_worker_suggest的方式。
linux系统可以通过len(os.sched_getaffinity(0))来直接获取,而windows则通过os.cpu()获取(并非cpu核数,而是线程数)。
pytorch中dataloader的num_workers参数_第3张图片
上述的8对应的即是cpu逻辑处理器的个数。
到了这里,我们已经可以知道colab中对应的warning message是由于我们所设定的num_workers数目超过了max_num_worker_suggest,也即cpu的线程数。
但是在先前的windows尝试中,num_workers的设定数没有超过cpu支持的process数,且即便是超过,也并不会导致报错,只会有warning message。于是我们继续查找windows报错信息的来源。

2.3 关于windows的报错信息

根据windows对应的报错信息“Runtime Error”,我们继续阅读源码,对应处如下
pytorch中dataloader的num_workers参数_第4张图片
当单process没有在指定时间内被killed,就会raise RuntimeError提示错误。
到了这一步我们可以确定报错来源,但仍不知是何原因导致这个process运行失败。
查找互联网他人笔记DataLoader windows平台下 多线程读数据报错,

在Windows上,FileMapping对象应必须在所有相关进程都关闭后,才能释放。
启用多线程处理时,子进程将创建FileMapping,然后主进程将打开它。 之后当子进程将尝试释放它的时候,因为父进程还在引用,所以它的引用计数不为零,无法释放。 但是当前代码没有提供在可能的情况下再次关闭它的机会。

在GitHub开发者讨论中也有相关的说法

The memory leak is caused by the difference in using FileMapping(mmap) on Windows. On Windows, FileMapping objects should be closed by all related processes and then it can be released. And there’s no other way to explicitly delete it.(Like shm_unlink)
When multiprocessing is on, the child process will create a FileMapping and then the main process will open it. After that, at some time, the child process will try to release it but it’s reference count is non-zero so it cannot be released at that time. But the current code does not provide a chance to let it close again when possible.
This PR targets #5590.
Current Progress:
The memory leak when num_worker=1 should be solved. However, further work has to be done for more workers.

也就是说,在windows系统上运行时,目前应将num_workers设为0,以保证运行。

你可能感兴趣的:(一些小经验)