Joblib是一个可以将Python代码转换为并行计算模式的包,可以大大简化我们写并行计算代码的步骤。我们可以通过操作该包内的函数来实现目标代码的并行计算,从而提高代码运行效率。下面举一个简单的例子来说明:
1、首先,我们定义一个简单的函数single(a),该函数顺序执行休眠1s然后打印a的值的操作:
from joblib import Parallel, delayed
import time
def single(a):
""" 定义一个简单的函数 """
time.sleep(1) # 休眠1s
print(a) # 打印出a
2、我们使用for循环运行10次single()函数,并记录运行的时间,由结果可知,这种情况下代码大概会运行10s。
start = time.time() # 记录开始的时间
for i in range(10): # 执行10次single()函数
single(i)
Time = time.time() - start # 计算执行的时间
print(str(Time)+'s')
# 运行结果如下 #
0
1
2
3
4
5
6
7
8
9
10.0172278881073s
3、下面我们使用joblib库里的Parallel函数及delayed函数来对执行10次single()函数的操作实现并行化处理。Parallel函数会创建一个进程池,以便在多进程中执行每一个列表项,函数中,我们设置参数n_jobs=3,即开启三个进程。函数delayed是一个创建元组(function, args, kwargs)
的简单技巧,代码中的意思是创建10个实参分别为0~9的single()函数的workers。代码及结果如下,可见运行时间相比顺序执行大大减小,由于进程切换等操作的时间开销,最终的执行时间并不是理想的3.33s,而是大于一个3.33s的时间。
start = time.time() # 记录开始的时间
Parallel(n_jobs=3)(delayed(single)(i) for i in range(10)) # 并行化处理
Time = time.time() - start # 计算执行的时间
print(str(Time)+'s')
# 运行结果如下 #
0
1
2
3
4
5
6
7
8
9
4.833665370941162s
另外,当n_jobs的值为1时,即相当于for循环的顺序执行,结果仍然会是10s,有兴趣可以自己实践下。当然,我们可以改变不同的n_jobs值来查看最终的运行结果。
4、Parallel参数众多,但常用的基本只有n_jobs和backend参数。有关Parallel函数的具体定义及用法可参考下面的解释:
class joblib.parallel(n_jobs=None, backend=None, verbose=0, timeout=None, pre_dispatch='2 * n_jobs',
batch_size='auto',temp_folder=None, max_nbytes='1M', mmap_mode='r', prefer=None, require=None)
参数解释:(参考:https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html#joblib.Parallel)
当backend="multiprocessing"时指python工作进程的数量,或者backend="threading"时指线程池大小。当n_jobs=-1时,使用所有的CPU执行并行计算。当n_jobs=1时,就不会使用并行代码,即等同于顺序执行,可以在debug情况下使用。另外,当n_jobs<-1时,将会使用(n_cpus + 1 + n_jobs)个CPU,例如n_jobs=-2时,将会使用n_cpus-1个CPU核,其中n_cpus为CPU核的数量。当n_jobs=None的情况等同于n_jobs=1
The maximum number of concurrently running jobs, such as the number of Python worker processes when backend ="multiprocessing" or the size of the thread-pool when backend="threading". If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. None is a marker for 'unset' that will be interpreted as n_jobs=1 (sequential execution) unless the call is performed under a parallel_backend context manager that sets another value for n_jobs.
backend='loky': 在与Python进程交换输入和输出数据时,可导致一些通信和内存开销。
backend='multiprocessing': 基于multiprocessing.Pool的后端,鲁棒性不如loky。
backend='threading': threading是一个开销非常低的backend。但是如果被调用的函数大量依赖于Python对象,它就会受到Python全局解释器(GIL)锁的影响。当执行瓶颈是显式释放GIL的已编译扩展时,“threading”非常有用(例如,封装在“with nogil”块中的Cython循环,或者对库(如NumPy)的大量调用)。
- "loky" used by default, can induce some communication and memory overhead when exchanging input and output data with the worker Python processes.
- "multiprocessing" previous process-based backend based on multiprocessing.Pool`. Less robust than `loky`.
- "threading" is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. "threading" is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a "with nogil" block or an expensive call to a library such as NumPy).
- finally, you can register backends by calling register_parallel_backend. This will allow you to implement a backend of your liking.
It is not recommended to hard-code the backend name in a call to Parallel in a library. Instead it is recommended to set soft hints (prefer) or hard constraints (require) so as to make it possible for library users to change the backend from the outside using the parallel_backend context manager.
信息级别:如果非零,则打印进度消息。超过50,输出被发送到stdout。消息的频率随着信息级别的增加而增加。如果大于10,则报告所有迭代。
The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported.
timeout仅用在n_jobs != 1的情况下,用来限制每个任务完成的时间,如果任何任务的执行超过这个限制值,将会引发“TimeOutError”错误。
Timeout limit for each task to complete. If any task takes longer a TimeOutError will be raised. Only applied when n_jobs != 1
预先分派的(任务的)批数(batches)。默认设置是“2 * n_jobs”。
The number of batches (of tasks) to be pre-dispatched. Default is '2*n_jobs'. When batch_size="auto" this is reasonable default and the workers should never starve.
当单个评估非常快时,由于开销的原因,使用dispatching的worker可能比顺序计算慢。一起进行批量快速计算可以缓解这种情况。“auto”策略会跟踪一个批处理完成所需的时间,并动态调整batch_size大小,使用启发式方法将时间保持在半秒以内。初始batch_size为1。batch_size="auto"且backend="threading时,将一次分派一个任务的batches,因为threading后端有非常小的开销,使用更大的batch_size在这种情况下没有证明带来任何好处。
The number of atomic tasks to dispatch at once to each worker. When individual evaluations are very fast, dispatching calls to workers can be slower than sequential computation because of the overhead. Batching fast computations together can mitigate this.
The ``'auto'`` strategy keeps track of the time it takes for a batch to complete, and dynamically adjusts the batch size to keep the time on the order of half a second, using a heuristic. The initial batch size is 1.
batch_size="auto" with backend="threading" will dispatch batches of a single task at a time as the threading backend has very little overhead and using larger batch size has not proved to bring any gain in that case.
内存映射大数组池(pool for memmapping large arrays)使用的文件夹,以便与工作进程共享内存。如果没有,这将尝试以下顺序:
— 指向环境变量JOBLIB_TEMP_FOLDER的文件夹
— /dev/shm (如果这个文件存在并且可写):这是现代Linux发行版上默认可用的RAM磁盘文件系统
— 可以被TMP、TMPDIR或TEMP这些环境变量覆盖的默认系统临时文件夹,Unix操作系统下通常是/TMP。
该参数只在backend="loky" 或 "multiprocessing"时有效。
Folder to be used by the pool for memmapping large arrays for sharing memory with worker processes. If None, this will try in order:
- a folder pointed by the JOBLIB_TEMP_FOLDER environment variable,
- /dev/shm if the folder exists and is writable: this is a RAM disk filesystem available by default on modern Linux distributions,
- the default system temporary folder that can be overridden with TMP, TMPDIR or TEMP environment variables, typically /tmp under Unix operating systems.
传递给在temp_folder中触发自动内存映射的worker的数组大小的阈值。’1M‘即1MB。
该参数只在backend="loky" 或 "multiprocessing"时有效。
Threshold on the size of arrays passed to the workers that triggers automated memory mapping in temp_folder. Can be an int in Bytes, or a human-readable string, e.g., '1M' for 1 megabyte. Use None to disable memmapping of large arrays. Only active when backend="loky" or "multiprocessing". Only active when backend="loky" or "multiprocessing".
Memmapping mode for numpy arrays passed to workers.See 'max_nbytes' parameter documentation for more details.
如果使用parallel_backend上下文管理器没有选择任何特定backend,则使用软提示选择默认backend。默认的基于进程(thread-based)的backend是“loky”,默认的基于线程的backend是“threading”。如果指定了“backend”参数,则忽略。
Soft hint to choose the default backend if no specific backend was selected with the parallel_backend context manager. The default process-based backend is 'loky' and the default thread-based backend is 'threading'. Ignored if the “backend”parameter is specified.
硬约束选择backend。如果设置为'sharedmem',即使用户要求使用parallel_backend实现非基于线程的后端,所选backend也将是single-host和thread-based的。
Hard constraint to select the backend. If set to 'sharedmem', the selected backend will be single-host and thread-based even if the user asked for a non-thread based backend with parallel_backend.