哪些因素决定了chunksize方法的最佳参数multiprocessing.Pool.map()?该.map()方法似乎使用任意启发式作为其默认的chunksize(如下所述);是什么推动了这种选择,是否有基于某些特定情况/设置的更周到的方法?
示例 - 说我是:
传递iterable到.map()拥有约1500万个元素的元素;
24个核的机器上工作,使用默认processes = os.cpu_count()内multiprocessing.Pool()。
我天真的想法是给每24个工人一个同样大小的块,即15_000_000 / 24625,000。大块应该在充分利用所有工人的同时减少营业额/管理费用。但似乎缺少给每个工人提供大批量的一些潜在缺点。这是不完整的图片,我错过了什么?
我的部分问题源于ifchunksize=None:both.map()和.starmap()call的默认逻辑,.map_async()如下所示:
def_map_async(self,func,iterable,mapper,chunksize=None,callback=None,error_callback=None):# ... (materialize `iterable` to list if it's an iterator)ifchunksizeisNone:chunksize,extra=divmod(len(iterable),len(self._pool)*4)# ????ifextra:chunksize+=1iflen(iterable)==0:chunksize=0
背后的逻辑是divmod(len(iterable), len(self._pool) * 4)什么?这意味着chunksize将更接近15_000_000 / (24 * 4) == 156_250。乘以len(self._pool)4的意图是什么?
这使得得到的chunksize比我上面的“天真逻辑”小4倍,其中包括将iterable的长度除以in的数量pool._pool。
最后,还有来自Python文档的这个片段.imap(),进一步激发了我的好奇心:
chunksize参数与map()方法使用的参数相同。对于使用了一个较大的值很长iterableschunksize可以使工作完成多少不是使用默认值1速度更快。
解决方案
简答
Pool的chunksize-algorithm是一种启发式算法。它为您尝试填充Pool方法的所有可想象的问题场景提供了一个简单的解决方案。因此,无法针对任何特定方案进行优化。
该算法任意地将可迭代的块分成大约比原始方法多四倍的块。更多的块意味着更多的开销,但增加了调度灵活性。这个答案将如何表明,这会导致平均较高的工人利用率,但不能保证每个案例的总计算时间更短。
“很高兴知道”你可能会想,“但是如何知道这对我的具体多处理问题有帮助?”嗯,事实并非如此。更诚实的简短回答是,“没有简短的答案”,“多处理是复杂的”和“它取决于”。观察到的症状可能有不同的根源,即使是类似的情况。
这个答案试图为您提供基本概念,帮助您更清楚地了解Pool的调度黑匣子。它还试图为您提供一些基本工具,用于识别和避免潜在的悬崖,因为它们与块状结构有关。
目录
第一部分
定义
并行化目标
并行化方案
Chunksize的风险> 1
Pool的Chunksize-Algorithm
量化算法效率
6.1模型
6.2并行计划
6.3效率
6.3.1绝对分配效率(ADE)
6.3.2相对分配效率(RDE)
天真与池的大块算法
现实检查
结论
有必要首先澄清一些重要的术语。
1.定义
块
这里的块是iterable池方法调用中指定的-argument的一部分。如何计算chunksize以及它可能产生的影响,是这个答案的主题。
任务
在数据方面,任务在工作进程中的物理表示可以在下图中看到。
该图显示了一个示例调用pool.map(),沿着一行代码显示,从multiprocessing.pool.worker函数中获取,其中从inqueuegets中读取的任务被解压缩。worker是MainThreadpool-worker-process中的底层main-function。该func池中法规定-argument只会匹配的func内部-variableworker-function单呼的方法,如apply_async和imap用chunksize=1。对于具有chunksize-parameter的其余池方法,处理函数func将是映射器函数(mapstar或starmapstar)。此函数将用户指定的func参数映射到传输的可迭代块( - >“map-tasks”)的每个元素上。这需要时间,定义任务也作为一个工作单位。
Taskel
虽然对于一个块的整个处理使用“任务”一词是由内部的代码匹配的multiprocessing.pool,但是没有指示如何对用户指定的单个调用func(块的一个元素作为参数)应该是提到。为了避免出现命名冲突引起的混淆(想想maxtasksperchildPool的__init__-method的参数),这个答案将把任务中的单个工作单元称为taskel。
甲taskel(从任务+ ELEMENT)是一种内工作的最小单位的任务。它是使用func-merameter -parameter指定的函数的单次执行Pool,使用从传输的块的单个元素获得的参数调用。一个任务由taskels。chunksize
并行化开销(PO)
PO由Python内部开销和进程间通信(IPC)的开销组成。Python中的每任务开销带有打包和解包任务及其结果所需的代码。IPC开销伴随着线程的必要同步以及不同地址空间之间的数据复制(需要两个复制步骤:parent - > queue - > child)。IPC开销的数量取决于操作系统,硬件和数据大小,这使得对影响的概括变得困难。
2.并行化目标
使用多处理时,我们的总体目标(显然)是最小化所有任务的总处理时间。为实现这一总体目标,我们的技术目标需要优化硬件资源的利用率。
实现技术目标的一些重要子目标是:
最小化并行化开销(最着名的,但不是唯一的:IPC)
所有cpu核心的高利用率
保持内存使用有限,以防止操作系统过度分页(垃圾)
首先,任务需要在计算上足够重(密集),以获得我们必须为并行化支付的PO。PO的相关性随着每个任务的绝对计算时间的增加而减少。或者,换句话说,对于您的问题,每个任务的绝对计算时间越大,减少PO的需求越少。如果您的计算每个任务需要几个小时,那么相比之下,IPC开销可以忽略不计。这里主要关注的是在分发所有任务之后防止空闲工作进程。保持所有核心的负载意味着,我们尽可能地进行并行化。
3.并行化方案
哪些因素决定了multiprocessing.Pool.map()等方法的最佳chunksize参数
问题的主要因素是我们的单个任务组的计算时间可能会有多大差异。为此命名,最佳chunksize的选择由...决定。
每个任务的计算时间的变异系数(CV)。
从这种变化的程度来看,规模上的两种极端情景是:
所有任务都需要完全相同的计算时间。
任务可能需要几秒或几天才能完成。
为了更好的可记忆性,我将这些场景称为:
密集的场景
广泛的情景
密集的场景
In a Dense Scenario it would be desirable to distribute all taskels at once, to keep necessary IPC and context switching at a minimum. This means we want to create only as much chunks, as much worker processes there are. How already stated above, the weight of PO increases with shorter computation times per taskel.
For maximal throughput, we also want all worker processes busy until all tasks are processed (no idling workers). For this goal, the distributed chunks should be of equal size or close to.
Wide Scenario
The prime example for a Wide Scenario would be an optimization problem, where results either converge quickly or computation can take hours, if not days. Usually it is not predictable what mixture of "light taskels" and "heavy taskels" a task will contain in such a case, hence it's not advisable to distribute too many taskels in a task-batch at once. Distributing less taskels at once than possible, means increasing scheduling flexibility. This is needed here to reach our sub-goal of high utilization of all cores.
If Pool methods, by default, would be totally optimized for the Dense Scenario, they would increasingly create suboptimal timings for every problem located closer to the Wide Scenario.
4. Risks of Chunksize > 1
Consider this simplified pseudo-code example of a Wide Scenario-iterable, which we want to pass into a pool-method:
good_luck_iterable=[60,60,86400,60,86400,60,60,84600]
Instead of the actual values, we pretend to see the needed computation time in seconds, for simplicity only 1 minute or 1 day.
We assume the pool has four worker processes (on four cores) and chunksize is set to 2. Because the order will be kept, the chunks send to the workers will be these:
[(60,60),(86400,60),(86400,60),(60,84600)]
Since we have enough workers and the computation time is high enough, we can say, that every worker process will get a chunk to work on in the first place. (This does not have to be the case for fast completing tasks). Further we can say, the whole processing will take about 86400+60 seconds, because that's the highest total computation time for a chunk in this artificial scenario and we distribute chunks only once.
Now consider this iterable, which has only one element switching its position compared to the previous iterable:
bad_luck_iterable=[60,60,86400,86400,60,60,60,84600]
...and the corresponding chunks:
[(60,60),(86400,86400),(60,60),(60,84600)]
Just bad luck with the sorting of our iterable nearly doubled (86400+86400) our total processing time! The worker getting the vicious (86400, 86400)-chunk is blocking the second heavy taskel in its task from getting distributed to one of the idling workers already finished with their (60, 60)-chunks. We obviously would not risk such an unpleasant outcome if we set chunksize=1.
This is the risk of bigger chunksizes. With higher chunksizes we trade scheduling flexibility for less overhead and in cases like above, that's a bad deal.
How we will see in chapter 6. Quantifying Algorithm Efficiency, bigger chunksizes can also lead to suboptimal results for Dense Scenarios.
5. Pool's Chunksize-Algorithm
Below you will find a slightly modified version of the algorithm inside the source code. As you can see, I cut off the lower part and wrapped it into a function for calculating the chunksize argument externally. I also replaced 4 with a factor parameter and outsourced the len() calls.
# mp_utils.pydefcalc_chunksize(n_workers,len_iterable,factor=4):"""Calculate chunksize argument for Pool-methods.
Resembles source-code within `multiprocessing.pool.Pool._map_async`.
"""chunksize,extra=divmod(len_iterable,n_workers*factor)ifextra:chunksize+=1returnchunksize
To ensure we are all on the same page, here's what divmod does:
divmod(x, y) is a builtin function which returns (x//y, x%y).
x // y is the floor division, returning the down rounded quotient from x / y, while
x % y is the modulo operation returning the remainder from x / y.
Hence e.g. divmod(10, 3) returns (3, 1).
Now when you look at chunksize, extra = divmod(len_iterable, n_workers * 4), you will notice n_workers here is the divisor y in x / y and multiplication by 4, without further adjustment through if extra: chunksize +=1 later on, leads to an initial chunksize at least four times smaller (for len_iterable >= n_workers * 4) than it would be otherwise.
For viewing the effect of multiplication by 4 on the intermediate chunksize result consider this function:
defcompare_chunksizes(len_iterable,n_workers=4):"""Calculate naive chunksize, Pool's stage-1 chunksize and the chunksize
for Pool's complete algorithm. Return chunksizes and the real factors by
which naive chunksizes are bigger.
"""cs_naive=len_iterable//n_workersor1# naive approachcs_pool1=len_iterable//(n_workers*4)or1# incomplete pool algo.cs_pool2=calc_chunksize(n_workers,len_iterable)real_factor_pool1=cs_naive/cs_pool1
real_factor_pool2=cs_naive/cs_pool2returncs_naive,cs_pool1,cs_pool2,real_factor_pool1,real_factor_pool2
The function above calculates the naive chunksize (cs_naive) and the first-step chunksize of Pool's chunksize-algorithm (cs_pool1), as well as the chunksize for the complete Pool-algorithm (cs_pool2). Further it calculates the real factors rf_pool1 = cs_naive / cs_pool1 and rf_pool2 = cs_naive / cs_pool2, which tell us how many times the naively calculated chunksizes are bigger than Pool's internal version(s).
Below you see two figures created with output from this function. The left figure just shows the chunksizes for n_workers=4 up until an iterable length of 500. The right figure shows the values for rf_pool1. For iterable length 16, the real factor becomes >=4(for len_iterable >= n_workers * 4) and it's maximum value is 7 for iterable lengths 28-31. That's a massive deviation from the original factor 4 the algorithm converges to for longer iterables. 'Longer' here is relative and depends on the number of specified workers.
Remember chunksize cs_pool1 still lacks the extra-adjustment with the remainder from divmod contained in cs_pool2 from the complete algorithm.
The algorithm goes on with:
ifextra:chunksize+=1
Now in cases were there is a remainder (an extra from the divmod-operation), increasing the chunksize by 1 obviously cannot work out for every task. After all, if it would, there would not be a remainder to begin with.
How you can see in the figures below, the "extra-treatment" has the effect, that the real factor for rf_pool2 now converges towards 4 from below 4 and the deviation is somewhat smoother. Standard deviation for n_workers=4 and len_iterable=500 drops from 0.5233 for rf_pool1 to 0.4115 for rf_pool2.
Eventually, increasing chunksize by 1 has the effect, that the last task transmitted only has a size of len_iterable % chunksize or chunksize.
The more interesting and how we will see later, more consequential, effect of the extra-treatment however can be observed for the number of generated chunks (n_chunks).
For long enough iterables, Pool's completed chunksize-algorithm (n_pool2 in the figure below) will stabilize the number of chunks at n_chunks == n_workers * 4.
In contrast, the naive algorithm (after an initial burp) keeps alternating between n_chunks == n_workers and n_chunks == n_workers + 1 as the length of the iterable grows.
Below you will find two enhanced info-functions for Pool's and the naive chunksize-algorithm. The output of this functions will be needed in the next chapter.
# mp_utils.pyfromcollectionsimportnamedtupleChunkinfo=namedtuple('Chunkinfo',['n_workers','len_iterable','n_chunks','chunksize','last_chunk'])defcalc_chunksize_info(n_workers,len_iterable,factor=4):"""Calculate chunksize numbers."""chunksize,extra=divmod(len_iterable,n_workers*factor)ifextra:chunksize+=1# `+ (len_iterable % chunksize > 0)` exploits that `True == 1`n_chunks=len_iterable//chunksize+(len_iterable%chunksize>0)# exploit `0 == False`last_chunk=len_iterable%chunksizeorchunksizereturnChunkinfo(n_workers,len_iterable,n_chunks,chunksize,last_chunk)
Don't be confused by the probably unexpected look of calc_naive_chunksize_info. The extra from divmod is not used for calculating the chunksize.
defcalc_naive_chunksize_info(n_workers,len_iterable):"""Calculate naive chunksize numbers."""chunksize,extra=divmod(len_iterable,n_workers)ifchunksize==0:chunksize=1n_chunks=extra
last_chunk=chunksizeelse:n_chunks=len_iterable//chunksize+(len_iterable%chunksize>0)last_chunk=len_iterable%chunksizeorchunksizereturnChunkinfo(n_workers,len_iterable,n_chunks,chunksize,last_chunk)
6. Quantifying Algorithm Efficiency
Now, after we have seen how the output of Pool's chunksize-algorithm looks different compared to output from the naive algorithm...
How to tell if Pool's approach actually improves something?
And what exactly could this something be?
As shown in the previous chapter, for longer iterables (a bigger number of taskels), Pool's chunksize-algorithm approximately divides the iterable into four times more chunks than the naive method. Smaller chunks mean more tasks and more tasks mean more Parallelization Overhead (PO), a cost which must be weighed against the benefit of increased scheduling-flexibility (recall "Risks of Chunksize>1").
For rather obvious reasons, Pool's basic chunksize-algorithm cannot weigh scheduling-flexibility against PO for us. IPC-overhead is OS-, hardware- and data-size dependent. The algorithm cannot know on what hardware we run our code, nor does it have a clue how long a taskel will take to finish. It's a heuristic providing basic functionality for all possible scenarios. This means it cannot be optimized for any scenario in particular. As mentioned before, PO also becomes increasingly less of a concern with increasing computation times per taskel (negative correlation).
When you recall the Parallelization Goals from chapter 2, one bullet-point was:
high utilization across all cpu-cores
The previously mentioned something, Pool's chunksize-algorithm can try to improve is the minimization of idling worker-processes, respectively the utilization of cpu-cores.
A repeating question on SO regarding multiprocessing.Pool is asked by people wondering about unused cores / idling worker-processes in situations where you would expect all worker-processes busy. While this can have many reasons, idling worker-processes towards the end of a computation are an observation we can often make, even with Dense Scenarios (equal computation times per taskel) in cases where the number of workers is not a divisor of the number of chunks (n_chunks % n_workers > 0).
The question now is:
How can we practically translate our understanding of chunksizes into something which enables us to explain observed worker-utilization, or even compare the efficiency of different algorithms in that regard?
6.1 Models
For gaining deeper insights here, we need a form of abstraction of parallel computations which simplifies the overly complex reality down to a manageable degree of complexity, while preserving significance within defined boundaries. Such an abstraction is called a model. An implementation of such a "Parallelization Model" (PM) generates worker-mapped meta-data (timestamps) as real computations would, if the data were to be collected. The model-generated meta-data allows predicting metrics of parallel computations under certain constraints.
One of two sub-models within the here defined PM is the Distribution Model (DM). The DM explains how atomic units of work (taskels) are distributed over parallel workers and time, when no other factors than the respective chunksize-algorithm, the number of workers, the input-iterable (number of taskels) and their computation duration is considered. This means any form of overhead is not included.
For obtaining a complete PM, the DM is extended with an Overhead Model (OM), representing various forms of Parallelization Overhead (PO). Such a model needs to be calibrated for each node individually (hardware-, OS-dependencies). How many forms of overhead are represented in a OM is left open and so multiple OMs with varying degrees of complexity can exist. Which level of accuracy the implemented OM needs is determined by the overall weight of PO for the specific computation. Shorter taskels lead to a higher weight of PO, which in turn requires a more precise OM if we were attempting to predict Parallelization Efficiencies (PE).
6.2 Parallel Schedule (PS)
The Parallel Schedule is a two-dimensional representation of the parallel computation, where the x-axis represents time and the y-axis represents a pool of parallel workers. The number of workers and the total computation time mark the extend of a rectangle, in which smaller rectangles are drawn in. These smaller rectangles represent atomic units of work (taskels).
Below you find the visualization of a PS drawn with data from the DM of Pool's chunksize-algorithm for the Dense Scenario.
The x-axis is sectioned into equal units of time, where each unit stands for the computation time a taskel requires.
The y-axis is divided into the number of worker-processes the pool uses.
A taskel here is displayed as the smallest cyan-colored rectangle, put into a timeline (a schedule) of an anonymized worker-process.
A task is one or multiple taskels in a worker-timeline continuously highlighted with the same hue.
Idling time units are represented through red colored tiles.
The Parallel Schedule is partitioned into sections. The last section is the tail-section.
The names for the composed parts can be seen in the picture below.
In a complete PM including an OM, the Idling Share is not limited to the tail, but also comprises space between tasks and even between taskels.
6.3 Efficiencies
Note:
Since earlier versions of this answer, "Parallelization Efficiency (PE)" has been renamed to "Distribution Efficiency (DE)".
PE now refers to overhead-including efficiency.
The Models introduced above allow quantifying the rate of worker-utilization. We can distinguish:
Distribution Efficiency (DE) - calculated with help of a DM (or a simplified method for the Dense Scenario).
Parallelization Efficiency (PE) - either calculated with help of a calibrated PM (prediction) or calculated from meta-data of real computations.
It's important to note, that calculated efficiencies do not automatically correlate with faster overall computation for a given parallelization problem. Worker-utilization in this context only distinguishes between a worker having a started, yet unfinished taskel and a worker not having such an "open" taskel. That means, possible idling during the time span of a taskel is not registered.
All above mentioned efficiencies are basically obtained by calculating the quotient of the division Busy Share / Parallel Schedule. The difference between DE and PE comes with the Busy Share
occupying a smaller portion of the overall Parallel Schedule for the overhead-extended PM.
This answer will further only discuss a simple method to calculate DE for the Dense Scenario. This is sufficiently adequate to compare different chunksize-algorithms, since...
... the DM is the part of the PM, which changes with different chunksize-algorithms employed.
... the Dense Scenario with equal computation durations per taskel depicts a "stable state", for which these time spans drop out of the equation. Any other scenario would just lead to random results since the ordering of taskels would matter.
6.3.1 Absolute Distribution Efficiency (ADE)
This basic efficiency can be calculated in general by dividing the Busy Share through the whole potential of the Parallel Schedule:
Absolute Distribution Efficiency (ADE) = Busy Share / Parallel Schedule
For the Dense Scenario, the simplified calculation-code looks like this:
# mp_utils.pydefcalc_ade(n_workers,len_iterable,n_chunks,chunksize,last_chunk):"""Calculate Absolute Distribution Efficiency (ADE).
`len_iterable` is not used, but contained to keep a consistent signature
with `calc_rde`.
"""ifn_workers==1:return1potential=(((n_chunks//n_workers+(n_chunks%n_workers>1))*chunksize)+(n_chunks%n_workers==1)*last_chunk)*n_workers
n_full_chunks=n_chunks-(chunksize>last_chunk)taskels_in_regular_chunks=n_full_chunks*chunksize
real=taskels_in_regular_chunks+(chunksize>last_chunk)*last_chunk
ade=real/potentialreturnade
If there is no Idling Share, Busy Share will be equal to Parallel Schedule, hence we get an ADE of 100%. In our simplified model, this is a scenario where all available processes will be busy through the whole time needed for processing all tasks. In other words, the whole job gets effectively parallelized to 100 percent.
But why do I keep referring to PE as absolute PE here?
To comprehend that, we have to consider a possible case for the chunksize (cs) which ensures maximal scheduling flexibility (also, the number of Highlanders there can be. Coincidence?):
___________________________________~ ONE ~___________________________________
If we, for example, have four worker-processes and 37 taskels, there will be idling workers even with chunksize=1, just because n_workers=4 is not a divisor of 37. The remainder of dividing 37 / 4 is 1. This single remaining taskel will have to be processed by a sole worker, while the remaining three are idling.
Likewise, there will still be one idling worker with 39 taskels, how you can see pictured below.
When you compare the upper Parallel Schedule for chunksize=1 with the below version for chunksize=3, you will notice that the upper Parallel Schedule is smaller, the timeline on the x-axis shorter. It should become obvious now, how bigger chunksizes unexpectedly also can lead to increased overall computation times, even for Dense Scenarios.
But why not just use the length of the x-axis for efficiency calculations?
Because the overhead is not contained in this model. It will be different for both chunksizes, hence the x-axis is not really directly comparable. The overhead can still lead to a longer total computation time like shown in case 2 from the figure below.
6.3.2 Relative Distribution Efficiency (RDE)
The ADE value does not contain the information if a better distribution of taskels is possible with chunksize set to 1. Better here still means a smaller Idling Share.
To get a DE value adjusted for the maximum possible DE, we have to divide the considered ADE through the ADE we get for chunksize=1.
Relative Distribution Efficiency (RDE) = ADE_cs_x / ADE_cs_1
Here is how this looks in code:
# mp_utils.pydefcalc_rde(n_workers,len_iterable,n_chunks,chunksize,last_chunk):"""Calculate Relative Distribution Efficiency (RDE)."""ade_cs1=calc_ade(n_workers,len_iterable,n_chunks=len_iterable,chunksize=1,last_chunk=1)ade=calc_ade(n_workers,len_iterable,n_chunks,chunksize,last_chunk)rde=ade/ade_cs1returnrde
RDE, how defined here, in essence is a tale about the tail of a Parallel Schedule. RDE is influenced by the maximum effective chunksize contained in the tail. (This tail can be of x-axis length chunksize or last_chunk.)
This has the consequence, that RDE naturally converges to 100% (even) for all sorts of "tail-looks" like shown in the figure below.
A low RDE ...
is a strong hint for optimization potential.
naturally gets less likely for longer iterables, because the relative tail-portion of the overall Parallel Schedule shrinks.
find Part II of this answer here below.