原文:23.1 Reduce-sum | Stan User’s Guide
Reduce-sum是stan的一种并行方法。
优点:
1. More flexible argument interface, avoiding the packing and unpacking that is necessary with rectangular map.
2. Partitions data for parallelization automatically (this is done manually in the rectangular map).
3. Is easier to use.
在probability modeling中常需要计算多个独立的函数的和。如函数,输入为,需计算。
reduce函数作用在partial sum function上,,只计算传给f的x的划分的和。
x的划分不受用户控制。stan提供两种划分方式,
1. reduce_sum: Automatically choose partial sums partitioning based on a dynamic scheduling algorithm.
2. reduce_sum_static: Compute the same sum as reduce_sum, but partition the input in the same way for given data set (in reduce_sum this partitioning might change depending on computer load).
函数签名
reduce_sum和reduce_sum_static 用来并行计算。
real reduce_sum(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
real reduce_sum_static(F f, T[] x, int grainsize, T1 s1, T2 s2, ...)
- f是用户定义的partial sum function;
+ real f(T[] x_slice, int start, int end, T1 s1, T2 s2, ...),start和end表示当前划分的第一个和最后一个元素在x中的位置
- x是输入,等待划分;
- grainsize:划分的大小
+ For reduce_sum, grainsize is a suggested partial sum size. A grainsize of 1 leaves the partitioning entirely up to the scheduler. 默认选1。
+ For reduce_sum_static, grainsize specifies the maximal partial sum size. 需要仔细选择。
- s1,s2,...:f 需要的参数
在model block中调用:real sum = reduce_sum(f, x, grainsize, s1, s2, ...);
例子:Logistic regression
functions {
real partial_sum(int[] y_slice,
int start, int end,
vector x,
vector beta) {
return bernoulli_logit_lpmf(y_slice | beta[1] + beta[2] * x[start:end]);
//注意y_slice是划分后的部分输入,而参数x和beta是完整的
}
}
data {
int N;
int y[N];
vector[N] x;
}
parameters {
vector[2] beta;
}
model {
int grainsize = 1;
beta ~ std_normal();
target += reduce_sum(partial_sum, y,
grainsize,
x, beta);
}
grainsize的选择
balancing the overhead implied by creating many small tasks versus creating fewer large tasks which limits the potential parallelism.
In order to figure out an optimal grain size, if there are N terms and M cores, run a quick test model with grainsize set roughly to N / M. Record the time, cut the grainsize in half, and run the test again. Repeat this iteratively until the model runtime begins to increase. This is a suitable grainsize for the model because this ensures the calculations can be carried out with the most parallelism without losing too much efficiency.
It is important to repeat this process until performance gets worse.