Trino 源码阅读 —— MultiLevelSplitQueue 调度机制

在 Trino 的查询引擎中,并行机制有如下两点:

多机并行:将查询的逻辑算子树按照一定规则拆分成多个 Fragment,将 Fragment 分发到不同机器上运行

单机并行:将 Fragment 进一步拆分成多个 pipeline,多线程并发执行 pipeline。

这篇文章主要尝试描述单机并行时,trino 如何调度 pipeline 任务的执行。

基本流程概述

下图是 Trino 中对划分后的查询任务调度的大致流程,可以把调度的基本单位认为是PrioritizedSplitRunner

Trino 源码阅读 —— MultiLevelSplitQueue 调度机制_第1张图片

编辑切换为居中

添加图片注释,不超过 140 字(可选)

涉及的主要类为(位于 package io.trino.execution.executor):

SqlTaskExecution:构造方法里有LocalExecutionPlan,入口类

TaskExecutor:持有工作线程资源,内部会维护一个MultilevelSplitQueue,该队列是等待调度的PrioritySplitRunner的队列。

TaskHandle:将多个相关的PrioritizedSplitRunner关联起来(来自同一个SqlTaskExecution),在调度单个PrioritySplitRunner时,其优先级会影响到与其相关的其它PrioritySplitRunner,也会受到与其相关的其它PrioritySplitRunner的影响。

MultiLevelSplitQueue:内置了一个分层级的优先队列,按照level维护了调度的时间和优先级分数。

PrioritizedSplitRunner:为了实现类似操作系统的分片调度的能力,Trino抽象出来一个SplitRunner的接口,在进行调度的时候每次只会调度这个接口的processor一个时间分片,然后重新寻找一个合适的SplitRunner用于下一个分片的执行。

详细说明

TaskExecutor

对应 IoTDB 中的 DriverScheduler

持有工作线程资源,内部会维护一个MultilevelSplitQueue,该队列是等待调度的PrioritySplitRunner的队列。

SqlTaskExecution将其任务提交给TaskExecutor时需要:

创建TaskHandle,并在TaskExecutor中注册该TaskHandle

提交相应的SplitRunner,这些SplitRunner会在TaskExecutor中被封装成PrioritizedSplitRunner,然后加入MultiLevelSplitQueue中等待被调度。(实际上只有intermediateSplits会被立马加入queue中,这里我们只需要考虑intermediateSplits就好)

核心成员变量

核心成员变量如下:

@GuardedBy("this")
private final List tasks;

/**
 * All splits registered with the task executor.
 */
@GuardedBy("this")
private final Set allSplits = new HashSet<>();

/**
 * Intermediate splits (i.e. splits that should not be queued).
 */
@GuardedBy("this")
private final Set intermediateSplits = new HashSet<>();

/**
 * Splits waiting for a runner thread.
 */
private final MultilevelSplitQueue waitingSplits;

/**
 * Splits running on a thread.
 */
private final Set runningSplits = newConcurrentHashSet();

/**
 * Splits blocked by the driver.
 */
private final Map> blockedSplits = new ConcurrentHashMap<>();

线程工作核心流程

可以参考下图:

Trino 源码阅读 —— MultiLevelSplitQueue 调度机制_第2张图片

编辑切换为居中

添加图片注释,不超过 140 字(可选)

主要流程:

尝试从waitingSplits(MultiLevelSplitQueue)里面拿到一个PrioritizedSplitRunner:waitingSplits.take()

调用PrioritizedSplitRunner#process,process方法内部会执行一个时间片

如果split已经finish,那么就做相应的finish操作

否则:

如果split没被block住,那么可以直接再推入waitingSplits(MultiLevelSplitQueue)等待被下一次调度,在split被process的时候,优先级信息也被更新了,所以推入MultiLevelSplitQueue的时候可能所在的level会发生变动

如果split被block住,那么就注册回调函数,等block状态解除再放入MultiLevelSplitQueue

run方法的源代码:

@Override
public void run()
{
    try (SetThreadName runnerName = new SetThreadName("SplitRunner-%s", runnerId)) {
        while (!closed && !Thread.currentThread().isInterrupted()) {
            // select next worker
            PrioritizedSplitRunner split;
            try {
                split = waitingSplits.take();
            }
            catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                return;
            }

            String threadId = split.getTaskHandle().getTaskId() + "-" + split.getSplitId();
            try (SetThreadName splitName = new SetThreadName(threadId)) {
                RunningSplitInfo splitInfo = new RunningSplitInfo(ticker.read(), threadId, Thread.currentThread(), split);
                runningSplitInfos.add(splitInfo);
                runningSplits.add(split);

                ListenableFuture blocked;
                try {
                    blocked = split.process();
                }
                finally {
                    runningSplitInfos.remove(splitInfo);
                    runningSplits.remove(split);
                }

                if (split.isFinished()) {
                    log.debug("%s is finished", split.getInfo());
                    splitFinished(split);
                }
                else {
                    if (blocked.isDone()) {
                        waitingSplits.offer(split);
                    }
                    else {
                        blockedSplits.put(split, blocked);
                        blocked.addListener(() -> {
                            blockedSplits.remove(split);
                            // reset the level priority to prevent previously-blocked splits from starving existing splits
                            split.resetLevelPriority();
                            waitingSplits.offer(split);
                        }, executor);
                    }
                }
            }
            catch (Throwable t) {
                // ignore random errors due to driver thread interruption
                if (!split.isDestroyed()) {
                    if (t instanceof TrinoException) {
                        TrinoException e = (TrinoException) t;
                        log.error(t, "Error processing %s: %s: %s", split.getInfo(), e.getErrorCode().getName(), e.getMessage());
                    }
                    else {
                        log.error(t, "Error processing %s", split.getInfo());
                    }
                }
                splitFinished(split);
            }
        }
    }
    finally {
        // unless we have been closed, we need to replace this thread
        if (!closed) {
            addRunnerThread();
        }
    }
}

TaskHandle

SqlTaskExecution在向TaskExecutor注册时会创建一个TaskHandle,这个TaskHandle会关联该SqlTaskExecution拥有的所有SplitRunner。

TaskHandle也持有一个Priority对象,这个Priority对象可以理解成是其关联的PrioritizedSplitRunner共享的一个cahce。

TaskHandle如何做到调度时统一考虑其关联的所有PrioritizedSplitRunner:

PrioritizedSplitRunner#process方法中,更新Priority的时候会调用TaskHandle#addScheduledNanos

TaskHandle#addScheduledNanos更新TaskHandle的Priority

PrioritizedSplitRunner在从MultiLevelSplitQueue#take中被取出时,会首先判断一下自己的Priority和TaskHandle的Priority是否一致,如果不一致,说明该TaskHandle累计的运行时间已经导致其升级到其它level了,其关联的所有PrioritizedSplitRunner也应该升级,此时将这个出队的PrioritizedSplitRunner重新放入队列,另外选取一个PrioritizedSplitRunner。

PrioritizedSplitRunner

MultilevelSplitQueue的调度单位,与一个TaskHandle关联,维护了一个Priority对象用于计算调度优先级。

核心方法是process,我们只需要关注

12行,processFor消耗一个时间片

18行,进行完一次processFor之后根据调度时间更新优先级

public ListenableFuture process()
{
    try {
        long startNanos = ticker.read();
        start.compareAndSet(0, startNanos);
        lastReady.compareAndSet(0, startNanos);
        processCalls.incrementAndGet();

        waitNanos.getAndAdd(startNanos - lastReady.get());

        CpuTimer timer = new CpuTimer();
        ListenableFuture blocked = split.processFor(SPLIT_RUN_QUANTA);
        CpuTimer.CpuDuration elapsed = timer.elapsedTime();

        long quantaScheduledNanos = ticker.read() - startNanos;
        scheduledNanos.addAndGet(quantaScheduledNanos);

        priority.set(taskHandle.addScheduledNanos(quantaScheduledNanos));
        lastRun.set(ticker.read());

        if (blocked == NOT_BLOCKED) {
            unblockedQuantaWallTime.add(elapsed.getWall());
        }
        else {
            blockedQuantaWallTime.add(elapsed.getWall());
        }

        long quantaCpuNanos = elapsed.getCpu().roundTo(NANOSECONDS);
        cpuTimeNanos.addAndGet(quantaCpuNanos);

        globalCpuTimeMicros.update(quantaCpuNanos / 1000);
        globalScheduledTimeMicros.update(quantaScheduledNanos / 1000);

        return blocked;
    }
    catch (Throwable e) {
        finishedFuture.setException(e);
        throw e;
    }
}

MultiLevelSplitQueue

核心成员变量

levelWaitingSplits : 一个size为5的优先队列的2维数组。共5个level,每次take时,通过比对每一level已经消耗的时间,选取一个level,从该level里选取优先级最高的PrioritizedSplitRunner。

levelSchedueTime:一个size为5的调度时间累加器,表示当前level的已用掉的调度时间。PrioritizedSplitRunner在执行完任务的process之后会增加当前level的调度时间。这部分的数据主要用于在poll取PrioritizedSplitRunner数据的时候计算具体从哪个level的队列里面拿。

levelTimeMultiplier:这个字段用于设置不同level之间的cpu时间分配。测试里用的是2,也就是level0-4时间占比为16:8:4:2:1

levelMinPriority:是一个size为5的优先级分数数组,表明当前level的最小的优先级分数,每次从队里成功take出数据之后会更新这个当前level的优先级分数。

LEVEL_THRESHOLD_SECONDS:每个level的阈值,比如LEVEL_THRESHOLD_SECONDS[1] = 1s,也就是说累计运行时间超过1s的任务会被放到level1。

LEVEL_CONTRIBUTION_CAP:一个保护性的值,一次process被计算的上限时间,避免某些任务因为特殊情况被计算了太多值。

static final int[] LEVEL_THRESHOLD_SECONDS = {0, 1, 10, 60, 300};
static final long LEVEL_CONTRIBUTION_CAP = SECONDS.toNanos(30);

@GuardedBy("lock")
private final PriorityQueue[] levelWaitingSplits;

private final AtomicLong[] levelScheduledTime;

private final double levelTimeMultiplier;

private final AtomicLong[] levelMinPriority;

核心流程

offer

将一个PrioritizedSplitRunner加入对应层级的优先队列,一个PrioritizedSplitRunner初始化的时候默认在level0

这里遇到空层级的时候,会将空层的运行时间调到预期的运行时间,Trino这样做的原因可能是:

如果不调整,如果这个空的level落后太多,那么之后真的有PrioritizedSplitRunner到达这个level时,这个level的优先级会很高(因为是按照每个层级已经分配到的时间来计算优先级的),可能会导致其它level较长时间得不到调度。

/**
 * During periods of time when a level has no waiting splits, it will not accumulate
 * scheduled time and will fall behind relative to other levels.
 * 

* This can cause temporary starvation for other levels when splits do reach the * previously-empty level. *

* To prevent this we set the scheduled time for levels which were empty to the expected * scheduled time. */ public void offer(PrioritizedSplitRunner split) { checkArgument(split != null, "split is null"); split.setReady(); int level = split.getPriority().getLevel(); lock.lock(); try { if (levelWaitingSplits[level].isEmpty()) { // Accesses to levelScheduledTime are not synchronized, so we have a data race // here - our level time math will be off. However, the staleness is bounded by // the fact that only running splits that complete during this computation // can update the level time. Therefore, this is benign. long level0Time = getLevel0TargetTime(); long levelExpectedTime = (long) (level0Time / Math.pow(levelTimeMultiplier, level)); long delta = levelExpectedTime - levelScheduledTime[level].get(); levelScheduledTime[level].addAndGet(delta); } levelWaitingSplits[level].offer(split); notEmpty.signal(); } finally { lock.unlock(); } }

take

pollSplit()方法选取一个合适的PrioritizedSplitRunner

如果result.updateLevelPriority()返回true,说明这个PrioritizedSplitRunner对应的TaskHandle的优先级和PrioritizedSplitRunner的优先级不同,而Trino认为TaskHandle关联的所有PrioritizedSplitRunner的level应当地一致改变,所以将PrioritizedSplitRunner重新放回队列,等待下一次调度。

public PrioritizedSplitRunner take()
        throws InterruptedException
{
    while (true) {
        lock.lockInterruptibly();
        try {
            PrioritizedSplitRunner result;
            while ((result = pollSplit()) == null) {
                notEmpty.await();
            }

            if (result.updateLevelPriority()) {
                offer(result);
                continue;
            }

            int selectedLevel = result.getPriority().getLevel();
            levelMinPriority[selectedLevel].set(result.getPriority().getLevelPriority());
            selectedLevelCounters[selectedLevel].update(1);

            return result;
        }
        finally {
            lock.unlock();
        }
    }
}

pollSplit

如何决定选哪个level的优先队列:

level的数量固定是5,Trino假设不同level之间CPU的预期时间分布是确定的,具体实现中使用levelMinPriority的幂次方来决定,比如选择levelMinPriority=2,level0-4的预期CPU时间占比就为16:8:4:2:1

维护了一个levelSchedueTime的数组,标识了各个level已经调度的时间,Trino的选择思路很简单

首先计算出level0的基准时间,注意level0的基准时间是通过getLevel0TargetTime()计算得出的,而不是一个常数

根据levelMinPriority给出的各level时间比例,我们就知道了每一个level的targetScheduledTime

Ratio = targetScheduledTime/实际的调度时间levelScheduledTime[level].get(),ratio结果最大的level就是我们认为最不符合预期比例的level,我们希望给它分配更多的时间,所以选择这个level。

/**
 * Trino attempts to give each level a target amount of scheduled time, which is configurable
 * using levelTimeMultiplier.
 * 

* This function selects the level that has the lowest ratio of actual to the target time * with the objective of minimizing deviation from the target scheduled time. From this level, * we pick the split with the lowest priority. */ @GuardedBy("lock") private PrioritizedSplitRunner pollSplit() { long targetScheduledTime = getLevel0TargetTime(); double worstRatio = 1; int selectedLevel = -1; for (int level = 0; level < LEVEL_THRESHOLD_SECONDS.length; level++) { if (!levelWaitingSplits[level].isEmpty()) { long levelTime = levelScheduledTime[level].get(); double ratio = levelTime == 0 ? 0 : targetScheduledTime / (1.0 * levelTime); if (selectedLevel == -1 || ratio > worstRatio) { worstRatio = ratio; selectedLevel = level; } } targetScheduledTime /= levelTimeMultiplier; } if (selectedLevel == -1) { return null; } PrioritizedSplitRunner result = levelWaitingSplits[selectedLevel].poll(); checkState(result != null, "pollSplit cannot return null"); return result; }

getLevel0TargetTime

pollSplit方法会用到这个基准时间。Trino在计算的时候选择的是所有level经过比例换算之后最大的那个时间。

@GuardedBy("lock")
private long getLevel0TargetTime()
{
    long level0TargetTime = levelScheduledTime[0].get();
    double currentMultiplier = levelTimeMultiplier;

    for (int level = 0; level < LEVEL_THRESHOLD_SECONDS.length; level++) {
        currentMultiplier /= levelTimeMultiplier;
        long levelTime = levelScheduledTime[level].get();
        level0TargetTime = Math.max(level0TargetTime, (long) (levelTime / currentMultiplier));
    }

    return level0TargetTime;
}

updatePriority

PrioritizedSplitRunner#process方法在处理完一次processFor之后,会更新其本身以及TaskHandle的Priority,就会调用到这个updatePriority方法

主体逻辑为:

计算TaskHanlde的累积调度时间是不是要升级到更高的level

逐级更新各level

返回更新后的Priority

/**
 * Trino 'charges' the quanta run time to the task and the level it belongs to in
 * an effort to maintain the target thread utilization ratios between levels and to
 * maintain fairness within a level.
 * 

* Consider an example split where a read hung for several minutes. This is either a bug * or a failing dependency. In either case we do not want to charge the task too much, * and we especially do not want to charge the level too much - i.e. cause other queries * in this level to starve. * * @return the new priority for the task */ public Priority updatePriority(Priority oldPriority, long quantaNanos, long scheduledNanos) { int oldLevel = oldPriority.getLevel(); int newLevel = computeLevel(scheduledNanos); long levelContribution = Math.min(quantaNanos, LEVEL_CONTRIBUTION_CAP); if (oldLevel == newLevel) { addLevelTime(oldLevel, levelContribution); return new Priority(oldLevel, oldPriority.getLevelPriority() + quantaNanos); } long remainingLevelContribution = levelContribution; long remainingTaskTime = quantaNanos; // a task normally slowly accrues scheduled time in a level and then moves to the next, but // if the split had a particularly long quanta, accrue time to each level as if it had run // in that level up to the level limit. for (int currentLevel = oldLevel; currentLevel < newLevel; currentLevel++) { long timeAccruedToLevel = Math.min(SECONDS.toNanos(LEVEL_THRESHOLD_SECONDS[currentLevel + 1] - LEVEL_THRESHOLD_SECONDS[currentLevel]), remainingLevelContribution); addLevelTime(currentLevel, timeAccruedToLevel); remainingLevelContribution -= timeAccruedToLevel; remainingTaskTime -= timeAccruedToLevel; } addLevelTime(newLevel, remainingLevelContribution); long newLevelMinPriority = getLevelMinPriority(newLevel, scheduledNanos); return new Priority(newLevel, newLevelMinPriority + remainingTaskTime); }

参考链接

Presto-MultilevelSplitQueue讨论 - 掘金

StarRocks DriverQueue:

https://mp.weixin.qq.com/s/eS5tMjE5O5mBnCmzse0kmQ

https://github.com/StarRocks/starrocks/blob/main/be/src/exec/pipeline/pipeline_driver_queue.cpp

你可能感兴趣的:(数据库)