ClickHouse源码笔记2:聚合流程的实现

上篇笔记讲到了聚合函数的实现并且带大家看了聚合函数是如何注册到ClickHouse之中的并被调用使用的。这篇笔记，笔者会续上上篇的内容，将剖析一把ClickHouse聚合流程的整体实现。
第二篇文章，我们来一起看看聚合流程的实现~~ 上车！

1.基础知识的梳理

ClickHouse的实现接口

Block类
前文我们聊到ClickHouse是一个列式存储数据库，在内存之中用IColumn接口来作为数据结构表示数据。 而Block则是这些列的集合，也就是说Block包含了一组列，而无数个Block就构成了我们通常理解的表了。
在ClickHouse进行查询之中，数据的最小处理单位是 Block 。由下面代码可以看到，Block就是由一组列以及列名对应列的偏移map组成的。

class Block
{
private:
    using Container = ColumnsWithTypeAndName;
    using IndexByName = std::map;

    Container data;
    IndexByName index_by_name;

这是一个很重要的类，实现的也并不复杂。Block类作为ClickHouse的核心，后续的工作都是基于Block类展开的。

抽象类IBlockInputStream
由名字可以看出，IBlockInputStream是一个实现接口。
这也同样是一个十分重要的接口，ClickHouse的调用模型就建立在IBlockInputStream接口之上。该接口最为核心的就是方法便是read函数，它返回一个被对应Stream处理过的Block。
想必看到这里应该明白了，ClickHouse就是通过IBlockInputStream实现的火山模型，每一个不同的Stream处理不同的查询逻辑，最后层层迭代，完成最终输出流就是用户需要的结果了。
IBlockInputStream类还有一个孪生兄弟IBlockoutputStream，顾名思义，需要进行写操作的时候就要用到它了。

class IBlockInputStream : public TypePromotion
{
    friend struct BlockStreamProfileInfo;

public:
    IBlockInputStream() { info.parent = this; }
    virtual ~IBlockInputStream() {}

    IBlockInputStream(const IBlockInputStream &) = delete;
    IBlockInputStream & operator=(const IBlockInputStream &) = delete;

    /// To output the data stream transformation tree (query execution plan).
    virtual String getName() const = 0;

    /** Get data structure of the stream in a form of "header" block (it is also called "sample block").
      * Header block contains column names, data types, columns of size 0. Constant columns must have corresponding values.
      * It is guaranteed that method "read" returns blocks of exactly that structure.
      */
    virtual Block getHeader() const = 0;

    virtual const BlockMissingValues & getMissingValues() const
    {
        static const BlockMissingValues none;
        return none;
    }

    /// If this stream generates data in order by some keys, return true.
    virtual bool isSortedOutput() const { return false; }

    /// In case of isSortedOutput, return corresponding SortDescription
    virtual const SortDescription & getSortDescription() const;

    /** Read next block.
      * If there are no more blocks, return an empty block (for which operator `bool` returns false).
      * NOTE: Only one thread can read from one instance of IBlockInputStream simultaneously.
      * This also applies for readPrefix, readSuffix.
      */
    Block read();

AggregatingBlockInputStream类
终于引出我们的主角了，AggregatingBlockInputStream类，作为上面IBlockInputStream的子类，也就是我们今天要重点分析的类。

class AggregatingBlockInputStream : public IBlockInputStream
{
public:
    /** keys are taken from the GROUP BY part of the query
      * Aggregate functions are searched everywhere in the expression.
      * Columns corresponding to keys and arguments of aggregate functions must already be computed.
      */
    AggregatingBlockInputStream(const BlockInputStreamPtr & input, const Aggregator::Params & params_, bool final_)
        : params(params_), aggregator(params), final(final_)
    {
        children.push_back(input);
    }

    String getName() const override { return "Aggregating"; }

    Block getHeader() const override;

protected:
    Block readImpl() override;

    Aggregator::Params params;
    Aggregator aggregator;
    bool final;

    bool executed = false;

    std::vector> temporary_inputs;

     /** From here we will get the completed blocks after the aggregation. */
    std::unique_ptr impl;
};

首先看它的构造方法，参数有：

BlockInputStreamPtr: 这个很好理解，就是它的子流，也就是实际产生数据的流，后续的聚合计算将会在子流返回的结果上展开。
params: 聚合参数，这个参数十分重要。它记录了那些key属于聚合，调用那些聚合参数等核心信息。并且aggregator也就是执行聚合的类，也是通过该参数构造的，它是Aggregator的内部类。
final: 指明该Stream是否是最终结果，还是要继续进行计算。

这里最为核心的就是AggregatingBlockInputStream类通过继承override对应的readImpl()的接口来实现对应的具体逻辑。AggregatingBlockInputStream类还有一个孪生兄弟：ParallelAggregatingBlockInputStream类，通过并行化来进一步加快聚合流程的执行效率。(通过笔者进行的测试，在简单查询聚合查询下，并行化能够提高近一倍的效率～～)

Aggregator::Params类
Aggregator::Params类是Aggregator的内部类。这个类是整个聚合过程之中最重要的类，查询解析优化后生成聚合查询的执行计划。 而对应的执行计划的参数都通过Aggregator::Params类来初始化，比如那些列要进行聚合，选取的聚合算子等等，并传递给对应的Aggregator来实现对应的聚合逻辑。

 struct Params
    {
        /// Data structure of source blocks.
        Block src_header;
        /// Data structure of intermediate blocks before merge.
        Block intermediate_header;

        /// What to count.
        const ColumnNumbers keys;
        const AggregateDescriptions aggregates;
        const size_t keys_size;
        const size_t aggregates_size;

        /// The settings of approximate calculation of GROUP BY.
        const bool overflow_row;    /// Do we need to put into AggregatedDataVariants::without_key aggregates for keys that are not in max_rows_to_group_by.
        const size_t max_rows_to_group_by;
        const OverflowMode group_by_overflow_mode;



        /// Settings to flush temporary data to the filesystem (external aggregation).
        const size_t max_bytes_before_external_group_by;        /// 0 - do not use external aggregation.

        /// Return empty result when aggregating without keys on empty set.
        bool empty_result_for_aggregation_by_empty_set;

        VolumePtr tmp_volume;

        /// Settings is used to determine cache size. No threads are created.
        size_t max_threads;

        const size_t min_free_disk_space;
        Params(
            const Block & src_header_,
            const ColumnNumbers & keys_, const AggregateDescriptions & aggregates_,
            bool overflow_row_, size_t max_rows_to_group_by_, OverflowMode group_by_overflow_mode_,
            size_t group_by_two_level_threshold_, size_t group_by_two_level_threshold_bytes_,
            size_t max_bytes_before_external_group_by_,
            bool empty_result_for_aggregation_by_empty_set_,
            VolumePtr tmp_volume_, size_t max_threads_,
            size_t min_free_disk_space_)
            : src_header(src_header_),
            keys(keys_), aggregates(aggregates_), keys_size(keys.size()), aggregates_size(aggregates.size()),
            overflow_row(overflow_row_), max_rows_to_group_by(max_rows_to_group_by_), group_by_overflow_mode(group_by_overflow_mode_),
            group_by_two_level_threshold(group_by_two_level_threshold_), group_by_two_level_threshold_bytes(group_by_two_level_threshold_bytes_),
            max_bytes_before_external_group_by(max_bytes_before_external_group_by_),
            empty_result_for_aggregation_by_empty_set(empty_result_for_aggregation_by_empty_set_),
            tmp_volume(tmp_volume_), max_threads(max_threads_),
            min_free_disk_space(min_free_disk_space_)
        {
        }

        /// Only parameters that matter during merge.
        Params(const Block & intermediate_header_,
            const ColumnNumbers & keys_, const AggregateDescriptions & aggregates_, bool overflow_row_, size_t max_threads_)
            : Params(Block(), keys_, aggregates_, overflow_row_, 0, OverflowMode::THROW, 0, 0, 0, false, nullptr, max_threads_, 0)
        {
            intermediate_header = intermediate_header_;
        }
    };

Aggregator类
顾名思义，这个是一个实际进行聚合工作展开的类。它最为核心的方法是下面两个函数：
- execute函数：将输入流的stream依照次序进行blcok迭代处理，将聚合的结果写入result之中。
- mergeAndConvertToBlocks函数：将聚合的结果转换为输入流，并通过输入流的read函数将结果继续返回给上一层。
  通过上面两个函数的调用，我们就可以完成被聚合的数据输入-》数据聚合 -》数据输出的流程。具体的细节笔者会在下一章详细的进行剖析。

class Aggregator
{
public:
    Aggregator(const Params & params_);

    /// Aggregate the source. Get the result in the form of one of the data structures.
    void execute(const BlockInputStreamPtr & stream, AggregatedDataVariants & result);

    using AggregateColumns = std::vector;
    using AggregateColumnsData = std::vector;
    using AggregateColumnsConstData = std::vector;
    using AggregateFunctionsPlainPtrs = std::vector;

    /// Process one block. Return false if the processing should be aborted (with group_by_overflow_mode = 'break').
    bool executeOnBlock(const Block & block, AggregatedDataVariants & result,
        ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns,    /// Passed to not create them anew for each block
        bool & no_more_keys);

    bool executeOnBlock(Columns columns, UInt64 num_rows, AggregatedDataVariants & result,
        ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns,    /// Passed to not create them anew for each block
        bool & no_more_keys);

    /** Convert the aggregation data structure into a block.
      * If overflow_row = true, then aggregates for rows that are not included in max_rows_to_group_by are put in the first block.
      *
      * If final = false, then ColumnAggregateFunction is created as the aggregation columns with the state of the calculations,
      *  which can then be combined with other states (for distributed query processing).
      * If final = true, then columns with ready values are created as aggregate columns.
      */
    BlocksList convertToBlocks(AggregatedDataVariants & data_variants, bool final, size_t max_threads) const;

    /** Merge several aggregation data structures and output the result as a block stream.
      */
    std::unique_ptr mergeAndConvertToBlocks(ManyAggregatedDataVariants & data_variants, bool final, size_t max_threads) const;
    ManyAggregatedDataVariants prepareVariantsToMerge(ManyAggregatedDataVariants & data_variants) const;

    /** Merge the stream of partially aggregated blocks into one data structure.
      * (Pre-aggregate several blocks that represent the result of independent aggregations from remote servers.)
      */
    void mergeStream(const BlockInputStreamPtr & stream, AggregatedDataVariants & result, size_t max_threads);

    using BucketToBlocks = std::map;
    /// Merge partially aggregated blocks separated to buckets into one data structure.
    void mergeBlocks(BucketToBlocks bucket_to_blocks, AggregatedDataVariants & result, size_t max_threads);

    /// Merge several partially aggregated blocks into one.
    /// Precondition: for all blocks block.info.is_overflows flag must be the same.
    /// (either all blocks are from overflow data or none blocks are).
    /// The resulting block has the same value of is_overflows flag.
    Block mergeBlocks(BlocksList & blocks, bool final);

     std::unique_ptr mergeAndConvertToBlocks(ManyAggregatedDataVariants & data_variants, bool final, size_t max_threads) const;

    using CancellationHook = std::function;

    /** Set a function that checks whether the current task can be aborted.
      */
    void setCancellationHook(const CancellationHook cancellation_hook);

    /// Get data structure of the result.
    Block getHeader(bool final) const;

2.聚合流程的实现

这里我们就从上文提到的Aggregator::execute(const BlockInputStreamPtr & stream, AggregatedDataVariants & result)函数作为起点来梳理一下ClickHouse的聚合实现：

void Aggregator::execute(const BlockInputStreamPtr & stream, AggregatedDataVariants & result)
{
    Stopwatch watch;

    size_t src_rows = 0;
    size_t src_bytes = 0;

    /// Read all the data
    while (Block block = stream->read())
    {
        if (isCancelled())
            return;

        src_rows += block.rows();
        src_bytes += block.bytes();

        if (!executeOnBlock(block, result, key_columns, aggregate_columns, no_more_keys))
            break;
    }

由上述代码可以看出，这里就是依次读取子节点流生成的Block，然后继续调用executeOnBlock方法来执行聚合流程处理每一个Block的聚合。接着我们按图索骥，继续看下去，这个函数比较长，我们拆分成几个部分，并且把无关紧要的代码先去掉：这部分主要完成的工作就是将param之中指定的key列与聚合列的指针作为参数提取出来，并且和聚合函数一起封装到AggregateFunctionInstructions的结构之中。

bool Aggregator::executeOnBlock(Columns columns, UInt64 num_rows, AggregatedDataVariants & result,
    ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns, bool & no_more_keys)
{
    /// `result` will destroy the states of aggregate functions in the destructor
    result.aggregator = this;

    /// How to perform the aggregation?
    if (result.empty())
    {
        result.init(method_chosen);
        result.keys_size = params.keys_size;
        result.key_sizes = key_sizes;
        LOG_TRACE(log, "Aggregation method: " << result.getMethodName());
    }

    for (size_t i = 0; i < params.aggregates_size; ++i)
        aggregate_columns[i].resize(params.aggregates[i].arguments.size());

    /** Constant columns are not supported directly during aggregation.
      * To make them work anyway, we materialize them.
      */
    Columns materialized_columns;

    /// Remember the columns we will work with
    for (size_t i = 0; i < params.keys_size; ++i)
    {
        materialized_columns.push_back(columns.at(params.keys[i])->convertToFullColumnIfConst());
        key_columns[i] = materialized_columns.back().get();

        if (!result.isLowCardinality())
        {
            auto column_no_lc = recursiveRemoveLowCardinality(key_columns[i]->getPtr());
            if (column_no_lc.get() != key_columns[i])
            {
                materialized_columns.emplace_back(std::move(column_no_lc));
                key_columns[i] = materialized_columns.back().get();
            }
        }
    }

    AggregateFunctionInstructions aggregate_functions_instructions(params.aggregates_size + 1);
    aggregate_functions_instructions[params.aggregates_size].that = nullptr;

    std::vector> nested_columns_holder;
    for (size_t i = 0; i < params.aggregates_size; ++i)
    {
        for (size_t j = 0; j < aggregate_columns[i].size(); ++j)
        {
            materialized_columns.push_back(columns.at(params.aggregates[i].arguments[j])->convertToFullColumnIfConst());
            aggregate_columns[i][j] = materialized_columns.back().get();

            auto column_no_lc = recursiveRemoveLowCardinality(aggregate_columns[i][j]->getPtr());
            if (column_no_lc.get() != aggregate_columns[i][j])
            {
                materialized_columns.emplace_back(std::move(column_no_lc));
                aggregate_columns[i][j] = materialized_columns.back().get();
            }
        }

        aggregate_functions_instructions[i].arguments = aggregate_columns[i].data();
        aggregate_functions_instructions[i].state_offset = offsets_of_aggregate_states[i];
        auto that = aggregate_functions[i];
        /// Unnest consecutive trailing -State combinators
        while (auto func = typeid_cast(that))
            that = func->getNestedFunction().get();
        aggregate_functions_instructions[i].that = that;
        aggregate_functions_instructions[i].func = that->getAddressOfAddFunction();

        if (auto func = typeid_cast(that))
        {
            /// Unnest consecutive -State combinators before -Array
            that = func->getNestedFunction().get();
            while (auto nested_func = typeid_cast(that))
                that = nested_func->getNestedFunction().get();
            auto [nested_columns, offsets] = checkAndGetNestedArrayOffset(aggregate_columns[i].data(), that->getArgumentTypes().size());
            nested_columns_holder.push_back(std::move(nested_columns));
            aggregate_functions_instructions[i].batch_arguments = nested_columns_holder.back().data();
            aggregate_functions_instructions[i].offsets = offsets;
        }
        else
            aggregate_functions_instructions[i].batch_arguments = aggregate_columns[i].data();

        aggregate_functions_instructions[i].batch_that = that;
    }

将需要准备的参数准备好了之后，后续就通过按部就班的调用executeImpl(*result.NAME, result.aggregates_pool, num_rows, key_columns, aggregate_functions_instructions.data(),
no_more_keys, overflow_row_ptr)聚合运算了。我们来看看它的实现，它是一个模板函数，内部通过调用了 executeImplBatch(method, state, aggregates_pool, rows, aggregate_instructions)来实现的，数据库都会通过Batch的形式，一次性提交一组需要操作的数据来减少虚函数调用的开销。

template 
void NO_INLINE Aggregator::executeImpl(
    Method & method,
    Arena * aggregates_pool,
    size_t rows,
    ColumnRawPtrs & key_columns,
    AggregateFunctionInstruction * aggregate_instructions,
    bool no_more_keys,
    AggregateDataPtr overflow_row) const
{
    typename Method::State state(key_columns, key_sizes, aggregation_state_cache);

    if (!no_more_keys)
        executeImplBatch(method, state, aggregates_pool, rows, aggregate_instructions);
    else
        executeImplCase(method, state, aggregates_pool, rows, aggregate_instructions, overflow_row);
}

那我们就继续看下去，executeImplBatch同样也是一个模板函数。

首先，它构造了一个AggregateDataPtr的数组places，这里是这就是后续我们实际聚合结果存放的地方。这个数据的长度也就是这个Batch的长度，也就是说，聚合结果的指针也作为一组列式的数据，参与到后续的聚合运算之中。
接下来，通过一个for循环，依次调用state.emplaceKey，计算每列聚合key的hash值，进行分类，并且将对应结果依次和places对应。
最后，通过一个for循环，调用聚合函数的addBatch方法，（这个函数我们在上一篇之中介绍过）。每个AggregateFunctionInstruction都有一个制定的places_offset和对应属于进行聚合计算的value列，这里通过一个for循环调用AddBatch，将places之中对应的数据指针和聚合value列进行聚合，最终形成所有的聚合计算的结果。

到这里，整个聚合计算的核心流程算是完成了，后续就是将result的结果通过上面的convertToBlock的方式转换为BlockStream流，继续返回给上层的调用方。

template 
void NO_INLINE Aggregator::executeImplBatch(
    Method & method,
    typename Method::State & state,
    Arena * aggregates_pool,
    size_t rows,
    AggregateFunctionInstruction * aggregate_instructions) const
{
    PODArray places(rows);

    /// For all rows.
    for (size_t i = 0; i < rows; ++i)
    {
        AggregateDataPtr aggregate_data = nullptr;

        auto emplace_result = state.emplaceKey(method.data, i, *aggregates_pool);

        /// If a new key is inserted, initialize the states of the aggregate functions, and possibly something related to the key.
        if (emplace_result.isInserted())
        {
            /// exception-safety - if you can not allocate memory or create states, then destructors will not be called.
            emplace_result.setMapped(nullptr);

            aggregate_data = aggregates_pool->alignedAlloc(total_size_of_aggregate_states, align_aggregate_states);
            createAggregateStates(aggregate_data);

            emplace_result.setMapped(aggregate_data);
        }
        else
            aggregate_data = emplace_result.getMapped();

        places[i] = aggregate_data;
        assert(places[i] != nullptr);
    }

    /// Add values to the aggregate functions.
    for (AggregateFunctionInstruction * inst = aggregate_instructions; inst->that; ++inst)
    {
        if (inst->offsets)
            inst->batch_that->addBatchArray(rows, places.data(), inst->state_offset, inst->batch_arguments, inst->offsets, aggregates_pool);
        else
            inst->batch_that->addBatch(rows, places.data(), inst->state_offset, inst->batch_arguments, aggregates_pool);
    }

3. 小结

好了，到这里也就把ClickHouse聚合流程的代码梳理完了。
除了聚合计算外，其他的物理执行操作符也是同样通过流的方式依次对接处理的，源码阅读的步骤也可以参照笔者的分析流程来参考。、
笔者是一个ClickHouse的初学者，对ClickHouse有兴趣的同学，欢迎多多指教，交流。

4. 参考资料

官方文档
ClickHouse源代码