ClickHouse执行过程分析

先抛出三个问题:

  1. ClickHouse执行多分片聚合一定会比单机慢吗?
  2. 如果慢,那么会慢在哪里?
  3. 是否可以优化?

在20版本之前不支持explain语句,所以这里列举用explaintrace级别日志两种查看执行过程的方法。

这里以本地表分布式表为例进行分析。集群配置了两分片,这里进行单机和集群查询的执行过程分析。

因为真实sql涉及机密,这里做了脱敏处理,后面不再赘述

explain

这种最简单,直接在sql语句前面增加explain,

查询本地表

explain 
	SELECT id,max(ch_updatetime) AS ch_updatetime_max 
	FROM database.merge_tree_table
	GROUP BY id

输出
ClickHouse执行过程分析_第1张图片
实际执行时间消耗12.7s,内存使用2.39G,每秒处理1167万行
在这里插入图片描述

查询分布式表

explain 
	SELECT id,max(ch_updatetime) AS ch_updatetime_max 
	FROM database.distributed_table
	GROUP BY id

输出
ClickHouse执行过程分析_第2张图片
实际执行时间消耗51s,内存使用5G左右,每秒处理581万行
在这里插入图片描述

这两者对比一下,区别主要是在MergingAggregated操作。

MergingAggregated真的有那么耗时吗?ClickHouse相当于超跑的话,做聚合为什么会慢4倍这么多?


trace级别日志

explain毕竟只是“花架子”,真正还要看实际执行的操作。这里把第二次查询分布式表的trace级别日志拿出来分析,关键日志如下:

[ch01] 2020.09.24 10:09:51.064623 [ 30751 ] {
    22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Debug> executeQuery: (from [::1]:46550) SELECT id,max(ch_updatetime) AS ch_updatetime_max FROM database.merge_tree_table GROUP BY settlementid
...
[ch02] 2020.09.24 10:09:51.069320 [ 71302 ] {
    5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Trace> ContextAccess (default): Access granted: SELECT(id, ch_updatetime) ON database.merge_tree_table
[ch02] 2020.09.24 10:09:51.070980 [ 71302 ] {
    5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Trace> database.merge_tree_table(SelectExecutor): Reading approx. 147587338 rows with 20 streams
[ch02] 2020.09.24 10:09:53.079298 [ 71254 ] {
    5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Trace> Aggregator: Merging aggregated data
[ch02] 2020.09.24 10:09:53.158780 [ 71261 ] {
    5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Debug> MemoryTracker: Current memory usage (for query): 9.01 GiB.
...
[ch01] 2020.09.24 10:09:53.390590 [ 37783 ] {
    22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Trace> AggregatingTransform: Aggregated. 8397703 to 8384809 rows (from 128.14 MiB) in 2.323149366 sec. (3614792.5410664277 rows/sec., 55.16 MiB/sec.)
[ch01] 2020.09.24 10:09:53.390640 [ 37783 ] {
    22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Trace> Aggregator: Merging aggregated data
[ch01] 2020.09.24 10:09:53.507340 [ 37783 ] {
    22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Trace> MergingAggregatedTransform: Reading blocks of partially aggregated data.
[ch01] 2020.09.24 10:10:02.869300 [ 37772 ] {
    22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Debug> MemoryTracker: Current memory usage (for query): 10.00 GiB.
[ch01] 2020.09.24 10:10:05.741863 [ 37772 ] {
    22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Debug> MemoryTracker: Current memory usage (for query): 11.00 GiB.
....
[ch02] 2020.09.24 10:10:17.171094 [ 71302 ] {
    5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Information> executeQuery: Read 147509056 rows, 2.20 GiB in 26.102098762 sec., 5651233 rows/sec., 86.23 MiB/sec.
[ch02] 2020.09.24 10:10:17.171222 [ 71302 ] {
    5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Debug> MemoryTracker: Peak memory usage (for query): 9.18 GiB.
...
[ch01] 2020.09.24 10:10:24.096175 [ 15020 ] {
    22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Trace> MergingAggregatedTransform: Read 512 blocks of partially aggregated data, total 289315641 rows.
[ch01] 2020.09.24 10:10:24.096175 [ 15020 ] {
    22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Trace> Aggregator: Converted aggregated data to blocks. 283656911 rows, 4.23 GiB in 1.327399526 sec. (213693696.1660479 rows/sec., 3.18 GiB/sec.)
...

仔仔细细看完日志,发现果然是MergingAggregatedTransform耗时太高,10:09:53-10:10:17,即24s才结束。执行慢的主要原因找到了!!!

注意这里是MergingAggregatedTransform,比前面explain执行计划中的MergingAggregated,多了一个Transform。这是为什么呢?往下看


分析ClickHouse源码

不多说,拿出MergingAggregatedTransform官方源码,开干!详细代码请见传送门,这里只分析主体。

先看引用,

#include 
#include 

果然引入了AggregatingTransform.h,真的h,所以执行计划写的AggregatingTransform,那就没问题了。
继续看下面:)

MergingAggregatedTransform::MergingAggregatedTransform(
    Block header_, AggregatingTransformParamsPtr params_, size_t max_threads_)
    : IAccumulatingTransform(std::move(header_), params_->getHeader())
    , params(std::move(params_)), max_threads(max_threads_)
{
     
}

构造函数引入了相关聚合参数、线程数等,注意IAccumulatingTransform(std::move(header_), params_->getHeader()),这是指累积转移,其中直接std::move更换header的所有权,非常高效。接下来看如何完成消费的:

void MergingAggregatedTransform::consume(Chunk chunk){
     
	//消费开始了,打印日志
	if (!consume_started)
    {
     
        consume_started = true;
        LOG_TRACE(log, "Reading blocks of partially aggregated data.");
    }

对比看前面trace级别日志,出现过该日志"Reading blocks of partially aggregated data.",说明从这里开始消费chunk了(补充一下概念:chunk和block是有大小关系,一个chunk可以分为多个block)。

    ...
    auto block = getInputPort().getHeader().cloneWithColumns(chunk.getColumns());
    block.info.is_overflows = agg_info->is_overflows;
    block.info.bucket_num = agg_info->bucket_num;

    bucket_to_blocks[agg_info->bucket_num].emplace_back(std::move(block));
}

接下来获取chunk中的所有列,然后赋值给新变量block,然后愉快地开启“放置”玩法。到这里就结束消费chunk了。这时候远程分片上相关数据已经处理完了,开始处理本机上分片上的数据了(这里假定两个分片中一个在本地ch01,另一个在ch02)。这里不再列举本机消费chunk的过程。

后面就到了生成阶段

Chunk MergingAggregatedTransform::generate()
{
     
    if (!generate_started)
    {
     
        generate_started = true;
        LOG_TRACE(log, "Read {} blocks of partially aggregated data, total {} rows.", total_input_blocks, total_input_rows);

        /// Exception safety. Make iterator valid in case any method below throws.
        next_block = blocks.begin();

        /// TODO: this operation can be made async. Add async for IAccumulatingTransform.
        params->aggregator.mergeBlocks(std::move(bucket_to_blocks), data_variants, max_threads);
        blocks = params->aggregator.convertToBlocks(data_variants, params->final, max_threads);
        next_block = blocks.begin();
    }

同样的,检查符号位,是否开始生成了,如果否,则开始输出trace级别日志,然后开始迭代block

    if (next_block == blocks.end())
        return {
     };

    auto block = std::move(*next_block);
    ++next_block;

    auto info = std::make_shared<AggregatedChunkInfo>();
    info->bucket_num = block.info.bucket_num;
    info->is_overflows = block.info.is_overflows;

    UInt64 num_rows = block.rows();
    Chunk chunk(block.getColumns(), num_rows);
    chunk.setChunkInfo(std::move(info));

    return chunk;
}

这就是将block转换为chunk。到此,结束分析MergingAggregatedTransform。

那么限制回答开头的问题:

  1. ClickHouse执行多分片聚合一定会比单机慢吗?
  2. 如果慢,那么会慢在哪里?
  3. 是否可以优化?

测试结果表明是慢的,这是因为聚合操作耗时大,但是如果放到更大规模的集群上,结果自然是集群优胜,详情可以看看字节跳动在ClickHouse方面的实时分析实践。
那针对这种小规模集群,就没有优化方法了吗?答案是有的,使用“物化视图”或“数据跳跃索引”。
下一期再分析分析

你可能感兴趣的:(数据库,clickhouse)