先抛出三个问题:
在20版本之前不支持explain语句,所以这里列举用explain和trace级别日志两种查看执行过程的方法。
这里以本地表和分布式表为例进行分析。集群配置了两分片,这里进行单机和集群查询的执行过程分析。
因为真实sql涉及机密,这里做了脱敏处理,后面不再赘述
这种最简单,直接在sql语句前面增加explain,
查询本地表
explain
SELECT id,max(ch_updatetime) AS ch_updatetime_max
FROM database.merge_tree_table
GROUP BY id
输出
实际执行时间消耗12.7s,内存使用2.39G,每秒处理1167万行
查询分布式表
explain
SELECT id,max(ch_updatetime) AS ch_updatetime_max
FROM database.distributed_table
GROUP BY id
输出
实际执行时间消耗51s,内存使用5G左右,每秒处理581万行
这两者对比一下,区别主要是在MergingAggregated操作。
MergingAggregated真的有那么耗时吗?ClickHouse相当于超跑的话,做聚合为什么会慢4倍这么多?
explain毕竟只是“花架子”,真正还要看实际执行的操作。这里把第二次查询分布式表的trace级别日志拿出来分析,关键日志如下:
[ch01] 2020.09.24 10:09:51.064623 [ 30751 ] {
22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Debug> executeQuery: (from [::1]:46550) SELECT id,max(ch_updatetime) AS ch_updatetime_max FROM database.merge_tree_table GROUP BY settlementid
...
[ch02] 2020.09.24 10:09:51.069320 [ 71302 ] {
5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Trace> ContextAccess (default): Access granted: SELECT(id, ch_updatetime) ON database.merge_tree_table
[ch02] 2020.09.24 10:09:51.070980 [ 71302 ] {
5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Trace> database.merge_tree_table(SelectExecutor): Reading approx. 147587338 rows with 20 streams
[ch02] 2020.09.24 10:09:53.079298 [ 71254 ] {
5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Trace> Aggregator: Merging aggregated data
[ch02] 2020.09.24 10:09:53.158780 [ 71261 ] {
5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Debug> MemoryTracker: Current memory usage (for query): 9.01 GiB.
...
[ch01] 2020.09.24 10:09:53.390590 [ 37783 ] {
22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Trace> AggregatingTransform: Aggregated. 8397703 to 8384809 rows (from 128.14 MiB) in 2.323149366 sec. (3614792.5410664277 rows/sec., 55.16 MiB/sec.)
[ch01] 2020.09.24 10:09:53.390640 [ 37783 ] {
22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Trace> Aggregator: Merging aggregated data
[ch01] 2020.09.24 10:09:53.507340 [ 37783 ] {
22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Trace> MergingAggregatedTransform: Reading blocks of partially aggregated data.
[ch01] 2020.09.24 10:10:02.869300 [ 37772 ] {
22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Debug> MemoryTracker: Current memory usage (for query): 10.00 GiB.
[ch01] 2020.09.24 10:10:05.741863 [ 37772 ] {
22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Debug> MemoryTracker: Current memory usage (for query): 11.00 GiB.
....
[ch02] 2020.09.24 10:10:17.171094 [ 71302 ] {
5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Information> executeQuery: Read 147509056 rows, 2.20 GiB in 26.102098762 sec., 5651233 rows/sec., 86.23 MiB/sec.
[ch02] 2020.09.24 10:10:17.171222 [ 71302 ] {
5b1ffeea-9d02-4a9c-8cdf-94c5457533e1} <Debug> MemoryTracker: Peak memory usage (for query): 9.18 GiB.
...
[ch01] 2020.09.24 10:10:24.096175 [ 15020 ] {
22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Trace> MergingAggregatedTransform: Read 512 blocks of partially aggregated data, total 289315641 rows.
[ch01] 2020.09.24 10:10:24.096175 [ 15020 ] {
22de5c9b-2df4-4251-b0c7-0ec77dcb1f67} <Trace> Aggregator: Converted aggregated data to blocks. 283656911 rows, 4.23 GiB in 1.327399526 sec. (213693696.1660479 rows/sec., 3.18 GiB/sec.)
...
仔仔细细看完日志,发现果然是MergingAggregatedTransform耗时太高,10:09:53-10:10:17,即24s才结束。执行慢的主要原因找到了!!!
注意这里是MergingAggregatedTransform,比前面explain执行计划中的MergingAggregated,多了一个Transform。这是为什么呢?往下看
不多说,拿出MergingAggregatedTransform官方源码,开干!详细代码请见传送门,这里只分析主体。
先看引用,
#include
#include
果然引入了AggregatingTransform.h,真的h,所以执行计划写的AggregatingTransform,那就没问题了。
继续看下面:)
MergingAggregatedTransform::MergingAggregatedTransform(
Block header_, AggregatingTransformParamsPtr params_, size_t max_threads_)
: IAccumulatingTransform(std::move(header_), params_->getHeader())
, params(std::move(params_)), max_threads(max_threads_)
{
}
构造函数引入了相关聚合参数、线程数等,注意IAccumulatingTransform(std::move(header_), params_->getHeader())
,这是指累积转移
,其中直接std::move更换header的所有权,非常高效。接下来看如何完成消费的:
void MergingAggregatedTransform::consume(Chunk chunk){
//消费开始了,打印日志
if (!consume_started)
{
consume_started = true;
LOG_TRACE(log, "Reading blocks of partially aggregated data.");
}
对比看前面trace级别日志,出现过该日志"Reading blocks of partially aggregated data.",说明从这里开始消费chunk了(补充一下概念:chunk和block是有大小关系,一个chunk可以分为多个block)。
...
auto block = getInputPort().getHeader().cloneWithColumns(chunk.getColumns());
block.info.is_overflows = agg_info->is_overflows;
block.info.bucket_num = agg_info->bucket_num;
bucket_to_blocks[agg_info->bucket_num].emplace_back(std::move(block));
}
接下来获取chunk中的所有列,然后赋值给新变量block,然后愉快地开启“放置”玩法。到这里就结束消费chunk了。这时候远程分片上相关数据已经处理完了,开始处理本机上分片上的数据了(这里假定两个分片中一个在本地ch01,另一个在ch02)。这里不再列举本机消费chunk的过程。
后面就到了生成阶段
Chunk MergingAggregatedTransform::generate()
{
if (!generate_started)
{
generate_started = true;
LOG_TRACE(log, "Read {} blocks of partially aggregated data, total {} rows.", total_input_blocks, total_input_rows);
/// Exception safety. Make iterator valid in case any method below throws.
next_block = blocks.begin();
/// TODO: this operation can be made async. Add async for IAccumulatingTransform.
params->aggregator.mergeBlocks(std::move(bucket_to_blocks), data_variants, max_threads);
blocks = params->aggregator.convertToBlocks(data_variants, params->final, max_threads);
next_block = blocks.begin();
}
同样的,检查符号位,是否开始生成了,如果否,则开始输出trace级别日志,然后开始迭代block
if (next_block == blocks.end())
return {
};
auto block = std::move(*next_block);
++next_block;
auto info = std::make_shared<AggregatedChunkInfo>();
info->bucket_num = block.info.bucket_num;
info->is_overflows = block.info.is_overflows;
UInt64 num_rows = block.rows();
Chunk chunk(block.getColumns(), num_rows);
chunk.setChunkInfo(std::move(info));
return chunk;
}
这就是将block转换为chunk。到此,结束分析MergingAggregatedTransform。
那么限制回答开头的问题:
测试结果表明是慢的,这是因为聚合操作耗时大,但是如果放到更大规模的集群上,结果自然是集群优胜,详情可以看看字节跳动在ClickHouse方面的实时分析实践。
那针对这种小规模集群,就没有优化方法了吗?答案是有的,使用“物化视图”或“数据跳跃索引”。
下一期再分析分析