原文地址:https://program-park.github.io/2022/04/01/clickhouse_11/
ClickHouse 底层提供了基于 RBO(Rule Based Optimization)的 SQL 优化实现,会对一些低效的查询语句自动进行优化,这些优化的方式实际上也是我们写高效查询的一些指导。
visits_v1.tar
和hits_v1.tar
上传到虚拟机,解压到 ClickHouse 数据路径下:// 解压到clickhouse数据路径
tar -xvf hits_v1.tar -C /var/lib/clickhouse
tar -xvf visits_v1.tar -C /var/lib/clickhouse
//修改所属用户
chown -R clickhouse:clickhouse /var/lib/clickhouse/data/datasets
chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata/datasets
clickhouse-server
:systemctl restart clickhouse-server
clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"
clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"
官方的 tar 包,包含了建库、建表语句、数据内容,这种方式不需要手动建库、建表。hits_v1
表有 130 多个字段,880 多万条数据,visits_v1
表有 180 多个字段,160 多万条数据。
在调用 count 时,如果使用的是count()
或者count(*)
,且没有where
条件时,会直接使用system.tables
的total_row
。
例如:
EXPLAIN SELECT count() FROM datasets.hits_v1;
Union
Expression (Projection)
Expression (Before ORDER BY and SELECT)
MergingAggregated
ReadNothing (Optimized trivial count)
注意Optimized trivial count
,这是对 count 的优化。
如果 count 具体的列字段,则不会使用此项优化:
EXPLAIN SELECT count(CounterID) FROM datasets.hits_v1;
Union
Expression (Projection)
Expression (Before ORDER BY and SELECT)
Aggregating
Expression (Before GROUP BY)
ReadFromStorage (Read from MergeTree)
下面语句子查询中有两个重复的 id 字段,会被去重:
EXPLAIN SYNTAX SELECT
a.UserID,
b.VisitID,
a.URL,
b.UserID
FROM
hits_v1 AS a
LEFT JOIN (
SELECT
UserID,
UserID as HaHa,
VisitID
FROM
visits_v1) AS b
USING (UserID)
limit 3;
// 返回优化语句:
SELECT
UserID,
VisitID,
URL,
b.UserID
FROM
hits_v1 AS a
ALL LEFT JOIN
(
SELECT
UserID,
VisitID
FROM
visits_v1
) AS b
USING (UserID)
LIMIT 3
当 group by 有 having 子句,但是没有 with cube、with rollup 或者 with totals 修饰的时候,having 过滤会下推到 where 提前过滤
。谓词下推优化规则涉及到一个参数 enable_optimize_predicate_expression,默认是1,打开谓词下推,设置为 0 就会关闭谓词下推优化。例如下面的查询,HAVING name 变成了 WHERE name,在 group by 之前过滤:
EXPLAIN SYNTAX SELECT UserID FROM hits_v1 GROUP BY UserID HAVING UserID = '8585742290196126178';
// 返回优化语句
SELECT
UserID
FROM
hits_v1
WHERE
UserID = '8585742290196126178'
GROUP BY
UserID
子查询也支持谓词下推:
EXPLAIN SYNTAX
SELECT
*
FROM
(
SELECT
UserID
FROM
visits_v1
)
WHERE
UserID = '8585742290196126178'
// 返回优化后的语句
SELECT
UserID
FROM
(
SELECT
UserID
FROM
visits_v1
WHERE
UserID = '8585742290196126178'
)
WHERE
UserID = '8585742290196126178'
再来一个复杂例子:
EXPLAIN SYNTAX
SELECT * FROM (
SELECT
*
FROM
(
SELECT
UserID
FROM visits_v1)
UNION ALL
SELECT
*
FROM
(
SELECT
UserID
FROM visits_v1)
)
WHERE
UserID = '8585742290196126178'
// 返回优化后的语句
SELECT UserID
FROM
(
SELECT UserID
FROM
(
SELECT
UserID
FROM
visits_v1
WHERE
UserID = '8585742290196126178'
)
WHERE
UserID = '8585742290196126178'
UNION ALL
SELECT
UserID
FROM
(
SELECT
UserID
FROM
visits_v1
WHERE
UserID = '8585742290196126178'
)
WHERE
UserID = '8585742290196126178'
)
WHERE
UserID = '8585742290196126178'
聚合函数内的计算,会外推,例如:
EXPLAIN SYNTAX
SELECT
sum(UserID * 2)
FROM
visits_v1
// 返回优化后的语句
SELECT
sum(UserID) * 2
FROM
visits_v1
如果对聚合键,也就是 group by key 使用 min、max、any 聚合函数,则将函数消除,例如:
EXPLAIN SYNTAX
SELECT
sum(UserID * 2),
max(VisitID),
max(UserID)
FROM
visits_v1
GROUP BY
UserID
// 返回优化后的语句
SELECT
sum(UserID) * 2,
max(VisitID),
UserID
FROM
visits_v1
GROUP BY
UserID
例如下面的语句,重复的聚合键 id 字段会被去重:
EXPLAIN SYNTAX
SELECT
*
FROM
visits_v1
ORDER BY
UserID ASC,
UserID ASC,
VisitID ASC,
VisitID ASC
// 返回优化后的语句
select
···
FROM
visits_v1
ORDER BY
UserID ASC,
VisitID ASC
例如下面的语句,重复声明的 name 字段会被去重:
EXPLAIN SYNTAX
SELECT
*
FROM
visits_v1
LIMIT 3 BY
VisitID,
VisitID
LIMIT 10
// 返回优化后的语句
select
···
FROM
visits_v1
LIMIT 3 BY
VisitID
LIMIT 10
例如下面的语句,重复的关联键 id 字段会被去重:
EXPLAIN SYNTAX
SELECT
a.UserID,
a.UserID,
b.VisitID,
a.URL,
b.UserID
FROM
hits_v1 AS a
LEFT JOIN
visits_v1 AS b
USING (UserID, UserID)
// 返回优化后的语句
SELECT
UserID,
UserID,
VisitID,
URL,
b.UserID
FROM
hits_v1 AS a
ALL LEFT JOIN
visits_v1 AS b
USING (UserID)
如果子查询只返回一行数据,在被引用的时候用标量替换,例如下面语句中的 total_disk_usage 字段:
EXPLAIN SYNTAX
WITH
(
SELECT
sum(bytes)
FROM
system.parts
WHERE
active
) AS total_disk_usage
SELECT
(sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
table
FROM
system.parts
GROUP BY
table
ORDER BY
table_disk_usage DESC
LIMIT 10;
// 返回优化后的语句
WITH CAST(0, \'UInt64\') AS total_disk_usage
SELECT
(sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
table
FROM
system.parts
GROUP BY
table
ORDER BY
table_disk_usage DESC
LIMIT 10
如果开启了 optimize_if_chain_to_multiif 参数,三元运算符会被替换成 multiIf 函数,例如:
EXPLAIN SYNTAX
SELECT
number = 1 ? 'hello' : (number = 2 ? 'world' : 'clickhouse')
FROM
numbers(10)
settings optimize_if_chain_to_multiif = 1;
// 返回优化后的语句
SELECT
multiIf(number = 1, 'hello', number = 2, 'world', 'clickhouse')
FROM
numbers(10)
SETTINGS optimize_if_chain_to_multiif = 1;
【1】https://clickhouse.com/docs/zh/
【2】https://www.bilibili.com/video/BV1Yh411z7os?from=search&seid=4579023877699743987&spm_id_from=333.337.0.0
【3】https://clickhouse.com/docs/zh/