ClickHouse实时分析(十)- ClickHouse内置语法优化详解

目录

  • 1. 准备测试用表
  • 2. COUNT优化
  • 3. 消除子查询重复字段
  • 4. 谓词下推
  • 5. 聚合计算外推
  • 6. 聚合函数消除
  • 7. 删除重复的order by key
  • 8. 删除重复的limit by key
  • 9. 删除重复的 USING Key
  • 10. 标量替换
  • 11. 三元运算优化
  • 参考文献

原文地址:https://program-park.github.io/2022/04/01/clickhouse_11/

1. 准备测试用表

  ClickHouse 底层提供了基于 RBO(Rule Based Optimization)的 SQL 优化实现,会对一些低效的查询语句自动进行优化,这些优化的方式实际上也是我们写高效查询的一些指导。

  1. 上传官方数据集(具体参考https://clickhouse.com/docs/zh/getting-started/example-datasets/metrica/),将visits_v1.tarhits_v1.tar上传到虚拟机,解压到 ClickHouse 数据路径下:
    // 解压到clickhouse数据路径 
    tar -xvf hits_v1.tar -C /var/lib/clickhouse
    tar -xvf visits_v1.tar -C /var/lib/clickhouse
    
    //修改所属用户 
    chown -R clickhouse:clickhouse /var/lib/clickhouse/data/datasets
    chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata/datasets
    
  2. 重启clickhouse-server
    systemctl restart clickhouse-server
    
  3. 执行查询:
    clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"
    clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1" 
    

  官方的 tar 包,包含了建库、建表语句、数据内容,这种方式不需要手动建库、建表。hits_v1表有 130 多个字段,880 多万条数据,visits_v1表有 180 多个字段,160 多万条数据。

2. COUNT优化

  在调用 count 时,如果使用的是count()或者count(*),且没有where条件时,会直接使用system.tablestotal_row
  例如:

EXPLAIN SELECT count() FROM datasets.hits_v1;

Union
 Expression (Projection)
  Expression (Before ORDER BY and SELECT)
   MergingAggregated
    ReadNothing (Optimized trivial count)

  注意Optimized trivial count,这是对 count 的优化。
  如果 count 具体的列字段,则不会使用此项优化:

EXPLAIN SELECT count(CounterID) FROM datasets.hits_v1;

Union
 Expression (Projection)
  Expression (Before ORDER BY and SELECT)
   Aggregating
    Expression (Before GROUP BY)
     ReadFromStorage (Read from MergeTree)

3. 消除子查询重复字段

  下面语句子查询中有两个重复的 id 字段,会被去重:

EXPLAIN SYNTAX SELECT 
	a.UserID,
	b.VisitID,
	a.URL,
	b.UserID
FROM
	hits_v1 AS a 
LEFT JOIN ( 
	SELECT 
		UserID, 
		UserID as HaHa, 
		VisitID 
	FROM
		visits_v1) AS b 
USING (UserID)
limit 3;

// 返回优化语句:
SELECT 
	UserID,
	VisitID,
	URL,
	b.UserID
FROM
	hits_v1 AS a
ALL LEFT JOIN 
	(
	SELECT 
		UserID,
		VisitID
	FROM
		visits_v1
	) AS b
USING (UserID)
LIMIT 3

4. 谓词下推

  当 group by 有 having 子句,但是没有 with cube、with rollup 或者 with totals 修饰的时候,having 过滤会下推到 where 提前过滤。谓词下推优化规则涉及到一个参数 enable_optimize_predicate_expression,默认是1,打开谓词下推,设置为 0 就会关闭谓词下推优化。例如下面的查询,HAVING name 变成了 WHERE name,在 group by 之前过滤:

EXPLAIN SYNTAX SELECT UserID FROM hits_v1 GROUP BY UserID HAVING UserID = '8585742290196126178';

// 返回优化语句
SELECT
	UserID
FROM
	hits_v1
WHERE
	UserID = '8585742290196126178'
GROUP BY
	UserID

  子查询也支持谓词下推:

EXPLAIN SYNTAX
SELECT
	*
FROM 
	(
	SELECT
		UserID
	FROM
		visits_v1
	)
WHERE
	UserID = '8585742290196126178'

// 返回优化后的语句
SELECT
	UserID
FROM 
	(
	SELECT
		UserID
	FROM
		visits_v1
	WHERE
		UserID = '8585742290196126178'
	)
WHERE
	UserID = '8585742290196126178'

  再来一个复杂例子:

EXPLAIN SYNTAX 
SELECT * FROM (
	SELECT 
		* 
	FROM 
		(
		SELECT 
			UserID 
		FROM visits_v1) 
	UNION ALL 
	SELECT 
		* 
	FROM 
		(
		SELECT 
			UserID 
		FROM visits_v1)
)
WHERE
	UserID = '8585742290196126178'

// 返回优化后的语句
SELECT UserID
FROM 
(
	SELECT UserID
	FROM 
	(
		SELECT
			UserID
		FROM
			visits_v1
		WHERE
			UserID = '8585742290196126178'
	)
	WHERE
		UserID = '8585742290196126178'
	UNION ALL
	SELECT
		UserID
	FROM 
	(
		SELECT
			UserID
		FROM
			visits_v1
		WHERE
			UserID = '8585742290196126178'
	)
	WHERE
		UserID = '8585742290196126178'
)
WHERE
	UserID = '8585742290196126178'

5. 聚合计算外推

  聚合函数内的计算,会外推,例如:

EXPLAIN SYNTAX
SELECT
	sum(UserID * 2)
FROM
	visits_v1

// 返回优化后的语句
SELECT
	sum(UserID) * 2
FROM
	visits_v1

6. 聚合函数消除

  如果对聚合键,也就是 group by key 使用 min、max、any 聚合函数,则将函数消除,例如:

EXPLAIN SYNTAX
SELECT
	sum(UserID * 2),
	max(VisitID),
	max(UserID)
FROM
	visits_v1
GROUP BY
	UserID

// 返回优化后的语句
SELECT 
	sum(UserID) * 2,
	max(VisitID),
	UserID
FROM
	visits_v1
GROUP BY
	UserID

7. 删除重复的order by key

  例如下面的语句,重复的聚合键 id 字段会被去重:

EXPLAIN SYNTAX
SELECT
	*
FROM
	visits_v1
ORDER BY
	UserID ASC,
	UserID ASC,
	VisitID ASC,
	VisitID ASC

// 返回优化后的语句
select
	···
FROM
	visits_v1
ORDER BY 
	UserID ASC,
	VisitID ASC

8. 删除重复的limit by key

  例如下面的语句,重复声明的 name 字段会被去重:

EXPLAIN SYNTAX
SELECT
	*
FROM
	visits_v1
LIMIT 3 BY
	VisitID,
	VisitID
LIMIT 10

// 返回优化后的语句
select
	···
FROM
	visits_v1
LIMIT 3 BY
	VisitID
LIMIT 10

9. 删除重复的 USING Key

  例如下面的语句,重复的关联键 id 字段会被去重:

EXPLAIN SYNTAX
SELECT
	a.UserID,
	a.UserID,
	b.VisitID,
	a.URL,
	b.UserID
FROM
	hits_v1 AS a
LEFT JOIN
	visits_v1 AS b
USING (UserID, UserID)

// 返回优化后的语句
SELECT 
	UserID,
	UserID,
	VisitID,
	URL,
	b.UserID
FROM 
	hits_v1 AS a
ALL LEFT JOIN
	visits_v1 AS b
USING (UserID)

10. 标量替换

  如果子查询只返回一行数据,在被引用的时候用标量替换,例如下面语句中的 total_disk_usage 字段:

EXPLAIN SYNTAX
WITH 
(
	SELECT
		sum(bytes)
	FROM
		system.parts
	WHERE
		active
) AS total_disk_usage
SELECT
	(sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
	table
FROM
	system.parts
GROUP BY
	table
ORDER BY
	table_disk_usage DESC
LIMIT 10;

// 返回优化后的语句
WITH CAST(0, \'UInt64\') AS total_disk_usage
SELECT 
	(sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
	table
FROM
	system.parts
GROUP BY
	table
ORDER BY
	table_disk_usage DESC
LIMIT 10

11. 三元运算优化

  如果开启了 optimize_if_chain_to_multiif 参数,三元运算符会被替换成 multiIf 函数,例如:

EXPLAIN SYNTAX 
SELECT
	number = 1 ? 'hello' : (number = 2 ? 'world' : 'clickhouse') 
FROM
	numbers(10) 
settings optimize_if_chain_to_multiif = 1;

// 返回优化后的语句
SELECT
	multiIf(number = 1, 'hello', number = 2, 'world', 'clickhouse')
FROM
	numbers(10)
SETTINGS optimize_if_chain_to_multiif = 1;

参考文献

  【1】https://clickhouse.com/docs/zh/
  【2】https://www.bilibili.com/video/BV1Yh411z7os?from=search&seid=4579023877699743987&spm_id_from=333.337.0.0
  【3】https://clickhouse.com/docs/zh/

你可能感兴趣的:(ClickHouse,clickhouse,大数据,数据库)