[WITH expr_list|(subquery)] SELECT [DISTINCT] expr_list [FROM [db.]table | (subquery) | table_function] [FINAL] [SAMPLE sample_coeff] [ARRAY JOIN ...] [GLOBAL] [ANY|ALL|ASOF] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER|SEMI|ANTI] JOIN (subquery)|table (ON
)|(USING ) [PREWHERE expr] [WHERE expr] [GROUP BY expr_list] [WITH TOTALS] [HAVING expr] [ORDER BY expr_list] [WITH FILL] [FROM expr] [TO expr] [STEP expr] [LIMIT [offset_value, ]n BY columns] [LIMIT [n, ]m] [WITH TIES] [UNION ALL ...] [INTO OUTFILE filename] [FORMAT format] 可以看到ClickHouse的SELECT语句的语法和通用的SQL的SELECT语句非常类似,包括:
- SELECT: 指定返回结果字段
- DISTINCT:去重
- FROM: 指定要查询的表或子查询
- JOIN:表连接,支持内连接和外连接、左连接和右连接
- WHERE:筛选条件
- GROUP BY:分组,和聚合函数一起使用
- HAVING:分组后筛选
- ORDER BY:排序
- LIMIT:限制返回记录数
- UNION ALL:并集;ClickHouse目前只支持UNION ALL,还不支持UNION
ClickHouse的SELECT语句中也有一些特殊的用法:
- WITH: 设置查询中要用到的变量
- SAMPLE: 数据取样,类似Pandas库的sample()函数
- PREWHERE: 预筛选,起到提升性能作用
- ARRAY JOIN:数组连接,用来展开数组或嵌套字段,一行变多行
- LIMIT BY: 分组,再从每组中取前n条记录
- INTO OUTFILE: 导出表数据到文件,再用FORMAT指定文件格式
-- 在WITH子句中定义一个变量并赋值,然后在SELECT子句中通过别名使用该变量 with '2014-03-17' as dt \ select count(1) from hits_v1 where EventDate = dt; -- 在WITH子句中定义一个函数,然后在SELECT子句中通过别名使用该函数 with round(Duration / 60)as duration_minutes \ select StartDate, max(duration_minutes) as max_duration_minutes from visits_v1 \ group by StartDate, Duration \ order by max_duration_minutes desc \ limit 10; -- 在WITH子句中定义一个子查询,然后在SELECT子句中通过别名使用该子查询 -- 该子查询只能返回一行数据 with ( \ select sum(Duration) from visits_v1 \ ) as total_duration \ select StartDate, sum(Duration) / total_duration as duration_percentage from visits_v1 \ group by StartDate, Duration \ limit 10;
对使用了MergeTree表引擎的表,并且设置了SAMPLE BY的表,可以使用SAMPLE子句来对数据进行抽样。
表例子:
CREATE TABLE tutorial.hits_v1 \ ( ... ) ENGINE = MergeTree() \ PARTITION BY toYYYYMM(EventDate) \ ORDER BY (CounterID, EventDate, intHash32(UserID)) \ SAMPLE BY intHash32(UserID) \ SETTINGS index_granularity = 8192;
hits_v1表使用MergeTree表引擎,并且SAMPLE BY为将UserID转换成32位的Hash值。
SAMPLE子句示例:
-- 按比例采样 -- 采样结果记录数 select count(1) from hits_v1 sample 0.1 -- 采样数据,默认限制返回10000条 select CounterID, UserID, EventDate, EventTime from hits_v1 sample 0.1 -- 采样数据,限制返回10条 select CounterID, UserID, EventDate, EventTime from hits_v1 sample 0.1 limit 10 -- 按记录数采样 -- 采样记录数较小时,采样结果数据为0条 select count(1) from hits_v1 sample 100 -- 采样记录数大过索引粒度时,采样结果数据记录数接近采样记录数 elect count(1) from hits_v1 sample 10000 -- 采样数据,默认限制返回10000条 select CounterID, UserID, EventDate, EventTime from hits_v1 sample 20000 -- 采样数据,限制返回10条 select CounterID, UserID, EventDate, EventTime from hits_v1 sample 20000 limit 10 -- 按比例和偏移量采样,类似于按比例采样 select CounterID, UserID, EventDate, EventTime from hits_v1 sample 0.1 offset 0.3 limit 10
SAMPLE子句数据采样具有幂等性和近似性的特点:
- 幂等性:采样条件不变时,两次采样的结果可能一样
- 近似性:采样范围和采样结果不保证精确
只有MergeTree表引擎的表才能使用PREWHERE子句,可以将PREWHERE看作是ClickHouse对WHERE子句的优化。
ClickHouse默认将WHERE自动优化为PREWHERE:
-- optimize_move_to_prewhere为1时,表示开始PREWHERE自动优化 select name, value from system.settings where name like '%prewhere%'
可以用ARRAY JOIN子句来对数组(Array)或嵌套(Nested)类型做链接查询,可以将一行数组展成多行。
ARRAY JOIN数组示例:
-- 不使用ARRAY JOIN select WatchID, RefererCategories from hits_v1 where WatchID = 4944118417295196513 -- 结果: ┌─────────────WatchID─┬─RefererCategories─┐ │ 4944118417295196513 │ [6,98,456,8586] │ └─────────────────────┴───────────────────┘ -- 使用ARRAY JOIN select WatchID, RefererCategories \ from hits_v1 \ array join RefererCategories \ where WatchID = 4944118417295196513; -- 结果: ─────────────WatchID─┬─RefererCategories─┐ │ 4944118417295196513 │ 6 │ │ 4944118417295196513 │ 98 │ │ 4944118417295196513 │ 456 │ │ 4944118417295196513 │ 8586 │ └─────────────────────┴───────────────────┘
ARRAY JOIN嵌套类型示例:
-- 不使用ARRAY JOIN select WatchID, ParsedParams.Key1, ParsedParams.Key2 from hits_v1 where WatchID = 5024825574842900819 -- 结果: ┌─────────────WatchID─┬─ParsedParams.Key1───────────┬─ParsedParams.Key2───┐ │ 5024825574842900819 │ ['gen_timestamp','Toolbar'] │ ['group','true" /'] │ └─────────────────────┴─────────────────────────────┴─────────────────────┘ -- 使用ARRAY JOIN select WatchID, ParsedParams.Key1, ParsedParams.Key2 \ from hits_v1 \ array join ParsedParams \ where WatchID = 5024825574842900819; -- 结果: ┌─────────────WatchID─┬─ParsedParams.Key1─┬─ParsedParams.Key2─┐ │ 5024825574842900819 │ gen_timestamp │ group │ │ 5024825574842900819 │ Toolbar │ true" / │ └─────────────────────┴───────────────────┴───────────────────┘
ARRAY JOIN也支持左连接接
LEFT ARRAY JOIN
。
LIMIT BY子句不同于通用SQL的LIMIT子句。
LIMIT n BY expression
对SELECT结果先按expression分组,再在每组里选出前n个,类似分类排行榜的概念。示例:
CREATE TABLE limit_by(id Int, val Int) ENGINE = Memory; INSERT INTO limit_by VALUES (1, 10), (1, 11), (1, 12), (2, 20), (2, 21), (2, 22), (3, 31);
-- 排序 SELECT * FROM limit_by ORDER BY id, val ┌─id─┬─val─┐ │ 1 │ 10 │ │ 1 │ 11 │ │ 1 │ 12 │ │ 2 │ 20 │ │ 2 │ 21 │ │ 2 │ 22 │ │ 3 │ 31 │ └────┴─────┘ -- 分类排序,再在每个分组内取前2条记录 SELECT * FROM limit_by ORDER BY id, val LIMIT 2 BY id ┌─id─┬─val─┐ │ 1 │ 10 │ │ 1 │ 11 │ │ 2 │ 20 │ │ 2 │ 21 │ │ 3 │ 31 │ └────┴─────┘ -- 注意,与LIMIT 2的不同 SELECT * FROM limit_by ORDER BY id, val LIMIT 2 ┌─id─┬─val─┐ │ 1 │ 10 │ │ 1 │ 11 │ └────┴─────┘
LIMIT BY子句也支持偏移量:
LIMIT n OFFSET m BY exression
ClickHouse的INTO OUTFILE子句与MySQL的INTO OUTFILE子句类似。
-- 输出到当前目录 -- 默认格式为TSV -- 注意文件名必须用单引号来括起来,且不能用双引号括起来,否则会报错:Expected string literal -- 目录下不能存在同名文件,否则会报错 select WatchID, JavaEnable, EventDate from hits_v1 limit 10 into outfile 'test.tsv' -- 设置格式为CSV,CSV需要为全大小 select WatchID, JavaEnable, EventDate from hits_v1 limit 10 into outfile 'out.csv' format CSV
查看执行计划
可以通过以下方式查看执行计划:
- 直接查看执行计划:
ch --send_logs_level=trace <<< 'select * from tutorial.hits_v1' > /dev/null
- 可以将以下内容保持到一个脚本文件,比如
chx.sh
中:#!/bin/bash clickhouse-client --send_logs_level=trace <<< "$1" > /dev/null
再通过
bash chi.sh "select * from tutorial.hits_v1"
来查看执行计划。- 也可以在clickhouse-client执行sql同时,用
tail -f clickhouse-server.log
查看clickhouse-server的日志。执行计划示例
select * from tutorial.hits_v1
的执行计划日志示例:
tutorial.hits_v1 (SelectExecutor): Key condition: unknown tutorial.hits_v1 (SelectExecutor): MinMax index condition: unknown tutorial.hits_v1 (SelectExecutor): Selected 9 parts by date, 9 parts by key, 1094 marks to read from 9 ranges MergeTreeSelectProcessor: Reading 1 ranges from part 201403_1_6_1, approx. 1900000 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_7_12_1, approx. 1900000 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_13_18_1, approx. 1800000 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_19_24_1, approx. 1800000 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_25_25_0, approx. 303104 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_26_26_0, approx. 303104 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_27_27_0, approx. 303104 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_28_28_0, approx. 303104 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_29_29_0, approx. 278528 rows starting from 0 InterpreterSelectQuery: FetchColumns -> Complete executeQuery: Read 8873898 rows, 7.88 GiB in 47.964 sec., 185011 rows/sec., 168.22 MiB/sec. MemoryTracker: Peak memory usage (for query): 204.54 MiB. 说明:
- Key condition: unknown:没有使用主键索引。
- MinMax index condition: unknown: 没有使用分区索引。
- Selected 9 parts by date, 9 parts by key, 1094 marks to read from 9 ranges: 共扫描了9个分区,共1094个mark。
- Read 8873898 rows, 7.88 GiB in 47.964 sec., 185011 rows/sec., 168.22 MiB/sec.: 读取了8873898行数据,数据大小为7.88GB,用时47.964秒。
- Peak memory usage (for query): 204.54 MiB.: 内存峰值(最大内存使用量)为204.54MB。
将SQL改为只返回一个字段
select WatchID from tutorial.hits_v1
,再来看执行计划:
tutorial.hits_v1 (SelectExecutor): Key condition: unknown tutorial.hits_v1 (SelectExecutor): MinMax index condition: unknown tutorial.hits_v1 (SelectExecutor): Selected 9 parts by date, 9 parts by key, 1094 marks to read from 9 ranges MergeTreeSelectProcessor: Reading 1 ranges from part 201403_1_6_1, approx. 1900000 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_7_12_1, approx. 1900000 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_13_18_1, approx. 1800000 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_19_24_1, approx. 1800000 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_25_25_0, approx. 303104 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_26_26_0, approx. 303104 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_27_27_0, approx. 303104 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_28_28_0, approx. 303104 rows starting from 0 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_29_29_0, approx. 278528 rows starting from 0 InterpreterSelectQuery: FetchColumns -> Complete executeQuery: Read 8873898 rows, 67.70 MiB in 0.318 sec., 27892532 rows/sec., 212.80 MiB/sec. MemoryTracker: Peak memory usage (for query): 12.32 MiB. 几个变化:
- Read 8873898 rows, 67.70 MiB in 0.318 sec., 27892532 rows/sec., 212.80 MiB/sec.: 数据大小从7.88GB减小到67.70MB,用时从47.964秒减小到0.318秒。
- Peak memory usage (for query): 12.32 MiB.: 内存峰值从204.54MB减小到12.32MB。
再将SQL改为指定分区查询:
select WatchID from tutorial.hits_v1 where EventDate = '2014-03-17'
执行计划日志:
executeQuery: (from 127.0.0.1:48102) SELECT WatchID FROM tutorial.hits_v1 WHERE EventDate = '2014-03-17' InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "EventDate = '2014-03-17'" moved to PREWHERE ContextAccess (default): Access granted: SELECT(WatchID, EventDate) ON tutorial.hits_v1 tutorial.hits_v1 (SelectExecutor): Key condition: (column 1 in [16146, 16146]) tutorial.hits_v1 (SelectExecutor): MinMax index condition: (column 0 in [16146, 16146]) tutorial.hits_v1 (SelectExecutor): Selected 9 parts by date, 9 parts by key, 833 marks to read from 72 ranges MergeTreeSelectProcessor: Reading 14 ranges from part 201403_1_6_1, approx. 818656, up to 818656 rows starting from 0 MergeTreeSelectProcessor: Reading 14 ranges from part 201403_7_12_1, approx. 1621472, up to 1621472 rows starting from 0 MergeTreeSelectProcessor: Reading 15 ranges from part 201403_13_18_1, approx. 1349440, up to 1349440 rows starting from 0 MergeTreeSelectProcessor: Reading 14 ranges from part 201403_19_24_1, approx. 1619776, up to 1619776 rows starting from 0 MergeTreeSelectProcessor: Reading 4 ranges from part 201403_25_25_0, approx. 278528, up to 278528 rows starting from 0 MergeTreeSelectProcessor: Reading 3 ranges from part 201403_26_26_0, approx. 286720, up to 286720 rows starting from 0 MergeTreeSelectProcessor: Reading 2 ranges from part 201403_27_27_0, approx. 294912, up to 294912 rows starting from 0 MergeTreeSelectProcessor: Reading 3 ranges from part 201403_28_28_0, approx. 229376, up to 229376 rows starting from 0 MergeTreeSelectProcessor: Reading 3 ranges from part 201403_29_29_0, approx. 253952, up to 253952 rows starting from 0 InterpreterSelectQuery: FetchColumns -> Complete executeQuery: Read 6735786 rows, 61.49 MiB in 0.097 sec., 69655325 rows/sec., 635.86 MiB/sec. MemoryTracker: Peak memory usage (for query): 16.32 MiB. 说明:
- MergeTreeWhereOptimizer: condition "EventDate = '2014-03-17'" moved to PREWHERE: 启动了PREWHERE优化。
- Key condition: (column 1 in [16146, 16146]): 用了主键索引。
- MinMax index condition: (column 0 in [16146, 16146]): 用了分区索引。
- Selected 9 parts by date, 9 parts by key, 833 marks to read from 72 ranges:从1094 marks减少到833 marks。
- Read 6735786 rows, 61.49 MiB in 0.097 sec., 69655325 rows/sec., 635.86 MiB/sec.: 数据大小从67.70MB变为61.49MB,用时从0.318秒减少到0.097秒。