ClickHouse数据查询

1.查询语法

[WITH expr_list|(subquery)]
SELECT [DISTINCT] expr_list
[FROM [db.]table | (subquery) | table_function] [FINAL]
[SAMPLE sample_coeff]
[ARRAY JOIN ...]
[GLOBAL] [ANY|ALL|ASOF] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER|SEMI|ANTI] JOIN (subquery)|table (ON )|(USING )
[PREWHERE expr]
[WHERE expr]
[GROUP BY expr_list] [WITH TOTALS]
[HAVING expr]
[ORDER BY expr_list] [WITH FILL] [FROM expr] [TO expr] [STEP expr]
[LIMIT [offset_value, ]n BY columns]
[LIMIT [n, ]m] [WITH TIES]
[UNION ALL ...]
[INTO OUTFILE filename]
[FORMAT format]

可以看到ClickHouse的SELECT语句的语法和通用的SQL的SELECT语句非常类似,包括:

  • SELECT: 指定返回结果字段
  • DISTINCT:去重
  • FROM: 指定要查询的表或子查询
  • JOIN:表连接,支持内连接和外连接、左连接和右连接
  • WHERE:筛选条件
  • GROUP BY:分组,和聚合函数一起使用
  • HAVING:分组后筛选
  • ORDER BY:排序
  • LIMIT:限制返回记录数
  • UNION ALL:并集;ClickHouse目前只支持UNION ALL,还不支持UNION

ClickHouse的SELECT语句中也有一些特殊的用法:

  • WITH: 设置查询中要用到的变量
  • SAMPLE: 数据取样,类似Pandas库的sample()函数
  • PREWHERE: 预筛选,起到提升性能作用
  • ARRAY JOIN:数组连接,用来展开数组或嵌套字段,一行变多行
  • LIMIT BY: 分组,再从每组中取前n条记录
  • INTO OUTFILE: 导出表数据到文件,再用FORMAT指定文件格式

2.WITH子句

-- 在WITH子句中定义一个变量并赋值,然后在SELECT子句中通过别名使用该变量
with '2014-03-17' as dt \
select count(1) from hits_v1 where EventDate = dt;


-- 在WITH子句中定义一个函数,然后在SELECT子句中通过别名使用该函数
with round(Duration / 60)as duration_minutes \
select StartDate, max(duration_minutes) as max_duration_minutes from visits_v1 \
group by StartDate, Duration \
order by max_duration_minutes desc \
limit 10;


-- 在WITH子句中定义一个子查询,然后在SELECT子句中通过别名使用该子查询
-- 该子查询只能返回一行数据
with ( \
  select sum(Duration) from visits_v1 \
) as total_duration \
select StartDate, sum(Duration) / total_duration as duration_percentage from visits_v1 \
group by StartDate, Duration \
limit 10;

 

3.SAMPLE子句

对使用了MergeTree表引擎的表,并且设置了SAMPLE BY的表,可以使用SAMPLE子句来对数据进行抽样。

表例子:

CREATE TABLE tutorial.hits_v1 \ 
( 
  ...
)
ENGINE = MergeTree() \ 
PARTITION BY toYYYYMM(EventDate) \ 
ORDER BY (CounterID, EventDate, intHash32(UserID)) \ 
SAMPLE BY intHash32(UserID) \ 
SETTINGS index_granularity = 8192;

hits_v1表使用MergeTree表引擎,并且SAMPLE BY为将UserID转换成32位的Hash值。

ClickHouse数据查询_第1张图片

SAMPLE子句示例:

-- 按比例采样
-- 采样结果记录数
select count(1) from hits_v1 sample 0.1
-- 采样数据,默认限制返回10000条
select CounterID, UserID, EventDate, EventTime  from hits_v1 sample 0.1
-- 采样数据,限制返回10条
select CounterID, UserID, EventDate, EventTime  from hits_v1 sample 0.1 limit 10


-- 按记录数采样
-- 采样记录数较小时,采样结果数据为0条
select count(1) from hits_v1 sample 100
-- 采样记录数大过索引粒度时,采样结果数据记录数接近采样记录数
elect count(1) from hits_v1 sample 10000
-- 采样数据,默认限制返回10000条
select CounterID, UserID, EventDate, EventTime  from hits_v1 sample 20000
-- 采样数据,限制返回10条
select CounterID, UserID, EventDate, EventTime  from hits_v1 sample 20000 limit 10


-- 按比例和偏移量采样,类似于按比例采样
select CounterID, UserID, EventDate, EventTime  from hits_v1 sample 0.1 offset 0.3 limit 10

SAMPLE子句数据采样具有幂等性和近似性的特点:

  • 幂等性:采样条件不变时,两次采样的结果可能一样
  • 近似性:采样范围和采样结果不保证精确

4.PREWHERE子句

只有MergeTree表引擎的表才能使用PREWHERE子句,可以将PREWHERE看作是ClickHouse对WHERE子句的优化。

ClickHouse默认将WHERE自动优化为PREWHERE:

-- optimize_move_to_prewhere为1时,表示开始PREWHERE自动优化
select name, value from system.settings where name like '%prewhere%'

 

5.ARRAY JOIN子句

可以用ARRAY JOIN子句来对数组(Array)或嵌套(Nested)类型做链接查询,可以将一行数组展成多行。

ARRAY JOIN数组示例:

-- 不使用ARRAY JOIN
select WatchID, RefererCategories from hits_v1 where WatchID = 4944118417295196513

-- 结果:
┌─────────────WatchID─┬─RefererCategories─┐
│ 4944118417295196513 │ [6,98,456,8586]   │
└─────────────────────┴───────────────────┘


-- 使用ARRAY JOIN
select WatchID, RefererCategories \
from hits_v1 \
array join RefererCategories \
where WatchID = 4944118417295196513;

-- 结果:
─────────────WatchID─┬─RefererCategories─┐
│ 4944118417295196513 │                 6 │
│ 4944118417295196513 │                98 │
│ 4944118417295196513 │               456 │
│ 4944118417295196513 │              8586 │
└─────────────────────┴───────────────────┘

ARRAY JOIN嵌套类型示例:

-- 不使用ARRAY JOIN
select WatchID, ParsedParams.Key1, ParsedParams.Key2 from hits_v1 where WatchID = 5024825574842900819

-- 结果:
┌─────────────WatchID─┬─ParsedParams.Key1───────────┬─ParsedParams.Key2───┐
│ 5024825574842900819 │ ['gen_timestamp','Toolbar'] │ ['group','true" /'] │
└─────────────────────┴─────────────────────────────┴─────────────────────┘


-- 使用ARRAY JOIN
select WatchID, ParsedParams.Key1, ParsedParams.Key2 \
from hits_v1 \
array join ParsedParams \
where WatchID = 5024825574842900819;

-- 结果:
┌─────────────WatchID─┬─ParsedParams.Key1─┬─ParsedParams.Key2─┐
│ 5024825574842900819 │ gen_timestamp     │ group             │
│ 5024825574842900819 │ Toolbar           │ true" /           │
└─────────────────────┴───────────────────┴───────────────────┘

ARRAY JOIN也支持左连接接LEFT ARRAY JOIN

6.LIMIT BY子句

LIMIT BY子句不同于通用SQL的LIMIT子句。

LIMIT n BY expression 对SELECT结果先按expression分组,再在每组里选出前n个,类似分类排行榜的概念。

示例:

CREATE TABLE limit_by(id Int, val Int) ENGINE = Memory;
INSERT INTO limit_by VALUES (1, 10), (1, 11), (1, 12), (2, 20), (2, 21), (2, 22), (3, 31);
-- 排序
SELECT * FROM limit_by ORDER BY id, val

┌─id─┬─val─┐
│  1 │  10 │
│  1 │  11 │
│  1 │  12 │
│  2 │  20 │
│  2 │  21 │
│  2 │  22 │
│  3 │  31 │
└────┴─────┘

-- 分类排序,再在每个分组内取前2条记录
SELECT * FROM limit_by ORDER BY id, val LIMIT 2 BY id

┌─id─┬─val─┐
│  1 │  10 │
│  1 │  11 │
│  2 │  20 │
│  2 │  21 │
│  3 │  31 │
└────┴─────┘


-- 注意,与LIMIT 2的不同
SELECT * FROM limit_by ORDER BY id, val LIMIT 2

┌─id─┬─val─┐
│  1 │  10 │
│  1 │  11 │
└────┴─────┘

LIMIT BY子句也支持偏移量:LIMIT n OFFSET m BY exression

7.INTO OUTFILE子句

ClickHouse的INTO OUTFILE子句与MySQL的INTO OUTFILE子句类似。

-- 输出到当前目录
-- 默认格式为TSV
-- 注意文件名必须用单引号来括起来,且不能用双引号括起来,否则会报错:Expected string literal
-- 目录下不能存在同名文件,否则会报错
select WatchID, JavaEnable, EventDate  from hits_v1 limit 10 into outfile 'test.tsv'

-- 设置格式为CSV,CSV需要为全大小
select WatchID, JavaEnable, EventDate  from hits_v1 limit 10 into outfile 'out.csv' format CSV

 

8.执行计划

查看执行计划

可以通过以下方式查看执行计划:

  1. 直接查看执行计划:

    ch --send_logs_level=trace <<< 'select * from tutorial.hits_v1' > /dev/null

  2. 可以将以下内容保持到一个脚本文件,比如chx.sh中:
    #!/bin/bash
    
    clickhouse-client --send_logs_level=trace <<< "$1" > /dev/null
    

    再通过bash chi.sh "select * from tutorial.hits_v1" 来查看执行计划。

  3. 也可以在clickhouse-client执行sql同时,用tail -f clickhouse-server.log 查看clickhouse-server的日志。

执行计划示例

select * from tutorial.hits_v1的执行计划日志示例:

 tutorial.hits_v1 (SelectExecutor): Key condition: unknown
 tutorial.hits_v1 (SelectExecutor): MinMax index condition: unknown
 tutorial.hits_v1 (SelectExecutor): Selected 9 parts by date, 9 parts by key, 1094 marks to read from 9 ranges
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_1_6_1, approx. 1900000 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_7_12_1, approx. 1900000 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_13_18_1, approx. 1800000 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_19_24_1, approx. 1800000 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_25_25_0, approx. 303104 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_26_26_0, approx. 303104 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_27_27_0, approx. 303104 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_28_28_0, approx. 303104 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_29_29_0, approx. 278528 rows starting from 0
 InterpreterSelectQuery: FetchColumns -> Complete
 executeQuery: Read 8873898 rows, 7.88 GiB in 47.964 sec., 185011 rows/sec., 168.22 MiB/sec.
 MemoryTracker: Peak memory usage (for query): 204.54 MiB.

说明:

  • Key condition: unknown:没有使用主键索引。
  • MinMax index condition: unknown: 没有使用分区索引。
  • Selected 9 parts by date, 9 parts by key, 1094 marks to read from 9 ranges: 共扫描了9个分区,共1094个mark。
  • Read 8873898 rows, 7.88 GiB in 47.964 sec., 185011 rows/sec., 168.22 MiB/sec.: 读取了8873898行数据,数据大小为7.88GB,用时47.964秒。
  • Peak memory usage (for query): 204.54 MiB.: 内存峰值(最大内存使用量)为204.54MB。

将SQL改为只返回一个字段select WatchID from tutorial.hits_v1,再来看执行计划:

 tutorial.hits_v1 (SelectExecutor): Key condition: unknown
 tutorial.hits_v1 (SelectExecutor): MinMax index condition: unknown
 tutorial.hits_v1 (SelectExecutor): Selected 9 parts by date, 9 parts by key, 1094 marks to read from 9 ranges
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_1_6_1, approx. 1900000 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_7_12_1, approx. 1900000 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_13_18_1, approx. 1800000 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_19_24_1, approx. 1800000 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_25_25_0, approx. 303104 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_26_26_0, approx. 303104 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_27_27_0, approx. 303104 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_28_28_0, approx. 303104 rows starting from 0
 MergeTreeSelectProcessor: Reading 1 ranges from part 201403_29_29_0, approx. 278528 rows starting from 0
 InterpreterSelectQuery: FetchColumns -> Complete
 executeQuery: Read 8873898 rows, 67.70 MiB in 0.318 sec., 27892532 rows/sec., 212.80 MiB/sec.
 MemoryTracker: Peak memory usage (for query): 12.32 MiB.

几个变化:

  • Read 8873898 rows, 67.70 MiB in 0.318 sec., 27892532 rows/sec., 212.80 MiB/sec.: 数据大小从7.88GB减小到67.70MB,用时从47.964秒减小到0.318秒。
  • Peak memory usage (for query): 12.32 MiB.: 内存峰值从204.54MB减小到12.32MB。

再将SQL改为指定分区查询:

select WatchID from tutorial.hits_v1 where EventDate = '2014-03-17'

执行计划日志: 

 executeQuery: (from 127.0.0.1:48102) SELECT WatchID FROM tutorial.hits_v1 WHERE EventDate = '2014-03-17'
 InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "EventDate = '2014-03-17'" moved to PREWHERE
 ContextAccess (default): Access granted: SELECT(WatchID, EventDate) ON tutorial.hits_v1
 tutorial.hits_v1 (SelectExecutor): Key condition: (column 1 in [16146, 16146])
 tutorial.hits_v1 (SelectExecutor): MinMax index condition: (column 0 in [16146, 16146])
 tutorial.hits_v1 (SelectExecutor): Selected 9 parts by date, 9 parts by key, 833 marks to read from 72 ranges
 MergeTreeSelectProcessor: Reading 14 ranges from part 201403_1_6_1, approx. 818656, up to 818656 rows starting from 0
 MergeTreeSelectProcessor: Reading 14 ranges from part 201403_7_12_1, approx. 1621472, up to 1621472 rows starting from 0
 MergeTreeSelectProcessor: Reading 15 ranges from part 201403_13_18_1, approx. 1349440, up to 1349440 rows starting from 0
 MergeTreeSelectProcessor: Reading 14 ranges from part 201403_19_24_1, approx. 1619776, up to 1619776 rows starting from 0
 MergeTreeSelectProcessor: Reading 4 ranges from part 201403_25_25_0, approx. 278528, up to 278528 rows starting from 0
 MergeTreeSelectProcessor: Reading 3 ranges from part 201403_26_26_0, approx. 286720, up to 286720 rows starting from 0
 MergeTreeSelectProcessor: Reading 2 ranges from part 201403_27_27_0, approx. 294912, up to 294912 rows starting from 0
 MergeTreeSelectProcessor: Reading 3 ranges from part 201403_28_28_0, approx. 229376, up to 229376 rows starting from 0
 MergeTreeSelectProcessor: Reading 3 ranges from part 201403_29_29_0, approx. 253952, up to 253952 rows starting from 0
 InterpreterSelectQuery: FetchColumns -> Complete
 executeQuery: Read 6735786 rows, 61.49 MiB in 0.097 sec., 69655325 rows/sec., 635.86 MiB/sec.
 MemoryTracker: Peak memory usage (for query): 16.32 MiB.

说明:

  • MergeTreeWhereOptimizer: condition "EventDate = '2014-03-17'" moved to PREWHERE: 启动了PREWHERE优化。
  • Key condition: (column 1 in [16146, 16146]): 用了主键索引。
  • MinMax index condition: (column 0 in [16146, 16146]): 用了分区索引。
  • Selected 9 parts by date, 9 parts by key, 833 marks to read from 72 ranges:从1094 marks减少到833 marks。
  • Read 6735786 rows, 61.49 MiB in 0.097 sec., 69655325 rows/sec., 635.86 MiB/sec.: 数据大小从67.70MB变为61.49MB,用时从0.318秒减少到0.097秒。

你可能感兴趣的:(ClickHouse数据库)