官方文档
ClickHouse 中物化视图(Materialized View)是一种预先计算并缓存结果的视图,它存储在磁盘上并自动更新,典型的空间换时间思路。物化视图是一种优化技术,它可以加速查询操作,降低系统负载,并提高查询性能。
创建语法:
CREATE [MATERIALIZED] VIEW [IF NOT EXISTS] [db.]table_name [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...
当你创建一个物化视图时,ClickHouse 会计算该视图的结果,并将结果存储在磁盘上。然后,当你查询该视图时,ClickHouse 会直接从磁盘上的结果中获取数据,而不需要重新计算。
物化视图可以基于一个或多个表创建,并可以使用 SQL 查询语句定义。它可以使用各种查询操作进行更新,例如 Insert、Update、Delete 。当数据源表发生更改时,物化视图会自动更新,以保持结果的一致性。
注意:使用物化视图,可以在查询性能和数据一致性之间进行权衡。物化视图可以提高查询性能,但会增加数据更新和维护的开销。
这边是以官方提供的数据来操作。example-datasets
创建数据库以及表,这里给出 sql,也可以去上面给的地址拿:
DROP DATABASE IF EXISTS git;
CREATE DATABASE git;
CREATE TABLE git.commits
(
hash String,
author LowCardinality(String),
time DateTime,
message String,
files_added UInt32,
files_deleted UInt32,
files_renamed UInt32,
files_modified UInt32,
lines_added UInt32,
lines_deleted UInt32,
hunks_added UInt32,
hunks_removed UInt32,
hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;
CREATE TABLE git.file_changes
(
change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
path LowCardinality(String),
old_path LowCardinality(String),
file_extension LowCardinality(String),
lines_added UInt32,
lines_deleted UInt32,
hunks_added UInt32,
hunks_removed UInt32,
hunks_changed UInt32,
commit_hash String,
author LowCardinality(String),
time DateTime,
commit_message String,
commit_files_added UInt32,
commit_files_deleted UInt32,
commit_files_renamed UInt32,
commit_files_modified UInt32,
commit_lines_added UInt32,
commit_lines_deleted UInt32,
commit_hunks_added UInt32,
commit_hunks_removed UInt32,
commit_hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;
CREATE TABLE git.line_changes
(
sign Int8,
line_number_old UInt32,
line_number_new UInt32,
hunk_num UInt32,
hunk_start_line_number_old UInt32,
hunk_start_line_number_new UInt32,
hunk_lines_added UInt32,
hunk_lines_deleted UInt32,
hunk_context LowCardinality(String),
line LowCardinality(String),
indent UInt8,
line_type Enum('Empty' = 0, 'Comment' = 1, 'Punct' = 2, 'Code' = 3),
prev_commit_hash String,
prev_author LowCardinality(String),
prev_time DateTime,
file_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
path LowCardinality(String),
old_path LowCardinality(String),
file_extension LowCardinality(String),
file_lines_added UInt32,
file_lines_deleted UInt32,
file_hunks_added UInt32,
file_hunks_removed UInt32,
file_hunks_changed UInt32,
commit_hash String,
author LowCardinality(String),
time DateTime,
commit_message String,
commit_files_added UInt32,
commit_files_deleted UInt32,
commit_files_renamed UInt32,
commit_files_modified UInt32,
commit_lines_added UInt32,
commit_lines_deleted UInt32,
commit_hunks_added UInt32,
commit_hunks_removed UInt32,
commit_hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;
使用s3 函数,INSERT INTO SELECT 插入数据
INSERT INTO git.commits SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/commits.tsv.xz', 'TSV', 'hash String,author LowCardinality(String), time DateTime, message String, files_added UInt32, files_deleted UInt32, files_renamed UInt32, files_modified UInt32, lines_added UInt32, lines_deleted UInt32, hunks_added UInt32, hunks_removed UInt32, hunks_changed UInt32');
INSERT INTO git.file_changes SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/file_changes.tsv.xz', 'TSV', 'change_type Enum(\'Add\' = 1, \'Delete\' = 2, \'Modify\' = 3, \'Rename\' = 4, \'Copy\' = 5, \'Type\' = 6), path LowCardinality(String), old_path LowCardinality(String), file_extension LowCardinality(String), lines_added UInt32, lines_deleted UInt32, hunks_added UInt32, hunks_removed UInt32, hunks_changed UInt32, commit_hash String, author LowCardinality(String), time DateTime, commit_message String, commit_files_added UInt32, commit_files_deleted UInt32, commit_files_renamed UInt32, commit_files_modified UInt32, commit_lines_added UInt32, commit_lines_deleted UInt32, commit_hunks_added UInt32, commit_hunks_removed UInt32, commit_hunks_changed UInt32');
INSERT INTO git.line_changes SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/line_changes.tsv.xz', 'TSV', 'sign Int8, line_number_old UInt32, line_number_new UInt32, hunk_num UInt32, hunk_start_line_number_old UInt32, hunk_start_line_number_new UInt32, hunk_lines_added UInt32,\n hunk_lines_deleted UInt32, hunk_context LowCardinality(String), line LowCardinality(String), indent UInt8, line_type Enum(\'Empty\' = 0, \'Comment\' = 1, \'Punct\' = 2, \'Code\' = 3), prev_commit_hash String, prev_author LowCardinality(String), prev_time DateTime, file_change_type Enum(\'Add\' = 1, \'Delete\' = 2, \'Modify\' = 3, \'Rename\' = 4, \'Copy\' = 5, \'Type\' = 6),\n path LowCardinality(String), old_path LowCardinality(String), file_extension LowCardinality(String), file_lines_added UInt32, file_lines_deleted UInt32, file_hunks_added UInt32, file_hunks_removed UInt32, file_hunks_changed UInt32, commit_hash String,\n author LowCardinality(String), time DateTime, commit_message String, commit_files_added UInt32, commit_files_deleted UInt32, commit_files_renamed UInt32, commit_files_modified UInt32, commit_lines_added UInt32, commit_lines_deleted UInt32, commit_hunks_added UInt32, commit_hunks_removed UInt32, commit_hunks_changed UInt32');
创建一个物化视图,查每个用户每天 commits 数量:
create materialized view git.commits_mv
engine SummingMergeTree
order by (dt, author)
as select
toDate(time) as dt, author, count() as n from git.commits group by dt, author order by dt asc;
SummingMergeTree 表引擎主要用于只关心聚合后的数据,而不关心明细数据的场景,它能够在合并分区的时候按照预先定义的条件聚合汇总数据,将同一分组下的多行数据汇总到一行,可以显著的 减少存储空间并加快数据查询的速度。
注意:这里创建物化视图,并没有数据。需要写入数据,后面会提到。至于为什么不用 POPULATE,因为在填充历史数据的期间, 新进入的这部分数据会被忽略掉, 所以如果对准确性要求非常高, 应慎用。
-- POPULATE 版
create materialized view git.commits_mv
engine SummingMergeTree
order by (dt, author)
POPULATE as select
toDate(time) as dt, author, count() as n from git.commits group by dt, author order by dt asc;
-- ClickHouse 官方并不推荐使用 populated,因为在创建视图过程中插入表中的数据并不会写入视图,会造成数据的丢失。
如果创建时,无使用 POPULATE 的话,通过 insert into 写入数据:
insert into git.commits_mv
select toDate(time) as dt, author, count() as n from git.commits group by dt, author order by dt asc;
如果无报错的话,此时应该是能看视图的数据的。也可以验证下,在源数据有新增的情况下,是否会更新到视图里:
-- 写一条数据看看是否会自动更新视图
insert into git.commits (hash, author, time, message, files_added, files_deleted, files_renamed, files_modified, lines_added, lines_deleted, hunks_added, hunks_removed, hunks_changed) values ('488610bd96415bdb8a718135676cxdf6a665829922', 'Nikita Taranov', '2022-11-30 18:22:24', 'impl (#43709)', 2, 0, 0, 3, 50, 31, 5, 1, 1);
结果:是会更新的。但是你多新增几条的话,commits_mv 视图里,并没有对其汇总?在使用物化视图(SummingMergeTree 引擎)的时候,也需要按照聚合查询来写 sql,因为虽然 SummingMergeTree
会自己预聚合,但是并不是实时的,具体执行聚合的时机并 不可控。
即查询的 sql 如下:
select dt, author, sum(n) from git.commits_mv group by dt ,author order by dt desc;