ClickHouse-物化视图

官方文档

文章目录

        • 什么是物化视图
        • 物化视图工作流程
        • 使用示例
        • 注意事项

什么是物化视图

ClickHouse 中物化视图(Materialized View)是一种预先计算并缓存结果的视图,它存储在磁盘上并自动更新,典型的空间换时间思路。物化视图是一种优化技术,它可以加速查询操作,降低系统负载,并提高查询性能。

创建语法:

CREATE [MATERIALIZED] VIEW [IF NOT EXISTS] [db.]table_name [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...

物化视图工作流程

当你创建一个物化视图时,ClickHouse 会计算该视图的结果,并将结果存储在磁盘上。然后,当你查询该视图时,ClickHouse 会直接从磁盘上的结果中获取数据,而不需要重新计算。

物化视图可以基于一个或多个表创建,并可以使用 SQL 查询语句定义。它可以使用各种查询操作进行更新,例如 Insert、Update、Delete 。当数据源表发生更改时,物化视图会自动更新,以保持结果的一致性。

注意:使用物化视图,可以在查询性能和数据一致性之间进行权衡。物化视图可以提高查询性能,但会增加数据更新和维护的开销。

使用示例

这边是以官方提供的数据来操作。example-datasets

  • 创建数据库以及表,这里给出 sql,也可以去上面给的地址拿:

    DROP DATABASE IF EXISTS git;
    CREATE DATABASE git;
    
    CREATE TABLE git.commits
    (
        hash String,
        author LowCardinality(String),
        time DateTime,
        message String,
        files_added UInt32,
        files_deleted UInt32,
        files_renamed UInt32,
        files_modified UInt32,
        lines_added UInt32,
        lines_deleted UInt32,
        hunks_added UInt32,
        hunks_removed UInt32,
        hunks_changed UInt32
    ) ENGINE = MergeTree ORDER BY time;
    
    CREATE TABLE git.file_changes
    (
        change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
        path LowCardinality(String),
        old_path LowCardinality(String),
        file_extension LowCardinality(String),
        lines_added UInt32,
        lines_deleted UInt32,
        hunks_added UInt32,
        hunks_removed UInt32,
        hunks_changed UInt32,
    
        commit_hash String,
        author LowCardinality(String),
        time DateTime,
        commit_message String,
        commit_files_added UInt32,
        commit_files_deleted UInt32,
        commit_files_renamed UInt32,
        commit_files_modified UInt32,
        commit_lines_added UInt32,
        commit_lines_deleted UInt32,
        commit_hunks_added UInt32,
        commit_hunks_removed UInt32,
        commit_hunks_changed UInt32
    ) ENGINE = MergeTree ORDER BY time;
    
    CREATE TABLE git.line_changes
    (
        sign Int8,
        line_number_old UInt32,
        line_number_new UInt32,
        hunk_num UInt32,
        hunk_start_line_number_old UInt32,
        hunk_start_line_number_new UInt32,
        hunk_lines_added UInt32,
        hunk_lines_deleted UInt32,
        hunk_context LowCardinality(String),
        line LowCardinality(String),
        indent UInt8,
        line_type Enum('Empty' = 0, 'Comment' = 1, 'Punct' = 2, 'Code' = 3),
    
        prev_commit_hash String,
        prev_author LowCardinality(String),
        prev_time DateTime,
    
        file_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
        path LowCardinality(String),
        old_path LowCardinality(String),
        file_extension LowCardinality(String),
        file_lines_added UInt32,
        file_lines_deleted UInt32,
        file_hunks_added UInt32,
        file_hunks_removed UInt32,
        file_hunks_changed UInt32,
    
        commit_hash String,
        author LowCardinality(String),
        time DateTime,
        commit_message String,
        commit_files_added UInt32,
        commit_files_deleted UInt32,
        commit_files_renamed UInt32,
        commit_files_modified UInt32,
        commit_lines_added UInt32,
        commit_lines_deleted UInt32,
        commit_hunks_added UInt32,
        commit_hunks_removed UInt32,
        commit_hunks_changed UInt32
    ) ENGINE = MergeTree ORDER BY time;
    
  • 使用s3 函数,INSERT INTO SELECT 插入数据

    INSERT INTO git.commits SELECT *
    FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/commits.tsv.xz', 'TSV', 'hash String,author LowCardinality(String), time DateTime, message String, files_added UInt32, files_deleted UInt32, files_renamed UInt32, files_modified UInt32, lines_added UInt32, lines_deleted UInt32, hunks_added UInt32, hunks_removed UInt32, hunks_changed UInt32');
    
    INSERT INTO git.file_changes SELECT *
    FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/file_changes.tsv.xz', 'TSV', 'change_type Enum(\'Add\' = 1, \'Delete\' = 2, \'Modify\' = 3, \'Rename\' = 4, \'Copy\' = 5, \'Type\' = 6), path LowCardinality(String), old_path LowCardinality(String), file_extension LowCardinality(String), lines_added UInt32, lines_deleted UInt32, hunks_added UInt32, hunks_removed UInt32, hunks_changed UInt32, commit_hash String, author LowCardinality(String), time DateTime, commit_message String, commit_files_added UInt32, commit_files_deleted UInt32, commit_files_renamed UInt32, commit_files_modified UInt32, commit_lines_added UInt32, commit_lines_deleted UInt32, commit_hunks_added UInt32, commit_hunks_removed UInt32, commit_hunks_changed UInt32');
    
    INSERT INTO git.line_changes SELECT *
    FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/line_changes.tsv.xz', 'TSV', 'sign Int8, line_number_old UInt32, line_number_new UInt32, hunk_num UInt32, hunk_start_line_number_old UInt32, hunk_start_line_number_new UInt32, hunk_lines_added UInt32,\n    hunk_lines_deleted UInt32, hunk_context LowCardinality(String), line LowCardinality(String), indent UInt8, line_type Enum(\'Empty\' = 0, \'Comment\' = 1, \'Punct\' = 2, \'Code\' = 3), prev_commit_hash String, prev_author LowCardinality(String), prev_time DateTime, file_change_type Enum(\'Add\' = 1, \'Delete\' = 2, \'Modify\' = 3, \'Rename\' = 4, \'Copy\' = 5, \'Type\' = 6),\n    path LowCardinality(String), old_path LowCardinality(String), file_extension LowCardinality(String), file_lines_added UInt32, file_lines_deleted UInt32, file_hunks_added UInt32, file_hunks_removed UInt32, file_hunks_changed UInt32, commit_hash String,\n    author LowCardinality(String), time DateTime, commit_message String, commit_files_added UInt32, commit_files_deleted UInt32, commit_files_renamed UInt32, commit_files_modified UInt32, commit_lines_added UInt32, commit_lines_deleted UInt32, commit_hunks_added UInt32, commit_hunks_removed UInt32, commit_hunks_changed UInt32');
    
    
  • 创建一个物化视图,查每个用户每天 commits 数量:

    create materialized view git.commits_mv
    engine SummingMergeTree
    order by (dt, author)
    as select
    toDate(time) as dt, author, count() as n from git.commits group by dt, author order by dt asc;
    

    SummingMergeTree 表引擎主要用于只关心聚合后的数据,而不关心明细数据的场景,它能够在合并分区的时候按照预先定义的条件聚合汇总数据,将同一分组下的多行数据汇总到一行,可以显著的 减少存储空间并加快数据查询的速度

    注意:这里创建物化视图,并没有数据。需要写入数据,后面会提到。至于为什么不用 POPULATE,因为在填充历史数据的期间, 新进入的这部分数据会被忽略掉, 所以如果对准确性要求非常高, 应慎用。

    -- POPULATE 版
    create materialized view git.commits_mv
    engine SummingMergeTree
    order by (dt, author)
    POPULATE as select
    toDate(time) as dt, author, count() as n from git.commits group by dt, author order by dt asc;
    
    -- ClickHouse 官方并不推荐使用 populated,因为在创建视图过程中插入表中的数据并不会写入视图,会造成数据的丢失。
    
  • 如果创建时,无使用 POPULATE 的话,通过 insert into 写入数据:

    insert into git.commits_mv
    select toDate(time) as dt, author, count() as n from git.commits group by dt, author order by dt asc;
    

    如果无报错的话,此时应该是能看视图的数据的。也可以验证下,在源数据有新增的情况下,是否会更新到视图里:

    -- 写一条数据看看是否会自动更新视图
    insert into git.commits (hash, author, time, message, files_added, files_deleted, files_renamed, files_modified, lines_added, lines_deleted, hunks_added, hunks_removed, hunks_changed) values ('488610bd96415bdb8a718135676cxdf6a665829922', 'Nikita Taranov', '2022-11-30 18:22:24', 'impl (#43709)', 2, 0, 0, 3, 50, 31, 5, 1, 1);
    

    结果:是会更新的。但是你多新增几条的话,commits_mv 视图里,并没有对其汇总?在使用物化视图(SummingMergeTree 引擎)的时候,也需要按照聚合查询来写 sql,因为虽然 SummingMergeTree 会自己预聚合,但是并不是实时的,具体执行聚合的时机并 不可控。

  • 即查询的 sql 如下:

    select dt, author, sum(n) from git.commits_mv group by dt ,author order by dt desc;
    

注意事项

  • 在创建 materialized view(简称 MV) 时,不要使用 POPULATE 关键字,而是在物化视图表建好之后将数据导入。
  • 在使用 MV 的聚合引擎时(例:SummingMergeTree),也需要按聚合查询来写 sql,因为聚合时间不可控。

你可能感兴趣的:(clickhouse,数据库,物化视图)