ClickHouse技术分享第二弹(英文讲义)

前言

以下是今天为公司小伙伴们做的ClickHouse技术分享的讲义。由于PPT太难做了,索性直接用Markdown来写,搭配Chrome上的Markdown Preview Plus插件来渲染,效果非常好。

以下全文奉上,浓缩的都是精华,包含之前写过的两篇文章《物化视图简介与ClickHouse中的应用示例》和《ClickHouse Better Practices》中的全部内容,另外也包含一些新内容,如:

  • ClickHouse聚合函数的combinator后缀
  • 分布式join/in的读放大和GLOBAL关键字
  • ClickHouse SQL缺乏开窗分析函数的解决方案
    • 示例:排名榜和同比、环比计算
    • 以及array join的用法
  • LowCardinality数据类型
  • MergeTree索引结构

数组和高阶函数没时间说了,之后再提。

开始~


Advanced Usage & Better Practice of ClickHouse

Part I - Materialized View

Intro

  • Materialized view (MV): A copy (persistent storage) of query result set

  • MVs ≠ normal views, but ≈ tables

  • Space trade-off for time

  • Exists in various DBMSs (Oracle/SQL Server/PostgreSQL/...)

  • MV in ClickHouse = Precomputation + Incremental refreshing + Explicit data cache

  • Usage: Relieve from frequent & patterned aggregating queries

Engines

  • MaterializedView: Implicit

  • (Replicated)AggregatingMergeTree: Do auto aggregation upon insertion according to user-defined logic

  • Distributed: Just like distributed tables before

Creation

  • Best-selling merchandise points: PV/UV/first visiting time/last visiting time

【此处图片涉及业务数据,故删掉】

CREATE MATERIALIZED VIEW IF NOT EXISTS dw.merchandise_point_pvuv_agg
ON CLUSTER sht_ck_cluster_1
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/dw/merchandise_point_pvuv_agg','{replica}')
PARTITION BY ts_date
ORDER BY (ts_date,site_id,point_index,merchandise_id)
SETTINGS index_granularity = 8192
[POPULATE] AS SELECT
  ts_date,
  site_id,
  site_name,
  point_index,
  merchandise_id,
  merchandise_abbr,
  sumState(1) AS pv,
  uniqState(user_id) AS uv,
  maxState(ts_date_time) AS last_time,
  minState(ts_date_time) AS first_time
FROM ods.analytics_access_log
WHERE event_type = 'shtOpenGoodsDetail'
AND active_id = 0
AND site_id >= 0 AND merchandise_id >= 0 AND point_index >= 0
GROUP BY ts_date,site_id,site_name,point_index,merchandise_id,merchandise_abbr;
  • MVs can have partition keys, order (primary) keys and setting parameters (again, like tables)

  • The POPULATE keyword:

    • Without POPULATE = Only compute the data inserted to the table after MV creation

    • With POPULATE = Compute all history data while creating the MV, but ignore new data ingested during this period

  • sum/uniq/max/minState() ???

Under the Hood

Distributed MV

CREATE TABLE IF NOT EXISTS dw.merchandise_point_pvuv_agg_all
ON CLUSTER sht_ck_cluster_1
AS dw.merchandise_point_pvuv_agg
ENGINE = Distributed(sht_ck_cluster_1,dw,merchandise_point_pvuv_agg,rand());

Query

SELECT
  merchandise_id,
  merchandise_abbr,
  sumMerge(pv) AS pv,
  uniqMerge(uv) AS uv,
  maxMerge(last_time) AS last_time,
  minMerge(first_time) AS first_time,
  arrayStringConcat(groupUniqArray(site_name),'|') AS site_names
FROM dw.merchandise_point_pvuv_agg_all
WHERE ts_date = today()
AND site_id IN (10030,10031,10036,10037,10038)
AND point_index = 2
GROUP BY merchandise_id,merchandise_abbr
ORDER BY pv DESC LIMIT 10;

【此处图片涉及业务数据,故删掉】

  • sum/uniq/max/minMerge() ???

Part II - Aggregate Function Combinators

-State

  • Do not return the aggregation result directly, but keeps an intermediate result (a "state") of the aggregating process

  • e.g. uniqState() keeps the hash table for cardinality approximation

  • Aggregate functions combined with -State will produce a column of type AggregateFunction(func,type)

  • AggregateFunction columns cannot be queried directly

【此处图片涉及业务数据,故删掉】

-Merge

  • Aggregate the intermediate results and gives out the final value

  • A variant '-MergeState', aggregates intermediate results to a new intermediate result (But what's the point?)

-If

  • Conditional aggregation

  • Perform multi-condition processing within one statement

SELECT
  sumIf(quantity, merchandise_abbr LIKE '%苹果%') AS apple_quantity,
  countIf(toStartOfHour(ts_date_time) = '2020-06-09 20:00:00') AS eight_oclock_sub_order_num,
  maxIf(quantity * price, coupon_money > 0) AS couponed_max_gmv
FROM ods.ms_order_done
WHERE ts_date = toDate('2020-06-09');
┌─apple_quantity─┬─eight_oclock_sub_order_num─┬─couponed_max_gmv─┐
│           1365 │                      19979 │           318000 │
└────────────────┴────────────────────────────┴──────────────────┘

-Array

  • Array aggregation
SELECT avgArray([33, 44, 99, 110, 220]);
┌─avgArray([33, 44, 99, 110, 220])─┐
│                            101.2 │
└──────────────────────────────────┘

-ForEach

  • Array aggregation by indexes (position)
SELECT sumForEach(arr)
FROM (
  SELECT 1 AS id, [3, 6, 12] AS arr
  UNION ALL
  SELECT 2 AS id, [7, 14, 7, 5] AS arr
);
┌─sumForEach(arr)─┐
│ [10,20,19,5]    │
└─────────────────┘

Part III - Using JOIN Correctly

Only consider 2-table equi-joins

Use IN When Possible

  • Prefer IN over JOIN when we only want to fetch data from the left table
SELECT sec_category_name,count()
FROM ods.analytics_access_log
WHERE ts_date = today() - 1
AND site_name like '长沙%'
AND merchandise_id IN (
  SELECT merchandise_id
  FROM ods.ms_order_done
  WHERE price > 10000
)
GROUP BY sec_category_name;

Put Small Table at Right

  • ClickHouse will utilize hash-join algorithm whenever memory is enough
  • Right table is always treated as build table (resides in memory), while left table is always treated as probe table

  • Convert to merge-join on disk when running out of memory (not as efficient as hash-join)

No Predicate Pushdown

  • Predicate pushdown is a common query optimization approach. e.g. in MySQL:
SELECT l.col1,r.col2 FROM left_table l
INNER JOIN right_table r ON l.key = r.key
WHERE l.col3 > 123 AND r.col4 = '...';
  • The WHERE predicates will be executed early during scan phase, thus reducing data size in join phase

  • But ClickHouse optimizer is fairly weak and has no support for this. We should manually put the predicates "inside"

SELECT l.col1,r.col2 FROM (
  SELECT col1,key FROM left_table
  WHERE col3 > 123
) l INNER JOIN (
  SELECT col2,key FROM right_table
  WHERE col4 = '...'
) r ON l.key = r.key;

Distributed JOIN/IN with GLOBAL

  • When joining or doing IN on two distributed tables/MVs, the GLOBAL keyword is crucial
SELECT
  t1.merchandise_id,t1.merchandise_abbr,t1.pv,t1.uv,
  t2.total_quantity,t2.total_gmv
FROM (
  SELECT
    merchandise_id,merchandise_abbr,
    sumMerge(pv) AS pv,
    uniqMerge(uv) AS uv
  FROM dw.merchandise_point_pvuv_agg_all  -- Distributed
  WHERE ts_date = today()
  AND site_id IN (10030,10031,10036,10037,10038)
  AND point_index = 1
  GROUP BY merchandise_id,merchandise_abbr
) t1
GLOBAL LEFT JOIN (  -- GLOBAL
  SELECT
    merchandise_id,
    sumMerge(total_quantity) AS total_quantity,
    sumMerge(total_gmv) AS total_gmv
  FROM dw.merchandise_gmv_agg_all  -- Distributed
  WHERE ts_date = today()
  AND site_id IN (10030,10031,10036,10037,10038)
  GROUP BY merchandise_id
) t2
ON t1.merchandise_id = t2.merchandise_id;
  • Distributed joining without GLOBAL
  • Causes read amplification: Right table will be read M*N times (or N2 when shards are equal), very wasteful

  • Distributed joining with GLOBAL is all right with an intermediate cache of right table

ARRAY JOIN

  • Special. Not related to table joining, but arrays

  • Used to convert a row of an array to multiple rows with extra column(s)

  • Seems like LATERAL VIEW EXPLODE in Hive?

  • An example in the next section

Part IV - Alternative to Windowed Analytical Functions

Drawback

  • ClickHouse lacks basic windowed analytical functions, such as (in Hive):
row_number() OVER (PARTITION BY col1 ORDER BY col2)
rank() OVER (PARTITION BY col1 ORDER BY col2)
dense_rank() OVER (PARTITION BY col1 ORDER BY col2)
lag(col,num) OVER (PARTITION BY col1 ORDER BY col2)
lead(col,num) OVER (PARTITION BY col1 ORDER BY col2)
  • Any other way around?

arrayEnumerate*()

  • arrayEnumerate(): Returns index array [1, 2, 3, …, length(array)]
SELECT arrayEnumerate([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerate([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,2,3,4,5,6,7,8]                                │
└──────────────────────────────────────────────────┘
  • arrayEnumerateDense(): Returns an array of the same size as the source array, indicating where each element first appears in the source array
SELECT arrayEnumerateDense([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerateDense([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,2,3,4,2,1,2,5]                                     │
└───────────────────────────────────────────────────────┘
  • arrayEnumerateUniq(): Returns an array the same size as the source array, indicating for each element what its position is among elements with the same value
SELECT arrayEnumerateUniq([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerateUniq([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,1,1,1,2,2,3,1]                                    │
└──────────────────────────────────────────────────────┘

Ranking List

  • When the array is ordered, arrayEnumerate() = row_number(), arrayEnumerateDense() = dense_rank()

  • Pay attention to the usage of ARRAY JOIN --- it 'flattens' the result of arrays into human-readable columns

SELECT main_site_id,merchandise_id,gmv,row_number,dense_rank
FROM (
  SELECT main_site_id,
    groupArray(merchandise_id) AS merchandise_arr,
    groupArray(gmv) AS gmv_arr,
    arrayEnumerate(gmv_arr) AS gmv_row_number_arr,
    arrayEnumerateDense(gmv_arr) AS gmv_dense_rank_arr
  FROM (
    SELECT main_site_id,
      merchandise_id,
      sum(price * quantity) AS gmv
    FROM ods.ms_order_done
    WHERE ts_date = toDate('2020-06-01')
    GROUP BY main_site_id,merchandise_id
    ORDER BY gmv DESC
  )
  GROUP BY main_site_id
) ARRAY JOIN
  merchandise_arr AS merchandise_id,
  gmv_arr AS gmv,
  gmv_row_number_arr AS row_number,
  gmv_dense_rank_arr AS dense_rank
ORDER BY main_site_id ASC,row_number ASC;
┌─main_site_id─┬─merchandise_id─┬────gmv─┬─row_number─┬─dense_rank─┐
│          162 │         379263 │ 136740 │          1 │          1 │
│          162 │         360845 │  63600 │          2 │          2 │
│          162 │         400103 │  54110 │          3 │          3 │
│          162 │         404763 │  52440 │          4 │          4 │
│          162 │          93214 │  46230 │          5 │          5 │
│          162 │         304336 │  45770 │          6 │          6 │
│          162 │         392607 │  45540 │          7 │          7 │
│          162 │         182121 │  45088 │          8 │          8 │
│          162 │         383729 │  44550 │          9 │          9 │
│          162 │         404698 │  43750 │         10 │         10 │
│          162 │         102725 │  33284 │         11 │         11 │
│          162 │         404161 │  29700 │         12 │         12 │
│          162 │         391821 │  28160 │         13 │         13 │
│          162 │         339499 │  26069 │         14 │         14 │
│          162 │         404548 │  25600 │         15 │         15 │
│          162 │         167303 │  25520 │         16 │         16 │
│          162 │         209754 │  23940 │         17 │         17 │
│          162 │         317795 │  22950 │         18 │         18 │
│          162 │         404158 │  21780 │         19 │         19 │
│          162 │         326096 │  21540 │         20 │         20 │
│          162 │         404493 │  20950 │         21 │         21 │
│          162 │         389508 │  20790 │         22 │         22 │
│          162 │         301524 │  19900 │         23 │         23 │
│          162 │         404506 │  19900 │         24 │         23 │
│          162 │         404160 │  18130 │         25 │         24 │
........................
  • Use WHERE row_number <= N/dense_rank <= N to extract grouped top-N

neighbor()

  • neighbor() is actually the combination of lag & lead
neighbor(column,offset[,default_value])
-- offset > 0 = lead
-- offset < 0 = lag
-- default_value is used when the offset is out of bound

Baseline (YoY/MoM)

  • “同比”—— YoY (year-over-year) rate = {value[month,year] - value[month,year - 1]} / value[month,year - 1]

  • “环比”—— MoM (month-over-month) rate = {value[month] - value[month - 1]} / value[month - 1]

  • Let's make up some fake data and test it over

WITH toDate('2019-01-01') AS start_date
SELECT
  toStartOfMonth(start_date + number * 32) AS dt,
  rand(number) AS val,
  neighbor(val,-12) AS prev_year_val,
  neighbor(val,-1) AS prev_month_val,
  if (prev_year_val = 0,-32768,round((val - prev_year_val) / prev_year_val, 4) * 100) AS yoy_percent,
  if (prev_month_val = 0,-32768,round((val - prev_month_val) / prev_month_val, 4) * 100) AS mom_percent
FROM numbers(18);
┌─────────dt─┬────────val─┬─prev_year_val─┬─prev_month_val─┬─yoy_percent─┬─────────mom_percent─┐
│ 2019-01-01 │  344308231 │             0 │              0 │      -32768 │              -32768 │
│ 2019-02-01 │ 2125630486 │             0 │      344308231 │      -32768 │              517.36 │
│ 2019-03-01 │  799858939 │             0 │     2125630486 │      -32768 │ -62.370000000000005 │
│ 2019-04-01 │ 1899653667 │             0 │      799858939 │      -32768 │               137.5 │
│ 2019-05-01 │ 3073278541 │             0 │     1899653667 │      -32768 │               61.78 │
│ 2019-06-01 │  882031881 │             0 │     3073278541 │      -32768 │               -71.3 │
│ 2019-07-01 │ 3888311917 │             0 │      882031881 │      -32768 │              340.84 │
│ 2019-08-01 │ 3791703268 │             0 │     3888311917 │      -32768 │               -2.48 │
│ 2019-09-01 │ 3472517572 │             0 │     3791703268 │      -32768 │               -8.42 │
│ 2019-10-01 │ 1010491656 │             0 │     3472517572 │      -32768 │  -70.89999999999999 │
│ 2019-11-01 │ 2841992923 │             0 │     1010491656 │      -32768 │              181.25 │
│ 2019-12-01 │ 1783039500 │             0 │     2841992923 │      -32768 │              -37.26 │
│ 2020-01-01 │ 2724427263 │     344308231 │     1783039500 │      691.28 │  52.800000000000004 │
│ 2020-02-01 │ 2472851287 │    2125630486 │     2724427263 │       16.33 │  -9.229999999999999 │
│ 2020-03-01 │ 1699617807 │     799858939 │     2472851287 │      112.49 │ -31.269999999999996 │
│ 2020-04-01 │  873033696 │    1899653667 │     1699617807 │      -54.04 │              -48.63 │
│ 2020-05-01 │ 3524933462 │    3073278541 │      873033696 │        14.7 │              303.76 │
│ 2020-06-01 │   85437434 │     882031881 │     3524933462 │      -90.31 │              -97.58 │
└────────────┴────────────┴───────────────┴────────────────┴─────────────┴─────────────────────┘

Part V - More on Data Types

Date/DateTime

  • Do not use String for Date/DateTime (other types also fit for this rule)

    • ClickHouse is strongly typed, no implicit conversions

    • All-String tables (as in Hive) do not agree with ClickHouse

  • Do not use Int-type timestamp for Date/DateTime

    • Date: Stored as the date difference to 1970-01-01

    • DateTime: Stored directly as timestamp (fast)

  • Very flexible date/time functions

Nullable

  • ClickHouse doesn't provide NULL by default, but if you want to...
merchandise_id Nullable(Int64)
  • But try to stay away from Nullable

    • Need a separate mark file for NULLs

    • Nullable columns cannot be indexed

  • Default value itself can indicate NULL (0 for Int, '' for String, etc.), or explicitly define it when creating tables

merchandise_id Int64 DEFAULT -1

LowCardinality

  • ClickHouse applies dictionary coding to LowCardinality columns. Operating with such kind of data significantly increases performance of SELECT queries for many applications

  • LowCardinality is almost always used together with less diversified String columns (cardinality < 10000)

-- event_type in access logs is quite suitable
event_type LowCardinality(String)

Arrays & Higher-order Functions [TBD]

  • TBD...

Part VI - MergeTree Indices & Table Settings

Index Structure

  • Not B-Tree style, but rather like Kafka log indices (sparse)

  • .bin (data), .mrk (index marker) files for each column on disks

  • primary.idx stores the indexed data according to index granularity

Index Settings

  • Must include those columns which occur frequently as predicates (in WHERE clause)

  • Date/DateTime columns come first (when partitioning with date/time)

  • Very distinctive columns are not suitable for indexing

  • Do not use too many columns, also do not change index_granularity = 8192 setting when everything's fine

Table TTL

  • Determines the lifetime of rows, thus enabling auto expiration of history data

  • When creating a table

PARTITION BY ...
ORDER BY (...)
TTL ts_date + INTERVAL 6 MONTH
  • Or modify an existing table (only affects the data inserted after modification)
ALTER TABLE ods.analytics_access_log ON CLUSTER sht_ck_cluster_1
MODIFY TTL ts_date + INTERVAL 6 MONTH;
  • The settings parameter for TTL-ed part merging frequency
SETTINGS merge_with_ttl_timeout = 86400  -- 1 day

ZooKeeper

  • ClickHouse utilizes ZooKeeper as: Coordination service + Mini log service + Metadata storage

  • Quite heavy, so try to keep ZooKeeper cluster happy

autopurge.purgeInterval = 1
autopurge.snapRetainCount = 5
  • Also, replicated tables can store the headers of the data parts compactly using a single znode by defining:
SETTINGS use_minimalistic_part_header_in_zookeeper = 1

Review CREATE TABLE statement

CREATE TABLE IF NOT EXISTS ods.analytics_access_log
ON CLUSTER sht_ck_cluster_1 (
  ts_date Date,
  ts_date_time DateTime,
  user_id Int64,
  event_type String,
  column_type String,
  groupon_id Int64,
  site_id Int64,
  site_name String,
  -- ...
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/ods/analytics_access_log','{replica}')
PARTITION BY ts_date
ORDER BY (ts_date,toStartOfHour(ts_date_time),main_site_id,site_id,event_type,column_type)
TTL ts_date + INTERVAL 6 MONTH
SETTINGS index_granularity = 8192,
use_minimalistic_part_header_in_zookeeper = 1,
merge_with_ttl_timeout = 86400;

The End

民那晚安晚安。

你可能感兴趣的:(ClickHouse技术分享第二弹(英文讲义))