09_ClickHouse,ReplacingMergeTree,案例,根据排序键去重,使用版本参数的去重(学习笔记)

7.ReplacingMergeTree
7.1.ReplacingMergeTree
7.2.案例
7.2.1.示例1:根据排序键去重
7.2.2.示例2:使用版本参数的去重

7.ReplacingMergeTree

7.1.ReplacingMergeTree

  1. 删除具有相同排序键值的重复项。
  2. 数据去重是在合并期间进行的。
  3. 后台的合并操作在未知的时间触发,因此用户无法对合并进行计划。
  4. 可使用OPTIMIZE语句运行计划外的合并,但OPTIMIZE是一个很重的操作。
  5. ReplacingMergeTree适合清除后台的重复数据以节省空间。

指定表引擎:
ENGINE = ReplacingMergeTree([ver])
参数:ver,版本列。版本列的类型为UInt*、Date或DateTime。可选参数。

合并的时候,ReplacingTree从所有相同主键的行中选择一行留下:如果ver未指定,选择最后一条。如果指定了ver列,选择ver值最大的版本。

7.2.案例

7.2.1.示例1:根据排序键去重

建表:

DROP TABLE replacingMergeTreeDemo;
CREATE TABLE replacingMergeTreeDemo
(
    UserID UInt32,
    CounterID UInt32,
    UserName String,
    EventDate Date
) ENGINE = ReplacingMergeTree()
ORDER BY (UserID, CounterID)
PRIMARY KEY (UserID);

执行效果:

xxxxx :) CREATE TABLE replacingMergeTreeDemo
:-] (
:-]     UserID UInt32,
:-]     CounterID UInt32,
:-]     UserName String,
:-]     EventDate Date
:-] ) ENGINE = ReplacingMergeTree()
:-] ORDER BY (UserID, CounterID)
:-] PRIMARY KEY (UserID);

CREATE TABLE replacingMergeTreeDemo
(
    `UserID` UInt32,
    `CounterID` UInt32,
    `UserName` String,
    `EventDate` Date
)
ENGINE = ReplacingMergeTree()
PRIMARY KEY UserID
ORDER BY (UserID, CounterID)

Ok.

0 rows in set. Elapsed: 0.055 sec. 

xxxxx :)

2 插入数据
分三批次插入数据,为了观察效果,下面的三段脚本依次按顺序分别执行,不要一起执行:
批次1:

insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-04-21');
insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-04-22');
insert into replacingMergeTreeDemo values(3,102,'xiaohe3','2020-04-23');

插入后的效果为:

xxxxx :) select * from replacingMergeTreeDemo;

SELECT *
FROM replacingMergeTreeDemo

┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      3102 │ xiaohe3  │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      2100 │ xiaohe2  │ 2020-04-22 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      1100 │ xiaohe1  │ 2020-04-21 │
└────────┴───────────┴──────────┴────────────┘

3 rows in set. Elapsed: 0.009 sec. 

xxxxx :)

批次2:

insert into replacingMergeTreeDemo values(3,103,'xiaohe3','2020-04-23');
insert into replacingMergeTreeDemo values(2,101,'xiaohe2','2020-04-24');
insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-04-20');

插入后的效果为:

xxxxx :) select * from replacingMergeTreeDemo;

SELECT *
FROM replacingMergeTreeDemo

┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      1100 │ xiaohe1  │ 2020-04-20 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      1100 │ xiaohe1  │ 2020-04-21 │
│      2100 │ xiaohe2  │ 2020-04-22 │
│      2101 │ xiaohe2  │ 2020-04-24 │
│      3102 │ xiaohe3  │ 2020-04-23 │
│      3103 │ xiaohe3  │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘

批次3:

insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-04-23');

插入后的效果:

xxxxx :) select * from replacingMergeTreeDemo;
SELECT *
FROM replacingMergeTreeDemo

┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      2100 │ xiaohe2  │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      1100 │ xiaohe1  │ 2020-04-20 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      1100 │ xiaohe1  │ 2020-04-21 │
│      2100 │ xiaohe2  │ 2020-04-22 │
│      2101 │ xiaohe2  │ 2020-04-24 │
│      3102 │ xiaohe3  │ 2020-04-23 │
│      3103 │ xiaohe3  │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘

7 rows in set. Elapsed: 0.007 sec. 

xxxxx :) 

如上所示,它并没有合并数据,这时候需要手工执行计划外合并。

这时候就要手工执行计划外合并:
执行命令:optimize table replacingMergeTreeDemo

xxxxx :) optimize table replacingMergeTreeDemo;

OPTIMIZE TABLE replacingMergeTreeDemo

Ok.

0 rows in set. Elapsed: 0.003 sec. 

xxxxx :) select * from replacingMergeTreeDemo;

SELECT *
FROM replacingMergeTreeDemo

┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      1100 │ xiaohe1  │ 2020-04-20 │
│      2100 │ xiaohe2  │ 2020-04-23 │
│      2101 │ xiaohe2  │ 2020-04-24 │
│      3102 │ xiaohe3  │ 2020-04-23 │
│      3103 │ xiaohe3  │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘

5 rows in set. Elapsed: 0.006 sec. 

xxxxx :)

从这个示例可以发现,ReplacingMergeTree是根据排序项对数据去重的,而不是根据主键。

7.2.2.示例2:使用版本参数的去重

在示例1中, 我们简单观察可以发现, 去重后,保留的记录是最后一条记录。
如果指定了版本,则保留版本列的值最大的记录。
1.建表
使用EventDate列作为版本字段。

DROP TABLE replacingMergeTreeDemo;
CREATE TABLE replacingMergeTreeDemo
(
    UserID UInt32,
    CounterID UInt32,
    UserName String,
    EventDate Date
) ENGINE = ReplacingMergeTree(EventDate)
ORDER BY (UserID, CounterID)
PRIMARY KEY (UserID);

2.插入数据
批次1:

insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-06-21');
insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-11-27');
insert into replacingMergeTreeDemo values(3,102,'xiaohe3','2020-04-23');

批次2:

insert into replacingMergeTreeDemo values(3,103,'xiaohe3','2020-04-23');
insert into replacingMergeTreeDemo values(2,101,'xiaohe2','2020-04-24');
insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-04-20');

批次3:

insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-04-23');

查看数据:

xxxxx :) select * from replacingMergeTreeDemo;

SELECT *
FROM replacingMergeTreeDemo

┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      2100 │ xiaohe2  │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      1100 │ xiaohe1  │ 2020-04-20 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      1100 │ xiaohe1  │ 2020-06-21 │
│      2100 │ xiaohe2  │ 2020-11-27 │
│      2101 │ xiaohe2  │ 2020-04-24 │
│      3102 │ xiaohe3  │ 2020-04-23 │
│      3103 │ xiaohe3  │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘

7 rows in set. Elapsed: 0.009 sec. 

xxxxx :) optimize table replacingMergeTreeDemo;

OPTIMIZE TABLE replacingMergeTreeDemo

Ok.

0 rows in set. Elapsed: 0.002 sec. 

xxxxx :) select * from replacingMergeTreeDemo;

SELECT *
FROM replacingMergeTreeDemo

┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│      1100 │ xiaohe1  │ 2020-06-21 │
│      2100 │ xiaohe2  │ 2020-11-27 │
│      2101 │ xiaohe2  │ 2020-04-24 │
│      3102 │ xiaohe3  │ 2020-04-23 │
│      3103 │ xiaohe3  │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘

5 rows in set. Elapsed: 0.005 sec. 

xxxxx :)

经过上面案例,可以看出,最终保留了如下数据:

2100 │ xiaohe2  │ 2020-11-27

从这个示例可以发现, ReplacingMergeTree根据排序项对数据去重的, 保留版本字段列对应的最大值的记录。

你可能感兴趣的:(#,ClickHouse)