7.ReplacingMergeTree
7.1.ReplacingMergeTree
7.2.案例
7.2.1.示例1:根据排序键去重
7.2.2.示例2:使用版本参数的去重
指定表引擎:
ENGINE = ReplacingMergeTree([ver])
参数:ver,版本列。版本列的类型为UInt*、Date或DateTime。可选参数。
合并的时候,ReplacingTree从所有相同主键的行中选择一行留下:如果ver未指定,选择最后一条。如果指定了ver列,选择ver值最大的版本。
建表:
DROP TABLE replacingMergeTreeDemo;
CREATE TABLE replacingMergeTreeDemo
(
UserID UInt32,
CounterID UInt32,
UserName String,
EventDate Date
) ENGINE = ReplacingMergeTree()
ORDER BY (UserID, CounterID)
PRIMARY KEY (UserID);
执行效果:
xxxxx :) CREATE TABLE replacingMergeTreeDemo
:-] (
:-] UserID UInt32,
:-] CounterID UInt32,
:-] UserName String,
:-] EventDate Date
:-] ) ENGINE = ReplacingMergeTree()
:-] ORDER BY (UserID, CounterID)
:-] PRIMARY KEY (UserID);
CREATE TABLE replacingMergeTreeDemo
(
`UserID` UInt32,
`CounterID` UInt32,
`UserName` String,
`EventDate` Date
)
ENGINE = ReplacingMergeTree()
PRIMARY KEY UserID
ORDER BY (UserID, CounterID)
Ok.
0 rows in set. Elapsed: 0.055 sec.
xxxxx :)
2 插入数据
分三批次插入数据,为了观察效果,下面的三段脚本依次按顺序分别执行,不要一起执行:
批次1:
insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-04-21');
insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-04-22');
insert into replacingMergeTreeDemo values(3,102,'xiaohe3','2020-04-23');
插入后的效果为:
xxxxx :) select * from replacingMergeTreeDemo;
SELECT *
FROM replacingMergeTreeDemo
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 3 │ 102 │ xiaohe3 │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 2 │ 100 │ xiaohe2 │ 2020-04-22 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 1 │ 100 │ xiaohe1 │ 2020-04-21 │
└────────┴───────────┴──────────┴────────────┘
3 rows in set. Elapsed: 0.009 sec.
xxxxx :)
批次2:
insert into replacingMergeTreeDemo values(3,103,'xiaohe3','2020-04-23');
insert into replacingMergeTreeDemo values(2,101,'xiaohe2','2020-04-24');
insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-04-20');
插入后的效果为:
xxxxx :) select * from replacingMergeTreeDemo;
SELECT *
FROM replacingMergeTreeDemo
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 1 │ 100 │ xiaohe1 │ 2020-04-20 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 1 │ 100 │ xiaohe1 │ 2020-04-21 │
│ 2 │ 100 │ xiaohe2 │ 2020-04-22 │
│ 2 │ 101 │ xiaohe2 │ 2020-04-24 │
│ 3 │ 102 │ xiaohe3 │ 2020-04-23 │
│ 3 │ 103 │ xiaohe3 │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
批次3:
insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-04-23');
插入后的效果:
xxxxx :) select * from replacingMergeTreeDemo;
SELECT *
FROM replacingMergeTreeDemo
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 2 │ 100 │ xiaohe2 │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 1 │ 100 │ xiaohe1 │ 2020-04-20 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 1 │ 100 │ xiaohe1 │ 2020-04-21 │
│ 2 │ 100 │ xiaohe2 │ 2020-04-22 │
│ 2 │ 101 │ xiaohe2 │ 2020-04-24 │
│ 3 │ 102 │ xiaohe3 │ 2020-04-23 │
│ 3 │ 103 │ xiaohe3 │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
7 rows in set. Elapsed: 0.007 sec.
xxxxx :)
如上所示,它并没有合并数据,这时候需要手工执行计划外合并。
这时候就要手工执行计划外合并:
执行命令:optimize table replacingMergeTreeDemo
xxxxx :) optimize table replacingMergeTreeDemo;
OPTIMIZE TABLE replacingMergeTreeDemo
Ok.
0 rows in set. Elapsed: 0.003 sec.
xxxxx :) select * from replacingMergeTreeDemo;
SELECT *
FROM replacingMergeTreeDemo
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 1 │ 100 │ xiaohe1 │ 2020-04-20 │
│ 2 │ 100 │ xiaohe2 │ 2020-04-23 │
│ 2 │ 101 │ xiaohe2 │ 2020-04-24 │
│ 3 │ 102 │ xiaohe3 │ 2020-04-23 │
│ 3 │ 103 │ xiaohe3 │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
5 rows in set. Elapsed: 0.006 sec.
xxxxx :)
从这个示例可以发现,ReplacingMergeTree是根据排序项对数据去重的,而不是根据主键。
在示例1中, 我们简单观察可以发现, 去重后,保留的记录是最后一条记录。
如果指定了版本,则保留版本列的值最大的记录。
1.建表
使用EventDate列作为版本字段。
DROP TABLE replacingMergeTreeDemo;
CREATE TABLE replacingMergeTreeDemo
(
UserID UInt32,
CounterID UInt32,
UserName String,
EventDate Date
) ENGINE = ReplacingMergeTree(EventDate)
ORDER BY (UserID, CounterID)
PRIMARY KEY (UserID);
2.插入数据
批次1:
insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-06-21');
insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-11-27');
insert into replacingMergeTreeDemo values(3,102,'xiaohe3','2020-04-23');
批次2:
insert into replacingMergeTreeDemo values(3,103,'xiaohe3','2020-04-23');
insert into replacingMergeTreeDemo values(2,101,'xiaohe2','2020-04-24');
insert into replacingMergeTreeDemo values(1,100,'xiaohe1','2020-04-20');
批次3:
insert into replacingMergeTreeDemo values(2,100,'xiaohe2','2020-04-23');
查看数据:
xxxxx :) select * from replacingMergeTreeDemo;
SELECT *
FROM replacingMergeTreeDemo
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 2 │ 100 │ xiaohe2 │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 1 │ 100 │ xiaohe1 │ 2020-04-20 │
└────────┴───────────┴──────────┴────────────┘
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 1 │ 100 │ xiaohe1 │ 2020-06-21 │
│ 2 │ 100 │ xiaohe2 │ 2020-11-27 │
│ 2 │ 101 │ xiaohe2 │ 2020-04-24 │
│ 3 │ 102 │ xiaohe3 │ 2020-04-23 │
│ 3 │ 103 │ xiaohe3 │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
7 rows in set. Elapsed: 0.009 sec.
xxxxx :) optimize table replacingMergeTreeDemo;
OPTIMIZE TABLE replacingMergeTreeDemo
Ok.
0 rows in set. Elapsed: 0.002 sec.
xxxxx :) select * from replacingMergeTreeDemo;
SELECT *
FROM replacingMergeTreeDemo
┌─UserID─┬─CounterID─┬─UserName─┬──EventDate─┐
│ 1 │ 100 │ xiaohe1 │ 2020-06-21 │
│ 2 │ 100 │ xiaohe2 │ 2020-11-27 │
│ 2 │ 101 │ xiaohe2 │ 2020-04-24 │
│ 3 │ 102 │ xiaohe3 │ 2020-04-23 │
│ 3 │ 103 │ xiaohe3 │ 2020-04-23 │
└────────┴───────────┴──────────┴────────────┘
5 rows in set. Elapsed: 0.005 sec.
xxxxx :)
经过上面案例,可以看出,最终保留了如下数据:
│ 2 │ 100 │ xiaohe2 │ 2020-11-27 │
从这个示例可以发现, ReplacingMergeTree根据排序项对数据去重的, 保留版本字段列对应的最大值的记录。