这里需要谨记,在进行下面的操作前(使用ReplicatedMergeTree表引擎),必须保证集群配置中internal_replication=true且配置了zookeeper。
CREATE TABLE IF NOT EXISTS bank (\
age UInt16, \
job String, \
marital String, \
education String, \
default String, \
housing String, \
loan String, \
contact String, \
month String, \
day_of_week String, \
duration UInt32, \
campaign UInt32, \
pdays UInt64, \
previous UInt8, \
poutcome String, \
empvar_rate Float64, \
cons_price_idx Float64, \
cons_conf_idx Float64, \
euribor3m Float64, \
nr_employed Float64 \
) ENGINE = MergeTree() \
PARTITION BY month \
ORDER BY (education, age) \
SETTINGS index_granularity = 8192;
导入数据:
# 插入文件数据
cat /root/clickhouse-packages/data/bank_data.csv | clickhouse-client --host=ckprd1 --port=9000 --database=default --query="INSERT INTO bank FORMAT CSVWithNames" --input_format_allow_errors_num=100000
这里有几个需要注意的点:
1)方式一:在每个集群上分别运行下面的代码
CREATE TABLE IF NOT EXISTS bank_replica (\
age UInt16,\
job String,\
marital String,\
education String,\
default String,\
housing String,\
loan String,\
contact String,\
month String,\
day_of_week String,\
duration UInt32,\
campaign UInt32,\
pdays UInt64,\
previous UInt8,\
poutcome String,\
empvar_rate Float64,\
cons_price_idx Float64,\
cons_conf_idx Float64,\
euribor3m Float64,\
nr_employed Float64\
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/bank_replica', '{replica}')\
PARTITION BY month\
ORDER BY (education, age) \
SETTINGS index_granularity = 8192;
ReplicatedMergeTree用的就是macros里面配置的参数了:
2)方式二:一次性在集群每个机器上建立本地表
-- Replicated Table
CREATE TABLE IF NOT EXISTS bank_replica ON CLUSTER mcd_prod_cluster_1st (\
age UInt16,\
job String,\
marital String,\
education String,\
default String,\
housing String,\
loan String,\
contact String,\
month String,\
day_of_week String,\
duration UInt32,\
campaign UInt32,\
pdays UInt64,\
previous UInt8,\
poutcome String,\
empvar_rate Float64,\
cons_price_idx Float64,\
cons_conf_idx Float64,\
euribor3m Float64,\
nr_employed Float64\
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/bank_replica', '{replica}')\
PARTITION BY month\
ORDER BY (education, age) \
SETTINGS index_granularity = 8192;
删除表:
DROP TABLE bank_replica ON CLUSTER mcd_prod_cluster_1st
这里为了验证ReplicatedMergeTree本身是可以同步和复制数据的,可以往一个分片里面某个表里面写入数据,看下其它备份是否也会有同样的数据:
INSERT INTO bank_replica SELECT * FROM bank
查询结果:
ckprd1 :) select count(1) from bank_replica;
SELECT count(1)
FROM bank_replica
┌─count(1)─┐
│ 27595 │
└──────────┘
ckprd2 :) select count(1) from bank_replica;
SELECT count(1)
FROM bank_replica
┌─count(1)─┐
│ 27595 │
└──────────┘
ckprd3 :) select count(1) from bank_replica;
SELECT count(1)
FROM bank_replica
┌─count(1)─┐
│ 0 │
└──────────┘
ckprd4 :) select count(1) from bank_replica;
SELECT count(1)
FROM bank_replica
┌─count(1)─┐
│ 0 │
└──────────┘
因为执行插入操作在ckprd1节点,而ckprd2和ckprd1属于同一个分片,所以数据相同。而ckprd3和ckprd4属于另一个分片,没有数据。
下面我会介绍使用分布式表的方式在两个分片上面分布数据。
同样的,和建立本地表一样,分布式表也需要在每个机器上建立,同样的也可以一个一个机器去分别建立,也可以一次性在所有节点创建表。 这里直接放出一次性建立所有的代码:
CREATE TABLE bank_dist ON CLUSTER mcd_prod_cluster_1st \
AS bank_replica \
ENGINE = Distributed(mcd_prod_cluster_1st, default, bank_replica, rand());
Distributed表引擎后面依次是集群名、库名、表名、数据分配方式。
关于分布式表我划重点如下:
另外,这里在每个机器上建立分布式表和上面的在每个机器上建立本地表的目的完全不同,这里即使只在一个机器上建分布式表也是可以的。只是说,在每个机器上建分布式表的话,那么可以在每个服务器上都做分布式查询了。
可以往各服务器上的任何一个分布式表里面写入数据(插入数据之前,可以先把bank_replica表数据删除):
INSERT INTO bank_dist SELECT * FROM bank;
查询结果:
ckprd1 :) select count(1) from bank_replica;
SELECT count(1)
FROM bank_replica
┌─count(1)─┐
│ 13832 │
└──────────┘
ckprd2 :) select count(1) from bank_replica;
SELECT count(1)
FROM bank_replica
┌─count(1)─┐
│ 13832 │
└──────────┘
ckprd3 :) select count(1) from bank_replica;
SELECT count(1)
FROM bank_replica
┌─count(1)─┐
│ 13763 │
└──────────┘
ckprd4 :) select count(1) from bank_replica;
SELECT count(1)
FROM bank_replica
┌─count(1)─┐
│ 13763 │
└──────────┘
1)一个分区一个文件夹
default数据库下面的bank_replica表的数据结构如下:
[外链图片转存失败(img-kFDRvV9i-1568252836912)(evernotecid://4CF2DB7A-63C8-4462-AF1B-D4FBD7AC5C92/appyinxiangcom/215407/ENResource/p779)]
查看分区信息:
ckprd1 :) SELECT partition, name, active FROM system.parts WHERE table = 'bank_replica';
SELECT
partition,
name,
active
FROM system.parts
WHERE table = 'bank_replica'
┌─partition─┬─name─────────────────────────────────┬─active─┐
│ dec │ 0dcda14aa879a9e9de8ce0a075dac042_3_3_0 │ 1 │
│ mar │ 534f9f773916a62e1ce21b79e23ba5e7_3_3_0 │ 1 │
│ oct │ 6d3d797dfda12e0b8c837064b52bacc8_3_3_0 │ 1 │
│ jun │ 9a34545ffc3f5f4b8bee104063c6dd61_3_3_0 │ 1 │
│ nov │ 9bbaa4d2c2df481f7661e7257563da2d_3_3_0 │ 1 │
│ sep │ a6ad89f019506d5c2359e353d73e033d_3_3_0 │ 1 │
│ apr │ b3478f1e0b48f24b48cfc42239749609_3_3_0 │ 1 │
│ jul │ e0f64a6601e692e6c7761024aa627976_3_3_0 │ 1 │
│ aug │ f3fd28549b760f6d22e03bab5d963e1b_3_3_0 │ 1 │
│ may │ f6139eb0d4e1e322ceebd4cc93d30326_3_3_0 │ 1 │
└───────────┴──────────────────────────────────────┴───────┘
10 rows in set. Elapsed: 0.002 sec.