https://clickhouse.tech/docs/en/engines/table-engines/integrations/hdfs/
https://clickhouse.tech/docs/en/engines/table-engines/integrations/kafka/
https://clickhouse.tech/docs/en/interfaces/formats/#data_types-matching
CREATE TABLE company_employees.employees(
name String,
age String,
email String,
addr String
) ENGINE = HDFS('hdfs://{active_namenode}:8020/ods/statis_date=202{0..9}{0..9}{0..9}{0..9}{0..9}/statis_hour={0,1,2}{0..9}/*', 'ORC')
官网举例:
Example
Suppose we have several files in TSV format with the following URIs on HDFS:
'hdfs://hdfs1:9000/some_dir/some_file_1'
'hdfs://hdfs1:9000/some_dir/some_file_2'
'hdfs://hdfs1:9000/some_dir/some_file_3'
'hdfs://hdfs1:9000/another_dir/some_file_1'
'hdfs://hdfs1:9000/another_dir/some_file_2'
'hdfs://hdfs1:9000/another_dir/some_file_3'
There are several ways to make a table consisting of all six files:
CREATE TABLE table_with_range (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/some_file_{1..3}', 'TSV')
Another way:
CREATE TABLE table_with_question_mark (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/some_file_?', 'TSV')
Table consists of all the files in both directories (all files should satisfy format and schema described in query):
CREATE TABLE table_with_asterisk (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/*', 'TSV')
常用的hive表基本都是分区表,因此clickhouse建表路径要包含所有分区路径
*
我构建的hive表的分区如下所示
对应的路径写法为:
/ods/statis_date=202{0..9}{0..9}{0..9}{0..9}{0..9}/statis_hour={0,1,2}{0..9}/*', 'ORC')
将路径写为hdfs://{active_namenode}:8020,如果采用cluster或者standby节点回获取不到数据,cluster模式可能也有办法,比如导入core-site.xml文件啥的,但是我没有试过
CREATE TABLE service_flow.service_flow_increment_resource (
mac String,
rxTraffic double,
updateTime String
) ENGINE = Kafka SETTINGS kafka_broker_list = 'node1:9092',
kafka_topic_list = 'service-flow-increment',
kafka_group_name = 'kafka_flow_test',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 1;
CREATE TABLE service_flow.service_flow_increment(
mac String,
rxTraffic double,
updateTime String
)ENGINE = MergeTree()
ORDER BY updateTime
SETTINGS index_granularity = 8192;
视图里面可以进行一系列操作,使得源数据经处理后符合自己的数据需求同步至实际的表中
CREATE MATERIALIZED VIEW service_flow.service_flow_increment_consumer TO service_flow.service_flow_increment
AS SELECT
mac,
rxTraffic,
updateTime
FROM service_flow.service_flow_increment_resource;