Hive连接器不使用Hive的运行环境,而是使用Hive Metastore服务获取元数据,再使用HDFS客户端直接从HDFS上读写数据。所以不能将SQL查询下推到Hive
在所有节点上新建catalog配置文件
[root@trino1 catalog]# pwd
/root/trino-server-367/etc/catalog
[root@trino1 catalog]#
[root@trino1 catalog]# cat hive.properties
connector.name=hive
hive.metastore.uri=thrift://192.168.23.91:9083
# 是否开启向Hive的外部表写入数据,默认是False
hive.non-managed-table-writes-enabled=true
[root@trino1 catalog]#
对于HA高可用的HDFS,则需要将Hadoop的配置文件core-site.xml和hdfs-site.xml复制到Trino的所有节点上
先在所有Trino节点上创建Hadoop配置目录
[root@trino1 ~]# mkdir trino-server-367/etc/hadoop
[root@trino1 ~]#
然后将Hadoop的配置文件复制到Trino的所有节点上
[root@hive1 ~]#
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/core-site.xml [email protected]:/root/trino-server-367/etc/hadoop
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/core-site.xml [email protected]:/root/trino-server-367/etc/hadoop
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/core-site.xml [email protected]:/root/trino-server-367/etc/hadoop
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/hdfs-site.xml [email protected]:/root/trino-server-367/etc/hadoop
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/hdfs-site.xml [email protected]:/root/trino-server-367/etc/hadoop
[root@hive1 ~]# scp hadoop-3.3.1/etc/hadoop/hdfs-site.xml [email protected]:/root/trino-server-367/etc/hadoop
[root@hive1 ~]#
在所有Trino节点的/etc/hosts添加Hadoop所在服务器的地址映射
192.168.23.91 hive1
192.168.23.92 hive2
192.168.23.93 hive3
然后在Trino所有节点的etc/catalog/hive.properties添加如下配置
hive.config.resources=/root/trino-server-367/etc/hadoop/core-site.xml,/root/trino-server-367/etc/hadoop/hdfs-site.xml
重启trino
[root@trino1 catalog]#
[root@trino1 catalog]# cd /root/trino-server-367
[root@trino1 trino-server-367]#
[root@trino1 trino-server-367]# bin/launcher stop
Not running
[root@trino1 trino-server-367]#
[root@trino1 trino-server-367]# bin/launcher start
Started as 46215
[root@trino1 trino-server-367]#
trino> create schema hive.test_db with (location = 'hdfs://nnha/user/hive/warehouse/test_db.db');
CREATE SCHEMA
trino>
location参数是可选的
数据文件格式format默认为ORC,可以在catalog属性文件中通过参数hive.storage-format
进行修改
数据压缩默认使用的是GZIP,可以在catalog属性文件中通过参数hive.compression-codec
进行修改
创建外部表,external_location指向的目录位置必须存在,如下所示:
trino>
trino> create table hive.test_db.test_external_tb(
-> user_id bigint,
-> user_name varchar,
-> birthday date,
-> country varchar
-> ) with (external_location = 'hdfs://nnha/user/hive/warehouse/test_db.db/test_external_tb');
CREATE TABLE
trino>
如果没有external_location参数,则创建的是内部表
分区列必须位于表中列位置的最后
trino>
trino> create table hive.test_db.test_partition_tb(
-> user_id bigint,
-> user_name varchar,
-> birthday date,
-> country varchar
-> ) with (partitioned_by = array['birthday', 'country']);
CREATE TABLE
trino>
这样使用分区列过滤数据,能减少数据扫描的范围,如下所示:
trino> select distinct user_id from hive.test_db.test_partition_tb where birthday = date '2022-02-09' and country = 'china';
HDFS目录结构如下:
/user/hive/warehouse/test_db.db/external_partition_tb/birthday=2018-08-16/country=china/test2018.txt
/user/hive/warehouse/test_db.db/external_partition_tb/birthday=2019-08-16/country=japan/test2019.txt
birthday=2018-08-16/country=china分区的数据内容如下
1,zhang_san,2018-08-16,china
birthday=2019-08-16/country=japan/test2019.txt分区的数据内容如下
2,li_si,2019-08-16,japan
在Trino中创建外部分区表external_partition_tb
trino>
trino> create table hive.test_db.external_partition_tb(
-> user_id varchar,
-> user_name varchar,
-> birthday varchar,
-> country varchar
-> ) with (
-> format = 'CSV',
-> csv_escape = '\',
-> csv_quote = '"',
-> csv_separator = ',',
-> external_location = 'hdfs://nnha/user/hive/warehouse/test_db.db/external_partition_tb',
-> partitioned_by = array['birthday', 'country']
-> );
CREATE TABLE
trino>
trino> select * from hive.test_db.external_partition_tb;
user_id | user_name | birthday | country
---------+-----------+----------+---------
(0 rows)
Query 20220209_113936_00038_jx84g, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0.30 [0 rows, 0B] [0 rows/s, 0B/s]
trino>
使用Hive连接器创建CSV格式的表,列的数据类型只能是varchar类型
查询external_partition_tb表数据为空。这是因为HMS无法识别分区,需要我们手动添加分区,或者执行自动发现分区命令
trino>
trino> use hive.test_db;
USE
trino:test_db>
trino:test_db> call system.create_empty_partition(
-> schema_name => 'test_db',
-> table_name => 'external_partition_tb',
-> partition_columns => array['birthday', 'country'],
-> partition_values => array['2018-08-16', 'china']
-> );
CALL
trino:test_db>
trino:test_db> select * from hive.test_db.external_partition_tb;
user_id | user_name | birthday | country
---------+-----------+------------+---------
1 | zhang_san | 2018-08-16 | china
(1 row)
Query 20220209_114018_00044_jx84g, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0.54 [1 rows, 32B] [1 rows/s, 60B/s]
trino:test_db>
trino:test_db>
trino:test_db> call system.sync_partition_metadata(
-> schema_name => 'test_db',
-> table_name => 'external_partition_tb',
-> mode => 'FULL',
-> case_sensitive => true
-> );
CALL
trino:test_db>
trino:test_db> select * from hive.test_db.external_partition_tb;
user_id | user_name | birthday | country
---------+-----------+------------+---------
1 | zhang_san | 2018-08-16 | china
2 | li_si | 2019-08-16 | japan
(2 rows)
Query 20220209_114728_00048_jx84g, FINISHED, 1 node
Splits: 3 total, 3 done (100.00%)
0.50 [2 rows, 57B] [3 rows/s, 114B/s]
trino:test_db>
trino>
trino> insert into hive.test_db.test_external_tb(user_id, user_name, birthday, country) values(1, 'zhang_san', date '2018-08-16', 'china'), (2, 'li_si', date '2019-08-16', 'japan');
INSERT: 2 rows
Query 20220209_095722_00002_jx84g, FINISHED, 2 nodes
Splits: 5 total, 5 done (100.00%)
17.84 [0 rows, 0B] [0 rows/s, 0B/s]
trino>
trino> insert into hive.test_db.test_partition_tb select * from hive.test_db.test_external_tb;
INSERT: 2 rows
Query 20220209_095933_00003_jx84g, FINISHED, 2 nodes
Splits: 5 total, 5 done (100.00%)
6.99 [2 rows, 589B] [0 rows/s, 84B/s]
trino>
trino>
trino> create table hive.test_db.create_table_insert (
-> user_id,
-> user_name,
-> birthday,
-> country
-> ) as select * from hive.test_db.test_partition_tb;
CREATE TABLE: 2 rows
Query 20220209_103421_00011_jx84g, FINISHED, 2 nodes
Splits: 6 total, 6 done (100.00%)
2.42 [2 rows, 732B] [0 rows/s, 302B/s]
trino>
create table table_name(......) as select ......
创建的表,会复制表结构,不会复制表属性。但创建的新表可以自己添加表属性trino>
trino> select * from hive.test_db.test_partition_tb;
user_id | user_name | birthday | country
---------+-----------+------------+---------
1 | zhang_san | 2018-08-16 | china
2 | li_si | 2019-08-16 | japan
(2 rows)
Query 20220209_115202_00053_jx84g, FINISHED, 1 node
Splits: 3 total, 3 done (100.00%)
0.44 [2 rows, 732B] [4 rows/s, 1.62KB/s]
trino>
trino> delete from hive.test_db.test_partition_tb where birthday = date '2019-08-16';
DELETE
Query 20220209_115234_00054_jx84g, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.91 [0 rows, 0B] [0 rows/s, 0B/s]
trino>
trino> select * from hive.test_db.test_partition_tb;
user_id | user_name | birthday | country
---------+-----------+------------+---------
1 | zhang_san | 2018-08-16 | china
(1 row)
Query 20220209_115259_00055_jx84g, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0.49 [1 rows, 374B] [2 rows/s, 759B/s]
trino>
trino> select * from hive.test_db.external_partition_tb;
user_id | user_name | birthday | country
---------+-----------+------------+---------
2 | li_si | 2019-08-16 | japan
1 | zhang_san | 2018-08-16 | china
(2 rows)
Query 20220209_115403_00056_jx84g, FINISHED, 1 node
Splits: 3 total, 3 done (100.00%)
0.39 [2 rows, 57B] [5 rows/s, 147B/s]
trino>
trino> delete from hive.test_db.external_partition_tb where birthday = '2019-08-16';
DELETE
Query 20220209_115453_00059_jx84g, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.42 [0 rows, 0B] [0 rows/s, 0B/s]
trino>
trino>
trino> select * from hive.test_db.external_partition_tb;
user_id | user_name | birthday | country
---------+-----------+------------+---------
1 | zhang_san | 2018-08-16 | china
(1 row)
Query 20220209_115649_00060_jx84g, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0.58 [1 rows, 32B] [1 rows/s, 55B/s]
trino>