Hive将csv数据导入parquet格式动态分区表

简介

本篇文章介绍,如何将csv或txt文件,导入到parquet格式存储的hive动态分区表中。大致流程是:先将csv文件导入到textfile格式的hive中间表中,然后在插入到parquet格式hive表中。下面以csv文件为例进行介绍。

准备工作

创建原始表

CREATE EXTERNAL TABLE IF NOT EXISTS db.ods_table_hi (
receive_time STRING,
user_id INT, 
user_name STRING,
warn_time STRING
) PARTITIONED BY (dt STRING, hr STRING) STORED AS parquet TBLPROPERTIES (
'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00',
'sink.partition-commit.delay'='0s',
'sink.partition-commit.policy.kind'='metastore,success-file',
'parquet.compression'='SNAPPY',
'sink.shuffle-by-partition.enable'='true'
);

准备csv数据文件

本地机器文件路径:/data/input/data-20220405.csv

20220405150600,1,zhangsan,20220405150600
20220405150600,2,lisi,20220405150600
20220405150600,3,wangwu,20220405150600

开始导入

步骤一 创建中间表

建立一个中间表,textfile格式。

CREATE TABLE IF NOT EXISTS db.ods_tmp_table_hi (
receive_time STRING,
user_id INT, 
user_name STRING,
warn_time STRING
)
row format delimited
fields terminated by ','
collection items terminated by '#'
MAP KEYS TERMINATED BY '$'
lines terminated by '\n'
STORED AS textfile;

步骤二 将csv文件导入hive中间表

LOAD DATA LOCAL INPATH '/data/input/data-20220405.csv' 
OVERWRITE INTO TABLE ods_tmp_table_hi;

步骤三 将中间表数据插入原始表

insert into table ods_table_hi
select * 
,FROM_UNIXTIME(UNIX_TIMESTAMP(warn_time,'yyyyMMddHHmmss'),'yyyy-MM-dd') as dt
,FROM_UNIXTIME(UNIX_TIMESTAMP(warn_time,'yyyyMMddHHmmss'),'HH') as hr
from ods_tmp_table_hi;

步骤四 检查导入结果

检查表数据量和hdfs数据分区情况,此处略!

你可能感兴趣的:(大数据杂谈,hive,sql,hdfs,大数据,数据仓库)