hive数据导入:从查询数据导入

hive数据导入:从查询数据导入_第1张图片

0.前言

本文介绍hive数据导入的一种方法,从查询数据导入。

1. 创建表的时候从其他表直接导入

该方法是创建新表的同时,直接读取旧表的字段和数据,常见的应用场景就是快速抽取数据做测试用,详情

0: jdbc:hive2://xxx.hadoop.com:2181,xxx2.h> CREATE  TABLE if not exists testA(
. . . . . . . . . . . . . . . . . . . . . . .>  id string comment '',
. . . . . . . . . . . . . . . . . . . . . . .>  name  string COMMENT ''
. . . . . . . . . . . . . . . . . . . . . . .> )partitioned by (pdt int)
. . . . . . . . . . . . . . . . . . . . . . .> STORED AS PARQUET
. . . . . . . . . . . . . . . . . . . . . . .> TBLPROPERTIES ("parquet.compression"="SNAPPY");

0: jdbc:hive2://xxx.hadoop.com:2181,xxx2.h> create table testB as select id, name from testA;
INFO  : Compiling command(queryId=hive_20211123142151_32d40208-47c8-4f73-a9ec-de5aad51807c): create table testB as select id, name from testA
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20211123142151_32d40208-47c8-4f73-a9ec-de5aad51807c); Time taken: 0.414 seconds
INFO  : Executing command(queryId=hive_20211123142151_32d40208-47c8-4f73-a9ec-de5aad51807c): create table testB as select id, name from testA
INFO  : Query ID = hive_20211123142151_32d40208-47c8-4f73-a9ec-de5aad51807c
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: hive_20211123142151_32d40208-47c8-4f73-a9ec-de5aad51807c
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: create table testB as select id, nam...testA (Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id application_1637046596410_10638)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1            container     SUCCEEDED      0          0        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 01/02  [==========================>>] 100%  ELAPSED TIME: 8.63 s     
----------------------------------------------------------------------------------------------
INFO  : Status: DAG finished successfully in 7.96 seconds
INFO  : 
INFO  : Query Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  : OPERATION                            DURATION
INFO  : ----------------------------------------------------------------------------------------------
INFO  : Compile Query                           0.41s
INFO  : Prepare Plan                            9.99s
INFO  : Get Query Coordinator (AM)              0.00s
INFO  : Submit Plan                             0.42s
INFO  : Start DAG                               1.16s
INFO  : Run DAG                                 7.96s
INFO  : ----------------------------------------------------------------------------------------------
INFO  : 
INFO  : Task Execution Summary
INFO  : ----------------------------------------------------------------------------------------------
INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
INFO  : ----------------------------------------------------------------------------------------------
INFO  :      Map 1              0.00              0              0               0                0
INFO  :  Reducer 2           2306.00          4,180             89               0                0
INFO  : ----------------------------------------------------------------------------------------------

INFO  : OK
No rows affected (20.638 seconds)

2. hive表导入到hive表

将A表数据导入到B表

INSERT INTO TABLE testB select id, name from testA where id = 1;  

2.1 技巧1:插入数据

关键字:insert into table

该方法多用于插入新的数据,比如定时或者实时落数据的时候,或者多次迁移历史数据

INSERT INTO TABLE testB select id, name from testA where id = 1;  

2.2 技巧2:覆盖数据

关键字:insert overwrite table

该方法多用于天表等,按一定周期去运行,需要对数据进行覆盖操作的场景

INSERT OVERWRITE TABLE testB select id, name from testA where id = 1;  

2.3 技巧3:分区数据

关键字:insert overwrite table partition(xxx=xxx)

该方法多用于插入的数据的分区是查询中的某个字段,并且可以做一定的处理,比如取模等操作

INSERT OVERWRITE TABLE testB partition(pdt=20210101) select id, name,pdt from testA where id = 1;  

2.4 技巧4:导入压缩数据(重要)

实际场景中,hive的表大多是有压缩的,没法直接从外部的文件直接导入到系统中,这里就需要一个折中的办法

关键字:

1.临时表

2.分区加入到字段中

3.将临时表数据导入到正式表,指定分区字段

接下来详细举例一下

假如原始表testA,这是一张带分区pdt,并且用snappy压缩的表

CREATE  TABLE if not exists testA(
 id string comment '',
 name  string COMMENT ''
)partitioned by (pdt int)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY");

想导入的数据是excel,或者是需要造一些数据,直接用insert插入,效率太低,尤其是大批量的

所以我们使用一张临时表,testA_tmp,分区字段pdt当做一个字段,字段用“,”分隔,文本文件,方便导入

CREATE  TABLE if not exists testA_tmp(
 id string comment '',
 name  string COMMENT '',
 pdt int COMMENT ""
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

准备好一个文本文件存入数据tmp.txt

tmp.txt:
1,2,20210101
3,4,20210101

上传hdfs,导入到临时表中

hadoop fs -put tmp.txt /dataTmp/
load data inpath '/dataTmp/tmp.txt' into table  testA_tmp;

hive命令行,设置动态分区,执行插入语句

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE testA_tmp PARTITION (pdt) SELECT * FROM testA;

至此大功告成,导入到目标表,并且带有分区

你可能感兴趣的:(【大数据】填坑大作战,hive,大数据,hadoop)