Hive多表插入遇坑记

众所周知,Hive多表插入可以有效避免多次读取同一数据集所造成的资源浪费,提升性能。但是,笔者在使用过程中,遇到了一小坑,遂记之与大家分享~

1. 环境

Hadoop: 2.9.1
Hive: 1.2.2.6

2. 问题重现

2.1 建表
create table test_o (`id` int, `value` string);
create table test_i1 (`id` int, `value` string) partitioned by (`part` string);
create table test_i2 (`id` int, `value` string) partitioned by (`part` string);
2.2 插入测试数据
insert into table test_o values (1,"1"),(2,"2"),(3,"3"),(4,"4"),(5,"5"),(6,"6"),(7,"7"),(8,"8"),(9,"9"),(10,"10");
2.3 组合结果

Template:

from test_o
insert {into/overwrite} table {test_i1/test_i2} partition(part="{INTO/OVERWRITE}")
    select id, value 
insert {into/overwrite} table {test_i2/test_i1} partition(part="{INTO/OVERWRITE}")
    select id, value;

Result:

组合 结果
into test_i1 (INTO) + into test_i2 (INTO) 正确
into test_i1 (INTO) + overwrite test_i2 (OVERWRITE) 正确
overwrite test_i1 (OVERWRITE) + into test_i2 (INTO) 正确
overwrite test_i1 (OVERWRITE) + overwrite test_i2 (OVERWRITE) 正确
into test_i1 (INTO) + into test_i1 (OVERWRITE) 正确
overwrite test_i1 (INTO) + overwrite test_i1 (OVERWRITE) 正确
into test_i1 (INTO) + overwrite test_i1 (OVERWRITE) 错误,均追加
overwrite test_i1 (OVERWRITE) + into test_i1 (INTO) 错误,均追加
  • 插入不同表,结果均正确;
  • 插入相同表,如果只是into/overwrite,结果均正确;
  • 插入相同表,如果overwrite和into混用,结果都会追加;

3. 解决方案

方案 优势 劣势
拆分HQL,多表插入改为多个单表插入 简单易懂 性能降低,重复拉取相同数据
先drop需要overwrite的分区,再多表插入 性能基本无损耗 方案不够优雅

希望后续版本可以解决这个小问题~

你可能感兴趣的:(Hive)