Hive 的虚拟字段,一共有3个,分别是:
INPUT__FILE__NAME: 文件名
BLOCK__OFFSET__INSIDE__FILE: 数据块在文件内的偏移量
ROW__OFFSET__INSIDE__BLOCK: 记录在数据块内的偏移量
注意:中间用两个_
分割。
如果查询,需要设置以下参数为 true。
<property>
<name>hive.exec.rowoffsetname>
<value>falsevalue>
<description>Whether to provide the row offset virtual columndescription>
property>
这些虚拟字段,默认是不能被检索的,除非显示的写在 SQL 语句中。
create table t1(c1 string) stored as textfile;
load data local inpath '/etc/profile' overwrite into table t1;
set hive.cli.print.header=true;
set hive.exec.rowoffset=true;
select INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE,ROW__OFFSET__INSIDE__BLOCK from t1;
执行结果如下:可以看到,不同记录的block__offset__inside__file 值都不同,每一条都是逐步增长的,ROW__OFFSET__INSIDE__BLOCK 都是 0。
hive> select INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE,ROW__OFFSET__INSIDE__BLOCK from t1;
OK
input__file__name block__offset__inside__file row__offset__inside__block
file:/user/hive/warehouse/test.db/t1/profile 0 0
file:/user/hive/warehouse/test.db/t1/profile 15 0
file:/user/hive/warehouse/test.db/t1/profile 16 0
file:/user/hive/warehouse/test.db/t1/profile 80 0
file:/user/hive/warehouse/test.db/t1/profile 122 0
file:/user/hive/warehouse/test.db/t1/profile 123 0
file:/user/hive/warehouse/test.db/t1/profile 191 0
create table t1_orc stored as orc as select * from t1;
select INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE,ROW__OFFSET__INSIDE__BLOCK from t1_orc;
结果如下,可以看到,在列存储的格式中,大量记录的 block__offset__inside__file 值一样。
hive> select INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE,ROW__OFFSET__INSIDE__BLOCK from t1_orc;
OK
input__file__name block__offset__inside__file row__offset__inside__block
file:/user/hive/warehouse/test.db/t1_orc/000000_0 3 0
file:/user/hive/warehouse/test.db/t1_orc/000000_0 1480 0
file:/user/hive/warehouse/test.db/t1_orc/000000_0 1480 0
file:/user/hive/warehouse/test.db/t1_orc/000000_0 1480 0
file:/user/hive/warehouse/test.db/t1_orc/000000_0 1480 0
file:/user/hive/warehouse/test.db/t1_orc/000000_0 1480 0
file:/user/hive/warehouse/test.db/t1_orc/000000_0 1480 0
create table t1_parquet stored as parquet as select * from t1;
select INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE,ROW__OFFSET__INSIDE__BLOCK from t1_parquet;
结果如下,可以看到,在列存储的格式中,也有 block__offset__inside__file 值一样。
hive> select INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE,ROW__OFFSET__INSIDE__BLOCK from t1_parquet;
OK
input__file__name block__offset__inside__file row__offset__inside__block
file:/user/hive/warehouse/test.db/t1_parquet/000000_0 32 0
file:/user/hive/warehouse/test.db/t1_parquet/000000_0 32 0
file:/user/hive/warehouse/test.db/t1_parquet/000000_0 65 0
file:/user/hive/warehouse/test.db/t1_parquet/000000_0 98 0
file:/user/hive/warehouse/test.db/t1_parquet/000000_0 131 0
file:/user/hive/warehouse/test.db/t1_parquet/000000_0 164 0
file:/user/hive/warehouse/test.db/t1_parquet/000000_0 197 0
file:/user/hive/warehouse/test.db/t1_parquet/000000_0 230 0
file:/user/hive/warehouse/test.db/t1_parquet/000000_0 263 0
file:/user/hive/warehouse/test.db/t1_parquet/000000_0 296 0
ROW__OFFSET__INSIDE__BLOCK 在 text,orc,parquet 格式的存储中都是 0,所以默认不可以查询。