前面一篇文章中,介绍过如何使用flink,消费kafka数据,并且将数据以parquet格式sink到hdfs上,并且sink的时候使用了天、小时的方式进行了分桶策略。详情见:https://blog.csdn.net/liuxiao723846/article/details/107695737
最终,在hdfs上形成了如下的文件:
/data/test/dt=2020-08-07
-hour=00
-part-0-0
-part-0-1
-hour=01
-part-0-0
-part-0-1
接下来,我们要对这个数据做一个hive外表,并且使用dt、hour分区。
1、建表语句:
CREATE EXTERNAL TABLE `test_table`(
`interfaceName` string,
`uri` string,
`reqMethod` string,
`status` int,
`isGzip` boolean,
`response` string,
`serverIp` string,
`reqCost` float,
`time` bigint,
`schema` string)
PARTITIONED by (
`dt` string,
`hour` string)
STORED AS PARQUET
LOCATION '/data/test'
TBLPROPERTIES('parquet.compression'='SNAPPY')
1)创建表时,可能会遇到这个错误:FAILED: ParseException line 16:0 missing EOF at 'LOCATION' near ')'
解决方法:https://stackoverflow.com/questions/22463444/hive-error-parseexception-missing-eof
将LOCATION放到TBLPROPERTIES前面。
2)hive数据类型:
hive中包括基本类型和复杂类型,常见的基础类型有:
数据类型 | 所占字节 | 开始支持版本 |
TINYINT | 1byte,-128 ~ 127 | |
SMALLINT | 2byte,-32,768 ~ 32,767 | |
INT | 4byte,-2,147,483,648 ~ 2,147,483,647 | |
BIGINT | 8byte,-9,223,372,036,854,775,808 ~ 9,223,372,036,854,775,807 | |
BOOLEAN | ||
FLOAT | 4byte单精度 | |
DOUBLE | 8byte双精度 | |
STRING | ||
BINARY | 从Hive0.8.0开始支持 | |
TIMESTAMP | 从Hive0.8.0开始支持 | |
DECIMAL | 从Hive0.11.0开始支持 | |
CHAR | 从Hive0.13.0开始支持 | |
VARCHAR | 从Hive0.12.0开始支持 | |
DATE | 从Hive0.12.0开始支持 |
2、查询:
1)设置数据库、队列:
hive
use test;
set mapred.job.queue.name = root.test;
msck repair table test_table;
2)使用sql语句查询是,可能会遇到如下错误:
java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io.IntWritable
主要原因就是hive 表字段数据类型 与 parquet 存储的数据类型不对应造成的,创建parquet类型的hive表示,各个字段的数据类型,要保证和写parquet类型数据是对应的pojo类中属性类型一致即可。假设我们不知道这个pojo类,那么只能使用工具查看parquet文件,获取其schema了。
3、parquet-tools工具
github地址:https://github.com/apache/parquet-mr/tree/master/parquet-tools?spm=5176.doc52798.2.6.H3s2kL
社区工具:http://logservice-resource.oss-cn-shanghai.aliyuncs.com/tools/parquet-tools-1.6.0rc3-SNAPSHOT.jar?spm=5176.doc52798.2.7.H3s2kL&file=parquet-tools-1.6.0rc3-SNAPSHOT.jar
下载github代码后,需要使用mvn编译,所以可以直接在社区下载别人编译好的jar。或者在我的资源中下载:https://download.csdn.net/download/liuxiao723846/12657908
1)查看结构:
$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema -d part-2-57 | head -n 30
message com.abc.test.stream.entity.MyBean {
required binary interfaceName (UTF8);
required binary reqMethod (UTF8);
required binary uri (UTF8);
required int32 status;
required boolean isGzip;
required binary response (UTF8);
required binary serverIp (UTF8);
required float reqCost;
required int64 time;
required binary schema (UTF8);
}
creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra: parquet.avro.schema = {"type":"record","name":"MyBean","namespace":"com.abc.test.stream.entity","fields":[{"name":"interfaceName","type":"string"},{"name":"reqMethod","type":"string"},{"name":"uri","type":"string"},{"name":"status","type":"int"},{"name":"isGzip","type":"boolean"},{"name":"response","type":"string"},{"name":"serverIp","type":"string"},{"name":"reqCost","type":"float"},{"name":"time","type":"long"},{"name":"schema","type":"string"}]}
extra: writer.model.name = avro
。。。
2)查看内容:
$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar head -n 2 part-2-57
...
参考:
https://blog.csdn.net/jobschen/article/details/77946628
https://www.cnblogs.com/yako/p/7889341.html
https://blog.csdn.net/qq_31866793/article/details/99545132