python json数据创建hive格式数据_hive仓库创建json格式表

在现有的数据资产管理系统中,系统支持text、orc等格式的表,其中text格式的表列可以由多种分割符进行分割,由于需要,仓库中需要存储json相关的数据,因此,系统需要支持json格式表创建,遇到的问题总结如下:

实现

首先就是引入JsonSerDe包的问题。

首先,系统自带的有相关的jar包。

另外,也有第三方开源的jar包。第三方

第三方做了相关的优化,下面有有介绍。

###hive官方文档

hive官方文档有相关json serde介绍

Stored as plain text file inJSON format.

The JsonSerDe forJSON files is available inHive 0.12 and later.

In some distributions, a reference to hive-hcatalog-core.jar is required.

ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar;

CREATE TABLE my_table(a string, b bigint, ...)

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

STORED AS TEXTFILE;

The JsonSerDe was moved to Hive from HCatalog and before it was inhive-contrib project. It was added to the Hive distribution by HIVE-4895.

An Amazon SerDe is available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar forreleases prior to 0.12.0.

The JsonSerDe forJSON files is available inHive 0.12 and later.

Starting inHive 3.0.0, JsonSerDe is added to Hive Serde as "org.apache.hadoop.hive.serde2.JsonSerDe" (HIVE-19211).

CREATE TABLE my_table(a string, b bigint, ...)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'

STORED AS TEXTFILE;

Or STORED AS JSONFILE is supported starting inHive 4.0.0 (HIVE-19899), so you can create table as follows:

CREATE TABLE my_table(a string, b bigint, ...) STORED AS JSONFILE;

创建

在hive客户端执行以下

create EXTERNAL table employees_json1 (firstName string, lastname string, employeenumber int )

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

;

可能会遇到以下问题:

1、创建时Error:”java.lang.ClassNotFoundException Class org.apache.hive.hcatalog.data.JsonSerDe not found”

原因是缺少相关的包。

导入json数据到 test_data.json文件中

{ "firstName" : "asMike", "lastName" : "asdfpesky", "employeeNumber" : 40192 }

{ "firstName" : "asMike", "lastName" : "asdfesky", "employeeNumber" : 123192 }

{ "firstName" : "Mikasde", "lastName" : "Chepesky", "employeeNumber" : 192 }

然后上传数据到hdfs上

hadoop fs -put -f test_data.json /user/datacenter/test_data.json

插入数据到表中

load data inpath '/user/datacenter/test_data.json' overwrite into table employees_json1;

查询数据

select * from employees_json1

hive> select * from employees_json1;

OK

asMikeasdfpesky40192

asMikeasdfesky123192

MikasdeChepesky192

如果需要用到映射关系可以使用第三方开源jar,上面有所提及。

如:json文件中使用的是firstName,表里面的字段名称是name,则可以通过以下方式:

create EXTERNAL table employees_json2 (name string, lastname string, employeenumber int )

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

WITH SERDEPROPERTIES (

"mapping.name"="firstName"

);

同样,查询出来的结果是:

hive> select * from employees_json1;

OK

asMikeasdfpesky40192

asMikeasdfesky123192

MikasdeChepesky192

第三方的jar需要导入,有关导入jar遇到的坑,下次再写。

你可能感兴趣的:(python)