最近处理了这么一份数据,json格式的,其实我之前写了一篇博客:
hive加载json数据和解析json(一):https://blog.csdn.net/lsr40/article/details/79399166
上一篇博客虽说介绍了两种手段处理json
方法一:通过Serde直接加载json
方法二:通过hive内置的函数从字符串总提取json中的value
不过还是相对着重的介绍了get_json_object()和json_tuple()这两个内置函数,对Serde的使用只是一笔带过,这次我在做json加载的时候采用了第一种方式(serde加载json)
数据格式如下:
{
"key1": "xxxx",
"key2": "XXXX",
"map1": {
"M-a": "ma",
"M-b": "mb",
"M-c": "mc"
},
"last": "last1"
}
解释下看到数据后我的想法:
1、是否需要额外添加jar包,如何添加?
2、其他的数据都是string类型,map1字段需要是Map或者Struts类型
3、map1中,M-a,M-b,这种不规整的字段名称,是否可以映射成其他名称,比如变成m_a,m_b这样
4、json数据中没有日期的字段,需要自己生成一个日期字段。
1、需要添加额外JsonSerde的jar包,可以使用apache有自带的JsonSerde(上一篇文章中有说),也可以使用另一个功能更加强大的JsonSerde(如果json格式比较复杂,推荐使用后者)
http://www.congiu.net/hive-json-serde/1.3.7/cdh5/json-serde-1.3.7-jar-with-dependencies.jar
2、要添加两个地方
第一:本地的hive客户端
在客户端的hive的lib里面添加,让hive启动的时候能够直接加载,或者每次手动执行add jar命令
第二:向集群添加这个jar,放到$HADOOP_HOME/share/hadoop/mapreduce/对应的路径下
可以参考:http://www.voidcn.com/article/p-dewklslz-bhu.html
必须两步都做好,否则会有各种各样的类找不到的报错
例如:
Task with the most failures(4):
-----
Task ID:
task_1559639870610_13878_m_000000
URL:
http://master节点:8088/taskdetails.jsp?jobid=job_1559639870610_13878&tipid=task_1559639870610_13878_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 17 more
Caused by: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:147)
... 22 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:323)
at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:333)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:116)
... 22 more
Caused by: java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2255)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:137)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:297)
... 24 more
问题3只有用推荐的jar包才能解决!
创建表如下:
--先添加刚刚下载到的jar包,然后再创建表
add jar /opt/json-serde-1.3.7-jar-with-dependencies.jar;
create table json_test(
`key1` string,
`key2` string,
`map1` struct
<
`m_a`:string,
`m_b`:string,
`m_c`:string> ,
`last` string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.m_a" = "M-a",
"mapping.m_b" = "M-b",
"mapping.m_c" = "M-c"
)
STORED AS TEXTFILE;
--其实map1那个字段大家也可以创建成Map使用的时候和Strunct有所区别
--一个是map1[m_a],一个是map1.m_a
问题4是一个很麻烦的问题!
我是这么做的,新增了一个pt_days的日期分区字段
add jar /opt/json-serde-1.3.7-jar-with-dependencies.jar;
create table json_test(
`key1` string,
`key2` string,
`map1` struct
<
`m_a`:string,
`m_b`:string,
`m_c`:string> ,
`last` string)
PARTITIONED BY (
`pt_days` string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.m_a" = "M-a",
"mapping.m_b" = "M-b",
"mapping.m_c" = "M-c"
)
STORED AS TEXTFILE;
当我如上建表之后
我执行: load data local inpath 'json文件块' into json_test partition (pt_days=20191001);
加载数据正常,但是查询数据的时候直接报空指针的错:
Failed with exception nulljava.lang.NullPointerException
我估计是Serde解析json数据的时候,找不到pt_days字段,所以报错了,因为需求比较急,我也没有太多的时候认真研究如何解决这个报错,因此我取了一个折中的方法。
apache的json解析类和我推荐的那个json解析类具体有什么区别,大家可以百度下,或者看如下的文章:
0659-6.2.0-Hive处理JSON格式数据:https://cloud.tencent.com/developer/article/1451308(作者:Fayson)