cdh5版本中的oozie hive action使用及踩坑集锦

本来的出发点是想做一个能够提交hive ql的web接口,然后我可以在接口层面做很多定制的功能,于是开始找hive restful接口相关的信息,发现Hive有个webhcat服务,可以提交hive ql然后还可以将输出写入到指定的HDFS目录上,而且还有callback功能,在执行完成后给指定URL发送消息。看了一下源代码,原理其实和Oozie差不多,也是启动了一个单mapper任务来管理JOB,然后启动一个子进程,在子进程中执行hive -e命令,但是悲剧的是子进程的hive命令启动的yarn container不知道怎么设置内存,而且hiveql启动的mapreduce竟然是local模式的,然后数据量一大,yarn container就OOM了。而且所有的状态信息都是通过正则表达式去捞stderr中打印出来的信息来汇报的。这样真的是很尴尬的做法,最终查询结果会输出到stdout中,有时候还会有一些warn信息在stdout中,将这种stdout给客户端下载查看着实不行,于是放弃了webhcat,不过hcat提供的执行DDL语句的restful api还是蛮好的,内部是直接使用的metastoreclient来执行DDL语句修改元数据。

然后转而研究了一下hive-jdbc的代码,hive jdbc是用的hiveserver2服务,但是提供的statement执行query是阻塞模式的,虽然提交给thrift server是异步的,但是execute的实现里面用了一个while循环去获取执行状态,当执行完成时候才会退出循环。虽然感觉可以把相关代码捞出来自己去写代码发送请求执行hql,但是改动量有点大,提交任务之后会返回一个operationHandle用来管理这个任务,可以取消,可以获取状态,但是好像jdbc那种写法好像只能在当前线程中使用了啊,我想在其他线程中获取状态提供API给前端展示执行进度好像还是得改改改。


然后转而回到webhcat的坑上想研究一下怎么解决hive 执行的是local模式的问题,然后又开始漫长的搜索资料,在stackoverflow上翻到一片帖子,说不建议用webhcat来提交hive query,虽然webhcat的原理和oozie差不多,但是建议用oozie来提交。于是乎,我又去翻oozie的web service api,返现确实有提供直接运行hive ql的接口:

https://oozie.apache.org/docs/4.1.0/WebServicesAPI.html#Proxy_Hive_Job_Submission

Proxy Hive Job Submission

You can submit a Workflow that contains a single Hive action without writing a workflow.xml. Any requred Jars or other files must already exist in HDFS.

The following properties are required:

  • fs.default.name : The NameNode
  • mapred.job.tracker : The JobTracker
  • user.name : The username of the user submitting the job
  • oozie.hive.script : Contains the hive script you want to run (the actual script, not a file path)
  • oozie.libpath : A directory in HDFS that contains necessary Jars for your job
  • oozie.proxysubmission : Must be set to true

The following properties are optional:

  • oozie.hive.script.params.size : The number of parameters you'll be passing to Hive
  • oozie.hive.script.params.n : A parameter (variable definition for the script) in 'key=value' format, the 'n' should be an integer starting with 0 to indicate the parameter number
  • oozie.hive.options.size : The number of options you'll be passing to Pig
  • oozie.hive.options.n : An argument to pass to Hive, the 'n' should be an integer starting with 0 to indicate the option number

The oozie.hive.options.n parameters are sent directly to Hive without any modification unless they start with -D , in which case they are put into the element of the action.

Request:

POST /oozie/v1/jobs?jobtype=hive
Content-Type: application/xml;charset=UTF-8
.


    
        fs.default.name
        hdfs://localhost:8020
    
    
        mapred.job.tracker
        localhost:8021
    
    
        user.name
        rkanter
    
    
        oozie.hive.script
        
            CREATE EXTERNAL TABLE test (a INT) STORED AS TEXTFILE LOCATION '${INPUT}';
            INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM test;
        
    
    
        oozie.hive.script.params.size
        2
    
    
        oozie.hive.script.params.0
        OUTPUT=/user/rkanter/examples/output-data/hive
    
    
        oozie.hive.script.params.1
        INPUT=/user/rkanter/examples/input-data/table
    
    
        oozie.libpath
        hdfs://localhost:8020/user/rkanter/share/lib/hive
    
    
        oozie.proxysubmission
        true
    

Response:

HTTP/1.1 201 CREATED
Content-Type: application/json;charset=UTF-8
.
{
  id: "job-3"
}
 
  
于是我在hive上建了一个测试表test,里面随便写了3条数据,就按照接口说明试着发送了一个请求,妈蛋,说table not found:test,are you kidding me???
查看了一下日志,发现hive根本不是连的现有的metastore,是创建了一个内置的derby数据库的metastore。那找不到表就好理解了。于是我就在请求的configuration中
加了一个配置:

    hive.metastore.uris
    thrift://name85:9083
  
指定要连接的远程metastore。重新发送请求,我曹。还是找不到表,连的还是内嵌derby metastore!(顿时感觉做hive web api二次开发的的愿望要破灭)
于是又开始漫长的google过程。。。。有些文章说oozie hive action要设置节点,指定一个hdfs上的hive-site.xml文件。这样Hive的属性可以传播到
子进程中去:
 
  

Syntax:


    ...
    
        
            [JOB-TRACKER]
            [NAME-NODE]
            
               
               ...
               
               ...
            
            [HIVE SETTINGS FILE]
            
                
                    [PROPERTY-NAME]
                    [PROPERTY-VALUE]
                
                ...
            
            
            [PARAM-VALUE]
                ...
            [PARAM-VALUE]
            [FILE-PATH]
            ...
            [FILE-PATH]
            ...
        
        
        
    
    ...
于是在hue上编写了一个workflow.xml里面通过job-xml指定了一个hive-site.xml,确实找到了remote metastore,执行成功了。
可是TMD这个接口没有提供指定job-xml的web请求参数啊。于是只能下载oozie的源码下来看看到底是怎么回事,找找看有没有地方可以设置这个属性的。
而且我以前也是研究使用过oozie的,现在做任务调度用的也是oozie,我印象中oozie都是要生成了一个workflow.xml文件放在hdfs上指定app path来执行的.
BUT这个接口不需要生成workflow.xml不需要指定app path。始终有点好奇是怎么回事,于是就下载了oozie的源码(cdh5.7对应的oozie4.1),准备窥探一下究竟。
配置oozie server允许远程调试:
通过远程调试查看代码后发现,其实虽然我们传的是一个hql语句,但是接口内部还是会给我们生成一个dummy.hive文件,把hql放在里面,然后生成workflow定义内容。
其中请求参数oozie.hive.script.params.n可以替换掉hql里面的变量,
请求参数oozie.hive.options.n如果是以-D开头的话,将会添加到workflow定义中的hive action定义里的configuration中,否则将添加到hive shell 命令行。
而job-xml制定的是默认配置,configuration中配置的属性将会覆盖job-xml中指定的属性,所以可以采用oozie.hive.option参数来传递hive 属性,
于是在请求参数中多加了下面两个参数,将metastore uri配置进去:

    oozie.hive.options.size
    1
  
  
    oozie.hive.options.0
    -Dhive.metastore.uris=thrift://name85:9083
  
再重新提交请求,YES,执行成功了!最后的request raw payload形如:


    
        fs.default.name
        hdfs://name84:8020
    
    
        mapred.job.tracker
        name84:8032
    
    
        user.name
        hdfs
    
    
        oozie.hive.script
        
          
            INSERT OVERWRITE DIRECTORY '${OUTPUT}' select * from test
		          
		
    
    
        oozie.hive.script.params.size
        1
    
    
        oozie.hive.script.params.0
        OUTPUT=/tmp/output-data3/hive
    
   
        oozie.libpath
        hdfs://name84:8020/user/oozie/share/lib/lib_20170725165530/hive
    
    
        oozie.proxysubmission
        true
    
  
    oozie.hive.options.size
    1
  
  
    oozie.hive.options.0
    -Dhive.metastore.uris=thrift://name85:9083
  

由于我们日志都是json格式的日志,建表的时候row serde都是指定了json serde(用到了第三方包)。我试着到了一个线上表下,用oozie接口执行查询,
果不其然报错了,找不到第三方jar包中的序列化反序列化类。于是我把这个jar包扔到了hive action的sharelib目录里(上面的oozie.libpath指定的目录),
重新提交请求,OK成功了。然后我在试试自定义的UDF。好巧不巧的,正好就踩中一个坑!
因为我注册的是持久的UDF,在使用到的时候是会自动装载UDF的jar包的。但是却报了一个错:
2017-07-26 12:30:20,067 FATAL [IPC Server handler 1 on 42314] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1501042569430_0004_m_000000_0 - exited : java.lang.RuntimeException: Failed to load plan: hdfs://name84:8020/tmp/hive/hdfs/aea8e323-daa9-494b-8d56-e6f6071c1da0/hive_2017-07-26_12-30-07_109_6584100257170331154-1/-mr-10004/28cb7c1f-f35f-4230-b8a5-ff80cb07eadc/map.xml: org.apache.hive.com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: org.joda.time.format.DateTimeFormatterBuilder$PaddedNumber
Serialization trace:
iParsers (org.joda.time.format.DateTimeFormatterBuilder$Composite)
iParser (org.joda.time.format.DateTimeFormatter)
isoFormatter (brickhouse.udf.json.InspectorHandle$PrimitiveHandle)
mapValHandle (brickhouse.udf.json.InspectorHandle$MapHandle)
inspHandle (brickhouse.udf.json.FromJsonUDF)
genericUDF (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
colExprMap (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator)
childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
	at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:451)
	at org.apache.hadoop.hive.ql.exec.Utilities.getMapWork(Utilities.java:296)
	at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:268)
	at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:502)
	at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:477)
	at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:717)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:169)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: org.joda.time.format.DateTimeFormatterBuilder$PaddedNumber
Serialization trace:
iParsers (org.joda.time.format.DateTimeFormatterBuilder$Composite)
iParser (org.joda.time.format.DateTimeFormatter)
isoFormatter (brickhouse.udf.json.InspectorHandle$PrimitiveHandle)
mapValHandle (brickhouse.udf.json.InspectorHandle$MapHandle)
inspHandle (brickhouse.udf.json.FromJsonUDF)
genericUDF (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
colExprMap (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator)
childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:672)
	at org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(Utilities.java:1070)
	at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:961)
	at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:975)
	at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:417)
	... 13 more
Caused by: java.lang.ArrayStoreException: org.joda.time.format.DateTimeFormatterBuilder$PaddedNumber
	at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:367)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:276)
	at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
	at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
	... 58 more

这个错咋一看好像是kyro序列化反序列化出问题了,于是又一顿google。。。。未果。后面看是ArrayStoreException错误,直觉好像是joda-time的jar包冲突了。
于是看了一下Oozie依赖的joda-time是2.1版本的,hive依赖的是1.6版本的,我的udf的jar包中还用的是2.9版本的!难道真的是这个原因导致的?
我先试试统一一下joda-time的版本。其实hive-exec.jar中直接把joda-time的源码打包进去了,所以我把udf的jar包中的joda-time依赖去掉了和hive的统一了。
然后我把oozie hive sharelib下面的joda-time-2.1.jar删除了重启了一下oozie服务。
再重新提交请求,喜出望外,果然执行成功了。
从坑里爬出来后,拍拍身上的泥土,终于可以开始设计我的hive执行引擎了,后续有坑继续补充。。。


你可能感兴趣的:(oozie,hive,cloudera,mamager)