说明:
在大数据开发中,由于数据量非常的大,有时候需要将关系型数据库(Oracle/Postgresql/Mysql等)中的表导入到hdfs中进行分布式存储。本文就oozie定时调用sqoop从oracle库中增量导入数据至hive表涉及到的问题做了详细的讨论,仅供参考。
1/ 定义sqoop job
要实现增量更新,首先需要创建相应的sqoop job,sqoop每次完成action之后都会将最新的last_value自动保存到sqoop的metastore中,用于替换下一次调用job时的last_value值,从而实现增量更新。
sqoop job \
--meta-connect jdbc:hsqldb:hsql://10.45.6.34:16000/sqoop \
--create cust_add_job \
--import --connect jdbc:oracle:thin:@10.45.6.34:1521:wsmp \
--username 123 \
--password 123 -m 1 \
--table CUST \
--hive-import --hive-table cust \
--incremental append --check-column CUST_ID --last-value 0
2/ 定义workflow.xml
应该说oozie使用自带的sqoop/hive包,而sqoop和hive安装目录下的配置文件oozie是读不到的,所以需要在workflow.xml指定相关的配置。
<workflow-app xmlns="uri:oozie:workflow:0.2" name="sqoop-to-hive">
<start to="sqoop2hive"/>
<action name="sqoop2hive">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}job-tracker>
<name-node>${nameNode}name-node>
<configuration>
<property>
<name>sqoop.metastore.client.record.passwordname>
<value>truevalue>
property>
<property>
<name>hive.metastore.localname>
<value>truevalue>
property>
<property>
<name>hive.metastore.urisname>
<value>thrift://10.45.6.34:9083value>
property>
<property>
<name>javax.jdo.option.ConnectionURLname>
<value>jdbc:postgresql://10.45.40.174:5432/dvsp_hive?createDatabaseIfNotExist=truevalue>
property>
<property>
<name>javax.jdo.option.ConnectionDriverNamename>
<value>org.postgresql.Drivervalue>
property>
<property>
<name>javax.jdo.option.ConnectionUserNamename>
<value>hivevalue>
property>
<property>
<name>javax.jdo.option.ConnectionPasswordname>
<value>hivevalue>
property>
<property>
<name>hive.metastore.warehouse.dirname>
<value>/user/hive/warehousevalue>
property>
configuration>
<command>job --meta-connect jdbc:hsqldb:hsql://10.45.6.34:16000/sqoop --exec cust_add_jobcommand>
sqoop>
<ok to="end"/>
<error to="fail"/>
action>
<kill name="fail">
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]message>
kill>
<end name="end"/>
workflow-app>
3/ 定义coordinator.xml
这里定义每6分钟执行一次sqoop2hive的增量更新。
<coordinator-app name="sqoop2hive-coord" frequency="${coord:minutes(6)}" start="${start}" end="${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.2">
<action>
<workflow>
<app-path>${workflowAppUri}app-path>
<configuration>
<property>
<name>jobTrackername>
<value>${jobTracker}value>
property>
<property>
<name>nameNodename>
<value>${nameNode}value>
property>
configuration>
workflow>
action>
coordinator-app>
4/ 定义job.properties
oozie使用的是UTC时间,比北京时间慢8小时。其中oozie.use.system.libpath置为true是必需的。
nameNode=hdfs://10.45.6.34:9000
jobTracker=10.45.6.34:8032
oozie.libpath=${nameNode}/user/${user.name}/share/lib/lib_20161121154957/sqoop
oozie.use.system.libpath=true
start=2016-12-06T07:30Z
end=2016-12-06T08:00Z
oozie.coord.application.path=${nameNode}/user/${user.name}/apps/sqoop2hive_cor
workflowAppUri=${nameNode}/user/${user.name}/apps/sqoop2hive_cor
5/ 一些问题
(1)当oozie执行失败后,想在web端查看详细的错误信息,但是
Console URL:http://centos34:8088/proxy/application_1479784816688_0014/无法访问
可能的原因:没有配置hive-site.xml中的yarn.resourcemanager.webapp.address项。
也可能是IP地址和主机名没有对应:修改hosts文件,添加相应的IP地址和主机名;或者直接将host写成IP的形式。
(2)时间不同步问题
Error Code: JA018
Error Message: begin > end in range (begin, end): (1480312665887, 1480312663350)
该问题是集群的时间同步问题,由hadoop集群的各节点的时间不同步导致,可以使用ntp服务进行时间同步,具体的ntp时间同步方法请百度。
(3)单纯的通过sqoop将oracle库中的数据导入到hive,能够成功,但是通过oozie来执行这一过程却导致失败:
Error Code: 1
Error Message: Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
在web端查看HDFS,可以看到数据成功导入到hdfs,但并没有成功导入到hive,执行失败。相关的oozie.log:
2016-12-06 16:22:23,088 INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@:start:] Start action [0000001-161206155113885-oozie-root-W@:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2016-12-06 16:22:23,089 INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@:start:] [***0000001-161206155113885-oozie-root-W@:start:***]Action status=DONE
2016-12-06 16:22:23,089 INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@:start:] [***0000001-161206155113885-oozie-root-W@:start:***]Action updated in DB!
2016-12-06 16:22:23,274 INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] Start action [0000001-161206155113885-oozie-root-W@sqoop2hive] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2016-12-06 16:22:23,611 WARN JobSubmitter:151 - SERVER[centos34] Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2016-12-06 16:22:23,629 WARN JobSubmitter:259 - SERVER[centos34] No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2016-12-06 16:22:24,697 INFO SqoopActionExecutor:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] checking action, hadoop job ID [job_1480907021185_0025] status [RUNNING]
2016-12-06 16:22:24,700 INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] [***0000001-161206155113885-oozie-root-W@sqoop2hive***]Action status=RUNNING
2016-12-06 16:22:24,700 INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] [***0000001-161206155113885-oozie-root-W@sqoop2hive***]Action updated in DB!
2016-12-06 16:22:31,198 INFO StatusTransitService$StatusTransitRunnable:541 - SERVER[centos34] USER[-] GROUP[-] Acquired lock for [org.apache.oozie.service.StatusTransitService]
2016-12-06 16:22:31,200 INFO StatusTransitService$StatusTransitRunnable:541 - SERVER[centos34] USER[-] GROUP[-] Running coordinator status service from last instance time = 2016-12-06T08:21Z
2016-12-06 16:22:31,207 INFO StatusTransitService$StatusTransitRunnable:541 - SERVER[centos34] USER[-] GROUP[-] Running bundle status service from last instance time = 2016-12-06T08:21Z
2016-12-06 16:22:31,210 INFO StatusTransitService$StatusTransitRunnable:541 - SERVER[centos34] USER[-] GROUP[-] Released lock for [org.apache.oozie.service.StatusTransitService]
2016-12-06 16:22:31,342 INFO PauseTransitService:541 - SERVER[centos34] USER[-] GROUP[-] Acquired lock for [org.apache.oozie.service.PauseTransitService]
2016-12-06 16:22:31,357 INFO PauseTransitService:541 - SERVER[centos34] USER[-] GROUP[-] Released lock for [org.apache.oozie.service.PauseTransitService]
2016-12-06 16:23:09,629 INFO CallbackServlet:541 - SERVER[centos34] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] callback for action [0000001-161206155113885-oozie-root-W@sqoop2hive]
2016-12-06 16:23:09,920 INFO SqoopActionExecutor:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] checking action, hadoop job ID [job_1480907021185_0025] status [RUNNING]
2016-12-06 16:23:13,860 INFO CallbackServlet:541 - SERVER[centos34] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] callback for action [0000001-161206155113885-oozie-root-W@sqoop2hive]
2016-12-06 16:23:14,494 INFO SqoopActionExecutor:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] action completed, external ID [job_1480907021185_0025]
2016-12-06 16:23:14,506 WARN SqoopActionExecutor:544 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
2016-12-06 16:23:14,603 INFO ActionEndXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@sqoop2hive] ERROR is considered as FAILED for SLA
2016-12-06 16:23:14,687 INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@fail] Start action [0000001-161206155113885-oozie-root-W@fail] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2016-12-06 16:23:14,688 INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@fail] [***0000001-161206155113885-oozie-root-W@fail***]Action status=DONE
2016-12-06 16:23:14,688 INFO ActionStartXCommand:541 - SERVER[centos34] USER[root] GROUP[-] TOKEN[] APP[sqoop-to-hive] JOB[0000001-161206155113885-oozie-root-W] ACTION[0000001-161206155113885-oozie-root-W@fail] [***0000001-161206155113885-oozie-root-W@fail***]Action updated in DB!
解决方式:
这个问题让人很没办法,因为有效的log就只有一句:Error Message: Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]。
先是查看了源码,然后是找遍了网上所有相关的问题及解决方法,但貌似都没能有效的搞定,唯一能够确定的就是这类错误应该是在hive命令执行过程中发生的。
因此每个人碰到这种错误的具体情况可能都不一样,只能对各个涉及到的参数进行试错,总能找到错误的地方。
最后找到本文产生这种错误的原因就是,所有url中的host请使用ip地址(如127.0.0.1),而不是主机名字,否则可能产生上面的问题。
6/ 一些其他说明:
(1)在workflow.xml中的各节点元素的顺序一定要同官方文档中对应,不要打乱,否则可能会报语法错误导致无法提交,比如:Error: E0701: XML schema error, cvc-complex-type.2.4.a: Invalid content was found starting with element ‘prepare’. One of ‘{“uri:oozie:sqoop-action:0.2”:configuration, “uri:oozie:sqoop-action:0.2”:command, “uri:oozie:sqoop-action:0.2”:arg}’ is expected.
(2)需要开启hive的metastore服务,否则会报错:java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
(3)需要开启sqoop的metastore服务,否则会报错:ERROR tool.JobTool: I/O error performing job operation: java.io.IOException: Exception creating SQL connection
(4)关于hive的配置问题,可以通过文件方式配置(job-xml元素指定hive-site.xml文件在hdfs上的位置),也可以configuration元素中以property子元素方式给定(本文就是这种方式)。如果两种方式都用上,貌似configuration里面的配置信息会覆盖外部配置文件的内容。
(5)关于sqoop job执行的时候,需要输入数据库的密码,需要在sqoop-site.xml中设置sqoop.metastore.client.record.password为true,同时还需要configuration元素中设置,否则可能执行失败。
完整代码下载地址:
参考文献:
http://ingest.tips/2014/11/27/how-to-oozie-sqoop-hive/