Oozie是一种Java Web应用程序,它运行在Java servlet容器——即Tomcat——中,并使用数据库来存储以下内容:
(4)CDH 5.7.0中的Oozie
CDH 5.7.0中,Oozie的版本是4.1.0,元数据存储使用MySQL。关于CDH 5.7.0中Oozie的属性,参考以下链接:yarn.nodemanager.resource.memory-mb = 2000 yarn.scheduler.maximum-allocation-mb = 2000否则会在执行工作流作业时报类似下面的错误:
具体的做法是:
sqoop metastore > /tmp/sqoop_metastore.log 2>&1 &关于Oozie无法运行Sqoop Job的问题,参考以下链接: http://www.lamborryan.com/oozie-sqoop-fail/
last_value=`sqoop job --show myjob_incremental_import | grep incremental.last.value | awk '{print $3}'` sqoop job --delete myjob_incremental_import sqoop job \ --meta-connect jdbc:hsqldb:hsql://cdh2:16000/sqoop \ --create myjob_incremental_import \ -- \ import \ --connect "jdbc:mysql://cdh1:3306/source?useSSL=false&user=root&password=mypassword" \ --table sales_order \ --columns "order_number, customer_number, product_code, order_date, entry_date, order_amount" \ --hive-import \ --hive-table rds.sales_order \ --incremental append \ --check-column order_number \ --last-value $last_value其中$last-value是上次ETL执行后的值。
<?xml version="1.0" encoding="UTF-8"?> <workflow-app xmlns="uri:oozie:workflow:0.1" name="regular_etl"> <start to="fork-node"/> <fork name="fork-node"> <path start="sqoop-customer" /> <path start="sqoop-product" /> <path start="sqoop-sales_order" /> </fork> <action name="sqoop-customer"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <arg>import</arg> <arg>--connect</arg> <arg>jdbc:mysql://cdh1:3306/source?useSSL=false</arg> <arg>--username</arg> <arg>root</arg> <arg>--password</arg> <arg>mypassword</arg> <arg>--table</arg> <arg>customer</arg> <arg>--hive-import</arg> <arg>--hive-table</arg> <arg>rds.customer</arg> <arg>--hive-overwrite</arg> <file>/tmp/hive-site.xml#hive-site.xml</file> <archive>/tmp/mysql-connector-java-5.1.38-bin.jar#mysql-connector-java-5.1.38-bin.jar</archive> </sqoop> <ok to="joining"/> <error to="fail"/> </action> <action name="sqoop-product"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <arg>import</arg> <arg>--connect</arg> <arg>jdbc:mysql://cdh1:3306/source?useSSL=false</arg> <arg>--username</arg> <arg>root</arg> <arg>--password</arg> <arg>mypassword</arg> <arg>--table</arg> <arg>product</arg> <arg>--hive-import</arg> <arg>--hive-table</arg> <arg>rds.product</arg> <arg>--hive-overwrite</arg> <file>/tmp/hive-site.xml#hive-site.xml</file> <archive>/tmp/mysql-connector-java-5.1.38-bin.jar#mysql-connector-java-5.1.38-bin.jar</archive> </sqoop> <ok to="joining"/> <error to="fail"/> </action> <action name="sqoop-sales_order"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <command>job --exec myjob_incremental_import --meta-connect jdbc:hsqldb:hsql://cdh2:16000/sqoop</command> <file>/tmp/hive-site.xml#hive-site.xml</file> <archive>/tmp/mysql-connector-java-5.1.38-bin.jar#mysql-connector-java-5.1.38-bin.jar</archive> </sqoop> <ok to="joining"/> <error to="fail"/> </action> <join name="joining" to="hive-node"/> <action name="hive-node"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>/tmp/hive-site.xml</job-xml> <script>/tmp/regular_etl.sql</script> </hive> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>其DAG如下图所示。
hdfs dfs -put -f workflow.xml /user/root/ hdfs dfs -put /etc/hive/conf.cloudera.hive/hive-site.xml /tmp/ hdfs dfs -put /root/mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar /tmp/ hdfs dfs -put /root/regular_etl.sql /tmp/(7)建立作业属性文件
nameNode=hdfs://cdh2:8020 jobTracker=cdh2:8032 queueName=default oozie.use.system.libpath=true oozie.wf.application.path=${nameNode}/user/${user.name}(8)运行工作流
oozie job -oozie http://cdh2:11000/oozie -config /root/job.properties -run此时从Oozie Web Console可以看到正在运行的作业,如下图所示。
nameNode=hdfs://cdh2:8020 jobTracker=cdh2:8032 queueName=default oozie.use.system.libpath=true oozie.coord.application.path=${nameNode}/user/${user.name} timezone=UTC start=2016-07-11T06:00Z end=2020-12-31T07:15Z workflowAppUri=${nameNode}/user/${user.name}(2)建立协调作业配置文件
<coordinator-app name="regular_etl-coord" frequency="${coord:days(1)}" start="${start}" end="${end}" timezone="${timezone}" xmlns="uri:oozie:coordinator:0.1"> <action> <workflow> <app-path>${workflowAppUri}</app-path> <configuration> <property> <name>jobTracker</name> <value>${jobTracker}</value> </property> <property> <name>nameNode</name> <value>${nameNode}</value> </property> <property> <name>queueName</name> <value>${queueName}</value> </property> </configuration> </workflow> </action> </coordinator-app>(3)部署协调作业
hdfs dfs -put -f coordinator.xml /user/root/(4)运行协调作业
oozie job -oozie http://cdh2:11000/oozie -config /root/job-coord.properties -run此时从Oozie Web Console可以看到准备运行的协调作业,作业的状态为PREP,如下图所示。