Oozie概览

OOZIE概览

[TOC]

调度框架:Linux Crontab,Azkaban,oozie,zeus

三款任务调度系统比较

简介

oozie是一个工作流调度系统

  • 工作流的调度是DAG
  • 可扩展:一个oozie就是一个mr任务,但是仅仅是map,没有reduce
  • 可靠性:任务失败后重试
  • 集成了Hadoop生态系统的其他任务,如mr、pig、hive、sqoop、spark

主要组件

  • tomcat(servlet进行调用并页面显示任务)
  • 数据库(存储任务)
  • Bundle,coordinator,workflow

架构图

clipboard.png

三大服务模块

  • Oozie V3 :a server based Bundle engine:对多个coordinator进行封装,可以启动,停止,挂起,关闭,重启一组coordinator的任务
  • Oozie V2 :a server based Coordinator engine:可以运行多个workflow,结构:start->workfows->end
  • Oozie V1 :a server based workflow engine,结构:start->mr->pig->fork->mr/hive->join->end

workflow

clipboard.png

clipboard.png

coordinator

记录下踩的
报错
Error: E0505 : E0505: App definition [hdfs://localhost:8020/tmp/oozie-app/coordinator/] does not exist
这个错误信息很坑爹,当时发现其实不是目录不对,是coordinator.xml文件名命名有问题。


准备工作:时区统一

建议采用东八区时间(GMT+0800)

在服务器上,date -R如果显示如下信息,则表示为东八区,如果不是需要设置时区,一般采用北京或者上海的ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
Sat, 30 Sep 2017 10:26:58 +0800

接着去修改oozie-site.xml,如果没有这个属性,就增加

   
      oozie.processing.timezone
      GMT+0800
    

让界面的时间也显示正确

clipboard.png

Examples

Spark Action

workflow spark on yarn

workflow spark on yarn参考官方地址

文件目录结构

├── ooziespark
│   ├── job.properties
│   ├── lib
│   │   └── spark-1.6.2-1.0-SNAPSHOT.jar
│   └── workflow.xml

workflow.xml


  
    
   
      
      ${jobTracker}  
      ${nameNode}  
       
        
        
      ${master}  
      Spark-Wordcount  
      WordCount  
      ${nameNode}/user/LJK/ooziecoor/lib/spark-1.6.2-1.0-SNAPSHOT.jar  
      --driver-memory 512M --executor-memory 512M  
      ${inputdir}  
      ${outputdir} 
      
      
     
    
   
    Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}] 
    
   

job.properties

nameNode=hdfs://nn1:8020
jobTracker=rm:8050
master=yarn-cluster
queueName=default
inputdir=/user/LJK/hello-spark
outputdir=/user/LJK/output
oozie.use.system.libpath=true
oozie.wf.application.path=/user/LJK/ooziespark
#oozie.coord.application.path=${nameNode}/user/LJK/ooziespark
#start=2017-09-28T17:00+0800
#end=2017-09-30T17:00+0800
#workflowAppUri=${nameNode}/user/LJK/ooziespark/

打包程序拷贝到app/lib目录下,测试源码以下

object WordCount {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
//      .setJars(List("/Users/LJK/Documents/code/github/study-spark1.6.2/target/spark-1.6.2-1.0-SNAPSHOT.jar"))
//      .set("spark.yarn.historyServer.address", "rm:18080")
//      .set("spark.eventLog.enabled", "true")
//      .set("spark.eventLog.dir", "hdfs://nn1:8020/spark-history")
      .set("spark.testing.memory", "1073741824")
    val sc = new SparkContext(conf)
    val rdd = sc.textFile(args(0))
      .flatMap(_.split(" "))
      .map((_, 1))
      .reduceByKey(_ + _)
    rdd.saveAsTextFile(args(1))
    sc.stop()
  }
}

把这个目录上传到HDFS目录,执行命令hdfs dfs -put ooziespark /user/LJK/
注意点:job.properties可以不用上传到HDFS,因为执行命令的时候用的是本地的不是HDFS的

oozie启动job,执行命令
oozie job -oozie http://rm:11000/oozie -config /usr/local/share/applications/ooziespark/job.properties -run
或者
oozie job -config /usr/local/share/applications/ooziespark/job.properties -run
简略版前提是你要配置the env variable 'OOZIE_URL' is used as default value for the '-oozie' option,具体可以用oozie help查看

在oozie界面上查看job执行

clipboard.png

Coordinator spark on yarn

clipboard.png

clipboard.png

简单调度,每五分钟跑一次WordCount

文件目录结构

├── ooziecoor
│   ├── coordinator.xml
│   ├── job.properties
│   ├── lib
│   │   └── spark-1.6.2-1.0-SNAPSHOT.jar
│   └── workflow.xml

coordinator.xml


     
     
         ${workflowAppUri}
         
             
                 jobTracker
                 ${jobTracker}
             
             
                 nameNode
                 ${nameNode}
             
             
                 queueName
                 ${queueName}
             
         
     
 

修改之前的job.properties,改为

nameNode=hdfs://nn1:8020
jobTracker=rm:8050
master=yarn-cluster
queueName=default
inputdir=/user/LJK/hello-spark
outputdir=/user/LJK/output
oozie.use.system.libpath=true
#oozie.wf.application.path=/user/LJK/ooziespark
oozie.coord.application.path=${nameNode}/user/LJK/ooziecoor
start=2017-09-30T9:30+0800
end=2017-09-30T17:00+0800
workflowAppUri=${nameNode}/user/LJK/ooziecoor

之前的workflow可以直接保留不改jar包位置也是可以的,但为了每个任务更加好看,修改下jar包位置即可

上传到HDFS,并执行命令
oozie job -config /usr/local/share/applications/ooziecoor/job.properties -run

可以在web上查看job

clipboard.png

clipboard.png

bundle spark on yarn

文件结构

├── ooziebundle
│   ├── bundle.xml
│   ├── coordinator.xml
│   ├── job.properties
│   ├── lib
│   │   └── spark-1.6.2-1.0-SNAPSHOT.jar
│   └── workflow.xml

增加bundle.xml


          
                 ${nameNode}/user/LJK/ooziebundle/coordinator.xml
                 
                     
                         start
                         ${start}
                     
                     
                         end
                         ${end}
                     
                 
          

修改job.properties

nameNode=hdfs://nn1:8020
jobTracker=rm:8050
master=yarn-cluster
queueName=default
inputdir=/user/LJK/hello-spark
outputdir=/user/LJK/output
oozie.use.system.libpath=true
#oozie.wf.application.path=/user/LJK/ooziespark
#oozie.coord.application.path=${nameNode}/user/LJK/ooziecoor
oozie.bundle.application.path=${nameNode}/user/LJK/ooziebundle
start=2017-09-30T9:30+0800
end=2017-09-30T17:00+0800
workflowAppUri=${nameNode}/user/LJK/ooziebundle

上传到HDFS,并执行命令
oozie job -config /usr/local/share/applications/ooziebundle/job.properties -run

web上查看job

clipboard.png


Java Action

文件结构,lib包不是打成一个jar包所以不列出了,你可以选择打成一个jar包

javaExample/
├── job.properties
├── lib
└── workflow.xml

注意
如果你用的是SpringBoot框架,需要在pom上加上exclusions,否则会有jar包冲突,oozie会报错


  org.springframework.boot
  spring-boot-starter
  
      
          spring-boot-starter-logging
          org.springframework.boot
      
  

workflow.xml


    
    
        Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
    
    
        
            ${jobTracker}
            ${nameNode}
            com.sharing.App
            hello
            springboot
        
        
        
    
    

job.properties

oozie.use.system.libpath=false
queueName=default
jobTracker=rm.ambari:8050
nameNode=hdfs://nn1.ambari:8020
oozie.wf.application.path=${nameNode}/user/LJK/javaExample

java程序源码

@SpringBootApplication
public class App {

    public static void main(String[] args) {
        SpringApplication.run(App.class,args);
        System.out.println(args[0] + " " + args[1]);
    }
}

Shell Action

文件结构

shell
├── job.properties
└── workflow.xml

workflow.xml


    
    
        Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
    
    
        
            ${jobTracker}
            ${nameNode}
            echo
              hello shell
              
        
        
        
    
    

job.properties

hue-id-w=50057
jobTracker=rm.ambari:8050
mapreduce.job.user.name=admin
nameNode=hdfs://nn1.ambari:8020
oozie.use.system.libpath=True
oozie.wf.application.path=hdfs://nn1.ambari:8020/user/LJK/shell
user.name=admin

Hive Action

文件结构

hiveExample/
├── hive-site.xml
├── input
│   └── inputdata
├── job.properties
├── output
├── script.q
└── workflow.xml

hive script,写一个hive脚本,文件名自定义,
script.q文件内容

DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (a INT) STORED AS TEXTFILE LOCATION '${INPUT}';
INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM test;

workflow.xml


    
    
        Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
    
    
        
            ${jobTracker}
            ${nameNode}
            
                  
                  
            
              /user/LJK/hiveExample/hive-site.xml
            
              INPUT=/user/LJK/hiveExample/input
              OUTPUT=/user/LJK/hiveExample/output
        
        
        
    
    

job.properties

hue-id-w=50059
jobTracker=rm.ambari:8050
mapreduce.job.user.name=admin
nameNode=hdfs://nn1.ambari:8020
oozie.use.system.libpath=True
oozie.wf.application.path=hdfs://nn1.ambari:8020/user/LJK/hiveExample
user.name=admin

其中hdfs://nn1.ambari:8020/user/LJK/hiveExample/input要放一个文件,文件名自定义,
inputdata文件内容

1
2
3
4
6
7
8
9

执行成功后,可以看到output文件夹生成文件000000_0,内容与inputdata内容一致

Hive2 Action

跟Hive Action基本是一样的,只要改动workflow.xml就好


    
    
        Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
    
    
        
            ${jobTracker}
            ${nameNode}
            
                  
                  
            
              /user/LJK/hiveExample/hive-site.xml
            jdbc:hive2://rm.ambari:10000/default
            
              INPUT=/user/LJK/hiveExample/input
              OUTPUT=/user/LJK/hiveExample/output
        
        
        
    
    

资源链接

你可能感兴趣的:(oozie)