Azkaban任务调度

概述

Azkaban是由Linkedin公司推出的一个批量工作流任务调度器,主要用于在一个工作流内以一个特定的顺序运行一组工作和流程,它的配置是通过简单的key:value对的方式,通过配置中的dependencies 来设置依赖关系,这个依赖关系必须是无环的,否则会被视为无效的工作流。Azkaban使用job配置文件建立任务之间的依赖关系,并提供一个易于使用的web用户界面维护和跟踪你的工作流。zkaban的设计首先考虑了可用性。它已经在LinkedIn上运行了几年,并驱动了许多Hadoop和数据仓库流程。

知名度比较高的应该是Apache Oozie,但是其配置工作流的过程是编写大量的XML配置,而且代码复杂度比较高,不易于二次开发。另外一个应用也比较广泛的调度系统是Airflow,但是其开发语言是Python。

选择Azkaban的理由:

  • 提供功能清晰,简单易用的Web UI界面
  • 提供job配置文件快速建立任务和任务之间的依赖关系
  • 提供模块化和可插拔的插件机制,原生支持command、Java、Hive、Pig、Hadoop
  • 基于Java开发,代码结构清晰,易于二次开发

适用场景

实际项目中经常有这些场景:每天有一个大任务,这个大任务可以分成A,B,C,D四个小任务,A,B任务之间没有依赖关系,C任务依赖A,B任务的结果,D任务依赖C任务的结果。一般的做法是,开两个终端同时执行A,B,两个都执行完了再执行C,最后再执行D。这样的话,整个的执行过程都需要人工参加,并且得盯着各任务的进度。但是我们的很多任务都是在深更半夜执行的,通过写脚本设置crontab执行。其实,整个过程类似于一个有向无环图(DAG)。每个子任务相当于大任务中的一个流,任务的起点可以从没有度的节点开始执行,任何没有通路的节点之间可以同时执行,比如上述的A,B。总结起来的话,我们需要的就是一个工作流的调度器,而Azkaban就是能解决上述问题的一个调度器。

架构

Azkaban在LinkedIn上实施,以解决Hadoop作业依赖问题。我们有工作需要按顺序运行,从ETL工作到数据分析产品。最初是单一服务器解决方案,随着多年来Hadoop用户数量的增加,Azkaban 已经发展成为一个更强大的解决方案。Azkaban总共有三个角色:关系型数据库(MySQL)、AzkabanWebServer、AzkabanExecutorServer。Azkaban使用MySQL存储服务状态,AzkabanWebServer和AzkabanExecutorServer都需要访问MySQL的DB数据库
Azkaban任务调度_第1张图片
AzkabanWebServer是所有Azkaban的主要管理器。它处理项目管理,身份验证,调度程序和执行监视。它还用作Web用户界面。 使用Azkaban很容易。 Azkaban使用* .job键值属性文件来定义工作流程中的各个任务,并使用_dependencies_属性来定义作业的依赖关系链。这些作业文件和相关的代码可以存档为* .zip,并通过Azkaban UI或curl通过Web服务器上传。

以前的Azkaban版本(version 3.0之前)在单个服务器中同时具有AzkabanWebServer和AzkabanExecutorServer功能。此后,执行程序已被分离到单独的服务器中。拆分这些服务的原因有很多:方便的扩展Executor的数量,并在失败的情况下可以恢复。分离以后在对Azkaban升级的时候对用户的使用影响很小。

AzkabanWebServer如何使用DB?

  • Project Management - The projects, the permissions on the projects as well as the uploaded files.
  • Executing Flow State - Keep track of executing flows and which Executor is running them.
  • Previous Flow/Jobs - Search through previous executions of jobs and flows as well as access their log files.
  • Scheduler - Keeps the state of the scheduled jobs.
  • SLA - Keeps all the sla rules

AzkabanExecutorServer如何使用DB?

  • Acess the project - Retrieves project files from the db.
  • Executing Flows/Jobs - Retrieves and updates data for flows and that are executing
  • Logs - Stores the output logs for jobs and flows into the db.
  • Interflow dependency - If a flow is running on a different executor, it will take state from the DB.

编译

[root@CentOS ~]# yum install git
[root@CentOS ~]# git clone https://github.com/azkaban/azkaban.git
[root@CentOS ~]# cd azkaban/
[root@CentOS azkaban]# ./gradlew build installDist
...漫长的等待...
Starting a Gradle Daemon, 1 incompatible and 1 stopped Daemons could not be reused, use --status for details
Parallel execution with configuration on demand is an incubating feature.

> Task :azkaban-web-server:npm_install 
added 39 packages in 0.901s

> Task :azkaban-web-server:jsTest 


  addClass
    ✓ should add class into element
    ✓ should not add a class which already exists in element

  CronTransformation
    ✓ should transfer correctly

  ValidateQuartzStr
    ✓ validate Quartz String corretly

  momentJSTest
    ✓ momentJSTest
    ✓ momentTimezoneTest


  6 passing (11ms)



BUILD SUCCESSFUL in 9s
114 actionable tasks: 8 executed, 106 up-to-date

##安装

在version 3.0中我们提供了三种模式:独立的"solo-server"模式、较重的"two server"模式以及"multiple-executor"模式。其中solo server mode 使用的内嵌的H2 DB,所有的web server和executor server运行在一个相同的进程中,该种模式适合测试或者任务调度规模比较小;two server mode用于生产环境,后台的DB数据库使用MySQL,其中Webserver和executorserver应该被部署在不同的主机上;multiple executor mode 也通常用于生产环境,后台的DB数据库使用MySQL,其中Webserver和executorservers应该被部署在不同的主机上;

solo server mode

1、编译好在azkaban的安装目录下会有相应的安装包azkaban-solo-server-*.tar.gz,将该安装解压到/usr目录下

[root@CentOS azkaban]# tree azkaban-solo-server/build/distributions
azkaban-solo-server/build/distributions
├── azkaban-solo-server-3.81.0-1-g304593d.tar.gz
└── azkaban-solo-server-3.81.0-1-g304593d.zip
[root@CentOS azkaban]# tar -zxf azkaban-solo-server/build/distributions/azkaban-solo-server-3.81.0-1-g304593d.tar.gz -C /usr/
[root@CentOS azkaban]# cd /usr/
[root@CentOS usr]# mv azkaban-solo-server-3.81.0-1-g304593d azkaban-solo-server

[root@CentOS azkaban-solo-server]# ls -l
总用量 24
drwxr-xr-x. 3 root root 4096 12月 17 16:11 bin        #启动脚本
drwxr-xr-x. 2 root root 4096 12月 17 16:11 conf       #配置目录
drwxr-xr-x. 2 root root 4096 12月 17 16:11 lib        #运行所需依赖jar
drwxr-xr-x. 3 root root 4096 12月 17 16:11 plugins    #插件安装目录
drwxr-xr-x. 2 root root 4096 12月 17 16:11 sql        #sql脚本
drwxr-xr-x. 6 root root 4096 12月 17 16:11 web        #web服务相关
[root@CentOS azkaban-solo-server]# tree conf/
conf/
├── azkaban.properties
├── azkaban-users.xml
└── global.properties

0 directories, 3 files
[root@CentOS azkaban-solo-server]# tree plugins/
plugins/
└── jobtypes
    └── commonprivate.properties

1 directory, 1 file

2、修改azkaban.properties配置文件

default.timezone.id=Asia/Shanghai #修改时区
azkaban.webserver.url=http://CentOS:8081 #配置azkabanweb服务器地址
# email 相关配置
[email protected]
mail.host=smtp.qq.com
[email protected]
mail.password=nvwoyoudipkjgdee
# 任务调度成功&失败的发送邮箱
[email protected]
[email protected]

3、修改commonprivate.properties配置文件

memCheck.enabled=false

4、运行Solo Server服务器,访问CentOS:8081

[root@CentOS azkaban-solo-server]# ./bin/start-solo.sh #关闭
[root@CentOS azkaban-solo-server]# jps
5638 AzkabanSingleServer
5679 Jps

Azkaban任务调度_第2张图片
5、填写用户名azkaban密码azkaban

Azkaban任务调度_第3张图片

登录的账户信息存储在azkaban-users.xml配置文件中

6、关闭Azkaban服务

[root@CentOS azkaban-solo-server]# ./bin/shutdown-solo.sh 
Killing solo-server. [pid: 5638], attempt: 1
shutdown succeeded

two server mode

  • 安装MySQL,开启MySQL远程访问权限
[root@CentOS azkaban]# tree azkaban-db/build/sql/create-all-sql-3.81.0-1-g304593d.sql 
azkaban-db/build/sql/create-all-sql-3.81.0-1-g304593d.sql [error opening dir]

0 directories, 0 files
[root@CentOS azkaban]# mysql -uroot -proot
mysql> source /root/azkaban/azkaban-db/build/sql/create-all-sql-3.81.0-1-g304593d.sql
mysql> show tables;
+-----------------------------+
| Tables_in_azkaban           |
+-----------------------------+
| QRTZ_BLOB_TRIGGERS          |
| QRTZ_CALENDARS              |
| QRTZ_CRON_TRIGGERS          |
| QRTZ_FIRED_TRIGGERS         |
| QRTZ_JOB_DETAILS            |
| QRTZ_LOCKS                  |
| QRTZ_PAUSED_TRIGGER_GRPS    |
| QRTZ_SCHEDULER_STATE        |
| QRTZ_SIMPLE_TRIGGERS        |
| QRTZ_SIMPROP_TRIGGERS       |
| QRTZ_TRIGGERS               |
| active_executing_flows      |
| active_sla                  |
| execution_dependencies      |
| execution_flows             |
| execution_jobs              |
| execution_logs              |
| executor_events             |
| executors                   |
| project_events              |
| project_files               |
| project_flow_files          |
| project_flows               |
| project_permissions         |
| project_properties          |
| project_versions            |
| projects                    |
| properties                  |
| ramp                        |
| ramp_dependency             |
| ramp_exceptional_flow_items |
| ramp_exceptional_job_items  |
| ramp_items                  |
| triggers                    |
+-----------------------------+
34 rows in set (0.00 sec)
[root@CentOS azkaban-web-server]# cat /etc/my.cnf 
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

max_allowed_packet=1024M
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
[root@CentOS azkaban-web-server]# service mysqld restart
Stopping mysqld:                                           [  OK  ]
Starting mysqld:                                           [  OK  ]
  • 安装azkaban-web-server

1、解压azkaban-exec-server-*.tar.gz到,将该安装解压到/usr目录下

[root@CentOS azkaban]# tree azkaban-exec-server/build/distributions/
azkaban-exec-server/build/distributions/
├── azkaban-exec-server-3.81.0-1-g304593d.tar.gz
└── azkaban-exec-server-3.81.0-1-g304593d.zip
[root@CentOS azkaban]# tar -zxf azkaban-exec-server/build/distributions/azkaban-exec-server-3.81.0-1-g304593d.tar.gz -C /usr/
[root@CentOS azkaban]# cd /usr/
[root@CentOS usr]# mv azkaban-exec-server-3.81.0-1-g304593d azkaban-exec-server
[root@CentOS azkaban-exec-server]# tree conf/ bin/ plugins/
conf/
├── azkaban.properties
├── global.properties
└── log4j.properties
bin/
├── internal
│   ├── internal-start-executor.sh
│   └── util.sh
├── shutdown-exec.sh
└── start-exec.sh
plugins/
└── jobtypes
    └── commonprivate.properties

2 directories, 8 files

2、修改azkaban.properties配置文件

default.timezone.id=Asia/Shanghai
azkaban.webserver.url=http://CentOS:8081
# email 相关配置
[email protected]
mail.host=smtp.qq.com
[email protected]
mail.password=nvwoyoudipkjgdee

[email protected]
[email protected]

database.type=mysql
mysql.port=3306
mysql.host=CentOS
mysql.database=azkaban
mysql.user=root
mysql.password=root
mysql.numconnections=100

3、修改commonprivate.properties配置文件

memCheck.enabled=false

4、启动azkaban-exec-server服务器

[root@CentOS azkaban-exec-server]# ./bin/start-exec.sh 
[root@CentOS azkaban-exec-server]# jps
6545 Jps
6511 AzkabanExecutorServer

5、激活当前azkaban-exec-server

[root@CentOS azkaban-exec-server]# curl -G "localhost:$(<./executor.port)/executor?action=activate" && echo
{"status":"success"}
  • 安装azkaban-web-server

1、编译好在azkaban的安装目录下会有相应的安装包azkaban-web-server-*.tar.gz,将该安装解压到/usr目录下

[root@CentOS azkaban]# tree azkaban-web-server/build/distributions/
azkaban-web-server/build/distributions/
├── azkaban-web-server-3.81.0-1-g304593d.tar.gz
└── azkaban-web-server-3.81.0-1-g304593d.zip
[root@CentOS azkaban]# tar -zxf azkaban-web-server/build/distributions/azkaban-web-server-3.81.0-1-g304593d.tar.gz -C /usr/
[root@CentOS usr]# mv azkaban-web-server-3.81.0-1-g304593d/ azkaban-web-server
[root@CentOS usr]# cd azkaban-web-server/
[root@CentOS azkaban-web-server]# tree  bin conf
bin
├── internal
│   ├── internal-start-web.sh
│   └── util.sh
├── shutdown-web.sh
└── start-web.sh
conf
├── azkaban.properties
├── azkaban-users.xml
├── global.properties
└── log4j.properties

2、修改azkaban.properties配置文件

default.timezone.id=Asia/Shanghai #修改时区
azkaban.webserver.url=http://CentOS:8081 #配置azkabanweb服务器地址
# email 相关配置
[email protected]
mail.host=smtp.qq.com
[email protected]
mail.password=nvwoyoudipkjgdee
# 任务调度成功&失败的发送邮箱
[email protected]
[email protected]
#配置MysQL
database.type=mysql
mysql.port=3306
mysql.host=CentOS
mysql.database=azkaban
mysql.user=root
mysql.password=root
mysql.numconnections=100

azkaban.executorselector.filters=StaticRemainingFlowSize,CpuStatus

3、启动azkaban-web-server

[root@CentOS azkaban-web-server]# ./bin/start-web.sh 
[root@CentOS azkaban-web-server]#  ./bin/start-web.sh 
[root@CentOS azkaban-web-server]# jps
6598 Jps
6570 AzkabanWebServer
6511 AzkabanExecutorServer

Azkaban任务调度_第4张图片
5、填写用户名azkaban密码azkaban
Azkaban任务调度_第5张图片

登录的账户信息存储在azkaban-users.xml配置文件中

multiple executor mode

搭建环境了和two server mode类似,只需要将azkaban-exec-server部署多分即可,因此搭建步骤略。

Flow 2.0 Basics

basic Flow

  • flow20.project
azkaban-flow-version: 2.0
  • basic.flow
nodes:
  - name: basic
    type: command
    config:
      command: echo "This is an echoed text."

将一以上文件打包

Job Dependencies

  • flow20.project
azkaban-flow-version: 2.0
  • basic.flow
nodes:
  - name: logic_job
    type: noop
    # jobC depends on jobA and jobB
    dependsOn:
      - jobA
      - jobB

  - name: jobA
    type: command
    config:
      command: echo "This is an echoed text."

  - name: jobB
    type: command
    config:
      command: pwd

将一以上文件打包

Flow Config

---
config:
  user.to.proxy: root
  failure.emails: [email protected]
  success.emails: [email protected]
  notify.emails: [email protected]
nodes:
  - name: jobC
    type: noop
    # jobC depends on jobA and jobB
    dependsOn:
      - jobA
      - jobB

  - name: jobA
    type: command
    config:
      command: echo "This is an echoed text."

  - name: jobB
    type: command
    config:
      command: pwd

Embedded Flows

---
config:
  user.to.proxy: root
  notify.emails: [email protected]
nodes:
  - name: logic_job
    type: noop
    dependsOn:
      - embedded_flow
      - jobD


  - name: embedded_flow
    type: flow
    config:
      name: jiangzz
    nodes:
      - name: jobA
        type: noop
        dependsOn:
          - jobB
          - jobC

      - name: jobB
        type: command
        config:
          command: pwd

      - name: jobC
        type: command
        config:
          command: echo ${name}

  - name: jobD
    type: command
    config:
      command: echo "this is jobD"

将一以上文件打包

Java 代码

azkaban-flow-version: 2.0
  • basic.flow
---
config:
  user.to.proxy: root
  failure.emails: [email protected]
  success.emails: [email protected]
nodes:
  - name: javaJob
    type: javaprocess
    config:
      classpath: ./flow2.0/libs/*
      java.class: com.baizhi.MyJavaJob
package com.baizhi;

public class MyJavaJob {
    public static void main(String[] args) {
        System.out.println("#################################");
        System.out.println("####  MyJavaJob class exec... ###");
        System.out.println("#################################");
    }
}

参数传递

---
config:
  user.to.proxy: root
  failure.emails: [email protected]
  success.emails: [email protected]
  notify.emails: [email protected]
nodes:

  - name: calculate_time
    type: command
    config:
      name: wangwu
      command: echo "==========start=========="
      command.1: sh ./flow2.0/bin/generatetime.sh 
      command.2: echo "==========end=========="


  - name: execut_job
    type: command
    dependsOn:
      - calculate_time
    config:
      command.1: ls -l 
      command.2: echo ${start}' ~ '${end}' path:'${path}
#!/usr/bin/env bash 
start=$(date '+%Y-%m-%d %H:%M:%S')
end=$(date -d +7day '+%Y-%m-%d %H:%M:%S')
path=$(pwd)/flow2.0/libs
echo '{"start":"'$start'","end":"'$end'","path":"'$path'"}' > $JOB_OUTPUT_PROP_FILE

ConditionalWorkflow

---
config:
  user.to.proxy: root
  notify.emails: [email protected]
nodes:
 - name: JobA
   type: command
   config:
     command: sh ./flow2.0/bin/exec.sh

 - name: JobB
   type: command
   dependsOn:
     - JobA
   config:
     command: echo “This is JobB.”
   condition: ${JobA:param1} == 1

 - name: JobC
   type: command
   dependsOn:
     - JobA
   config:
     command: echo “This is JobC.”
   condition: ${JobA:param1} == 2

 - name: JobD
   type: command
   dependsOn:
     - JobB
     - JobC
   config:
     command: pwd
   condition: one_success

bin/exec.sh

#!/usr/bin/env bash 

echo '{"param1": "1"}' > $JOB_OUTPUT_PROP_FILE

参考https://azkaban.readthedocs.io/en/latest/conditionalFlow.html

Kafka Event Trigger Plugin

当前,Azkaban支持通过调度它或Ajax API来启动流。但是,它们受到限制,因为有时需要按需自动执行作业。事件触发是Azkaban引入的一项新功能。它定义了触发流程的新范例-在Kafka事件到达时触发流程。该概念使用户可以定义流所依赖的事件。一旦所有事件准备就绪,将触发工作流程。

1、修改azkaban-web-server/conf/azkaban.properties,添加一下属性

azkaban.dependency.plugin.dir=plugins/dependency
azkaban.server.schedule.enable_quartz=true

2、创建azkaban.private.propertiesplugins

[root@CentOS azkaban-web-server]# mkdir plugins/
[root@CentOS azkaban-web-server]# tree plugins/
plugins/
└── dependency
    └── kafka
        └──  dependency.properties
        
2 directories, 1 files  

[root@CentOS azkaban-web-server]# tree conf/
conf/
├── azkaban.private.properties
├── azkaban.properties
├── azkaban-users.xml
├── global.properties
└── log4j.properties

0 directories, 5 files

[root@CentOS ~]# cd azkaban
[root@CentOS azkaban]# cp az-flow-trigger-dependency-type/kafka-event-trigger/build/libs/kafka-event-trigger-3.81.0-1-g304593d-fat.jar /usr/azkaban-web-server/plugins/dependency/kafka/
[root@CentOS azkaban]# cd /usr/azkaban-web-server/
[root@CentOS azkaban-web-server]# tree plugins/ #该目录是创建的
plugins/
└── dependency
    └── kafka
        ├── dependency.properties
        └── kafka-event-trigger-3.81.0-1-g304593d-fat.jar

2 directories, 2 files

3、修改dependency.properties

dependency.classpath=/usr/azkaban-web-server/plugins/dependency/kafka/kafka-event-trigger-3.81.0-1-g304593d-fat.jar
dependency.class=trigger.kafka.KafkaDependencyCheck
kafka.broker.url=CentOS:9092

4、修该azkaban.private.properties

org.quartz.dataSource.quartzDS.user=root
org.quartz.dataSource.quartzDS.password=root
org.quartz.dataSource.quartzDS.driver = com.mysql.jdbc.Driver
org.quartz.dataSource.quartzDS.URL = jdbc:mysql://CentOS:3306/azkaban
org.quartz.threadPool.threadCount = 3
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.dataSource=quartzDS

你可能感兴趣的:(Azkaban,任务调度)