Airflow 已逐渐成为最流行的任务调度框架,加上本身由 Python 语言编写,对比 Azkaban 灵活性,可配置性更高
Airflow官网
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'tutorial',
default_args=default_args,
start_date=datetime(2015, 12, 1),
description='A simple tutorial DAG',
schedule_interval='@daily',
catchup=False)
官方为我们提供了多种 operator
also_run_this = BashOperator(
task_id='also_run_this',
bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
dag=dag,
)
text_msg_remind_none = DingdingOperator(
task_id='text_msg_remind_none',
dingding_conn_id='dingding_default',
message_type='text',
message='Airflow dingding text message remind none',
at_mobiles=None,
at_all=False
)
def print_context(ds, **kwargs):
pprint(kwargs)
print(ds)
return 'Whatever you return gets printed in the logs'
# Use the op_args and op_kwargs arguments to pass additional arguments to the Python callable.
run_this = PythonOperator(
task_id='print_the_context',
provide_context=True,
python_callable=print_context,
op_kwargs={'random_base': float(i) / 10},
dag=dag,
)
如示例中,python_callable 放 python 方法名
op_kwargs 或 op_args,分别传递字典参数和数组参数,很方便
from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults
class HelloOperator(BaseOperator):
@apply_defaults
def __init__(
self,
name: str,
*args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.name = name
def execute(self, context):
message = "Hello {}".format(self.name)
print(message)
return message
过程也不难,都是套模板,主要逻辑都在 execute
方法里面完成,调用与官方提供的opeartor
无区别
另外这里也提到 Hook,配上官网原文解释
Hooks act as an interface to communicate with the external shared resources in a DAG. For example, multiple tasks in a DAG can require access to a MySQL database. Instead of creating a connection per task, you can retrieve a connection from the hook and utilize it. Hook also helps to avoid storing connection auth parameters in a DAG. See Managing Connections for how to create and manage connections.
大概意思是比如多个task
都要操作一个mysql
数据库时,正常情况下每个 task
都需要创建 数据库连接,如果使用 hooks
就可以多任务共用一个连接,另外hooks
还可以保存授权等参数信息
Airflow
还有一个优势就是支持REST API
,通过插件的形式安装部署重启;通过 REST API
可以传递参数来触发任务的执行
wget https://github.com/teamclairvoyant/airflow-rest-api-plugin/archive/master.zip
airflow
airflow backfill -s START_DATE -e END_DATE dag_id
airflow clear dag_id -t task_regex -s START_DATE -d END_DATE
airflow test dag_id -t task_regex -s START_DATE -d END_DATE
airflow run dag_id -t task_regex -s START_DATE -d END_DATE
airflow trigger_dag -e execution_date run_id
airflow dags trigger --conf '{"conf1": "value1"}' example_parametrized_dag
off
,需要开启才能执行和调度success
,running
,failed
,upstream_failed
,skiped
,up_for_retry
,up_for_reschedule
,queued
,无状态
,scheduled
success
,running
,failed
triggle dag 跳转到triggle界面
start_date = "{% if dag_run.conf.start_date %} {{dag_run.conf.start_date}} {% else %} {{ds}} {% endif %}"
end_date = "{% if dag_run.conf.end_date %} {{dag_run.conf.end_date}} {% else %} {{tomorrow_ds}} {% endif %}"
如果外部传入参数则使用外部参数,否则,通过 {{ds}}
和 {{tomorrow_ds}}
来将滞后的周期提前一天,这样就可以正常执行昨天的任务数据
triggle
传参也是搞不清楚的一点,主要是这个参数格式,经过反复测试,triggle
直接点则无参执行,也可以通过json的方式设置参数,但是 key
必须和脚本里面的key
保持一致;REST API
来triggle
,这个参数怎么传可以在demo
里面试,demo
会生成链接,参考一下就ok了