Airflow Celery Executor 架构与任务执行过程

Airflow Celery Executor 架构与任务执行过程_第1张图片

Data Engineering Pipelines play a vital role in managing the flow of business data for companies. Organizations spend a significant amount of money on developing and managing Data Pipelines, so they can get streamlined data to accomplish their daily activities with ease. Many tools in the market help companies manage their messaging and task distributions for scalability.

数据工程管道在管理公司的业务数据流方面发挥着至关重要的作用。组织在开发和管理数据管道上花费大量资金,因此他们可以获得简化的数据,轻松完成日常活动。市场上的许多工具可帮助公司管理其消息传递和任务分发以实现可扩展性。

Apache Airflow is a workflow management platform that helps companies orchestrate their Data Pipeline tasks and save time. As data and the number of backend tasks grow, the need to scale up and make resources available for all the tasks becomes a necessity. Airflow Celery Executor makes it easier for developers to build a scalable application by distributing the tasks on multiple machines. With the help of Airflow Celery, companies can send multiple messages for execution without any lag.

Apache Airflow 是一个工作流管理平台,可帮助公司编排其数据管道任务并节省时间。随着数据和后端任务数量的增加,扩展并使资源可用于所有任务的需求变得势在必行。Airflow Celery Executor 通过将任务分布在多台计算机上,使开发人员能够更轻松地构建可扩展的应用程序。在Airflow Celery的帮助下,公司可以发送多条消息执行,没有任何延迟。

Apache Airflow Celery Executor is easy to use as it uses Python scripts and DAGs for scheduling, monitoring, and executing tasks. In this article, you will learn about Airflow Celery Executor and how it helps in scaling out and easily distributing tasks. Also, you will read about the architecture of the Airflow Celery Executor and how its process executes the tasks.

Apache Airflow Celery Executor易于使用,因为它使用Python脚本和DAG来调度,监视和执行任务。在本文中,您将了解airflow celery executor以及它如何帮助横向扩展和轻松分配任务。此外,您还将阅读有关airflow celery执行器的体系结构及其进程如何执行任务的信息。

Introduction to Apache Airflow  简介

Apache Airflow is an open-source workflow management platform to programmatically author, schedule, and monitor workflows. In Airflow, workflows are the DAGs (Directed Acyclic Graphs) of tasks and are widely used to schedule and manage the Data Pipeline. Airflow is written in Python, and workflows are created in Python scripts. Airflow using Python allows Developers to make use of libraries in creating workflows.

Apache Airflow 是一个开源工作流管理平台,用于以编程方式创作、计划和监控工作流。在 Airflow 中,工作流是任务的 DAG(有向无环图),广泛用于计划和管理数据管道。Airflow是用Python编写的,工作流是用Python脚本创建的。使用Python的Airflow允许开发人员在创建工作流时使用库。

The airflow workflow engine executes the complex Data Pipeline jobs with ease and ensures that each task executes on time in the correct order, and makes sure every task gets the required resources. Airflow can run DAGs on the defined schedule and based on the external event triggers.

Airflow工作流引擎可以轻松执行复杂的数据管道作业,并确保每个任务以正确的顺序按时执行,并确保每个任务都获得所需的资源。Airflow可以按定义的计划并基于外部事件触发器运行 DAG。

Key Features of Apache Airflow  关键特性

Some of the main features of Apache Airflow are listed below.

  • Open Source: Airflow is free to use and has a lot of active users and an interactive community. Its resources are easily available over the web.

  • Robust Integrations: Airflow comes with ready-to-use operators and many plug-and-play integrations that offer users to run tasks on Google Cloud Platform, Amazon AWS, Microsoft Azure, etc.

  • Easy to Use: Airflow is written in Python which means, users can create workflows using Python scripts and import libraries that make the job easier.

  • Interactive UI: Airflow also offers a web application that allows users to monitor the real-time status of tasks running and DAGs. Users can schedule and manage their workflows with ease.

  • 开源:Airflow是免费使用的,拥有许多活跃用户和互动社区。它的资源可以通过网络轻松获得。

  • 强大的集成:Airflow带有即用型运算符和许多即插即用的集成,使用户能够在Google Cloud Platform,Amazon AWS,Microsoft Azure等上运行任务。

  • 易于使用:Airflow是用Python编写的,这意味着用户可以使用Python脚本创建工作流程并导入库,使工作更容易。

  • 交互式UI:Airflow还提供了一个Web应用程序,允许用户监视正在运行的任务和DAG的实时状态。用户可以轻松安排和管理其工作流程。


Introduction to the Airflow Celery Executor 简介

Airflow Celery is a task queue that helps users scale and integrates with other languages. It comes with the tools and supports you need to run such a system in production.

Airflow Celery 是一个任务队列,可帮助用户扩展并与其他语言集成。它附带了在生产中运行此类系统所需的工具和支持。

Executors in Airflow are the mechanism by which users can run the task instances. Airflow comes with various executors, but the most widely used among those is Airflow Celery Executor used for scaling out by distributing the workload to multiple Celery workers that can run on different machines.

Airflow 中的执行程序是用户可以运行任务实例的机制。Airflow 带有各种执行器,但其中使用最广泛的是 Airflow Celery 执行器,用于通过将工作负载分配给可以在不同机器上运行的多个 Celery 工作线程来横向扩展。

CeleryExecutor works with some workers it has to distribute the tasks with the help of messages. The deployment scheduler adds a message to the queue and the Airflow Celery broker delivers that message to the Airflow Celerty worker to execute the task. If due to any failure the assigned Airflow Celery worker on the job goes down then Airflow Celery quickly adapts to it and assigns the task to another worker.

CeleryExecutor与一些worker一起工作,它必须在消息(RabbitMQ、Redis)的帮助下分发任务。部署调度程序将消息添加到队列中,Airflow Celery Broker将该消息传递给 Airflow Celery工作线程以执行任务。如果由于任何故障,分配的airflow celery worker 在工作中出现故障,那么airflow  celery会迅速适应它并将任务分配给另一个worker执行。


Airflow Celery Executor Setup 设置Celery Executor

To set up the Airflow Celery Executor, first, you need to set up an Airflow Celery backend using the message broker services such as RabbitMQ, Redis, etc. After that, you need to change the airflow.cfg file to point the executor parameters to CeleryExecutor and enter all the required configurations for it.

要设置 Airflow Celery Executor,首先,您需要使用 RabbitMQ、Redis 等消息代理服务设置 Airflow Celery 后端。之后,您需要更改airflow.cfg文件以将执行器参数指向 CeleryExecutor 并为其输入所有必需的配置。

The prerequisites for setting up the Airflow Celery Executor Setup are listed below:
下面列出了设置airflow celery执行器安装程序的先决条件:

Airflow was installed on the local machine and properly configured. Airflow configuration should be homogeneous across the cluster. Before executing any Operators, the workers need to have their dependencies met in that context by importing the Python library. The workers should have access to the DAGs directory – DAGS_FOLDER.

  • airflow已安装在本地计算机上并正确配置。

  • airflow配置在整个集群中应该是均匀的。

  • 在执行任何Operators之前,workers需要通过导入 Python 库来满足其依赖项。

  • workers应有权访问 DAG 目录 – DAGS_FOLDER。

To start the Airflow Celery worker, use the following command given below.

要启动airflow celery worker,请使用下面给出的以下命令。

airflow celery worker

As the service is started and ready to receive tasks. As soon as a task is triggered in its direction, it will start its job.

当服务启动并准备好接收任务时。一旦任务在其方向上被触发,它就会开始工作。

For stopping the worker, use the following command given below.

要停止工作线程,请使用下面给出的以下命令。

airflow celery stop

You can also use its GUI web application that allows users to monitor the workers. To install the flower Python library, use the following command given below in the terminal.

您还可以使用允许用户监视工作人员的 GUI Web 应用程序。要安装flower Python 库,请在终端中使用下面给出的以下命令。

pip install ‘apache-airflow[celery]’

You can start the Flower web server by using the command given below.

您可以使用下面给出的命令启动 Flower Web 服务器。

airflow celery flower

Airflow Celery Executor Architecture  架构

The architecture of the Airflow Celery Executor consists of several components, listed below:

Airflow Celery 执行器的体系结构由几个组件组成,如下所示:

  • Workers: Its job is to execute the assigned tasks by Airflow Celery. 它的工作是执行celery 分配的任务。

  • Scheduler: It is responsible for adding the necessary tasks to the queue. 它负责将必要的任务添加到队列中。

  • Database: It contains all the information related to the status of tasks, DAGs, Variables, connections, etc. 它包含与任务状态,DAG,变量,连接等相关的所有信息。

  • Web Server: The HTTP server provides access to information related to DAG and task status. HTTP 服务器提供对与 DAG 和任务状态相关的信息的访问。

  • Celery: Queue mechanism.

  • Broker: This component of the Celery queue stores commands for execution. Celery 队列此组件存储要执行的命令。

  • Result Backend: It stores the status of all completed commands. 它存储所有已完成命令的状态。

Now that you have understood the different components of the architecture of Airflow Celery Executor. Now, let’s understand how these components communicate with each other during the execution of any task.

现在您已经了解了airflow celery 体系结构的不同组件。现在,让我们了解这些组件在执行任何任务期间如何相互通信。

  • Web server –> Workers: It fetches all the task execution logs. 它获取所有任务执行日志。

  • Web server –> DAG files: It reveals the DAG structure. DAG 文件:它展示了 DAG 结构。

  • Web server –> Database: It fetches the status of the tasks. 它获取任务的状态。

  • Workers –> DAG files: It reveals the DAG structure and executes the tasks. 它显示 DAG 结构并执行任务。

  • Workers –> Database: Workers get and store information about connection configuration, variables, and XCOM.  worker线程获取并存储有关连接配置、变量和 XCOM 的信息。

  • Workers –> Celery’s result backend: It saves the status of tasks. 它保存任务的状态。

  • Workers –> Celery’s broker: It stores all the commands for execution. 它存储所有用于执行的命令。

  • Scheduler –> DAG files: It reveals the DAG structure and executes the tasks. 它显示 DAG 结构并调度任务。

  • Scheduler –> Database: It stores information on DAG runs and related tasks. 它存储有关 DAG 运行和相关任务的信息。

  • Scheduler –> Celery’s result backend: It gets information about the status of completed tasks. 它获取有关已完成任务状态的信息。

  • Scheduler –> Celery’s broker: It puts the commands to be executed. 它放置要执行的命令。


Task Execution Process of Airflow Celery 任务执行流程

Airflow Celery Executor 架构与任务执行过程_第2张图片 image.png

In this section, you will learn how the tasks are executed by Airflow Celery Executor. Initially when the task is begin to execute, mainly two processes are running at that time.

在本节中,您将了解如何通过airflow celery执行器执行任务。最初,当任务开始执行时,当时主要有两个进程正在运行。

  • SchedulerProcess: This processes the tasks and runs with the help of Airflow CeleryExecutor. 这处理任务并在Airflow CeleryExecutor的帮助下运行。

  • WorkerProcess: It overserves that is waiting for new tasks to append in the queue. It also has WorkerChildProcess that waits for new tasks. 它覆盖了等待新任务附加到队列中的服务。它还具有等待新任务的WorkerChildProcess。

Also, The two Databases are involved in the process while executing the task:
此外,执行任务时,两个数据库也参与该过程:

  • ResultBackend

  • QueueBroker

Along with the process, the 2 other processes are created. The processes are listed below: 与该过程一起,还创建了其他 2 个进程。下面列出了这些进程:

  • RawTaskProcess: It is the process with user code. 它是带有用户代码的过程

  • LocalTaskJobProcess: The logic for this process is described by the LocalTaskJob. Its job is to monitor the status of RawTaskProcess. All the new processes are started from TaskRunner. 此过程的逻辑由 LocalTaskJob 描述。它的工作是监控RawTaskProcess的状态。所有新流程都是从 TaskRunner 启动的。

The process of how tasks are executed using the above process and how the data and instructions flow are given below.
如何使用上述方法执行任务的过程流程以及数据和指令的流程如下。

  • SchedulerProcess processes the tasks and when it gets a task that needs to be executed, then it sends it to the QueueBroker to add the task to the queue. 处理任务,当它获得需要执行的任务时,它会将其发送到 QueueBroker 以将该任务添加到队列中。

  • Simultaneously the QueueBroker starts querying the ResultBackend for the status of the task. And when the QueueBroker gets acquainted with the then it sends the information about it to one WorkerProcess. 同时,QueueBroker 开始查询 ResultBackend 以获取任务的状态。当 QueueBroker 熟悉 然后它会将有关它的信息发送给一个 WorkerProcess。

  • When WorkerProcess receives the information, it assigns a single task to one WorkerChildProcess. 当WorkerProcess收到信息时,它会将单个任务分配给一个WorkerChildProcess。

  • After receiving a task, WorkerChildProcess executes the handling function for the task i.e. execute_command(). This creates the new process LocalTaskJobProcess. 收到任务后,WorkerChildProcess 执行任务的处理函数,即 execute_command()。这将创建新进程 LocalTaskJobProcess。

  • LocalTaskJobProcess describes the logic for LocalTaskJobProcess. A new process is started using TaskRunner. LocalTaskJobProcess 描述了 LocalTaskJobProcess 的逻辑。使用 TaskRunner 启动一个新进程。

  • Both the process RawTaskProcess and LocalTaskJobProcess are stopped when they finish their work.进程 RawTaskProcess 和 LocalTaskJobProcess 在完成工作时都会停止。

  • Then, WorkerChildProcess notifies this information about the completion of a task and the availability of subsequent tasks to the main process WorkerProcess. 然后,WorkerChildProcess 将有关任务完成和后续任务可用性的信息通知给主进程 WorkerProcess。

  • WorkerProcess then saves the status information back into the ResultBackend. 然后,WorkerProcess 将状态信息保存回 ResultBackend。

  • Now, whenever SchedulerProcess requests for the status from ResultBackend,  it receives the status information of the task. 现在,每当 SchedulerProcess 从 ResultBackend 请求状态时,它都会收到任务的状态信息。

Conclusion  结论

In this article, you learned about Apache Airflow, Airflow Celery Executor, and how it helps companies achieve scalability. Also, you read about the architecture of Airflow Celery Executor and its task execution process. Airflow Celery uses message brokers to receive tasks and distributes those on multiple machines for higher performance.

在本文中,您了解了 Apache Airflow、Airflow Celery Executor 以及它如何帮助公司实现可扩展性。此外,您还阅读了有关Airflow Celery执行器的体系结构及其任务执行过程的信息。Airflow Celery 使用消息代理来接收任务,并将这些任务分发到多台计算机上以获得更高的性能。


 往期推荐

Airflow Celery Executor 架构与任务执行过程_第3张图片

一种处理云原生项目的新方法- GitOps是什么 为什么要研究它?

Airflow Celery Executor 架构与任务执行过程_第4张图片

2023-第⑥期DevOps实践训练营(已发布)

Airflow Celery Executor 架构与任务执行过程_第5张图片

数据库即代码| 使用Bytebase的数据库GitOps实践

Airflow Celery Executor 架构与任务执行过程_第6张图片

初体验| 使用Bytebase进行数据库CI/CD变更

Airflow Celery Executor 架构与任务执行过程_第7张图片

2023 企业实施DevOps的价值有哪些?

If this article is helpful to you, please forward it, like it and share it. Your attention is the driving force for my continuous efforts. Thanks for attention.

如果这篇文章对您有帮助,欢迎转发点赞分享。您的关注是我持续努力的动力。感谢关注。Hava a good day!

你可能感兴趣的:(架构,apache,大数据,java,开发语言)