How spark executes a job on the cluster

In the cluster mode, when a j ob is submitted for execution, the j ob is sent to the driver (or a master) node. The driver node creates a DAG for a j ob and decides which executor (or worker) nodes will run specific tasks.

The driver then instructs the workers to execute their tasks and return the results to the driver when done. Before that happens, however, the driver prepares each task's closure: A set of variables and methods present on the driver for the worker to execute its task on the RDD.

This set of variables and methods is inherently static within the executors' context, that is, each executor gets a copy  of the variables and methods from the driver. If, when running the task, the executor alters these variables or overwrites the methods, it does so without affecting either other executors' copies or the variables and methods of the driver. This might lead to some unexpected behavior and runtime bugs that can sometimes be really hard to track down.

你可能感兴趣的:(How spark executes a job on the cluster)