[Spark进阶]--再识spark高阶架构

Spark EcoSystem几乎都是以 Spark Core为核心而构建起来的,那么,先看看 Spark Core的高阶架构:

[Spark进阶]--再识spark高阶架构_第1张图片

分别介绍下几个概念

1、Driver Programs
        A driver program is an application that uses Spark as a library. It provides the data processing code that  Spark executes on the worker nodes. A driver program can launch one or more jobs on a Spark cluster.

2、Executors
       An executor is a JVM (Java virtual machine) process that Spark creates on each worker for an application.  It executes application code concurrently in multiple threads. It can also cache data in memory or disk.
An executor has the same lifespan as the application for which it is created. When a Spark application  terminates, all the executors created for it also terminate.

3、Tasks
      A task is the smallest unit of work that Spark sends to an executor. It is executed by a thread in an executor  on a worker node. Each task performs some computations to either return a result to a driver program or  partition its output for shuffle.
Spark creates a task per data partition. An executor runs one or more tasks concurrently. The amount  of parallelism is determined by the number of partitions. More partitions mean more tasks processing data  in parallel.
 

Application Execution


    This section briefly describes how data processing code is executed on a Spark cluster.

Terminology
    Let’s define a few terms first:
    Shuffle. A shuffle redistributes data among a cluster of nodes. It is an expensive  operation because it involves moving data across a network. Note that a shuffle does  not randomly redistribute data; it groups data elements into buckets based on some  criteria. Each bucket forms a new partition.

    Job. A job is a set of computations that Spark performs to return results to a driver  program. Essentially, it is an execution of a data processing algorithm on a Spark  cluster. An application can launch multiple jobs. Exactly how a job is executed is  covered later in this chapter.

    Stage. A stage is a collection of tasks. Spark splits a job into a DAG of stages. A stage  may depend on another stage. For example, a job may be split into two stages, stage  0 and stage 1, where stage 1 cannot begin until stage 0 is completed. Spark groups  tasks into stages using shuffle boundaries. Tasks that do not require a shuffle are  grouped into the same stage. A task that requires its input data to be shuffled begins  a new stage.

How an Application Works


    With the definitions out of the way, I can now describe how a Spark application processes data in parallel  across a cluster of nodes. When a Spark application is run, Spark connects to a cluster manager and acquires  executors on the worker nodes. As mentioned earlier, a Spark application submits a data processing  algorithm as a job. Spark splits a job into a directed acyclic graph (DAG) of stages. It then schedules the  execution of these stages on the executors using a low-level scheduler provided by a cluster manager. The  executors run the tasks submitted by Spark in parallel.

    Every Spark application gets its own set of executors on the worker nodes. This design provides a few  benefits. 
    First, tasks from different applications are isolated from each other since they run in different  JVM processes. A misbehaving task from one application cannot crash another Spark application. 

    Second,scheduling of tasks becomes easier. Spark has to schedule the tasks belonging to only one application at a time.  It does not have to handle the complexities of scheduling tasks from multiple concurrently running applications.
    However, this design also has one disadvantage. Since applications run in separate JVM processes, they  cannot easily share data. 
    Even though they may be running on the same worker nodes, they cannot share  data without writing it to disk. 
    As previously mentioned, writing and reading data from disk are expensive  operations. Therefore, applications sharing data through disk will experience performance issues.
 

总结如下

  1. 一个物理节点可以有一个或多个worker
  2. 一个worker中可以有一个或多个executor
  3. 一个executor拥有多个cpu core和memory
  4. 仅shuffle (把一组无规则的数据尽量转换成一组具有一定规则的数据)操作时才算作一个stage
  5. 一个partition对应一个task

 

你可能感兴趣的:(Spark)