【Spark十一】Spark集群基本架构及相关术语

Spark组织的不是很合理,到这时才说到Spark集群的基本架构。原因是,前面的篇幅更多的是在Spark Shell上体验Spark API,以及对RDD一些粗浅的认识。没事,本着由粗糙到精细的原则,一步一步来,最后再把整个Spark相关的博客整理下,使之有条理,目前只是记录学习的过程。

 

Spark Cluster Overview

 

 
【Spark十一】Spark集群基本架构及相关术语
 

 

 下面对上图进行解释:

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks for the executors to run.

 

There are several useful things to note about this architecture:

  • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
  • Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
  • Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

最后一点强调了Sparker Driver所在的机器和Sparker集群最好位于同一个网络环境中,因为Driver中的SparkContext实例要发送任务给不同Worker Node的Executor并接受Executor的一些执行结果信息,一般而言,在企业实际的生产环境中Driver所在机器的配置往往都是比较不错的,尤其是其CPU的处理能力往往都很强悍。

 

 

Spark相关术语

Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application Jar

A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never

include Hadoop or Spark libraries, however, these will be added at runtime

Driver program  The process running the main() function of the application and creating the SparkContext
Cluster manager  An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode        Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
 Worker node  Any node that can run application code in the cluster
 Executor  A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
 Task  A unit of work that will be sent to one executor
 Job  A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
 Stage  Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs

 

 

  • Application

Application是创建了SparkContext实例对象的Spark用户,包含了Driver程序:Spark-shell是一个Application,因为spark-shell在启动的时候创建了SparkContext对象,其名称为sc

 

  • Job

和Spark的action相对应,每一个action例如count、savaAsTextFile等都会对应一个Job实例,该Job实例包含多任务的并行计算。

 

  • Cluster Manager

集群资源管理的外部服务,在Spark上现在主要有Standalone、Yarn、Mesos等三种集群资源管理器,Spark自带的Standalone模式能够满足绝大部分纯粹的Spark计算环境中对集群资源管理的需求,基本上只有在集群中运行多套计算框架的时候才建议考虑Yarn和Mesos

 

  • Worker Node

集群中可以运行应用程序代码的工作节点,相当于Hadoop的slave节点


  • Executor

在一个Worker Node上为应用启动的工作进程,在进程中负责任务的运行,并且负责将数据存放在内存或磁盘上,必须注意的是,每个应用在一个Worker Node上只会有一个Executor,在Executor内部通过多线程的方式并发处理应用的任务。

 

  • Task

被Driver送到executor上的工作单元,通常情况下一个task会处理一个split的数据,每个split一般就是一个Block块的大小:

 

  • Stage

一个Job会被拆分成很多任务,每一组任务称为Stage,这个MapReduce的map和reduce任务很像,划分Stage的依据在于:Stage开始一般是由于读取外部数据或者Shuffle数据、一个Stage的结束一般是由于发生Shuffle(例如reduceByKey操作)或者整个Job结束时例如要把数据放到hdfs等存储系统上

 

 

 

 后面补上Task,Executor,Stage的ScalaDoc

 

 

 

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(spark)