- Spark components示意图
1.1 spark component示意图解释
(1) driver program内包含了sparkcontext对象
sparkContext是spark函数的entry point
代表到spark集群上的一个连接
(2) sparkcontext对象与cluster manager联系
(3)cluster manager与worker node联系
worker node可以理解为node manager
- Spark解释
Spark applications run as independent sets of processes on a cluster
在集群上,Spark应用程序看成独立的一组进程来运行
Example
spark1: 3 executors
spark2: 3 executors
spark1的executors与spark2的executors是相互独立的
coordinated by the SparkContext object in your main program
通过main程序上的SparkContext对象来进行协调
main program也就是driver program
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers(Spark’s own standalone cluster manager, Mesos or YARN)
详细的说,为了运行在集群上,sparkcontext能够连接很多类型cluster managers(集群管理器),
例如Spark自己的standalone集群管理器,mesos或者是yarn
Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application
一旦连接, Spark获取在集群上的node的executors,executor是一个进程,
application的计算可以运行在executor上
应用的数据也可以存在于executor上
Example
spark context连接cluster manager
申请两个节点资源,也就是两个executor
申请资源步骤
先于node manager通信, executor会注册到spark context上
worker node会启动相应的executor
注: executor有cache可以存储数据
executor也可以运行task
Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors
下一步, driver program发送你的应用代码
(JAR或者Python文件中传递给sparkcontext)到executor
Finally, SparkContext sends tasks to the executors to run
最后, Spark context将task发送给executor运行
注意
先发送代码后发送task原因是:
先将code发送,等task来的时候就可以执行代码
也就是给task提供代码支持
which allocate resources across applications
给各个应用程序分配资源
Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads
每个应用会得到自己的executor进程,
executor进程在整个应用的生命周期内都存活,并且可以多线程运行task
This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs)
这样做的好处是每个应用程序之间是隔离的,因为不同的应用程序由不同的executor processes,
多个executor processes之间是不会干扰的,
在executor端(每个driver program调度自己的task)
和执行端(task来自于运行在不同的JVM上的不同的应用,
executor是jvm上的一个进程)
However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system
但是,不同的spark应用(sparkcontext实例)之间不能分享数据,
除非将数据已经写到了外部存储系统中
Spark is agnostic to the underlying cluster manager
Spark不知道底层集群管理器
As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN)
只要能获取到executor进程,node之间就可以进行通信,
这是一个非常容易的方式即使集群管理器上
(支持其他应用eg. Mesos/Yarn)都可以运行Spark
- spark总结
3.1 spark
spark只有driver program和executor是进程
3.2 hadoop
Map task和reduce task都是进程