Spark学习笔记之初识

1 spark官网 http://spark.apache.org/
2 学习版本为1.5.0

Spark架构,官方文档解读

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
跟其他分布式系统一样,每个节点的spark 应用程序都是一系列独立的进程,这些进程由主节点的SparkContext对象管理,这个对象叫做驱动程序。

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.
集群管理程序可能有很多种,Mesos or YARN等,主要是为应用程序分配资源,SparkContext要和集群管理程序进行连接才能在多集群上驱动应用程序。

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
完成连接之后,SparkContext向各个节点发送执行代码,最后分配执行任务。

Spark学习笔记之初识_第1张图片

注意点:
1 Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
不同的SparkContext之间不能共享数据

2 Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
spark对YARN等集群管理器有很好的支持

3The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port and spark.fileserver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
驱动程序一直监视节点知道任务完成,因此这个期间要保证主节点和其他业务节点的网络通信

4Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.
最好做本地集群,驱动服务器和执行节点服务器最好在一个物理位置上就很靠近的局域网之内

你可能感兴趣的:(spark)