Spark: Introduction of Spark

It has been 3 months since I have taken the cloud computing course. Here is my brief summary about spark. 


Why Spark? And what is Spark?


If we run iterative Map-Reduce jobs, the middle result will be stored in the HDFS after one map-reduce step. Then, if the next map-reduce job starts, it mush get the data from HDFS again. As we can know from store hierarchy of computer, disks are the most slowest media of computer. Therefore, we can thinking about find an alternative of disk to store the middle result. Global memory comes to our mind. 


Spark is not a computation tool. Let's clarify this first. Here is a good statement of Spark[1]:


Spark: Introduction of Spark_第1张图片 


所以,spark是一个资源分配器。After we handed out the application to spark, the application generates a driver program on Spark. The driver program actually is the director for finishing this application. What it does is to coordinate the resources and control the executors. 


1/ Driver program applies for worker nodes in the clusters;

2/ Driver program connect the executors which belongs to the worker nodes it has applied. Then acquire those executors.

3/ Then driver program sends the codes and data to executor. 


As we can let different executors run different process for a particular application, driver program can use sparkContext to coordinate different executors to run different processes. 


Spark: Introduction of Spark_第2张图片



实现集群的程序称为:集群管理器。目前有三种集群管理器:


Standalone - 这个集群管理器打包在 spark 的程序里,是最简单的集群管理器。
Apache Mesos - 一个非常成熟的分布式操作系统,可以用来运行除 Spark 以外的很多系统。
Hadoop YARN - Hadoop 的 资源管理器。


Spark: Introduction of Spark_第3张图片


How can spark uses memory for storage[2]?


It uses RDD as its foundation. It is a new data abstraction called Resilient Distributed Datasets or (RDDs), which allow for a distributed dataset to remain in-memory in the nodes of a cluster during various stages of computation. RDDs also store the lineage information about the data, keeping a record of all the operations that were performed to bring an RDD to its present state.(不用复制,因为这样会浪费内存) This way, if a node fails on a Spark cluster, the data that was in-memory and lost can be re-loaded from the source (often the distributed file system) and the operations that were recorded in the lineage information can be re-applied to bring the data to its present state. Thus, data can remain in memory through multiple stages of transformation without spilling to disk, and applications can run many times faster than traditional frameworks that rely on disk accesses in between stages.

To use Spark, you need to write a driver program to connect to a Spark cluster of workers. The driver defines one or more RDDs and invokes actions on them. The driver also tracks the RDDs’ lineage, which records the history of how this RDD is generated as a Directed Acyclic Graph (DAG). The workers are long-lived processes (running for the entire lifetime of an application) that can store RDD partitions in RAM across operations.


The SparkContext object can connect to several types of cluster managers that handle the scheduling of applications and tasks (Figure 2). The cluster manager isolates multiple Spark programs from each other- each application has its own driver and runs on isolated executors coordinated by the cluster manager. Currently, Spark supports applications written in Scala, Java and Python.


A tips about writing Scala program:


There are three kinds of code in using Spark:


Spark: Introduction of Spark_第4张图片


Only the action sentences will perform actual effect on data. 


Reference: 

[1] https://segmentfault.com/a/1190000003889102

[2] 15619 CMU Cloud Computing 

你可能感兴趣的:(scala,spark,RDD)