【Spark一百】Spark问题总结

准备写100篇关于Spark的博客,先把这第100篇写了,主要是记录学习Spark中想到的、碰到的问题,作为一个学习Spark的checklist。

1.内存容纳不下数据集

2. 内存容不下缓存的数据集

3. Spark local运行模式

4.Spark如何控制程序对资源(memory和cores)的分配

5. spark.default.parallesm这个参数有什么作用?

6. ExecutorRunner的fetchAndRunExecutor方法启动了一个进程,请问这是什么进程

7. Spark提交程序时,指定了内存和cpu的核数,那么spark如何使用这些参数对Spark运行的作业进行资源控制?

8.在Yarn Cluser上运行spark程序,如果deploy-mode为cluster,即Driver运行于Worker节点,那么ApplicationMaster和Driver是否位于一台机器?Driver由谁控制?

9.  Driver和Executor如何通信?它们要通信,这就要求Driver和Executor位于同一个局域网内

10. MapReduce程序是否一定有shuffle?

11.Driver发起作业,然后将计算回送给Driver的流程是什么?即结果如何会送给Driver?

 

1.内存容纳不下数据集

What happens if my dataset does not fit in memory?

Often each partition of data is small and does fit in memory, and these partitions are processed a few at a time. For very large partitions that do not fit in memory, Spark's built-in operators perform external operations on datasets.

 

2. 内存容不下缓存的数据集

What happens when a cached dataset does not fit in memory?

Spark can either spill it to disk or recompute the partitions that don't fit in RAM each time they are requested. By default, it uses recomputation, but you can set a dataset's storage level to MEMORY_AND_DISK to avoid this.

 

3. Spark local运行模式

Note that you can also run Spark locally (possibly on multiple cores) without any special setup by just passing local[N] as the master URL, where N is the number of parallel threads you want.

 

4. Spark如何控制程序对资源(memory和cores)的分配

程序指定了memory和cores之后,Spark的代码中如何对资源进行控制?

 

5. spark.default.parallesm这个参数有什么作用?

 

6. ExecutorRunner的fetchAndRunExecutor方法启动了一个进程,请问这是什么进程

 

10.MapReduce程序是否一定有shuffle?

不一定!比如rdd = sc.texfile(...);rdd.count这两个语句中,没有shuffle过程,因为每个rdd的结果

 

你可能感兴趣的:(spark)