问题1:
对于RDD的序列化saveAsObjectFile之后,反序列化使用SparkContext.objectFile方法,当时序列化时候必须传入泛型告诉序列化器反序列化之后的对象类型。
例如:
定义一个样例类:
case class Student(name:String,age:Int)
val s1=Student("dax1n",18) val s2=Student("mala",28)
val rdd=sc.parallelize(Array(s1,s2))
rdd.saveAsObjectFile("obj")
sc.objectFile[Student]("obj").collect//泛型Student必须传入
如果不传入泛型的话:sc.objectFile("obj").collect 异常:
saveAsObjectFile(path) | Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile() . |
foreach(fuc) | Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.译文:修改除了Accumulators的外面变量可能导致不明确的的结果,具体看闭包。 |
Shuffle operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
译文:Spark的操作可能触发一个称之为shuffle的操作,Spark shuffle是一种重新分配数据到各个节点的一种机制(在分区上分组),典型包括executors之间或者节点之间数据拷贝,是一个复杂,昂贵的操作。
To understand what happens during the shuffle we can consider the example of the reduceByKey
operation. The reduceByKey
operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.
译文:为了理解在shuffle期间发生了什么,我们可以考虑一个例子就是reduceByKey操作,reduceByKey产生一个新的RDD,新的RDD里面记录是一个二元组(key,value)value为每一个的次数。这个任务的挑战是对于某一个key来说,这个key对应的所有的value不一定全在一个分区或者不全在一个节点上,所以我们必须协调计算最终结果。
In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey
reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.
Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:
mapPartitions
to sort each partition using, for example, .sorted
repartitionAndSortWithinPartitions
to efficiently sort partitions while simultaneously repartitioningsortBy
to make a globally ordered RDD Operations which can cause a shuffle include repartition operations like repartition
and coalesce
, ‘ByKey operations (except for counting) like groupByKey
and reduceByKey
, and join operations like cogroup
and join
.
译文:
尽管shuffle操作后每个分区上的元素集合是确定的,但是分区的顺序以及分区中元素的顺序是不确定的。如果你希望预测shuffle操作后数据的顺序,可以使用如下操作: The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map
and reduce
operations.
Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.
Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey
and aggregateByKey
create these structures on the map side, and 'ByKey
operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period of time, if the application retains references to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may consume a large amount of disk space. The temporary storage directory is specified by thespark.local.dir
configuration parameter when configuring the Spark context.
Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the ‘Shuffle Behavior’ section within the Spark Configuration Guide.
Shuffle操作属于昂贵的操作,因为它涉及磁盘I/O,数据序列化,和网络I/O。为是组织Shuffle的数据,Spark生成一个map任务集合来组织数据,和一个reduce任务集合来聚合数据,这种概念来自于MapReduce,和Spark自身的map,reduce操作没有直接关系。
在实现上,每个单独的map任务将结果存储在内存中,直到无法分配内存。然后,这些结果以目标分区排序并存储在单一的文件中。在reduce中,reduce任务读取已排序的相关块.
某些shuffle操作会消耗大量的堆内存,因为它们使用基于内存的数据结构来组织记录,在传递它们之前或之后。特别地, reduceByKey和aggregateByKey在map中创建这些结构然后'ByKey'操作在reduce中生成这些结构. 当没有足够的内存时, Spark将这些数据写到磁盘上, 引起额外的磁盘I/O和垃圾回收处理。
shuffle也会产生大量的中间文件。至从Spark1.3.1起,这些中间文件一直保存到RDD不在使用之前然后被垃圾回收器(磁盘的回收器,不是JVM垃圾回收器)回收。这么做的话,如果被重新计算的话,就不需要在重新创建中间文件了。磁盘回收器会在一段时间之后回收磁盘,如果应用依然引用其RDD或者垃圾回收器没有清理磁盘的话,这意味着
Spark可能会消耗大量的磁盘空间。临时存储目录可以在配置SparkContext的时候使用spark.local.dir属性配置。
shuffle操作可以被一些配置参数调整,这些将在spark的配置指导中详细说明。