看一些博客都是轻描淡写的说一下这是spark的特性,延迟/惰性计算(lazy evaluation)就完事了,然后各个博客之间抄来抄去就是那么几句话,所以就想着把这些东西整理一下讲清楚,希望对有需要的朋友有所帮助。
RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变、可分区、里面的元素可并行计算的集合。那么为什么叫弹性分布式?分布式好理解,这里不谈;那什么叫弹性?
6、数据调度弹性:DAG TASK 和资源 管理无关
RDD 的 Transformation 函数中又分为窄依赖(narrow dependency)和宽依赖(wide dependency)的操作:
lines = spark.textFile("hdfs://...") //第1行从HDFS文件定义了一个RDD(即一个文本行集合)
errors = lines.filter(_.startsWith("ERROR")) //第2行获得一个过滤后的RDD
errors.cache() //第3行请求将errors缓存起来
# 注意在Scala语法中filter的参数是一个闭包
// Count errors mentioning MySQL:
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
使用errors的第一个action运行以后,Spark会把errors的分区缓存在内存中,极大地加快了后续计算速度。注意,最初的RDD lines不会被缓存。因为错误信息可能只占原数据集的很小一部分(小到足以放入内存)。
最后,为了说明模型的容错性,图1给出了第3个查询的Lineage图。在lines RDD上执行filter操作,得到errors,然后再filter、map后得到新的RDD,在这个RDD上执行collect操作。Spark调度器以流水线的方式执行后两个转换,向拥有errors分区缓存的节点发送一组任务。此外,如果某个errors分区丢失,Spark只在相应的lines分区上执行filter操作来重建该errors分区。
图1 示例中第三个查询的Lineage图。(方框表示RDD,箭头表示转换)
简单字面理解下含义就是:spark直到action 动作之前,数据不会先被计算;(什么是action?这里不做过多介绍,一句话概括就是,spark的算子中存在action和transform两种,transform就是常见的map,union,flatmap,groupByKey, join等不需要系统返回啥的算子。而collect,count,reduce等需要拉回产生结果的算子就是action算子,可以简单的说,action算的的个数是job提交的个数; 详见:Spark的Transform算子和Action算子列举和示例
>>> rdd_ = sc.parallelize([1,2,3,4,33,2,44,1,44,3,2,2,2,1,33,6],4)
>>> rdd_.getNumPartitions()
>>> rdd_transform = rdd_.filter(lambda x:x > 2).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
>>> rdd_transform
PythonRDD[104] at RDD at PythonRDD.scala:48
这时候spark UI上不会有任何job被提交,因为这个计算没有任何action算子,实际上根本没有被计算,这就是lazy特性。
>>> rdd_transform.collect()
[(4, 1), (44, 2), (33, 2), (6, 1), (3, 2)]
in high school, and your mom came in to ask you to do a chore (“fetch me some milk for tonight’s dinner”). Your response: say that you were going to do it, then keep right on doing what you were already doing. Sometimes your mom would come back in and say she didn’t need the chore done after all (“I substituted water instead”). Magic, work saved! Sometimes the laziest finish first.
Spark is the same. It waits until you’re done giving it operators, and only when you ask it to give you the final answer does it evaluate, and it always looks to limit how much work it has to do. Suppose you first ask Spark to filter a petabyte of data for something—say, find you all the point of sale records for the Chicago store—then next you ask for it to give you just the first result that comes back. This is a really common thing to do. Sometimes a data analyst just wants to see a typical record for the Chicago store. If Spark were to run things explicitly as you gave it instructions, it would load the entire file, then filter for all the Chicago records, then once it had all those, pick out just the first line for you. That’s a huge waste of time and resources. Spark will instead wait to see the full list of instructions, and understand the entire chain as a whole. If you only wanted the first line that matches the filter, then Spark will just find the first Chicago POS record, then it will emit that as the answer, and stop. It’s much easier than first filtering everything, then picking out only the first line.
Now, you could write your MapReduce jobs more intelligently to similarly avoid over-using resource, but it’s much more difficult to do that. Spark makes this happen automatically for you. Normally, software like Hive goes into contortions to avoid running too many MapReduce jobs, and programmers write very complex and hard-to-read code to force as much as possible into each Map and Reduce job. This makes development hard, and makes the code hard to maintain over time. By using Spark instead, you can write code that describes how you want to process data, not how you want the execution to run, and then Spark “does the right thing” on your behalf to run it as efficiently as possible. This is the same thing a good high-level programming language does: it raises the abstraction layer, letting the developer talk more powerfully and expressively, and does the work behind the scenes to ensure it runs as fast as possible.
其中一个case翻译一下就是,假设有一个需求是需要先加载一下一天的订单信息,如果不是lazy的特性,spark会先根据你的需求把这一天的订单信息加载到内存中(太大就溢写),然后你需求又变更了,只想看一下这一天的订单的基本样式,取个一条就可以了,然后spark对内存中的数据取一条吐出来;其实这样是很低效的,最终诉求其实可以理解为,取出订单的第一条数据看看;我们把第一个需求(取出所有订单),第二个需求(订单中的一条)当做两个transform算子,那么他们会最终成为一个链chain;如果操作再多一些,就会形成DAG,这样spark会理解整个链路以及最后的需求后,优化整个DAG,以消耗最少资源的情况下满足需求,这就是lazy特性带来的好处,并不是立刻执行,而是see the big picture,概览全局后,在最后一顿骚操作优化,然后再执行计算;虽然mapreduce也可以实现,但是对开发人员成本比较高,需要写代码去规避这些资源浪费;而spark自己自动进行优化
