弹性分布式数据集(Resilient Distributed Datasets,RDD) ,可以分三个层次来理解:
Spark数据存储的核心是弹性分布式数据集(RDD),我们可以把RDD简单地理解为一个抽象的大数组,但是这个数组是分布式的,逻辑上RDD的每个分区叫做一个Partition。
在物理上,RDD对象实质上是一个元数据结构,存储着Block、Node等映射关系,以及其他元数据信息。一个RDD就是一组分区(Partition),RDD的每个分区Partition对应一个Block,Block可以存储在内存,当内存不够时可以存储到磁盘上。
如下图所示,存在2个RDD:RDD1包含3个分区,分别存储在Node1、Node2和Node3的内存中;RDD2也包含3个分区,p1和p2分区存储在Node1和Node2的内存中,p3分区存在在Node3的磁盘中。
RDD的数据源也可以存储在HDFS上,数据按照HDFS分布策略进行分区,HDFS中的一个Block对应Spark RDD的一个Partition。
(1)RDD包括两大类基本操作Transformation和Acion
(3)惰性执行(Lazy Execution)
(1)代码
[root@master ~]# spark-shell
17/09/06 03:36:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/09/06 03:36:39 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.1.180:4040
Spark context available as 'sc' (master = local[*], app id = local-1504683394043).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val rdd1=sc.parallelize(1 to 100,5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
scala> val rdd2=rdd1.map(_+1)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at :26
scala> rdd2.take(2)
res0: Array[Int] = Array(2, 3)
scala> rdd2.count
res1: Long = 100
scala>
(2)程序说明
Spark context available as 'sc'
,表示spark-shell中已经默认将SparkContext类初始化为对象sc,在spark-shell中可以直接使用SparkContext的对象sc。(1)代码
scala> val listRdd=sc.parallelize(List(1,2,3),3)
listRdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at :24
scala> val squares=listRdd.map(x=>x*x)
squares: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at :26
scala> squares.take(3)
res3: Array[Int] = Array(1, 4, 9)
scala> val even=squares.filter(_%2==0)
even: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at filter at :28
scala> squares.first
res4: Int = 1
scala> even.first
res5: Int = 4
scala> val nums=sc.parallelize(1 to 3)
nums: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at :24
scala> val mapRdd=nums.flatMap(x=>1 to x)
mapRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at flatMap at :26
scala> mapRdd.count
res6: Long = 6
scala> mapRdd.take(6)
res7: Array[Int] = Array(1, 1, 2, 1, 2, 3)
scala>
(2)程序说明
(1)代码
scala> val pets=sc.parallelize(List( ("cat",1),("dog",1),("cat",2) ))
pets: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[8] at parallelize at :24
scala> val pets2=pets.reduceByKey(_+_)
pets2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at :26
scala> pets2.count
res8: Long = 2
scala> pets2.take(2)
res10: Array[(String, Int)] = Array((dog,1), (cat,3))
scala> val pets3=pets.groupByKey()
pets3: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[10] at groupByKey at :26
scala> pets3.count
res11: Long = 2
scala> pets3.take(2)
res12: Array[(String, Iterable[Int])] = Array((dog,CompactBuffer(1)), (cat,CompactBuffer(1, 2)))
scala> val pets4=pets.sortByKey()
pets4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[11] at sortByKey at :26
scala> pets4.take(3)
res14: Array[(String, Int)] = Array((cat,1), (cat,2), (dog,1))
scala>
(2)程序说明
reduceByKey(_+_)
对每个key对应的多个value进行merge操作,自动在map端进行本地combinegroupByKey()
对每个key进行归并,但只生成一个sequence。sortByKey()
按照key进行排序WordCount是大数据处理的HelloWorld,下面看看Spark是如何实现。
(1)准备数据
[root@master ~]# mkdir data
[root@master ~]# vi data/words
[root@master ~]# cat data/words
hi hello
how do you do?
hello, Spark!
hello, Scala!
[root@master ~]#
(2)转换处理
scala> val rdd=sc.textFile("file:///root/data/words")
rdd: org.apache.spark.rdd.RDD[String] = file:///root/data/words MapPartitionsRDD[3] at textFile at :24
scala> val mapRdd=rdd.flatMap(_.split(" "))
mapRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at flatMap at :26
scala> mapRdd.first
res2: String = hi
scala> val kvRdd=mapRdd.map(x=>(x,1))
kvRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] at map at :28
scala> kvRdd.first
res3: (String, Int) = (hi,1)
scala> kvRdd.take(2)
res4: Array[(String, Int)] = Array((hi,1), (hello,1))
scala> val rsRdd=kvRdd.reduceByKey(_+_)
rsRdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at :30
scala> rsRdd.take(2)
res5: Array[(String, Int)] = Array((how,1), (do?,1))
scala> rsRdd.saveAsTextFile("file:///tmp/output")
scala>
程序说明:
rdd.flatMap(_.split(" "))
表示将RDD每个元素(文件的每行)按照空格分割,并生成新的RDDmapRdd.map(x=>(x,1))
表示将RDD每个元素x生成(x,1)Key-Value对,并生成新的RDDkvRdd.reduceByKey(_+_)
对每个key对应的多个value进行merge操作,最重要的是它能够在本地先进行merge操作,并且merge操作可以通过函数自定义(value值相加)。(3)查看结果
[root@master ~]# ll /tmp/output
total 8
-rw-r--r-- 1 root root 48 Sep 6 03:51 part-00000
-rw-r--r-- 1 root root 33 Sep 6 03:51 part-00001
-rw-r--r-- 1 root root 0 Sep 6 03:51 _SUCCESS
[root@master ~]# cat /tmp/output/part-00000
(how,1)
(do?,1)
(hello,,2)
(hello,1)
(Spark!,1)
[root@master ~]# cat /tmp/output/part-00001
(you,1)
(Scala!,1)
(hi,1)
(do,1)
[root@master ~]#
Spark程序设计基本流程
1)创建SparkContext对象
每个Spark应用程序有且仅有一个SparkContext对象,封装了Spark执行环境信息
2)创建RDD
可以从Scala集合或Hadoop数据集上创建
3)在RDD之上进行转换和action
MapReduce只提供了map和reduce两种操作,而Spark提供了多种转换和action函数
4)返回结果
保存到HDFS中,或直接打印出来。