方面
1、RDD创建的几个方式;
2、RDD创建实战;
3、RDD创建内幕
==========RDD创建的几个方式 ============
为什么要有几个创建方式?
Spark会基于不同的介质进行计算。
Spark和Hadoop有没有关系?
没有任何关系,知识当Spark运行在Hadoop之上,Hadoop作为数据来源的时候才有关系。
如果知识基于计算本身,完全没必要学Hadoop。
既可以运行在Hadoop之上,也可以运行在其它分布式文件之上,也可以运行在本地。
第一个RDD代表了Spark应用程序输入数据的来源。只有创建了第一个RDD,才能通过transformation来进行对RDD各种算子的转换。实现算法。
RDD基本有三种创建方式(其实300种都不止):
1、使用程序中的集合创建RDD;
2、使用本地文件系统创建RDD;
3、使用HDFS创建RDD;
另外常见的比如
4、基于DB(oracle、mysql)创建;
5、基于NoSQL(HBASE);
6、基于S3;
7、基于数据流创建RDD;
创建RDD的7大方式
本课演示前面三种方式。
通过集合创建RDD实际意义?测试!!!
使用本地文件系统创建RDD的主要作用能够是什么?集合的数据量有限,可以测试大量数据的文件。
使用HDFS创建RDD,主要是生产环境最常用的RDD创建方式。
Hadoop+Spark是大数据领域最有前途的组合
~~~1、基于集合创建RDD~~~
package com.dt.spark.cores
import org.apache.spark.{SparkContext, SparkConf}
/**
* Created by 威 on 2016/2/6.
*/
object RDDBaseOnCollection {
def main(args: Array[String]) {
/**
* 1、创建一个Scala集合
*/
val conf = new SparkConf()//创建SparkConf对象
conf.setAppName("RDDBaseOnCollection")//设置应用程序的名称,在程序运行的监控界面可以看到名称
conf.setMaster("local")//此时程序在本地运行,不需要安装Spark集群
val sc = new SparkContext(conf)//通过创建SparkContext对象,通过传入SparkConf实例来定制Spark运行的具体参数和配置信息
val numbers = 1 to 100
val rdd = sc.parallelize(numbers)
val sum = rdd.reduce(_+_)//1+2=3 3+3=6 6+4=10 ......
println("1+2+...+99+100="+sum)
}
}
结果:
16/02/06 14:18:01 INFO DAGScheduler: Job 0 finished: reduce at RDDBaseOnCollection.scala:21, took 1.192071 s
1+2+...+99+100=5050
16/02/06 14:18:01 INFO SparkContext: Invoking stop() from shutdown hook
验证Spark也可以作为数据处理的单机程序,可以在只能设备,例如:手机、平板、电视上使用Spark,也可以在PC、Server上使用Spark。所以Spark可以运行在一切设备上。
单机上可以多线程的方式,模拟分布式。
local模式失败了就是失败了,MAX_LOCAL_TASK_FAILURES用来重试次数
private def createTaskScheduler(
sc: SparkContext,
master: String): (SchedulerBackend, TaskScheduler) = {
import SparkMasterRegex._
// When running locally, don't try to re-execute tasks on failure.
val MAX_LOCAL_TASK_FAILURES = 1
master match {
case "local" =>
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(sc.getConf, scheduler, 1)
scheduler.initialize(backend)
(backend, scheduler)
case LOCAL_N_REGEX(threads) =>
def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
// local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
val threadCount = if (threads == "*") localCpuCount else threads.toInt
if (threadCount <= 0) {
在集群中运行,上面的代码
./spark-shell --master spark://Master:7077,Worker1:7077,Worker2:7077
scala> val numbers = 1 to 100
numbers: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
scala> val rdd = sc.parallelize(numbers)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:29
scala> val sum = rdd.reduce(_+_)
16/02/06 14:42:06 INFO spark.SparkContext: Starting job: reduce at <console>:31
16/02/06 14:42:06 INFO scheduler.DAGScheduler: Got job 0 (reduce at <console>:31) with 24 output partitions
16/02/06 14:42:06 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at <console>:31)
16/02/06 14:42:06 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/06 14:42:06 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/06 14:42:06 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:29), which has no missing parents
16/02/06 14:42:07 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1216.0 B, free 1216.0 B)
16/02/06 14:42:07 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 858.0 B, free 2.0 KB)
16/02/06 14:42:07 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.145.131:54442 (size: 858.0 B, free: 1247.2 MB)
16/02/06 14:42:07 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/02/06 14:42:07 INFO scheduler.DAGScheduler: Submitting 24 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:29)
16/02/06 14:42:07 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 24 tasks
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, Worker2, partition 0,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, Master, partition 1,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, Worker1, partition 2,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, Worker2, partition 3,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, Master, partition 4,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, Worker1, partition 5,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, Worker2, partition 6,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, Master, partition 7,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, Worker1, partition 8,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, Worker2, partition 9,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, Master, partition 10,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 11.0 in stage 0.0 (TID 11, Worker1, partition 11,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 12.0 in stage 0.0 (TID 12, Worker2, partition 12,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 13.0 in stage 0.0 (TID 13, Master, partition 13,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 14.0 in stage 0.0 (TID 14, Worker1, partition 14,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 15.0 in stage 0.0 (TID 15, Worker2, partition 15,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 16.0 in stage 0.0 (TID 16, Master, partition 16,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 17.0 in stage 0.0 (TID 17, Worker1, partition 17,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 18.0 in stage 0.0 (TID 18, Worker2, partition 18,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 19.0 in stage 0.0 (TID 19, Master, partition 19,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 20.0 in stage 0.0 (TID 20, Worker1, partition 20,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 21.0 in stage 0.0 (TID 21, Worker2, partition 21,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 22.0 in stage 0.0 (TID 22, Master, partition 22,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:42:07 INFO scheduler.TaskSetManager: Starting task 23.0 in stage 0.0 (TID 23, Worker1, partition 23,PROCESS_LOCAL, 2135 bytes)
16/02/06 14:42:08 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Master:57909 (size: 858.0 B, free: 511.1 MB)
16/02/06 14:42:08 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Worker1:55725 (size: 858.0 B, free: 511.1 MB)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 2201 ms on Master (1/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) in 2325 ms on Master (2/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 16.0 in stage 0.0 (TID 16) in 2330 ms on Master (3/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 19.0 in stage 0.0 (TID 19) in 2332 ms on Master (4/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 2349 ms on Master (5/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 0.0 (TID 10) in 2349 ms on Master (6/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2354 ms on Master (7/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 13.0 in stage 0.0 (TID 13) in 2347 ms on Master (8/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 2598 ms on Worker1 (9/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 2606 ms on Worker1 (10/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 0.0 (TID 23) in 2593 ms on Worker1 (11/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 17.0 in stage 0.0 (TID 17) in 2598 ms on Worker1 (12/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 20.0 in stage 0.0 (TID 20) in 2606 ms on Worker1 (13/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 2620 ms on Worker1 (14/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 0.0 (TID 11) in 2621 ms on Worker1 (15/24)
16/02/06 14:42:09 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 2629 ms on Worker1 (16/24)
16/02/06 14:42:09 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on Worker2:53946 (size: 858.0 B, free: 511.1 MB)
16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 21.0 in stage 0.0 (TID 21) in 3189 ms on Worker2 (17/24)
16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 3206 ms on Worker2 (18/24)
16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 18.0 in stage 0.0 (TID 18) in 3196 ms on Worker2 (19/24)
16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 3212 ms on Worker2 (20/24)
16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3265 ms on Worker2 (21/24)
16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 15.0 in stage 0.0 (TID 15) in 3208 ms on Worker2 (22/24)
16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 12.0 in stage 0.0 (TID 12) in 3211 ms on Worker2 (23/24)
16/02/06 14:42:10 INFO scheduler.TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 3218 ms on Worker2 (24/24)
16/02/06 14:42:10 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at <console>:31) finished in 3.271 s
16/02/06 14:42:10 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/06 14:42:10 INFO scheduler.DAGScheduler: Job 0 finished: reduce at <console>:31, took 4.025308 s
sum: Int = 5050
这里用到了24个cores,因为有3台机器,一台机器8个cores,不指定多少个core,就全部用满,看下面的截图,说明Spark会最大化的运营你的cores,但是如果不配置Spark会极大的消耗内存
只有一个Stage,reduce不会产生RDD,不会用到Shuffle,所以只有一个Stage,一共23个任务
可以设置并行粒度
val rdd = sc.parallelize(numbers,10) 第二个参数
val sum = rdd.reduce(_+_)
scala> val sum = rdd.reduce(_+_)
16/02/06 14:53:27 INFO spark.SparkContext: Starting job: reduce at <console>:31
16/02/06 14:53:27 INFO scheduler.DAGScheduler: Got job 1 (reduce at <console>:31) with 10 output partitions
16/02/06 14:53:27 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (reduce at <console>:31)
16/02/06 14:53:27 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/06 14:53:27 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/06 14:53:27 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ParallelCollectionRDD[1] at parallelize at <console>:29), which has no missing parents
16/02/06 14:53:27 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 1216.0 B, free 3.2 KB)
16/02/06 14:53:27 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 858.0 B, free 4.1 KB)
16/02/06 14:53:27 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.145.131:54442 (size: 858.0 B, free: 1247.2 MB)
16/02/06 14:53:27 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/06 14:53:27 INFO scheduler.DAGScheduler: Submitting 10 missing tasks from ResultStage 1 (ParallelCollectionRDD[1] at parallelize at <console>:29)
16/02/06 14:53:27 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 10 tasks
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 24, Master, partition 0,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 25, Worker1, partition 1,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 26, Worker2, partition 2,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 1.0 (TID 27, Master, partition 3,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 1.0 (TID 28, Worker1, partition 4,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 1.0 (TID 29, Worker2, partition 5,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 1.0 (TID 30, Master, partition 6,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 1.0 (TID 31, Worker1, partition 7,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 8.0 in stage 1.0 (TID 32, Worker2, partition 8,PROCESS_LOCAL, 2078 bytes)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Starting task 9.0 in stage 1.0 (TID 33, Master, partition 9,PROCESS_LOCAL, 2135 bytes)
16/02/06 14:53:27 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker1:55725 (size: 858.0 B, free: 511.1 MB)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 25) in 113 ms on Worker1 (1/10)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 7.0 in stage 1.0 (TID 31) in 114 ms on Worker1 (2/10)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 1.0 (TID 28) in 117 ms on Worker1 (3/10)
16/02/06 14:53:27 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Master:57909 (size: 858.0 B, free: 511.1 MB)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 1.0 (TID 27) in 185 ms on Master (4/10)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 9.0 in stage 1.0 (TID 33) in 186 ms on Master (5/10)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 24) in 202 ms on Master (6/10)
16/02/06 14:53:27 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 1.0 (TID 30) in 210 ms on Master (7/10)
16/02/06 14:53:28 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on Worker2:53946 (size: 858.0 B, free: 511.1 MB)
16/02/06 14:53:28 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 26) in 808 ms on Worker2 (8/10)
16/02/06 14:53:28 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 1.0 (TID 29) in 812 ms on Worker2 (9/10)
16/02/06 14:53:28 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 1.0 (TID 32) in 814 ms on Worker2 (10/10)
16/02/06 14:53:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/02/06 14:53:28 INFO scheduler.DAGScheduler: ResultStage 1 (reduce at <console>:31) finished in 0.815 s
16/02/06 14:53:28 INFO scheduler.DAGScheduler: Job 1 finished: reduce at <console>:31, took 0.834312 s
这里只有10个并行单位了
实际上Spark并行度应该设置多少合适呢?
根据经验,每个core可以承载2~4个partition。32个核,最大64~128个partition之间比较合适。跟数据规模没有任何关系,只跟task使用partition的内存使用量和CPU使用时间有关系。
默认情况下,可以设置为core数量的2~4倍
来看看上面的partition的原理:
ctrl+shift+n
查询
private[spark] class ParallelCollectionRDD[T: ClassTag](
sc: SparkContext,
@transient private val data: Seq[T],
numSlices: Int,
locationPrefs: Map[Int, Seq[String]])
extends RDD[T](sc, Nil) {
// TODO: Right now, each split sends along its full data, even if later down the RDD chain it gets
// cached. It might be worthwhile to write the data to a file in the DFS and read it in the split
// instead.
// UPDATE: A parallel collection can be checkpointed to HDFS, which achieves this goal.
numSlices,并行度
locationPrefs,刚才1~100运行在3台机器,有24个分片,每个分片都有具体的位置,被blockManager管理的
DAGScheduler运行时都决定好在哪里运行
override def getPartitions: Array[Partition] = {
val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
}
slices变成了一个数组,我们会对数据分片。就是一个最基本的数据结构
override def getPreferredLocations(s: Partition): Seq[String] = {
locationPrefs.getOrElse(s.index, Nil)
}
~~~2、读取本地文件创建RDD~~~
package com.dt.spark.cores
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by 威 on 2016/2/6.
*/
object RDDBaseOnLocalFile {
def main(args: Array[String]) {
/**
* 1、创建一个Scala集合
*/
val conf = new SparkConf()//创建SparkConf对象
conf.setAppName("RDDBaseOnLocalFile")//设置应用程序的名称,在程序运行的监控界面可以看到名称
conf.setMaster("local")//此时程序在本地运行,不需要安装Spark集群
val sc = new SparkContext(conf)//通过创建SparkContext对象,通过传入SparkConf实例来定制Spark运行的具体参数和配置信息
val rdd = sc.textFile("F:/安装文件/操作系统/spark-1.6.0-bin-hadoop2.6/README.md")
//计算所行长度的总和
val linesLength = rdd.map{line=>line.length}
val sum = linesLength.reduce(_+_)
println("The total charecters of file is ="+sum)
}
}
运行结果:
16/02/06 15:19:43 INFO DAGScheduler: Job 0 finished: reduce at RDDBaseOnLocalFile.scala:23, took 2.369687 s
The total charecters of file is =3264
16/02/06 15:19:43 INFO SparkContext: Invoking stop() from shutdown hook
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
val inputSplits = inputFormat.getSplits(jobConf, minPartitions) //逻辑拆分
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) {
array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
然后集群运行:
scala> val lines = sc.textFile("/historyserverforSpark/README.md", 1)
16/02/06 15:34:02 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 212.8 KB, free 212.8 KB)
16/02/06 15:34:02 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.7 KB, free 232.4 KB)
16/02/06 15:34:02 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.145.131:54442 (size: 19.7 KB, free: 1247.2 MB)
16/02/06 15:34:02 INFO spark.SparkContext: Created broadcast 2 from textFile at <console>:27
lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:27
scala> val linesLength = lines.map{line=>line.length}
linesLength: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at <console>:29
scala> val sum = linesLength.reduce(_+_)
16/02/06 15:35:07 INFO mapred.FileInputFormat: Total input paths to process : 1
16/02/06 15:35:08 INFO spark.SparkContext: Starting job: reduce at <console>:31
16/02/06 15:35:08 INFO scheduler.DAGScheduler: Got job 2 (reduce at <console>:31) with 1 output partitions
16/02/06 15:35:08 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 (reduce at <console>:31)
16/02/06 15:35:08 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/06 15:35:08 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/06 15:35:08 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[4] at map at <console>:29), which has no missing parents
16/02/06 15:35:08 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.3 KB, free 235.7 KB)
16/02/06 15:35:08 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1964.0 B, free 237.6 KB)
16/02/06 15:35:08 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.145.131:54442 (size: 1964.0 B, free: 1247.2 MB)
16/02/06 15:35:08 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
16/02/06 15:35:08 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[4] at map at <console>:29)
16/02/06 15:35:08 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
16/02/06 15:35:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 34, Worker1, partition 0,NODE_LOCAL, 2152 bytes)
16/02/06 15:35:09 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Worker1:55725 (size: 1964.0 B, free: 511.1 MB)
16/02/06 15:35:09 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on Worker1:55725 (size: 19.7 KB, free: 511.1 MB)
16/02/06 15:35:12 INFO scheduler.DAGScheduler: ResultStage 2 (reduce at <console>:31) finished in 3.718 s
16/02/06 15:35:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 34) in 3716 ms on Worker1 (1/1)
16/02/06 15:35:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
16/02/06 15:35:12 INFO scheduler.DAGScheduler: Job 2 finished: reduce at <console>:31, took 4.366041 s
sum: Int = 3264
Tachyon中间件可以从mysql或者oracle抽取数据建立RDD
也可以把mysql或者oracle数据导入Hive中
王家林老师名片:
中国Spark第一人
新浪微博:http://weibo.com/ilovepains
微信公众号:DT_Spark
博客:http://blog.sina.com.cn/ilovepains
手机:18610086859
QQ:1740415547
本文出自 “一枝花傲寒” 博客,谢绝转载!