1:启动shell
master=spark://feng02:7077 ./bin/spark-shell
[jifeng@feng02 spark-1.2.0-bin-2.4.1]$ master=spark://feng02:7077 ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/02/27 16:33:58 INFO SecurityManager: Changing view acls to: jifeng 15/02/27 16:33:58 INFO SecurityManager: Changing modify acls to: jifeng 15/02/27 16:33:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jifeng); users with modify permissions: Set(jifeng) 15/02/27 16:33:58 INFO HttpServer: Starting HTTP Server 15/02/27 16:33:58 INFO Utils: Successfully started service 'HTTP class server' on port 34677. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.2.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45) Type in expressions to have them evaluated. Type :help for more information. 15/02/27 16:34:05 INFO SecurityManager: Changing view acls to: jifeng 15/02/27 16:34:05 INFO SecurityManager: Changing modify acls to: jifeng 15/02/27 16:34:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jifeng); users with modify permissions: Set(jifeng) 15/02/27 16:34:06 INFO Slf4jLogger: Slf4jLogger started 15/02/27 16:34:06 INFO Remoting: Starting remoting 15/02/27 16:34:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@feng02:50101] 15/02/27 16:34:06 INFO Utils: Successfully started service 'sparkDriver' on port 50101. 15/02/27 16:34:06 INFO SparkEnv: Registering MapOutputTracker 15/02/27 16:34:06 INFO SparkEnv: Registering BlockManagerMaster 15/02/27 16:34:06 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150227163406-af47 15/02/27 16:34:06 INFO MemoryStore: MemoryStore started with capacity 267.3 MB 15/02/27 16:34:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/02/27 16:34:07 INFO HttpFileServer: HTTP File server directory is /tmp/spark-87b83dea-f916-4c1d-bd4f-f77cceca4b56 15/02/27 16:34:07 INFO HttpServer: Starting HTTP Server 15/02/27 16:34:07 INFO Utils: Successfully started service 'HTTP file server' on port 45673. 15/02/27 16:34:08 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/02/27 16:34:08 INFO SparkUI: Started SparkUI at http://feng02:4040 15/02/27 16:34:08 INFO Executor: Using REPL class URI: http://10.6.3.201:34677 15/02/27 16:34:08 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@feng02:50101/user/HeartbeatReceiver 15/02/27 16:34:09 INFO NettyBlockTransferService: Server created on 54438 15/02/27 16:34:09 INFO BlockManagerMaster: Trying to register BlockManager 15/02/27 16:34:09 INFO BlockManagerMasterActor: Registering block manager localhost:54438 with 267.3 MB RAM, BlockManagerId(<driver>, localhost, 54438) 15/02/27 16:34:09 INFO BlockManagerMaster: Registered BlockManager 15/02/27 16:34:09 INFO SparkILoop: Created spark context.. Spark context available as sc. scala>2:RDD 测试people.txt
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@30aceb27 scala> import sqlContext.createSchemaRDD import sqlContext.createSchemaRDD scala> case class Person(name: String, age: Int) defined class Person scala> val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) 15/02/27 16:41:19 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975 15/02/27 16:41:19 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB) 15/02/27 16:41:19 INFO MemoryStore: ensureFreeSpace(22736) called with curMem=163705, maxMem=280248975 15/02/27 16:41:19 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB) 15/02/27 16:41:19 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54438 (size: 22.2 KB, free: 267.2 MB) 15/02/27 16:41:19 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/02/27 16:41:19 INFO SparkContext: Created broadcast 0 from textFile at <console>:17 people: org.apache.spark.rdd.RDD[Person] = MappedRDD[3] at map at <console>:17 scala> people.registerTempTable("people") scala> val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") teenagers: org.apache.spark.sql.SchemaRDD = SchemaRDD[6] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Project [name#0] Filter ((age#1 >= 13) && (age#1 <= 19)) PhysicalRDD [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at ExistingRDD.scala:36 scala> teenagers.map(t => "Name: " + t(0)).collect().foreach(println) 15/02/27 16:42:24 INFO FileInputFormat: Total input paths to process : 1 15/02/27 16:42:24 INFO SparkContext: Starting job: collect at <console>:18 15/02/27 16:42:24 INFO DAGScheduler: Got job 0 (collect at <console>:18) with 1 output partitions (allowLocal=false) 15/02/27 16:42:24 INFO DAGScheduler: Final stage: Stage 0(collect at <console>:18) 15/02/27 16:42:24 INFO DAGScheduler: Parents of final stage: List() 15/02/27 16:42:24 INFO DAGScheduler: Missing parents: List() 15/02/27 16:42:24 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at map at <console>:18), which has no missing parents 15/02/27 16:42:24 INFO MemoryStore: ensureFreeSpace(6416) called with curMem=186441, maxMem=280248975 15/02/27 16:42:24 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.3 KB, free 267.1 MB) 15/02/27 16:42:24 INFO MemoryStore: ensureFreeSpace(4290) called with curMem=192857, maxMem=280248975 15/02/27 16:42:24 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.2 KB, free 267.1 MB) 15/02/27 16:42:24 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54438 (size: 4.2 KB, free: 267.2 MB) 15/02/27 16:42:24 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/02/27 16:42:24 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838 15/02/27 16:42:24 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[7] at map at <console>:18) 15/02/27 16:42:24 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/02/27 16:42:24 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1349 bytes) 15/02/27 16:42:24 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/02/27 16:42:24 INFO HadoopRDD: Input split: file:/home/jifeng/hadoop/spark-1.2.0-bin-2.4.1/examples/src/main/resources/people.txt:0+32 15/02/27 16:42:24 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/02/27 16:42:24 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/02/27 16:42:24 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/02/27 16:42:24 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/02/27 16:42:24 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/02/27 16:42:24 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1734 bytes result sent to driver 15/02/27 16:42:24 INFO DAGScheduler: Stage 0 (collect at <console>:18) finished in 0.248 s 15/02/27 16:42:24 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 237 ms on localhost (1/1) 15/02/27 16:42:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/02/27 16:42:24 INFO DAGScheduler: Job 0 finished: collect at <console>:18, took 0.348078 s Name: Justin scala> val teenagers = sqlContext.sql("SELECT name FROM people ") 15/02/27 17:06:45 INFO BlockManager: Removing broadcast 1 15/02/27 17:06:45 INFO BlockManager: Removing block broadcast_1_piece0 15/02/27 17:06:45 INFO MemoryStore: Block broadcast_1_piece0 of size 4290 dropped from memory (free 280056118) 15/02/27 17:06:45 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:54438 in memory (size: 4.2 KB, free: 267.2 MB) 15/02/27 17:06:45 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/02/27 17:06:45 INFO BlockManager: Removing block broadcast_1 15/02/27 17:06:45 INFO MemoryStore: Block broadcast_1 of size 6416 dropped from memory (free 280062534) 15/02/27 17:06:45 INFO ContextCleaner: Cleaned broadcast 1 teenagers: org.apache.spark.sql.SchemaRDD = SchemaRDD[10] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == Project [name#0] PhysicalRDD [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at ExistingRDD.scala:36 scala> teenagers.map(t => "Name: " + t(0)).collect().foreach(println) 15/02/27 17:06:50 INFO SparkContext: Starting job: collect at <console>:18 15/02/27 17:06:50 INFO DAGScheduler: Got job 1 (collect at <console>:18) with 1 output partitions (allowLocal=false) 15/02/27 17:06:50 INFO DAGScheduler: Final stage: Stage 1(collect at <console>:18) 15/02/27 17:06:50 INFO DAGScheduler: Parents of final stage: List() 15/02/27 17:06:50 INFO DAGScheduler: Missing parents: List() 15/02/27 17:06:50 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[11] at map at <console>:18), which has no missing parents 15/02/27 17:06:50 INFO MemoryStore: ensureFreeSpace(5512) called with curMem=186441, maxMem=280248975 15/02/27 17:06:50 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 5.4 KB, free 267.1 MB) 15/02/27 17:06:50 INFO MemoryStore: ensureFreeSpace(3790) called with curMem=191953, maxMem=280248975 15/02/27 17:06:50 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.7 KB, free 267.1 MB) 15/02/27 17:06:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54438 (size: 3.7 KB, free: 267.2 MB) 15/02/27 17:06:50 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 15/02/27 17:06:50 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838 15/02/27 17:06:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[11] at map at <console>:18) 15/02/27 17:06:50 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 15/02/27 17:06:50 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1349 bytes) 15/02/27 17:06:50 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 15/02/27 17:06:50 INFO HadoopRDD: Input split: file:/home/jifeng/hadoop/spark-1.2.0-bin-2.4.1/examples/src/main/resources/people.txt:0+32 15/02/27 17:06:50 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1763 bytes result sent to driver 15/02/27 17:06:50 INFO DAGScheduler: Stage 1 (collect at <console>:18) finished in 0.018 s 15/02/27 17:06:50 INFO DAGScheduler: Job 1 finished: collect at <console>:18, took 0.032411 s Name: Michael Name: Andy Name: Justin scala> 15/02/27 17:06:50 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 19 ms on localhost (1/1) 15/02/27 17:06:50 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/02/27 17:36:19 INFO BlockManager: Removing broadcast 2 15/02/27 17:36:19 INFO BlockManager: Removing block broadcast_2 15/02/27 17:36:19 INFO MemoryStore: Block broadcast_2 of size 5512 dropped from memory (free 280058744) 15/02/27 17:36:19 INFO BlockManager: Removing block broadcast_2_piece0 15/02/27 17:36:19 INFO MemoryStore: Block broadcast_2_piece0 of size 3790 dropped from memory (free 280062534) 15/02/27 17:36:19 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:54438 in memory (size: 3.7 KB, free: 267.2 MB) 15/02/27 17:36:19 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 15/02/27 17:36:19 INFO ContextCleaner: Cleaned broadcast 2