1,利用spark自带的ec2脚本来生成spark
在spark的安装目录下,执行如下命令
$ ./ec2/spark-ec2 -k
, where
is the name of your EC2 key pair (that you gave it when you created it),
is the private key file for your key pair,
is the number of slave nodes to launch (try 1 at first),
is the name of your VPC,
is the name of your subnet, and
is the name to give to your cluster.
比如:
$ export AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtU export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123
$ ./ec2/spark-ec2 --key-pair=spark_study --identity-file=/home/ubuntu/spark_study.pem --region=ap-northeast-1 --zone=ap-northeast-1a launch my-spark-cluster
2,在本地生成scala程序
首先创建如下目录
./src
./src/main
./src/main/scala
然后创建文件,./src/main/scala/SimpleApp.scala
内容为
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "/home/ubuntu/spark-1.6.0-bin-hadoop2.6/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
我们使用sbt来生成scala的jar包,首先需要安装sbt。
通过如下命令安装sbt
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
sudo apt-get update
sudo apt-get install sbt
创建如下文件 ./simple.sbt
内容为
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"
sbt package
3,将jar包上传到spark的master结点上。
4,注意我们的程序里面的logfile路径
/home/ubuntu/spark-1.6.0-bin-hadoop2.6/README.md
在spark cluster上面,默认会去读取hdfs的路径,这就需要现在hdfs上面创建一个文本文件
首先登陆到master的机器结点,然后进入hdfs的安装目录
cd ~/ephemeral-hdfs
/home/ubuntu/spark-1.6.0-bin-hadoop2.6/
目录下。
通过如下命令拷贝.
bin/hadoop fs -mkdir /home/ubuntu/spark-1.6.0-bin-hadoop2.6/
bin/hadoop fs -put ~/README.md /home/ubuntu/spark-1.6.0-bin-hadoop2.6/
然后可以通过如下命令检查是否拷贝正确
bin/hadoop fs -ls /home/ubuntu/spark-1.6.0-bin-hadoop2.6/
这时应该可以看到README.md文件。这就是我们拷贝到HDFS上的文件。
5,提交运行拷贝到master结点的jar包
进入spark的安装路径,然后执行如下命令
./bin/spark-submit --class SimpleApp --master spark://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:7077 --deploy-mode client /home/ec2-user/simple-project_2.10-1.0.jar
16/02/07 12:55:31 INFO spark.SecurityManager: Changing view acls to: root
16/02/07 12:55:31 INFO spark.SecurityManager: Changing modify acls to: root
16/02/07 12:55:31 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/02/07 12:55:32 INFO util.Utils: Successfully started service 'sparkDriver' on port 36092.
16/02/07 12:55:32 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/02/07 12:55:32 INFO Remoting: Starting remoting
16/02/07 12:55:33 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:33336]
16/02/07 12:55:33 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 33336.
16/02/07 12:55:33 INFO spark.SparkEnv: Registering MapOutputTracker
16/02/07 12:55:33 INFO spark.SparkEnv: Registering BlockManagerMaster
16/02/07 12:55:33 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/blockmgr-76f22bc5-a78a-4847-ab29-b7292f7a7cff
16/02/07 12:55:33 INFO storage.DiskBlockManager: Created local directory at /mnt2/spark/blockmgr-dd59300b-9590-41c3-b94a-d76a3c7fe8db
16/02/07 12:55:33 INFO storage.MemoryStore: MemoryStore started with capacity 511.5 MB
16/02/07 12:55:33 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/02/07 12:55:33 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/02/07 12:55:33 INFO server.AbstractConnector: Started [email protected]:4040
16/02/07 12:55:33 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/02/07 12:55:33 INFO ui.SparkUI: Started SparkUI at http://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:4040
16/02/07 12:55:33 INFO spark.HttpFileServer: HTTP File server directory is /mnt/spark/spark-6ce5d4ee-c69e-40d1-9053-5d613461f9bf/httpd-77ec3777-a82b-4ef3-9fca-5a4c790a5747
16/02/07 12:55:33 INFO spark.HttpServer: Starting HTTP Server
16/02/07 12:55:33 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/02/07 12:55:33 INFO server.AbstractConnector: Started [email protected]:56378
16/02/07 12:55:33 INFO util.Utils: Successfully started service 'HTTP file server' on port 56378.
16/02/07 12:55:33 INFO spark.SparkContext: Added JAR file:/home/ec2-user/simple-project_2.10-1.0.jar at http://172.31.12.26:56378/jars/simple-project_2.10-1.0.jar with timestamp 1454849733709
16/02/07 12:55:33 INFO client.AppClient$ClientEndpoint: Connecting to master spark://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:7077...
16/02/07 12:55:34 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160207125534-0003
16/02/07 12:55:34 INFO client.AppClient$ClientEndpoint: Executor added: app-20160207125534-0003/0 on worker-20160207085403-172.31.2.135-39140 (172.31.2.135:39140) with 2 cores
16/02/07 12:55:34 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160207125534-0003/0 on hostPort 172.31.2.135:39140 with 2 cores, 6.0 GB RAM
16/02/07 12:55:34 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 47128.
16/02/07 12:55:34 INFO netty.NettyBlockTransferService: Server created on 47128
16/02/07 12:55:34 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/02/07 12:55:34 INFO storage.BlockManagerMasterEndpoint: Registering block manager 172.31.12.26:47128 with 511.5 MB RAM, BlockManagerId(driver, 172.31.12.26, 47128)
16/02/07 12:55:34 INFO storage.BlockManagerMaster: Registered BlockManager
16/02/07 12:55:34 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160207125534-0003/0 is now RUNNING
16/02/07 12:55:34 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 46.3 KB, free 46.3 KB)
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.4 KB, free 50.7 KB)
16/02/07 12:55:35 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.31.12.26:47128 (size: 4.4 KB, free: 511.5 MB)
16/02/07 12:55:35 INFO spark.SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
16/02/07 12:55:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/07 12:55:35 WARN snappy.LoadSnappy: Snappy native library not loaded
16/02/07 12:55:35 INFO mapred.FileInputFormat: Total input paths to process : 1
16/02/07 12:55:35 INFO spark.SparkContext: Starting job: count at SimpleApp.scala:12
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (count at SimpleApp.scala:12)
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 53.9 KB)
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1891.0 B, free 55.7 KB)
16/02/07 12:55:35 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.31.12.26:47128 (size: 1891.0 B, free: 511.5 MB)
16/02/07 12:55:35 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
16/02/07 12:55:35 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/02/07 12:55:38 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-172-31-2-135.ap-northeast-1.compute.internal:38142) with ID 0
16/02/07 12:55:38 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 0,NODE_LOCAL, 2286 bytes)
16/02/07 12:55:38 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 1,NODE_LOCAL, 2286 bytes)
16/02/07 12:55:38 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-2-135.ap-northeast-1.compute.internal:46994 with 4.1 GB RAM, BlockManagerId(0, ip-172-31-2-135.ap-northeast-1.compute.internal, 46994)
16/02/07 12:55:38 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 1891.0 B, free: 4.1 GB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 4.4 KB, free: 4.1 GB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added rdd_1_0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 3.1 KB, free: 4.1 GB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added rdd_1_1 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 2.9 KB, free: 4.1 GB)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1477 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (1/2)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1502 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (2/2)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: ResultStage 0 (count at SimpleApp.scala:12) finished in 3.923 s
16/02/07 12:55:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 4.132609 s
16/02/07 12:55:39 INFO spark.SparkContext: Starting job: count at SimpleApp.scala:13
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (count at SimpleApp.scala:13)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
16/02/07 12:55:39 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 58.8 KB)
16/02/07 12:55:39 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1893.0 B, free 60.7 KB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.12.26:47128 (size: 1893.0 B, free: 511.5 MB)
16/02/07 12:55:39 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
16/02/07 12:55:39 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 0,PROCESS_LOCAL, 2286 bytes)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 1,PROCESS_LOCAL, 2286 bytes)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 1893.0 B, free: 4.1 GB)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 65 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (1/2)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: ResultStage 1 (count at SimpleApp.scala:13) finished in 0.078 s
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 79 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (2/2)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.100431 s
16/02/07 12:55:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
Lines with a: 32, Lines with b: 11
16/02/07 12:55:39 INFO spark.SparkContext: Invoking stop() from shutdown hook
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/02/07 12:55:39 INFO ui.SparkUI: Stopped Spark web UI at http://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:4040
16/02/07 12:55:39 INFO cluster.SparkDeploySchedulerBackend: Shutting down all executors
16/02/07 12:55:39 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down
16/02/07 12:55:39 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/02/07 12:55:39 INFO storage.MemoryStore: MemoryStore cleared
16/02/07 12:55:39 INFO storage.BlockManager: BlockManager stopped
16/02/07 12:55:39 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/02/07 12:55:39 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/02/07 12:55:39 INFO spark.SparkContext: Successfully stopped SparkContext
16/02/07 12:55:39 INFO util.ShutdownHookManager: Shutdown hook called
16/02/07 12:55:39 INFO util.ShutdownHookManager: Deleting directory /mnt/spark/spark-6ce5d4ee-c69e-40d1-9053-5d613461f9bf/httpd-77ec3777-a82b-4ef3-9fca-5a4c790a5747
16/02/07 12:55:39 INFO util.ShutdownHookManager: Deleting directory /mnt2/spark/spark-8d522cb1-beec-4d21-9a8c-e9c1cf635ea7
16/02/07 12:55:39 INFO util.ShutdownHookManager: Deleting directory /mnt/spark/spark-6ce5d4ee-c69e-40d1-9053-5d613461f9bf