一、所遇问题
由于在IDEA下可以方便快捷地运行scala程序,所以先前并没有在终端下使用spark-submit提交打包好的jar任务包的习惯,但是其只能在local模式下执行,在网上搜了好多帖子设置VM参数都不能启动spark集群,由于实验任务紧急只能暂时作罢IDEA下任务提交,继而改由终端下使用spark-submit提交打包好的jar任务。
二、spark-shell功能介绍
进入$SPARK_HOME目录,输入bin/spark-submit --help可以得到该命令的使用帮助。
Master URL | 含义 |
local | 使用1个worker线程在本地运行Spark应用程序 |
local[K] | 使用K个worker线程在本地运行Spark应用程序 |
local[*] | 使用所有剩余worker线程在本地运行Spark应用程序 |
spark://HOST:PORT | 连接到Spark Standalone集群,以便在该集群上运行Spark应用程序 |
mesos://HOST:PORT | 连接到Mesos集群,以便在该集群上运行Spark应用程序 |
yarn-client | 以client方式连接到YARN集群,集群的定位由环境变量HADOOP_CONF_DIR定义,该方式driver在client运行。 |
yarn-cluster | 以cluster方式连接到YARN集群,集群的定位由环境变量HADOOP_CONF_DIR定义,该方式driver也在集群中运行。 |
A:建立新项目(new project)
B:编写代码
在源代码scala目录下创建1个名为KMeansTest的package,并增加3个object(SparkPi、WordCoun、SparkKMeans):
//SparkPi代码 package <span style="font-size:14px;"><span style="font-size:14px;"><span style="font-size:14px;">KMeansTest</span></span></span> import scala.math.random import org.apache.spark._ /** Computes an approximation to pi */ object SparkPi { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Spark Pi") val spark = new SparkContext(conf) val slices = if (args.length > 0) args(0).toInt else 2 val n = 100000 * slices val count = spark.parallelize(1 to n, slices).map { i => val x = random * 2 - 1 val y = random * 2 - 1 if (x*x + y*y < 1) 1 else 0 }.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) spark.stop() } }
// WordCount1代码 package <span style="font-size:14px;"><span style="font-size:14px;"><span style="font-size:14px;">KMeansTest</span></span></span> import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.SparkContext._ object WordCount1 { def main(args: Array[String]) { if (args.length == 0) { System.err.println("Usage: WordCount1 <file1>") System.exit(1) } val conf = new SparkConf().setAppName("WordCount1") val sc = new SparkContext(conf) sc.textFile(args(0)).flatMap(_.split(" ")).map(x => (x, 1)).reduceByKey(_ + _).take(10).foreach(println) sc.stop() } }
//SparkKMeans代码 package KMeansTest import java.util.Random import breeze.linalg.{Vector, DenseVector, squaredDistance} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ object SparkKMeans { val R = 1000 // Scaling factor val rand = new Random(42) def parseVector(line: String): Vector[Double] = { DenseVector(line.split(' ').map(_.toDouble)) } def closestPoint(p: Vector[Double], centers: Array[Vector[Double]]): Int = { var index = 0 var bestIndex = 0 var closest = Double.PositiveInfinity for (i <- 0 until centers.length) { val tempDist = squaredDistance(p, centers(i)) if (tempDist < closest) { closest = tempDist bestIndex = i } } bestIndex } def main(args: Array[String]) { if (args.length < 3) { System.err.println("Usage: SparkKMeans <file> <k> <convergeDist>") System.exit(1) } val sparkConf = new SparkConf().setAppName("SparkKMeans").setMaster(args(0)) val sc = new SparkContext(sparkConf) val lines = sc.textFile(args(1)) val data = lines.map(parseVector _).cache() val K = args(2).toInt val convergeDist = args(3).toDouble val kPoints = data.takeSample(withReplacement = false, K, 42).toArray var tempDist = 1.0 while(tempDist > convergeDist) { val closest = data.map (p => (closestPoint(p, kPoints), (p, 1))) val pointStats = closest.reduceByKey{case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2)} val newPoints = pointStats.map {pair => (pair._1, pair._2._1 * (1.0 / pair._2._2))}.collectAsMap() tempDist = 0.0 for (i <- 0 until K) { tempDist += squaredDistance(kPoints(i), newPoints(i)) } for (newP <- newPoints) { kPoints(newP._1) = newP._2 } println("Finished iteration (delta = " + tempDist + ")") } println("Final centers:") kPoints.foreach(println) sc.stop() } }
按OK后, Build -> Build Artifacts -> KMeansTest -> rebuild进行打包,经过编译后,程序包放置在out/artifacts/KMeansTest目录下,文件名为KMeansTest.jar。
D:Spark应用程序部署
将生成的程序包KMeansTest.jar复制到spark安装目录下,切换到用户hadoop/bin目录下进行程序的部署。
四、spark-shell下进行jar程序包提交运行实验
参考资料:Spark1.0.0 应用程序部署工具spark-submit
使用IntelliJ IDEA开发Spark1.0.0应用程序