一、所遇问题
由于在IDEA下可以方便快捷地运行scala程序,所以先前并没有在终端下使用spark-submit提交打包好的jar任务包的习惯,但是其只能在local模式下执行,在网上搜了好多帖子设置VM参数都不能启动spark集群,由于实验任务紧急只能暂时作罢IDEA下任务提交,继而改由终端下使用spark-submit提交打包好的jar任务。
二、spark-shell功能介绍
进入$SPARK_HOME目录,输入bin/spark-submit --help可以得到该命令的使用帮助。
Master URL | 含义 |
local | 使用1个worker线程在本地运行Spark应用程序 |
local[K] | 使用K个worker线程在本地运行Spark应用程序 |
local[*] | 使用所有剩余worker线程在本地运行Spark应用程序 |
spark://HOST:PORT | 连接到Spark Standalone集群,以便在该集群上运行Spark应用程序 |
mesos://HOST:PORT | 连接到Mesos集群,以便在该集群上运行Spark应用程序 |
yarn-client | 以client方式连接到YARN集群,集群的定位由环境变量HADOOP_CONF_DIR定义,该方式driver在client运行。 |
yarn-cluster | 以cluster方式连接到YARN集群,集群的定位由环境变量HADOOP_CONF_DIR定义,该方式driver也在集群中运行。 |
A:建立新项目(new project)
B:编写代码
在源代码scala目录下创建1个名为KMeansTest的package,并增加3个object(SparkPi、WordCoun、SparkKMeans):
//SparkPi代码
package KMeansTest
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = 100000 * slices
val count = spark.parallelize(1 to n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
// WordCount1代码
package KMeansTest
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._
object WordCount1 {
def main(args: Array[String]) {
if (args.length == 0) {
System.err.println("Usage: WordCount1 ")
System.exit(1)
}
val conf = new SparkConf().setAppName("WordCount1")
val sc = new SparkContext(conf)
sc.textFile(args(0)).flatMap(_.split(" ")).map(x => (x, 1)).reduceByKey(_ + _).take(10).foreach(println)
sc.stop()
}
}
//SparkKMeans代码
package KMeansTest
import java.util.Random
import breeze.linalg.{Vector, DenseVector, squaredDistance}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
object SparkKMeans {
val R = 1000 // Scaling factor
val rand = new Random(42)
def parseVector(line: String): Vector[Double] = {
DenseVector(line.split(' ').map(_.toDouble))
}
def closestPoint(p: Vector[Double], centers: Array[Vector[Double]]): Int = {
var index = 0
var bestIndex = 0
var closest = Double.PositiveInfinity
for (i <- 0 until centers.length) {
val tempDist = squaredDistance(p, centers(i))
if (tempDist < closest) {
closest = tempDist
bestIndex = i
}
}
bestIndex
}
def main(args: Array[String]) {
if (args.length < 3) {
System.err.println("Usage: SparkKMeans ")
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("SparkKMeans").setMaster(args(0))
val sc = new SparkContext(sparkConf)
val lines = sc.textFile(args(1))
val data = lines.map(parseVector _).cache()
val K = args(2).toInt
val convergeDist = args(3).toDouble
val kPoints = data.takeSample(withReplacement = false, K, 42).toArray
var tempDist = 1.0
while(tempDist > convergeDist) {
val closest = data.map (p => (closestPoint(p, kPoints), (p, 1)))
val pointStats = closest.reduceByKey{case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2)}
val newPoints = pointStats.map {pair =>
(pair._1, pair._2._1 * (1.0 / pair._2._2))}.collectAsMap()
tempDist = 0.0
for (i <- 0 until K) {
tempDist += squaredDistance(kPoints(i), newPoints(i))
}
for (newP <- newPoints) {
kPoints(newP._1) = newP._2
}
println("Finished iteration (delta = " + tempDist + ")")
}
println("Final centers:")
kPoints.foreach(println)
sc.stop()
}
}
按OK后, Build -> Build Artifacts -> KMeansTest -> rebuild进行打包,经过编译后,程序包放置在out/artifacts/KMeansTest目录下,文件名为KMeansTest.jar。
D:Spark应用程序部署
将生成的程序包KMeansTest.jar复制到spark安装目录下,切换到用户hadoop/bin目录下进行程序的部署。
四、spark-shell下进行jar程序包提交运行实验
参考资料:Spark1.0.0 应用程序部署工具spark-submit
使用IntelliJ IDEA开发Spark1.0.0应用程序