Spark1.5.2在eclipse生成jar提交到集群运行
环境:
window7
ubuntu
spark1.5.2
1.WordCountSpark.scala代码:
//class WordCountSpark { // //} import org.apache.spark._ import SparkContext._ object WordCountSpark { def main(args: Array[String]) { if (args.length != 3 ){ println("usage is org.test.WordCount <master> <input> <output>") return } val sc = new SparkContext(args(0), "WordCount", System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_TEST_JAR"))) val textFile = sc.textFile(args(1)) val result = textFile.flatMap(line => line.split("\\s+")) .map(word => (word, 1)).reduceByKey(_ + _) result.saveAsTextFile(args(2)) } }
2.脚本代submitJob.sh 代码:
#!/usr/bin/env bash ./spark-submit --name WordCountSpark \ --class WordCountSpark \ --master spark://219.219.220.149:7077 \ --executor-memory 512M \ --total-executor-cores 1 WordCountSpark.jar local /input/* /output/201601262158
3.将WordCountSpark.scala生成jar包,用rz上传到/home/hadoop/cloud/spark-1.5.2/bin;
数据放在HDFS的input下:
4.运行:
hadoop@Master:~/cloud/spark-1.5.2/bin$ ./submitJob.sh
运行记录
参考:
【1】http://bit1129.iteye.com/blog/2172164
【2】http://blog.csdn.net/ggz631047367/article/details/50185181
同理,第二份代码按照相似的操作也可行:
//class SparkWordCount { // //} //package spark.examples //import org.apache.spark.SparkConf //import org.apache.spark.SparkContext // //import org.apache.spark.SparkContext._ import org.apache.spark._ import SparkContext._ object SparkWordCount { def main(args: Array[String]) { if (args.length < 1) { System.err.println("Usage: <file>") System.exit(1) } //定义Spark运行时的配置信息 /* Configuration for a Spark application. Used to set various Spark parameters as key-value pairs. Most of the time, you would create a SparkConf object with `new SparkConf()`, which will load values from any `spark.*` Java system properties set in your application as well. In this case, parameters you set directly on the `SparkConf` object take priority over system properties. */ val conf = new SparkConf() conf.setAppName("SparkWordCount") //定义Spark的上下文 /* Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Only one SparkContext may be active per JVM. You must `stop()` the active SparkContext before creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details. @param config a Spark Config object describing the application configuration. Any settings in this config overrides the default configs as well as system properties. */ val sc = new SparkContext(conf) ///从HDFS中获取文本(没有实际的进行读取),构造MappedRDD val rdd = sc.textFile(args(0)) //此行如果报value reduceByKey is not a member of org.apache.spark.rdd.RDD[(String, Int)],则需要import org.apache.spark.SparkContext._ rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).map(x => (x._2, x._1)).sortByKey(false).map(x => (x._2, x._1)).saveAsTextFile(args(1)) sc.stop } }
脚本:
#!/usr/bin/env bash ./spark-submit --name SparkWordCount \ --class SparkWordCount \ --master spark://219.219.220.149:7077 \ --executor-memory 512M \ --total-executor-cores 1 SparkWordCount.jar /input/* /output/201601262211
执行:
hadoop@Master:~/cloud/spark-1.5.2/bin$ ./submitJob_SparkWordCount.sh