spark教程2 (代码本地和集群运行)

本地运行

  • 1.新建项目,maven引入如下依赖
<dependency>
     <groupId>org.scala-langgroupId>
     <artifactId>scala-libraryartifactId>
     <version>2.10.6version>
 dependency>
 <dependency>
     <groupId>org.apache.sparkgroupId>
     <artifactId>spark-core_2.10artifactId>
     <version>1.6.3version>
 dependency>
  • 2.新建如下scala类,注意System.setProperty(“HADOOP_USER_NAME”, “hdfs”)为你hdfs中有权限的用户,然后直接运行即可
object TestWordCount {
  def main(args: Array[String]): Unit = {
    System.setProperty("HADOOP_USER_NAME", "hdfs")
    val time = System.currentTimeMillis()

    val inPath = "hdfs://t1:8020/user/admin/test_in/word.txt"
    val outPath = "hdfs://t1:8020//user/admin/test_out/"

    val conf = new SparkConf().setAppName("word_count").setMaster("local")
    val sc = new SparkContext(conf)
    sc.textFile(inPath).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _, 1).sortBy(_._2).saveAsTextFile(outPath)

    sc.stop()
    val cost = System.currentTimeMillis()-time
    println(s"cost $cost ms")
  }
}

集群运行

  • 1.切换到hdfs用户,然后启动spark-shell(如果你用scala)
su hdfs

/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark/bin/spark-shell
  • 2.将你要运行的代码直输入并回车,即可看到运行结果
sc.textFile("hdfs://t1:8020/user/admin/test_in/word.txt").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _, 1).sortBy(_._2).saveAsTextFile("hdfs://t1:8020//user/admin/test_out/")

其中sc就是spark shell自带的SparkContext

你可能感兴趣的:(spark教程2 (代码本地和集群运行))