一个Spark maven项目打包并使用spark-submit运行

  • 项目目录名 countjpgs
  • pom.xml文件(位于项目目录下)
  • countjpgs => src => main => scala => stubs => CountJPGs.scala
  • weblogs文件存放在HDFS的/loudacre目录下,是一个包含各种请求的web日志文件。

一个Spark maven项目打包并使用spark-submit运行_第1张图片

pom.xml文件内容:


  4.0.0
  com.cloudera.training.dev1
  countjpgs
  1.0
  jar
  "Count JPGs"
  
  
    /usr/lib/spark/lib/spark-assembly.jar
    /usr/lib/hadoop/client/hadoop-mapreduce-client-common.jar
    /usr/lib/hadoop/client/hadoop-mapreduce-client-core.jar
    /usr/lib/hadoop/client/hadoop-common.jar
    /usr/lib/hadoop/client/avro.jar
    /usr/lib/hadoop/client/commons-lang.jar
    /usr/lib/hadoop/client/guava.jar
    /usr/lib/hadoop/client/slf4j-api.jar
    /usr/lib/hadoop/client/slf4j-log4j12.jar
    /usr/lib/hadoop/client/hadoop-common.jar
    /usr/lib/hadoop/client/hadoop-annotations.jar
  
  
  
    
      apache-repo
      Apache Repository
      https://repository.apache.org/content/repositories/releases
      
        true
      
      
        false
      
    
   
     cloudera-repo-releases
     https://repository.cloudera.com/artifactory/repo/
    
  

  
    
      
        org.scala-tools
        maven-scala-plugin
	    2.15.2
        
          
            
              compile
            
          
        
      
      
        maven-compiler-plugin
	    2.5.1
        
          1.7
          1.7
        
      
      
  

  
    
      org.scala-lang
      scala-library
      2.10.5
      system
      ${spark-assembly}
    
    
       org.apache.spark
       spark-core_2.10
       local
       system
       ${spark-assembly}
    
    
        org.apache.hadoop
        hadoop-common
        local
        system
        ${hadoop-common}
    
    
        org.apache.hadoop
        hadoop-mapreduce-client-common
        local
        system
        ${hadoop-mapreduce-client-common}
    
    
        org.apache.hadoop
        hadoop-annotations
        local
        system
        ${hadoop-annotations}
    
    
        org.apache.hadoop
        avro
        local
        system
        ${avro}
    
    
        org.apache.hadoop
        slf4j-log4j12
        local
        system
        ${slf4j-log4j12}
    

  

CountJPGs.scala文件内容:

package stubs

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object CountJPGs {
   def main(args: Array[String]) {
     if (args.length < 1) {
       System.err.println("Usage: CountJPGs ")
       System.exit(1)
     }
	//val sc = new SparkContext("hdfs","weblogs")
	val sc = new SparkContext()
	//val filepath = "/loudace/weblogs/*66"
	val logfile = args(0)
	val weblogs = sc.textFile(logfile)
	val weblogsJpg = weblogs.filter(_.contains(".jpg"))
	var weblogsJpgCount = weblogsJpg.count()
	println("JPG Count : "+weblogsJpgCount)
	sc.stop
     //TODO: complete exercise
     println("stub is not implemented")
     System.exit(1)

   }
 }

进入到项目根目录countjpg文件夹下:

$ cd 项目存放路径/countjpgs

 打包程序:

$ mvn package

打包成功后,jar包会生成在target文件夹下,名称和项目名类似:

一个Spark maven项目打包并使用spark-submit运行_第2张图片

还是进入到项目根目录countjpg文件夹下:

$ cd 项目存放路径/countjpgs

使用spark-submit命令运行程序:

$ spark-submit --class stubs.CountJPGs target/countjpgs-1.0.jar /loudacre/weblogs/*

输出效果:

一个Spark maven项目打包并使用spark-submit运行_第3张图片

 

补充:提交到YARN集群上面运行的命令:

$ spark-submit --class stubs.CountJPGs --master yarn-client --name 'Count JPGs' target/countjpgs-1.0.jar /loudacre/weblogs/*

另外可以在项目根目录创建一个配置文件,以便在使用spark-submit命令时调用:

$ vim myspark.conf

此文件内容:

spark.app.name My Spark App
spark.master yarn-client
spark.executor.memory 400M

启动命令:

$ spark-submit --properties-file myspark.conf --class stubs.CountJPGs target/loudacre/weblogs/*

然后就可以在YARN可视化页面看到相关的配置。

你可能感兴趣的:(HDFS,Spark,Scala)