Azure中Databricks上运行spark streaming job

我们需要先用Maven创建一个scala的工程,具体步骤可以参考:https://docs.scala-lang.org/tutorials/scala-with-maven.html

然后用IntelliJ IDEA打开这个Maven Project。其中在根目录下游一个pom.xml文件,针对我们项目的需求需要加上相应的dependency包。比较我们要建一个spark streaming的project,所以我们必须要加spark相应的包。其中需要注意的是scope的功能。根据我的实验,如果这个参数值为provided后,最后project package完的with dependency的jar里面就不会包含有所依赖的jar。

    
        
            org.apache.spark
            spark-core_2.11
            ${spark.version}
            provided
        
        
            org.apache.spark
            spark-streaming_2.11
            ${spark.version}
            provided
        
        
            org.scalatest
            scalatest_2.11
            2.2.4
            test
        
    

而repository则表示我们所依赖的包的查询地方,它可以是本地已经下载好的一个folder,也可以是网络上指定的服务器位置。对于服务器repository,有的可能涉及到权限问题,就需要到userprofile目录下的.m2文件夹的settings.xml中配置。

    
        
            my-local-repo
            file://${basedir}/repo
        
    
  
    
      apache.snapshots
      Apache Development Snapshot Repository
      https://repository.apache.org/content/repositories/snapshots/
      
        false
      
      
        true
      
    
    
      xxx
      https://xxx.visualstudio.com/_packaging/xx/maven/v1
      
        true
      
      
        true
      
    
  

当project准备好了后,我们就可以在scala文件夹里面添加文件了。这里以常用的NetworkWordCount为例写了一个简单的程序。考虑到databricks不支持new SparkContext,所以我们需要替换代码为SparkContext.getOrCreate。下面是一个验证了可以在azure databricks上跑的程序。

package org.twenz

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.ConstantInputDStream

object NetworkWordCount extends App {
  val checkpointDirectory = "/usr/twenz/test"
  val sentences = Array (
    "This is a probe for test spark streaming",
    "The cow jumped over the moon",
    "An apple a day keeps the doctor away",
    "Four score and seven years ago",
    "Snow white and the seven dwarfs",
    "I am at two with nature");
  // Function to create and setup a new StreamingContext
  def functionToCreateContext(): StreamingContext = {
    val sparkConf = new SparkConf(true).setAppName("NetworkWordCount").set("spark.streaming.unpersist","true")
    val ssc = new StreamingContext(SparkContext.getOrCreate(sparkConf), Seconds(30))   // new context
    ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
    ssc
  }

  val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
  val rdd = ssc.sparkContext.parallelize(sentences)
  val lines = new ConstantInputDStream(ssc, rdd) // create DStreams
  val words = lines.flatMap(_.split(" "))
  val pairs = words.map(word => (word, 1))
  val wordCounts = pairs.reduceByKey(_ + _)
  // Print the first ten elements of each RDD generated in this DStream to the console
  wordCounts.print()

  ssc.start() // Start the computation
  ssc.awaitTermination() // Wait for the computation to terminate
  ssc.stop()
}
在IJ IDEA中,View->Tool Windows->Maven Projects打开右边的Maven Projects窗口,在lifecycle中package打包,这将会在target中产生两个jar文件,其中一个有dependency一个没有。接下来我们就可以在databricks网站上建立一个job,然后把有dependency的jar拖上去,并且在Main class中填上上面入口org.twenz.NetworkWordCount。Job开始运行后就可以看Log,查看streaming job的输出。

你可能感兴趣的:(Hadoop)