SparkStreaming获取数据源的两种方式(监听端口号及整合kafka)

方式一:监听端口号,此方式需要先在linux上开启nc -lk 端口号服务,之后SparkStreaming可以从此端口拉取到数据,并进行实时处理,代码如下:

import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object StreamingWordCount {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
    val sc: SparkContext = new SparkContext(conf)
    val ssc: StreamingContext = new StreamingContext(sc, Seconds(5))
    val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("linux01", 8888)
    val dStream2: DStream[String] = dStream.flatMap(_.split(" "))
    val dStream3: DStream[(String, Int)] = dStream2.map((_, 1))
    val reduced: DStream[(String, Int)] = dStream3.reduceByKey(_ + _)
    reduced.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

方式二:SparkStreaming整合kafka,此方式需要kafka集群正常运行,代码如下:

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamingKafkaWordCount {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
    val ssc: StreamingContext = new StreamingContext(conf, Seconds(5))
    //因控制台显示的信息类型为Info,信息量太大,好多都是不需要的,故将显示的信息类型设置成WARN,只显示结果信息
    ssc.sparkContext.setLogLevel("WARN")
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "test01:9092,test02:9092,test03:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "1",
      "auto.offset.reset" -> "earliest",
      "enable.auto.commit" -> (true: java.lang.Boolean)
    )

    val topics = Array("wordcount")
    /*
      ssc: StreamingContext,
      locationStrategy: LocationStrategy,
      consumerStrategy: ConsumerStrategy[K, V]
     */
     //此处为创建原始的DStream,使用KafkaUtils静态对象调用createDirectStream方法.此方式调用的是Kafka底层的API,Consumer直连Leader,效率更高
    val value: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
    //参数一:StreamingContext对象
      ssc,
    //参数二:位置策略,将kafka与Worker设置在同一台机器,这样拉取,处理数据的效率会变高
      LocationStrategies.PreferConsistent,
    //参数三:消费者策略,需要传入两个参数,参数1是消费kafka中得哪些topic,参数2是kafka的一些配置参数
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
    )
    //此处需要进行一次map,不然程序运行会报object not serializable异常,这是因为上一步中的InputStream中存放的数据为ConsumerRecord类型,此类型没有序列化,故需要单独将Key或Value取出来再打印
    val value1: DStream[String] = value.map(cr => {
      cr.value()
    })
    value1.print()
    ssc.start()
    ssc.awaitTermination()

  }
}

你可能感兴趣的:(SparkStreaming)