spark streaming从指定offset处消费Kafka数据

一. 前言

当spark streaming程序意外退出时,数据仍然再往Kafka中推送,然而由于Kafka默认是从latest的offset读取,这会导致数据丢失。为了避免数据丢失,那么我们需要记录每次消费的offset,以便下次检查并且从指定的offset开始读取

二. 环境

kafka-0.9.0、spark-1.6.0、jdk-1.7、scala-2.10.5、idea16

三. 实现代码

1. 引入spark和kafka的相关依赖包
  
  
    4.0.0  
  
    com.ngaa  
    test-my  
    1.0-SNAPSHOT  
    2008  
      
        UTF-8  
        UTF-8  
          
        1.7  
        1.7  
        UTF-8  
          
        2.10.5  
          
        2.11.7  
  
        2.3.0  
          
        1.7.20  
          
        1.6.0-cdh5.8.0  
        1.6.0-cdh5.8.0  
        1.6.0-cdh5.8.0  
          
        2.6.0-cdh5.8.0  
          
        4.2.5  
  
          
        4.2.5  
          
        1.1.39  
  
      
  
      
          
            scala-tools.org  
            Scala-Tools Maven2 Repository  
            http://scala-tools.org/repo-releases  
          
          
          
            cloudera  
            https://repository.cloudera.com/artifactory/cloudera-repos/  
          
      
  
      
          
            scala-tools.org  
            Scala-Tools Maven2 Repository  
            http://scala-tools.org/repo-releases  
          
      
  
      
  
          
          
            com.alibaba  
            fastjson  
            ${fastjson.version}  
          
          
          
            org.apache.httpcomponents  
            httpclient  
            ${httpclient.version}  
          
  
          
          
            org.apache.httpcomponents  
            httpcore  
            ${httpcore.version}  
          
  
          
          
            org.slf4j  
            slf4j-log4j12  
            ${slf4j-version}  
          
          
          
            org.apache.hadoop  
            hadoop-client  
            ${hadoop.cdh.version}  
              
                  
                    javax.servlet  
                    *  
                  
              
          
          
            org.apache.hadoop  
            hadoop-common  
            ${hadoop.cdh.version}  
              
                  
                    javax.servlet  
                    *  
                  
              
          
          
            org.apache.hadoop  
            hadoop-hdfs  
            ${hadoop.cdh.version}  
              
                  
                    javax.servlet  
                    *  
                  
              
          
          
          
            org.scala-lang  
            scala-library  
            ${scala.version}  
          
          
            com.fasterxml.jackson.core  
            jackson-databind  
            ${jackson.version}  
          
  
          
          
            org.apache.spark  
            spark-streaming_2.10  
            ${spark.streaming.cdh.version}  
          
          
            org.apache.spark  
            spark-streaming-kafka_2.10  
            ${kafka.spark.cdh.version}  
          
          
            junit  
            junit  
            4.12  
            test  
          
  
          
          
        org.apache.spark  
        spark-assembly_2.10  
        ${spark.cdh.version}  
        system  
        D:/crt_send_document/spark-assembly-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar  
          
  
          
          
              
              
              
              
              
              
          
  
          
          
          
          
          
          
  
          
          
            org.apache.hadoop  
            hadoop-yarn-server-web-proxy  
            2.6.0-cdh5.8.0  
          
  
      
  
      
      
        test-my  
        src/main/scala  
        src/test/scala  
          
              
                org.scala-tools  
                maven-scala-plugin  
                2.15.2  
                  
                      
                          
                            compile  
                            testCompile  
                          
                      
                  
                  
                    ${scala.version}  
                      
                        -target:jvm-1.7  
                      
                  
              
              
                org.apache.maven.plugins  
                maven-eclipse-plugin  
                  
                    true  
                      
                        ch.epfl.lamp.sdt.core.scalabuilder  
                      
                      
                        ch.epfl.lamp.sdt.core.scalanature  
                      
                      
                        org.eclipse.jdt.launching.JRE_CONTAINER  
                        ch.epfl.lamp.sdt.launching.SCALA_CONTAINER  
                      
                  
              
              
                maven-assembly-plugin  
                  
                      
                        jar-with-dependencies  
                      
                      
                          
                              
                          
                      
                  
                  
                      
                        make-assembly  
                        package  
                          
                            single  
                          
                      
                  
              
          
      
      
          
              
                org.scala-tools  
                maven-scala-plugin  
                  
                    ${scala.version}  
                  
              
          
      
  
  

2. 新建测试类
import kafka.common.TopicAndPartition  
import kafka.message.MessageAndMetadata  
import kafka.serializer.StringDecoder  
import org.apache.log4j.{Level, Logger}  
import org.apache.spark.{SparkConf, TaskContext}  
import org.apache.spark.streaming.dstream.InputDStream  
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils, OffsetRange}  
import org.apache.spark.streaming.{Seconds, StreamingContext}  
import org.slf4j.LoggerFactory  
  
/** 
  * Created by yangjf on 2016/12/18 
  * Update date: 
  * Time: 11:10 
  * Describle :从指定偏移量读取kafka数据 
  * Result of Test: 
  * Command: 
  * Email: [email protected] 
  */  
object ReadBySureOffsetTest {  
  val logger = LoggerFactory.getLogger(ReadBySureOffsetTest.getClass)  
  
  def main(args: Array[String]) {  
    //设置打印日志级别  
    Logger.getLogger("org.apache.kafka").setLevel(Level.ERROR)  
    Logger.getLogger("org.apache.zookeeper").setLevel(Level.ERROR)  
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)  
    logger.info("测试从指定offset消费kafka的主程序开始")  
    if (args.length < 1) {  
      System.err.println("Your arguments were " + args.mkString(","))  
      System.exit(1)  
      logger.info("主程序意外退出")  
    }  
    //hdfs://hadoop1:8020/user/root/spark/checkpoint  
    val Array(checkpointDirectory) = args  
    logger.info("checkpoint检查:" + checkpointDirectory)  
    val ssc = StreamingContext.getOrCreate(checkpointDirectory,  
      () => {  
        createContext(checkpointDirectory)  
      })  
    logger.info("streaming开始启动")  
    ssc.start()  
    ssc.awaitTermination()  
  }  
  
  def createContext(checkpointDirectory: String): StreamingContext = {  
    //获取配置  
    val brokers = "hadoop3:9092,hadoop4:9092"  
    val topics = "20161218a"  
  
    //默认为5秒  
    val split_rdd_time = 8  
    // 创建上下文  
    val sparkConf = new SparkConf()  
      .setAppName("SendSampleKafkaDataToApple").setMaster("local[2]")  
      .set("spark.app.id", "streaming_kafka")  
  
    val ssc = new StreamingContext(sparkConf, Seconds(split_rdd_time))  
  
    ssc.checkpoint(checkpointDirectory)  
  
    // 创建包含brokers和topic的直接kafka流  
    val topicsSet: Set[String] = topics.split(",").toSet  
    //kafka配置参数  
    val kafkaParams: Map[String, String] = Map[String, String](  
      "metadata.broker.list" -> brokers,  
      "group.id" -> "apple_sample",  
      "serializer.class" -> "kafka.serializer.StringEncoder"  
//      "auto.offset.reset" -> "largest"   //自动将偏移重置为最新偏移(默认)  
//      "auto.offset.reset" -> "earliest"  //自动将偏移重置为最早的偏移  
//      "auto.offset.reset" -> "none"      //如果没有为消费者组找到以前的偏移,则向消费者抛出异常  
    )  
    /** 
      * 从指定位置开始读取kakfa数据 
      * 注意:由于Exactly  Once的机制,所以任何情况下,数据只会被消费一次! 
      *      指定了开始的offset后,将会从上一次Streaming程序停止处,开始读取kafka数据 
      */  
    val offsetList = List((topics, 0, 22753623L),(topics, 1, 327041L))                          //指定topic,partition_no,offset  
    val fromOffsets = setFromOffsets(offsetList)     //构建参数  
    val messageHandler = (mam: MessageAndMetadata[String, String]) => (mam.topic, mam.message()) //构建MessageAndMetadata  
   //使用高级API从指定的offset开始消费,欲了解详情,  
   //请进入"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$"查看  
    val messages: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)  
  
    //数据操作  
    messages.foreachRDD(mess => {  
      //获取offset集合  
      val offsetsList = mess.asInstanceOf[HasOffsetRanges].offsetRanges  
      mess.foreachPartition(lines => {  
        lines.foreach(line => {  
          val o: OffsetRange = offsetsList(TaskContext.get.partitionId)  
          logger.info("++++++++++++++++++++++++++++++此处记录offset+++++++++++++++++++++++++++++++++++++++")  
          logger.info(s"${o.topic}  ${o.partition}  ${o.fromOffset}  ${o.untilOffset}")  
          logger.info("+++++++++++++++++++++++++++++++此处消费数据操作++++++++++++++++++++++++++++++++++++++")  
          logger.info("The kafka  line is " + line)  
        })  
      })  
    })  
    ssc  
  }  
  
  //构建Map  
  def setFromOffsets(list: List[(String, Int, Long)]): Map[TopicAndPartition, Long] = {  
    var fromOffsets: Map[TopicAndPartition, Long] = Map()  
    for (offset <- list) {  
      val tp = TopicAndPartition(offset._1, offset._2)//topic和分区数  
      fromOffsets += (tp -> offset._3)           // offset位置  
    }  
    fromOffsets  
  }  
}  

四. 参考文档:

1、spark API  http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$

2、Kafka官方配置说明:http://kafka.apache.org/documentation.html#configuration

3、Kafka SampleConsumer:https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example

4、Spark streaming 消费遍历offset说明:http://spark.apache.org/docs/1.6.0/streaming-kafka-integration.html

5、Kafka官方API说明:http://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

转自:https://blog.csdn.net/high2011/article/details/53706446

你可能感兴趣的:(kafka,大数据,Spark,kafka,spark,大数据)