记SparkStreaming连接kafka

背景

项目需要用SparkStreaming连接kafka,本以为很简单,没想到遇到不少麻烦

版本

scala版本2.10,kafka版本2.11.0-0.11.0.0,jdk1.8

pom依赖


    4.0.0

    ..............

    
        
            Hortonworks
            http://repo.hortonworks.com/content/repositories/releases/
        
    


    
        
            org.apache.kafka
            kafka-clients
            0.11.0.0
            
                
                    com.fasterxml.jackson.core
                    *
                
            
        


        
            com.alibaba
            fastjson
            1.2.51
        

               .............

        
            org.apache.spark
            spark-streaming_2.10
            2.1.0
        
        
            org.apache.spark
            spark-streaming-kafka-0-10_2.10
            2.1.0
            
                
                    org.apache.kafka
                    kafka-clients
                
            
        


        
            io.netty
            netty-all
            4.0.50.Final
        
        
            com.fasterxml.jackson.core
            jackson-databind
            2.6.6
        
    

spark程序的基本框架

       SparkConf conf = new SparkConf()
                .setMaster("local[2]")
                .setAppName("ActionConsumer")
                .set("spark.serializer", KryoSerializer.class.getCanonicalName())
                .registerKryoClasses(new Class[]{ConsumerRecord.class})
                .set("spark.kryoserializer.buffer.max", "512m");


        // 优雅的关闭,避免在处理数据时yarn kill导致kafka数据重复消费和数据丢失
        conf.set("spark.streaming.stopGracefullyOnShutdown", "true");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaStreamingContext ssc = new JavaStreamingContext(sc, Durations.seconds(1));


        Set topicsSet = Collections.singleton("test0");
        //kafka相关参数,必要!缺了会报错
        Map kafkaParams = new HashMap();
        kafkaParams.put("bootstrap.servers", KafkaConfig.sBOOTSTRAP_SERVER);
        kafkaParams.put("group.id", "group2");
        kafkaParams.put("key.deserializer", StringDeserializer.class.getCanonicalName());
        kafkaParams.put("value.deserializer", StringDeserializer.class.getCanonicalName());
        kafkaParams.put("enable.auto.commit", true);


        final JavaInputDStream> kafkaSteam =
                KafkaUtils.createDirectStream(
                        ssc,
                        LocationStrategies.PreferConsistent(),
                        ConsumerStrategies.Subscribe(topicsSet, kafkaParams)
                );


        kafkaSteam.foreachRDD(new VoidFunction2>,
                        Time>() {
            public void call(JavaRDD> consumerRecordJavaRDD,
                             Time time) throws Exception {
                if (consumerRecordJavaRDD.rdd().count() > 0) {
                    OffsetRange[] offsetRanges = ((HasOffsetRanges)consumerRecordJavaRDD
                            .rdd()).offsetRanges();

                    final long recordCount = consumerRecordJavaRDD.count();
                    List> records = consumerRecordJavaRDD.take((int) recordCount);
                    for (ConsumerRecord record : records) {
                        // 获取kafka消息,处理业务逻辑
                        JSONObject obj = JSON.parseObject(record.value().toString());
                        System.out.println(obj);
                    }

                    ((CanCommitOffsets)kafkaSteam.inputDStream()).commitAsync(offsetRanges);
                }
            }
        });


        ssc.start();
        ssc.awaitTermination();
        ssc.close();

遇到过的问题

1)、找不到类org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RetrieveSparkAppConfig$

解决方法,pom中spark依赖改成2.10

 

2)、Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient

解决方法:之前我是在spark程序里连接远程的spark,也就是sparkConfig.setLocal()传入的远程地址。现在我直接把参数改成local[2],反正到时候也是要到服务器下运行jar包,和本地跑没什么区别

 

3)、找不到方法sparkjava.lang.NoSuchMethodError: javax.servlet.http.HttpServletResponse.getStatus()I

解决方法:按照上面提供的方式改pom依赖,问题就不知不觉解决了,不用导入额外的javax.servlet.http.HttpServletRequest包

 

4)、 requirement failed: No output operations registered, so nothing to execute

解决方法:流要进行行为操作,例如print和上面的foreachRDD

 

5)、JsonMappingException: Incompatible Jackson version: 2.7.8

jackson依赖冲突,复制上面pom依赖的相关内容

 

6)、send RPC to  java.lang.AbstractMethodError: org.apache.spark.network.protocol.MessageWithHeader.touc

解决方法:按照我上面的方式配置:pom、master

 

7)、NoSuchMethodError: io.netty.buffer.PooledByteBufAllocator.metric()Lio/netty/buffer/PooledByteBufAllo

解决方法:nettyIO依赖冲突,复制上面pom依赖相的关内容

 

8)、ClassNotFoundException org.apache.spark.streaming.kafka010.KafkaRDDPartition

解决方法:降低scala的版本到2.10(以前我用的是2.11)

 

9)、object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord

这是要输出rdd相关内容时报的错

解决方法:像上面程序框架开头那样,给sparkConf设置Kyro序列化器

       SparkConf conf = new SparkConf()

                ...

                .set("spark.serializer", KryoSerializer.class.getCanonicalName())

                .registerKryoClasses(new Class[]{ConsumerRecord.class})

                ...

 

10)、报错....To avoid this, increase spark.kryoserializer.buffer.max value...

解决方法:kyro缓存不够,像上面程序框架开头那样,给sparkConf设置Kyro的最大缓存

       SparkConf conf = new SparkConf()

                ......

                .set("spark.kryoserializer.buffer.max", "512m");

 

11)、 NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;

解决方法:降低cala的版本到2.10(以前我用的是2.11)

 

12)、sparkStreaming的map方法不执行,打个断点也进不去,哪怕有数据

解决方法:像上面程序框架一样,改成foreachRDD遍历RDD,然后根据rdd的count()方法获取拿到的数据量,根据这个值take数据

 

13)、sparkStreaming丢失数据

解决方法:没有及时提交偏移量,按照上面代码的这部分提交偏移量

        kafkaSteam.foreachRDD(new VoidFunction2>,

                        Time>() {

            public void call(JavaRDD> consumerRecordJavaRDD,

                             Time time) throws Exception {

                if (consumerRecordJavaRDD.rdd().count() > 0) {

                    OffsetRange[] offsetRanges = ((HasOffsetRanges)consumerRecordJavaRDD

                            .rdd()).offsetRanges();



                                       ......



                    ((CanCommitOffsets)kafkaSteam.inputDStream()).commitAsync(offsetRanges);

                }

            }

        });

同时开启自动提交

kafkaParams.put("enable.auto.commit", true);

经过多次测试,发现streaming停止后kafka发数据,streaming启动后可以正常接续接收上次结束后的所有新数据,也就是上面两种设置不冲突。

结语

20200722记录:用scala2.10编译运行没问题后,我把scala版本换成了2.11,再次编译运行,竟然成功不出错了,真是神奇

你可能感兴趣的:(spark,Kafka,spark,kafka,java)