阿里云EMR spark streaming 消费kafka数据

先吐槽一下阿里云,简直了,为了一个简单demo,简直无语

先是本身MQ的kafka有问题,然后3.30升级也无这方面文档提供,回到正题:

本文主要讲讲述下再阿里云的emr中的spark streaming怎么连接阿里云的消息kafka

1. kafka配置

    在新的消息队列kafka中,申请topic+consumer groupID

    (1) topic建议测试使用外网

    (2)阿里云需建立2个groupID

    一个为executor使用:spark-executor-CID-xxxx
    一个为driver使用:CID-xxxx
    代码consumer使用: CID-xxxx

2. 配置根证书和jaas.config,例配置在服务器 /kafka/my目录

kafka_client_jaas.config 格式如下

KafkaClient {
        com.aliyun.openservices.ons.sasl.client.OnsLoginModule required
        AccessKey="xxxx"
        SecretKey="xxxx";
};

3. 构造kafkaparams参数

     (1) bootstrap.servers 在原有接入点加前缀: SASL_SSL://

     (2) group.id 为driver的consumer id,不添加spark-executor-

4. 运行

4.1 local模式,直接运行

4.2 集群模式

(1) 每个机器在需在相同目录保存jks和con文件

    阿里云emr集群之间复制文件,注意切换到 hadoop 账户

    注意权限修改777 方便

  (2) 运行提交参数,设置conf或者在spark-default.conf中配置    

-conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/kafka/my/kafka_client_jaas.conf 
--conf spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/kafka/my/kafka_client_jaas.conf

 其中,echo $SPARK_CONF_DIR可以打印spark conf路径

spark submit完整参数例子如下

--class com.sd.App --master yarn --deploy-mode client --driver-memory 2g --num-executors 2 --executor-memory 1g --executor-cores 2 --conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/kafka/my/kafka_client_jaas.conf --conf spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/kafka/my/kafka_client_jaas.conf ossref://gm-big-data/onlytest/com.sd.re_test-1.0-shaded.jar

5. 核心代码片段如下

 private def construcKafkaParams: Map[String, Object] = {

    val jks = "/kafka/my/kafka.client.truststore.jks"
    val jaas_conf = "/kafka/my/kafka_client_jaas.conf"

    val conf = System.getProperty("java.security.auth.login.config")
    CommonFun.devinPrintln("jass_conf init", conf)
//    if(null == conf) {
//      System.setProperty("java.security.auth.login.config", jaas_conf)
//    }

    val kServer = "SASL_SSL://kafka-cn-internet.aliyun.com:8080"
    val groupID = "CID-real-log-test"

    Map[String,Object](
      "bootstrap.servers" -> kServer,
      "ssl.truststore.location" -> jks,
      "ssl.truststore.password" -> "KafkaOnsClient",
      "security.protocol" -> "SASL_SSL",
      "sasl.mechanism" -> "ONS",
      "auto.commit.interval.ms" -> "1000",
      "session.timeout.ms" -> "30000",
      "enable.auto.commit" -> (false: java.lang.Boolean),
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> groupID,
//      "client.id" -> groupID,
      "auto.offset.reset" -> "latest"
    )
  }


def testKafka(): Unit = {

    val sparkConf = new SparkConf().setAppName("testkafka").setMaster("local")
    val ssc = new StreamingContext(sparkConf, Seconds(3))
    val topics = List("alikafka-real-test").toSet
    val kafkaParams = construcKafkaParams
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )
    val trans = stream.transform{ rdd =>
      val logCount = rdd.count()
      CommonFun.devinPrintln(s" have got log count ${logCount}")
      rdd.map{ x => x.toString
      }
    }
    trans.foreachRDD{
      rdd =>
        rdd.foreach(println)
    }
    ssc.start()
    ssc.awaitTermination()
    ssc.stop()
  }

6. pom文件参考如下


      org.apache.spark
      spark-streaming_${scala.compat.version}
      ${spark.version}
    

    
      org.apache.spark
      spark-sql_${scala.compat.version}
      ${spark.version}
    

    
      org.apache.spark
      spark-streaming-kafka-0-10_${scala.compat.version}
      ${spark.version}
      
    

7. 完整demo参考本人 资源demo

        https://download.csdn.net/download/shuaidan19920412/10326950



你可能感兴趣的:(spark)