今天讲了kafka和sparkstream的一个简单结合,试着在网上找了一个例子进行实现
1、相关配置 spark2.2.0,scala2.11.8,kafka_2.10-0.10.2.1,jdk1.8
2、这里是自己的pom.xml文件 如下
4.0.0
make
Spark_code_hive
1.0-SNAPSHOT
2008
2.11.8
1.8
1.8
UTF-8
2.2.0
2.9.1
0.10.2.1
scala-tools.org
Scala-Tools Maven2 Repository
http://scala-tools.org/repo-releases
scala-tools.org
Scala-Tools Maven2 Repository
http://scala-tools.org/repo-releases
org.scala-lang
scala-library
${scala.version}
junit
junit
4.4
test
org.specs
specs
1.2.5
test
org.apache.spark
spark-core_2.11
${spark.version}
compile
org.apache.spark
spark-sql_2.11
${spark.version}
compile
org.apache.spark
spark-streaming_2.11
${spark.version}
compile
org.apache.spark
spark-streaming-kafka-0-10_2.11
${spark.version}
compile
org.apache.kafka
kafka_2.11
${kafka.version}
compile
org.apache.hadoop
hadoop-client
${hadoop.version}
compile
org.apache.spark
spark-hive_2.11
${spark.version}
compile
org.spark-project.hive
hive-jdbc
1.2.1.spark2
com.databricks
spark-csv_2.11
1.5.0
mysql
mysql-connector-java
5.1.27
com.alibaba
fastjson
1.2.47
src/main/scala
src/test/scala
org.scala-tools
maven-scala-plugin
compile
testCompile
${scala.version}
-target:jvm-1.5
org.apache.maven.plugins
maven-eclipse-plugin
true
ch.epfl.lamp.sdt.core.scalabuilder
ch.epfl.lamp.sdt.core.scalanature
org.eclipse.jdt.launching.JRE_CONTAINER
ch.epfl.lamp.sdt.launching.SCALA_CONTAINER
org.scala-tools
maven-scala-plugin
${scala.version}
3、创建一个相关的配置文件,my.properties 如下,就是你的kafka的一些topic的相关设置
# kafka configs
kafka.bootstrap.servers=make.spark.com:9092,make.spark.com:9093,make.spark.com:9094
kafka.topic.source=spark-kafka-demo
kafka.topic.sink=spark-sink-test
kafka.group.id=spark_demo_gid1
4、创建我们的相关代码代码
4.1 首先创建读取我们配置文件my.properties的工具类,如下
package Utils
import java.util.Properties
/**
* Properties的工具类
* Created by make on 2017-08-08 18:39
*/
object PropertiesUtil {
/**
* 获取配置文件Properties对象
* @author make
* @return java.util.Properties
*/
def getProperties() :Properties = {
val properties = new Properties()
//读取源码中resource文件夹下的my.properties配置文件,得到一个properties
val reader = getClass.getResourceAsStream("/my.properties")
properties.load(reader)
properties
}
/**
* 获取配置文件中key对应的value
* @author make
* @return java.util.Properties
*/
def getPropString(key : String) : String = {
getProperties().getProperty(key)
}
/**
* 获取配置文件中key对应的整数值,可能后面这里会需要其他的值
* @author yore
* @return java.util.Properties
*/
def getPropInt(key : String) : Int = {
getProperties().getProperty(key).toInt
}
/**
* 获取配置文件中key对应的布尔值
* @author make
* @return java.util.Properties
*/
def getPropBoolean(key : String) : Boolean = {
getProperties().getProperty(key).toBoolean
}
}
4.2 我们创建一个kafkasink类用来 实例化producer以及向kafka发送数据 如下
package spark_stream
import java.util.concurrent.Future
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata}
/**
* 手动实现一个KafkaSink类,并实例化producer 将数据发送到kafka的对应topic
* This is the key idea that allows us to work around running into NotSerializableExceptions.
* Created by make on 2018-08-08 18:50
*/
class KafkaSink[K,V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
//创建一个 生产者
lazy val producer = createProducer()
/** 发送消息 */
//本质是调用producer.send进行数据发送
def send(topic : String, key : K, value : V) : Future[RecordMetadata] =
producer.send(new ProducerRecord[K,V](topic,key,value))
def send(topic : String, value : V) : Future[RecordMetadata] =
producer.send(new ProducerRecord[K,V](topic,value))
}
//使用了伴生对象,简单实例化kafkasink
object KafkaSink {
import scala.collection.JavaConversions._
def apply[K, V](config: Map[String, Object]): KafkaSink[K, V] = {
val createProducerFunc = () => {
val producer = new KafkaProducer[K, V](config)
sys.addShutdownHook {
// Ensure that, on executor JVM shutdown, the Kafka producer sends
// any buffered messages to Kafka before shutting down.
producer.close()
}
producer
}
//返回一个producer
new KafkaSink(createProducerFunc)
}
def apply[K, V](config: java.util.Properties): KafkaSink[K, V] = apply(config.toMap)
}
4.3、创建我们的主方法类
package spark_stream
import java.util.Properties
import Utils.PropertiesUtil
import com.alibaba.fastjson.{JSON, JSONObject}
import org.apache.commons.lang3.StringUtils
import org.apache.kafka.common.serialization.{StringDeserializer, StringSerializer}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.kafka010.ConsumerStrategies._
import org.apache.spark.streaming.kafka010.LocationStrategies._
object SparkKafkaDemo extends App {
// default a Logger Object
val LOG = org.slf4j.LoggerFactory.getLogger(SparkKafkaDemo.getClass)
/*if (args.length < 2) {
System.err.println(s"""
|Usage: DirectKafkaWordCount
| is a list of one or more Kafka brokers
| is a list of one or more kafka topics to consume from
|
""".stripMargin)
System.exit(1)
}*/
// 设置日志级别
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark.sql").setLevel(Level.WARN)
val Array(brokers, topics, outTopic) = /*args*/ Array(
PropertiesUtil.getPropString("kafka.bootstrap.servers"),
PropertiesUtil.getPropString("kafka.topic.source"),
PropertiesUtil.getPropString("kafka.topic.sink")
)
// Create context
/* 第一种方式 */
val sparkConf = new SparkConf().setMaster("local[2]")
.setAppName("spark-kafka-demo1")
val ssc = new StreamingContext(sparkConf, Milliseconds(1000))
/* 第二种方式 */
/*val spark = SparkSession.builder()
.appName("spark-kafka-demo1")
.master("local[2]")
.getOrCreate()
// 引入隐式转换方法,允许ScalaObject隐式转换为DataFrame
import spark.implicits._
val ssc = new StreamingContext(spark.sparkContext,Seconds(1))*/
// 设置检查点
ssc.checkpoint("spark_demo_cp1")
// Create direct Kafka Stream with Brokers and Topics
// 注意:这个Topic最好是Array形式的,set形式的匹配不上
//var topicSet = topics.split(",")/*.toSet*/
val topicsArr: Array[String] = topics.split(",")
// set Kafka Properties
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> PropertiesUtil.getPropString("kafka.group.id"),
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
/**
* createStream是Spark和Kafka集成包0.8版本中的方法,它是将offset交给ZK来维护的
*
* 在0.10的集成包中使用的是createDirectStream,它是自己来维护offset,在这个版本中
* zkCli中是看不到每个分区,到底消费到了那个偏移量,而在老的版本中,是可以看到的
* 速度上要比交给ZK维护要快很多,但是无法进行offset的监控。
* 这个方法只有3个参数,使用起来最为方便,但是每次启动的时候默认从Latest offset开始读取,
* 或者设置参数auto.offset.reset="smallest"后将会从Earliest offset开始读取。
*
* 官方文档@see Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)
*
*/
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topicsArr, kafkaParams)
)
/** Kafak sink */
//set producer config
val kafkaProducer: Broadcast[KafkaSink[String, String]] = {
val kafkaProducerConfig = {
val p = new Properties()
p.setProperty("bootstrap.servers", brokers)
p.setProperty("key.serializer", classOf[StringSerializer].getName)
p.setProperty("value.serializer", classOf[StringSerializer].getName)
p
}
LOG.info("kafka producer init done!")
// 广播KafkaSink 传入kafkaProducerConfig,在kafkaSink中实例化producer
ssc.sparkContext.broadcast(KafkaSink[String, String](kafkaProducerConfig))
}
var jsonObject = new JSONObject()
//对传入的流中的数据,进行筛选和逻辑处理
stream.filter(record => {
// 过滤掉不符合要求的数据
try {
jsonObject = JSON.parseObject(record.value)
} catch {
case e: Exception => {
LOG.error("转换为JSON时发生了异常!\t{}", e.getMessage)
}
}
// 如果不为空字符时,为null,返回false过滤,否则为true通过
StringUtils.isNotEmpty(record.value) && null != jsonObject
}).map(record => {
//这个地方可以写自己的业务逻辑代码,因为本次是测试,简单返回一个元组
jsonObject = JSON.parseObject(record.value)
// 返出一个元组,(时间戳,json的数据日期,json的关系人姓名)
(System.currentTimeMillis(),
jsonObject.getString("date_dt"),
jsonObject.getString("relater_name")
)
}).foreachRDD(rdd => {
if (!rdd.isEmpty()) {
rdd.foreach(kafkaTuple => {
//向Kafka发送数据,outTopic,value,也就是我们kafkasink的第二种send方法
//取出广播的value 调用send方法 对每个数据进行发送 和 打印
kafkaProducer.value.send(
outTopic,
kafkaTuple._1 + "\t" + kafkaTuple._2 + "\t" + kafkaTuple._3
)
//同时将信息打印到控制台,以便查看
println(kafkaTuple._1 + "\t" + kafkaTuple._2 + "\t" + kafkaTuple._3)
})
}
})
// 启动streamContext
ssc.start()
//一直等待数据 直到关闭
ssc.awaitTermination()
}
5、我们在我们kafka集群上,创建对应的一个生产者,以及消费者
创建两个对应的topic
bin/kafka-topics.sh --create --zookeeper make.spark.com:2181/kafka_10 --topic spark-kafka-demo --partitions 3 --replication-factor 2
bin/kafka-topics.sh --create --zookeeper make.spark.com:2181/kafka_10 --partitions 3 --replication-factor 1 --topic spark-sink-test
创建一个对我们的程序发送数据的生产者
bin/kafka-console-producer.sh --broker-list make.spark.com:9092,make.spark.com:9093,make.spark.com:9094 --topic spark-kafka-demo
创建一个消费我们的程发送数据的消费者
bin/kafka-console-consumer.sh --bootstrap-server make.spark.com:9092,make.spark.com:9093,make.spark.com:9094 --from-beginning --topic spark-sink-test
6、启动生产者,启动我们的程序,并在生产者窗口,写入我们的测试数据 如下
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
切换到我们的idea 可以看到我们的打印信息已经输出了,name我么的数据也发送出去了
切换到我们的消费者窗口,也可以看到数据已经过来了
到这里,就实现一个接受,发送的一个kafka-stream-kafka这样的一个流程,也学到不少东西,以上
参考文章:参考文章点这里!!谢谢博主