开发环境:
系统:Win10
开发工具:scala-eclipse-IDE
项目管理工具:Maven 3.6.0
JDK 1.8
Scala 2.11.11
Spark 2.4.3
spark-streaming-kafka-0-10_2.11 (Spark Streaming 提供的Kafka集成接口)
注1. 末尾的2.11 代表scala版本;
注2. kafka-0-10 代表支持 kafka 0.10 及以上版本
注3. 集成接口使用方式,访问以下官方链接:
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Kafka_2.11-2.2.1 (2.11表示对应的scala版本,2.2.1表示kafka版本)
作业运行环境:
系统:Linux CentOS7(两台机:主从节点)
master : 192.168.190.200
slave1 : 192.168.190.201
JDK 1.8
Scala 2.11.11
Spark 2.4.3
spark-streaming-kafka-0-10_2.11
ZooKeeper 3.4.14
Kafka_2.11-2.2.1
1. (项目1)实现向Kafka中灌入模拟数据,以键值对形式传入:
1)姓名地址数据 name_addr 输入格式:(key: name, value: name\taddr\t0),如下:
bob bob shanghai#200000 0
amy amy beijing#100000 0
alice alice shanghai#200000 0
tom tom beijing#100000 0
lulu lulu hangzhou#310000 0
nick nick shanghai#200000 0
注1:其中最后的0代表类型type,0|地址,1|电话
注2:\t代表value中name,addr,type三者的的分隔符(制表符)
2)姓名电话数据 name_phone 输入格式:(key: name, value: name\tphone\t1),如下:
bob bob 15700079421 1
amy amy 18700079458 1
alice alice 17730079427 1
tom tom 16700379451 1
lulu lulu 18800074423 1
nick nick 14400033426 1
注1:其中最后的1代表类型type,0|地址,1|电话
注2:\t代表value中name,phone,type三者的的分隔符(制表符)
2. (项目2)实现SparkStreaming流式作业以2s为间隔,不断拉取Kafka中对应Topic下的数据,并作出分析(即对上述数据Join连接):
Join效果如下,输出至终端控制台(实现合并个人的姓名、地址、电话信息):
姓名:tom,地址:beijign#100000,电话:16700379451
姓名:alice,地址:shanghai#200000,电话:17730079427
姓名:nick,地址:shagnhai#200000,电话:14400033426
姓名:lulu,地址:hangzhou#310000,电话:18800074423
姓名:amy,地址:beijing#100000,电话:18700079458
姓名:bob,地址:shanghai#200000,电话:15700079421
Streaming 与 Kafka 集成官网教程:streaming-kafka-integration.html,如下两种方式:0-8 和 0-10 版本。
本文使用 spark-streaming-kafka-0-10:streaming-kafka-0-10-integration.html
基于Direct Stream(直接流)读取的方式,有如下特点:
1)提供了简单的并行性;
2)实现了Kafka分区与Spark分区 1 : 1 对应;
3)提供了获取Kafka偏移量 (offset) 和元数据 (metadata)的方式。
1. (生产者Producer)项目 kafkaGenerator:用于生成模拟数据,向Kafka推送
4.0.0
com
kafkaGenerator
0.1
org.apache.kafka
kafka_2.11
2.2.1
jmxri
com.sun.jmx
jmxtools
com.sun.jdmk
jms
javax.jms
junit
junit
org.scala-tools
maven-scala-plugin
compile
compile
compile
test-compile
testCompile
test-compile
process-resources
compile
maven-compiler-plugin
1.8
org.apache.maven.plugins
maven-assembly-plugin
2.4
jar-with-dependencies
assemble-all
package
single
org.apache.maven.plugins
maven-jar-plugin
true
sparkstreaming_action.kafka.generator.Producer
alimaven
aliyun maven
http://maven.aliyun.com/nexus/content/groups/public/
true
false
2)Producer.scala:程序入口
package sparkstreaming_action.kafka.generator
import scala.util.Random
import java.util.Properties
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.clients.producer.ProducerRecord
object Producer extends App{
//从运行时参数读入topic
val topic = args(0)
//从运行时参数读入brokers
val brokers = args(1)
//设置一个随机数
val rnd = new Random()
//配置项
val props = new Properties()
//配置brokers
//引导服务列表,host:port,host:port格式,只用于初始化引导,Server不必全部列出
props.put("bootstrap.servers", brokers)
//设置客户端名称
//目的是能够让请求的server追踪请求源头,以此来允许ip/port许可列表之外的一些应用可以发送信息
props.put("client.id", "kafkaGenerator")
//序列化类型
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
//建立Kafka连接
val producer = new KafkaProducer[String, String](props)
//当前时间ms数
val t = System.currentTimeMillis()
//模拟用户名地址数据
val nameAddrs = Map("bob" -> "shanghai#200000", "amy" -> "beijing#100000",
"alice" -> "shanghai#200000", "tom" -> "beijign#100000",
"lulu" -> "hangzhou#310000", "nick" -> "shagnhai#200000")
//模拟用户名电话数据
val namePhones = Map("bob" -> "15700079421", "amy" -> "18700079458",
"alice" -> "17730079427", "tom" -> "16700379451",
"lulu" -> "18800074423", "nick" -> "14400033426")
//生成模拟数据(name, addr, type:0)
for (nameAddr <- nameAddrs) {
val data = new ProducerRecord[String, String](topic, nameAddr._1,
s"${nameAddr._1}\t${nameAddr._2}\t0")
producer.send(data) //异步发送,写入Kafka
//if (rnd.nextInt(100) < 50) Thread.sleep(rnd.nextInt(10))
}
//生成模拟数据(name, phone, type:1)
for (namePhone <- namePhones) {
val data = new ProducerRecord[String, String](topic, namePhone._1,
s"${namePhone._1}\t${namePhone._2}\t1")
producer.send(data) //异步发送,写入Kafka
//if (rnd.nextInt(100) < 50) Thread.sleep(rnd.nextInt(10))
}
//计算每条记录的平均发送时间
System.out.println("sent per second: "
+ (nameAddrs.size + namePhones.size) * 1000 / (System.currentTimeMillis() - t))
producer.close()
}
2. (消费者Consumer)项目 kafkaSparkStreaming:用于拉取 Kafka数据(topic: kafkaOperation),并分析(join)输出至控制台
1)pom.xml:需要依赖 spark-streaming-kafka-0-10_2.11
4.0.0
com
kafkaSparkStreaming
0.1
2.4.3
org.apache.spark
spark-core_2.11
${spark.version}
provided
org.apache.spark
spark-streaming_2.11
${spark.version}
provided
org.apache.spark
spark-streaming-kafka-0-10_2.11
${spark.version}
log4j
log4j
1.2.17
org.slf4j
slf4j-log4j12
1.7.12
org.scala-tools
maven-scala-plugin
compile
compile
compile
test-compile
testCompile
test-compile
process-resources
compile
maven-compiler-plugin
1.8
org.apache.maven.plugins
maven-assembly-plugin
2.4
jar-with-dependencies
assemble-all
package
single
org.apache.maven.plugins
maven-jar-plugin
true
sparkstreaming_action.kafka.operation.KafkaOperation
alimaven
aliyun maven
http://maven.aliyun.com/nexus/content/groups/public/
true
false
2)KafkaOperation.scala:程序入口
package sparkstreaming_action.kafka.operation
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies
import org.apache.spark.streaming.kafka010.ConsumerStrategies
object KafkaOperation extends App {
//Spark 配置项
val sparkConf = new SparkConf()
.setAppName("KafkaOperation")
.setMaster("spark://master:7077")
.set("spark.local.dir", "./tmp")
.set("spark.streaming.kafka.maxRatePerPartition", "10")
//spark.streaming.kafka.maxRatePerPartition: 控制spark读取的每个分区最大消息数
//创建流式上下文,2s为批处理间隔
val ssc = new StreamingContext(sparkConf, Seconds(2))
//根据broker和topic创建直接通过Kafka连接Direct Kafka
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "master:9092,slave1:9092", //引导服务列表
"key.deserializer" -> classOf[StringDeserializer], //key序列化类型
"value.deserializer" -> classOf[StringDeserializer], //value序列化类型
"group.id" -> "kafkaOperationGroup", //Group设置
"auto.offset.reset" -> "latest", //从最新offset开始
"enable.auto.commit" -> (false: java.lang.Boolean) //自动提交
)
//获取KafkaDStream
val kafkaDirectStream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](List("kafkaOperation"), kafkaParams)
)
//根据接收的Kafka信息,切分得到用户地址DStream
val nameAddrStream = kafkaDirectStream.map(_.value).filter(record => {
//记录以制表符切分
val tokens = record.split("\t")
//过滤出 (name addr type:0)的DS记录
tokens(2).toInt == 0
}).map(record => {
//记录以制表符切分
val tokens = record.split("\t")
//以(name, addr)的格式返回新的DS
(tokens(0), tokens(1))
})
//根据接收的Kafka信息,切分得到用户电话DStream
val namePhoneStream = kafkaDirectStream.map(_.value).filter(record => {
//记录以制表符切分
val tokens = record.split("\t")
//过滤出 (name phone type:1)的DS记录
tokens(2).toInt == 1
}).map(record => {
//记录以制表符切分
val tokens = record.split("\t")
//以(name, phone)的格式返回新的DS
(tokens(0), tokens(1))
})
//以用户名为key,将地址电话配对在一起
//并产生固定格式的用户地址电话信息(name,(addr,phone))
val nameAddrPhoneStream = nameAddrStream.join(namePhoneStream).map(record => {
s"姓名:${record._1},地址:${record._2._1},电话:${record._2._2}"
})
//打印输出(默认前10条)
nameAddrPhoneStream.print()
//开始计算
ssc.start()
ssc.awaitTermination()
}
3)该项目(消费者)中的Spark作业流程图
SparkStreaming作业输出print Job的三个DAG调度阶段(Stage)
1. 打包编译项目
1.在项目 kafkaSparkStreaming 根目录执行 mvn clean install 编译命令
将根目录下target/kafkaSparkStreaming-0.1-jar-with-dependencies.jar上传到Linux
2.在项目 kafkaGenerator 根目录执行 mvn clean install 编译命令
将根目录下target//kafkaGenerator-0.1-jar-with-dependencies.jar上传到Linux
所有节点执行
2. 启动Zookeeper,Kafka
$ zkServer.sh start
$ kafka-server-start.sh -daemon /opt/kafka_2.11-2.2.1/config/server.properties
使用终端A(如:win的PowerShell)连接master主节点
3. 启动Spark
$ /opt/spark-2.4.3-bin-hadoop2.7/sbin/start-all.sh
4. 提交Spark作业,不断拉取Kafka中(topic: kafkaOperation)的数据
$ spark-submit \
--class sparkstreaming_action.kafka.operation.KafkaOperation \
--num-executors 2 \
--conf spark.default.parallelism=1000 \
kafkaSparkStreaming-0.1-jar-with-dependencies.jar
<出现如下日志信息,每2s执行一次Job打印,即:启动成功>
-------------------------------------------
Time: 1560391586000 ms
-------------------------------------------
-------------------------------------------
Time: 1560391588000 ms
-------------------------------------------
<注:一开始Kafka中无数据,所以打印空行>
使用终端B(PowerShell可开启多个)连接master主节点
5. 向Kafka中灌入模拟数据(可灌入多次,观察Spark UI中的Streaming曲线)
$ java -cp kafkaGenerator-0.1-jar-with-dependencies.jar \
sparkstreaming_action.kafka.generator.Producer \
kafkaOperation master:9092,slave1:9092
<执行成功,输出如下信息:每条记录向Kafka发送的平均时间>
sent per second: 54
<此时,终端A中流式作业拉取到了Kafka数据,并做Join连接后输出,如下:>
-------------------------------------------
Time: 1560392046000 ms
-------------------------------------------
姓名:tom,地址:beijign#100000,电话:16700379451
姓名:alice,地址:shanghai#200000,电话:17730079427
姓名:nick,地址:shagnhai#200000,电话:14400033426
姓名:lulu,地址:hangzhou#310000,电话:18800074423
姓名:amy,地址:beijing#100000,电话:18700079458
姓名:bob,地址:shanghai#200000,电话:15700079421
-------------------------------------------
Time: 1560392048000 ms
-------------------------------------------
6. 在终端B中,查看topic: kafkaOperation 主题中消费者offset情况:
$ kafka-consumer-groups.sh \
--bootstrap-server master:9092 \
--describe --group kafkaOperationGroup
<出现如下描述信息:每向Kafka灌入一条数据,offset都会加1 >
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
kafkaOperation 0 - 2412 - consumer-1-3feb3110-f90c-4a55-b453-4270038e724b /192.168.190.200 consumer-1
1. 《Spark Streaming 实时流式大数据处理实战》第五章 Spark Streaming 与 Kafka
2. 再谈Spark Streaming Kafka反压
3. Kafka配置说明
4. streaming-programming-guide.html(Spark Streaming官网教程)
5. streaming-kafka-0-10-integration.html(官网教程)
6. streaming-kafka-integration.html(Streaming 与 Kafka 集成官网教程)