Kafka是一种高吞吐量的分布式发布订阅消息系统,用户通过Kafka系统卡伊发布大量的消息,同时也能实时订阅消费消息
Kafka可以同时满足在线实时处理和批量离线处理
在大公司生态系统中,可以把Kafka作为数据交换枢纽,不同类型的分布式系统(关系数据库、NoSQL数据库、流处理系统、批处理系统等),可以统一接入到Kafka,实现和Hadoop各个组件之间的不同类型数据的实时高效交换。
关于Kafka的具体安装方法请参考网上教程,这里加上一句成功安装Kafka到"/usr/local/kafka"目录下
说明:本课程下载的安装文件为Kafka_2.11-0.10.2.0.tgz,前面的2.11就是该Kafka所支持的Scala版本号,后面的0.10.2.0是Kafka自身的版本号
打开一个终端,输入下面命令启动Zookeeper服务:
$ cd /usr/local/kafka
$ ./bin/zookeeper-server-start.sh config/zookeeper.properties
千万不要关闭这个终端窗口,一旦关闭,Zookeeper服务就停止了。
打开第二个终端,然后输入下面命令启动Kafka服务:
$ cd /usr/local/kafka
$ ./bin/kafka-server-start.sh config/server.properties
千万不要关闭这个终端窗口,一旦关闭,kafka服务就停止了。
使用Kafka作为Spark输入源,Spark需要依赖独立的库(jar文件),在spark-shell中执行下面import语句进行测试:
import org.apache.spark.streaming.kafka._
对于Spark2.3.0版本,如果要使用Kafka,则需要下载spark-streaming-kafka-0-8_2.11相关jar包,下载地址:http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8_2.11/2.3.0
把下载的jar文件复制到Spark目录的jars目录下
$ cd /usr/local/spark/jars
$ mkdir kafka
$ cd ~
$ cd 下载
$ cp ./spark-streaming-kafka-0-8_2.11-2.1.0.jar /usr/local/spark/jars/kafka
继续把Kafka安装目录的libs目录下所以jar文件复制到"/usr/local/spark/jars/kafka"目录下,即再终端执行下面命令:
$ cd /usr/local/kafka/libs
$ cp ./* /usr/local/spark/jars/kafka
编写一个Kafka词频统计功能的程序
编写KafkaWordProducer程序,执行命令创建代码目录:
$ /usr/local/spark/mycode
$ mkdir kafka
$ cd kafka
$ mkdir -p src/main/scala
$ cd src/main/scala
$ vim KafkaWordProducer.scala
在KafkaWordProducer.scala中输入以下代码:
package org.apache.spark.examples.streaming
improt java.util.HashMap
import org.apache.kafka.clients.producer.{KafkaProducer,ProducerConfig, ProducerRecord}
import org.apache.spark,SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object KafkaWordProducer {
def main(args:Array[String]){
if(args.lenght < 4)
{
System.err.println("Usage:KafkaWordProducer "+"")
System.exit(1)
}
val Array(brokers, topic, messagesPerSec, wordsPerMessage) = args
//Zookeeper connection properties
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, Stgring](props)
//Send some messages
while(true) {
(1 to messagesPerSec.toInt).foreach { messageNum =>
val str = (1 to wordsPerMessage.toInt).map(x =>
scala.util.Random.nextInt(10).toString)
.mkString(" ")
print(str)
println()
val message = new ProducerRecord[String, String](topic, null,str)
producer.send(message)
}
Thread.sleep(1000)
}
}
}
继续在当前目录下创建KafkaWordCount.scala代码文件,它会把KafkaWordProducer发送过来的单词进行词频统计,代码内容如下:
package org.apache.spark.examples.streaming
import org.apache.spark,SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka.KafkaUtils
object KafkaWordCount{
def main(args:Array[String]){
StreamingExamples.setStreamingLogLevels()
val sc = new SparkConf.setAppName("KafkaWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sc, Seconds(10))
//设置检查点,如果存放在HDFS上面,则写成类似ssc.checkpoint("/usr/hadoop/checkpoint")这种形式,hadoop要启动
ssc.checkpoint("file:///usr/local/spark/mycode/kafka/chekcpoint")
val zkQuorum = "localhost:2181" //Zookeeper服务器地址
val group = "1" //topic所在的group,可以设置为自己想要的名称,比如不用1,而是val group = "test-consumer-group"
val topics = "wordsender" //topics的名称
val numThreads = 1 //每个topic的名称
val topicMap = topics.split(",").map((_,numThreads.toInt)).toMap
val lineMap = KafkaUtils.createStream(ssc,zkQuorum, group, topicMap)
val lines = lineMap.map(_._2)
val words = lines.flaMap(_split(" "))
val pair = words.map(x=>(x,1))
val wordCounts = pair.reduceByKeyAndWindow(_+_,_-_,Minutes(2),Seconds(10),2) //窗口转换操作,每隔10秒将最近2分钟的数据作为一个窗口,最后参数2表示任务数
wordCounts.print
ssc.start
ssc.awaitTermination
}
}
Spark Streaming 提供了滑动窗口的操作的支持,从而让我们可以对一个滑动窗口内的数据执行计算操作。每次落在窗口里面的RDD 数据,会被集合起来,然后生成新的RDD 会作为windows DStream 的一个RDD 。参考文章
继续在当前目录下创建StreamingExamples.scala代码文件,用于设置log4j:
package org.apacke.spark.exmaples.streaming
import org.apache.spark.internal.Logging
import org.apache.log4j.{Level, Logger}
/** Utility functions for Spark Streaming examples. */
object StreamingExamples extends Logging {
/** Set reasonable logging levels for streaming if the user has not configured log4j. */
def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if(!log4jInitialized){
//We first log something to initialize Spark's default loggin, then we override the logging level
logInfo("Setting log level to [WARN] for steaming exmaple."+" To override add a custom log4j.properties to the classpath.")
Logger.getRootLogger.setLevel(Level.WARN)
}
}
}
创建simple.sbt文件
$ cd /usr/local/spark/mycode/kafka/
$ vim simple.sbt
在simple.sbt中输入以下代码:
name:="Simple Project"
version:="1.0"
scalaVersion:="2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.3.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.3.0"
打开一个终端,执行如下命令,运行"KafkaWordProducer"程序,生成一些单词(是一堆整数形式的单词):
$ cd /usr/local/spark
$ /usr/local/spark/bin/spark-submit --driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/kafka/* --class "org.apache.spark.examples.streaming.KafakWordProducer" /usr/local/spark/mycode/kafka/target/scala-2.11/simple-project_2.11-1.0.jar localhost:9092 wordsender 3 5
执行上面命令后,屏幕上就会不断滚动出现新的单词,不要关闭这个终端窗口,,让它一直不断发送单词。打开另一个终端,执行下面命令,运行KafkaWordCount程序,执行词频统计:
$ cd /usr/local/spark
$ /usr/local/spark/bin/spark-submit --driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/kafka/* --class "org.apache.spark.examples.streaming.KafakWordCount" /usr/local/spark/mycode/kafka/target/scala-2.11/simple-project_2.11-1.0.jar
运行上面命令以后,就启动了词频统计功能。