【SparkStreaming】Windows 10环境下 Kafka+SparkStreaming运行实例

  • 运行环境
  • 1.环境部署
    • 1. 安装zookeepr
    • 2.安装Kafka
  • 2.Scala API 测试Producer和Consumer
    • 1.Maven依赖
    • 2.Producer
    • 3. Consumer
    • 3.运行结果
  • 3.SparkStreaming实例
    • 1. Maven依赖
    • 2.代码

运行环境

os:Windows 10
zookeeper:zookeeper-3.4.6
kafka:kafka_2.11-1.1.0
scala:scala-2.11.8
java:jdk1.8.0_111
Intellij idea: 14.1.4

1.环境部署

1. 安装zookeepr

1.下载zookeeper安装包,解压到指定目录,比如D:\envpath\zookeeper-3.4.6。
2.将conf文件夹下zoo_sample.cfg重命名为zoo.cfg,修改其中的配置。

#修改配置项:
dataDir=D:/envpath/zookeeper-3.4.6/data
#增加配置项:
dataLogDir=D:/envpath/zookeeper-3.4.6/logs

3.添加环境变量ZOOKEEPER_HOME=D:\envpath\zookeeper-3.4.6;将%ZOOKEEPER_HOME%\bin添加到Path。
4.启动Zookeeper。cmd中输入:

zkServer  

2.安装Kafka

1.下载Kafka部署包,解压到指定目录。
2.修改config文件夹下的server.properties,修改日志路径的配置。

log.dirs=D:/envpath/kafka_2.11-1.1.0/logs

3.到kafka的安装目录下,启动Kafka。

.\bin\windows\kafka-server-start.bat .\config\server.properties

2.Scala API 测试Producer和Consumer

1.Maven依赖

<properties>
        <kafka.version>1.1.0kafka.version>
properties>
<dependency>
            <groupId>org.apache.kafkagroupId>
            <artifactId>kafka_2.11artifactId>
            <version>${kafka.version}version>
        dependency>
        <dependency>
            <groupId>org.apache.kafkagroupId>
            <artifactId>kafka-clientsartifactId>
            <version>${kafka.version}version>
        dependency>

2.Producer

producer是用来生成数据的。props中配置了一系列的参数,每个参数如下:

参数 含义
bootstrap.servers kafka连接的broker地址列表。格式为host[:port];可以有多个地址,用逗号分隔,如kafka01:9092,kafka02:9092。
acks 代表kafka收到消息的答复数。0表示不需要收到答复。1表示,只要有一个leader broker答复即可,all表示需要收到所有broker的答复。默认为1。
retries 重试发送次数。网络故障时,会自动重发消息。若acks为0,则该项无效,因为无法判断是否需要重发。
batch.size 批处理消息字节数。发往broker的消息会包含多个batches,每个分区对应一个batch,batch小了会减小响吞吐量,batch为0的话就禁用了batch发送。默认值为16384(16kb)。
linger.ms 逗留时间。这个逗留指的是消息不立即发送,而是逗留这个时间后一块发送。默认值为0。
buffer.memory 保存待发送消息的内存大小。当消息发送速度大于kafka服务器接收的速度,producer会阻塞max_block_ms,超时会报异常,buffer_memory用来保存等待发送的消息,默认33554432(32MB)。
key.serializer key序列化函数。默认值为: None,因此必须要配置该项,否则会报错。
value.serializer value序列化函数。默认值为: None,因此必须要配置该项,否则会报错。

具体代码如下:

import java.util.Properties
import org.apache.kafka.clients.producer.{ProducerRecord, KafkaProducer}
import scala.util.Random

object MessageProducer {
  val topic = "test-music-topic"
  def main(args: Array[String]) {
    val props = new Properties()
    props.put("bootstrap.servers", "localhost:9092")
    props.put("acks", "1")
    props.put("retries", "0")
    props.put("batch.size", "16384")
    props.put("linger.ms", "1")
    props.put("buffer.memory","33554432")
    props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer")
    props.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer")

    val producer = new KafkaProducer[String, String](props)

    val users = Array("Tim", "Mary", "Jack", "Edward", "Milly", "Jackson")
    val musics = Array("Life is like a boat", "Lemon", "Rain", "Fish in the pool", "City of Starts", "Summer", "Planet")
    val operations = Array("like", "download", "store","delete")

    val random = new Random()
    val num = 10
    for (i <- 0 to num) {
      val message = users(random.nextInt(users.length)) + "," +
        musics(random.nextInt(musics.length)) + "," +
        operations(random.nextInt(operations.length)) + "," +
        System.currentTimeMillis()
      producer.send(new ProducerRecord[String, String](topic, Integer.toString(i), message))
      println(message)
    }
    producer.close()
  }
}

3. Consumer

Consumer用来消费数据。其配置项key.deserializer和value.deserializer是必须的,与Producer的key.seriliazer和value.seriliazer对应。具体代码如下。

import java.util.{Collections, Properties}
import org.apache.kafka.clients.consumer.KafkaConsumer
import scala.collection.JavaConverters._

object MessageConsumer {
  val topic = "test-music-topic"

  def main(args: Array[String]) {
    val props = new Properties();
    props.put("bootstrap.servers", "localhost:9092")
    props.put("request.required.acks", "1");
    props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
    props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
    props.put("group.id", "something")
    val consumer = new KafkaConsumer[String, String](props)
    consumer.subscribe(Collections.singletonList(topic))
    while (true) {
      val records = consumer.poll(100)
      for (record <- records.asScala) {
        println(s"offset = ${record.offset()}, key = ${record.key()}, value = ${record.value()}")
      }
    }
  }
}

3.运行结果

先运行Consumer,其会先输出consumer的配置信息,因为producer还没有生成消息,所以之后consumer停止输出。然后运行producer,producer只发送10条消息数据。发送完自动关闭。
producer的输出如下:

2018-06-12 19:56:22,699 INFO [org.apache.kafka.common.utils.AppInfoParser] - Kafka version : 1.1.0
2018-06-12 19:56:22,699 INFO [org.apache.kafka.common.utils.AppInfoParser] - Kafka commitId : fdcf75ea326b8e07
Tim,Lemon,like,1528804583071
2018-06-12 19:56:23,528 INFO [org.apache.kafka.clients.Metadata] - Cluster ID: u88DYmIoSJCSkoWG2EdXDQ
Jackson,City of Starts,like,1528804583552
Tim,Rain,delete,1528804583553
Mary,City of Starts,like,1528804583553
Jack,Lemon,like,1528804583556
Edward,Lemon,download,1528804583556
Tim,Lemon,download,1528804583558
Milly,Fish in the pool,like,1528804583558
Tim,Planet,download,1528804583559
Edward,Rain,like,1528804583559
Tim,Rain,like,1528804583559
2018-06-12 19:56:23,559 INFO [org.apache.kafka.clients.producer.KafkaProducer] - [Producer clientId=producer-1] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.

producer发送消息后,consumer马上可以收到,输出如下。

2018-06-12 19:56:16,833 INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] - [Consumer clientId=consumer-1, groupId=something] Successfully joined group with generation 29
2018-06-12 19:56:16,833 INFO [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] - [Consumer clientId=consumer-1, groupId=something] Setting newly assigned partitions [test-music-topic-0]
offset = 121, key = 0, value = Tim,Lemon,like,1528804583071
offset = 122, key = 1, value = Jackson,City of Starts,like,1528804583552
offset = 123, key = 2, value = Tim,Rain,delete,1528804583553
offset = 124, key = 3, value = Mary,City of Starts,like,1528804583553
offset = 125, key = 4, value = Jack,Lemon,like,1528804583556
offset = 126, key = 5, value = Edward,Lemon,download,1528804583556
offset = 127, key = 6, value = Tim,Lemon,download,1528804583558
offset = 128, key = 7, value = Milly,Fish in the pool,like,1528804583558
offset = 129, key = 8, value = Tim,Planet,download,1528804583559
offset = 130, key = 9, value = Edward,Rain,like,1528804583559
offset = 131, key = 10, value = Tim,Rain,like,1528804583559

3.SparkStreaming实例

1. Maven依赖

     <properties>
        <spark.version>2.2.0spark.version>
     properties>
        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-streaming_2.11artifactId>
            <version>${spark.version}version>
        dependency>
        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-streaming-kafka-0-10_2.11artifactId>
            <version>${spark.version}version>
        dependency>

2.代码

接收proudcer生产的数据,打印输出。代码如下:

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkContext, SparkConf}

/**
 * Created by DELL_PC on 2018/6/12.
 */
object UserActionStreaming {

  def main(args: Array[String]) {
    val group = "something"
    val topics = "test-music-topic"

    val conf = new SparkConf().setAppName("pvuv").setMaster("local[3]")
    val sc =  new SparkContext(conf)
    val ssc = new StreamingContext(sc, Seconds(10))
    ssc.checkpoint("data/spark/checkpoint")

    val topicSets = topics.split(",").toSet
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> group,
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topicSets, kafkaParams)
    )
    stream.map(record => (record.key, record.value())).foreachRDD(rdd => rdd.foreach(println));
    ssc.start()
    ssc.awaitTermination()
  }
}

先运行UserActionStreaming,然后再运行producer。producer发送后,UserActionStreaming才接收到消息,产生消息输出。UserActionStreaming整个的运行输出如下。

2018-06-12 15:21:17,472 INFO [org.apache.kafka.clients.producer.ProducerConfig] - ProducerConfig values: 
    acks = 1
    batch.size = 16384
    bootstrap.servers = [localhost:9092]
    buffer.memory = 33554432
    client.id = 
    compression.type = none
    connections.max.idle.ms = 540000
    enable.idempotence = false
   ... ...
2018-06-12 15:21:18,033 INFO [org.apache.kafka.common.utils.AppInfoParser] - Kafka version : 1.1.0
2018-06-12 15:21:18,033 INFO [org.apache.kafka.common.utils.AppInfoParser] - Kafka commitId : fdcf75ea326b8e07
2018-06-12 15:21:18,629 INFO [org.apache.kafka.clients.Metadata] - Cluster ID: u88DYmIoSJCSkoWG2EdXDQ
Jack,Rain,delete,1528788078392
Milly,Planet,like,1528788078674
Jack,Lemon,delete,1528788078675
Mary,Rain,store,1528788078675
Edward,Planet,store,1528788078675
Milly,Fish in the pool,delete,1528788078675
Milly,City of Starts,download,1528788078676
Mary,Planet,delete,1528788078680
Milly,Fish in the pool,like,1528788078680
Jack,Summer,like,1528788078680
Jackson,Fish in the pool,store,1528788078680
2018-06-12 15:21:18,681 INFO [org.apache.kafka.clients.producer.KafkaProducer] - [Producer clientId=producer-1] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms.

producer运行输出如下。

2018-06-12 15:21:20,134 INFO [org.apache.kafka.common.utils.AppInfoParser] - Kafka version : 1.1.0
2018-06-12 15:21:20,134 INFO [org.apache.kafka.common.utils.AppInfoParser] - Kafka commitId : fdcf75ea326b8e07
2018-06-12 15:21:20,138 INFO [org.apache.spark.streaming.kafka010.CachedKafkaConsumer] - Initial fetch for spark-executor-something test-music-topic 0 55
2018-06-12 15:21:20,151 INFO [org.apache.kafka.clients.Metadata] - Cluster ID: u88DYmIoSJCSkoWG2EdXDQ
(0,Jack,Rain,delete,1528788078392)
(1,Milly,Planet,like,1528788078674)
(2,Jack,Lemon,delete,1528788078675)
(3,Mary,Rain,store,1528788078675)
(4,Edward,Planet,store,1528788078675)
(5,Milly,Fish in the pool,delete,1528788078675)
(6,Milly,City of Starts,download,1528788078676)
(7,Mary,Planet,delete,1528788078680)
(8,Milly,Fish in the pool,like,1528788078680)
(9,Jack,Summer,like,1528788078680)
(10,Jackson,Fish in the pool,store,1528788078680)

参考文章:
https://blog.csdn.net/woloqun/article/details/76047104
https://my.oschina.net/u/218540/blog/1794669
http://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html

你可能感兴趣的:(Spark,bigdata)