SparkStreaming总结下

一、spark直连方式和Receiver方式比较

consumer 传统的消息者(老的方式)需要连接ZK,新的方式(高效的方式)不需要连接ZK,但是要自己维护偏移量

consumer group 一个消费者组下可以有多个消费者,不重复消息

 

DStream离散的数据流,是SparkStreaming中一个最基本的抽象,DStream中不存放数据,也可以认为是一个分布式的数据集,本子上DStream是一系列连续的RDD,DStream是对RDD的封装,可以处理实时的数据流

 

DStream每隔一段时间会生成一个小批次,将该小批次提交到Spark引擎中进行计算,每一个批次都会有计算的结果

 

SparkStreaming整合Kafka有两种方式(0.8, 0.10只支持直连方式)

第一种方式就是有Receiver的方式,相当于有接受者,接收一个批次产生数据的,然后在进行计算,适合高级的消费API(连接zk,自动跟新偏移量,WAL),但是效率比较低

第二种方式是直连方式,一个SparkStraming的Task之间连到Kafka对应Topic的分区上,以迭代器的方式一条一条对其数据,边度边计算,计算一个批次的时间,生成一个批次的小结构,该消息方式,不需要连接zk,之间连接broker上,但是需要手动维护偏移量(偏移量可以记录到MySQL、Redis、ZK)

 

 

二、sparkStreaming直连方式整合Kafka0.8

object KafkaDirectWordCount {

  def main(args: Array[String]): Unit = {

    val group="g001"

    val conf = new SparkConf().setMaster("local[2]").setAppName("KafkaDirectWordCount")

    val ssc = new StreamingContext(conf,Seconds(5))

    val topic ="wordcount"

    //指定kafka的broker地址(sparkStream的Task直连到kafka的分区上,用更加底层的API消费,效率更高)

    val brokerList = "hdp20-01:9092,hdp20-02:9092,hdp20-03:9092"

    //指定zk的地址,后期更新消费的偏移量时使用

    val zkQuorum = "hdp20-01:2181,hdp20-02:2181,hdp20-03:2181"

    //创建 stream 时使用的 topic 名字集合

    val topics: Set[String] = Set(topic)

    //创建一个 ZKGroupTopicDirs 对象,其实是指定往zk中写入数据的目录,用于保存偏移量

    val topicDirs = new ZKGroupTopicDirs(group, topic)

    //获取 zookeeper 中的路径 "/g001/offsets/wordcount/"

    val zkTopicPath = s"${topicDirs.consumerOffsetDir}"

    //准备kafka的参数

    val kafkaParams = Map(

      "metadata.broker.list" -> brokerList,

      "group.id" -> group,

      "auto.offset.reset" -> kafka.api.OffsetRequest.SmallestTimeString

    )

    //zookeeper 的host 和 ip,创建一个 client,用于跟新偏移量量的

    val zkClient = new ZkClient(zkQuorum)

    //查询该路径下是否字节点(默认有字节点为我们自己保存不同 partition 时生成的)

    // /g001/offsets/wordcount/0/10001"

    // /g001/offsets/wordcount/1/30001"

    // /g001/offsets/wordcount/2/10001"

    //zkTopicPath  -> /g001/offsets/wordcount/

    val children = zkClient.countChildren(zkTopicPath)

    var kafkaStream: InputDStream[(String, String)] = null

    //如果 zookeeper 中有保存 offset,我们会利用这个 offset 作为 kafkaStream 的起始位置

    var fromOffsets: Map[TopicAndPartition, Long] = Map()

    //如果保存过 offset

    if (children > 0) {

      for (i <- 0 until children) {

        // /g001/offsets/wordcount/0/10001

        val partitionOffset = zkClient.readData[String](s"$zkTopicPath/${i}")

        // wordcount/0

        val tp = TopicAndPartition(topic, i)

        //将不同 partition 对应的 offset 增加到 fromOffsets 中

        // wordcount/0 -> 10001

        fromOffsets += (tp -> partitionOffset.toLong)

      }

      //Key: wordcount   values: "hello tom hello jerry"

      //这个会将 kafka 的消息进行 transform,最终 kafak 的数据都会变成 (topic_name, message) 这样的 tuple

      val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message())

      //通过KafkaUtils创建直连的DStream(fromOffsets参数的作用是:按照前面计算好了的偏移量继续消费数据)

      //[String, String, StringDecoder, StringDecoder,     (String, String)]

      //  key    value    key的解码方式   value的解码方式

      kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)

    } else {

      //如果未保存,根据 kafkaParam 的配置使用最新(largest)或者最旧的(smallest) offset

      kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)

    }

    //偏移量的范围

    var offsetRanges = Array[OffsetRange]()

    //从kafka读取的消息,DStream的Transform方法可以将当前批次的RDD获取出来

    //该transform方法计算获取到当前批次RDD的偏移量

    val transform: DStream[(String, Int)] = kafkaStream.transform { rdd =>

      //得到该 rdd 对应 kafka 的消息的 offset

      //该RDD是一个KafkaRDD,可以获得偏移量的范围

      offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

      rdd

    }.map(_._2).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)

    val messages: DStream[(String, Int)] = transform

    //依次迭代DStream中的RDD

    messages.foreachRDD { rdd =>

      //对RDD进行操作,触发Action

      rdd.foreachPartition(partition =>

        partition.foreach(x => {

          println(x)

        })

      )

      for (o <- offsetRanges) {

        val zkPath = s"${topicDirs.consumerOffsetDir}/${o.partition}"

        //将该 partition 的 offset 保存到 zookeeper

        ZkUtils.updatePersistentPath(zkClient, zkPath,o.fromOffset.toString)

      }

    }

    ssc.start()

    ssc.awaitTermination()

  }

}

 

五、Kafka0.10版本安装

kafka集群部署

broker.id=1

delete.topic.enable=true

log.dirs=/bigdata/kafka_2.11-0.10.2.1/data

zookeeper.connect=node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181

 

启动kafka

/bigdata/kafka_2.11-0.10.2.1/bin/kafka-server-start.sh -daemon /bigdata/kafka_2.11-0.10.2.1/config/server.properties

 

停止kafka

/bigdata/kafka_2.11-0.10.2.1/bin/kafka-server-stop.sh

 

创建topic

/bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --create --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --replication-factor 3 --partitions 3 --topic my-topic

 

 

列出所有topic

/bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --list --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181

 

查看某个topic信息

/bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --describe --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --topic my-topic

 

启动一个命令行的生产者

/bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-producer.sh --broker-list node-1.xiaoniu.com:9092,node-1.xiaoniu.xom:9092,node-3.xiaoniu.com:9092 --topic xiaoniu

 

启动一个命令行的消费者

/bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --topic my-topic --from-beginning

 

# 消费者连接到borker的地址

/bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --bootstrap-server node-1.xiaoniu.com:9092,node-2.xiaoniu.com:9092,node-3.xiaoniu.com:9092 --topic xiaoniu --from-beginning

 

Kafka Connect:

https://kafka.apache.org/documentation/#connect

http://docs.confluent.io/2.0.0/connect/connect-jdbc/docs/index.html

 

Kafka Stream:

https://kafka.apache.org/documentation/streams

https://spark.apache.org/docs/1.6.1/streaming-kafka-integration.html

 

kafka monitor:

https://kafka.apache.org/documentation/#monitoring

https://github.com/quantifind/KafkaOffsetMonitor

https://github.com/yahoo/kafka-manager

 

kafka生态圈:

https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem

 

六、sparkStreaming直连方式整合Kafka0.10

object DirectStream {

 

  def main(args: Array[String]): Unit = {

 

    //创建SparkConf,如果将任务提交到集群中,那么要去掉.setMaster("local[2]")

    val conf = new SparkConf().setAppName("DirectStream").setMaster("local[2]")

    //创建一个StreamingContext,其里面包含了一个SparkContext

    val streamingContext = new StreamingContext(conf, Seconds(5))

 

    //配置kafka的参数

    val kafkaParams = Map[String, Object](

      "bootstrap.servers" -> "node-1.xiaoniu.com:9092,node-2.xiaoniu.com:9092,node-3.xiaoniu.com:9092",

      "key.deserializer" -> classOf[StringDeserializer],

      "value.deserializer" -> classOf[StringDeserializer],

      "group.id" -> "test123",

      "auto.offset.reset" -> "earliest", // lastest

      "enable.auto.commit" -> (false: java.lang.Boolean)

    )

 

    val topics = Array("xiaoniu")

    //在Kafka中记录读取偏移量

    val stream = KafkaUtils.createDirectStream[String, String](

      streamingContext,

      //位置策略(可用的Executor上均匀分配分区)

      LocationStrategies.PreferConsistent,

      //消费策略(订阅固定的主题集合)

      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)

    )

 

    //迭代DStream中的RDD(KafkaRDD),将每一个时间点对于的RDD拿出来

    stream.foreachRDD { rdd =>

      //获取该RDD对于的偏移量

      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

      //拿出对应的数据

      rdd.foreach{ line =>

        println(line.key() + " " + line.value())

      }

      //异步更新偏移量到kafka中

      // some time later, after outputs have completed

      stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)

    }

    streamingContext.start()

    streamingContext.awaitTermination()

  }

}

 

七、spark on yarn cluster模式

1.      官方文档

http://spark.apache.org/docs/latest/running-on-yarn.html

2.      配置安装

1.安装hadoop:需要安装HDFS模块和YARN模块,HDFS必须安装,spark运行时要把jar包存放到HDFS上。

2.安装Spark:解压Spark安装程序到一台服务器上,修改spark-env.sh配置文件,spark程序将作为YARN的客户端用于提交任务

export JAVA_HOME=/usr/local/jdk1.7.0_80

export HADOOP_CONF_DIR=/usr/local/hadoop-2.6.4/etc/hadoop

3.启动HDFS和YARN

3.      运行模式(cluster模式和client模式)

1.cluster模式

./bin/spark-submit --class org.apache.spark.examples.SparkPi \

--master yarn \

--deploy-mode cluster \

--driver-memory 1g \

--executor-memory 1g \

--executor-cores 2 \

--queue default \

lib/spark-examples*.jar \

10

 

---------------------------------------------------------------------------------------------------------------------------------

./bin/spark-submit --class cn.edu360.spark.day1.WordCount \

--master yarn \

--deploy-mode cluster \

--driver-memory 1g \

--executor-memory 1g \

--executor-cores 2 \

--queue default \

/home/bigdata/hello-spark-1.0.jar \

hdfs://node-1.edu360.cn:9000/wc hdfs://node-1.edu360.cn:9000/out-yarn-1

 

 

2.client模式

./bin/spark-submit --class org.apache.spark.examples.SparkPi \

--master yarn \

--deploy-mode client \

--driver-memory 1g \

--executor-memory 1g \

--executor-cores 2 \

--queue default \

lib/spark-examples*.jar \

10

 

spark-shell必须使用client模式

./bin/spark-shell --master yarn --deploy-mode client

 

3.两种模式的区别

cluster模式:Driver程序在YARN中运行,应用的运行结果不能在客户端显示,所以最好运行那些将结果最终保存在外部存储介质(如HDFS、Redis、Mysql)而非stdout输出的应用程序,客户端的终端显示的仅是作为YARN的job的简单运行状况。

 

client模式:Driver运行在Client上,应用程序运行结果会在客户端显示,所有适合运行结果有输出的应用程序(如spark-shell)

 

4.原理

cluster模式:

 

Spark Driver首先作为一个ApplicationMaster在YARN集群中启动,客户端提交给ResourceManager的每一个job都会在集群的NodeManager节点上分配一个唯一的ApplicationMaster,由该ApplicationMaster管理全生命周期的应用。具体过程:

 

1. 由client向ResourceManager提交请求,并上传jar到HDFS上

这期间包括四个步骤:

a).连接到RM

b).从RM的ASM(ApplicationsManager )中获得metric、queue和resource等信息。

c). upload app jar and spark-assembly jar

d).设置运行环境和container上下文(launch-container.sh等脚本)

 

2. ResouceManager向NodeManager申请资源,创建Spark ApplicationMaster(每个SparkContext都有一个ApplicationMaster)

3. NodeManager启动ApplicationMaster,并向ResourceManager AsM注册

4. ApplicationMaster从HDFS中找到jar文件,启动SparkContext、DAGscheduler和YARN Cluster Scheduler

5. ResourceManager向ResourceManager AsM注册申请container资源

6. ResourceManager通知NodeManager分配Container,这时可以收到来自ASM关于container的报告。(每个container对应一个executor)

7. Spark ApplicationMaster直接和container(executor)进行交互,完成这个分布式任务。

 

client模式:

 

在client模式下,Driver运行在Client上,通过ApplicationMaster向RM获取资源。本地Driver负责与所有的executor container进行交互,并将最后的结果汇总。结束掉终端,相当于kill掉这个spark应用。一般来说,如果运行的结果仅仅返回到terminal上时需要配置这个。

 

客户端的Driver将应用提交给Yarn后,Yarn会先后启动ApplicationMaster和executor,另外ApplicationMaster和executor都 是装载在container里运行,container默认的内存是1G,ApplicationMaster分配的内存是driver- memory,executor分配的内存是executor-memory。同时,因为Driver在客户端,所以程序的运行结果可以在客户端显 示,Driver以进程名为SparkSubmit的形式存在。

 

八、遇到的问题

ERROR spark.SparkContext: Error initializing SparkContext.

org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)

        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)

        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)

        at org.apache.spark.SparkContext.(SparkContext.scala:509)

        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)

        at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)

        at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)

        at scala.Option.getOrElse(Option.scala:121)

        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)

        at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)

        at $line3.$read$$iw$$iw.(:15)

        at $line3.$read$$iw.(:42)

        at $line3.$read.(:44)

        at $line3.$read$.(:48)

        at $line3.$read$.()

        at $line3.$eval$.$print$lzycompute(:7)

        at $line3.$eval$.$print(:6)

        at $line3.$eval.$print()

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)

        at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)

        at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)

        at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)

        at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)

        at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)

        at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)

        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)

        at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)

        at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)

        at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)

        at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)

        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)

        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)

        at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)

        at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)

        at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)

        at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:98)

        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)

        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)

        at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)

        at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)

        at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)

        at org.apache.spark.repl.Main$.doMain(Main.scala:70)

        at org.apache.spark.repl.Main$.main(Main.scala:53)

        at org.apache.spark.repl.Main.main(Main.scala)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)

        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)

        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)

        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)

        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

17/08/29 18:11:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!

17/08/29 18:11:51 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running

org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

  at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)

  at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)

  at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)

  at org.apache.spark.SparkContext.(SparkContext.scala:509)

  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)

  at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)

  at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)

  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)

  ... 47 elided

 

解决:在yarn-site.xml中配置

 

    yarn.nodemanager.pmem-check-enabled

    false

 

    yarn.nodemanager.vmem-check-enabled

    false

 

你可能感兴趣的:(大数据-spark)