2020年寒假学习进度第六天

今天主要进行了spark实验六的学习,Spark Streaming 编程初级实践

Flume 是非常流行的日志采集系统,可以作为 Spark Streaming 的高级数据源。请把 Flume Source 设置为 netcat 类型,从终端上不断给 Flume Source 发送各种消息,Flume 把消息汇集到 Sink,这里把 Sink 类型设置为 avro,由 Sink 把消息推送给 Spark Streaming,由自己编写的 Spark Streaming 应用程序对消息进行处理
⑴配置 Flume 数据源
1.cd /usr/local/flume
2.cd conf
3.vim flume-to-spark.conf
 
flume-to-spark.conf 文件中写入如下内容:
#flume-to-spark.conf: A single-node Flume configuration
 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 # Describe/configure the source
 a1.sources.r1.type = netcat
 a1.sources.r1.bind = localhost
 a1.sources.r1.port = 33333
# Describe the sink
 a1.sinks.k1.type = avro
 a1.sinks.k1.hostname = localhost
 a1.sinks.k1.port =44444
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000000
 a1.channels.c1.transactionCapacity = 1000000
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

  

(2)Spark 的准备工作
  1.cd /usr/local/spark
  2. ./bin/spark-shell
 
启动成功后,在 spark-shell 中执行下面 import 语句:
1.scala> import org.apache.spark.streaming.flume._
 
你可以看到,马上会报错,因为找不到相关的 jar 包。所以,现在我们就需要下载spark-streaming-flume_2.11-2.1.0.jar,其中2.11表示对应的Scala版本号,2.1.0表示Spark版本号。现在请在 Linux 系统中,打开一个火狐浏览器,打开下方的网址http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume_2.11/2.1.0,里面有提供 spark-streaming-flume_2.11-2.1.0.jar 文件的下载。
 
随后:
1.cd /usr/local/spark/jars
2.mkdir flume
3.cd ~
4.cd 下载
5.cp ./spark-streaming-flume_2.11-2.1.0.jar
6./usr/local/spark/jars/flume
7.cd /usr/local/flume/lib
8.ls
9.cp ./* /usr/local/spark/jars/flume
 
⑶编写 Spark 程序使用 Flume 数据源
1.cd /usr/local/spark/mycode
2.mkdir flume
3.cd flume
4.mkdir -p src/main/scala
5.cd src/main/scala
6.vim FlumeEventCount.scala
package org.apache.spark.examples.streaming
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
object FlumeEventCount {
 def main(args: Array[String]) {
 if (args.length < 2) {
 System.err.println(
 "Usage: FlumeEventCount  ")
 System.exit(1)
 }
StreamingExamples.setStreamingLogLevels()
 val Array(host, IntParam(port)) = args
 val batchInterval = Milliseconds(2000)
 // Create the context and set the batch size
 val sparkConf = new SparkConf().setAppName("FlumeEventCount").setMaster("local
[2]")
 val ssc = new StreamingContext(sparkConf, batchInterval)
 // Create a flume stream
 val stream = FlumeUtils.createStream(ssc, host, port, StorageLevel.MEMORY_ONLY_S
ER_2)
 // Print out the count of events received from this server in each batch
 stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
 ssc.start()
 ssc.awaitTermination()
 } }

  

然后再使用 vim 编辑器新建 StreamingExamples.scala 文件,输入如下代码,用于控
制日志输出格式:
package org.apache.spark.examples.streaming
import org.apache.log4j.{Level, Logger}
import org.apache.spark.internal.Logging
object StreamingExamples extends Logging {
 /** Set reasonable logging levels for streaming if the user has not configured log4
j. */
 def setStreamingLogLevels() {
 val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
 if (!log4jInitialized) {
 // We first log something to initialize Spark's default logging, then we overri
de the
 // logging level.
 logInfo("Setting log level to [WARN] for streaming example." +
 " To override add a custom log4j.properties to the classpath.")
 Logger.getRootLogger.setLevel(Level.WARN)
 }
 }
}

  

然后,请执行下面命令新建一个 simple.sbt 文件:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-flume_2.11" % "2.1.0"

  进行打包:

 1.cd /usr/local/spark/mycode/flume/
 2./usr/local/sbt/sbt package#执行这一步之前需要先安装 sbt 插件
 
⑷测试程序效果
关闭之前打开的所有终端。首先,请新建第 1 个 Linux 终端,启动 Spark Streaming 应用程序,命令如下:
 1.cd /usr/local/spark
 2. ./bin/spark-submit --driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/flume/* --class "org.apache.spark.examples.streaming.FlumeEventCount" /usr/local/spark/mycode/flume/target/scala-2.11/simple-project_2.11-1.0.jar localhost 44444
 
通过上面命令,我们为应用程序提供 host 和 port 两个参数的值分别为 localhost 和44444,程序会对 localhost 的 44444 端口进行监听,Milliseconds(2000)设置了时间间隔为2 秒,所以,该程序每隔 2 秒就会从指定的端口中获取由 Flume Sink 发给该端口的消息,然后进行处理,对消息进行统计,打印出“Received 0 flume events.”这样的信息。执行该命令后,屏幕上会显示程序运行的相关信息,并会每隔 2 秒钟刷新一次信息,大量信息中会包含如下重要信息:
-------------------------------------------
Time: 1488029430000 ms
-------------------------------------------
Received 0 flume events.
因为目前 Flume 还没有启动,没有给 FlumeEventCount 发送任何消息,所以 FlumeEvents 的数量是 0。第 1 个终端不要关闭,让它一直处于监听状态。现在,我们可以再另外新建第 2 个终端,在这个新的终端中启动 Flume Agent,命令如下:
 1.cd /usr/local/flume
 2.bin/flume-ng agent --conf ./conf --conf-file ./conf/flume-to-spark.conf --name a1 -Dflume.root.logger=INFO,console
启动 agent 以后,该 agent 就会一直监听 localhost 的 33333 端口,这样,我们下面就可以通过“telnet localhost 33333”命令向 Flume Source 发送消息。第 2 个终端也不要关闭,让它一直处于监听状态。
 
请另外新建第 3 个终端,执行如下命令:
1.telnet localhost 33333
 
执行该命令以后,就可以在这个窗口里面随便敲入若干个字符和若干个回车,这些消息都会被 Flume 监听到,Flume 把消息采集到以后汇集到 Sink,然后由 Sink 发送给 Spark的 FlumeEventCount 程序进行处理。然后,你就可以在运行 FlumeEventCount 的前面那个终端窗口内看到类似如下的统计结果:
-------------------------------------------
Time: 1488029430000 ms
-------------------------------------------
Received 0 flume events.
#这里省略了其他屏幕信息
-------------------------------------------
Time: 1488029432000 ms
-------------------------------------------
Received 8 flume events.
#这里省略了其他屏幕信息
-------------------------------------------
Time: 1488029434000 ms
-------------------------------------------
Received 21 flume events.
 
 
 
以上是关于实验的一些操作步骤,最后经过实验出现了一个问题,启动Flume之后,就应该能接受到Received 21 flume events这个提示,但是我做实验启动flume之后,依然是0,不知原因,此问题还未解决。

你可能感兴趣的:(2020年寒假学习进度第六天)