实时分析Flume-Kafka框架搭建最终将数据在mysql中输出

因为搭建框架比较复杂如果这其中有不足,欢迎提出指正。下面附上实时分析简化框架图帮助理解。

实时分析Flume-Kafka框架搭建最终将数据在mysql中输出_第1张图片

把离线分析框架也附上

实时分析Flume-Kafka框架搭建最终将数据在mysql中输出_第2张图片

实时分析搭建过程:

1.在命令提示符中(Windows+R)找到准备好的SocketTest.java路径,javac SocketTest.java运行后生成SocketTest.class文件(运行前将SocketTest.java中包名删除)

将.class文件拖到linux中指定目录下我设置的是/home/cat/soft/ST 下

将你事先准备好的access.20120104.log(爬虫日志大约300万条记录)也放到ST下,再在ST中新建一个data.log保存即可

2.在一个用户下运行java SocketTest access.20120104.log data.log ----------->data.log是监控access.20120104.log中信息并记录的日志。

在另起一个机器输入   tail -F data.log   这时会在data.log中输出access.20120104.log中的数据

得到结果说明连接成功

3. 开始进行kafka配置文件修改 

1、http://flume.apache.org/FlumeUserGuide.html#exec-source

 

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /soft/ST/data.log
a1.sources.r1.channels = c1

2、flume1.6权威指南中http://flume.apache.org/releases/content/1.6.0/FlumeUserGuide

agent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.k1.topic = mytopic
agent.sinks.k1.brokerList = lion:9092
agent.sinks.k1.batchSize = 20
agent.sinks.k1.requiredAcks = 1

完整版

agent.sources = r1
agent.channels = c1
agent.sinks = k1

# For each one of the sources, the type is defined
agent.sources.r1.type = exec
agent.sources.r1.command = tail -F /home/cat/soft/ST/data.log

# The channel can be defined as follows.
agent.sources.r1.channels = c1

# Each sink's type must be defined

agent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.k1.topic = mytopic
agent.sinks.k1.brokerList = lion:9092
agent.sinks.k1.batchSize = 20
agent.sinks.k1.requiredAcks = 1

#Specify the channel the sink should use
agent.sinks.k1.channel = c1

# Each channel's type is defined.
agent.channels.c1.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100

现在配置文件更改完毕可以测试了

(1)在一个机器中cd soft/kafka/   运行  zookeeper-server-start.sh config/zookeeper.properties
(2)在另一台机器中cd soft/kafka/   运行  kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic mytopic --from-beginning   

(3)在另一台机器中cd soft/flume/conf   运行  flume-ng agent --conf /home/cat/soft/flume/conf/ --conf-file tokafka.conf --name agent -Dflume.root.logger=INFO,console  (检验一下kafka能否收到flume数据)    

(4) 启动flume(用于测试)

flume-ng agent --conf /home/cat/soft/flume/conf/ --conf-file tokafka.conf --name agent -Dflume.root.logger=INFO,console    

实际上在   IntelliJ IDEA 2017.2 x64上运行     

先把配置文件补充一些https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.11/2.4.0

 
        
            org.apache.spark
            spark-streaming-kafka-0-10_2.11
            2.4.0
        

 

package Test1227

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object StreamingKafka {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("nwc").setMaster("local[2]")
    val streamingContext = new StreamingContext(conf, Seconds(30))
    streamingContext.sparkContext.setLogLevel("ERROR")
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "lion:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (true: java.lang.Boolean)
    )

    val topics = Array("mytopic")
    val stream = KafkaUtils.createDirectStream[String, String](
      streamingContext,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )

    stream.map(record => ( record.value().toString)).saveAsTextFiles("D://123//StreamingKafka")

    streamingContext.start()
    streamingContext.awaitTermination()
    streamingContext.stop()
  }

}

 

(5)最后在linux中输入 java SocketTest access.20120104.log data.log 

会在本地文件中或 .print 直接在控制台输出结果

(6)mysql中得到结果

将DBUtil.java和db.properties放到idea程序下

实时分析Flume-Kafka框架搭建最终将数据在mysql中输出_第3张图片

将和mysql联系到一起的代码附上

package Test1227

import DButil.DBUtil
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.log4j.PropertyConfigurator
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object StreamingKafka {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("nwc").setMaster("local[2]")
    val streamingContext = new StreamingContext(conf, Seconds(20))
    //streamingContext.sparkContext.setLogLevel("ERROR")
    PropertyConfigurator.configure("log4j.properties.template")
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "lion:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (true: java.lang.Boolean)
    )

    val topics = Array("mytopic")
    val stream = KafkaUtils.createDirectStream[String, String](
      streamingContext,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )
//    stream.map(record =>  record.value().toString).saveAsTextFiles("D://123//StreamingKafka")
    stream.map(record =>  record.value().toString).map(_.split(" ")(0))
      .map((_,1))
      .reduceByKey(_+_)
      .repartition(1)
      .foreachRDD{rdd => rdd.foreachPartition{partitionOfRecords =>
        val connection = DBUtil.getConnection()
        val sql = "INSERT INTO mytopic VALUES (?,?)"
        partitionOfRecords.foreach{record =>
          val stat = connection.prepareStatement(sql)
          stat.setString(1,record._1)
          stat.setInt(2,record._2)
          val n = stat.executeUpdate()
//            if(n==1)
//              println("写入mysql成功")
            }
        }
      }
    streamingContext.start()
    streamingContext.awaitTermination()
    streamingContext.stop()
  }

}

 

 

打开命令提示符可以如下操作

先进入mysql--创建一个表

实时分析Flume-Kafka框架搭建最终将数据在mysql中输出_第4张图片

实时分析Flume-Kafka框架搭建最终将数据在mysql中输出_第5张图片

可以运行

启动zookeeper
        zookeeper-server-start.sh ~/soft/kafka/config/zookeeper.properties
    启动kafka服务
        kafka-server-start.sh ~/soft/kafka/config/server.properties
    创建主题(可以不启动)
        kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
    启动SocketTest
        java SocketTest access.20120104.log sparktest/data.log
    启动flume
        flume-ng agent --conf ~/soft/flume/conf --conf-file ~/soft/flume/conf/tokafka.conf --name agent -Dflume.root.logger=INFO,console

可以查看结果了

use bd1802;

select * from mytopic;

你可能感兴趣的:(实时分析Flume-Kafka框架搭建最终将数据在mysql中输出)