flume+kafka+sparkstreaming+hdfs

跨服务器布置flume时需要注意公司的安全策略,可能不是配置有问题,有问题需要问运维。
现在业务需求是:
1.不是集群内部服务器布置flume,跨服务器采集数据。
2.采集到的数据通过kafka+sparkstreaming进行实时消费
3.保存到hdfs当中

服务器A的flume配置:
flume_kafka_source.conf

a1.sources = r1
a1.channels = c1
a1.sinks =s1


#sources端配置
a1.sources.r1.type=exec
a1.sources.r1.command=tail -F /home/diamonds/ztx/test/kafka.log
a1.sources.r1.channels=c1

#channels端配置
a1.channels.c1.type=memory
a1.channels.c1.capacity=20000
a1.channels.c1.transactionCapacity=1000


# Describe the sink
a1.sinks.s1.type = avro
#公网IP
a1.sinks.s1.hostname = 118.31.75.XXX
a1.sinks.s1.port = 33333


# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

服务器B的flume配置:
flume_kafka_sink.conf

#agent名, source、channel、sink的名称
a1.sources = r1
a1.sinks = s1
a1.channels = c1

#定义source
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 33333

#定义channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000


#设置Kafka接收器
a1.sinks.s1.type= org.apache.flume.sink.kafka.KafkaSink

#设置Kafka的broker地址和端口
a1.sinks.s1.brokerList=sparkmaster:9092,datanode1:9092,datanode2:9092

#设置Kafka的Topic
a1.sinks.s1.topic=realtime

#设置序列化方式
a1.sinks.s1.serializer.class=kafka.serializer.StringEncoder
a1.sinks.s1.channel=c1
a1.sources.r1.channels=c1

maven依赖:


        1.8
        1.8
        2.11.8
        2.7.4
        2.0.2
        UTF-8
    


    
        
            org.scala-lang
            scala-library
            ${scala.version}
        
        
            org.apache.spark
            spark-streaming-flume_2.11
            ${spark.version}
        
        
            org.apache.hadoop
            hadoop-client
            ${hadoop.version}
        
        
            org.apache.hadoop
            hadoop-common
            ${hadoop.version}
        
        
            org.apache.hadoop
            hadoop-hdfs
            ${hadoop.version}
        

        
        
            org.apache.spark
            spark-streaming_2.11
            2.0.2
        
        
        
            org.apache.spark
            spark-streaming-flume_2.11
            2.0.2
        

        
        
            org.apache.spark
            spark-streaming-kafka-0-8_2.11
            2.0.2
        


        
        
            log4j
            log4j
            1.2.14
        
    


    
        src/main/scala
        src/test/scala
        
            
                net.alchim31.maven
                scala-maven-plugin
                3.2.0
                
                    
                        
                            compile
                            testCompile
                        
                        
                            
                                -dependencyfile
                                ${project.build.directory}/.scala_dependencies
                            
                        
                    
                
            
            
                org.apache.maven.plugins
                maven-shade-plugin
                2.3
                
                    
                        package
                        
                            shade
                        
                        
                            
                                
                                    *:*
                                    
                                        META-INF/*.SF
                                        META-INF/*.DSA
                                        META-INF/*.RSA
                                    
                                
                            
                            
                                
                                    
                                
                            
                        
                    
                
            
        
    

kafka+sparkstreaming代码:

package cn.test.spark

import kafka.serializer.StringDecoder
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils

object SparkStreamingKafkaDirect {
  //正式环境
  val ks="sparkmaster:9092,datanode1:9092,datanode2:9092"

  def main(args: Array[String]): Unit = {
    //创建SparkConf
//    val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingKafkaDirect")
    //本地模式
    val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingKafkaDirect").setMaster("local[3]")
    //创建sparkContext
    val sparkContext = new SparkContext(sparkConf)
    //设置警告级别
    sparkContext.setLogLevel("WARN")
    //创建streamingContext Seconds 设置窗口大小(秒)
    val streamingContext = new StreamingContext(sparkContext,Seconds(6))
    //保存topic偏移量
    streamingContext.checkpoint("./spark_kafka01")
    //准备kafka参数
    val kafkaParams=Map("metadata.broker.list"->ks,"group.id"->"ztx01")
    //准备topics
    val topics = Set("realtime")
    //获取kafka数据
    val dstream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](streamingContext,kafkaParams,topics)
    //获取topic中的数据
    val line: DStream[String] = dstream.map(_._2)


    //将结果保存到hdfs中
//    line.saveAsTextFiles("/user/hdfs/source/log/kafkaSs/qwer")

    //打印结果数据
    line.print()

    //开启流式计算
    streamingContext.start()
    streamingContext.awaitTermination()

  }

服务器B上安装好了kafka则先启动kafka:

1.创建topic
./kafka-topics --describe --zookeeper 10.10.0.27:2181,10.10.0.8:2181,10.10.0.127:2181 --topic realtime
2.启动topic
./kafka-console-consumer --zookeeper 10.10.0.27:2181,10.10.0.8:2181,10.10.0.127:2181 --from-beginning --topic realtime
3.启动服务器B上flume服务
bin/flume-ng agent -c conf -f conf/flume_kafka_sink.conf -name a1 -Dflume.root.logger=DEBUG,console
4.再启动服务器A的服务:
bin/flume-ng agent -c conf -f conf/flume_kafka_source.conf -name a1 -Dflume.root.logger=DEBUG,console
(后台启动可以在前加上nohub 后面加上&)
5.提交spark任务
./spark2-submit
–class cn.test.spark.SparkStreamingKafkaDirect
–master yarn
–executor-memory 1g
–total-executor-cores 2
/opt/cloudera/parcels/SPARK2/bin/sparkstreaming-kafka.jar
(注意在编译工具IDEA上的scala、spark的版本是否和集群上的spark版本兼容,不兼容则会报错。spark-shell可查看集群上spark的版本)
6.创建一个.log日志文件进行测试
7.创建一个shell文件写入数据到.log日志文件中

ps:

  1. netstat -ntlp //查看当前所有tcp端口
  2. 注意监控的文件位置(是监控一个文件内容还是监控一个文件夹中文件的变化)
  3. 注意创建topic的名称
  4. 注意flume连接kafka的地址(是否映射,使用公网还是私网)
  5. 保存到hdfs中时需要做yarn配置,减少生成小文件

你可能感兴趣的:(大数据,kafka,flume,sparkstreaming,hdfs)