Spark Streaming + Flume + Kafka

这是一个综合的笔记,我们已经分别学习了spark streaming如何从Flume上面如何读取数据,以及如何从Kafka上面读取数据。

现在我们来尝试将日志数据发送到Flume,Flume收集数据后实时的将数据发送到Kafka中,然后,Spark Streaming在Kafka中读取数据,并进行处理。

Flume人性的给我们提供了通过log4j.properties配置,将日志产生的数据发送到Flume上。如下:

GenerateLog.java

public class GenerateLog {
    private static Logger logger = Logger.getLogger(GenerateLog.class.getName());
    public static void main(String[] args) throws InterruptedException {
        ArrayList list = new ArrayList();
        list.add("a");
        list.add("b");
        list.add("c");
        int len = list.size();
        while(true){
            Thread.sleep(2000);
            int index_1 = new Random().nextInt(len);
            int index_2 = new Random().nextInt(len);
            String s_1 = list.get(index_1);
            String s_2 = list.get(index_2);
            logger.info(s_1+","+s_2);
        }
    }
}

log4j.properties

log4j.rootLogger=INFO,stdout,flume

log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n

log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = master
log4j.appender.flume.Port = 9449

Java程序(GenerateLog.java)会读取log4j.properties,发现需要将数据发送到flume上面(主机名是master,端口号是9449)。

master主机上面的flume接收到日志数据时,会通过配置文件将数据收集发送给kafka。配置文件如下所示:

agent1.sources=avro-source
agent1.channels=memory-channel
agent1.sinks=kafka-sink

#define source
agent1.sources.avro-source.type=avro
agent1.sources.avro-source.bind=master
agent1.sources.avro-source.port=9449

#define channel
agent1.channels.memory-channel.type=memory

#define sink
agent1.sinks.kafka-sink.type=org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.kafka-sink.kafka.topic=test
agent1.sinks.kafka-sink.kafka.bootstrap.servers=master:9092,slave1:9092,slave2:9092

agent1.sources.avro-source.channels=memory-channel
agent1.sinks.kafka-sink.channel=memory-channel

将数据发送到kafka上面后,Spark Streaming程序就可以读取数据了。通过如下的代码进行处理:

object KafkaDirectWordCount {
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setMaster("local[2]").setAppName("KafkaDirectWordCount")
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")
    ssc.checkpoint(".")

    val kafkaParams = Map("bootstrap.servers" -> "master:9092"
      , ("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
      , "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer"
      , "group.id" -> "kafkatest"
      , "enable.auto.commit" -> "false"
    )
    val topics = Set("test")
    val consumerStrategies = ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
    val kafkaDStream = KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, consumerStrategies)

    val res = kafkaDStream
      .map(x => {
        x.value()
      })
      .flatMap(_.split(","))
      .map(x => (x, 1))
      .reduceByKey(_ + _)
    res.print()

    ssc.start()
    ssc.awaitTermination()

  }
}

这样,一套日志处理方案就搭建完成了。

 

你可能感兴趣的:(Spark Streaming + Flume + Kafka)