vinfly_li

Flume + Kafka + TridentStorm + Hbase项目实战

Flume + Kafka + TridentStorm + Hbase项目实战

版权声明:禁止转载,转载必究
标签（空格分隔）： Storm项目

Write by Vin

1,项目简介

项目名称:基于Storm开发实现的实时网站流量统计
项目需求:通过Storm分析业务系统产生的网站访问日志数据,实时的统计出各种PV,包括:

每个URL单独的PV
网站外链PV
搜索关键字PV

项目技术架构:

本文目的旨在记录配置要点,以方便以后查看,故均按简单方式搭建环境,并且通过代码生成日志来模拟Nginx日志信息,并使用一层flume来进行该日志的监控

2,数据模拟

2.1数据模拟与环境搭建

1,生成日志
日志样例:

132.46.30.61 - - [1476285399264] "GET /list.php HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Linux; Android 4.2.1; Galaxy Nexus Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19" "-"
215.168.214.201 - - [1476285965677] "GET /edit.php HTTP/1.1" 200 0 "http://www.google.cn/search?q=spark mllib" "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" "-"

通过日志样例,使用scala编程每个1s生成一条日志信息,代码如下:文件名称 NginxLogGenerator.scala

package org.project.storm.study

/**
 * Created by hp-pc on 2016/10/16.
 */
import scala.collection.immutable.IndexedSeq
import scala.util.Random

/**
 * Created by ad on 2016/10/13.
 */
class NginxLogGenerator {

}

object NginxLogGenerator{
  /** user_agent **/
  val userAgents: Map[Double, String] = Map(
    0.0 -> "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2;Tident/6.0)",
    0.1 -> "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2;Tident/6.0)",
    0.2 -> "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1;Tident/4.0; .NETCLR 2.0.50727)",
    0.3 -> "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    0.4 -> "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    0.5 -> "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    0.6 -> "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    0.7 -> "Mozilla/5.0 (iPhone; CPU iPhone OS 7_)_3 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11B511 Safari/9537.53",
    0.8 -> "Mozilla/5.0 (Linux; Android 4.2.1; Galaxy Nexus Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19",
    0.9 -> "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    1.0 -> " "
  )

  /**IP**/
  val ipSliceList = List(10,28,29,30,43,46,53,61,72,89,96,132,156,122,167,
    143,187,168,190,201,202,214,215,222)

  /** url **/
  val urlPathList = List(
    "login.php","view.php","list.php","upload.php","admin/login.php","edit.php","index.html"
  )

  /** http_refer **/
  val httpRefers = List(
    "http://www.baidu.com/s?wd=%s",
    "http://www.google.cn/search?q=%s",
    "http://www.sogou.com/web?query=%s",
    "http://www.yahoo.com/s?p=%s",
    "http://cn.bing.com/search?q=%s"
  )

  /** search_keyword **/
  val searchKeywords = List(
    "spark",
    "hadoop",
    "yarn",
    "hive",
    "mapreduce",
    "spark mllib",
    "spark sql",
    " IPphoenix",
    "hbase"
  )

  val random = new Random()

  /** ip **/
  def sampleIp(): String ={
    val ipEles: IndexedSeq[Int] = (1 to 4).map{
      case i =>
        val ipEle: Int = ipSliceList(random.nextInt(ipSliceList.length))
        //println(ipEle)
        ipEle
    }

    ipEles.iterator.mkString(".")
  }

  /**
   * url
   * @return
   */
  def sampleUrl(): String ={
    urlPathList(random.nextInt(urlPathList.length))
  }

  /**
   * user_agent
   * @return
   */
  def sampleUserAgent(): String ={
    val distUppon = random.nextDouble()
    userAgents("%#.1f".format(distUppon).toDouble)
  }


  /** http_refer **/
  def sampleRefer()={
    val fra = random.nextDouble()
    if(fra > 0.2)
      "-"

    else {
      val referStr = httpRefers(random.nextInt(httpRefers.length))
      val queryStr = searchKeywords(random.nextInt(searchKeywords.length))

      referStr.format(queryStr)
    }
  }



  def sampleOneLog() ={
    val time = System.currentTimeMillis()
    val query_log = "%s - - [%s] \"GET /%s HTTP/1.1\" 200 0 \"%s\" \"%s\" \"-\"".format(
      sampleIp(),
      time,
      sampleUrl(),
      sampleRefer(),
      sampleUserAgent()
    )
    query_log
  }


  def main(args: Array[String]) {
    while(true){
      println(sampleOneLog())

      Thread.sleep(1000)
    }
  }

}

执行示例:

2,模拟Nginx服务器
在Linux中新建一个目录mkdir ~/project_workspace
将NginxLogGenerator.scala文件拷贝到该新创建的目录中
编辑Linux脚本执行该scala文件:文件名为generator_log.sh:
代码如下

#!/usr/bin

SCALAC='/usr/bin/scalac'

$SCALAC NginxLogGenerator.scala

SCALA='/usr/bin/scala'

$SCALA /类所在路径/NginxLogGenerator >> nginx.log

执行sh generator_log.sh，就会在该目录下生成nginx.log文件
可通过tail -f nginx.log查看,中断 CTRL + C 或者 jps获取pid然后使用kill -9 pid
查看执行效果:

2.2Flume搭建及配置

在vin01机器上,搭建并配置flume,新建一个Storm_project.conf文件,配置如下:

#exec source - memory channel - kafka sink/hdfs sink
a1.sources = r1
a1.sinks = kafka_sink hdfs_sink
a1.channels = c1 c2

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F  /home/vin/project_workspace/nginx.log //此处之前配置成~/...,然后一直运行不成功


# kafka_sink
a1.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.kafka_sink.topic = nginxlog
a1.sinks.kafka_sink.brokerList = vin01:9092
a1.sinks.kafka_sink.requiredAcks = 1
a1.sinks.kafka_sink.batchSize = 20
a1.sinks.kafka_sink.channel = c1

# hdfs_sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = /flume/events/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = nginx_log-
a1.sinks.hdfs_sink.hdfs.fileType = DataStream
a1.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
a1.sinks.hdfs_sink.hdfs.rollInterval = 0
a1.sinks.hdfs_sink.hdfs.rollSize = 102400
a1.sinks.hdfs_sink.hdfs.rollCount = 0


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.kafka_sink.channel = c1
a1.sinks.hdfs_sink.channel = c2
a1.sources.r1.selector.type = replicating

2.3 使用kafka创建nginxlog topic,并测试

首先先保证zookeeper正常:zkServer.sh status
启动kafka集群:

[vin@vin01 kafka_2.10-0.8.2.1]$ bin/kafka-server-start.sh config/server.properties &
在实际生产中使用后台启动,命令如下:
$ nohup bin/kafka-server-start.sh config/server.properties > /dev/null 2>&1 &

创建topic:
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
查看topic:
bin/kafka-topics.sh --list --zookeeper localhost:2181
测试kafka集群:
开启生产者:[vin@vin01 kafka_2.10-0.8.2.1]$ bin/kafka-console-producer.sh --broker-list vin01:9092 --topic nginxlog
开启消费者:[vin@vin01 kafka_2.10-0.8.2.1]$ bin/kafka-console-consumer.sh --zookeeper vin01:2181 --topic nginxlog --from-beginning
同生产者窗口发送消息,观察消费者窗口能否收到
查看某个topic:$ bin/kafka-topics.sh --describe --zookeeper vin01:2181 --topic nginxlog

2.4 开启Flume agent进行监控日志

[vin@vin01 flume-1.5.0-cdh5.3.6-bin]$ bin/flume-ng agent -n a1 -c conf/ --conf-file conf/Storm_project.conf -Dflume.root.logger=INFO,console

这是可以打开kafka的消费者观察日志数据是否进来,同时查看HDFS上是否生成日志数据
kafka consumer窗口:

HDFS目录:

注:两层Flume的设置

第一层配置

# use an avro type file as a source
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# configure the source
a2.sources.r2.channels = c2
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F  /home/vin/project_workspace/nginx.log

# Describe the sink
a2.sinks.k2.type = avro
a2.sinks.k2.channel = c2
a2.sinks.k2.hostname = 192.168.73.6
a2.sinks.k2.port = 44446


# configure the channel
a2.channels.c2.type = memory
a2.channels.c2.capacity = 10000
a2.channels.c2.transactionCapacity = 1000

# Bind the source and sink to the channel

第二层配置

#exec source - memory channel - kafka sink/hdfs sink
a1.sources = r1
a1.sinks =hdfs_sink kafka_sink
a1.channels = c1 c2

a1.sources.r1.type = avro
a1.sources.r1.channels = c1 c2
a1.sources.r1.bind = 192.168.73.6
a1.sources.r1.port = 44446

# kafka_sink
a1.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.kafka_sink.topic = nginxlog
a1.sinks.kafka_sink.brokerList = vin01:9092,vin02:9092,vin03:9092
a1.sinks.kafka_sink.requiredAcks = 1
a1.sinks.kafka_sink.batchSize = 20


# hdfs_sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path =hdfs://192.168.73.6:8020/flume/events/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = nginx_log-
a1.sinks.hdfs_sink.hdfs.fileType = DataStream
a1.sinks.hdfs_sink.hdfs.useLocalTimeStamp = true
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
a1.sinks.hdfs_sink.hdfs.rollInterval = 0
a1.sinks.hdfs_sink.hdfs.rollSize = 102400
a1.sinks.hdfs_sink.hdfs.rollCount = 0


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sinks.kafka_sink.channel = c2
a1.sinks.hdfs_sink.channel = c1

2.5 数据源

在flume监控的日志目录中,我们通过generator_log.sh脚本来源源不断的产生数据

2.6 总结

上面的步骤通过模拟nginx服务器源源不断的产生数据,通过flume收集数据并将数据写入HDFS用作离线分析和写入kafka中用作实时数据分析,下面步骤即是对这些日志数据使用Storm进行分析

3,数据分析

数据分析流程图

在idea上开发storm

主类的设计:NginxLogAnly.java

1,架构设计

Storm的核心Topology,这里使用Trident Storm
首先将类的整体架构设计出来:包括topology的构造,及config的设置;代码如下:

public class NginxLogAnaly {
public static void main(String[] args) {

        // 构造Trident Topology  operation流程
        TridentTopology topology = new TridentTopology();
        //.......
        //.......
        //.......
        //.......
        Config conf = new Config();

        if (args == null || args.length <= 0){
            // 本地IDE环境中执行
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("nginxlogAnaly",conf,topology.build());
        }else{
            try {
                conf.setNumWorkers(4);
                conf.setDebug(true);
                // 设置当前Topology处理流程中正在处理的Tuple最大数量
                conf.setMaxSpoutPending(200);
                StormSubmitter.submitTopology(args[0],conf,topology.build());
            } catch (AlreadyAliveException e) {
                e.printStackTrace();
            } catch (InvalidTopologyException e) {
                e.printStackTrace();
            }
        }

    }
}

2,构建kafka spout

通过在trident storm kafka的jar包中可以找到两个函数:

TransactionalTridentKafkaSpout与OpaqueTridentKafkaSpout,其中OpaqueTridentKafkaSpout具有容错性,他们的区别在于事务控制的不同
那么创建Spout:

 OpaqueTridentKafkaSpout tridentKafkaSpout =  new OpaqueTridentKafkaSpout(???);

创建好spout之后就要完成它,即?的位置是什么?通过查看 OpaqueTridentKafkaSpout 源码,它需要一个config配置项

config配置项包含了哪个kafka集群,哪个topic等信息,查看它的源码:

其中的第一个方法帮我们实现了clientId,即在底层自己生成了客户端id,其中的BrokerHost指定的其实是我们的zookeeper集群,因为storm是作为一个消费者而言的,topic是要消费的kafka集群上的某个topic,所以定义这两个对象:

        BrokerHosts hosts = new ZkHosts("vin01:2181");
        String topic = "nginxlog";

然后再指定是否从头消费,和解析的格式,通常以字符串格式进行解析:

        config.forceFromStart = false;
        config.scheme = new SchemeAsMultiScheme(new StringScheme());

整个spout就构造好了:

// 构造Trident Kafka Spout
        BrokerHosts hosts = new ZkHosts("vin01:2181");
        String topic = "nginxlog";
        TridentKafkaConfig config = new TridentKafkaConfig(hosts,topic);
        config.forceFromStart = false;
        config.scheme = new SchemeAsMultiScheme(new StringScheme());
        //TransactionalTridentKafkaSpout
        OpaqueTridentKafkaSpout tridentKafkaSpout =
                new OpaqueTridentKafkaSpout(config);

        Stream stream  = topology.newStream(SPOUT_ID,tridentKafkaSpout)
                // {"str":"xxxxxx"}
                .each(new Fields("str"),new PrintFilter())

程序写到这里进行测试,即最后一行each方法,each方法的第一个参数表示它后面的参数能够获取哪些字符串,spout解析出来的是str:xxx格式的,所以自定义一个PrinterFilter()方法将它打印出来.该方法代码如下:

package com.vin.bigdata.storm.trident01;

import backtype.storm.tuple.Fields;
import storm.trident.operation.Filter;
import storm.trident.operation.TridentOperationContext;
import storm.trident.tuple.TridentTuple;

import java.util.List;
import java.util.Map;

/**
 * 打印测试过滤器
 * Created by ad on 2016/9/25.
 */
public class PrintFilter implements Filter {


    /**
     * 实现对Tuple进行过滤的逻辑
     * @param tuple
     * @return
     */
    @Override
    public boolean isKeep(TridentTuple tuple) {

        Fields fields = tuple.getFields();

        List