Flume

Flume

1.Flume概述

1.1 Flume定义

Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。Flume基于流式架构,灵活简单。

Flume_第1张图片

1.2 Flume基础架构

Flume组成架构如下图所示。

Flume_第2张图片

1.2.1 Agent

Agent是一个JVM进程,它以事件的形式将数据从源头送至目的。

Agent主要有3个部分组成,SourceChannelSink

1.2.2 Source

Source是负责接收数据到Flume Agent的组件。Source组件可以处理各种类型、各种格式的日志数据,包括avro、thrift、exec、jms、spooling directorynetcattaildir、sequence generator、syslog、http、legacy。

1.2.3 Sink

Sink不断地轮询Channel中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。

Sink组件目的地包括hdfsloggeravro、thrift、ipc、fileHBase、solr、自定义。

1.2.4 Channel

Channel是位于Source和Sink之间的缓冲区。因此,Channel允许Source和Sink运作在不同的速率上。Channel是线程安全的,可以同时处理几个Source的写入操作和几个Sink的读取操作。

Flume自带两种Channel:Memory ChannelFile Channel

Memory Channel是内存中的队列。Memory Channel在不需要关心数据丢失的情景下适用。如果需要关心数据丢失,那么Memory Channel就不应该使用,因为程序死亡、机器宕机或者重启都会导致数据丢失。

File Channel将所有事件写到磁盘。因此在程序关闭或机器宕机的情况下不会丢失数据。

1.2.5 Event

传输单元,Flume数据传输的基本单元,以Event的形式将数据从源头送至目的地。Event由HeaderBody两部分组成,Header用来存放该event的一些属性,为K-V结构,Body用来存放该条数据,形式为字节数组。

Flume_第3张图片

2. Flume安装部署

解压,改名

[root@kb129 install]# tar -xvf ./apache-flume-1.9.0-bin.tar.gz -C ../soft/

复制一份配置文件,并进行配置

[root@kb129 conf]# cp flume-env.sh.template flume-env.sh

[root@kb129 conf]# vim ./flume-env.sh

22 export JAVA_HOME=/opt/soft/jdk180

25 export JAVA_OPTS="-Xms2000m -Xmx2000m -Dcom.sun.management.jmxremote"

将lib文件夹下的guava-11.0.2.jar删除以兼容Hadoop 3.1.3

找到flume下现有的guava jar包并删除

[root@kb129 lib]# find ./ -name guava*

./guava-11.0.2.jar

[root@kb129 lib]# rm -rf ./guava-11.0.2.jar

拷贝新hadoop内的guava至flume内

[root@kb129 lib]# pwd

/opt/soft/hadoop313/share/hadoop/hdfs/lib

[root@kb129 lib]# cp ./guava-27.0-jre.jar /opt/soft/flume190/lib/

安装工具

[root@kb129 conf]# yum install -y net-tools

[root@kb129 conf]# yum install -y nc           #安装netcat服务

[root@kb129 conf]# yum install -y telnet-server   #安装netcat服务

[root@kb129 conf]# yum install -y telnet.*       #安装netcat客户端

测试工具

启动服务端口

[root@kb129 conf]# nc -lk 7777

连接服务器

[root@kb129 conf]# telnet localhost 7777

Flume_第4张图片

查看端口是否占用

[root@kb129 conf]# netstat -lnp | grep 7777

tcp        0      0 0.0.0.0:7777            0.0.0.0:*               LISTEN      9264/nc            

tcp6       0      0 :::7777                 :::*                  LISTEN      9264/nc 

2.2 Flume入门案例

2.2.1 监控端口数据官方案例

1)案例需求:

使用Flume监听一个端口,收集该端口数据,并打印到控制台。

2)需求分析:本次使用的端口为7777

Flume_第5张图片

3)实现步骤:

创建myconf2文件夹,写入监控配置文件

[root@kb129 lib]# cd ../conf/myconf2/

[root@kb129 myconf2]# vim ./netcat-logger.conf

a1.sources=r1

a1.channels=c1

a1.sinks=k1


# Describe/configure the source

a1.sources.r1.type=netcat

a1.sources.r1.bind=localhost

a1.sources.r1.port=7777


# Use a channel which buffers events in memory

a1.channels.c1.type=memory


# Describe the sink

a1.sinks.k1.type=logger


# Bind the source and sink to the channel

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

Flume_第6张图片

启动监控命令(a1,conf目录,conf文件,指定控制台输出info信息)

[root@kb129 flume190]# ./bin/flume-ng agent --name a1 --conf ./conf/ --conf-file ./conf/myconf2/netcat-logger.conf -Dflume.root.logger=INFO,console

参数说明:

        --conf/-c:表示配置文件存储在conf/目录

        --name/-n:表示给agent起名为a1

        --conf-file/-f:flume本次启动读取的配置文件是在job文件夹下flume-telnet.conf文件。

        -Dflume.root.logger=INFO,console :-D表示flume运行时动态修改flume.root.logger参数属性值,并将控制台日志打印级别设置为INFO级别。日志级别包括:log、info、warn、error。

[root@kb129 conf]# telnet localhost 7777

输入内容,控制台可监控输入在Flume监听页面观察接收数据情况

Flume_第7张图片

2.2.2 实时监控单个追加文件

1)案例需求:使用Flume监听单个的文件

2)需求分析:

3)实现步骤:

[root@kb129 myconf2]# vim ./ filelogger.conf

a2.sources=r1
a2.channels=c1
a2.sinks=k1

# Describe/configure the source
a2.sources.r1.type=exec
a2.sources.r1.command=tail -f /opt/tmp/flumelog.log

# Use a channel which buffers events in memory
a2.channels.c1.type=memory
a2.channels.c1.capacity=1000
a2.channels.c1.transactionCapacity=100

# Describe the sink
a2.sinks.k1.type=logger

# Bind the source and sink to the channel
a2.sources.r1.channels=c1
a2.sinks.k1.channel=c1

开启flume监听端口

[root@kb129 flume190]# ./bin/flume-ng agent -n a2 -c ./conf/ -f ./conf/myconf2/filelogger.conf -Dflume.root.logger=INFO,console

Flume_第8张图片

2.2.3 实时监控单个追加文件

1)案例需求:使用Flume监听整个目录的文件,并上传至HDFS

2)需求分析

Flume_第9张图片

3)实现步骤:

[root@kb129 myconf2]# vim ./file-flume-hdfs.conf

a3.sources=r1

a3.sinks=k1

a3.channels=c1


# Describe/configure the source

a3.sources.r1.type=exec

a3.sources.r1.command=tail -f /opt/tmp/flumelog.log


# Use a channel which buffers events in memory

a3.channels.c1.type=memory

a3.channels.c1.capacity=1000

a3.channels.c1.transactionCapacity=100


# Describe the sink

a3.sinks.k1.type=hdfs

a3.sinks.k1.hdfs.fileType=DataStream

a3.sinks.k1.hdfs.filePrefix=flumetohdfs

a3.sinks.k1.hdfs.fileSuffix=.txt

a3.sinks.k1.hdfs.path=hdfs://kb129:9000/kb23flume/


# Bind the source and sink to the channel

a3.sources.r1.channels=c1

a3.sinks.k1.channel=c1

Flume_第10张图片

启动监控

[root@kb129 flume190]# ./bin/flume-ng agent -n a3 -c ./conf/ -f ./conf/myconf2/file-flume-hdfs.conf -Dflume.root.logger=INFO,console

Flume_第11张图片

2.2.4 实时监控单个追加文件,多个输出

1)案例需求:使用Flume监听整个目录的文件,并上传至HDFS和本地logger

2)实现步骤:

[root@kb129 myconf2]# vim ./file-flume-hdfslogger.conf

a4.sources=r1

a4.channels=c1 c2

a4.sinks=k1 k2


# Describe/configure the source

a4.sources.r1.type=exec

a4.sources.r1.command=tail -f /opt/tmp/flumelog.log


# Use a channel which buffers events in memory

a4.channels.c1.type=memory

a4.channels.c2.type=memory

a4.channels.c1.capacity=1000

a4.channels.c1.transactionCapacity=100


# Describe the sink

a4.sinks.k1.type=logger

a4.sinks.k2.type=hdfs

a4.sinks.k2.hdfs.fileType=DataStream

a4.sinks.k2.hdfs.filePrefix=flumetohdfs

a4.sinks.k2.hdfs.fileSuffix=.txt

a4.sinks.k2.hdfs.path=hdfs://kb129:9000/kb23flume1/


# Bind the source and sink to the channel

a4.sources.r1.channels=c1 c2

a4.sinks.k1.channel=c1

a4.sinks.k2.channel=c2

启动flume监控

[root@kb129 flume190]# ./bin/flume-ng agent -n a4 -c ./conf/ -f ./conf/myconf2/file-flume-hdfslogger.conf -Dflume.root.logger=INFO,console

追加文件,可以在控制台和hdfs中查看到监控日志

2.2.5 实时监控端口数据,输出至HDFS

1)案例需求:使用Flume监听整个目录的文件,并上传至HDFS和本地logger

2)实现步骤:

使用Flume监听一个端口,收集该端口数据,并输出到hdfs

[root@kb129 myconf2]# vim ./demo.conf

a5.sources=r1

a5.sinks=k1

a5.channels=c1


# Describe/configure the source

a5.sources.r1.type=netcat

a5.sources.r1.bind=localhost

a5.sources.r1.port=7777


# Use a channel which buffers events in memory

a5.channels.c1.type=memory

a5.channels.c1.capacity=1000

a5.channels.c1.transactionCapacity=100


# Describe the sink

a5.sinks.k1.type=hdfs

a5.sinks.k1.hdfs.fileType=DataStream

a5.sinks.k1.hdfs.filePrefix=flumetohdfs

a5.sinks.k1.hdfs.fileSuffix=.txt

a5.sinks.k1.hdfs.path=hdfs://kb129:9000/kb23flume2/


# Bind the source and sink to the channel

a5.sources.r1.channels=c1

a5.sinks.k1.channel=c1

启动监听

[root@kb129 flume190]# ./bin/flume-ng agent -n a5 -c ./conf/ -f ./conf/myconf2/demo.conf -Dflume.root.logger=INFO,console

开启通信端口,发送数据,在hdfs查看

2.2.6 实时监控目录下指定文件(spooldir、正则、断点续传、sink至kafka)

Exec source适用于监控一个实时追加的文件,不能实现断点续传;Spooldir Source适合用于同步新文件,但不适合对实时追加日志的文件进行监听并同步;而Taildir Source适合用于监听多个实时追加的文件,并且能够实现断点续传。总结:exec source从外部命令的输出中读取数据,spooldir source适用于处理已存在的文件,而taildir source适用于实时产生的日志文件。

1)读取文件,管道类型memory

[root@kb129 myconf2]# vim events-flume-logger.conf

events.sources=eventsSource

events.channels=eventsChannel

events.sinks=eventsSink


# Describe/configure the source

events.sources.eventsSource.type=spooldir

events.sources.eventsSource.spoolDir=/opt/kb23/flumelogfile/events

#反序列化类型为LINE

events.sources.eventsSource.deserializer=LINE

#反序列化 每行的最大长度

events.sources.eventsSource.deserializer.maxLineLength=32000

events.sources.eventsSource.includePattern=events.csv


# Use a channel which buffers events in memory

events.channels.eventsChannel.type=memory


# Describe the sink

events.sinks.eventsSink.type=logger


# Bind the source and sink to the channel

events.sources.eventsSource.channels= eventsChannel

events.sinks.eventsSink.channel= eventsChannel

(2)读取文件,加正则过滤,管道类型file

1. 文件管道(file):文件管道将事件数据写入磁盘上的文件中。这种管道适用于需要持久化存储数据的场景。文件管道可以在系统重启后继续读取未处理的数据,因为数据被写入了磁盘。但是,由于写入和读取都需要磁盘操作,所以文件管道的性能可能相对较低。

2. 内存管道(memory):内存管道将事件数据保存在内存中。这种管道适用于对性能要求较高的场景,因为内存操作比磁盘操作更快。内存管道不会将数据持久化到磁盘,所以在系统重启后,未处理的数据将会丢失。因此,内存管道适用于对数据可靠性要求不高的场景。

[root@kb129 myconf2]# vim events-flume-logger.conf

events.sources=eventsSource

events.channels=eventsChannel

events.sinks=eventsSink


# Describe/configure the source

events.sources.eventsSource.type=spooldir

events.sources.eventsSource.spoolDir=/opt/kb23/flumelogfile/events

#反序列化类型为LINE

events.sources.eventsSource.deserializer=LINE

#反序列化 每行的最大长度

events.sources.eventsSource.deserializer.maxLineLength=32000

events.sources.eventsSource.includePattern=events_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv

#正则过滤器

events.sources.eventsSource.interceptors=head_filter

events.sources.eventsSource.interceptors.head_filter.type=regex_filter

events.sources.eventsSource.interceptors.head_filter.regex=^event_id*

events.sources.eventsSource.interceptors.head_filter.excludeEvents=true


# Use a channel which buffers events in memory

events.channels.eventsChannel.type=file

events.channels.eventsChannel.checkpointDir=/opt/kb23/checkpoint/events

events.channels.eventsChannel.dataDirs=/opt/kb23/data/events


# Describe the sink

events.sinks.eventsSink.type=logger


# Bind the source and sink to the channel

events.sources.eventsSource.channels=eventsChannel

events.sinks.eventsSink.channel=eventsChannel

(3)sink输出指向kafka,topic

[root@kb129 myconf2]# vim ./events-flume-kafka.conf

events.sources=eventsSource
events.channels=eventsChannel
events.sinks=eventsSink

# Describe/configure the source
events.sources.eventsSource.type=spooldir
events.sources.eventsSource.spoolDir=/opt/kb23/flumelogfile/events
#反序列化类型为LINE
events.sources.eventsSource.deserializer=LINE
#反序列化 每行的最大长度
events.sources.eventsSource.deserializer.maxLineLength=32000
events.sources.eventsSource.includePattern=events_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
#正则过滤器
events.sources.eventsSource.interceptors=head_filter
events.sources.eventsSource.interceptors.head_filter.type=regex_filter
events.sources.eventsSource.interceptors.head_filter.regex=^event_id*
events.sources.eventsSource.interceptors.head_filter.excludeEvents=true

# Use a channel which buffers events in memory
events.channels.eventsChannel.type=file
events.channels.eventsChannel.checkpointDir=/opt/kb23/checkpoint/events
events.channels.eventsChannel.dataDirs=/opt/kb23/data/events

# Describe the sink
events.sinks.eventsSink.type=org.apache.flume.sink.kafka.KafkaSink
events.sinks.eventsSink.topic=events
events.sinks.eventsSink.brokerList=192.168.142.129:9092
events.sinks.eventsSink.batchSize=640

# Bind the source and sink to the channel
events.sources.eventsSource.channels=eventsChannel
events.sinks.eventsSink.channel=eventsChannel

2.2.7  自定义Interceptor

)案例需求

2.用Flume采集服务器本地日志,需要按照日志类型的不同,将不同种类的日志发往不同的分析系统。

2)需求分析

在实际的开发中,一台服务器产生的日志类型可能有很多种,不同类型的日志可能需要发送到不同的分析系统。此时会用到Flume拓扑结构中的Multiplexing结构,Multiplexing的原理是,根据event中Header的某个key的值,将不同的event发送到不同的Channel中,所以我们需要自定义一个Interceptor,为不同类型的event的Header中的value赋予不同的值。

在该案例中,我们以端口数据模拟日志,以数字(单个)和字母(单个)模拟不同类型的日志,我们需要自定义interceptor区分数字和字母,将其分别发往不同的分析系统(Channel)。

Flume_第12张图片

Flume_第13张图片

重要组件:

1ChannelSelector

ChannelSelector的作用就是选出Event将要被发往哪个Channel。其共有两种类型,分别是Replicating(复制)和Multiplexing(多路复用)。

ReplicatingSelector会将同一个Event发往所有的Channel,Multiplexing会根据相应的原则,将不同的Event发往不同的Channel。

2SinkProcessor

SinkProcessor共有三种类型,分别是DefaultSinkProcessorLoadBalancingSinkProcessorFailoverSinkProcessor

DefaultSinkProcessor对应的是单个的Sink,LoadBalancingSinkProcessor和FailoverSinkProcessor对应的是Sink Group,LoadBalancingSinkProcessor可以实现负载均衡的功能,FailoverSinkProcessor可以错误恢复的功能。

3)实现步骤

(1)创建一个maven项目,并引入以下依赖。


  org.apache.flume
  flume-ng-core
  1.9.0

(2)定义类并实现Interceptor接口。编写完成后打包传至flume中lib目录下

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class InterceptorDemo implements Interceptor {
    private ArrayList opList;

    public void initialize() {
        opList = new ArrayList();
    }

    public Event intercept(Event event) {
        Map headers = event.getHeaders();
        final String body = new String(event.getBody());
        if (body.startsWith("hello")){
            headers.put("type","hello");
        } else if (body.startsWith("hi")) {
            headers.put("type","hi");
        } else {
            headers.put("type","other");
        }
        return event;
    }

    public List intercept(List events) {
        opList.clear();
        for (Event event : events) {
            opList.add(intercept(event));
        }
        return opList;
    }

    public void close() {
        opList.clear();
        opList = null;
    }

    public static class Builder implements Interceptor.Builder{

        public Interceptor build() {
            return new InterceptorDemo();
        }

        public void configure(Context context) {
        }
    }
}

(3)编辑flume配置文件

[root@kb129 myconf2]# vim ./netcat-myinterceptor.conf

myinterceptor.sources=s1

myinterceptor.channels=helloChannel hiChannel otherChannel

myinterceptor.sinks=helloSink hiSink otherSink


# Describe/configure the source

myinterceptor.sources.s1.type=netcat

myinterceptor.sources.s1.bind=localhost

myinterceptor.sources.s1.port=7777

myinterceptor.sources.s1.interceptors=myinterceptors

myinterceptor.sources.s1.interceptors.myinterceptors.type=nj.zb.kb23.InterceptorDemo$Builder

myinterceptor.sources.s1.selector.type=multiplexing

myinterceptor.sources.s1.selector.mapping.hello=helloChannel

myinterceptor.sources.s1.selector.mapping.hi=hiChannel

myinterceptor.sources.s1.selector.mapping.other=otherChannel

myinterceptor.sources.s1.selector.header=type


# Use a channel which buffers events in memory

myinterceptor.channels.helloChannel.type=memory

myinterceptor.channels.hiChannel.type=memory

myinterceptor.channels.otherChannel.type=memory


# Describe the sink

myinterceptor.sinks.helloSink.type=hdfs

myinterceptor.sinks.helloSink.hdfs.fileType=DataStream

myinterceptor.sinks.helloSink.hdfs.filePrefix=hellocontent

myinterceptor.sinks.helloSink.hdfs.fileSuffix=.txt

myinterceptor.sinks.helloSink.hdfs.path=hdfs://kb129:9000/kb23hello/


#myinterceptor.sinks.hiSink.type=org.apache.flume.sink.kafka.KafkaSink

#myinterceptor.sinks.hiSink.kafka.topic=hitopic

#myinterceptor.sinks.hiSink.kafka.bootstrap.servers=192.168.142.129:9092

#myinterceptor.sinks.hiSink.kafka.producer.acks=1

#myinterceptor.sinks.hiSink.kafka.flumeBatchSize=640

myinterceptor.sinks.hiSink.type=org.apache.flume.sink.kafka.KafkaSink

myinterceptor.sinks.hiSink.topic=hitopic

myinterceptor.sinks.hiSink.brokerList=192.168.142.129:9092


myinterceptor.sinks.otherSink.type=logger


# Bind the source and sink to the channel

myinterceptor.sources.s1.channels=helloChannel hiChannel otherChannel

myinterceptor.sinks.helloSink.channel=helloChannel

myinterceptor.sinks.hiSink.channel=hiChannel

myinterceptor.sinks.otherSink.channel=otherChannel

3.Flume事务

Flume_第14张图片

你可能感兴趣的:(flume,大数据)