冰芷若涵

flume

flume概述
1. flume概念
  1. flume概念

flume是分布式的，可靠的，高可用的，用于对不同来源的大量的日志数据进行有效收集、聚集和移动，并以集中式的数据存储的系统。

flume目前是apache的一个顶级项目。

1. 1. 系统需求

flume需要java运行环境，要求java1.6以上，推荐java1.7.

1. 下载安装flume
  1. 下载flume：

可以到apache官网下载flume的安装包。

下载时需要注意，flume具有两个版本：0.9.x和1.x,这两个版本并不兼容，我们学习的是最新的1.x版本，也叫flume-ng版本。

1. 1. 安装flume：

将下载好的flume安装包解压到指定目录即可。

tar -zxvf apache-flume-1.6.0-bin.tar.gz

flume中的概念、模型和特点
1. flume中的一些重要概念
  1. flume Event：

flume 事件，被定义为一个具有有效荷载的字节数据流和可选的字符串属性集。(json格式的字符串，由headers和body两部分组成)

1. 1. flume Agent：

flume 代理，是一个进程承载从外部源事件流到下一个目的地的过程。包含source channel和sink。

1. 1. Source

数据源，消耗外部传递给他的事件，外部源将数据按照flume Source 能识别的格式将Flume 事件发送给flume Source。

1. 1. Channel

数据通道，是一个被动的存储，用来保持事件，直到由一个flume Sink消耗。

1. 1. Sink

数据汇聚点，代表外部数据存放位置。发送flume event到指定的外部目标。

1. flume流动模型
  1. flume流动模型

1. flume的特点
  1. 复杂流动性

Flume允许用户进行多级流动到最终目的地，也允许扇出流（一到多）、扇入流(多到一)的、故障转移和失败处理。

1. 1. 可靠性

事务性的数据传递，保证了数据的可靠性。(flume一次处理一批数据，只要有任意一条失败了，那么flume会重新处理这批数据)

1. 1. 可恢复性

通道可以以内存或文件的方式实现，内存更快，但是不可恢复，而文件虽然比较慢但提供了可恢复性。

入门案例
1. 编写配置文件
  1. 编写配置文件

首先需要通过一个配置文件来配置Agent。

＃example.conf：单节点Flume配置

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = netcat

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 44444

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

注意：

（1）一个配置文件中可以配置多个Agent，一个Agent中可以包含多个Source、Sink、Channel。

（2）一个Source 可以绑定到多个通道，但一个Sink只能绑定到一个通道。

1. 通过flume的工具启动agent
  1. 通过flume的工具启动agent

$ bin/flume-ng agent --conf ../conf --conf-file ../conf/example.conf --name a1 -Dflume.root.logger=INFO,console

1. 发送数据
  1. 发送数据

在windows中通过命令连接flume所在机器的44444端口发送数据。

发现，flume确实收集到了该信息。

telnet ip port

#按ctrl+]后回车在界面输入数据，就可以在flume的控制台上显示数据

Source详解
1. Avro Source
  1. Avro Source 概述

监听AVRO端口来接受来自外部AVRO客户端的事件流。

利用Avro Source可以实现多级流动、扇出流、扇入流等效果。

另外也可以接受通过flume提供的Avro客户端发送的日志信息。

1. 1. Avro Source属性说明

!channels –

!type – 类型名称，"AVRO"

!bind – 需要监听的主机名或IP

!port – 要监听的端口

threads – 工作线程最大线程数

selector.type

selector.*

interceptors – 空格分隔的拦截器列表

interceptors.*

compression-type none 压缩类型，可以是“none”或“default”，这个值必须和AvroSource的压缩格式匹配

ssl false 是否启用ssl加密，如果启用还需要配置一个“keystore”和一个“keystore-password”。

keystore – 为SSL提供的java密钥文件所在路径。

keystore-password – 为SSL提供的java密钥文件密码。

keystore-type JKS 密钥库类型可以是“JKS”或“PKCS12”。

exclude-protocols SSLv3 空格分隔开的列表，用来指定在SSL / TLS协议中排除。SSLv3将总是被排除除了所指定的协议。

ipFilter false 如果需要为netty开启ip过滤，将此项设置为true

ipFilterRules – 定义netty的ip过滤设置表达式规则

1. 1. 案例

编写配置文件：

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = avro

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 33333

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume：

./flume-ng agent --conf ../conf --conf-file ../conf/template2.conf --name a1 -Dflume.root.logger=INFO,console

通过flume提供的avro客户端向指定机器指定端口发送日志信息：

./flume-ng avro-client --conf ../conf --host 0.0.0.0 --port 33333 --filename ../mydata/log1.txt

发现确实收集到了日志。

1. Exec Source
  1. Exec Source概述

可以将命令产生的输出作为源。

1. 1. Exec Source属性说明

!channels –

!type – 类型名称，需要是"exec"

!command – 要执行的命令

shell – A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.

restartThrottle 10000 毫秒为单位的时间，用来声明等待多久后尝试重试命令

restart false 如果cmd挂了，是否重启cmd

logStdErr false 无论是否是标准错误都该被记录

batchSize 20 同时发送到通道中的最大行数

batchTimeout 3000 如果缓冲区没有满，经过多长时间发送数据

selector.type 复制还是多路复用

selector.* Depends on the selector.type value

interceptors – 空格分隔的拦截器列表

interceptors.*

1. 1. 案例

编写配置文件：

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = exec

a1.sources.r1.command = ping 192.168.242.102

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume：

./flume-ng agent --conf ../conf --conf-file ../conf/template2.conf --name a1 -Dflume.root.logger=INFO,console

可以通过tail命令，收集日志文件中后续追加的日志

1. Spooling Directory Source
  1. Spooling Directory Source概述

这个Source允许你将将要收集的数据放置到"自动搜集"目录中。这个Source将监视该目录，并将解析新文件的出现。事件处理逻辑是可插拔的，当一个文件被完全读入通道，它会被重命名或可选的直接删除。

要注意的是，放置到自动搜集目录下的文件不能修改，如果修改，则flume会报错。另外，也不能产生重名的文件，如果有重名的文件被放置进来，则flume会报错。

1. 1. Spooling Directory Source属性说明

!channels –

!type – 类型，需要指定为"spooldir"

!spoolDir – 读取文件的路径，即"搜集目录"

fileSuffix .COMPLETED 对处理完成的文件追加的后缀

deletePolicy never 处理完成后是否删除文件，需是"never"或"immediate"

fileHeader false 是否添加一个存储的绝对路径名的头文件.

fileHeaderKey file Header key to use when appending absolute path filename to event header.

basenameHeader false Whether to add a header storing the basename of the file.

basenameHeaderKey basename Header Key to use when appending basename of file to event header.

ignorePattern ^$ 正则表达式指定哪些文件需要忽略

trackerDir .flumespool Directory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir.

consumeOrder 处理文件的策略，oldest, youngest 或 random。

maxBackoff 4000 The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter.

batchSize 100 Granularity at which to batch transfer to the channel

inputCharset UTF-8 读取文件时使用的编码。

decodeErrorPolicy FAIL 当在输入文件中发现无法处理的字符编码时如何处理。FAIL：抛出一个异常而无法解析该文件。REPLACE：用“替换字符”字符，通常是Unicode的U + FFFD更换不可解析角色。忽略：掉落的不可解析的字符序列。

deserializer LINE 声明用来将文件解析为事件的解析器。默认一行为一个事件。处理类必须实现EventDeserializer.Builder接口。

deserializer.* Varies per event deserializer.

bufferMaxLines – (Obselete) This option is now ignored.

bufferMaxLineLength 5000 (Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead.

selector.type replicating replicating or multiplexing

selector.* Depends on the selector.type value

interceptors – Space-separated list of interceptors

interceptors.*

1. 1. 案例

编写配置文件：

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = spooldir

a1.sources.r1.spoolDir=/home/park/work/apache-flume-1.6.0-bin/mydata

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume：

./flume-ng agent --conf ../conf --conf-file ../conf/template4.conf --name a1 -Dflume.root.logger=INFO,console

向指定目录中传输文件，发现flume收集到了该文件，将文件中的每一行都作为日志来处理。

1. NetCat Source
  1. NetCat Source概述

一个NetCat Source用来监听一个指定端口，并将接收到的数据的每一行转换为一个事件。

1. 1. NetCat Source属性说明

！channels –

！type – 类型名称，需要被设置为"netcat"

！bind – 指定要绑定到的ip或主机名。

！port – 指定要绑定到的端口号

max-line-length 512 单行最大字节数

ack-every-event true 对于收到的每一个Event是否响应"OK"

selector.type

selector.*

interceptors –

interceptors.*

1. 1. 案例

参见入门案例。

1. Sequence Generator Source

1. 1. Sequence Generator Source概述

一个简单的序列发生器，不断的产生事件，值是从0开始每次递增1。

主要用来进行测试。

1. 1. Sequence Generator Source属性说明

!channels –

!type – 类型名称，必须为"seq"

selector.type

selector.*

interceptors –

interceptors.*

batchSize

1. 1. 案例

编写配置文件:

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = seq

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template4.conf --name a1 -Dflume.root.logger=INFO,console

发现打印了日志

1. HTTP Source
  1. HTTP Source概述

HTTP Source接受HTTP的GET和POST请求作为Flume的事件,其中GET方式应该只用于试验。

该Source需要提供一个可插拔的"处理器"来将请求转换为事件对象，这个处理器必须实现HTTPSourceHandler接口，该处理器接受一个 HttpServletRequest对象，并返回一个Flume Envent对象集合。

从一个HTTP请求中得到的事件将在一个事务中提交到通道中。因此允许像文件通道那样对通道提高效率。

如果处理器抛出一个异常，Source将会返回一个400的HTTP状态码。

如果通道已满，无法再将Event加入Channel，则Source返回503的HTTP状态码，表示暂时不可用。

1. 1. HTTP Source属性说明

！type 类型，必须为"HTTP"

！port – 监听的端口

bind 0.0.0.0 监听的主机名或ip

handler org.apache.flume.source.http.JSONHandler处理器类，需要实现HTTPSourceHandler接口

handler.* – 处理器的配置参数

selector.type

selector.*

interceptors –

interceptors.*

enableSSL false 是否开启SSL,如果需要设置为true。注意，HTTP不支持SSLv3。

excludeProtocols SSLv3 空格分隔的要排除的SSL/TLS协议。SSLv3总是被排除的。

keystore 密钥库文件所在位置。

keystorePassword Keystore 密钥库密码

1. 1. 案例

编写配置文件:

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = http

a1.sources.r1.port = 66666

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template6.conf --name a1 -Dflume.root.logger=INFO,console

通过命令发送HTTP请求到指定端口：

curl -X POST -d '[{ "headers" :{"a" : "a1","b" : "b1"},"body" : "hello~http~flume~"}]' http://0.0.0.0:6666

发现flume收集到了日志

1. 1. 常用的Handler

JSONHandler

可以处理JSON格式的数据，并支持UTF-8 UTF-16 UTF-32字符集，该handler接受Evnet数组，并根据请求头中指定的编码将其转换为Flume Event。

如果没有指定编码，默认编码为UTF-8.

JSON格式如下：

[{

"headers" : {

"timestamp" : "434324343",

"host" : "random_host.example.com"

"body" : "random_body"

{

"headers" : {

"namenode" : "namenode.example.com",

"datanode" : "random_datanode.example.com"

"body" : "really_random_body"

}]

设置字符集时，请求必须包含content type 并设置为application/json;charset=UTF-8。

To set the charset, the request must have content type specified as application/json;charset=UTF-8 (replace UTF-8 with UTF-16 or UTF-32 as required).

One way to create an event in the format expected by this handler is to use JSONEvent provided in the Flume SDK and use Google Gson to create the JSON string using the Gson#fromJson(Object, Type) method.

Typetype=newTypeToken>(){}.getType();

BlobHandler

BlobHandler是一种将请求中上传文件信息转化为event的处理器。

参数说明，加！为必须属性：

！handler – The FQCN of this class: org.apache.flume.sink.solr.morphline.BlobHandler

handler.maxBlobLength 100000000 The maximum number of bytes to read and buffer for a given request

1. Custom source
  1. Custom source概述

自定义源是自己实现源接口得到的，自定义源的类和其依赖包必须在开始时就放置到Flume的类加载目录下。

1. 1. Custom source属性说明

！channels –

！type – 类型，必须设置为自己的自定义处理类的全路径名

selector.type

elector.*

interceptors –

interceptors.*

Sink详解
1. Logger Sink
  1. Logger Sink概述

记录INFO级别的日志，通常用于调试。

1. 1. Logger Sink属性说明

!channel –

!type – The component type name, needs to be logger

maxBytesToLog 16 Maximum number of bytes of the Event body to log

要求必须在 --conf 参数指定的目录下有 log4j的配置文件

也可以通过-Dflume.root.logger=INFO,console在命令启动时手动指定log4j参数

1. 1. 案例

参见入门案例。

1. File Roll Sink
  1. File Roll Sink概述

在本地文件系统中存储事件，每隔指定时长生成文件保存这段时间内收集到的日志信息。

1. 1. File Roll Sink属性说明

!channel –

!type – 类型，必须是"file_roll"

!sink.directory – 文件被存储的目录

sink.rollInterval 30 滚动文件每隔30秒（应该是每隔30秒钟单独切割数据到一个文件的意思）。如果设置为0，则禁止滚动，从而导致所有数据被写入到一个文件中。

sink.serializer TEXT Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface.

batchSize 100

1. 1. 案例

编写配置文件:

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = http

a1.sources.r1.port = 6666

＃描述Sink

a1.sinks.k1.type = file_roll

a1.sinks.k1.sink.directory=/home/park/work/apache-flume-1.6.0-bin/mysink(修改：之前是a1.sinks.k1.dirctory，出不来结果)

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template7.conf --name a1 -Dflume.root.logger=INFO,console

1. Avro Sink
  1. Avro Sink概述

Avro Sink是实现多级流动、扇出流(1到多) 和扇入流(多到1) 的基础。

1. 1. Avro Sink属性说明

!channel –

!type – The component type name, needs to be avro.

!hostname – The hostname or IP address to bind to.

!port – The port # to listen on.

batch-size 100 number of event to batch together for send.

connect-timeout 20000 Amount of time (ms) to allow for the first (handshake) request.

request-timeout 20000 Amount of time (ms) to allow for requests after the first.

reset-connection-interval none Amount of time (s) before the connection to the next hop is reset. This will force the Avro Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent.

compression-type none This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource

compression-level 6 The level of compression to compress event. 0 = no compression and 1-9 is compression. The higher the number the more compression

ssl false Set to true to enable SSL for this AvroSink. When configuring SSL, you can optionally set a “truststore”, “truststore-password”, “truststore-type”, and specify whether to “trust-all-certs”.

trust-all-certs false If this is set to true, SSL server certificates for remote servers (Avro Sources) will not be checked. This should NOT be used in production because it makes it easier for an attacker to execute a man-in-the-middle attack and “listen in” on the encrypted connection.

truststore – The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Avro Source’s SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically “jssecacerts” or “cacerts” in the Oracle JRE) will be used.

truststore-password – The password for the specified truststore.

truststore-type JKS The type of the Java truststore. This can be “JKS” or other supported Java truststore type.

exclude-protocols SSLv3 Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.

maxIoWorkers 2 * the number of available processors in the machine The maximum number of I/O w

1. 1. 案例1-多级流动

我们用三台机器h1、h2、h3来进行实验：

h3:

配置配置文件:

#命名Agent组件

a1.sources=r1

a1.sinks=k1

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=avro

a1.sources.r1.bind=0.0.0.0

a1.sources.r1.port=9988

#描述Sink

a1.sinks.k1.type=logger

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

h2:

配置文件

#命名Agent组件

a1.sources=r1

a1.sinks=k1

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=avro

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port=9988

#描述Sink

a1.sinks.k1.type=avro

a1.sinks.k1.hostname=192.168.242.139

a1.sinks.k1.port=9988

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

启动

./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

h1:

配置配置文件

#命名Agent组件

a1.sources=r1

a1.sinks=k1

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=http

a1.sources.r1.port=8888

#描述Sink

a1.sinks.k1.type=avro

a1.sinks.k1.hostname=192.168.242.138

a1.sinks.k1.port=9988

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

发送http请求到h1：

curl -X POST -d '[{ "headers" :{"a" : "a1","b" : "b1"},"body" : "hello~http~flume~"}]' http://192.168.242.133:8888

稍等几秒后，发现h2最终收到了这条消息

1. 1. 案例2-扇出流-复制

h2 h3:

配置配置文件:

#命名Agent组件

a1.sources=r1

a1.sinks=k1

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=avro

a1.sources.r1.bind=0.0.0.0

a1.sources.r1.port=9988

#描述Sink

a1.sinks.k1.type=logger

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

h1:

配置配置文件

#命名Agent组件

a1.sources=r1

a1.sinks=k1 k2

a1.channels=c1 c2

#描述/配置Source

a1.sources.r1.type=http

a1.sources.r1.port=8888

#描述Sink

a1.sinks.k1.type=avro

a1.sinks.k1.hostname=192.168.242.138

a1.sinks.k1.port=9988

a1.sinks.k2.type=avro

a1.sinks.k2.hostname=192.168.242.135

a1.sinks.k2.port=9988

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

a1.channels.c2.type=memory

a1.channels.c2.capacity=1000

a1.channels.c2.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1 c2

a1.sinks.k1.channel=c1

a1.sinks.k2.channel=c2

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

1. 1. 案例3-扇出流-多路复用（路由）

h2 h3:

配置配置文件:

#命名Agent组件

a1.sources=r1

a1.sinks=k1

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=avro

a1.sources.r1.bind=0.0.0.0

a1.sources.r1.port=9988

#描述Sink

a1.sinks.k1.type=logger

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

h1:

配置配置文件

#配置Agent组件

a1.sources=r1

a1.sinks=k1 k2

a1.channels=c1 c2

#描述/配置Source

a1.sources.r1.type=http

a1.sources.r1.port=8888

a1.sources.r1.selector.type=multiplexing

a1.sources.r1.selector.header=gender

a1.sources.r1.selector.mapping.male=c1

a1.sources.r1.selector.mapping.female=c2

a1.sources.r1.selector.default=c1

#描述Sink

a1.sinks.k1.type=avro

a1.sinks.k1.hostname=192.168.242.138

a1.sinks.k1.port=9988

a1.sinks.k2.type=avro

a1.sinks.k2.hostname=192.168.242.135

a1.sinks.k2.port=9988

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

a1.channels.c2.type=memory

a1.channels.c2.capacity=1000

a1.channels.c2.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1 c2

a1.sinks.k1.channel=c1

a1.sinks.k2.channel=c2

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

发送http请求进行测试。发现可以实现路由效果

1. 1. 案例4-扇入流

m3:

编写配置文件:

#命名Agent组件

a1.sources=r1

a1.sinks=k1

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=avro

a1.sources.r1.bind=0.0.0.0

a1.sources.r1.port=4141

#描述Sink

a1.sinks.k1.type=logger

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template.conf --name a1 -Dflume.root.logger=INFO,console

m1、m2:

编写配置文件:

#命名Agent组件

a1.sources=r1

a1.sinks=k1

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=http

a1.sources.r1.port=8888

#描述Sink

a1.sinks.k1.type=avro

a1.sinks.k1.hostname=192.168.242.135

a1.sinks.k1.port=4141

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template9.conf --name a1 -Dflume.root.logger=INFO,console

m1通过curl发送一条http请求，由于默认使用的是jsonHandler，数据格式必须是指定的json格式：

[root@localhost conf]# curl -XPOST -d '[{ "headers" :{"flag" : "c"},"body" : "idoall.org_body"}]' http://0.0.0.0:8888

m2通过curl发送一条http请求，由于默认使用的是jsonHandler，数据格式必须是指定的json格式：

[root@localhost conf]# curl -XPOST -d '[{ "headers" :{"flag" : "c"},"body" : "idoall.org_body"}]' http://0.0.0.0:8888

发现m3均能正确收到消息

1. HDFS Sink
  1. HDFS Sink概述

HDFS Sink将事件写入到Hadoop分布式文件系统HDFS中，目前支持创建文本文件和序列化文件。并且对这两种格式都支持压缩。这些文件可以按照指定的时间或数据量或事件的数量为基础进行分卷。

它还通过类似时间戳或机器属性对数据进行 buckets/partitions 操作。 HDFS的目录路径可以包含将要由HDFS替换格式的转移序列用以生成存储事件的目录/文件名。

使用HDFS Sink要求hadoop必须已经安装好，以便Flume可以通过hadoop提供的jar包与HDFS进行通信。

注意，此版本hadoop必须支持sync()调用。

1. 1. HDFS Sink属性说明

!channel –

!type – 类型名称，必须是“HDFS”

!hdfs.path – HDFS 目录路径 (eg hdfs://namenode/flume/webdata/)

hdfs.filePrefix FlumeData Flume在目录下创建文件的名称前缀

hdfs.fileSuffix – 追加到文件的名称后缀 (eg .avro - 注: 日期时间不会自动添加)

hdfs.inUsePrefix – Flume正在处理的文件所加的前缀

hdfs.inUseSuffix .tmp Flume正在处理的文件所加的后缀

hdfs.rollInterval 30 Number of seconds to wait before rolling current file (0 = never roll based on time interval)

hdfs.rollSize 1024 File size to trigger roll, in bytes (0: never roll based on file size)

hdfs.rollCount 10 Number of events written to file before it rolled (0 = never roll based on number of events)

hdfs.idleTimeout 0 Timeout after which inactive files get closed (0 = disable automatic closing of idle files)

hdfs.batchSize 100 number of events written to file before it is flushed to HDFS

hdfs.codeC – Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy

hdfs.fileType SequenceFile File format: currently SequenceFile, DataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC

hdfs.maxOpenFiles 5000 Allow only this number of open files. If this number is exceeded, the oldest file is closed.

hdfs.minBlockReplicas – Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.

hdfs.writeFormat – Format for sequence file records. One of “Text” or “Writable” (the default).

hdfs.callTimeout 10000 Number of milliseconds allowed for HDFS operations, such as open, write, flush, close. This number should be increased if many HDFS timeout operations are occurring.

hdfs.threadsPoolSize 10 Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)

hdfs.rollTimerPoolSize 1 Number of threads per HDFS sink for scheduling timed file rolling

hdfs.kerberosPrincipal – Kerberos user principal for accessing secure HDFS

hdfs.kerberosKeytab – Kerberos keytab for accessing secure HDFS

hdfs.proxyUser

hdfs.round false 时间戳是否向下取整（如果是true，会影响所有基于时间的转移序列，除了%T）

hdfs.roundValue 1 舍值的边界值

hdfs.roundUnit 向下舍值的单位 - second, minute , hour

hdfs.timeZone Local Time Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.

hdfs.useLocalTimeStamp false Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.

hdfs.closeTries 0 Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart.

hdfs.retryInterval 180 Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension.

serializer TEXT Other possible options include avro_event or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.

1. 1. 案例

编写配置文件:

#命名Agent组件

a1.sources=r1

a1.sinks=k1

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=http

a1.sources.r1.port=8888

#描述Sink

a1.sinks.k1.type=hdfs

a1.sinks.k1.hdfs.path=hdfs://0.0.0.0:9000/ppp

#文本格式，默认是序列化的格式

a1.sinks.k1.hdfs.fileType=DataStream

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

启动flume:

./flume-ng agent --conf ../conf --conf-file ../conf/template9.conf --name a1 -Dflume.root.logger=INFO,console

1. Hive Sink
  1. Hive Sink概述

这个Sink可以将含分隔符的文本或JSON数据事件直接导入Hive的表或分区。

事件Event是使用Hive transactions编写的，当一组Event被提交到Hive中，它们立即可以通过Hive被查询出来。

Flume要写入数据的分区即可以预先创建好，也可以在缺失时由Flume来创建。

Flume收到的数据字段将映射到Hive表的列上。

此功能是一个预览功能，不推荐在生产环境下使用。

1. 1. Hive Sink属性说明

！channel –

！type – 类型，必须设置为“hive”

！hive.metastore – Hive metastore URI (eg thrift://a.b.com:9083 )

！hive.database – hive库名称

！hive.table – hive表名称

hive.partition – 逗号分开的分区值确定写入分区的列表。可以包含转义序列。

hive.txnsPerBatchAsk 100 Hive grants a batch of transactions instead of single transactions to streaming clients like Flume. This setting configures the number of desired transactions per Transaction Batch. Data from all transactions in a single batch end up in a single file. Flume will write a maximum of batchSize events in each transaction in the batch. This setting in conjunction with batchSize provides control over the size of each file. Note that eventually Hive will transparently compact these files into larger files.

heartBeatInterval 240 (In seconds) Interval between consecutive heartbeats sent to Hive to keep unused transactions from expiring. Set this value to 0 to disable heartbeats.

autoCreatePartitions true Flume will automatically create the necessary Hive partitions to stream to

batchSize 15000 Max number of events written to Hive in a single Hive transaction

maxOpenConnections 500 Allow only this number of open connections. If this number is exceeded, the least recently used connection is closed.

callTimeout 10000 (In milliseconds) Timeout for Hive & HDFS I/O operations, such as openTxn, write, commit, abort.

serializer Serializer is responsible for parsing out field from the event and mapping them to columns in the hive table. Choice of serializer depends upon the format of the data in the event. Supported serializers: DELIMITED and JSON

roundUnit minute The unit of the round down value - second, minute or hour.

roundValue 1 Rounded down to the highest multiple of this (in the unit configured using hive.roundUnit), less than current time

timeZone Local Time Name of the timezone that should be used for resolving the escape sequences in partition, e.g. America/Los_Angeles.

useLocalTimeStamp false Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.

1. Custom Sink
  1. Custom Sink概述

自定义接收器，是自己实现的接收器接口Sink来实现的。

自定义接收器的类及其依赖类须在Flume启动前放置到Flume类加载目录下。

1. 1. Custom Sink属性说明

type – 类型，需要指定为自己实现的Sink类的全路径名。

Selector
1. Selector概述
  1. Selector概述

Selector（选择器）可以工作在复制或多路复用(路由) 模式下。

1. 复制模式
  1. Selector复制模式-属性说明

selector.type replicating 类型名称，默认是 replicating

selector.optional – 标志通道为可选

1. 1. Selector复制模式-案例

参看5.3.4avro sink案例.

1. 多路复用（路由）模式
  1. Selector多路复用（路由）模式-属性说明

selector.type 类型，必须是"multiplexing"

selector.header 指定要监测的头的名称

selector.default –

selector.mapping.* –

举例：

a1.sources = r1

a1.channels = c1 c2 c3 c4

a1.sources.r1.selector.type = multiplexing

a1.sources.r1.selector.header = state

a1.sources.r1.selector.mapping.CZ = c1

a1.sources.r1.selector.mapping.US = c2 c3

a1.sources.r1.selector.default = c4

1. 1. Selector多路复用（路由）模式-案例

参看 5.3.5 avro sink案例

Interceptors
1. Interceptors概述
  1. Interceptors概述

Flume有能力在运行阶段修改/删除Event，这是通过拦截器（Interceptors）来实现的。

拦截器需要实现org.apache.flume.interceptor.Interceptor接口。

拦截器可以修改或删除事件基于开发者在选择器中选择的任何条件。

拦截器采用了责任链模式，多个拦截器可以按指定顺序拦截。

一个拦截器返回的事件列表被传递给链中的下一个拦截器。

如果一个拦截器需要删除事件，它只需要在返回的事件集中不包含要删除的事件即可。

如果要删除所有事件，只需返回一个空列表。

1. Timestamp Interceptor
  1. Timestamp Interceptor概述

这个拦截器在事件头中插入以毫秒为单位的当前处理时间。

头的名字为timestamp，值为当前处理的时间戳。

如果在之前已经有这个时间戳，则保留原有的时间戳。

1. 1. Timestamp Interceptor属性说明

!interceptors.type – 类型名称，必须是timestamp或自定义类的全路径名

preserveExisting false 如果时间戳已经存在是否保留

1. 1. 案例

配置文件

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = http

a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = timestamp

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

./flume-ng agent --conf ../conf --conf-file ../conf/flume4.conf --name a1 -Dflume.root.logger=INFO,console

1. Host Interceptor
  1. Host Interceptor概述

这个拦截器插入当前处理Agent的主机名或ip

头的名字为host或配置的名称

值是主机名或ip地址，基于配置。

1. 1. Host Interceptor属性说明

!type – 类型名称，必须是host

preserveExisting false 如果主机名已经存在是否保留

useIP true 如果配置为true则用IP，配置为false则用主机名

hostHeader host 加入头时使用的名称

1. 1. 案例

配置文件

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = http

a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1 i2

a1.sources.r1.interceptors.i1.type = timestamp

#ip是拦截者所在机器的ip

a1.sources.r1.interceptors.i2.type = host

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

./flume-ng agent --conf ../conf --conf-file ../conf/flume4.conf --name a1 -Dflume.root.logger=INFO,console

1. Static Interceptor
  1. Static Interceptor概述

此拦截器允许用户增加静态头信息使用静态的值到所有事件。

目前的实现中不允许一次指定多个头。

如果需要增加多个静态头可以指定多个Static interceptors

1. 1. Static Interceptor属性说明

!type – 类型，必须是static

preserveExisting true 如果配置头已经存在是否应该保留

key key 要增加的透明

value value 要增加的头值

1. 1. 案例

配置文件

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = http

a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1 i2 i3

a1.sources.r1.interceptors.i1.type = timestamp

#ip是拦截者所在机器的ip

a1.sources.r1.interceptors.i2.type = host

a1.sources.r1.interceptors.i3.type = static

a1.sources.r1.interceptors.i3.key = country

a1.sources.r1.interceptors.i3.value = China

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

./flume-ng agent --conf ../conf --conf-file ../conf/flume4.conf --name a1 -Dflume.root.logger=INFO,console

在Event的headers中有多了country=China信息

1. UUID Interceptor
  1. UUID Interceptor概述

这个拦截器在所有事件头中增加一个全局一致性标志，其实就是UUID。

1. 1. UUID Interceptor属性说明

!type – 类型名称，必须是org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

headerName id 头名称

preserveExisting true 如果头已经存在，是否保留

prefix “” 在UUID前拼接的字符串前缀

1. 1. 案例

配置文件

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = http

a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1 i2 i3 i4

a1.sources.r1.interceptors.i1.type = timestamp

#ip是拦截者所在机器的ip

a1.sources.r1.interceptors.i2.type = host

a1.sources.r1.interceptors.i3.type = static

a1.sources.r1.interceptors.i3.key = country

a1.sources.r1.interceptors.i3.value = China

a1.sources.r1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

./flume-ng agent --conf ../conf --conf-file ../conf/flume4.conf --name a1 -Dflume.root.logger=INFO,console

作用：标识一条日志

1. Search and Replace Interceptor
  1. Search and Replace Interceptor概述

这个拦截器提供了简单的基于字符串的正则搜索和替换功能。可以修改body部分的内容

1. 1. Search and Replace Interceptor属性说明

type – 类型名称，必须是"search_replace"

searchPattern – 要搜索和替换的正则表达式

replaceString – 要替换为的字符串

charset UTF-8 字符集编码，默认utf-8

1. 1. 案例

配置文件

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = http

a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1 i2 i3 i4 i5

a1.sources.r1.interceptors.i1.type = timestamp

#ip是拦截者所在机器的ip

a1.sources.r1.interceptors.i2.type = host

a1.sources.r1.interceptors.i3.type = static

a1.sources.r1.interceptors.i3.key = country

a1.sources.r1.interceptors.i3.value = China

a1.sources.r1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

a1.sources.r1.interceptors.i5.type = search_replace

#将所有的数字替换成*

a1.sources.r1.interceptors.i5.searchPattern = [0-9]

a1.sources.r1.interceptors.i5.replaceString = *

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

./flume-ng agent --conf ../conf --conf-file ../conf/flume4.conf --name a1 -Dflume.root.logger=INFO,console

1. Regex Filtering Interceptor
  1. Regex Filtering Interceptor概述

此拦截器通过解析事件体去匹配给定正则表达式来筛选事件，所提供的正则表达式即可以用来包含或刨除事件。

1. 1. Regex Filtering Interceptor属性说明

!type – 类型，必须设定为regex_filter

regex ”.*” 所要匹配的正则表达式

excludeEvents false 如果是true则排除匹配的事件，false则包含匹配的事件。

1. 1. 案例

配置文件

＃命名Agent a1的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置Source

a1.sources.r1.type = http

a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1 i2 i3 i4 i5 i6

a1.sources.r1.interceptors.i1.type = timestamp

#ip是拦截者所在机器的ip

a1.sources.r1.interceptors.i2.type = host

a1.sources.r1.interceptors.i3.type = static

a1.sources.r1.interceptors.i3.key = country

a1.sources.r1.interceptors.i3.value = China

a1.sources.r1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

a1.sources.r1.interceptors.i5.type = search_replace

#将所有的数字替换成*

a1.sources.r1.interceptors.i5.searchPattern = [0-9]

a1.sources.r1.interceptors.i5.replaceString = *

a1.sources.r1.interceptors.i6.type = regex_filter

#只要是a开头的抛弃

a1.sources.r1.interceptors.i6.regex = ^a.*

a1.sources.r1.interceptors.i6.excludeEvents = true

＃描述Sink

a1.sinks.k1.type = logger

＃描述内存Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动flume

./flume-ng agent --conf ../conf --conf-file ../conf/flume4.conf --name a1 -Dflume.root.logger=INFO,console

1. Regex Extractor Interceptor
  1. Regex Extractor Interceptor概述

使用指定正则表达式匹配事件，并将匹配到的组作为头加入到事件中，它也支持插件化的序列化器用来格式化匹配到的组在加入他们作为头之前。

1. 1. Regex Extractor Interceptor属性说明

!type – 类型，必须是regex_extractor

!regex – 要匹配的正则表达式

!serializers – Space-separated list of serializers for mapping matches to header names and serializing their values. (See example below) Flume provides built-in support for the following serializers: org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer

serializers..type default Must be default (org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer), org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer, or the FQCN of a custom class that implements org.apache.flume.interceptor.RegexExtractorInterceptorSerializer

serializers..name –

serializers.* – Serializer-specific properties

1. 1. 案例

项目时在讲

Processor
1. 概述
  1. 概述

Sink Group允许用户将多个Sink组合成一个实体。

Flume Sink Processor 可以通过切换组内Sink用来实现负载均衡的效果，或在一个Sink故障时切换到另一个Sink。

sinks – 用空格分隔的Sink集合

processor.type default 类型名称，必须是 default、failover 或 load_balance

1. Default Sink Processor
  1. Default Sink Processor

Default Sink Processor 只接受一个 Sink。

不要求用户为单一Sink创建processor

1. Failover Sink Processor
  1. Failover Sink Processor

Failover Sink Processor 维护一个sink们的优先表。确保只要一个是可用的事件就可以被处理。

失败处理原理是，为失效的sink指定一个冷却时间，在冷却时间到达后再重新使用。

sink们可以被配置一个优先级，数字越大优先级越高。

如果sink发送事件失败，则下一个最高优先级的sink将会尝试接着发送事件。

如果没有指定优先级，则优先级顺序取决于sink们的配置顺序，先配置的默认优先级高于后配置的。

在配置的过程中，设置一个group processor ，并且为每个sink都指定一个优先级。

优先级必须是唯一的。

另外可以设置maxpenalty属性指定限定失败时间。

1. 1. 属性说明

sinks – Space-separated list of sinks that are participating in the group

processor.type default The component type name, needs to be failover

processor.priority. – Priority value. must be one of the sink instances associated with the current sink group A higher priority value Sink gets activated earlier. A larger absolute value indicates higher priority

processor.maxpenalty 30000 The maximum backoff period for the failed Sink (in millis)

1. 1. 案例

h1配置文件

#命名Agent组件

a1.sources=r1

a1.sinks=k1 k2

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=http

a1.sources.r1.port=44444

#描述Sink

a1.sinkgroups = g1

a1.sinkgroups.g1.sinks = k1 k2

a1.sinkgroups.g1.processor.type = failover

a1.sinkgroups.g1.processor.priority.k1=5

a1.sinkgroups.g1.processor.priority.k2=10

a1.sinks.k1.type =avro

a1.sinks.k1.hostname =hadoop02

a1.sinks.k1.port =44444

a1.sinks.k2.type =avro

a1.sinks.k2.hostname =hadoop03

a1.sinks.k2.port =44444

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

a1.sinks.k2.channel=c1

h2、h3配置文件

#命名Agent组件

a1.sources=r1

a1.sinks=k1

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=avro

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port=44444

#描述Sink

a1.sinks.k1.type=logger

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

启动flume

./flume-ng agent --conf ../conf --conf-file ../conf/flume5.conf --name a1 -Dflume.root.logger=INFO,console

h1发送数据

curl -XPOST -d '[{ "headers" :{"flag" : "c"},"body" : "idoall.org_body"}]' http://0.0.0.0:44444

1. Load balancing Sink Processor
  1. Load balancing Sink Processor

Load balancing Sink processor 提供了在多个sink之间实现负载均衡的能力。

它维护了一个活动sink的索引列表。

它支持轮询或随机方式的负载均衡，默认值是轮询方式，可以通过配置指定。

也可以通过实现AbstractSinkSelector接口实现自定义的选择机制。

1. 1. 属性说明

!processor.sinks – Space-separated list of sinks that are participating in the group

!processor.type default The component type name, needs to be load_balance

processor.backoff false Should failed sinks be backed off exponentially.

processor.selector round_robin Selection mechanism. Must be either round_robin, random or FQCN of custom class that inherits from AbstractSinkSelector

processor.selector.maxTimeOut 30000 Used by backoff selectors to limit exponential backoff (in milliseconds)

1. 1. 案例

h1配置文件

#命名Agent组件

a1.sources=r1

a1.sinks=k1 k2

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=http

a1.sources.r1.port=44444

#描述Sink 负载均衡方式

a1.sinkgroups = g1

a1.sinkgroups.g1.sinks = k1 k2

a1.sinkgroups.g1.processor.type = load_balance

a1.sinkgroups.g1.processor.selector = random

a1.sinks.k1.type =avro

a1.sinks.k1.hostname =hadoop02

a1.sinks.k1.port =44444

a1.sinks.k2.type =avro

a1.sinks.k2.hostname =hadoop03

a1.sinks.k2.port =44444

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

a1.sinks.k2.channel=c1

h2、h3配置文件

#命名Agent组件

a1.sources=r1

a1.sinks=k1

a1.channels=c1

#描述/配置Source

a1.sources.r1.type=avro

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port=44444

#描述Sink

a1.sinks.k1.type=logger

#描述内存Channel

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

启动flume

./flume-ng agent --conf ../conf --conf-file ../conf/flume6.conf --name a1 -Dflume.root.logger=INFO,console

h1发送数据

curl -XPOST -d '[{ "headers" :{"flag" : "c"},"body" : "idoall.org_body"}]' http://0.0.0.0:44444

Channel
1. Memory Channel
  1. Memory Channel概述

Memory Channel，内存通道；事件将被存储在内存中的具有指定大小的队列中。

非常适合那些需要高吞吐量但是失败时会丢失数据的场景下。

1. 1. Memory Channel属性说明

!type – 类型，必须是“memory”

capacity 100 事件存储在信道中的最大数量

transactionCapacity 100 每个事务中的最大事件数

keep-alive 3 添加或删除操作的超时时间

byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.

byteCapacity see description Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB.

1. 1. 案例

参见入门案例。

1. JDBC Channel
  1. JDBC Channel概述

事件被持久存储在可靠的数据库中。目前支持嵌入式的Derby数据库。如果可恢复性非常的重要可以使用这种方式。

1. File Channel
  1. File Channel概述

性能会比较低下，但是即使程序出错数据不会丢失

1. 1. File Channel属性说明

!type – 类型，必须是“file”

checkpointDir ~/.flume/file-channel/checkpoint 检查点文件存放的位置

useDualCheckpoints false Backup the checkpoint. If this is set to true, backupCheckpointDir must be set

backupCheckpointDir – The directory where the checkpoint is backed up to. This directory must not be the same as the data directories or the checkpoint directory

dataDirs ~/.flume/file-channel/data 逗号分隔的目录列表，用以存放日志文件。使用单独的磁盘上的多个目录可以提高文件通道效率。

transactionCapacity 10000 The maximum size of transaction supported by the channel

checkpointInterval 30000 Amount of time (in millis) between checkpoints

maxFileSize 2146435071 一个日志文件的最大尺寸

minimumRequiredSpace 524288000 Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value

capacity 1000000 Maximum capacity of the channel

keep-alive 3 Amount of time (in sec) to wait for a put operation

use-log-replay-v1 false Expert: Use old replay logic

use-fast-replay false Expert: Replay without using queue

checkpointOnClose true Controls if a checkpoint is created when the channel is closed. Creating a checkpoint on close speeds up subsequent startup of the file channel by avoiding replay.

encryption.activeKey – Key name used to encrypt new data

encryption.cipherProvider – Cipher provider type, supported types: AESCTRNOPADDING

encryption.keyProvider – Key provider type, supported types: JCEKSFILE

encryption.keyProvider.keyStoreFile – Path to the keystore file

encrpytion.keyProvider.keyStorePasswordFile – Path to the keystore password file

encryption.keyProvider.keys – List of all keys (e.g. history of the activeKey setting)

encyption.keyProvider.keys.*.passwordFile – Path to the optional key password file

1. Spillable Memory Channel
  1. Spillable Memory Channel概述

Spillable Memory Channel：内存溢出通道。

事件被存储在内存队列和磁盘中，内存队列作为主存储，而磁盘作为溢出内容的存储。

内存存储通过embedded File channel来进行管理，当内存队列已满时，后续的事件将被存储在文件通道中，这个通道适用于正常操作期间适用内存通道已期实现高效吞吐，而在高峰期间适用文件通道实现高耐受性。通过降低吞吐效率提高系统可耐受性。

如果Agent崩溃，则只有存储在文件系统中的事件可以被恢复；此通道处于试验阶段，不建议在生产环境中使用。

1. 1. Spillable Memory Channel属性说明

!type – 类型，必须是"SPILLABLEMEMORY"

memoryCapacity 10000 内存中存储事件的最大值，如果想要禁用内存缓冲区将此值设置为0。

overflowCapacity 100000000 可以存储在磁盘中的事件数量最大值。设置为0可以禁用磁盘存储。

overflowTimeout 3 在内存填充磁盘溢出之前等待的秒数。

byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.

byteCapacity see description Maximum bytes of memory allowed as a sum of all events in the memory queue. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB.

avgEventSize 500 Estimated average size of events, in bytes, going into the channel

see file channel Any file channel property with the exception of ‘keep-alive’ and ‘capacity’ can be used. The keep-alive of file channel is managed by Spillable Memory Channel. Use ‘overflowCapacity’ to set the File channel’s capacity.

1. 自定义通道
  1. 自定义通道概述

自定义通道需要自己实现Channel接口。

自定义Channle类及其依赖类必须在Flume启动前放置到类加载的目录下。

1. 1. 自定义通道属性说明

type - 自己实现的Channle类的全路径名称

你可能感兴趣的:(日志采集,flume)

php 高并发下日志量巨大，如何高效采集、存储、分析贵哥的编程之路(热爱分享为后来者) PHP语言经典程序100题 php 开发语言
1.问题背景高并发系统每秒产生大量日志（如访问日志、错误日志、业务日志等）。单机写入、存储、分析能力有限，容易成为瓶颈。需要支持实时采集、分布式存储、快速检索与分析。2.主流架构方案一、分布式日志采集架构[应用服务器(PHP等)]|v[日志采集Agent（如Filebeat、Fluentd、Logstash）]|v[消息队列/缓冲（如Kafka、Redis、RabbitMQ）]|v[日志存储（如E
Flume到Kafka且均分到多个partition 小学僧来啦 Flume Kafka partition Flume
@Author:Spinach|GHB@Link:http://blog.csdn.net/bocai8058文章目录说明情况解决方法说明情况Flume向kafka发布数据时，发现kafka接收到的数据总是在一个partition中，而我们希望发布来的数据在所有的partition平均分布。应该怎么做呢？解决方法Flume的官方文档是这么说的：KafkaSinkusesthetopicandkey
Kubernetes日志运维痛点及日志系统架构设计（Promtail+Loki+Grafana）
Kubernetes日志运维痛点及日志系统架构设计（Promtail+Loki+Grafana）运维痛点日志采集的可靠性与复杂性pod生命周期短、易销毁容器重启或Pod被销毁后，日志会丢失（除非已持久化或集中采集）。需要侧重于实时采集和转发，而不能依赖节点本地日志。多样化的日志来源与格式应用日志、系统日志、Kubernetes组件日志（如kubelet、kube-apiserver）、中间件日志（
大前端日志分析的AI应用：从海量日志中提取有价值的运维信息欧阳天羲大前端与 AI 的深度融合 #AI 在大前端安全与运维篇前端人工智能运维
在大前端技术快速发展的今天，前端应用的复杂度呈指数级增长，涵盖Web、移动端H5、小程序、快应用等多端形态。随之而来的是海量日志数据的爆发式增长——从浏览器控制台输出到移动端性能埋点，从用户行为轨迹到API调用异常，这些日志分散在不同终端、格式异构，传统的人工分析或规则引擎已难以应对。本文将系统阐述AI技术如何赋能大前端日志分析，从日志采集到智能诊断的全流程解决方案，结合实际案例展示如何利用机器学
java分析tomcat日志_tomcat日志采集催眠神兔 java分析tomcat日志
1、采集tomcat确实比之前的需求复杂很多，我在搭建了一个tomcat的环境，然后产生如下报错先贴出来：Jan05,201710:53:35AMorg.apache.catalina.core.AprLifecycleListenerlifecycleEventINFO:TheAPRbasedApacheTomcatNativelibrarywhichallowsoptimalperforman
Python 自动化日志采集与分析方法
```htmlPython自动化日志采集与分析方法Python自动化日志采集与分析方法在现代软件开发和运维过程中，日志是排查问题、监控系统运行状态的重要工具。然而，随着系统的复杂度增加，手动处理日志变得越来越困难。本文将介绍如何使用Python实现自动化日志采集与分析的方法。一、日志采集的必要性日志记录了系统运行中的各种事件和错误信息，对于开发者和运维人员来说，它们是诊断问题、优化性能的关键数据源
大数据ETL工具比较：Sqoop vs Flume vs Kafka AI天才研究院 AI人工智能与大数据大数据 etl sqoop ai
大数据ETL工具比较：SqoopvsFlumevsKafka关键词：大数据ETL、Sqoop、Flume、Kafka、数据迁移、日志采集、消息队列摘要：在大数据生态中，ETL（抽取-转换-加载）是数据价值挖掘的关键环节。不同业务场景对数据传输的实时性、可靠性、数据类型有差异化需求，催生了Sqoop、Flume、Kafka等特色鲜明的ETL工具。本文从核心架构、工作原理、性能指标、实战案例四个维度，
在大数据求职面试中如何回答分布式协调与数据挖掘问题
在大数据求职面试中如何回答分布式协调与数据挖掘问题场景：小白的大数据求职面试小白是一名初出茅庐的程序员，今天他来到一家知名互联网公司的面试现场，面试官是经验丰富的老黑。以下是他们之间的对话：第一轮提问：分布式与数据采集老黑：小白，你对Zookeeper有了解吗？小白：当然，Zookeeper是一个分布式协调服务，主要用于分布式应用程序中的同步服务、命名服务和配置管理。老黑：不错，你能说说Flume
性能监控与智能诊断系统的全流程
智能运维（AIOps）系统架构。核心目标：解决企业面临的性能问题、资源瓶颈、服务异常，实现从被动响应到主动预防、智能诊断的转变。关键特性：全链路覆盖：从日志采集到最终告警展示。实时处理：基于流处理引擎（Storm）快速加工数据。智能分析：引入AI进行根因分析。闭环进化：告警反馈驱动模型训练，系统自学习优化。解耦设计：各模块职责清晰，通过消息队列（Kafka）连接。系统全流程解析（分步详解）：起点：
手把手教你玩转 Sqoop：从数据库到大数据的「数据搬运工」 AAA建材批发王师傅数据库 sqoop 大数据 hive hdfs
一、Sqoop是什么？——数据界的「超级搬运工」兄弟们，今天咱们聊个大数据圈的「搬运小能手」——Sqoop！可能有人会问：这玩意儿跟Flume啥区别？简单来说：Flume是专门搬日志数据的「快递员」而Sqoop是搬数据库数据的「搬家公司」它的名字咋来的？SQL+Hadoop，直接告诉你核心技能：在关系型数据库（比如MySQL）和Hadoop家族（HDFS、Hive、HBase）之间疯狂倒腾数据！核
Flum的组件和原理。以及配置和基础命令
ApacheFlume架构的原理和组成ApacheFlume是一个高可靠、高性能的服务，用于收集、聚合和移动大量日志数据。它的架构设计灵活且可扩展，能够适应各种不同的数据源和目的地。一、Flume的核心组件及其任务1.Agent定义：Flume的基本运行单元，是一个独立的进程。功能：负责执行数据采集任务，包含Source、Channel和Sink三个主要部分。2.Source（源）定义：数据进入F
阿里云可观测 2025 年 5 月产品动态阿里云云原生阿里云云计算
本月可观测热文回顾文章一览：StoreViewSQL，让数据分析不受地域限制不懂PromQL？AI智能体帮你玩转大规模指标数据分析DeepWiki×LoongCollector：AI重塑开源代码理解从o11y2.0说起，大数据Pipeline的「多快好省」之道日志采集Agent性能大比拼——LoongCollector性能深度测评阿里云SLS多云日志接入最佳实践：链路、成本与高可用性优化功能快报点
ELK日志采集系统 UFIT 服务器 nginx 运维
ELK日志采集系统指的是由Elasticsearch、Logstash和Kibana三个核心开源软件组成的套件，用于集中式日志的采集、处理、存储、搜索、分析和可视化。它现在更常被称为ElasticStack，因为其组件生态已经扩展（尤其是引入了Beats）。以下是ELK系统的核心组件和工作流程详解：数据源(DataSources)任何产生日志或事件的应用、系统或设备。例如：Web服务器日志（Ngi
Flume入门指南：大数据日志采集的秘密武器 £菜鸟也有梦大数据基础大数据 flume kafka hadoop hive
目录一、Flume是什么？为何如此重要？二、Flume核心概念大揭秘2.1Agent：Flume的核心引擎2.2Source：数据的入口大门2.3Channel：数据的临时港湾2.4Sink：数据的最终归宿2.5Event：数据的最小单元三、Flume工作原理深度剖析3.1数据如何流动3.2可靠性保障机制四、Flume安装与配置实战4.1安装前的准备工作4.2下载与解压4.3配置文件详解4.4启动
Flume进阶之路：从基础到高阶的飞跃 £菜鸟也有梦大数据基础 flume 大数据 hadoop hive
目录一、Flume高阶特性揭秘二、拦截器：数据的精细雕琢师2.1拦截器的概念与作用2.2常见拦截器类型及案例分析2.2.1时间添加戳拦截器2.2.2Host添加拦截器2.2.3正则表达式过滤拦截器三、选择器：数据流向的掌控者3.1选择器的概念与分类3.2不同选择器的工作原理与案例3.2.1复制选择器3.2.2多路复用选择器3.2.3自定义选择器四、Sink组逻辑处理器：数据传输的保障者4.1Sin
一文读懂Loki、Promtail介绍和搭建，并且根据日志监控配置报警生产队的猿 Prometheus监控 prometheus
前言愿君赐以一赞一关，此皆无费之举，而乃吾精勤力作之动力也。有吾在，君可放心摸。其他文章可进入此专栏查看：Prometheus专栏说明之前再写Loki、Promtail、Grafana采集日志，根据日志采集业务指标这篇文章的时候，突然发现还没写Loki、Promtail的介绍和搭建文章，本文主要介绍Loki、Promtail的概念和安装使用。Loki和Promtail简介1.什么是Loki？Lok
记一次·Spark读Hbase
记一次·Spark读Hbase一、背景过年回来，数仓发现hive的一个表丢数据了，需要想办法补数据。这个表是flume消费kafka写hive。但是kafka里只保存最近7天数据，有部分数据kafka里已经没有了。不过这份数据会同时被消费到HBase内存储一份，并且HBase内的数据是正常的。所以这次任务是读HBase数据写Hive表。HBase表内，只有一个列族info，列族内只有一个列valu
ELK日志收集之kafka 方案Filebeat + kafka + Logstash + ES + Kibana 心上之秋 elk kafka elasticsearch linq 分布式
一.简介常见的日志采集处理解决方案登录后复制Filebeat+ES+KibanaFilebeat+Logstash+ES+KibanaFilebeat+Kafka/Redis/File/Console+应用程序(处理/存储/展示)Filebeat+Logstash+Kafka/Redis/File/Console+应用程序(处理/存储/展示)1.2.3.4.二.配置1.创建Filebeat配置文件
ES8生产实践——自定义日志采集(Filebeat方式) 崔亮的博客 ELK Stack elasticsearch
在某些存在业务高峰期的场景下，期间可能会产生大量日志，如果继续使用fleet采集日志，使用ingest处理数据，可能会出现写入堆积的情况。此时可采用传统的Filebeat方式采集日志，引入Kafka作为消息缓冲队列，保证日志传输数据的可靠性和稳定性。接下来以日志demo程序为例，实现Filebeat采集——>kafka消息缓冲队列——>logstash解析处理数据——>es存储——>kibana数
Kafka整合Flume 小顽童王 kafka flume
Kafka与flume1）准备jar包1、将Kafka主目录lib下的如下jar拷贝至Flume的lib目录下kafka_2.10-0.8.2.1.jar、kafka-clients-0.8.2.1.jar、jopt-simple-3.2.jar、metrics-core-2.2.0.jar、scala-library-2.10.4.jar、zkclient-0.3.jar等2、将如下jar拷贝至
电商数仓项目(八) Flume(3) 生产者和消费者配置涛2021 数据仓库:Hadoop+Hive flume kafka
目录一、生产数据写到kafka二、消费kafka数据写到hdfs本节讲解Flume生产者和消费者配置。源码下载一、生产数据写到kafka将上节生成的flume-interceptor-1.0.0.jar文件上传到$FLUME_HOME/lib目录下在$FLUME_HOME/conf目录中创建file-flume-kafka.conf文件，文件目录：/u01/gmall/data/in/log-da
离线数仓01-用户行为日志采集平台最佳第六六六人大数据项目实战大数据
1数据仓库项目1.1数据仓库概念数据仓库（DataWarehouse）①日志采集系统②业务系统数据③爬虫系统等获得的数据进行清洗、转义、分类、重组、合并、拆分、统计等操作。1.2项目需求用户行为数据采集平台的搭建（前端埋点日志数据）业务数据采集平台的搭建（MySQL业务数据）数据仓库维度建模（hive数据分层：ods、dwd、dws、dwt、ads）即席查询工具，随时进行指标分析（es、kiban
运维-ES集群介绍 ww22652098814 运维 elasticsearch
什么是ElasticStackElasticStack早期名称为elk。elk分别代表了3个组件:-ElasticSearch负责数据存储和检索。-Logstash:负责数据的采集，将源数据采集到ElasticSearch进行存储。-Kibana:负责数据的展示。由于Logstash是一个重量级产品，安装包超过300MB+，很多同学只是用于采集日志，于是使用其他采集工具代替，比如flume，flu
小程序开发工具日志分析：ELK堆栈应用小程序开发2020 小程序 elk ai
小程序开发工具日志分析：ELK堆栈应用实战指南关键词：小程序开发、日志分析、ELK堆栈、分布式日志处理、日志可视化、故障排查、性能优化摘要：本文系统讲解如何通过ELK堆栈（Elasticsearch+Logstash+Kibana）构建小程序全链路日志分析平台。从日志采集规范设计到ELK集群部署，从Logstash数据清洗到Kibana可视化仪表盘搭建，结合具体代码案例演示日志处理全流程。深入解析
《云计算》第三版总结冰菓Neko 书籍云计算
《云计算》第三版总结云计算体系结构云计算成本优势开源云计算架构Hadoop2.0Hadoop体系架构Hadoop访问接口Hadoop编程接口Hadoop大家族分布式组件概述ZooKeeperHbasePigHiveOozieFlumeMahout虚拟化技术服务器虚拟化存储虚拟化网络虚拟化桌面虚拟化OpenStack开源虚拟化平台NovaSwiftGlance云计算核心算法PaxosDHTGossi
数据采集与接入：Kafka、Flume、Flink CDC、Debezium（实时/离线数据获取方式）晴天彩虹雨 kafka flume flink 大数据
数据采集是大数据平台中的关键步骤，它负责将数据从多个数据源传输到数据处理系统。对于大数据处理平台来说，数据的实时与离线获取方式至关重要，能够确保系统的响应性与可扩展性。在本篇文章中，我们将深入探讨四种常见的数据采集与接入技术：Kafka、Flume、FlinkCDC、Debezium，并分析它们的适用场景。1.Kafka-分布式流处理平台概述：Kafka是一个分布式流平台，用于高吞吐量、低延迟的数
Flume启动报错，guava.java包冲突 Lion-ha 大数据
Flume启动时报错如下：(SinkRunner-PollingRunner-DefaultSinkProcessor)[ERROR-org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:459)]processfailedjava.lang.NoSuchMethodError:com.google.common.b
Flume(二十一)Memory Channel 薛定谔的猫1982 #flume flume 大数据
MemoryChannel是将收集来的数据临时存储到内存队列中，如果不指定，那么该队列默认大小是100，即最多允许在队列中存储100条数据。如果队列被占满，那么后来的数据就会被阻塞(即Source收集到的数据就无法放入队列中,产生rollback回滚)，直到队列中有位置被空出。实际过程中，这个值一般会调大，一般会调节为10W~30W，如果数据量较大，那么也可以考虑调节为50W。需要注意的是，Mem
【课程笔记】华为 HCIA-Big Data 大数据总结淵_ken 华为 HCIA-Big Data 大数据大数据
目录HDFS分布式文件系统ZooKeeper分布式应用程序协调服务HBase非关系型分布式数据库Hive分布式数据仓库ClickHouse列式数据库管理系统MapReduce分布式计算框架Yarn资源管理调度器Spark分布式计算框架Flink分布式计算框架Flume日志采集工具Kafka分布式消息队列本课程主要围绕以下几个服务展开：HDFS(Hadoop分布式文件系统)ZooKeeper(分布式
Windows PC上创建大数据职业技能竞赛实验环境之三--Spark、Hive、Flume、Kafka和Flink环境的搭建 liu9ang 大数据平台 hadoop spark kafka flink
在前述hadoop-base基础容器环境的基础上，实现Spark、Hive、Flume、kafka和Flink实验环境的搭建。我们已将前述的hadoop-base基础容器进行可阶段的保存：sudodockercommit"hadoopbasev3"hadoop-basecentos/hadoop-base:v3现在，如果已经将前述作业的hadoop-base容器停用并删除，用保存的centos/h
多线程编程之卫生间周凡杨 java 并发卫生间线程厕所
如大家所知，火车上车厢的卫生间很小，每次只能容纳一个人，一个车厢只有一个卫生间，这个卫生间会被多个人同时使用，在实际使用时，当一个人进入卫生间时则会把卫生间锁上，等出来时打开门，下一个人进去把门锁上，如果有一个人在卫生间内部则别人的人发现门是锁的则只能在外面等待。问题分析：首先问题中有两个实体，一个是人，一个是厕所，所以设计程序时就可以设计两个类。人是多数的，厕所只有一个（暂且模拟的是一个车厢）。
How to Install GUI to Centos Minimal sunjing linux Install Desktop GUI
http://www.namhuy.net/475/how-to-install-gui-to-centos-minimal.html I have centos 6.3 minimal running as web server. I’m looking to install gui to my server to vnc to my server. You can insta
Shell 函数 daizj shell 函数
Shell 函数 linux shell 可以用户定义函数，然后在shell脚本中可以随便调用。 shell中函数的定义格式如下： [function] funname [()]{ action; [return int;] } 说明： 1、可以带function fun() 定义，也可以直接fun() 定义,不带任何参数。 2、参数返回
Linux服务器新手操作之一周凡杨 Linux 简单操作
1.whoami 当一个用户登录Linux系统之后，也许他想知道自己是发哪个用户登录的。此时可以使用whoami命令。 [ecuser@HA5-DZ05 ~]$ whoami e
浅谈Socket通信（一）朱辉辉33 socket
在java中ServerSocket用于服务器端，用来监听端口。通过服务器监听，客户端发送请求，双方建立链接后才能通信。当服务器和客户端建立链接后，两边都会产生一个Socket实例，我们可以通过操作Socket来建立通信。首先我建立一个ServerSocket对象。当然要导入java.net.ServerSocket包 ServerSock
关于框架的简单认识西蜀石兰框架
入职两个月多，依然是一个不会写代码的小白，每天的工作就是看代码，写wiki。前端接触CSS、HTML、JS等语言，一直在用的CS模型，自然免不了数据库的链接及使用，真心涉及框架，项目中用到的BootStrap算一个吧，哦，JQuery只能算半个框架吧，我更觉得它是另外一种语言。后台一直是纯Java代码，涉及的框架是Quzrtz和log4j。都说学前端的要知道三大框架，目前node.
You have an error in your SQL syntax; check the manual that corresponds to your 林鹤霄
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'option,changed_ids ) values('0ac91f167f754c8cbac00e9e3dc372
MySQL5.6的my.ini配置 aigo mysql
注意：以下配置的服务器硬件是：8核16G内存 [client] port=3306 [mysql] default-character-set=utf8 [mysqld] port=3306 basedir=D:/mysql-5.6.21-win
mysql 全文模糊查找便捷解决方案 alxw4616 mysql
mysql 全文模糊查找便捷解决方案 2013/6/14 by 半仙 [email protected] 目的: 项目需求实现模糊查找. 原则: 查询不能超过 1秒. 问题: 目标表中有超过1千万条记录. 使用like '%str%' 进行模糊查询无法达到性能需求. 解决方案: 使用mysql全文索引. 1.全文索引 : MySQL支持全文索引和搜索功能。MySQL中的全文索
自定义数据结构链表(单项 ,双向,环形) 百合不是茶单项链表双向链表
链表与动态数组的实现方式差不多, 数组适合快速删除某个元素链表则可以快速的保存数组并且可以是不连续的单项链表;数据从第一个指向最后一个实现代码: //定义动态链表 clas
threadLocal实例 bijian1013 java thread java多线程 threadLocal
实例1： package com.bijian.thread; public class MyThread extends Thread { private static ThreadLocal tl = new ThreadLocal() { protected synchronized Object initialValue() { return new Inte
activemq安全设置—设置admin的用户名和密码 bijian1013 java activemq
ActiveMQ使用的是jetty服务器, 打开conf/jetty.xml文件，找到 <bean id="adminSecurityConstraint" class="org.eclipse.jetty.util.security.Constraint"> <p
【Java范型一】Java范型详解之范型集合和自定义范型类 bit1129 java
本文详细介绍Java的范型，写一篇关于范型的博客原因有两个，前几天要写个范型方法(返回值根据传入的类型而定)，竟然想了半天，最后还是从网上找了个范型方法的写法；再者，前一段时间在看Gson, Gson这个JSON包的精华就在于对范型的优雅简单的处理，看它的源代码就比较迷糊，只其然不知其所以然。所以，还是花点时间系统的整理总结下范型吧。范型内容范型集合类范型类
【HBase十二】HFile存储的是一个列族的数据 bit1129 hbase
在HBase中，每个HFile存储的是一个表中一个列族的数据，也就是说，当一个表中有多个列簇时，针对每个列簇插入数据，最后产生的数据是多个HFile，每个对应一个列族，通过如下操作验证 1. 建立一个有两个列族的表 create 'members','colfam1','colfam2' 2. 在members表中的colfam1中插入50*5
Nginx 官方一个配置实例 ronin47 nginx 配置实例
user www www; worker_processes 5; error_log logs/error.log; pid logs/nginx.pid; worker_rlimit_nofile 8192; events { worker_connections 4096;} http { include conf/mim
java-15.输入一颗二元查找树，将该树转换为它的镜像，即在转换后的二元查找树中，左子树的结点都大于右子树的结点。用递归和循环 bylijinnan java
//use recursion public static void mirrorHelp1(Node node){ if(node==null)return; swapChild(node); mirrorHelp1(node.getLeft()); mirrorHelp1(node.getRight()); } //use no recursion bu
返回null还是empty bylijinnan java apache spring 编程
第一个问题，函数是应当返回null还是长度为0的数组（或集合）？第二个问题，函数输入参数不当时，是异常还是返回null？先看第一个问题有两个约定我觉得应当遵守： 1.返回零长度的数组或集合而不是null（详见《Effective Java》）理由就是，如果返回empty，就可以少了很多not-null判断： List<Person> list
[科技与项目]工作流厂商的战略机遇期 comsci 工作流
在新的战略平衡形成之前，这里有一个短暂的战略机遇期，只有大概最短6年，最长14年的时间，这段时间就好像我们森林里面的小动物，在秋天中，必须抓紧一切时间存储坚果一样，否则无法熬过漫长的冬季。。。。在微软，甲骨文，谷歌，IBM,SONY
过度设计-举例 cuityang 过度设计
过度设计，需要更多设计时间和测试成本，如无必要，还是尽量简洁一些好。未来的事情，比如访问量，比如数据库的容量，比如是否需要改成分布式都是无法预料的再举一个例子，对闰年的判断逻辑：　　1、 if($Year%4==0) return True; else return Fasle; 　　2、if ( ($Year%4==0 &am
java进阶，《Java性能优化权威指南》试读 darkblue086 java性能优化
记得当年随意读了微软出版社的.NET 2.0应用程序调试，才发现调试器如此强大，应用程序开发调试其实真的简单了很多，不仅仅是因为里面介绍了很多调试器工具的使用，更是因为里面寻找问题并重现问题的思想让我震撼，时隔多年，Java已经如日中天，成为许多大型企业应用的首选，而今天，这本《Java性能优化权威指南》让我再次找到了这种感觉，从不经意的开发过程让我刮目相看，原来性能调优不是简单地看看热点在哪里，
网络学习笔记初识OSI七层模型与TCP协议 dcj3sjt126com 学习笔记
协议：在计算机网络中通信各方面所达成的、共同遵守和执行的一系列约定　　计算机网络的体系结构：计算机网络的层次结构和各层协议的集合。　　两类服务：　　面向连接的服务通信双方在通信之前先建立某种状态，并在通信过程中维持这种状态的变化，同时为服务对象预先分配一定的资源。这种服务叫做面向连接的服务。　　面向无连接的服务通信双方在通信前后不建立和维持状态，不为服务对象
mac中用命令行运行mysql dcj3sjt126com mysql linux mac
参考这篇博客：http://www.cnblogs.com/macro-cheng/archive/2011/10/25/mysql-001.html 感觉workbench不好用（有点先入为主了）。 1，安装mysql 在mysql的官方网站下载 mysql 5.5.23 http://www.mysql.com/downloads/mysql/，根据我的机器的配置情况选择了64
MongDB查询（1）——基本查询[五] eksliang mongodb mongodb 查询 mongodb find
MongDB查询转载请出自出处：http://eksliang.iteye.com/blog/2174452 一、find简介 MongoDB中使用find来进行查询。 API:如下 function ( query , fields , limit , skip, batchSize, options ){.....} 参数含义： query:查询参数 fie
base64，加密解密经融加密，对接 y806839048 经融加密对接
String data0 = new String(Base64.encode(bo.getPaymentResult().getBytes(("GBK")))); String data1 = new String(Base64.decode(data0.toCharArray()),"GBK"); // 注意编码格式，注意用于加密，解密的要是同
JavaWeb之JSP概述 ihuning javaweb
什么是JSP？为什么使用JSP？ JSP表示Java Server Page，即嵌有Java代码的HTML页面。使用JSP是因为在HTML中嵌入Java代码比在Java代码中拼接字符串更容易、更方便和更高效。 JSP起源在很多动态网页中，绝大部分内容都是固定不变的，只有局部内容需要动态产生和改变。如果使用Servl
apple watch 指南啸笑天 apple
1. 文档 WatchKit Programming Guide（中译在线版 By @CocoaChina）译文译者原文概览 - 开始为 Apple Watch 进行开发 @星夜暮晨 Overview - Developing for Apple Watch 概览 - 配置 Xcode 项目 - Overview - Configuring Yo
java经典的基础题目 macroli java 编程
1.列举出 10个JAVA语言的优势 a:免费，开源，跨平台(平台独立性)，简单易用，功能完善，面向对象，健壮性，多线程，结构中立，企业应用的成熟平台, 无线应用 2.列举出JAVA中10个面向对象编程的术语 a:包，类，接口，对象，属性，方法，构造器，继承，封装，多态，抽象，范型 3.列举出JAVA中6个比较常用的包 Java.lang;java.util;java.io;java.sql;ja
你所不知道神奇的js replace正则表达式 qiaolevip 每天进步一点点学习永无止境纵观千象 regex
var v = 'C9CFBAA3CAD0'; console.log(v); var arr = v.split(''); for (var i = 0; i < arr.length; i ++) { if (i % 2 == 0) arr[i] = '%' + arr[i]; } console.log(arr.join('')); console.log(v.r
[一起学Hive]之十五-分析Hive表和分区的统计信息(Statistics) superlxw1234 hive hive分析表 hive统计信息 hive Statistics
关键字：Hive统计信息、分析Hive表、Hive Statistics 类似于Oracle的分析表，Hive中也提供了分析表和分区的功能，通过自动和手动分析Hive表，将Hive表的一些统计信息存储到元数据中。表和分区的统计信息主要包括：行数、文件数、原始数据大小、所占存储大小、最后一次操作时间等； 14.1 新表的统计信息对于一个新创建
Spring Boot 1.2.5 发布 wiselyman spring boot
Spring Boot 1.2.5已在7月2日发布，现在可以从spring的maven库和maven中心库下载。这个版本是一个维护的发布版，主要是一些修复以及将Spring的依赖提升至4.1.7(包含重要的安全修复)。官方建议所有的Spring Boot用户升级这个版本。项目首页 | 源