奔跑的乌班

flume学习笔记

1.flume概述

1.1.flume概念

1.1.1.flume概念

flume是分布式的，可靠的，高可用的，用于对不同来源的大量的日志数据进行有效收集、聚集和移动，并以集中式的数据存储的系统。
flume目前是apache的一个顶级项目。

1.1.2.系统需求

flume需要java运行环境，要求java1.6以上，推荐java1.7.

1.2.下载安装flume

1.2.1.下载flume：

可以到apache官网下载flume的安装包。
下载时需要注意，flume具有两个版本：0.9.x和1.x,这两个版本并不兼容，我们学习的是最新的1.x版本，也叫flume-ng版本。

1.2.2.安装flume：

将下载好的flume安装包解压到指定目录即可。
tar -zxvf apache-flume-1.6.0-bin.tar.gz

2.flume中的概念、模型和特点

2.1.flume中的一些重要概念

2.1.1.flume Event：

flume 事件，被定义为一个具有有效荷载的字节数据流和可选的字符串属性集。(json格式的字符串，由headers和body两部分组成)

2.1.2.flume Agent：

flume 代理，是一个进程承载从外部源事件流到下一个目的地的过程。包含source channel和sink。

2.1.3.Source

数据源，消耗外部传递给他的事件，外部源将数据按照flume Source 能识别的格式将Flume 事件发送给flume Source。

2.1.4.Channel

数据通道，是一个被动的存储，用来保持事件，直到由一个flume Sink消耗。

2.1.5.Sink

数据汇聚点，代表外部数据存放位置。发送flume event到指定的外部目标。

2.2.flume流动模型

2.2.1.flume流动模型

图-1

2.3.flume的特点

2.3.1.复杂流动性

Flume允许用户进行多级流动到最终目的地，也允许扇出流（一到多）、扇入流(多到一)的、故障转移和失败处理。

2.3.2.可靠性

事务性的数据传递，保证了数据的可靠性。(flume一次处理一批数据，只要有任意一条失败了，那么flume会重新处理这批数据)

2.3.3.可恢复性

通道可以以内存或文件的方式实现，内存更快，但是不可恢复，而文件虽然比较慢但提供了可恢复性。

3.入门案例

3.1.编写配置文件

3.1.1.编写配置文件

首先需要通过一个配置文件来配置Agent。
＃example.conf：单节点Flume配置
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

	＃描述/配置Source
	a1.sources.r1.type  =  netcat
	a1.sources.r1.bind  =  0.0.0.0
	a1.sources.r1.port  =  44444

	＃描述Sink
	a1.sinks.k1.type  =  logger

	＃描述内存Channel
	a1.channels.c1.type  =  memory
	a1.channels.c1.capacity  =  1000 
	a1.channels.c1.transactionCapacity  =  100

	＃为Channle绑定Source和Sink
	a1.sources.r1.channels  =  c1
	a1.sinks.k1.channel  =  c1

注意：
（1）一个配置文件中可以配置多个Agent，一个Agent中可以包含多个Source、Sink、Channel。
（2）一个Source 可以绑定到多个通道，但一个Sink只能绑定到一个通道。

3.2.通过flume的工具启动agent

3.2.1.通过flume的工具启动agent

$ bin/flume-ng agent --conf …/conf --conf-file …/conf/example.conf --name a1 -Dflume.root.logger=INFO,console

3.3.发送数据

3.3.1.发送数据

在windows中通过telnet命令连接flume所在机器的44444端口发送数据。
发现，flume确实收集到了该信息。
telnet ip port
#按ctrl+]后回车在界面输入数据，就可以在flume的控制台上显示数据

4.Source详解

4.1.Avro Source

4.1.1. Avro Source 概述

监听AVRO端口来接受来自外部AVRO客户端的事件流。
利用Avro Source可以实现多级流动、扇出流、扇入流等效果。
另外也可以接受通过flume提供的Avro客户端发送的日志信息。

4.1.2. Avro Source属性说明

!channels –
!type – 类型名称，“AVRO”
!bind – 需要监听的主机名或IP
!port – 要监听的端口
threads – 工作线程最大线程数
selector.type
selector.*
interceptors – 空格分隔的拦截器列表
interceptors.*
compression-type none 压缩类型，可以是“none”或“default”，这个值必须和AvroSource的压缩格式匹配
ssl false 是否启用ssl加密，如果启用还需要配置一个“keystore”和一个“keystore-password”。
keystore – 为SSL提供的java密钥文件所在路径。
keystore-password – 为SSL提供的java密钥文件密码。
keystore-type JKS 密钥库类型可以是“JKS”或“PKCS12”。
exclude-protocols SSLv3 空格分隔开的列表，用来指定在SSL / TLS协议中排除。SSLv3将总是被排除除了所指定的协议。
ipFilter false 如果需要为netty开启ip过滤，将此项设置为true
ipFilterRules – 定义netty的ip过滤设置表达式规则

4.1.3. 案例

编写配置文件：
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type  =  avro
a1.sources.r1.bind  =  0.0.0.0
a1.sources.r1.port  =  33333

＃描述Sink
a1.sinks.k1.type  =  logger
＃描述内存Channel
a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  =  100

＃为Channle绑定Source和Sink
a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

启动flume：
./flume-ng agent --conf …/conf --conf-file …/conf/template2.conf --name a1 -Dflume.root.logger=INFO,console

通过flume提供的avro客户端向指定机器指定端口发送日志信息：
./flume-ng avro-client --conf …/conf --host 0.0.0.0 --port 33333 --filename …/mydata/log1.txt

发现确实收集到了日志。

4.2.Exec Source

4.2.1. Exec Source概述

可以将命令产生的输出作为源。

4.2.2. Exec Source属性说明

!channels –
!type – 类型名称，需要是"exec"
!command – 要执行的命令
shell – A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.
restartThrottle 10000 毫秒为单位的时间，用来声明等待多久后尝试重试命令
restart false 如果cmd挂了，是否重启cmd
logStdErr false 无论是否是标准错误都该被记录
batchSize 20 同时发送到通道中的最大行数
batchTimeout 3000 如果缓冲区没有满，经过多长时间发送数据
selector.type 复制还是多路复用
selector.* Depends on the selector.type value
interceptors – 空格分隔的拦截器列表
interceptors.*

4.2.3. 案例

编写配置文件：
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type  =  exec
a1.sources.r1.command = ping 192.168.242.102

＃描述Sink
a1.sinks.k1.type  =  logger
＃描述内存Channel
a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  =  100

＃为Channle绑定Source和Sink
a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

启动flume：
./flume-ng agent --conf …/conf --conf-file …/conf/template2.conf --name a1 -Dflume.root.logger=INFO,console

可以通过tail命令，收集日志文件中后续追加的日志

4.3.Spooling Directory Source

4.3.1. Spooling Directory Source概述

这个Source允许你将将要收集的数据放置到"自动搜集"目录中。这个Source将监视该目录，并将解析新文件的出现。事件处理逻辑是可插拔的，当一个文件被完全读入通道，它会被重命名或可选的直接删除。
要注意的是，放置到自动搜集目录下的文件不能修改，如果修改，则flume会报错。另外，也不能产生重名的文件，如果有重名的文件被放置进来，则flume会报错。

4.3.2. Spooling Directory Source属性说明

!channels –
!type – 类型，需要指定为"spooldir"
!spoolDir – 读取文件的路径，即"搜集目录"
fileSuffix .COMPLETED 对处理完成的文件追加的后缀
deletePolicy never 处理完成后是否删除文件，需是"never"或"immediate"
fileHeader false 是否添加一个存储的绝对路径名的头文件.
fileHeaderKey file Header key to use when appending absolute path filename to event header.
basenameHeader false Whether to add a header storing the basename of the file.
basenameHeaderKey basename Header Key to use when appending basename of file to event header.
ignorePattern ^$ 正则表达式指定哪些文件需要忽略
trackerDir .flumespool Directory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir.
consumeOrder 处理文件的策略，oldest, youngest 或 random。
maxBackoff 4000 The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter.
batchSize 100 Granularity at which to batch transfer to the channel
inputCharset UTF-8 读取文件时使用的编码。
decodeErrorPolicy FAIL 当在输入文件中发现无法处理的字符编码时如何处理。FAIL：抛出一个异常而无法解析该文件。REPLACE：用“替换字符”字符，通常是Unicode的U + FFFD更换不可解析角色。忽略：掉落的不可解析的字符序列。
deserializer LINE 声明用来将文件解析为事件的解析器。默认一行为一个事件。处理类必须实现EventDeserializer.Builder接口。
deserializer.* Varies per event deserializer.
bufferMaxLines – (Obselete) This option is now ignored.
bufferMaxLineLength 5000 (Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead.
selector.type replicating replicating or multiplexing
selector.* Depends on the selector.type value
interceptors – Space-separated list of interceptors
interceptors.*

4.3.3. 案例

编写配置文件：
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir=/home/park/work/apache-flume-1.6.0-bin/mydata

＃描述Sink
a1.sinks.k1.type = logger
＃描述内存Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume：
./flume-ng agent --conf …/conf --conf-file …/conf/template4.conf --name a1 -Dflume.root.logger=INFO,console

向指定目录中传输文件，发现flume收集到了该文件，将文件中的每一行都作为日志来处理。

4.4.NetCat Source

4.4.1. NetCat Source概述

一个NetCat Source用来监听一个指定端口，并将接收到的数据的每一行转换为一个事件。

4.4.2. NetCat Source属性说明

！channels –
！type – 类型名称，需要被设置为"netcat"
！bind – 指定要绑定到的ip或主机名。
！port – 指定要绑定到的端口号
max-line-length 512 单行最大字节数
ack-every-event true 对于收到的每一个Event是否响应"OK"
selector.type
selector.*
interceptors –
interceptors.*

4.4.3. 案例

参见入门案例。

4.5.Sequence Generator Source

4.5.1. Sequence Generator Source概述

一个简单的序列发生器，不断的产生事件，值是从0开始每次递增1。
主要用来进行测试。

4.5.2. Sequence Generator Source属性说明

!channels –
!type – 类型名称，必须为"seq"
selector.type
selector.*
interceptors –
interceptors.*
batchSize

4.5.3. 案例

编写配置文件:
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type  = seq

＃描述Sink
a1.sinks.k1.type  =  logger
＃描述内存Channel
a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  =  100

＃为Channle绑定Source和Sink
a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

启动flume:
./flume-ng agent --conf …/conf --conf-file …/conf/template4.conf --name a1 -Dflume.root.logger=INFO,console

发现打印了日志

4.6.HTTP Source

4.6.1. HTTP Source概述

HTTP Source接受HTTP的GET和POST请求作为Flume的事件,其中GET方式应该只用于试验。
该Source需要提供一个可插拔的"处理器"来将请求转换为事件对象，这个处理器必须实现HTTPSourceHandler接口，该处理器接受一个 HttpServletRequest对象，并返回一个Flume Envent对象集合。
从一个HTTP请求中得到的事件将在一个事务中提交到通道中。因此允许像文件通道那样对通道提高效率。
如果处理器抛出一个异常，Source将会返回一个400的HTTP状态码。
如果通道已满，无法再将Event加入Channel，则Source返回503的HTTP状态码，表示暂时不可用。

4.6.2. HTTP Source属性说明

！type 类型，必须为"HTTP"
！port – 监听的端口
bind 0.0.0.0 监听的主机名或ip
handler org.apache.flume.source.http.JSONHandler处理器类，需要实现HTTPSourceHandler接口
handler.* – 处理器的配置参数
selector.type
selector.*
interceptors –
interceptors.*
enableSSL false 是否开启SSL,如果需要设置为true。注意，HTTP不支持SSLv3。
excludeProtocols SSLv3 空格分隔的要排除的SSL/TLS协议。SSLv3总是被排除的。
keystore 密钥库文件所在位置。
keystorePassword Keystore 密钥库密码

4.6.3. 案例

编写配置文件:
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type  = http
a1.sources.r1.port  = 66666

＃描述Sink
a1.sinks.k1.type  =  logger
＃描述内存Channel
a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  =  100

＃为Channle绑定Source和Sink
a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

启动flume:
./flume-ng agent --conf …/conf --conf-file …/conf/template6.conf --name a1 -Dflume.root.logger=INFO,console

通过命令发送HTTP请求到指定端口：
curl -X POST -d ‘[{ “headers” :{“a” : “a1”,“b” : “b1”},“body” : “hello_httpflume~”}]’ http://0.0.0.0:6666
发现flume收集到了日志

4.6.4. 常用的Handler

JSONHandler
可以处理JSON格式的数据，并支持UTF-8 UTF-16 UTF-32字符集，该handler接受Evnet数组，并根据请求头中指定的编码将其转换为Flume Event。
如果没有指定编码，默认编码为UTF-8.
JSON格式如下：
–
[{
“headers” : {
“timestamp” : “434324343”,
“host” : “random_host.example.com”
},
“body” : “random_body”
},
{
“headers” : {
“namenode” : “namenode.example.com”,
“datanode” : “random_datanode.example.com”
},
“body” : “really_random_body”
}]
–
设置字符集时，请求必须包含content type 并设置为application/json;charset=UTF-8。
To set the charset, the request must have content type specified as application/json;charset=UTF-8 (replace UTF-8 with UTF-16 or UTF-32 as required).
One way to create an event in the format expected by this handler is to use JSONEvent provided in the Flume SDK and use Google Gson to create the JSON string using the Gson#fromJson(Object, Type) method.
Typetype=newTypeToken(){}.getType();

BlobHandler
BlobHandler是一种将请求中上传文件信息转化为event的处理器。
参数说明，加！为必须属性：
！handler – The FQCN of this class: org.apache.flume.sink.solr.morphline.BlobHandler
handler.maxBlobLength 100000000 The maximum number of bytes to read and buffer for a given request

4.7.Custom source

4.7.1. Custom source概述

自定义源是自己实现源接口得到的，自定义源的类和其依赖包必须在开始时就放置到Flume的类加载目录下。

4.7.2. Custom source属性说明

！channels –
！type – 类型，必须设置为自己的自定义处理类的全路径名
selector.type
elector.*
interceptors –
interceptors.*

5.Sink详解

5.1.Logger Sink

5.1.1.Logger Sink概述

记录INFO级别的日志，通常用于调试。

5.1.2.Logger Sink属性说明

!channel –
!type – The component type name, needs to be logger
maxBytesToLog 16 Maximum number of bytes of the Event body to log

要求必须在 --conf 参数指定的目录下有 log4j的配置文件
也可以通过-Dflume.root.logger=INFO,console在命令启动时手动指定log4j参数

5.1.3.案例

参见入门案例。

5.2.File Roll Sink

5.2.1.File Roll Sink概述

在本地文件系统中存储事件，每隔指定时长生成文件保存这段时间内收集到的日志信息。

5.2.2.File Roll Sink属性说明

!channel –
!type – 类型，必须是"file_roll"
!sink.directory – 文件被存储的目录
sink.rollInterval 30 滚动文件每隔30秒（应该是每隔30秒钟单独切割数据到一个文件的意思）。如果设置为0，则禁止滚动，从而导致所有数据被写入到一个文件中。
sink.serializer TEXT Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface.
batchSize 100

5.2.3.案例

编写配置文件:
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type = http
a1.sources.r1.port = 6666

＃描述Sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory=/home/park/work/apache-flume-1.6.0-bin/mysink(修改：之前是a1.sinks.k1.dirctory，出不来结果)

＃描述内存Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume:
./flume-ng agent --conf …/conf --conf-file …/conf/template7.conf --name a1 -Dflume.root.logger=INFO,console

5.3.Avro Sink

5.3.1.Avro Sink概述

Avro Sink是实现多级流动、扇出流(1到多) 和扇入流(多到1) 的基础。

5.3.2.Avro Sink属性说明

!channel –
!type – The component type name, needs to be avro.
!hostname – The hostname or IP address to bind to.
!port – The port # to listen on.
batch-size 100 number of event to batch together for send.
connect-timeout 20000 Amount of time (ms) to allow for the first (handshake) request.
request-timeout 20000 Amount of time (ms) to allow for requests after the first.
reset-connection-interval none Amount of time (s) before the connection to the next hop is reset. This will force the Avro Sink to reconnect to the next hop. This will allow the sink to connect to hosts behind a hardware load-balancer when news hosts are added without having to restart the agent.
compression-type none This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
compression-level 6 The level of compression to compress event. 0 = no compression and 1-9 is compression. The higher the number the more compression
ssl false Set to true to enable SSL for this AvroSink. When configuring SSL, you can optionally set a “truststore”, “truststore-password”, “truststore-type”, and specify whether to “trust-all-certs”.
trust-all-certs false If this is set to true, SSL server certificates for remote servers (Avro Sources) will not be checked. This should NOT be used in production because it makes it easier for an attacker to execute a man-in-the-middle attack and “listen in” on the encrypted connection.
truststore – The path to a custom Java truststore file. Flume uses the certificate authority information in this file to determine whether the remote Avro Source’s SSL authentication credentials should be trusted. If not specified, the default Java JSSE certificate authority files (typically “jssecacerts” or “cacerts” in the Oracle JRE) will be used.
truststore-password – The password for the specified truststore.
truststore-type JKS The type of the Java truststore. This can be “JKS” or other supported Java truststore type.
exclude-protocols SSLv3 Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
maxIoWorkers 2 * the number of available processors in the machine The maximum number of I/O w

5.3.3.案例1-多级流动

我们用三台机器h1、h2、h3来进行实验：
h3:
配置配置文件:
#命名Agent组件
a1.sources=r1
a1.sinks=k1
a1.channels=c1

	#描述/配置Source
	a1.sources.r1.type=avro
	a1.sources.r1.bind=0.0.0.0
	a1.sources.r1.port=9988

	#描述Sink
	a1.sinks.k1.type=logger

#描述内存Channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=1000

	#为Channel绑定Source和Sink
	a1.sources.r1.channels=c1
	a1.sinks.k1.channel=c1

启动flume:

./flume-ng agent --conf …/conf --conf-file …/conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

h2:
配置文件
#命名Agent组件
a1.sources=r1
a1.sinks=k1
a1.channels=c1

	#描述/配置Source
	a1.sources.r1.type=avro
	a1.sources.r1.bind = 0.0.0.0
	a1.sources.r1.port=9988

	#描述Sink
	a1.sinks.k1.type=avro
	a1.sinks.k1.hostname=192.168.242.139
	a1.sinks.k1.port=9988

	#描述内存Channel
	a1.channels.c1.type=memory
	a1.channels.c1.capacity=1000
	a1.channels.c1.transactionCapacity=1000

	#为Channel绑定Source和Sink
	a1.sources.r1.channels=c1
	a1.sinks.k1.channel=c1

启动
./flume-ng agent --conf …/conf --conf-file …/conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

h1:
配置配置文件
#命名Agent组件
a1.sources=r1
a1.sinks=k1
a1.channels=c1

	#描述/配置Source
	a1.sources.r1.type=http
	a1.sources.r1.port=8888

	#描述Sink
	a1.sinks.k1.type=avro
	a1.sinks.k1.hostname=192.168.242.138
	a1.sinks.k1.port=9988

	#描述内存Channel
	a1.channels.c1.type=memory
	a1.channels.c1.capacity=1000
	a1.channels.c1.transactionCapacity=1000

	#为Channel绑定Source和Sink
	a1.sources.r1.channels=c1
	a1.sinks.k1.channel=c1
启动flume:

./flume-ng agent --conf …/conf --conf-file …/conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

发送http请求到h1：
curl -X POST -d ‘[{ “headers” :{“a” : “a1”,“b” : “b1”},“body” : “hello_httpflume~”}]’ http://192.168.242.133:8888

稍等几秒后，发现h2最终收到了这条消息

5.3.4.案例2-扇出流-复制

h2 h3:
配置配置文件:
#命名Agent组件
a1.sources=r1
a1.sinks=k1
a1.channels=c1

	#描述/配置Source
	a1.sources.r1.type=avro
	a1.sources.r1.bind=0.0.0.0
	a1.sources.r1.port=9988

	#描述Sink
	a1.sinks.k1.type=logger

	#描述内存Channel
	a1.channels.c1.type=memory
	a1.channels.c1.capacity=1000
	a1.channels.c1.transactionCapacity=1000

	#为Channel绑定Source和Sink
	a1.sources.r1.channels=c1
	a1.sinks.k1.channel=c1

启动flume:
./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

h1:
配置配置文件
#命名Agent组件
a1.sources=r1
a1.sinks=k1 k2
a1.channels=c1 c2

	#描述/配置Source
	a1.sources.r1.type=http
	a1.sources.r1.port=8888

	#描述Sink
	a1.sinks.k1.type=avro
	a1.sinks.k1.hostname=192.168.242.138
	a1.sinks.k1.port=9988
	a1.sinks.k2.type=avro
	a1.sinks.k2.hostname=192.168.242.135
	a1.sinks.k2.port=9988

	#描述内存Channel
	a1.channels.c1.type=memory
	a1.channels.c1.capacity=1000
	a1.channels.c1.transactionCapacity=1000
	a1.channels.c2.type=memory
	a1.channels.c2.capacity=1000
	a1.channels.c2.transactionCapacity=1000

	#为Channel绑定Source和Sink
	a1.sources.r1.channels=c1 c2
	a1.sinks.k1.channel=c1	
	a1.sinks.k2.channel=c2

启动flume:
./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

5.3.5.案例3-扇出流-多路复用（路由）

h2 h3:
配置配置文件:
#命名Agent组件
a1.sources=r1
a1.sinks=k1
a1.channels=c1

	#描述/配置Source
	a1.sources.r1.type=avro
	a1.sources.r1.bind=0.0.0.0
	a1.sources.r1.port=9988

	#描述Sink
	a1.sinks.k1.type=logger

	#描述内存Channel
	a1.channels.c1.type=memory
	a1.channels.c1.capacity=1000
	a1.channels.c1.transactionCapacity=1000

	#为Channel绑定Source和Sink
	a1.sources.r1.channels=c1
	a1.sinks.k1.channel=c1
启动flume:
./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

h1:
配置配置文件
#配置Agent组件
a1.sources=r1
a1.sinks=k1 k2
a1.channels=c1 c2

	#描述/配置Source
	a1.sources.r1.type=http
	a1.sources.r1.port=8888
	a1.sources.r1.selector.type=multiplexing
	a1.sources.r1.selector.header=gender
	a1.sources.r1.selector.mapping.male=c1
	a1.sources.r1.selector.mapping.female=c2
	a1.sources.r1.selector.default=c1

	#描述Sink
	a1.sinks.k1.type=avro
	a1.sinks.k1.hostname=192.168.242.138
	a1.sinks.k1.port=9988
	a1.sinks.k2.type=avro
	a1.sinks.k2.hostname=192.168.242.135
	a1.sinks.k2.port=9988

	#描述内存Channel
	a1.channels.c1.type=memory
	a1.channels.c1.capacity=1000
	a1.channels.c1.transactionCapacity=1000
	a1.channels.c2.type=memory
	a1.channels.c2.capacity=1000
	a1.channels.c2.transactionCapacity=1000

	#为Channel绑定Source和Sink
	a1.sources.r1.channels=c1 c2
	a1.sinks.k1.channel=c1
	a1.sinks.k2.channel=c2
启动flume:
./flume-ng agent --conf ../conf --conf-file ../conf/template8.conf --name a1 -Dflume.root.logger=INFO,console

发送http请求进行测试。发现可以实现路由效果

5.3.6.案例4-扇入流

m3:
编写配置文件:
#命名Agent组件
a1.sources=r1
a1.sinks=k1
a1.channels=c1

	#描述/配置Source
	a1.sources.r1.type=avro
	a1.sources.r1.bind=0.0.0.0
	a1.sources.r1.port=4141

	#描述Sink
	a1.sinks.k1.type=logger

	#描述内存Channel
	a1.channels.c1.type=memory
	a1.channels.c1.capacity=1000
	a1.channels.c1.transactionCapacity=1000

	#为Channel绑定Source和Sink
	a1.sources.r1.channels=c1
	a1.sinks.k1.channel=c1
启动flume:
./flume-ng agent --conf ../conf --conf-file ../conf/template.conf --name a1 -Dflume.root.logger=INFO,console

m1、m2:
编写配置文件:
#命名Agent组件
a1.sources=r1
a1.sinks=k1
a1.channels=c1

	#描述/配置Source
	a1.sources.r1.type=http
	a1.sources.r1.port=8888
	#描述Sink
	a1.sinks.k1.type=avro
	a1.sinks.k1.hostname=192.168.242.135
	a1.sinks.k1.port=4141
			
    #描述内存Channel
	a1.channels.c1.type=memory
	a1.channels.c1.capacity=1000
	a1.channels.c1.transactionCapacity=1000
	#为Channel绑定Source和Sink
	a1.sources.r1.channels=c1
	a1.sinks.k1.channel=c1
启动flume:
./flume-ng agent --conf ../conf --conf-file ../conf/template9.conf --name a1 -Dflume.root.logger=INFO,console

m1通过curl发送一条http请求，由于默认使用的是jsonHandler，数据格式必须是指定的json格式：
	[root@localhost conf]# curl -XPOST -d '[{ "headers" :{"flag" : "c"},"body" : "idoall.org_body"}]' http://0.0.0.0:8888

m2通过curl发送一条http请求，由于默认使用的是jsonHandler，数据格式必须是指定的json格式：
[root@localhost conf]# curl -XPOST -d ‘[{ “headers” :{“flag” : “c”},“body” : “idoall.org_body”}]’ http://0.0.0.0:8888

发现m3均能正确收到消息

5.4.HDFS Sink

5.4.1.HDFS Sink概述

HDFS Sink将事件写入到Hadoop分布式文件系统HDFS中，目前支持创建文本文件和序列化文件。并且对这两种格式都支持压缩。这些文件可以按照指定的时间或数据量或事件的数量为基础进行分卷。
它还通过类似时间戳或机器属性对数据进行 buckets/partitions 操作。 HDFS的目录路径可以包含将要由HDFS替换格式的转移序列用以生成存储事件的目录/文件名。
使用HDFS Sink要求hadoop必须已经安装好，以便Flume可以通过hadoop提供的jar包与HDFS进行通信。
注意，此版本hadoop必须支持sync()调用。

5.4.2.HDFS Sink属性说明

!channel –
!type – 类型名称，必须是“HDFS”
!hdfs.path – HDFS 目录路径 (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefix FlumeData Flume在目录下创建文件的名称前缀
hdfs.fileSuffix – 追加到文件的名称后缀 (eg .avro - 注: 日期时间不会自动添加)
hdfs.inUsePrefix – Flume正在处理的文件所加的前缀
hdfs.inUseSuffix .tmp Flume正在处理的文件所加的后缀
hdfs.rollInterval 30 Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize 1024 File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount 10 Number of events written to file before it rolled (0 = never roll based on number of events)
hdfs.idleTimeout 0 Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
hdfs.batchSize 100 number of events written to file before it is flushed to HDFS
hdfs.codeC – Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy
hdfs.fileType SequenceFile File format: currently SequenceFile, DataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC
hdfs.maxOpenFiles 5000 Allow only this number of open files. If this number is exceeded, the oldest file is closed.
hdfs.minBlockReplicas – Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.
hdfs.writeFormat – Format for sequence file records. One of “Text” or “Writable” (the default).
hdfs.callTimeout 10000 Number of milliseconds allowed for HDFS operations, such as open, write, flush, close. This number should be increased if many HDFS timeout operations are occurring.
hdfs.threadsPoolSize 10 Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
hdfs.rollTimerPoolSize 1 Number of threads per HDFS sink for scheduling timed file rolling
hdfs.kerberosPrincipal – Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab – Kerberos keytab for accessing secure HDFS
hdfs.proxyUser
hdfs.round false 时间戳是否向下取整（如果是true，会影响所有基于时间的转移序列，除了%T）
hdfs.roundValue 1 舍值的边界值
hdfs.roundUnit 向下舍值的单位 - second, minute , hour
hdfs.timeZone Local Time Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStamp false Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.
hdfs.closeTries 0 Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart.
hdfs.retryInterval 180 Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension.
serializer TEXT Other possible options include avro_event or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.

5.4.3.案例

编写配置文件:
#命名Agent组件
a1.sources=r1
a1.sinks=k1
a1.channels=c1

#描述/配置Source
a1.sources.r1.type=http
a1.sources.r1.port=8888

#描述Sink
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://0.0.0.0:9000/ppp
#文本格式，默认是序列化的格式
a1.sinks.k1.hdfs.fileType=DataStream

#描述内存Channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

启动flume:
./flume-ng agent --conf …/conf --conf-file …/conf/template9.conf --name a1 -Dflume.root.logger=INFO,console

5.5.Hive Sink

5.5.1.Hive Sink概述

这个Sink可以将含分隔符的文本或JSON数据事件直接导入Hive的表或分区。
事件Event是使用Hive transactions编写的，当一组Event被提交到Hive中，它们立即可以通过Hive被查询出来。
Flume要写入数据的分区即可以预先创建好，也可以在缺失时由Flume来创建。
Flume收到的数据字段将映射到Hive表的列上。
此功能是一个预览功能，不推荐在生产环境下使用。

5.5.2.Hive Sink属性说明

！channel –
！type – 类型，必须设置为“hive”
！hive.metastore – Hive metastore URI (eg thrift://a.b.com:9083 )
！hive.database – hive库名称
！hive.table – hive表名称
hive.partition – 逗号分开的分区值确定写入分区的列表。可以包含转义序列。
hive.txnsPerBatchAsk 100 Hive grants a batch of transactions instead of single transactions to streaming clients like Flume. This setting configures the number of desired transactions per Transaction Batch. Data from all transactions in a single batch end up in a single file. Flume will write a maximum of batchSize events in each transaction in the batch. This setting in conjunction with batchSize provides control over the size of each file. Note that eventually Hive will transparently compact these files into larger files.
heartBeatInterval 240 (In seconds) Interval between consecutive heartbeats sent to Hive to keep unused transactions from expiring. Set this value to 0 to disable heartbeats.
autoCreatePartitions true Flume will automatically create the necessary Hive partitions to stream to
batchSize 15000 Max number of events written to Hive in a single Hive transaction
maxOpenConnections 500 Allow only this number of open connections. If this number is exceeded, the least recently used connection is closed.
callTimeout 10000 (In milliseconds) Timeout for Hive & HDFS I/O operations, such as openTxn, write, commit, abort.
serializer Serializer is responsible for parsing out field from the event and mapping them to columns in the hive table. Choice of serializer depends upon the format of the data in the event. Supported serializers: DELIMITED and JSON
roundUnit minute The unit of the round down value - second, minute or hour.
roundValue 1 Rounded down to the highest multiple of this (in the unit configured using hive.roundUnit), less than current time
timeZone Local Time Name of the timezone that should be used for resolving the escape sequences in partition, e.g. America/Los_Angeles.
useLocalTimeStamp false Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.

5.6.Custom Sink

5.6.1.Custom Sink概述

自定义接收器，是自己实现的接收器接口Sink来实现的。
自定义接收器的类及其依赖类须在Flume启动前放置到Flume类加载目录下。

5.6.2.Custom Sink属性说明

type – 类型，需要指定为自己实现的Sink类的全路径名。

6.Selector

6.1.Selector概述

6.1.1.Selector概述

Selector（选择器）可以工作在复制或多路复用(路由) 模式下。

6.2.复制模式

6.2.1.Selector复制模式-属性说明

selector.type replicating 类型名称，默认是 replicating
selector.optional – 标志通道为可选

6.2.2.Selector复制模式-案例

		参看5.3.4avro sink案例.

6.3.多路复用（路由）模式

6.3.1.Selector多路复用（路由）模式-属性说明

selector.type 类型，必须是"multiplexing"
selector.header 指定要监测的头的名称
selector.default –
selector.mapping.* –

举例：
a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4

6.3.2.Selector多路复用（路由）模式-案例

参看 5.3.5 avro sink案例

7.Interceptors

7.1.Interceptors概述

7.1.1.Interceptors概述

Flume有能力在运行阶段修改/删除Event，这是通过拦截器（Interceptors）来实现的。
拦截器需要实现org.apache.flume.interceptor.Interceptor接口。
拦截器可以修改或删除事件基于开发者在选择器中选择的任何条件。
拦截器采用了责任链模式，多个拦截器可以按指定顺序拦截。
一个拦截器返回的事件列表被传递给链中的下一个拦截器。
如果一个拦截器需要删除事件，它只需要在返回的事件集中不包含要删除的事件即可。
如果要删除所有事件，只需返回一个空列表。

7.2.Timestamp Interceptor

7.2.1.Timestamp Interceptor概述

这个拦截器在事件头中插入以毫秒为单位的当前处理时间。
头的名字为timestamp，值为当前处理的时间戳。
如果在之前已经有这个时间戳，则保留原有的时间戳。

7.2.2.Timestamp Interceptor属性说明

!interceptors.type – 类型名称，必须是timestamp或自定义类的全路径名
preserveExisting false 如果时间戳已经存在是否保留

7.2.3.案例

配置文件
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type = http
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

＃描述Sink
a1.sinks.k1.type = logger

＃描述内存Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume
./flume-ng agent --conf …/conf --conf-file …/conf/flume4.conf --name a1 -Dflume.root.logger=INFO,console

7.3. Host Interceptor

7.3.1.Host Interceptor概述

这个拦截器插入当前处理Agent的主机名或ip
头的名字为host或配置的名称
值是主机名或ip地址，基于配置。

7.3.2.Host Interceptor属性说明

!type – 类型名称，必须是host
preserveExisting false 如果主机名已经存在是否保留
useIP true 如果配置为true则用IP，配置为false则用主机名
hostHeader host 加入头时使用的名称

7.3.3.案例

配置文件
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type = http
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = timestamp
#ip是拦截者所在机器的ip
a1.sources.r1.interceptors.i2.type = host

＃描述Sink
a1.sinks.k1.type = logger

＃描述内存Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
启动flume
./flume-ng agent --conf …/conf --conf-file …/conf/flume4.conf --name a1 -Dflume.root.logger=INFO,console

7.4. Static Interceptor

7.4.1.Static Interceptor概述

此拦截器允许用户增加静态头信息使用静态的值到所有事件。
目前的实现中不允许一次指定多个头。
如果需要增加多个静态头可以指定多个Static interceptors

7.4.2.Static Interceptor属性说明

!type – 类型，必须是static
preserveExisting true 如果配置头已经存在是否应该保留
key key 要增加的透明
value value 要增加的头值

7.4.3.案例

配置文件
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type = http
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3
a1.sources.r1.interceptors.i1.type = timestamp
#ip是拦截者所在机器的ip
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = country
a1.sources.r1.interceptors.i3.value = China

＃描述Sink
a1.sinks.k1.type = logger

＃描述内存Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

＃为Channle绑定Source和Sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动flume
./flume-ng agent --conf …/conf --conf-file …/conf/flume4.conf --name a1 -Dflume.root.logger=INFO,console
在Event的headers中有多了country=China信息

7.5.UUID Interceptor

7.5.1.UUID Interceptor概述

这个拦截器在所有事件头中增加一个全局一致性标志，其实就是UUID。

7.5.2.UUID Interceptor属性说明

!type – 类型名称，必须是org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
headerName id 头名称
preserveExisting true 如果头已经存在，是否保留
prefix “” 在UUID前拼接的字符串前缀

7.5.3.案例

配置文件
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type = http
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3 i4
a1.sources.r1.interceptors.i1.type = timestamp
#ip是拦截者所在机器的ip
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = country
a1.sources.r1.interceptors.i3.value = China
a1.sources.r1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

＃描述Sink
a1.sinks.k1.type = logger

＃描述内存Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

7.6.Search and Replace Interceptor

7.6.1.Search and Replace Interceptor概述

这个拦截器提供了简单的基于字符串的正则搜索和替换功能。可以修改body部分的内容

7.6.2.Search and Replace Interceptor属性说明

type – 类型名称，必须是"search_replace"
searchPattern – 要搜索和替换的正则表达式
replaceString – 要替换为的字符串
charset UTF-8 字符集编码，默认utf-8

7.6.3.案例

配置文件
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type = http
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3 i4 i5
a1.sources.r1.interceptors.i1.type = timestamp
#ip是拦截者所在机器的ip
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = country
a1.sources.r1.interceptors.i3.value = China
a1.sources.r1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.r1.interceptors.i5.type = search_replace
#将所有的数字替换成*
a1.sources.r1.interceptors.i5.searchPattern = [0-9]
a1.sources.r1.interceptors.i5.replaceString = *

＃描述Sink
a1.sinks.k1.type = logger

＃描述内存Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

7.7.Regex Filtering Interceptor

7.7.1.Regex Filtering Interceptor概述

此拦截器通过解析事件体去匹配给定正则表达式来筛选事件，所提供的正则表达式即可以用来包含或刨除事件。

7.7.2.Regex Filtering Interceptor属性说明

!type – 类型，必须设定为regex_filter
regex ”.*” 所要匹配的正则表达式
excludeEvents false 如果是true则排除匹配的事件，false则包含匹配的事件。

7.7.3.案例

配置文件
＃命名Agent a1的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

＃描述/配置Source
a1.sources.r1.type = http
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1 i2 i3 i4 i5 i6
a1.sources.r1.interceptors.i1.type = timestamp
#ip是拦截者所在机器的ip
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i3.type = static
a1.sources.r1.interceptors.i3.key = country
a1.sources.r1.interceptors.i3.value = China
a1.sources.r1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
a1.sources.r1.interceptors.i5.type = search_replace
#将所有的数字替换成*
a1.sources.r1.interceptors.i5.searchPattern = [0-9]
a1.sources.r1.interceptors.i5.replaceString = *
a1.sources.r1.interceptors.i6.type = regex_filter
#只要是a开头的抛弃
a1.sources.r1.interceptors.i6.regex = ^a.*
a1.sources.r1.interceptors.i6.excludeEvents = true

＃描述Sink
a1.sinks.k1.type = logger

＃描述内存Channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

7.8.Regex Extractor Interceptor

7.8.1.Regex Extractor Interceptor概述

使用指定正则表达式匹配事件，并将匹配到的组作为头加入到事件中，它也支持插件化的序列化器用来格式化匹配到的组在加入他们作为头之前。

7.8.2.Regex Extractor Interceptor属性说明

!type – 类型，必须是regex_extractor
!regex – 要匹配的正则表达式
!serializers – Space-separated list of serializers for mapping matches to header names and serializing their values. (See example below) Flume provides built-in support for the following serializers: org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
serializers..type default Must be default (org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer), org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer, or the FQCN of a custom class that implements org.apache.flume.interceptor.RegexExtractorInterceptorSerializer
serializers..name –
serializers.* – Serializer-specific properties

7.8.3.案例

项目时在讲

8.Processor

8.1.概述

8.1.1.概述

Sink Group允许用户将多个Sink组合成一个实体。
Flume Sink Processor 可以通过切换组内Sink用来实现负载均衡的效果，或在一个Sink故障时切换到另一个Sink。
sinks – 用空格分隔的Sink集合
processor.type default 类型名称，必须是 default、failover 或 load_balance

8.2.Default Sink Processor

8.2.1.Default Sink Processor

Default Sink Processor 只接受一个 Sink。
不要求用户为单一Sink创建processor

8.3.Failover Sink Processor

8.3.1.Failover Sink Processor

Failover Sink Processor 维护一个sink们的优先表。确保只要一个是可用的事件就可以被处理。
失败处理原理是，为失效的sink指定一个冷却时间，在冷却时间到达后再重新使用。
sink们可以被配置一个优先级，数字越大优先级越高。
如果sink发送事件失败，则下一个最高优先级的sink将会尝试接着发送事件。
如果没有指定优先级，则优先级顺序取决于sink们的配置顺序，先配置的默认优先级高于后配置的。
在配置的过程中，设置一个group processor ，并且为每个sink都指定一个优先级。
优先级必须是唯一的。
另外可以设置maxpenalty属性指定限定失败时间。

8.3.2.属性说明

sinks – Space-separated list of sinks that are participating in the group
processor.type default The component type name, needs to be failover
processor.priority. – Priority value. must be one of the sink instances associated with the current sink group A higher priority value Sink gets activated earlier. A larger absolute value indicates higher priority
processor.maxpenalty 30000 The maximum backoff period for the failed Sink (in millis)

8.3.3.案例

h1配置文件
#命名Agent组件
a1.sources=r1
a1.sinks=k1 k2
a1.channels=c1

#描述/配置Source
a1.sources.r1.type=http
a1.sources.r1.port=44444

#描述Sink
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1=5
a1.sinkgroups.g1.processor.priority.k2=10

a1.sinks.k1.type =avro
a1.sinks.k1.hostname =hadoop02
a1.sinks.k1.port =44444

a1.sinks.k2.type =avro
a1.sinks.k2.hostname =hadoop03
a1.sinks.k2.port =44444

#描述内存Channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
a1.sinks.k2.channel=c1
h2、h3配置文件
#命名Agent组件
a1.sources=r1
a1.sinks=k1
a1.channels=c1

#描述/配置Source
a1.sources.r1.type=avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port=44444

#描述Sink
a1.sinks.k1.type=logger

#描述内存Channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

启动flume
./flume-ng agent --conf …/conf --conf-file …/conf/flume5.conf --name a1 -Dflume.root.logger=INFO,console

h1发送数据
curl -XPOST -d ‘[{ “headers” :{“flag” : “c”},“body” : “idoall.org_body”}]’ http://0.0.0.0:44444

8.4.Load balancing Sink Processor

8.4.1.Load balancing Sink Processor

Load balancing Sink processor 提供了在多个sink之间实现负载均衡的能力。
它维护了一个活动sink的索引列表。
它支持轮询或随机方式的负载均衡，默认值是轮询方式，可以通过配置指定。
也可以通过实现AbstractSinkSelector接口实现自定义的选择机制。

8.4.2.属性说明

!processor.sinks – Space-separated list of sinks that are participating in the group
!processor.type default The component type name, needs to be load_balance
processor.backoff false Should failed sinks be backed off exponentially.
processor.selector round_robin Selection mechanism. Must be either round_robin, random or FQCN of custom class that inherits from AbstractSinkSelector
processor.selector.maxTimeOut 30000 Used by backoff selectors to limit exponential backoff (in milliseconds)

8.4.3.案例

h1配置文件
#命名Agent组件
a1.sources=r1
a1.sinks=k1 k2
a1.channels=c1

#描述/配置Source
a1.sources.r1.type=http
a1.sources.r1.port=44444

#描述Sink 负载均衡方式
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.selector = random

a1.sinks.k1.type =avro
a1.sinks.k1.hostname =hadoop02
a1.sinks.k1.port =44444

a1.sinks.k2.type =avro
a1.sinks.k2.hostname =hadoop03
a1.sinks.k2.port =44444

#描述内存Channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
a1.sinks.k2.channel=c1
h2、h3配置文件
#命名Agent组件
a1.sources=r1
a1.sinks=k1
a1.channels=c1

#描述/配置Source
a1.sources.r1.type=avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port=44444

#描述Sink
a1.sinks.k1.type=logger

#描述内存Channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=1000

#为Channel绑定Source和Sink
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

启动flume
./flume-ng agent --conf …/conf --conf-file …/conf/flume6.conf --name a1 -Dflume.root.logger=INFO,console

h1发送数据
curl -XPOST -d ‘[{ “headers” :{“flag” : “c”},“body” : “idoall.org_body”}]’ http://0.0.0.0:44444

9.Channel

9.1.Memory Channel

9.1.1.Memory Channel概述

Memory Channel，内存通道；事件将被存储在内存中的具有指定大小的队列中。
非常适合那些需要高吞吐量但是失败时会丢失数据的场景下。

9.1.2.Memory Channel属性说明

!type – 类型，必须是“memory”
capacity 100 事件存储在信道中的最大数量
transactionCapacity 100 每个事务中的最大事件数
keep-alive 3 添加或删除操作的超时时间
byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.
byteCapacity see description Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB.

9.1.3.案例

参见入门案例。

9.2.JDBC Channel

9.2.1.JDBC Channel概述

事件被持久存储在可靠的数据库中。目前支持嵌入式的Derby数据库。如果可恢复性非常的重要可以使用这种方式。

9.3.File Channel

9.3.1.File　Channel概述

性能会比较低下，但是即使程序出错数据不会丢失

9.3.2.File　Channel属性说明

!type – 类型，必须是“file”
checkpointDir ~/.flume/file-channel/checkpoint 检查点文件存放的位置
useDualCheckpoints false Backup the checkpoint. If this is set to true, backupCheckpointDir must be set
backupCheckpointDir – The directory where the checkpoint is backed up to. This directory must not be the same as the data directories or the checkpoint directory
dataDirs ~/.flume/file-channel/data 逗号分隔的目录列表，用以存放日志文件。使用单独的磁盘上的多个目录可以提高文件通道效率。
transactionCapacity 10000 The maximum size of transaction supported by the channel
checkpointInterval 30000 Amount of time (in millis) between checkpoints
maxFileSize 2146435071 一个日志文件的最大尺寸
minimumRequiredSpace 524288000 Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value
capacity 1000000 Maximum capacity of the channel
keep-alive 3 Amount of time (in sec) to wait for a put operation
use-log-replay-v1 false Expert: Use old replay logic
use-fast-replay false Expert: Replay without using queue
checkpointOnClose true Controls if a checkpoint is created when the channel is closed. Creating a checkpoint on close speeds up subsequent startup of the file channel by avoiding replay.
encryption.activeKey – Key name used to encrypt new data
encryption.cipherProvider – Cipher provider type, supported types: AESCTRNOPADDING
encryption.keyProvider – Key provider type, supported types: JCEKSFILE
encryption.keyProvider.keyStoreFile – Path to the keystore file
encrpytion.keyProvider.keyStorePasswordFile – Path to the keystore password file
encryption.keyProvider.keys – List of all keys (e.g. history of the activeKey setting)
encyption.keyProvider.keys.*.passwordFile – Path to the optional key password file

9.4.Spillable Memory Channel

9.4.1.Spillable Memory Channel概述

Spillable Memory Channel：内存溢出通道。
事件被存储在内存队列和磁盘中，内存队列作为主存储，而磁盘作为溢出内容的存储。
内存存储通过embedded File channel来进行管理，当内存队列已满时，后续的事件将被存储在文件通道中，这个通道适用于正常操作期间适用内存通道已期实现高效吞吐，而在高峰期间适用文件通道实现高耐受性。通过降低吞吐效率提高系统可耐受性。
如果Agent崩溃，则只有存储在文件系统中的事件可以被恢复；此通道处于试验阶段，不建议在生产环境中使用。

9.4.2.Spillable Memory Channel属性说明

!type – 类型，必须是"SPILLABLEMEMORY"
memoryCapacity 10000 内存中存储事件的最大值，如果想要禁用内存缓冲区将此值设置为0。
overflowCapacity 100000000 可以存储在磁盘中的事件数量最大值。设置为0可以禁用磁盘存储。
overflowTimeout 3 在内存填充磁盘溢出之前等待的秒数。
byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.
byteCapacity see description Maximum bytes of memory allowed as a sum of all events in the memory queue. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB.
avgEventSize 500 Estimated average size of events, in bytes, going into the channel
see file channel Any file channel property with the exception of ‘keep-alive’ and ‘capacity’ can be used. The keep-alive of file channel is managed by Spillable Memory Channel. Use ‘overflowCapacity’ to set the File channel’s capacity.

9.5.自定义通道

9.5.1.自定义通道概述

自定义通道需要自己实现Channel接口。
自定义Channle类及其依赖类必须在Flume启动前放置到类加载的目录下。

9.5.2.自定义通道属性说明

type - 自己实现的Channle类的全路径名称

你可能感兴趣的:(大数据)

nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
ES聚合分析原理与代码实例讲解光剑书架上的书大厂Offer收割机面试题简历程序员读书硅基计算碳基计算认知计算生物计算深度学习神经网络大数据 AIGC AGI LLM Java Python 架构设计 Agent 程序员实现财富自由
ES聚合分析原理与代码实例讲解1.背景介绍1.1问题的由来在大规模数据分析场景中，特别是在使用Elasticsearch（ES）进行数据存储和检索时，聚合分析成为了一个至关重要的功能。聚合分析允许用户对数据集进行细分和分组，以便深入探索数据的结构和模式。这在诸如实时监控、日志分析、业务洞察等领域具有广泛的应用。1.2研究现状目前，ES聚合分析已经成为现代大数据平台的核心组件之一。它支持多种类型的聚
WebMagic：强大的Java爬虫框架解析与实战 Aaron_945 Java java 爬虫开发语言
文章目录引言官网链接WebMagic原理概述基础使用1.添加依赖2.编写PageProcessor高级使用1.自定义Pipeline2.分布式抓取优点结论引言在大数据时代，网络爬虫作为数据收集的重要工具，扮演着不可或缺的角色。Java作为一门广泛使用的编程语言，在爬虫开发领域也有其独特的优势。WebMagic是一个开源的Java爬虫框架，它提供了简单灵活的API，支持多线程、分布式抓取，以及丰富的
免费的GPT可在线直接使用（一键收藏） kkai人工智能 gpt
1、LuminAI（https://kk.zlrxjh.top）LuminAI标志着一款融合了星辰大数据模型与文脉深度模型的先进知识增强型语言处理系统，旨在自然语言处理（NLP）的技术开发领域发光发热。此系统展现了卓越的语义把握与内容生成能力，轻松驾驭多样化的自然语言处理任务。VisionAI在NLP界的应用领域广泛，能够胜任从机器翻译、文本概要撰写、情绪分析到问答等众多任务。通过对大量文本数据的
如何利用大数据与AI技术革新相亲交友体验 h17711347205 回归算法安全系统架构交友小程序
在数字化时代，大数据和人工智能（AI）技术正逐渐革新相亲交友体验，为寻找爱情的过程带来前所未有的变革（编辑h17711347205）。通过精准分析和智能匹配，这些技术能够极大地提高相亲交友系统的效率和用户体验。大数据的力量大数据技术能够收集和分析用户的行为模式、偏好和互动数据，为相亲交友系统提供丰富的信息资源。通过分析用户的搜索历史、浏览记录和点击行为，系统能够深入了解用户的兴趣和需求，从而提供更
未来软件市场是怎么样的？做开发的生存空间如何？ cesske 软件需求
目录前言一、未来软件市场的发展趋势二、软件开发人员的生存空间前言未来软件市场是怎么样的？做开发的生存空间如何？一、未来软件市场的发展趋势技术趋势：人工智能与机器学习：随着技术的不断成熟，人工智能将在更多领域得到应用，如智能客服、自动驾驶、智能制造等，这将极大地推动软件市场的增长。云计算与大数据：云计算服务将继续普及，大数据技术的应用也将更加广泛。企业将更加依赖云计算和大数据来优化运营、提升效率，并
Hadoop架构 henan程序媛 hadoop 大数据分布式
一、案列分析1.1案例概述现在已经进入了大数据(BigData)时代，数以万计用户的互联网服务时时刻刻都在产生大量的交互，要处理的数据量实在是太大了，以传统的数据库技术等其他手段根本无法应对数据处理的实时性、有效性的需求。HDFS顺应时代出现，在解决大数据存储和计算方面有很多的优势。1.2案列前置知识点1.什么是大数据大数据是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的大量数据集合，
[转载] NoSQL简介 weixin_30325793 大数据数据库运维
摘自“百度百科”。NoSQL，泛指非关系型的数据库。随着互联网web2.0网站的兴起，传统的关系数据库在应付web2.0网站，特别是超大规模和高并发的SNS类型的web2.0纯动态网站已经显得力不从心，暴露了很多难以克服的问题，而非关系型的数据库则由于其本身的特点得到了非常迅速的发展。NoSQL数据库的产生就是为了解决大规模数据集合多重数据种类带来的挑战，尤其是大数据应用难题。虽然NoSQL流行语
Kafka详细解析与应用分析芊言芊语 kafka 分布式
Kafka是一个开源的分布式事件流平台（EventStreamingPlatform），由LinkedIn公司最初采用Scala语言开发，并基于ZooKeeper协调管理。如今，Kafka已经被Apache基金会纳入其项目体系，广泛应用于大数据实时处理领域。Kafka凭借其高吞吐量、持久化、分布式和可靠性的特点，成为构建实时流数据管道和流处理应用程序的重要工具。Kafka架构Kafka的架构主要由
分享一个基于python的电子书数据采集与可视化分析 hadoop电子书数据分析与推荐系统 spark大数据毕设项目（源码、调试、LW、开题、PPT) 计算机源码社 Python项目大数据大数据 python hadoop 计算机毕业设计选题计算机毕业设计源码数据分析 spark毕设
作者：计算机源码社个人简介：本人八年开发经验，擅长Java、Python、PHP、.NET、Node.js、Android、微信小程序、爬虫、大数据、机器学习等，大家有这一块的问题可以一起交流！学习资料、程序开发、技术解答、文档报告如需要源码，可以扫取文章下方二维码联系咨询Java项目微信小程序项目Android项目Python项目PHP项目ASP.NET项目Node.js项目选题推荐项目实战|p
疫情，疫情东山草
2020年，疫情爆发，至今已近三年，反反复复，此起彼伏。不但没被消灭，还自我发展，从德尔塔到奥密克戎，与时俱进的变异着。去年11月，疫情之下，大数据800米范围内，都成为时空伴随者。“你的码儿有没有变颜色”“你绿码还是黄码”成为那段时间的流行语，当然少不了的还有全员核酸。段子手整出来一首歌：我走过你走过的路,这算不算相逢？我吹过你吹过的风，这算不算相拥？800米内我们不曾擦肩而过，你却要我14天相
在服务器计算节点中使用 jupyter Lab ranshan567 程序人生
JupyterLab是一个基于网页的交互式开发环境,用于科学计算、数据分析和机器学.jupyterlab是jupyternotebook的下一代产品,集成了更多功能,使用起来更方便.在进行数据分析及可视化时，个人电脑不能满足大数据的分析需求，就需要用到高性能计算机集群资源，然而计算机集群的计算节点往往没有联网功能，所以在计算机集群中使用jupyterLab需要进行一些配置。具体的步骤如下：
大数据真实面试题---SQL The博宇大数据面试题——SQL 大数据 mysql sql 数据库 big data
视频号数据分析组外包招聘笔试题时间限时45分钟完成。题目根据3张表表结构，写出具体求解的SQL代码（搞笑品类定义：视频分类或者视频创建者分类为“搞笑”）1、表创建语句：createtablet_user_video_action_d(dsint,user_idstring,video_idstring,action_typeint,`timestamp`bigint)rowformatdelimi
Flume：大规模日志收集与数据传输的利器傲雪凌霜，松柏长青后端大数据 flume 大数据
Flume：大规模日志收集与数据传输的利器在大数据时代，随着各类应用的不断增长，产生了海量的日志和数据。这些数据不仅对业务的健康监控至关重要，还可以通过深入分析，帮助企业做出更好的决策。那么，如何高效地收集、传输和存储这些海量数据，成为了一项重要的挑战。今天我们将深入探讨ApacheFlume，它是如何帮助我们应对这些挑战的。一、Flume概述ApacheFlume是一个分布式、可靠、可扩展的日志
云服务业界动态简报-20180128 Captain7
一、青云青云QingCloud推出深度学习平台DeepLearningonQingCloud，包含了主流的深度学习框架及数据科学工具包，通过QingCloudAppCenter一键部署交付，可以让算法工程师和数据科学家快速构建深度学习开发环境，将更多的精力放在模型和算法调优。二、腾讯云1.腾讯云正式发布腾讯专有云TCE(TencentCloudEnterprise)矩阵，涵盖企业版、大数据版、AI
大数据毕业设计hadoop+spark+hive知识图谱租房数据分析可视化大屏租房推荐系统 58同城租房爬虫房源推荐系统房价预测系统计算机毕业设计机器学习深度学习人工智能 2401_84572577 程序员大数据 hadoop 人工智能
做了那么多年开发，自学了很多门编程语言，我很明白学习资源对于学一门新语言的重要性，这些年也收藏了不少的Python干货，对我来说这些东西确实已经用不到了，但对于准备自学Python的人来说，或许它就是一个宝藏，可以给你省去很多的时间和精力。别在网上瞎学了，我最近也做了一些资源的更新，只要你是我的粉丝，这期福利你都可拿走。我先来介绍一下这些东西怎么用，文末抱走。（1）Python所有方向的学习路线（
架构评审的自动化与人工智能: 如何提高效率光剑书架上的书架构自动化人工智能运维
1.背景介绍架构评审是软件开发过程中的一个关键环节，它旨在确保软件架构的质量、可维护性和可扩展性。传统的架构评审通常是由人工进行，需要大量的时间和精力。随着大数据技术和人工智能的发展，自动化和人工智能技术已经开始应用于架构评审，从而提高评审的效率和准确性。在本文中，我们将讨论如何通过自动化和人工智能技术来提高架构评审的效率。我们将从以下几个方面进行讨论：背景介绍核心概念与联系核心算法原理和具体操作
【数字化供应链】数字化供应链架构、全景管理、全流程贯通方案数字化建设方案数字化转型数据治理主数据数据仓库供应链数字仓储智慧物流智慧仓储物流园区架构微服务数据挖掘大数据人工智能
原文《数字化供应链架构、全景管理、全流程贯通方案》PPT格式。主要从供应链管理全景、智慧供应链建设总体目标、供应链总体业务流程、供应链总体功能架构、供应链总体技术架构、供应链全流程贯通、供应链全领域管理、供应链数据数据分析、供应链决策中台等进行建设。本文仅对主要内容进行介绍。来源网络公开渠道，旨在交流学习，如有侵权联系速删，更多参考公众号：优享智库基于先进IT技术、大数据能力、物联网应用、区块链平
80 鑫_259b
科普一个谈恋爱的方法。在以前，谈恋爱千难万难，就难在对对方不知底细，不知道对方希望自己是一个怎样的人，要耗费大量的时间去试探、再磨合，往往会因为一些小事一些细节，满盘皆输。在一个信息化的时代，在一个大数据近乎变成了流行语的时代，我们要跟上时代的步伐，通过大数据，去寻找异性最希望自己展现出来的形象是什么，才可以在爱情的道路上少走弯路。那这个大数据怎么操作呢？上街发问卷？问别人的择偶标准？一来会被打死
解锁企业潜能，Vatee万腾平台引领智能新纪元自媒体经济说其他
在数字化转型的浪潮中，企业正站在一个前所未有的十字路口，面对着前所未有的机遇与挑战。解锁企业内在潜能，实现跨越式发展，已成为众多企业的共同追求。而Vatee万腾平台，作为智能科技的先锋，正以其强大的智能赋能能力，引领企业步入一个全新的智能纪元。Vatee万腾平台，是一个集成了人工智能、大数据、云计算等前沿技术的综合性智能服务平台。它不仅仅是一个技术工具，更是企业转型升级的加速器，能够深入企业运营的
释放“AI+”新质生产力，深算院如何“把大数据变小”？ YashanDB YashanDB 国产数据库数据库数据库大数据
近期，南都·湾财社推出《新质·中国造》栏目，深入千行百业，遍访湾区企业，解锁湾区新质生产力，共探高质量发展之道。本期对话深圳计算科学研究院YashanDB首席技术官陈志标，探讨国产数据库如何实现创新突围，抢抓数字经济时代的新机遇。以下是专访内容：如何应对AI时代所面临的算力挑战？南都·湾财社：数据、算力和算法是发展人工智能的三要素，深算院做了怎样的前瞻性布局？陈志标：今年，政府工作报告中首次提及开
数字化智能工厂数字化供应链架构、全景管理、全流程贯通方案数字化建设方案智能制造数字工厂制造业数字化转型工业互联网架构
随着信息技术的飞速发展，数字化转型已成为制造企业提升竞争力的关键途径。数字化智能工厂通过集成先进的物联网(IoT)、大数据、云计算、人工智能(AI)等技术，实现了生产过程的智能化、供应链管理的精准化及决策的科学化。本方案旨在构建一套完善的数字化供应链架构，实现全景管理、全流程贯通、智慧化升级，以数据为驱动，强化技术支撑与安全管理体系，推动企业向智能制造迈进。一、数字化供应链架构1.**集成化平台构
日记——我的歌单静若小猴
又到一年一度大数据汇总的时候了，听歌已经成为很多人生活里的一种乐趣。春夏秋冬，我们都有自己喜欢的歌，歌词歌曲唱出沃尔玛你的心声。还记得大学时候最喜欢听的《春天里》，我有一天单曲回放了30遍，总觉得听着仿佛看到自己声音。还有的歌，初听不知曲中意，再听已经是曲终人，听着歌流泪，听着歌入睡……还记得那些年少的故事吗，总觉得自己才是故事外的人，却不是自己已经入歌。一段时间会喜欢一个人的音乐，一段时间会沉静
Linux dmesg命令：显示开机信息 fafadsj666 linux 数据库数据挖掘机器学习大数据
通过学习《Linux启动管理》一章可以知道，在系统启动过程中，内核还会进行一次系统检测（第一次是BIOS进行加测），但是检测的过程不是没有显示在屏幕上，就是会快速的在屏幕上一闪而过那么，如果开机时来不及查看相关信息，我们是否可以在开机后查看呢？答案是肯定的，使用dmesg命令就可以。无论是系统启动过程中，还是系统运行过程中，只要是内核产生的信息，都会被存储在系统缓冲区中，已经为大家精心准备了大数据
大数据新视界 --大数据大厂之揭秘大数据时代 Excel 魔法：大厂数据分析师进阶秘籍青云交大数据新视界 Excel 数据分析函数公式数据透视表图表功能规划求解数据分析工具库大数据新视界数据库
亲爱的朋友们，热烈欢迎你们来到青云交的博客！能与你们在此邂逅，我满心欢喜，深感无比荣幸。在这个瞬息万变的时代，我们每个人都在苦苦追寻一处能让心灵安然栖息的港湾。而我的博客，正是这样一个温暖美好的所在。在这里，你们不仅能够收获既富有趣味又极为实用的内容知识，还可以毫无拘束地畅所欲言，尽情分享自己独特的见解。我真诚地期待着你们的到来，愿我们能在这片小小的天地里共同成长，共同进步。本博客的精华专栏：Ja
大数据新视界 --大数据大厂之数据挖掘入门：用 R 语言开启数据宝藏的探索之旅青云交大数据新视界数据库大数据数据挖掘 R 语言算法案例未来趋势应用场景学习建议大数据新视界
亲爱的朋友们，热烈欢迎你们来到青云交的博客！能与你们在此邂逅，我满心欢喜，深感无比荣幸。在这个瞬息万变的时代，我们每个人都在苦苦追寻一处能让心灵安然栖息的港湾。而我的博客，正是这样一个温暖美好的所在。在这里，你们不仅能够收获既富有趣味又极为实用的内容知识，还可以毫无拘束地畅所欲言，尽情分享自己独特的见解。我真诚地期待着你们的到来，愿我们能在这片小小的天地里共同成长，共同进步。本博客的精华专栏：Ja
高职人工智能训练师边缘计算实训室解决方案武汉唯众智创人工智能训练师边缘计算实训室人工智能训练师实训室边缘计算实训室
一、引言随着物联网（IoT）、大数据、人工智能（AI）等技术的飞速发展，计算需求日益复杂和多样化。传统的云计算模式虽在一定程度上满足了这些需求，但在处理海量数据、保障实时性与安全性、提升计算效率等方面仍面临诸多挑战。在此背景下，边缘计算作为一种新兴的计算模式应运而生，通过将计算能力推向数据生成或用户所在的网络边缘，显著降低了数据传输的延迟，提升了处理效率，并增强了数据安全性。针对高等职业院校的人工
python基于django/flask的NBA球员大数据分析与可视化python+java+node.js QQ_511008285 python django flask java spring boot 数据分析
前端开发框架:vue.js数据库mysql版本不限后端语言框架支持：1java(SSM/springboot)-idea/eclipse2.Nodejs+Vue.js-vscode3.python(flask/django)--pycharm/vscode4.php(thinkphp/laravel)-hbuilderx数据库工具：Navicat/SQLyog等都可以本文针对NBA球员的大数据进行
Java基于spring boot的国产电影数据分析与可视化python+java+node.js QQ_511008285 java spring boot 数据分析 python django vue.js flask
前端开发框架:vue.js数据库mysql版本不限后端语言框架支持：1java(SSM/springboot)-idea/eclipse2.Nodejs+Vue.js-vscode3.python(flask/django)--pycharm/vscode4.php(thinkphp/laravel)-hbuilderx数据库工具：Navicat/SQLyog等都可以该系统使用进行大数据处理和
数字化（电子化）招标采购平台系统核心功能详细介绍 xinyuan_123456 oracle
数智化招标采购平台覆盖全业务类型、全采购流程、全采购方式，是郑州信源公司运用“互联网+”、大数据、人工智能、区块链、物联网等新兴技术，结合供应链管理理念，以招标采购为核心，提供交易、管理、数据、服务、监管为一体的高标准采购管理平台，赋能政企用户实现采购业务全流程的电子化、数字化、智慧化。根据产品功能及应用领域，产品包括：企业数智化招采供应链平台、金融数智化招采平台、政府数智化采购平台、公共资源数智
html页面js获取参数值 0624chenhong html
1.js获取参数值js function GetQueryString(name) { var reg = new RegExp("(^|&)"+ name +"=([^&]*)(&|$)"); var r = windo
MongoDB 在多线程高并发下的问题 BigCat2013 mongodb DB 高并发重复数据
最近项目用到 MongoDB , 主要是一些读取数据及改状态位的操作. 因为是结合了最近流行的 Storm进行大数据的分析处理，并将分析结果插入Vertica数据库，所以在多线程高并发的情境下, 会发现 Vertica 数据库中有部分重复的数据. 这到底是什么原因导致的呢？笔者开始也是一筹莫展，重复去看 MongoDB 的 API , 终于有了新发现： com.mongodb.DB 这个类有
c++ 用类模版实现链表(c++语言程序设计第四版示例代码) CrazyMizzz 数据结构 C++
#include<iostream> #include<cassert> using namespace std; template<class T> class Node { private: Node<T> * next; public: T data;
最近情况麦田的设计者感慨考试生活
在五月黄梅天的岁月里，一年两次的软考又要开始了。到目前为止，我已经考了多达三次的软考，最后的结果就是通过了初级考试（程序员）。人啊，就是不满足，考了初级就希望考中级，于是，这学期我就报考了中级，明天就要考试。感觉机会不大，期待奇迹发生吧。这个学期忙于练车，写项目，反正最后是一团糟。后天还要考试科目二。这个星期真的是很艰难的一周，希望能快点度过。
linux系统中用pkill踢出在线登录用户被触发 linux
由于linux服务器允许多用户登录，公司很多人知道密码，工作造成一定的障碍所以需要有时踢出指定的用户 1/#who 查出当前有那些终端登录（用 w 命令更详细） # who root pts/0 2010-10-28 09:36 (192
仿QQ聊天第二版肆无忌惮_ qq
在第一版之上的改进内容: 第一版链接: http://479001499.iteye.com/admin/blogs/2100893 用map存起来号码对应的聊天窗口对象,解决私聊的时候所有消息发到一个窗口的问题. 增加ViewInfo类,这个是信息预览的窗口,如果是自己的信息,则可以进行编辑. 信息修改后上传至服务器再告诉所有用户,自己的窗口
java读取配置文件知了ing
1，java读取.properties配置文件 InputStream in; try { in = test.class.getClassLoader().getResourceAsStream("config/ipnetOracle.properties");//配置文件的路径 Properties p = new Properties()
__attribute__ 你知多少？矮蛋蛋 C++gcc
原文地址: http://www.cnblogs.com/astwish/p/3460618.html GNU C 的一大特色就是__attribute__ 机制。__attribute__ 可以设置函数属性（Function Attribute ）、变量属性（Variable Attribute ）和类型属性（Type Attribute ）。 __attribute__ 书写特征是：
jsoup使用笔记 alleni123 java 爬虫 JSoup
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.7.3</version> </dependency> 2014/08/28 今天遇到这种形式，
JAVA中的集合 Collectio 和Map的简单使用及方法百合不是茶 list map set
List ,set ,map的使用方法和区别 java容器类类库的用途是保存对象，并将其分为两个概念： Collection集合：一个独立的序列，这些序列都服从一条或多条规则;List必须按顺序保存元素，set不能重复元素；Queue按照排队规则来确定对象产生的顺序（通常与他们被插入的
杀LINUX的JOB进程 bijian1013 linux unix
今天发现数据库一个JOB一直在执行，都执行了好几个小时还在执行，所以想办法给删除掉系统环境： ORACLE 10G Linux操作系统操作步骤如下：第一步.查询出来那个job在运行，找个对应的SID字段 select * from dba_jobs_running--找到job对应的sid &n
Spring AOP详解 bijian1013 java spring AOP
最近项目中遇到了以下几点需求，仔细思考之后，觉得采用AOP来解决。一方面是为了以更加灵活的方式来解决问题，另一方面是借此机会深入学习Spring AOP相关的内容。例如，以下需求不用AOP肯定也能解决，至于是否牵强附会，仁者见仁智者见智。 1.对部分函数的调用进行日志记录，用于观察特定问题在运行过程中的函数调用
[Gson六]Gson类型适配器(TypeAdapter) bit1129 Adapter
TypeAdapter的使用动机 Gson在序列化和反序列化时，默认情况下，是按照POJO类的字段属性名和JSON串键进行一一映射匹配，然后把JSON串的键对应的值转换成POJO相同字段对应的值，反之亦然，在这个过程中有一个JSON串Key对应的Value和对象之间如何转换(序列化/反序列化)的问题。以Date为例，在序列化和反序列化时，Gson默认使用java.
【spark八十七】给定Driver Program，如何判断哪些代码在Driver运行，哪些代码在Worker上执行 bit1129 driver
Driver Program是用户编写的提交给Spark集群执行的application，它包含两部分作为驱动： Driver与Master、Worker协作完成application进程的启动、DAG划分、计算任务封装、计算任务分发到各个计算节点(Worker)、计算资源的分配等。计算逻辑本身，当计算任务在Worker执行时，执行计算逻辑完成application的计算任务
nginx 经验总结 ronin47 nginx 总结
　　　深感nginx的强大，只学了皮毛，把学下的记录。　　　获取Header 信息，一般是以$http_XX（ＸＸ是小写）获取body,通过接口，再展开，根据Ｋ取Ｖ　　　获取uri,以$arg_XX &n
轩辕互动-1.求三个整数中第二大的数2.整型数组的平衡点 bylijinnan 数组
import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class ExoWeb { public static void main(String[] args) { ExoWeb ew=new ExoWeb(); System.out.pri
Netty源码学习-Java-NIO-Reactor bylijinnan java 多线程 netty
Netty里面采用了NIO-based Reactor Pattern 了解这个模式对学习Netty非常有帮助参考以下两篇文章： http://jeewanthad.blogspot.com/2013/02/reactor-pattern-explained-part-1.html http://gee.cs.oswego.edu/dl/cpjslides/nio.pdf
AOP通俗理解 cngolon spring AOP
1.我所知道的aop 初看aop,上来就是一大堆术语，而且还有个拉风的名字，面向切面编程，都说是OOP的一种有益补充等等。一下子让你不知所措，心想着：怪不得很多人都和我说aop多难多难。当我看进去以后，我才发现：它就是一些java基础上的朴实无华的应用，包括ioc，包括许许多多这样的名词，都是万变不离其宗而已。 2.为什么用aop&nb
cursor variable 实例 ctrain variable
create or replace procedure proc_test01 as type emp_row is record( empno emp.empno%type, ename emp.ename%type, job emp.job%type, mgr emp.mgr%type, hiberdate emp.hiredate%type, sal emp.sal%t
shell报bash: service: command not found解决方法 daizj linux shell service jps
今天在执行一个脚本时，本来是想在脚本中启动hdfs和hive等程序，可以在执行到service hive-server start等启动服务的命令时会报错，最终解决方法记录一下：脚本报错如下： ./olap_quick_intall.sh: line 57: service: command not found ./olap_quick_intall.sh: line 59
40个迹象表明你还是PHP菜鸟 dcj3sjt126com 设计模式 PHP 正则表达式 oop
你是PHP菜鸟，如果你：1. 不会利用如phpDoc 这样的工具来恰当地注释你的代码2. 对优秀的集成开发环境如Zend Studio 或Eclipse PDT 视而不见3. 从未用过任何形式的版本控制系统，如Subclipse4. 不采用某种编码与命名标准，以及通用约定，不能在项目开发周期里贯彻落实5. 不使用统一开发方式6. 不转换（或）也不验证某些输入或SQL查询串（译注：参考PHP相关函
Android逐帧动画的实现 dcj3sjt126com android
一、代码实现： private ImageView iv; private AnimationDrawable ad; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout
java远程调用linux的命令或者脚本 eksliang linux ganymed-ssh2
转载请出自出处： http://eksliang.iteye.com/blog/2105862 Java通过SSH2协议执行远程Shell脚本(ganymed-ssh2-build210.jar) 使用步骤如下： 1.导包官网下载: http://www.ganymed.ethz.ch/ssh2/ ma
adb端口被占用问题 gqdy365 adb
最近重新安装的电脑，配置了新环境，老是出现： adb server is out of date. killing... ADB server didn't ACK * failed to start daemon * 百度了一下，说是端口被占用，我开个eclipse，然后打开cmd，就提示这个，很烦人。一个比较彻底的解决办法就是修改
ASP.NET使用FileUpload上传文件 hvt .net C#hovertree asp.net webform
前台代码： <asp:FileUpload ID="fuKeleyi" runat="server" /> <asp:Button ID="BtnUp" runat="server" onclick="BtnUp_Click" Text="上传" />
代码之谜（四）- 浮点数（从惊讶到思考） justjavac 浮点数精度代码之谜 IEEE
在『代码之谜』系列的前几篇文章中，很多次出现了浮点数。浮点数在很多编程语言中被称为简单数据类型，其实，浮点数比起那些复杂数据类型（比如字符串）来说，一点都不简单。单单是说明 IEEE浮点数就可以写一本书了，我将用几篇博文来简单的说说我所理解的浮点数，算是抛砖引玉吧。一次面试记得多年前我招聘 Java 程序员时的一次关于浮点数、二分法、编码的面试，多年以后，他已经称为了一名很出色的
数据结构随记_1 lx.asymmetric 数据结构笔记
第一章 1.数据结构包括数据的逻辑结构、数据的物理/存储结构和数据的逻辑关系这三个方面的内容。 2.数据的存储结构可用四种基本的存储方法表示，它们分别是顺序存储、链式存储、索引存储和散列存储。 3.数据运算最常用的有五种，分别是查找/检索、排序、插入、删除、修改。 4.算法主要有以下五个特性：输入、输出、可行性、确定性和有穷性。 5.算法分析的
linux的会话和进程组网络接口 linux
会话：一个或多个进程组。起于用户登录，终止于用户退出。此期间所有进程都属于这个会话期。会话首进程：调用setsid创建会话的进程1.规定组长进程不能调用setsid，因为调用setsid后，调用进程会成为新的进程组的组长进程.如何保证？先调用fork，然后终止父进程，此时由于子进程的进程组ID为父进程的进程组ID，而子进程的ID是重新分配的，所以保证子进程不会是进程组长，从而子进程可以调用se
二维数组元素的连续求解 1140566087 二维数组 ACM
import java.util.HashMap; public class Title { public static void main(String[] args){ f(); } // 二位数组的应用 //12、二维数组中，哪一行或哪一列的连续存放的0的个数最多，是几个0。注意，是“连续”。 public static void f(){
也谈什么时候Java比C++快 windshome java C++
刚打开iteye就看到这个标题“Java什么时候比C++快”，觉得很好笑。你要比，就比同等水平的基础上的相比，笨蛋写得C代码和C++代码，去和高手写的Java代码比效率，有什么意义呢？我是写密码算法的，深刻知道算法C和C++实现和Java实现之间的效率差，甚至也比对过C代码和汇编代码的效率差，计算机是个死的东西，再怎么优化，Java也就是和C