大数据之Flume学习记录

Flume大数据框架中 用于分布式收集数据的系统

Cloudera公司开源;
 分布式、 可靠、 高可用的海量日志采集系统;
 数据源可定制, 可扩展;
 数据存储系统可定制, 可扩展。

 中间件: 屏蔽了数据源和数据存储系统的异构性

采集非结构化数据

Event
Client
  Agent
1、  Source  2、 Channel  3、 Sink
其他组件: Interceptor Channel Selector Sink Processor

EventFlume数据传输的基本单元,Flume以事件的形式将数据从源头传送到最终的目的,Event由可选的header和载有数据的一个byte array成。

client是一个将log包装成events并且发送它们到一个或多个agent的实体。一个Agent包含source,channel,sink和其他组件。agent是flume流的基础部分。source负责接收event并将event批量的放到channel。

不同类型的Source: 1、与系统集成的Source: Syslog, Netcat 2、自动生成事件的Source: Exec 3、监听文件夹下文件变化: Spooling Directory SourceTaildir Source 4、用于AgentAgent之间通信的IPC Source: AvroThrift


Source必须至少和一个channel关联

自动监听文件夹并发送给hdfs:

LogAgent.sources = mysource
LogAgent.channels = mychannel
LogAgent.sinks = mysink
LogAgent.sources.mysource.type = spooldir
LogAgent.sources.mysource.channels = mychannel
LogAgent.sources.mysource.spoolDir =/home/vijay/tmp/logs
LogAgent.sinks.mysink.channel = mychannel
LogAgent.sinks.mysink.type = hdfs
LogAgent.sinks.mysink.hdfs.path = hdfs://master:9000/data/logs/%Y_%m_%d/
LogAgent.sinks.mysink.hdfs.batchSize = 1000
LogAgent.sinks.mysink.hdfs.rollSize = 0
LogAgent.sinks.mysink.hdfs.rollCount = 10000
LogAgent.sinks.mysink.hdfs.useLocalTimeStamp = true
LogAgent.sinks.mysink.hdfs.filePrefix=events
LogAgent.sinks.mysink.hdfs.fileType=DataStream
LogAgent.channels.mychannel.type = memory
LogAgent.channels.mychannel.capacity = 10000

LogAgent.channels.mychannel.transactionCapacity = 100

远程监听端口并发送给hdfs:

AvroAgent.sources = mysource
AvroAgent.channels = mychannel
AvroAgent.sinks = mysink
AvroAgent.sources.mysource.type = avro
AvroAgent.sources.mysource.bind = 192.168.42.128
AvroAgent.sources.mysource.port =56720
AvroAgent.sources.mysource.channels = mychannel
AvroAgent.sinks.mysink.channel = mychannel
AvroAgent.sinks.mysink.type = hdfs
AvroAgent.sinks.mysink.hdfs.path = hdfs://master:9000/data/logs/%Y_%m_%d/
AvroAgent.sinks.mysink.hdfs.batchSize = 1000
AvroAgent.sinks.mysink.hdfs.rollSize = 0
AvroAgent.sinks.mysink.hdfs.rollCount = 10000
AvroAgent.sinks.mysink.hdfs.useLocalTimeStamp = true
AvroAgent.channels.mychannel.type = memory
AvroAgent.channels.mychannel.capacity = 10000

AvroAgent.channels.mychannel.transactionCapacity = 100

监听文件夹并发送给kafka:

KafkaAgent.sources = mysource
KafkaAgent.channels = c1
KafkaAgent.sinks = k1
KafkaAgent.sources.mysource.type = spooldir
KafkaAgent.sources.mysource.channels = c1
KafkaAgent.sources.mysource.spoolDir =/tmp/logs


KafkaAgent.sinks.k1.channel = c1
KafkaAgent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
KafkaAgent.sinks.k1.kafka.topic = flume-data
KafkaAgent.sinks.k1.kafka.bootstrap.servers = localhost:9092
KafkaAgent.sinks.k1.kafka.flumeBatchSize = 20
KafkaAgent.sinks.k1.kafka.producer.acks = 1
KafkaAgent.sinks.k1.kafka.producer.linger.ms = 1
KafkaAgent.sinks.k1.kafka.producer.compression.type = snappy
KafkaAgent.channels.c1.type = memory
KafkaAgent.channels.c1.capacity = 10000
KafkaAgent.channels.c1.transactionCapacity = 100

在conf目录下建立以上配置文件,示例如:kafkaagent.properties

启动agent命令:bin/flume-ng agent --conf conf --conf-file conf/kafkaagent.properties --name KafkaAgent -Dflume.root.logger=DEBUG,console

你可能感兴趣的:(大数据之Flume学习记录)