Flume用来收集日志信息,这里记录以下使用场景:
场景一:使用avro source ,memory,logger 将收集到的日志打印在标准输出,适合测试。
场景二:使用avro source,kafka channel,hdfs 将日志以"Flume Event" Avro Event Serializer 的形式保存在hdfs上,这种方式生成的.avro文件中的每一条记录的字段中包含headers和body两部分内容,其中headers是flume event的头部消息,body是一个bytes类型的数组,这是真正的需要的数据。读取这种方式生成的.avro 提取的schema如下:
{"type":"record","name":"Event","fields":[{"name":"headers","type":{"type":"map","values":"string"}},{"name":"body","type":"bytes"}]}
场景三:使用avro source, kafka channel,hdfs 将日志以Avro Event Serializer的形式保存,即希望保存后的.avro能够被直接读取用来做统计计算,这里根据所选序列化类的不同有两种不同的配置方法:
第一种是:使用CDH 提供的org.apache.flume.serialization.AvroEventSerializer$Builder 序列化类来序列化消息到Hdfs 上,这个序列化器要求在flume event 的headers 中包含"flume.avro.schema.literal" 或"flume.avro.schema.url" 来指定数据所使用的avro schema,否则数据将无法保存,flume报错。使用这种配置需要注意:
(1) sink端的序列化器使用cdk提供的,git上有源码,可以自己下载编译(我编译失败了),也可以直接下载Jar:https://repository.cloudera.com/content/repositories/releases/com/cloudera/cdk/cdk-flume-avro-event-serializer/0.9.2/
第二种是:使用apache flume提供的org.apache.flume.sink.hdfs.AvroEventSerializer$Builder 序列化类作为其保存数据到hdfs上的工具,这个序列化类不要求在event header中包含avro schema信息,但是在sink端指定avro schema url 。
这两种配置能够成功需要注意以下一点:
(1)使用avro client发送数据时,非String数据,例如常用的对象数据需要转化成字节数组,然后才能发送,这里需要注意的是不能把java 对象直接转化为字节数据,而应该使用avro 提供api按照给定的schema进行转化,否则从hdfs读取avro格式的文件是不可用的,因为avro无法进行decoder.
场景四:利用Spooling Directory Source,memory,hdfs 监控生成的avro文件,相当于把avro从源服务器上传到目标服务器,但是需要注意以下几点
(1)应该把要上传的avro文件在非Spooling Directory目录下整理完成后再移动到Spooling Directory,否则agent将会出现错误。
(2)当Spooling Directory目录下的文件已经被flume读取后,任何更改都是无效的。
(3)f放到Spooling Directory目录下的所有文件命名要唯一,否则会引发错误。
下面详细记录每种场景下的配置及相关的java 代码:
场景一详解:
这种方式是官网提供的配置及demo,它能让使用者快速感受到flume的作用。
配置如下:
a1.channels = c1
a1.sources = r1
a1.sinks = k1
a1.channels.c1.type = memory
a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414
a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger
发送avro事件的客户端如下:
package com.learn.flume;
import com.learn.model.UserModel;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Properties;
public class SendAvroClient {
public static void main(String[] args) throws IOException {
RpcClientFacade client = new RpcClientFacade();
// Initialize client with the remote Flume agent's host and port
client.init("127.0.0.1", 41414);
// Send 10 events to the remote Flume agent. That agent should be
// configured to listen with an AvroSource.
String sampleData = "china";
for (int i = 0; i < 10; i++) {
client.sendStringDataToFlume(sampleData);
}
client.cleanUp();
}
}
class RpcClientFacade {
private RpcClient client;
private String hostname;
private int port;
private static Properties p= new Properties();
static {
p.put("client.type","default");
p.put("hosts","h1");
p.put("hosts.h1","127.0.0.1:41414");
p.put("batch-size",100);
p.put("connect-timeout",20000);
p.put("request-timeout",20000);
}
public void init(String hostname, int port) {
// Setup the RPC connection
this.hostname = hostname;
this.port = port;
this.client = RpcClientFactory.getInstance(p);
if (this.client == null) {
System.out.println("init client fail");
}
// this.client = RpcClientFactory.getInstance(hostname, port);
// Use the following method to create a thrift client (instead of the above line):
// this.client = RpcClientFactory.getThriftInstance(hostname, port);
}
public void sendStringDataToFlume(String data) {
// Create a FEventBuil
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
//event.getHeaders().put("kkkk","aaaaa");
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
// Use the following method to create a thrift client (instead of the above line):
// this.client = RpcClientFactory.getThriftInstance(hostname, port);
e.printStackTrace();
}
}
public void cleanUp() {
// Close the RPC connection
client.close();
}
}
这里利用Netty客户端发送flume消息,但是发送的消息需要是String类型,因为最终消息内容是以字节的形式在Flume中传递的。
场景二详解:场景二使用 "Flume Event" Avro Event Serializer作为其序列化器,最终保存在hdfs上的数据除了数据本省内容外还包括Flume event headers 消息部分。
下面是其在Flume中的配置:
a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k1
a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group
a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加拦截器
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414
##************k1 使用 "Flume Event" Avro Event Serializer start*************************#
a1.sinks.k1.channel = kafka_channel
a1.sinks.k1.type = hdfs
#选用默认的分区粒度
##a1.sinks.k1.hdfs.path=hdfs://localhost:9000/flume_data
#选择天作为分区粒度,这里需要启动header中的timestap功能,即给source添加拦截器
a1.sinks.k1.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k1.hdfs.inUsePrefix=_
a1.sinks.k1.serializer=avro_event
##************k1 使用 "Flume Event" Avro Event Serializer end*************************#
avro 客户端代码:
package com.learn.flume;
import com.learn.model.UserModel;
import com.learn.utils.ByteArrayUtils;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Properties;
public class SendAvroClient {
public static void main(String[] args) throws IOException {
RpcClientFacade client = new RpcClientFacade();
// Initialize client with the remote Flume agent's host and port
client.init("127.0.0.1", 41414);
for (int i = 0; i < 10; i++) {
UserModel userModel = new UserModel();
userModel.setAddress("hangzhou");
userModel.setAge(26);
userModel.setJob("it");
userModel.setName("shenjin");
client.sendObjectDataToFlume(userModel);
}
client.cleanUp();
}
}
class RpcClientFacade {
private RpcClient client;
private String hostname;
private int port;
private static Properties p = new Properties();
static {
p.put("client.type", "default");
p.put("hosts", "h1");
p.put("hosts.h1", "127.0.0.1:41414");
p.put("batch-size", 100);
p.put("connect-timeout", 20000);
p.put("request-timeout", 20000);
}
public void init(String hostname, int port) {
// Setup the RPC connection
this.hostname = hostname;
this.port = port;
this.client = RpcClientFactory.getInstance(p);
if (this.client == null) {
System.out.println("init client fail");
}
}
public void sendStringDataToFlume(String data) {
// Create a FEventBuil
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
e.printStackTrace();
}
}
public void sendObjectDataToFlume(Object data) throws IOException {
Event event = EventBuilder.withBody(ByteArrayUtils.objectToBytes(data).get());
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
// Use the following method to create a thrift client (instead of the above line):
// this.client = RpcClientFactory.getThriftInstance(hostname, port);
e.printStackTrace();
}
}
public void cleanUp() {
// Close the RPC connection
client.close();
}
}
这个客户端的方法是基于场景一的改动,不同的是提供了 sendObjectDataToFlume(Object data) 用法用来发送java对象,而不是简单的String类型的消息,这里就需要把Object 转换为bytes数组后,调用客户端提供的方法发送消息,于是编写了java对象与bytes数组之间转换的工具类,如下:
package com.learn.utils;
import java.io.*;
import java.util.Optional;
public class ByteArrayUtils {
/**
* java 对象转字节数据
* @param obj
* @param
* @return
*/
public static Optional objectToBytes(T obj){
byte[] bytes = null;
ByteArrayOutputStream out = new ByteArrayOutputStream();
ObjectOutputStream sOut;
try {
sOut = new ObjectOutputStream(out);
sOut.writeObject(obj);
sOut.flush();
bytes= out.toByteArray();
} catch (IOException e) {
e.printStackTrace();
}
return Optional.ofNullable(bytes);
}
/**
* 字节数组转java 对象
* @param bytes
* @param
* @return
*/
public static Optional bytesToObject(byte[] bytes) {
T t = null;
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
ObjectInputStream sIn;
try {
sIn = new ObjectInputStream(in);
t = (T)sIn.readObject();
} catch (Exception e) {
e.printStackTrace();
}
return Optional.ofNullable(t);
}
}
这样消息能够顺利发送并以avro格式存储在Hdfs,然后使用avro提供的api读取文件的内容,并且反序列化为对象:
public static void read2() throws IOException {
//数据的schema已经存在文件的头部,反序列化时不需要指定schema
Configuration configuration = new Configuration();
String hdfsURI = "hdfs://localhost:9000/";
String hdfsFileURL = "flume_data/year=2018/moth=02/day=07/FlumeData.1517970870974.avro";
FileContext fileContext = FileContext.getFileContext(URI.create(hdfsURI), configuration);
//FileSystem hdfs = FileSystem.get(URI.create(hdfsURI), configuration);
//FSDataInputStream inputStream = hdfs.open(new Path(hdfsURI+hdfsFileURL));
AvroFSInput avroFSInput = new AvroFSInput(fileContext,new Path(hdfsURI+hdfsFileURL));
DatumReader genericRecordDatumReader = new GenericDatumReader<>();
DataFileReader fileReader = new DataFileReader(avroFSInput,genericRecordDatumReader);
//可以从读取的文件中获取schema信息
Schema schema = fileReader.getSchema();
System.out.println("Get Schema Info:"+schema);
GenericRecord genericUser = null;
while (fileReader.hasNext()) {
//这里将需要读取的对象传入next方法中,用来减小创建对象的开销和垃圾回收
genericUser = fileReader.next(genericUser);
byte[] o = ((ByteBuffer)genericUser.get("body")).array();
UserModel userModel= (UserModel)(((Optional)ByteArrayUtils.bytesToObject(o)).get());
System.out.println(userModel);
System.out.println(userModel.age);
}
fileReader.close();
}
这样就可以获取存储的信息,但是感觉很麻烦。使用的model如下:
package com.learn.model;
import java.io.Serializable;
public class UserModel implements Serializable{
public String name ;
public Integer age;
private String address;
private String job;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Integer getAge() {
return age;
}
public void setAge(Integer age) {
this.age = age;
}
public String getAddress() {
return address;
}
public void setAddress(String address) {
this.address = address;
}
public String getJob() {
return job;
}
public void setJob(String job) {
this.job = job;
}
@Override
public String toString() {
return "UserModel{" +
"name='" + name + '\'' +
", age=" + age +
", address='" + address + '\'' +
", job='" + job + '\'' +
'}';
}
}
场景3详解(重):个人认为场景三在生产环境使用比较合适的,由服务器把需要记录的信息以对象的形式通过avro RPC发送到flume,然后存储在hdfs上,相对于场景二它只包含消息内容本身,可以用于计算引擎进行读取计算。
Flume 配置如下:
方式一:使用非apache flume提供的序列化类,即使用org.apache.flume.serialization.AvroEventSerializer$Builder类序列化,配置如下:
a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k2
a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group
a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加拦截器
a1.sources.r1.interceptors=i1 i2
a1.sources.r1.interceptors.i1.type=timestamp
#必须指定avro schema url
a1.sources.r1.interceptors.i2.type=static
a1.sources.r1.interceptors.i2.key=flume.avro.schema.url
a1.sources.r1.interceptors.i2.value=hdfs://localhost:9000/flume_schema/UserModel.avsc
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414
##************k2 使用 Avro Event Serializer start*************************#
a1.sinks.k2.channel = kafka_channel
a1.sinks.k2.type = hdfs
#选择天作为分区粒度,这里需要启动header中的timestap功能,即给source添加拦截器
a1.sinks.k2.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k2.hdfs.fileType=DataStream
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k2.hdfs.inUsePrefix=_
a1.sinks.k2.serializer=org.apache.flume.serialization.AvroEventSerializer$Builder
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollInterval=30
##************k2 使用 Avro Event Serializer end*************************#
这个配置中需要注意的有以下几点:
(1)在avro source 中添加了两个拦截器,一个是timestap类型的,用来决定在hdfs上文件存储的目录结构的,另外一个是static拦截器,这里面指定了当前消息对应的avro schema,这个是sink serializer需要的参数,以上两个拦截器将消息加载在flume event 的headers中,并不影响消息体。
(2 )在hdfs sink 中需要注意指定的序列化类
a1.sinks.k2.serializer=org.apache.flume.serialization.AvroEventSerializer$Builder
并不是Flume 自身提供的那个类,需要重新下载jar包,下载完成后把它丢到flume安装目录的lib目录下,jar包下载方式有两种:
一是源码自己编译:https://github.com/cloudera/cdk
二是下载已经编译好的jar:https://repository.cloudera.com/content/repositories/releases/com/cloudera/cdk/cdk-flume-avro-event-serializer/0.9.2/
方式二:使用apache flume提供的序列化类,即使用org.apache.flume.sink.hdfs.AvroEventSerializer$Builder类序列化,配置如下:
a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k2
a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group
a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加拦截器
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414
##************k2 使用 Avro Event Serializer start*************************#
a1.sinks.k2.channel = kafka_channel
a1.sinks.k2.type = hdfs
#选择天作为分区粒度,这里需要启动header中的timestap功能,即给source添加拦截器
a1.sinks.k2.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k2.hdfs.fileType=DataStream
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k2.hdfs.inUsePrefix=_
a1.sinks.k2.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
a1.sinks.k2.serializer.schemaURL = hdfs://localhost:9000/flume_schema/UserModel.avsc
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollInterval=30
##************k2 使用 Avro Event Serializer end*************************#
这个序列化类是apache flume提供的,不需要额外的jar包。
Avro 客户端的代码如下:
package com.learn.flume;
import com.google.gson.JsonObject;
import com.learn.model.UserModel;
import com.learn.utils.ByteArrayUtils;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.NettyAvroRpcClient;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.event.JSONEvent;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Properties;
public class MyApp {
public static void main(String[] args) throws IOException {
MyRpcClientFacade client = new MyRpcClientFacade();
// Initialize client with the remote Flume agent's host and port
client.init("127.0.0.1", 41414);
// Send 10 events to the remote Flume agent. That agent should be
// configured to listen with an AvroSource.
for (int i = 0; i < 10; i++) {
UserModel userModel = new UserModel();
userModel.setAddress("hangzhou");
userModel.setAge(26);
userModel.setJob("it");
userModel.setName("shenjin");
client.sendObjectDataToFlume(userModel);
}
client.cleanUp();
}
}
class MyRpcClientFacade {
private RpcClient client;
private String hostname;
private int port;
private static Properties p = new Properties();
static {
p.put("client.type", "default");
p.put("hosts", "h1");
p.put("hosts.h1", "127.0.0.1:41414");
p.put("batch-size", 100);
p.put("connect-timeout", 20000);
p.put("request-timeout", 20000);
}
public void init(String hostname, int port) {
// Setup the RPC connection
this.hostname = hostname;
this.port = port;
this.client = RpcClientFactory.getInstance(p);
if (this.client == null) {
System.out.println("init client fail");
}
}
public void sendStringDataToFlume(String data) {
// Create a FEventBuil
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
event.getHeaders().put("kkkk", "aaaaa");
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
e.printStackTrace();
}
}
public void sendObjectDataToFlume(Object data) throws IOException {
InputStream inputStream = ClassLoader.getSystemResourceAsStream("schema/UserModel.avsc");
Schema schema = new Schema.Parser().parse(inputStream);
Event event = EventBuilder.withBody(serializeAvro(data, schema));
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
e.printStackTrace();
}
}
public void cleanUp() {
// Close the RPC connection
client.close();
}
/**
* 使用avro提供的编码器生成字节数组,否则存取在hdfs上的数据无法读取
* @param datum
* @param schema
* @return
* @throws IOException
*/
public byte[] serializeAvro(Object datum, Schema schema) throws IOException {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ReflectDatumWriter
这里与场景二中最大的不同就是java对象转字节数组的方式,场景二中使用的是常用的java对象转字节数组的方式,而这里需要结合avro提供的编码器和scheme来生成对象的字节数组,然后通过RPC发送,否则即使数据存储在hdfs上,用avro方式无法解析数据。
对应类的schema如下:
{"namespace": "com.learn.avro",
"type": "record",
"name": "UserModel",
"fields": [
{"name": "name", "type": "string"},
{"name": "address", "type": ["string", "null"]},
{"name": "age", "type": "int"},
{"name": "job", "type": "string"}
]
}
应该把它上传到hdfs上。
数据保存到hdfs上后,使用下面的代码来读取数据:
public static void read1() throws IOException {
//数据的schema已经存在文件的头部,反序列化时不需要指定schema
Configuration configuration = new Configuration();
String hdfsURI = "hdfs://localhost:9000/";
String hdfsFileURL = "flume_data/year=2018/moth=02/day=09/FlumeData.1518169360741.avro";
FileContext fileContext = FileContext.getFileContext(URI.create(hdfsURI), configuration);
//FileSystem hdfs = FileSystem.get(URI.create(hdfsURI), configuration);
//FSDataInputStream inputStream = hdfs.open(new Path(hdfsURI+hdfsFileURL));
AvroFSInput avroFSInput = new AvroFSInput(fileContext,new Path(hdfsURI+hdfsFileURL));
DatumReader genericRecordDatumReader = new GenericDatumReader<>();
DataFileReader fileReader = new DataFileReader(avroFSInput,genericRecordDatumReader);
//可以从读取的文件中获取schema信息
Schema schema = fileReader.getSchema();
System.out.println("Get Schema Info:"+schema);
GenericRecord genericUser = null;
while (fileReader.hasNext()) {
//这里将需要读取的对象传入next方法中,用来减小创建对象的开销和垃圾回收
genericUser = fileReader.next(genericUser);
System.out.println(genericUser);
}
fileReader.close();
}
场景四详解:这个场景解决的问题是使用Flume 上传avro 格式的文件到hdfs,以便后续的查询等工作。
其中Flume 的配置如下:
# memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# source
agent1.sources.spooldir-source1.channels = ch1
agent1.sources.spooldir-source1.type = spooldir
agent1.sources.spooldir-source1.spoolDir=/home/shenjin/data/avro/
agent1.sources.spooldir-source1.basenameHeader = true
agent1.sources.spooldir-source1.deserializer = AVRO
agent1.sources.spooldir-source1.deserializer.schemaType = LITERAL
# sink
agent1.sinks.hdfs-sink1.channel = ch1
agent1.sinks.hdfs-sink1.type = hdfs
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://localhost:9000/flume_data
agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
agent1.sinks.hdfs-sink1.hdfs.fileSuffix = .avro
agent1.sinks.hdfs-sink1.serializer = org.apache.flume.serialization.AvroEventSerializer$Builder
agent1.sinks.hdfs-sink1.hdfs.filePrefix = event
agent1.sinks.hdfs-sink1.hdfs.rollSize = 0
agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = spooldir-source1
agent1.sinks = hdfs-sink
这个配置需要注意的地方有:
(1)agent source type 为spooldir,表示用来监控某个目录下面的新增文件,使用avro 作为其反序列化。
(2) sink 端的serializer 指定的序列化类不是apache flume自带的,下载的jar请参考场景三。
(3) hdfs.rollsize = 0,hdfs.rollcount = 0的配置保证文件上传到hdfs上时不会被切割成多个文件。