Flume 使用场景记录

Flume用来收集日志信息,这里记录以下使用场景:


场景一:使用avro source ,memory,logger 将收集到的日志打印在标准输出,适合测试。


场景二:使用avro source,kafka channel,hdfs 将日志以"Flume Event" Avro Event Serializer 的形式保存在hdfs上,这种方式生成的.avro文件中的每一条记录的字段中包含headers和body两部分内容,其中headers是flume event的头部消息,body是一个bytes类型的数组,这是真正的需要的数据。读取这种方式生成的.avro 提取的schema如下:

{"type":"record","name":"Event","fields":[{"name":"headers","type":{"type":"map","values":"string"}},{"name":"body","type":"bytes"}]}


场景三:使用avro source, kafka channel,hdfs 将日志以Avro Event Serializer的形式保存,即希望保存后的.avro能够被直接读取用来做统计计算,这里根据所选序列化类的不同有两种不同的配置方法:

    第一种是:使用CDH 提供的org.apache.flume.serialization.AvroEventSerializer$Builder 序列化类来序列化消息到Hdfs 上,这个序列化器要求在flume event 的headers 中包含"flume.avro.schema.literal" 或"flume.avro.schema.url" 来指定数据所使用的avro schema,否则数据将无法保存,flume报错。使用这种配置需要注意:

  (1) sink端的序列化器使用cdk提供的,git上有源码,可以自己下载编译(我编译失败了),也可以直接下载Jar:https://repository.cloudera.com/content/repositories/releases/com/cloudera/cdk/cdk-flume-avro-event-serializer/0.9.2/


第二种是:使用apache flume提供的org.apache.flume.sink.hdfs.AvroEventSerializer$Builder 序列化类作为其保存数据到hdfs上的工具,这个序列化类不要求在event header中包含avro schema信息,但是在sink端指定avro schema url 。

这两种配置能够成功需要注意以下一点:

    (1)使用avro client发送数据时,非String数据,例如常用的对象数据需要转化成字节数组,然后才能发送,这里需要注意的是不能把java 对象直接转化为字节数据,而应该使用avro 提供api按照给定的schema进行转化,否则从hdfs读取avro格式的文件是不可用的,因为avro无法进行decoder.


场景四:利用Spooling Directory Source,memory,hdfs 监控生成的avro文件,相当于把avro从源服务器上传到目标服务器,但是需要注意以下几点

    (1)应该把要上传的avro文件在非Spooling Directory目录下整理完成后再移动到Spooling Directory,否则agent将会出现错误。

     (2)当Spooling Directory目录下的文件已经被flume读取后,任何更改都是无效的。

      (3)f放到Spooling Directory目录下的所有文件命名要唯一,否则会引发错误。



下面详细记录每种场景下的配置及相关的java 代码:

场景一详解:

   这种方式是官网提供的配置及demo,它能让使用者快速感受到flume的作用。

    配置如下:

a1.channels = c1
a1.sources = r1
a1.sinks = k1

a1.channels.c1.type = memory

a1.sources.r1.channels = c1
a1.sources.r1.type = avro

a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414

a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger

 发送avro事件的客户端如下:

package com.learn.flume;

import com.learn.model.UserModel;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Properties;

public class SendAvroClient {
    public static void main(String[] args) throws IOException {
        RpcClientFacade client = new RpcClientFacade();
        // Initialize client with the remote Flume agent's host and port
        client.init("127.0.0.1", 41414);

        // Send 10 events to the remote Flume agent. That agent should be
        // configured to listen with an AvroSource.
        String sampleData = "china";
        for (int i = 0; i < 10; i++) {

            client.sendStringDataToFlume(sampleData);
        }
        
        client.cleanUp();
    }
}

class RpcClientFacade {
    private RpcClient client;
    private String hostname;
    private int port;
    private static Properties p= new Properties();

    static {
        p.put("client.type","default");
        p.put("hosts","h1");
        p.put("hosts.h1","127.0.0.1:41414");
        p.put("batch-size",100);
        p.put("connect-timeout",20000);
        p.put("request-timeout",20000);
    }

    public void init(String hostname, int port) {
        // Setup the RPC connection
        this.hostname = hostname;
        this.port = port;
        this.client = RpcClientFactory.getInstance(p);
        if (this.client == null) {
            System.out.println("init client fail");
        }
       // this.client = RpcClientFactory.getInstance(hostname, port);
        // Use the following method to create a thrift client (instead of the above line):
        // this.client = RpcClientFactory.getThriftInstance(hostname, port);
    }

    public void sendStringDataToFlume(String data) {
        // Create a FEventBuil
        Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
        //event.getHeaders().put("kkkk","aaaaa");
        // Send the event
        try {
            client.append(event);
        } catch (EventDeliveryException e) {
            // clean up and recreate the client
            client.close();
            client = null;
            client =  RpcClientFactory.getDefaultInstance(hostname, port);
            // Use the following method to create a thrift client (instead of the above line):
            // this.client = RpcClientFactory.getThriftInstance(hostname, port);
            e.printStackTrace();
        }
    }

    public void cleanUp() {
        // Close the RPC connection
        client.close();
    }
}

这里利用Netty客户端发送flume消息,但是发送的消息需要是String类型,因为最终消息内容是以字节的形式在Flume中传递的。


场景二详解:场景二使用 "Flume Event" Avro Event Serializer作为其序列化器,最终保存在hdfs上的数据除了数据本省内容外还包括Flume event headers 消息部分。

下面是其在Flume中的配置:

a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k1

a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group

a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加拦截器
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp

a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414

##************k1 使用 "Flume Event" Avro Event Serializer start*************************#
a1.sinks.k1.channel = kafka_channel
a1.sinks.k1.type = hdfs
#选用默认的分区粒度
##a1.sinks.k1.hdfs.path=hdfs://localhost:9000/flume_data
#选择天作为分区粒度,这里需要启动header中的timestap功能,即给source添加拦截器
a1.sinks.k1.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k1.hdfs.inUsePrefix=_
a1.sinks.k1.serializer=avro_event
##************k1 使用 "Flume Event" Avro Event Serializer end*************************#

avro 客户端代码:

package com.learn.flume;

import com.learn.model.UserModel;
import com.learn.utils.ByteArrayUtils;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Properties;

public class SendAvroClient {
    public static void main(String[] args) throws IOException {
        RpcClientFacade client = new RpcClientFacade();
        // Initialize client with the remote Flume agent's host and port
        client.init("127.0.0.1", 41414);

        for (int i = 0; i < 10; i++) {
            UserModel userModel = new UserModel();
            userModel.setAddress("hangzhou");
            userModel.setAge(26);
            userModel.setJob("it");
            userModel.setName("shenjin");
            client.sendObjectDataToFlume(userModel);
        }

        client.cleanUp();
    }
}

class RpcClientFacade {
    private RpcClient client;
    private String hostname;
    private int port;

    private static Properties p = new Properties();

    static {
        p.put("client.type", "default");
        p.put("hosts", "h1");
        p.put("hosts.h1", "127.0.0.1:41414");
        p.put("batch-size", 100);
        p.put("connect-timeout", 20000);
        p.put("request-timeout", 20000);
    }

    public void init(String hostname, int port) {
        // Setup the RPC connection
        this.hostname = hostname;
        this.port = port;

        this.client = RpcClientFactory.getInstance(p);
        if (this.client == null) {
            System.out.println("init client fail");
        }
    }

    public void sendStringDataToFlume(String data) {
        // Create a FEventBuil
        Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
        // Send the event
        try {
            client.append(event);
        } catch (EventDeliveryException e) {
            // clean up and recreate the client
            client.close();
            client = null;
            client = RpcClientFactory.getDefaultInstance(hostname, port);
            e.printStackTrace();
        }
    }

    public void sendObjectDataToFlume(Object data) throws IOException {


        Event event = EventBuilder.withBody(ByteArrayUtils.objectToBytes(data).get());
        // Send the event
        try {
            client.append(event);
        } catch (EventDeliveryException e) {
            // clean up and recreate the client
            client.close();
            client = null;
            client = RpcClientFactory.getDefaultInstance(hostname, port);
            // Use the following method to create a thrift client (instead of the above line):
            // this.client = RpcClientFactory.getThriftInstance(hostname, port);
            e.printStackTrace();
        }
    }
     public void cleanUp() {
        // Close the RPC connection
        client.close();
    }
}

这个客户端的方法是基于场景一的改动,不同的是提供了 sendObjectDataToFlume(Object data) 用法用来发送java对象,而不是简单的String类型的消息,这里就需要把Object 转换为bytes数组后,调用客户端提供的方法发送消息,于是编写了java对象与bytes数组之间转换的工具类,如下:

package com.learn.utils;

import java.io.*;
import java.util.Optional;

public class ByteArrayUtils {

    /**
     * java 对象转字节数据
     * @param obj
     * @param 
     * @return
     */
    public static Optional objectToBytes(T obj){
        byte[] bytes = null;
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        ObjectOutputStream sOut;
        try {
            sOut = new ObjectOutputStream(out);
            sOut.writeObject(obj);
            sOut.flush();
            bytes= out.toByteArray();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return Optional.ofNullable(bytes);
    }

    /**
     * 字节数组转java 对象
     * @param bytes
     * @param 
     * @return
     */
    public static Optional bytesToObject(byte[] bytes) {
        T t = null;
        ByteArrayInputStream in = new ByteArrayInputStream(bytes);
        ObjectInputStream sIn;
        try {
            sIn = new ObjectInputStream(in);
            t = (T)sIn.readObject();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return Optional.ofNullable(t);

    }
}

这样消息能够顺利发送并以avro格式存储在Hdfs,然后使用avro提供的api读取文件的内容,并且反序列化为对象:

public static void read2() throws IOException {
        //数据的schema已经存在文件的头部,反序列化时不需要指定schema
        Configuration configuration = new Configuration();
        String hdfsURI = "hdfs://localhost:9000/";
        String hdfsFileURL = "flume_data/year=2018/moth=02/day=07/FlumeData.1517970870974.avro";
        FileContext fileContext = FileContext.getFileContext(URI.create(hdfsURI), configuration);
        //FileSystem hdfs = FileSystem.get(URI.create(hdfsURI), configuration);
        //FSDataInputStream inputStream = hdfs.open(new Path(hdfsURI+hdfsFileURL));
        AvroFSInput avroFSInput  = new AvroFSInput(fileContext,new Path(hdfsURI+hdfsFileURL));
        DatumReader genericRecordDatumReader = new GenericDatumReader<>();
        DataFileReader fileReader = new DataFileReader(avroFSInput,genericRecordDatumReader);
        //可以从读取的文件中获取schema信息
        Schema schema = fileReader.getSchema();
        System.out.println("Get Schema Info:"+schema);
        GenericRecord genericUser = null;
        while (fileReader.hasNext()) {
            //这里将需要读取的对象传入next方法中,用来减小创建对象的开销和垃圾回收
            genericUser = fileReader.next(genericUser);
            byte[] o = ((ByteBuffer)genericUser.get("body")).array();
            UserModel userModel= (UserModel)(((Optional)ByteArrayUtils.bytesToObject(o)).get());
            System.out.println(userModel);
            System.out.println(userModel.age);
        }
        fileReader.close();
    }

这样就可以获取存储的信息,但是感觉很麻烦。使用的model如下:

package com.learn.model;

import java.io.Serializable;

public class UserModel implements Serializable{

    public String name ;

    public Integer age;

    private String address;

    private String job;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public Integer getAge() {
        return age;
    }

    public void setAge(Integer age) {
        this.age = age;
    }

    public String getAddress() {
        return address;
    }

    public void setAddress(String address) {
        this.address = address;
    }

    public String getJob() {
        return job;
    }

    public void setJob(String job) {
        this.job = job;
    }

    @Override
    public String toString() {
        return "UserModel{" +
                "name='" + name + '\'' +
                ", age=" + age +
                ", address='" + address + '\'' +
                ", job='" + job + '\'' +
                '}';
    }
}


场景3详解(重):个人认为场景三在生产环境使用比较合适的,由服务器把需要记录的信息以对象的形式通过avro RPC发送到flume,然后存储在hdfs上,相对于场景二它只包含消息内容本身,可以用于计算引擎进行读取计算。

Flume 配置如下:

方式一:使用非apache flume提供的序列化类,即使用org.apache.flume.serialization.AvroEventSerializer$Builder类序列化,配置如下:

a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k2

a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group

a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加拦截器
a1.sources.r1.interceptors=i1 i2
a1.sources.r1.interceptors.i1.type=timestamp
#必须指定avro schema url
a1.sources.r1.interceptors.i2.type=static
a1.sources.r1.interceptors.i2.key=flume.avro.schema.url
a1.sources.r1.interceptors.i2.value=hdfs://localhost:9000/flume_schema/UserModel.avsc

a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414

##************k2 使用  Avro Event Serializer start*************************#
a1.sinks.k2.channel = kafka_channel
a1.sinks.k2.type = hdfs
#选择天作为分区粒度,这里需要启动header中的timestap功能,即给source添加拦截器
a1.sinks.k2.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k2.hdfs.fileType=DataStream
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k2.hdfs.inUsePrefix=_
a1.sinks.k2.serializer=org.apache.flume.serialization.AvroEventSerializer$Builder
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollInterval=30
##************k2 使用  Avro Event Serializer end*************************#

这个配置中需要注意的有以下几点:

   (1)在avro source 中添加了两个拦截器,一个是timestap类型的,用来决定在hdfs上文件存储的目录结构的,另外一个是static拦截器,这里面指定了当前消息对应的avro schema,这个是sink serializer需要的参数,以上两个拦截器将消息加载在flume event 的headers中,并不影响消息体。

(2 )在hdfs sink 中需要注意指定的序列化类

a1.sinks.k2.serializer=org.apache.flume.serialization.AvroEventSerializer$Builder

并不是Flume 自身提供的那个类,需要重新下载jar包,下载完成后把它丢到flume安装目录的lib目录下,jar包下载方式有两种:

一是源码自己编译:https://github.com/cloudera/cdk

二是下载已经编译好的jar:https://repository.cloudera.com/content/repositories/releases/com/cloudera/cdk/cdk-flume-avro-event-serializer/0.9.2/


方式二:使用apache flume提供的序列化类,即使用org.apache.flume.sink.hdfs.AvroEventSerializer$Builder类序列化,配置如下:

a1.channels = kafka_channel
a1.sources = r1
a1.sinks = k2

a1.channels.kafka_channel.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.kafka_channel.kafka.bootstrap.servers=localhost:9092
a1.channels.kafka_channel.kafka.topic=test
a1.channels.kafka_channel.group.id=flume_group

a1.sources.r1.channels = kafka_channel
a1.sources.r1.type = avro
#添加拦截器
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp

a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 41414

##************k2 使用  Avro Event Serializer start*************************#
a1.sinks.k2.channel = kafka_channel
a1.sinks.k2.type = hdfs
#选择天作为分区粒度,这里需要启动header中的timestap功能,即给source添加拦截器
a1.sinks.k2.hdfs.path=hdfs://localhost:9000/flume_data/year=%Y/moth=%m/day=%d
a1.sinks.k2.hdfs.fileType=DataStream
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.fileSuffix=.avro
#set in use file prefix is "_",because hadoop mapreduce will ignore those files start with "_" prefix.
a1.sinks.k2.hdfs.inUsePrefix=_
a1.sinks.k2.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
a1.sinks.k2.serializer.schemaURL = hdfs://localhost:9000/flume_schema/UserModel.avsc
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollInterval=30
##************k2 使用  Avro Event Serializer end*************************#

这个序列化类是apache flume提供的,不需要额外的jar包。


Avro 客户端的代码如下:

package com.learn.flume;

import com.google.gson.JsonObject;
import com.learn.model.UserModel;
import com.learn.utils.ByteArrayUtils;
import org.apache.avro.Schema;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.reflect.ReflectDatumWriter;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.NettyAvroRpcClient;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.event.JSONEvent;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import java.util.Properties;

public class MyApp {
    public static void main(String[] args) throws IOException {
        MyRpcClientFacade client = new MyRpcClientFacade();
        // Initialize client with the remote Flume agent's host and port
        client.init("127.0.0.1", 41414);


        // Send 10 events to the remote Flume agent. That agent should be
        // configured to listen with an AvroSource.
        for (int i = 0; i < 10; i++) {
            UserModel userModel = new UserModel();
            userModel.setAddress("hangzhou");
            userModel.setAge(26);
            userModel.setJob("it");
            userModel.setName("shenjin");
            client.sendObjectDataToFlume(userModel);
        }

        client.cleanUp();
    }
}

class MyRpcClientFacade {
    private RpcClient client;
    private String hostname;
    private int port;

    private static Properties p = new Properties();

    static {
        p.put("client.type", "default");
        p.put("hosts", "h1");
        p.put("hosts.h1", "127.0.0.1:41414");
        p.put("batch-size", 100);
        p.put("connect-timeout", 20000);
        p.put("request-timeout", 20000);
    }

    public void init(String hostname, int port) {
        // Setup the RPC connection
        this.hostname = hostname;
        this.port = port;

        this.client = RpcClientFactory.getInstance(p);
        if (this.client == null) {
            System.out.println("init client fail");
        }
    }

    public void sendStringDataToFlume(String data) {
        // Create a FEventBuil
        Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
        event.getHeaders().put("kkkk", "aaaaa");
        // Send the event
        try {
            client.append(event);
        } catch (EventDeliveryException e) {
            // clean up and recreate the client
            client.close();
            client = null;
            client = RpcClientFactory.getDefaultInstance(hostname, port);
            e.printStackTrace();
        }
    }

    public void sendObjectDataToFlume(Object data) throws IOException {
        InputStream inputStream = ClassLoader.getSystemResourceAsStream("schema/UserModel.avsc");
        Schema schema = new Schema.Parser().parse(inputStream);

        Event event = EventBuilder.withBody(serializeAvro(data, schema));
        try {
            client.append(event);
        } catch (EventDeliveryException e) {
            // clean up and recreate the client
            client.close();
            client = null;
            client = RpcClientFactory.getDefaultInstance(hostname, port);
            e.printStackTrace();
        }
    }


    public void cleanUp() {
        // Close the RPC connection
        client.close();
    }

    /**
     * 使用avro提供的编码器生成字节数组,否则存取在hdfs上的数据无法读取
     * @param datum
     * @param schema
     * @return
     * @throws IOException
     */
    public byte[] serializeAvro(Object datum, Schema schema) throws IOException {
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        ReflectDatumWriter writer = new ReflectDatumWriter<>(schema);
        BinaryEncoder binaryEncoder = EncoderFactory.get().binaryEncoder(outputStream, null);
        outputStream.reset();
        writer.write(datum, binaryEncoder);
        binaryEncoder.flush();
        return outputStream.toByteArray();
    }

}
 
  

这里与场景二中最大的不同就是java对象转字节数组的方式,场景二中使用的是常用的java对象转字节数组的方式,而这里需要结合avro提供的编码器和scheme来生成对象的字节数组,然后通过RPC发送,否则即使数据存储在hdfs上,用avro方式无法解析数据。

对应类的schema如下:

{"namespace": "com.learn.avro",
 "type": "record",
 "name": "UserModel",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "address",  "type": ["string", "null"]},
     {"name": "age", "type": "int"},
     {"name": "job", "type": "string"}
 ]
}

应该把它上传到hdfs上。

数据保存到hdfs上后,使用下面的代码来读取数据:

public static void read1() throws IOException {
        //数据的schema已经存在文件的头部,反序列化时不需要指定schema
        Configuration configuration = new Configuration();
        String hdfsURI = "hdfs://localhost:9000/";
        String hdfsFileURL = "flume_data/year=2018/moth=02/day=09/FlumeData.1518169360741.avro";
        FileContext fileContext = FileContext.getFileContext(URI.create(hdfsURI), configuration);
        //FileSystem hdfs = FileSystem.get(URI.create(hdfsURI), configuration);
        //FSDataInputStream inputStream = hdfs.open(new Path(hdfsURI+hdfsFileURL));
        AvroFSInput avroFSInput  = new AvroFSInput(fileContext,new Path(hdfsURI+hdfsFileURL));
        DatumReader genericRecordDatumReader = new GenericDatumReader<>();
        DataFileReader fileReader = new DataFileReader(avroFSInput,genericRecordDatumReader);
        //可以从读取的文件中获取schema信息
        Schema schema = fileReader.getSchema();
        System.out.println("Get Schema Info:"+schema);
        GenericRecord genericUser = null;
        while (fileReader.hasNext()) {
            //这里将需要读取的对象传入next方法中,用来减小创建对象的开销和垃圾回收
            genericUser = fileReader.next(genericUser);
            System.out.println(genericUser);
        }
        fileReader.close();
    }


场景四详解:这个场景解决的问题是使用Flume 上传avro 格式的文件到hdfs,以便后续的查询等工作。

其中Flume 的配置如下:

# memory channel called ch1 on agent1
agent1.channels.ch1.type = memory

# source
agent1.sources.spooldir-source1.channels = ch1
agent1.sources.spooldir-source1.type = spooldir
agent1.sources.spooldir-source1.spoolDir=/home/shenjin/data/avro/
agent1.sources.spooldir-source1.basenameHeader = true
agent1.sources.spooldir-source1.deserializer = AVRO
agent1.sources.spooldir-source1.deserializer.schemaType = LITERAL

# sink
agent1.sinks.hdfs-sink1.channel = ch1
agent1.sinks.hdfs-sink1.type = hdfs

agent1.sinks.hdfs-sink1.hdfs.path = hdfs://localhost:9000/flume_data
agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
agent1.sinks.hdfs-sink1.hdfs.fileSuffix = .avro
agent1.sinks.hdfs-sink1.serializer =  org.apache.flume.serialization.AvroEventSerializer$Builder

agent1.sinks.hdfs-sink1.hdfs.filePrefix = event
agent1.sinks.hdfs-sink1.hdfs.rollSize = 0
agent1.sinks.hdfs-sink1.hdfs.rollCount = 0

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = spooldir-source1
agent1.sinks = hdfs-sink

这个配置需要注意的地方有:

  (1)agent source type 为spooldir,表示用来监控某个目录下面的新增文件,使用avro 作为其反序列化。

    (2)  sink 端的serializer 指定的序列化类不是apache flume自带的,下载的jar请参考场景三。

    (3)  hdfs.rollsize = 0,hdfs.rollcount = 0的配置保证文件上传到hdfs上时不会被切割成多个文件。



你可能感兴趣的:(Flume,环境搭建)