前边分析了flume的 Source 和 MemoryChannel 两个组件,接下来分析下第三个大组件 Sink。Sink组件主要用于从Channel 中拉取数据至下一个flume agent 或者目的存储对象(如HDFS)。
要分析Sink,就来先看下Sink接口的定义:
public interface Sink extends LifecycleAware, NamedComponent {
/**
* 设置Channel
*/
public void setChannel(Channel channel);
/**
* 返回具体sink的channel
*/
public Channel getChannel();
/**
* 请求的Sink尝试从连接的Channel消费数据data。这个方法应该在一个事务范围内从Channel消费。
* 成功分发事务应该被提交。失败应该回退。
* 如果有 1个或多个Event被成功分发,则为READY;
* 如果没有数据从Channel取回放至sink,则为BACKOFF
* 在任何类型的故障传递数据到下一跳目的地的情况下,抛出异常EventDeliveryException
*/
public Status process() throws EventDeliveryException;
public static enum Status {
READY, BACKOFF
}
}
从前边的分析可知,flume系统的入口为 Application 类,在该类中会依次启动 Channel、Sink、Source三个组件。从启动代码分析可以发现Sink组件启动调用对象 Sink运行器SinkRunner的start方法,该操作发生在MonitorRunnable线程中调用lifecycleAware.start()方法启动一个Sink组件,eclipse查看该方法具体Sink相关实现如下:
@Override
public void start() {
SinkProcessor policy = getPolicy(); //获取
policy.start();
runner = new PollingRunner();
runner.policy = policy;
runner.counterGroup = counterGroup;
runner.shouldStop = new AtomicBoolean(); //以原子方式创建 Boolean值,默认为false
runnerThread = new Thread(runner);
runnerThread.setName("SinkRunner-PollingRunner-" +
policy.getClass().getSimpleName());
runnerThread.start(); //启动线程
lifecycleState = LifecycleState.START;
}
在该start方法中首先会获取一个SinkProcessor,指定线程的属性(policy、counterGroup 、shouldStop )并且启动它。然后会创建一个线程PollingRunner,调用线程的run方法:
@Override
public void run() {
logger.debug("Polling sink runner starting");
while (!shouldStop.get()) {
try {
if (policy.process().equals(Sink.Status.BACKOFF)) {
counterGroup.incrementAndGet("runner.backoffs");
Thread.sleep(Math.min(
counterGroup.incrementAndGet("runner.backoffs.consecutive")
* backoffSleepIncrement, maxBackoffSleep));
} else {
counterGroup.set("runner.backoffs.consecutive", 0L);
}
} catch (InterruptedException e) {
......
}
}
logger.debug("Polling runner exiting. Metrics:{}", counterGroup);
}
}
在run方法中可以发现使用while循环(直到设置shouldStop为true结束循环)调用SinkProcessor中的process方法进行下一步的处理。
SinkProcessor就是Sink处理器,那么SinkRunner运行器和SinlkProcessor处理器有什么不同呢?其实SinkRunner实际上主要就是运行Sink的(Sink启动入口首先就是调用该对象,相比于Source也有其SourceRunner),而 SinkProcessor 决定究竟哪个 Sink 应该从自己对应的 Channel 中拉取事件。
为什么需要SinkProcessor呢?
Flume可以聚合线程到Sink组,每个Sink组可以包含一个或多个Sink,如果一个Sink没有定义Sink组,那么该Sink可以被认为是在一个组内,且该Sink是组内的唯一成员。Flume会为每一个Sink组实例化一个SinkRunner运行器,来运行该 Sink 组。如下Sink组件框架图:
2.是flume系统默认的Sink处理器类DefaultSinkProcessor,只接受一个单一的Sink,没有任何额外的处理(相比于第一种)传递process的处理结果。
若没有配置Sink组,采用的默认就是DefaultSinkProcessor类中的process,该方法中因为不需要做任何的额外处理,代码也是十分的简单,直接调用Sink 的process方法(也就是配置中具体定义的sink,比如写入HDFS中,那就是调用hdfsSink):
@Override
public Status process() throws EventDeliveryException {
return sink.process();
}
HDFSEventSink 的process方法是Sink组件的核心代码,其中实现了Sink的event事务处理。每一种具体的sink都必须实现process方法,目前1.7版本自带如下:
在HDFSEventSink.java中
/**
* 从channel拉数据发送到HDFS。每个事务可以取出batchSize个events.
* 找到event对应的存储桶bucket。确保文件打开。序列化数据写入到HDFS上的文件中。
* 这个方法不是线程安全的。
*/
public Status process() throws EventDeliveryException {
// 获取管道channel
Channel channel = getChannel();
Transaction transaction = channel.getTransaction(); //getTransaction获取或创建事务Transaction
List<BucketWriter> writers = Lists.newArrayList();
transaction.begin(); //事务开始
try {
int txnEventCount = 0;
//从channel中取出batchSize个event
for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
Event event = channel.take();
if (event == null) {
break;
}
// reconstruct the path name by substituting place holders
// 通过替换占位符重建路径名
String realPath = BucketPath.escapeString(filePath, event.getHeaders(),
timeZone, needRounding, roundUnit, roundValue, useLocalTime);
String realName = BucketPath.escapeString(fileName, event.getHeaders(),
timeZone, needRounding, roundUnit, roundValue, useLocalTime);
String lookupPath = realPath + DIRECTORY_DELIMITER + realName;
LOG.debug("realPath:"+realPath+" ; realName: "+realName);
LOG.debug("lookupPath: "+lookupPath);
/* filePath配置: hdfs.path = /user/portal/tmp/syx/flume-events/%y-%m-%d/%H%M
* filePrefix配置: hdfs.filePrefix = events
*
* 添加调试输出如下:
* realPath: /user/portal/tmp/syx/flume-events/16-12-17/2110 ; realName: events
* lookupPath: /user/portal/tmp/syx/flume-events/16-12-17/2110/events
*/
BucketWriter bucketWriter;
HDFSWriter hdfsWriter = null;
// Callback to remove the reference to the bucket writer from the
// sfWriters map so that all buffers used by the HDFS file
// handles are garbage collected.
/*
* 构造一个回调函数,回调函数从sfWriters map中移除对bucket写入器的引用,
* 以便HDFS文件句柄使用的所有缓冲区被垃圾回收gc
*/
WriterCallback closeCallback = new WriterCallback() {
@Override
public void run(String bucketPath) {
LOG.info("Writer callback called.");
synchronized (sfWritersLock) {
sfWriters.remove(bucketPath); //从sfWriters映射中移除指定键bucketPath的映射关系
}
}
};
synchronized (sfWritersLock) {
bucketWriter = sfWriters.get(lookupPath);
// we haven't seen this file yet, so open it and cache the handle
// 我们还没有看到这个文件,所以打开它并缓存句柄
if (bucketWriter == null) {
//根据fileType选择对应的HDFSWriter,3种:SequenceFile, DataStream or CompressedStream
hdfsWriter = writerFactory.getWriter(fileType);
//initializeBucketWriter方法如其名,初始化BucketWriter
bucketWriter = initializeBucketWriter(realPath, realName,
lookupPath, hdfsWriter, closeCallback);
sfWriters.put(lookupPath, bucketWriter);
}
}
// track the buckets getting written in this transaction 跟踪在事务中写入的buckets
if (!writers.contains(bucketWriter)) {
writers.add(bucketWriter);
}
//写数据到HDFS
try {
bucketWriter.append(event);
} catch (BucketClosedException ex) {
LOG.info("Bucket was closed while trying to append, " +
"reinitializing bucket and writing event.");
hdfsWriter = writerFactory.getWriter(fileType);
bucketWriter = initializeBucketWriter(realPath, realName,
lookupPath, hdfsWriter, closeCallback); //根据传入参数创建BucketWriter对象
synchronized (sfWritersLock) {
sfWriters.put(lookupPath, bucketWriter);
}
bucketWriter.append(event);
}
}
if (txnEventCount == 0) {
sinkCounter.incrementBatchEmptyCount();
} else if (txnEventCount == batchSize) {
sinkCounter.incrementBatchCompleteCount();
} else {
sinkCounter.incrementBatchUnderflowCount();
}
// flush all pending buckets before committing the transaction
//在提交事务前flush所有的文件
for (BucketWriter bucketWriter : writers) {
bucketWriter.flush();
}
transaction.commit(); //提交事务
if (txnEventCount < 1) {
return Status.BACKOFF;
} else {
sinkCounter.addToEventDrainSuccessCount(txnEventCount);
return Status.READY;
}
} catch (IOException eIO) {
transaction.rollback(); //发生异常,回滚事务
LOG.warn("HDFS IO error", eIO);
return Status.BACKOFF;
} catch (Throwable th) {
transaction.rollback();
LOG.error("process failed", th);
if (th instanceof Error) {
throw (Error) th;
} else {
throw new EventDeliveryException(th);
}
} finally {
transaction.close(); //关闭事务
}
}
这个process方法中主要实现功能就是event的事务处理,开始一个事务transaction,然后从对应的Channel中take出event,写入HDFS中,写入的过程是open根据配置文件指定的路径名称在对应目录中生成对应的临时文件(默认以.tmp为后缀)打开文件,将event写入,在到达某个时间点或者文件达到指定大小,对文件进行重新命名rename操作,然后提交commit事务,若发生异常则回滚事务,否则正常关闭文件即可。这一过程中对文件的操作均是调用HDFS的文件操作方法(open、mkdir、rename等)。在写入hdfs时还有文件压缩、hadoop中副本处理等一些点,这里就讲了。
至此Sink写入hdfs的大致过程结束。