Flink1.9系列-StreamingFileSink vs BucketingSink篇

在完成以下两篇文章的操作后,我们基本就可以创建属于我们自己的Flink工程代码了。
1.Flink1.9系列-CDH版本源码编译篇
2.Flink1.9系列-Flink on Yarn配置篇

1.Flink Project代码结构

在开始之前,我们先大概浏览一下官方文档,Flink1.9 doc ,在programming-model模块我们可以看到一个简单的Flink demo,类似于flink源码中的WordCount代码一样。从demo中我们可以看到一个Flink Project简单可以分成以下两个部分:

  1. source
  2. sink

这次我们讲的StreamingFileSink和BucketingSink就是属于sink板块的一大支柱,为什么说明明是两个我们要说成是一大支柱呢?因为Bucketing从历史上看是StreamingFileSink的祖宗,而StreamingFileSink更像是一个正在茁壮成长的孩子,虽然问题很多,但是前景很好!

或者你遇到了如下的错误不知道怎么去解决

java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS and for Hadoop version 2.7 or newer
	at org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriter.(HadoopRecoverableWriter.java:57)
	at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.createRecoverableWriter(HadoopFileSystem.java:202)
	at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.createRecoverableWriter(SafetyNetWrapperFileSystem.java:69)
	at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.(Buckets.java:112)
	at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink$RowFormatBuilder.createBuckets(StreamingFileSink.java:242)
	at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.initializeState(StreamingFileSink.java:327)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:281)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:878)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:392)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
	at java.lang.Thread.run(Thread.java:748)

ok!开始正题。。。。。。

2.BucketingSink

我们先看一下使用的demo:

val bucketingsink = new BucketingSink[(String, String)](basePath)
bucketingsink.setBucketer(new KeyBucket())
bucketingsink.setWriter(new Tuple_2Writer())
bucketingsink.setBatchSize(1024 * 1024 * 20)
bucketingsink.setBatchRolloverInterval(20 * 60 * 1000)

使用方法很简单,我们再简单浏览一下BucketingSink的源码。。。
这个类属于package org.apache.flink.streaming.connectors.fs.bucketing,可以看到是属于connectors的一部分,接下来就是一些类的说明,类参数和子类的一些说明,如下:

/**
 * Sink that emits its input elements to {@link FileSystem} files within
 * buckets. This is integrated with the checkpointing mechanism to provide exactly once semantics.
 *
 *
 * 

When creating the sink a {@code basePath} must be specified. The base directory contains * one directory for every bucket. The bucket directories themselves contain several part files, * one for each parallel subtask of the sink. These part files contain the actual output data. * * *

The sink uses a {@link Bucketer} to determine in which bucket directory each element should * be written to inside the base directory. The {@code Bucketer} can, for example, use time or * a property of the element to determine the bucket directory. The default {@code Bucketer} is a * {@link DateTimeBucketer} which will create one new bucket every hour. You can specify * a custom {@code Bucketer} using {@link #setBucketer(Bucketer)}. For example, use the * {@link BasePathBucketer} if you don't want to have buckets but still want to write part-files * in a fault-tolerant way. * * *

The filenames of the part files contain the part prefix, the parallel subtask index of the sink * and a rolling counter. For example the file {@code "part-1-17"} contains the data from * {@code subtask 1} of the sink and is the {@code 17th} bucket created by that subtask. Per default * the part prefix is {@code "part"} but this can be configured using {@link #setPartPrefix(String)}. * When a part file becomes bigger than the user-specified batch size or when the part file becomes older * than the user-specified roll over interval the current part file is closed, the part counter is increased * and a new part file is created. The batch size defaults to {@code 384MB}, this can be configured * using {@link #setBatchSize(long)}. The roll over interval defaults to {@code Long.MAX_VALUE} and * this can be configured using {@link #setBatchRolloverInterval(long)}. * * *

In some scenarios, the open buckets are required to change based on time. In these cases, the sink * needs to determine when a bucket has become inactive, in order to flush and close the part file. * To support this there are two configurable settings: *

    *
  1. the frequency to check for inactive buckets, configured by {@link #setInactiveBucketCheckInterval(long)}, * and
  2. *
  3. the minimum amount of time a bucket has to not receive any data before it is considered inactive, * configured by {@link #setInactiveBucketThreshold(long)}
  4. *
* Both of these parameters default to {@code 60, 000 ms}, or {@code 1 min}. * * *

Part files can be in one of three states: {@code in-progress}, {@code pending} or {@code finished}. * The reason for this is how the sink works together with the checkpointing mechanism to provide exactly-once * semantics and fault-tolerance. The part file that is currently being written to is {@code in-progress}. Once * a part file is closed for writing it becomes {@code pending}. When a checkpoint is successful the currently * pending files will be moved to {@code finished}. * * *

If case of a failure, and in order to guarantee exactly-once semantics, the sink should roll back to the state it * had when that last successful checkpoint occurred. To this end, when restoring, the restored files in {@code pending} * state are transferred into the {@code finished} state while any {@code in-progress} files are rolled back, so that * they do not contain data that arrived after the checkpoint from which we restore. If the {@code FileSystem} supports * the {@code truncate()} method this will be used to reset the file back to its previous state. If not, a special * file with the same name as the part file and the suffix {@code ".valid-length"} will be created that contains the * length up to which the file contains valid data. When reading the file, it must be ensured that it is only read up * to that point. The prefixes and suffixes for the different file states and valid-length files can be configured * using the adequate setter method, e.g. {@link #setPendingSuffix(String)}. * * *

NOTE: *

    *
  1. * If checkpointing is not enabled the pending files will never be moved to the finished state. In that case, * the pending suffix/prefix can be set to {@code ""} to make the sink work in a non-fault-tolerant way but * still provide output without prefixes and suffixes. *
  2. *
  3. * The part files are written using an instance of {@link Writer}. By default, a * {@link StringWriter} is used, which writes the result of {@code toString()} for * every element, separated by newlines. You can configure the writer using the * {@link #setWriter(Writer)}. For example, {@link SequenceFileWriter} * can be used to write Hadoop {@code SequenceFiles}. *
  4. *
  5. * {@link #closePartFilesByTime(long)} closes buckets that have not been written to for * {@code inactiveBucketThreshold} or if they are older than {@code batchRolloverInterval}. *
  6. *
* * *

Example: *

{@code
 *     new BucketingSink>(outPath)
 *         .setWriter(new SequenceFileWriter())
 *         .setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm")
 * }
* *

This will create a sink that writes to {@code SequenceFiles} and rolls every minute. * * @see DateTimeBucketer * @see StringWriter * @see SequenceFileWriter * * @param Type of the elements emitted by this sink * * @deprecated Please use the * {@link org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink StreamingFileSink} * instead. * */

请着重看这两段注释

If case of a failure, and in order to guarantee exactly-once semantics, the sink should roll back to the state it
 * had when that last successful checkpoint occurred. To this end, when restoring, the restored files in {@code pending}
 * state are transferred into the {@code finished} state while any {@code in-progress} files are rolled back, so that
 * they do not contain data that arrived after the checkpoint from which we restore. If the {@code FileSystem} supports
 * the {@code truncate()} method this will be used to reset the file back to its previous state. If not, a special
 * file with the same name as the part file and the suffix {@code ".valid-length"} will be created that contains the
 * length up to which the file contains valid data. When reading the file, it must be ensured that it is only read up
 * to that point. The prefixes and suffixes for the different file states and valid-length files can be configured
 * using the adequate setter method, e.g. {@link #setPendingSuffix(String)}
* @deprecated Please use the
 * {@link org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink StreamingFileSink}
 * instead.

大概解释一下:Flink为了保证消息只消费一次,sink必须具有能回滚到上一次成功checkpoint的状态点,如果你指定的文件系统支持truncate操作,flink就会将之前保存的文件重新设置到上次成功的状态。相反,如果不支持的话,flink会创建一个同名的文件,并且增加一个后缀作为标识。这是非常重要的,而且也是BucketingSink和StreamingFileSink的主要不同点所在

下面那段注释则标识了新版本的Flink已经废弃该类,并用StreamingFileSink替代了,这时候BucketingSink的孩子就出现在人们的眼睛中了。当然,我没有深入考究StreamingFileSink是从Flink1.6还是Flink1.7或者其他版本开始的。既然官网推荐使用,我们接下来就讲一下StreamingFileSink这个类吧。

3.StreamingFileSink

同样的,我们看一下StreamingFileSink的源码和使用方法
先看一下简单的demo

val bucketingsink = StreamingFileSink
    .forRowFormat(new Path(basePath), new Tuple2Encoder())
    .withBucketAssigner(new KeyBucketAssigner())
    //      .withRollingPolicy(DefaultRollingPolicy[(String,String),String])
    .build()

使用方法也很简单,这里主要看一下几个方法

		@Override
		Buckets createBuckets(int subtaskIndex) throws IOException {
			return new Buckets<>(
					basePath,
					bucketAssigner,
					bucketFactory,
					new RowWisePartWriter.Factory<>(encoder),
					rollingPolicy,
					subtaskIndex);
		}

使用自定义或者默认的bucket创建目录及文件层级,接下来看一下这个方法里调用的Buckets类:

Buckets(
			final Path basePath,
			final BucketAssigner bucketAssigner,
			final BucketFactory bucketFactory,
			final PartFileWriter.PartFileFactory partFileWriterFactory,
			final RollingPolicy rollingPolicy,
			final int subtaskIndex) throws IOException {

		this.basePath = Preconditions.checkNotNull(basePath);
		this.bucketAssigner = Preconditions.checkNotNull(bucketAssigner);
		this.bucketFactory = Preconditions.checkNotNull(bucketFactory);
		this.partFileWriterFactory = Preconditions.checkNotNull(partFileWriterFactory);
		this.rollingPolicy = Preconditions.checkNotNull(rollingPolicy);
		this.subtaskIndex = subtaskIndex;

		this.activeBuckets = new HashMap<>();
		this.bucketerContext = new Buckets.BucketerContext();

		try {
			this.fsWriter = FileSystem.get(basePath.toUri()).createRecoverableWriter();
		} catch (IOException e) {
			LOG.error("Unable to create filesystem for path: {}", basePath);
			throw e;
		}

		this.bucketStateSerializer = new BucketStateSerializer<>(
				fsWriter.getResumeRecoverableSerializer(),
				fsWriter.getCommitRecoverableSerializer(),
				bucketAssigner.getSerializer()
		);

		this.maxPartCounter = 0L;
	}

请注意这一行代码:

this.fsWriter = FileSystem.get(basePath.toUri()).createRecoverableWriter();

创建文件,尤其是hdfs文件,这里我们在深入一层看一下

    @Override
	public RecoverableWriter createRecoverableWriter() throws IOException {
		// This writer is only supported on a subset of file systems, and on
		// specific versions. We check these schemes and versions eagerly for better error
		// messages in the constructor of the writer.
		return new HadoopRecoverableWriter(fs);
	}
public HadoopRecoverableWriter(org.apache.hadoop.fs.FileSystem fs) {
		this.fs = checkNotNull(fs);

		// This writer is only supported on a subset of file systems, and on
		// specific versions. We check these schemes and versions eagerly for
		// better error messages.
		if (!"hdfs".equalsIgnoreCase(fs.getScheme()) || !HadoopUtils.isMinHadoopVersion(2, 7)) {
			throw new UnsupportedOperationException(
					"Recoverable writers on Hadoop are only supported for HDFS and for Hadoop version 2.7 or newer");
		}
	}

看到问题了吗?StreamingFileSink在写hdfs时候,要求hadoop版本必须大于2.7,但是目前市面开源的稳定版本包含cloudera cdh在内,都是支持hadoop2.6,所以如果你使用hadoop版本<2.7,那建议你还是使用BucketingSink,不出什么错,毕竟是祖宗!!!

你可能感兴趣的:(Flink)