Flink 写HDFS之BucketingSink

     Flink写HDFS,目前常用的有 BucketingSink, StreamingFileSink .

     BucketingSink后续会被StreamingFileSink替代。不过功能实现都还是很强大的。

     StreamingFileSink 支持一些BucketingSink不支持的特性,如S3, parquet格式写等等,

 

 

1  代码示例:

 

import java.io.{FileWriter, Writer}
import java.time.ZoneId
import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.{CheckpointingMode, TimeCharacteristic}
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.connectors.fs.{SequenceFileWriter, StringWriter}
import org.apache.flink.streaming.connectors.fs.bucketing.{BucketingSink, DateTimeBucketer}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer

object BucketingHdfsSink {

  def main(args: Array[String]): Unit = {
    val params: ParameterTool = ParameterTool.fromArgs(args)

    // set up execution environment
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    // make parameters available in the web interface
    env.getConfig.setGlobalJobParameters(params)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.enableCheckpointing(10*1000,CheckpointingMode.EXACTLY_ONCE)

    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "192.168.139.13:9092,192.168.139.11:9092,192.168.139.12:9092")
    properties.setProperty("group.id", "sink-group")
   
    val stream = env
      .addSource(new FlinkKafkaConsumer[String]("sink-topic", new SimpleStringSchema(), properties))

    // 写入文件类型:String     也支持元组:Tuple2[IntWritable,Text]
    val  bucketingSink = new BucketingSink[String]("/tmp/flinkhdfs")

    // 分桶: 即创建文件目录。
   
    // DateTimeBucketer 只能根据时间指定文件名的滚动是规则,没办法根据数据指定文件的输出位置。可实现 BasePathBucketer 自定义输出路径如下例子DayBasePathBucketer
     ///**
    //  * 根据实际数据返回数据输出的路径,实现同DaeTimeBucketer功能类似的自定义的按天分桶器
    //  */
    //class DayBasePathBucketer extends BasePathBucketer[String]{
    //
    //  /**
    //    * 返回路径
    //    * @param clock
    //    * @param basePath
    //    * @param element
    //    * @return  ,
    //    * 基本路径下创建分桶目录。自定义目录
    //    */
    //  override def getBucketPath(clock: Clock, basePath: Path, element: String): Path = {
    //    // yyyyMMdd
    //    val day = element.substring(1, 9)
    //    new Path(basePath + File.separator + day)
    //  }
    //}

    // 系统自带的分桶器 DateTimeBuckete
    // 默认格式:yyyy-MM-dd--HH: 按小时分桶。   yyyy-MM-dd:按天分桶      yyyy-MM-dd--HHmm,按分钟分桶。
    bucketingSink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd"))
    // 写入文件类型:String     也支持元组:[IntWritable,Text] 
    // 默认写入使用StringWriter。   也可以使用SequenceFileWriter
    // bucketingSink.setWriter(new StringWriter())

    // 重要:设置 文件滚存大小,可以按SIZE和时间间隔自动生成新文件,两者任一条件满足,回生成新文件。
    bucketingSink.setBatchSize(1024 * 100) // this is 400 MB, 500K
    bucketingSink.setBatchRolloverInterval(1 * 60 * 1000); // this is 1 mins。   按分钟重新生成文件(必须有数据才会刷新)。
   
   // 设定不活动桶时间阈值,超过此值便关闭文件// 设定不活动桶时间阈值,超过此值便关闭文件            bucketingSink.setInactiveBucketThreshold(3 * 60 * 1000L)
   // 设定检查不活动桶的频率
   bucketingSink.setInactiveBucketCheckInterval(30 * 1000L)

 

//invoke方法会调用每一条记录

//shouldRoll方法判断文件大小和滚动时间是否达到。是则创建新的文件,以inprogress结尾。

/************** onProcessingTime方法  ****************/
/*  此方法定时器触发(InactiveBucketCheckInterval), 在设置的不活动桶检查频率检查。详见:closePartFilesByTime方法    
    检查1: 文件最后修改时间与当前时间,是否超过InactiveBucketThreshold,是则关闭:closeCurrentPartFile
    检查2: 文件创建时间与当前时间,是否超过BatchRolloverInterval,是则关闭 :closeCurrentPartFile
*/
// public void onProcessingTime(long timestamp) throws Exception { 
// long currentProcessingTime = processingTimeService.getCurrentProcessingTime(); 
// closePartFilesByTime(currentProcessingTime); 
// processingTimeService.registerTimer(currentProcessingTime + inactiveBucketCheckInterval, this); // }


    //    sink.setInProgressPrefix("inProcessPre")
    //    sink.setPendingPrefix("pendingpre")
    //    sink.setPartPrefix("partPre")
    stream.addSink(bucketingSink)
    env.execute()

   



    // closeCurrentPartFile逻辑,
     // 1 文件最开始保存在内存中,定时刷新缓存,后缀inprogress. 文件到达指定间隔时间(创建间隔或不活动间隔),关闭writer刷新缓存。将文件改名为pending结尾。        直到checkpoint完成,才去掉后缀。
    // 注: 如果checkpoint没完成,系统崩溃。  则inprogress以及pending状态的文件废弃,以之前的offset重新生成文件。   // checkpoint完成,则将pending状态的文件去掉后缀。notifyCheckpointComplete方法逻辑实现







  }
}

 

 

3 运行示例:

3.1 查看分桶目录

[hadoop@cdh01 ~]$ hadoop fs -ls /tmp/flinkhdfs
Found 17 items
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:39 /tmp/flinkhdfs/2020-04-14
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 10:57 /tmp/flinkhdfs/2020-04-14--1055
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 10:57 /tmp/flinkhdfs/2020-04-14--1056
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 10:58 /tmp/flinkhdfs/2020-04-14--1057
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 10:59 /tmp/flinkhdfs/2020-04-14--1058
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:00 /tmp/flinkhdfs/2020-04-14--1059
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:01 /tmp/flinkhdfs/2020-04-14--1100
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:02 /tmp/flinkhdfs/2020-04-14--1101
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:03 /tmp/flinkhdfs/2020-04-14--1102
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:04 /tmp/flinkhdfs/2020-04-14--1103
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:05 /tmp/flinkhdfs/2020-04-14--1104
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:06 /tmp/flinkhdfs/2020-04-14--1105
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:07 /tmp/flinkhdfs/2020-04-14--1106
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:08 /tmp/flinkhdfs/2020-04-14--1107
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:09 /tmp/flinkhdfs/2020-04-14--1108
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:09 /tmp/flinkhdfs/2020-04-14--1109
drwxr-xr-x   - hadoop supergroup          0 2020-04-14 11:10 /tmp/flinkhdfs/2020-04-14--1110

3.2 查看桶内文件:(后缀会不时变动: inprogress-->pending-->移除后缀)

[hadoop@cdh01 ~]$ hadoop fs -ls /tmp/flinkhdfs/2020-04-14
Found 28 items
-rw-r--r--   1 hadoop supergroup     102418 2020-04-14 11:12 /tmp/flinkhdfs/2020-04-14/part-0-0
-rw-r--r--   1 hadoop supergroup      49736 2020-04-14 11:13 /tmp/flinkhdfs/2020-04-14/part-0-1
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:22 /tmp/flinkhdfs/2020-04-14/part-0-10
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:23 /tmp/flinkhdfs/2020-04-14/part-0-11
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:24 /tmp/flinkhdfs/2020-04-14/part-0-12
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:25 /tmp/flinkhdfs/2020-04-14/part-0-13
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:26 /tmp/flinkhdfs/2020-04-14/part-0-14
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:27 /tmp/flinkhdfs/2020-04-14/part-0-15
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:28 /tmp/flinkhdfs/2020-04-14/part-0-16
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:29 /tmp/flinkhdfs/2020-04-14/part-0-17
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:30 /tmp/flinkhdfs/2020-04-14/part-0-18
-rw-r--r--   1 hadoop supergroup      49883 2020-04-14 11:31 /tmp/flinkhdfs/2020-04-14/part-0-19
-rw-r--r--   1 hadoop supergroup      49966 2020-04-14 11:14 /tmp/flinkhdfs/2020-04-14/part-0-2
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:32 /tmp/flinkhdfs/2020-04-14/part-0-20
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:33 /tmp/flinkhdfs/2020-04-14/part-0-21
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:34 /tmp/flinkhdfs/2020-04-14/part-0-22
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:35 /tmp/flinkhdfs/2020-04-14/part-0-23
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:36 /tmp/flinkhdfs/2020-04-14/part-0-24
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:37 /tmp/flinkhdfs/2020-04-14/part-0-25
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:38 /tmp/flinkhdfs/2020-04-14/part-0-26
-rw-r--r--   1 hadoop supergroup      49468 2020-04-14 11:39 /tmp/flinkhdfs/2020-04-14/part-0-27
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:15 /tmp/flinkhdfs/2020-04-14/part-0-3
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:16 /tmp/flinkhdfs/2020-04-14/part-0-4
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:17 /tmp/flinkhdfs/2020-04-14/part-0-5
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:18 /tmp/flinkhdfs/2020-04-14/part-0-6
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:19 /tmp/flinkhdfs/2020-04-14/part-0-7
-rw-r--r--   1 hadoop supergroup      49883 2020-04-14 11:20 /tmp/flinkhdfs/2020-04-14/part-0-8
-rw-r--r--   1 hadoop supergroup      49551 2020-04-14 11:21 /tmp/flinkhdfs/2020-04-14/part-0-9

你可能感兴趣的:(Flink,flink,源码)