4、Apache Hudi:Spark读取Binlog并写入

1、数据准备

使用canal将mysql binlog的数据发送到kafka中

2、程序编写

1、消费kafka中的binlog数据

val kafkaParams = Map[String, String](
	"bootstrap.servers" -> "xxx.xxx.xxx.xxx:9092",
	"auto.offset.reset" -> "latest",
	"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
	"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
	"group.id" -> "test"
)
val topics = Array("topic_xxx")

val conf = new SparkConf()
    .setAppName("Demo1")
    .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val ssc = new StreamingContext(conf, Seconds(60))
val stream = KafkaUtils.createDirectStream[String, String](ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))

2、解析并过滤数据

我们kafka中binlog的数据格式如下:

V_BinlogName|#|mysql-bin.001205|*|V_StartPos|#|111404|*|V_EndPos|#|113489|*|V_logOffset|#|112140|*|V_ExecuteTime|#|2018-10-22 12:00:33|*|V_StartTime|#|2018-10-22 12:00:33|*|V_EndTime|#|2018-10-22 12:00:33|*|EventType|#|INSERT|*|SchemaId|#|45|*|Schema|#|uc|*|TableName|#|staff_aa_sync|*|id|#|000000|*|employee_date|#|2018-10-16 00:00:00|*|employee_id|#|000000|*|update_time|#|2018-10-22 12:00:33|*|dimission_time|#||*|employee_status|#|1

需要将字符串数据解析成键值对,由于我们把需要实时同步的表的所有binlog放到一个topic中,因为在存储之前还需要根据TableName过滤出想要的表

val lineStream = stream.map(v => {
     
   val lineArray = v.value().split("\\|\\*\\|")
   var lineMap: Map[String, String] = Map()
   lineArray.foreach(i => {
     
     val fieldArray = i.split("\\|#\\|")
     if (fieldArray.length == 2) {
     
       lineMap += (fieldArray(0) -> fieldArray(1))
     }
   })
   (lineMap.getOrDefault("TableName", ""), JSON.toJSONString(mapAsJavaMap(lineMap), new SerializeConfig(true)))
 }).filter(v => "table_aaa".equals(v._1)).map(_._2)

使用Hudi保存数据的时候需要Dataframe,所以我将解析好的数据转换成json,然后使用spark的api直接将json数据转换成df,这样的好处是可以动态检测json的scheme(因为我们原先的方案mysql表中新增字段后不能及时同步到hive中)

val spark = SparkSession.builder()
      .config(conf)
      .getOrCreate()

lineStream.foreachRDD(rdd => {
     
      val df = spark.read.json(rdd)
      df.write.format("com.uber.hoodie")
        .option("hoodie.upsert.shuffle.parallelism", "1")
        .option(HIVE_URL_OPT_KEY, "jdbc:hive2://xxx.xxx.xxx.xxx:10000")
        .option(HIVE_USER_OPT_KEY, "aaaa")
        .option(HIVE_PASS_OPT_KEY, "123456")
        .option(HIVE_DATABASE_OPT_KEY, "test_dc")
        .option(HIVE_SYNC_ENABLED_OPT_KEY, true)
        .option(HIVE_TABLE_OPT_KEY, tableName)
        .option(HIVE_PARTITION_FIELDS_OPT_KEY, "partiton")
        .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "com.xxx.dc.MyPartitionValueExtractor")
        .option(PRECOMBINE_FIELD_OPT_KEY, "id")
        .option(INSERT_DROP_DUPS_OPT_KEY, "true")
        .option(RECORDKEY_FIELD_OPT_KEY, "id")
        .option(PARTITIONPATH_FIELD_OPT_KEY, "partitionPath")
        .option(TABLE_NAME, tableName)
        .mode(SaveMode.Append)
        .save(basePath)
    })

问题1:为了和hive分区表保存目录统一,我将hudi保存的partitonpath定义成xxx=xxx的格式,因此需要在上面解析处理数据的阶段增加一列:

lineMap += ("partitionPath" -> "partiton=20191210")

问题2:hudi默认解析分区目录的方式解析yyyy/MM/dd,因此针对上面我自定义个分区目录会报错,所以要写一个自定义的HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY

public class MyPartitionValueExtractor implements PartitionValueExtractor {
     

    public List<String> extractPartitionValuesInPath(String partitionPath) {
     
        String[] splits = partitionPath.split("/");
        String split = splits[0];
        if (split.indexOf("=") == -1) {
     
            throw new IllegalArgumentException("Partition path " + partitionPath + " is not in the form xxx=xxxx ");
        }
        return Arrays.asList(split.split("=")[1]);
    }

}

3、在hive中查询hudi表

因为Hudi使用了自定义的InputFormat,因此需要将hudi的jar包上传到集群:
hudi jar包
这样就可以在hive中使用hudi表了

你可能感兴趣的:(Apache,Hudi,Apache,Hudii,binlog)