使用canal将mysql binlog的数据发送到kafka中
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "xxx.xxx.xxx.xxx:9092",
"auto.offset.reset" -> "latest",
"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"group.id" -> "test"
)
val topics = Array("topic_xxx")
val conf = new SparkConf()
.setAppName("Demo1")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val ssc = new StreamingContext(conf, Seconds(60))
val stream = KafkaUtils.createDirectStream[String, String](ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))
我们kafka中binlog的数据格式如下:
V_BinlogName|#|mysql-bin.001205|*|V_StartPos|#|111404|*|V_EndPos|#|113489|*|V_logOffset|#|112140|*|V_ExecuteTime|#|2018-10-22 12:00:33|*|V_StartTime|#|2018-10-22 12:00:33|*|V_EndTime|#|2018-10-22 12:00:33|*|EventType|#|INSERT|*|SchemaId|#|45|*|Schema|#|uc|*|TableName|#|staff_aa_sync|*|id|#|000000|*|employee_date|#|2018-10-16 00:00:00|*|employee_id|#|000000|*|update_time|#|2018-10-22 12:00:33|*|dimission_time|#||*|employee_status|#|1
需要将字符串数据解析成键值对,由于我们把需要实时同步的表的所有binlog放到一个topic中,因为在存储之前还需要根据TableName过滤出想要的表
val lineStream = stream.map(v => {
val lineArray = v.value().split("\\|\\*\\|")
var lineMap: Map[String, String] = Map()
lineArray.foreach(i => {
val fieldArray = i.split("\\|#\\|")
if (fieldArray.length == 2) {
lineMap += (fieldArray(0) -> fieldArray(1))
}
})
(lineMap.getOrDefault("TableName", ""), JSON.toJSONString(mapAsJavaMap(lineMap), new SerializeConfig(true)))
}).filter(v => "table_aaa".equals(v._1)).map(_._2)
使用Hudi保存数据的时候需要Dataframe,所以我将解析好的数据转换成json,然后使用spark的api直接将json数据转换成df,这样的好处是可以动态检测json的scheme(因为我们原先的方案mysql表中新增字段后不能及时同步到hive中)
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
lineStream.foreachRDD(rdd => {
val df = spark.read.json(rdd)
df.write.format("com.uber.hoodie")
.option("hoodie.upsert.shuffle.parallelism", "1")
.option(HIVE_URL_OPT_KEY, "jdbc:hive2://xxx.xxx.xxx.xxx:10000")
.option(HIVE_USER_OPT_KEY, "aaaa")
.option(HIVE_PASS_OPT_KEY, "123456")
.option(HIVE_DATABASE_OPT_KEY, "test_dc")
.option(HIVE_SYNC_ENABLED_OPT_KEY, true)
.option(HIVE_TABLE_OPT_KEY, tableName)
.option(HIVE_PARTITION_FIELDS_OPT_KEY, "partiton")
.option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "com.xxx.dc.MyPartitionValueExtractor")
.option(PRECOMBINE_FIELD_OPT_KEY, "id")
.option(INSERT_DROP_DUPS_OPT_KEY, "true")
.option(RECORDKEY_FIELD_OPT_KEY, "id")
.option(PARTITIONPATH_FIELD_OPT_KEY, "partitionPath")
.option(TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath)
})
问题1:为了和hive分区表保存目录统一,我将hudi保存的partitonpath定义成xxx=xxx的格式,因此需要在上面解析处理数据的阶段增加一列:
lineMap += ("partitionPath" -> "partiton=20191210")
问题2:hudi默认解析分区目录的方式解析yyyy/MM/dd,因此针对上面我自定义个分区目录会报错,所以要写一个自定义的HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY
:
public class MyPartitionValueExtractor implements PartitionValueExtractor {
public List<String> extractPartitionValuesInPath(String partitionPath) {
String[] splits = partitionPath.split("/");
String split = splits[0];
if (split.indexOf("=") == -1) {
throw new IllegalArgumentException("Partition path " + partitionPath + " is not in the form xxx=xxxx ");
}
return Arrays.asList(split.split("=")[1]);
}
}
因为Hudi使用了自定义的InputFormat,因此需要将hudi的jar包上传到集群:
这样就可以在hive中使用hudi表了