How spark-binlog works

Several days ago, I developed a new project called spark-binlog . Before this project if you want to incrementally sync data from MySQL, there is a really big pipeline. The pipeline looks like the following picture:

How spark-binlog works_第1张图片
A really big pipeline

You must use tools like canal to extract binlog from MySQL, send them to Kafka, then build a stream application based on Flink/Spark/Store to consume from the Kafka again. Because the real target is to sync the MySQL table through binlog instead of syncing binlog itself, we need to find a upsertable storage e.g. Hbase or Kudu. Kudu is great, you can query it directly, If hbase, you should wrap it with pheonix or export the data from HBase to HDFS table so it can be queried by Spark/Preso/Hive. This is really un-maintainable and time-consumed pipeline.

We hope we can query MySQL binlog directly and export it to a storage which has upsertable feature. It looks like this:

How spark-binlog works_第2张图片
A simple pipeline

To accomplish such a simple pipeline, there are two requirements:

你可能感兴趣的:(How spark-binlog works)