在实际的实时流开发当中,数据库的日志实时变化获取占了很大一部分。一般使用canal或者maxwell接入binglog,但是canal和maxwell很难按需接入,也就是说按照数据库或者按照单个表来接入。这些配置在启动canal或者maxwell之前就在配置文件配置好的,后期修改还需要增加配置再重新启动,这在生产业务中不太合适。 kafka-connect结合开源的JAR包可以根据业务按需来接入,比如你需要test库的a表,那么使用rest api直接启动一个test.a的connector即可。
另外kafka-connect好处还有每个表都会有一个topic,这样在实时流获取数据变化就会很方便,需要哪个表直接拿对应的topic即可。 基于这几个方便之处,才体现了kafka-connect的意义,否者也就没什么必要使用这个东西了。
官方文档:
http://kafka.apache.org/documentation/#connect
使用kafka-connect之前,先来说一下这个东西的基本使用方法,kafka-connect是基于kafka本身的,也就是说这个东西是kafka的一个功能,但是有独立的进程和端口,需要通过standalone或者distributed脚本启动kafka-connect。
启动之后如何使用呢?kafka-connector是一个插件的概念,比如你接入MySQL binlog,那么你需要使用mysql connector jar, 如果你接入ORACLE,需要使用jdbc connector jar, 把这个JAR放到对应的目录即可。然后再通过rest api启动对应的接入TASK即可。
1. kafka connector配置
启动kafka connector之前,确保你的kafka服务已经启动,然后通过connect-distributed.sh或者connect-standalone.sh 启动connector进程,端口是8083. 上面2个脚本的区别从字面意思很容易理解,一个是单独启动,一个是分布式。 我这里测试分布式,进入config目录,修改config/connect-distributed.properties配置文件,有几个配置要注意:
1) 分布式的connector group.id必须相同,以表示这是一组分布式connector, 默认group.id=connect-cluster
2)kafka-connector有3个内置的topic,建议手动创建,因为对这几个内置topic,kafka-connector有自己的要求,要求如下:
config.storage.topic (default connect-configs) - topic to use for storing connector and task configurations; note that this should be a single partition, highly replicated, compacted topic. You may need to manually create the topic to ensure the correct configuration as auto created topics may have multiple partitions or be automatically configured for deletion rather than compaction
offset.storage.topic (default connect-offsets) - topic to use for storing offsets; this topic should have many partitions, be replicated, and be configured for compaction
status.storage.topic (default connect-status) - topic to use for storing statuses; this topic can have multiple partitions, and should be replicated and configured for compaction
3)plugin.path=/data/kafka/connector 建立一个目录专门放置kafka-connector的jar
2. kafka connector启动
配置文件修改好之后,启动分布式kafka-connector:
bin/connect-distributed.sh -daemon config/connect-distributed.properties
启动完成之后使用 lsof -i:8083确认端口已经启动。
3. 测试kafa-connector
之前提过kafa-connector是通过REST API来使用的,因此所有操作使用API即可。我这里使用postman来测试。
测试API如下:
GET /connectors - return a list of active connectors
POST /connectors - create a new connector; the request body should be a JSON object containing a string name field and an object config field with the connector configuration parameters
GET /connectors/{name} - get information about a specific connector
GET /connectors/{name}/config - get the configuration parameters for a specific connector
PUT /connectors/{name}/config - update the configuration parameters for a specific connector
GET /connectors/{name}/status - get current status of the connector, including if it is running, failed, paused, etc., which worker it is assigned to, error information if it has failed, and the state of all its tasks
GET /connectors/{name}/tasks - get a list of tasks currently running for a connector
GET /connectors/{name}/tasks/{taskid}/status - get current status of the task, including if it is running, failed, paused, etc., which worker it is assigned to, and error information if it has failed
PUT /connectors/{name}/pause - pause the connector and its tasks, which stops message processing until the connector is resumed
PUT /connectors/{name}/resume - resume a paused connector (or do nothing if the connector is not paused)
POST /connectors/{name}/restart - restart a connector (typically because it has failed)
POST /connectors/{name}/tasks/{taskId}/restart - restart an individual task (typically because it has failed)
DELETE /connectors/{name} - delete a connector, halting all tasks and deleting its configuration
输入测试地址:http://10.203.0.43:8083 然后获得结果:
{
"version": "2.4.1",
"commit": "c57222ae8cd7866b",
"kafka_cluster_id": "wFh4bgsGQeSDicHf1xpT6Q"
}
这样就表示kafka-connector服务启动成功了,接下来要做的事就是根据业务需求,使用kafka-connector jar来接入日志。
这里使用MySQL binlog接入作为例子,那么我们先要找kafka-connector mysql jar,这里使用https://debezium.io 公司开发的jar,部署到kafka-connector plugin目录下即可。具体JAR下载等步骤这里就忽略了。
4. MySQL binlog接入
1) 准备好MySQL相关信息,比如用户密码,接入的数据库或者表,构建成JSON文件如下:
{ "name": "test-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector", "tasks.max": "1", "database.hostname": "10.203.0.93", "database.port": "3306", "database.user": "admin", "database.password": "internal", "database.server.id": "1", "database.server.name": "test", "database.whitelist": "test.*", "database.history.kafka.bootstrap.servers": "10.203.3.91:9092,10.203.0.44:9092,10.203.0.43:9092", "database.history.kafka.topic": "dbhistory.test" } }
2) postman POST URL: http://10.203.0.43:8083/connectors 以上JSON文件,建立task
3) 检查TASK
http://10.203.0.43:8083/connectors/test-connector
返回结果如下:
{
"name": "test-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.user": "admin",
"database.server.id": "1",
"tasks.max": "1",
"database.hostname": "10.203.0.93",
"database.password": "internal",
"database.history.kafka.bootstrap.servers": "10.203.3.91:9092,10.203.0.44:9092,10.203.0.43:9092",
"database.history.kafka.topic": "dbhistory.test",
"name": "test-connector",
"database.server.name": "test",
"database.whitelist": "test.*",
"database.port": "3306"
},
"tasks": [
{
"connector": "test-connector",
"task": 0
}
],
"type": "source"
}
这就表示task启动了,接下来就是去数据库中或者表中执行DDL或者DML,接受kafka信息即可。
默认的规则是:DDL的topic是db.server.name DML的TOPIC是db.server.name . database . table ,通过kafka-topic可以查看对应信息,比如:
bin/kafka-topics.sh --bootstrap-server 10.203.3.91:9092 --list
__consumer_offsets
connect-configs
connect-offsets
connect-status
connect-test
dbhistory.test
test
test.jlwang
test.test.jlwang
test.test.jlwang1
具体接入到kafka的日志信息就不展示了,使用consumer的脚本查看即可。
上面只是一个基本的展示,这里我想讨论的是这个东西的意义。因为能够按需接入,比canal/maxwell做的更精细化,在实时流开发过程中,我们可以根据表来同步,直接通过REST API提交新的TASK即可,并不需要重启或者配置什么。这样的好处是开发异构数据库数据同步,或者实时数据仓库的时候方便之处不言而喻。