kafka connect写入hbase和hdfs

本文通过插件化的方式进行实现
kafka版本:1.1.1
hdp安装

一、Kafka连接器插件包准备

1、kafka-connect-hdfs插件包

备注:hdfs插件包可以直接下载社区版conflunt平台支持的插件包,无license限制

插件包下载地址:https://www.confluent.io/hub/confluentinc/kafka-connect-hdfs
插件版本号:5.3.2
插件包名:confluentinc-kafka-connect-hdfs-5.3.2.zip

2、kafka-connect-hbase插件包

备注:hbase插件包社区版conflunt平台不支持,无license限制

hbase-connect插件通过源码编译生成jar包
源码地址:https://github.com/nishutayal/kafka-connect-hbase
git到本地进行编译:git clone https://github.com/nishutayal/kafka-connect-hbase.git
编译命令:mvn clean package
编译生成的插件jar包:target/hbase-sink.jar

3、将插件包上传至kafka安装目录

①进入kafka安装目录,并创建plugins插件目录



②在plugins目录下创建kafka-connect-hbase目录,上传编译好的插件包hbase-sink.jar



③在plugins目录下上传connect-hdfs插件包:confluentinc-kafka-connect-hdfs-5.3.2.zip,并解压到当前目录

重命名:mv confluentinc-kafka-connect-hdfs-5.3.2 kafka-connect-hdfs


二、kafka-connect-hdfs安装部署

切换用户为kafka用户

单机部署

1、创建topic,用于消息写入

bin/kafka-topics.sh --create --zookeeper hanyun-2:2181 --replication-factor 1 --partitions 3 --topic hdfstopic

2、配置worker

bootstrap.servers=hy-2:6667
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter 
key.converter.schemas.enable=false
value.converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.storage.StringConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
plugin.path= /usr/hdp/3.0.1.0-187/kafka/plugins

3、配置connector文件
name:连接器名称
topic:kafka写入消息的topic
topics.dir:配置的是hdfs写入的父路径,/xiehh/hdfstopic/partition=0/…
二层路径为topic名,三层路径为分区名
logs.dir:配置的是hdfs 日志目录

[kafka@hy-2 kafka]$ vim plugins/kafka-connect-hdfs/etc/quickstart-hdfs.properties
    name=hdfs-sink
    connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
    #format.class=io.confluent.connect.hdfs.string.StringFormat
    format.class=io.confluent.connect.hdfs.json.JsonFormat
    tasks.max=1
    topics=hdfstopic
    hdfs.url=hdfs://hanyun-1:8020
    flush.size=3
    #dir config of hdfs writing
    topics.dir=xiehh
    logs.dir=logs

4、启动连接器

[kafka@hy-2 kafka]$ nohup bin/connect-standalone.sh conf/connect-standalone.properties plugins/kafka-connect-hdfs/etc/quickstart-hdfs.properties &

5、消息写入kafka topic,会保存到hdfs目录

分布式部署

创建kafka消息写入topic:
bin/kafka-topics.sh --create --zookeeper hanyun-2:2181 --replication-factor 1 --partitions 3 --topic hdfstopic
这里创建带分区的topic,可以根据消息key做hash,将消息体自动写入相应的分区
每台服务器都执行如下步骤:
注:

  1. group.id=connect-cluster #群集的唯一名称,用于形成连接群集组。请注意,这不能与consumer组ID冲突

  2. offset.storage.topic=connect-offsets

config.storage.topic=connect-configs

status.storage.topic=connect-status

此三个配置项,建议手动创建topic,命令如下:

bin/kafka-topics.sh --create --zookeeper hanyun-2:2181 --topic connect-configs --replication-factor 3 --partitions 1 --config cleanup.policy=compact

bin/kafka-topics.sh --create --zookeeper hanyun-2:2181 --topic connect-offsets --replication-factor 3 --partitions 3 --config cleanup.policy=compact

bin/kafka-topics.sh --create --zookeeper hanyun-2:2181 --topic connect-status --replication-factor 3 --partitions 3 --config cleanup.policy=compact

config.storage.topic(默认连接配置)-用于存储连接器和任务配置的主题;请注意,这应该是一个单分区、高度复制、压缩的主题。您可能需要手动创建主题以确保正确的配置,因为自动创建的主题可能有多个分区,或者自动配置为删除而不是压缩
offset.storage.topic(默认连接偏移量)-用于存储偏移量的主题;此主题应具有多个分区、被复制并配置为压缩
status.storage.topic(默认连接状态)-用于存储状态的主题;此主题可以有多个分区,应复制并配置为压缩

  1. rest.host.name=hanyun-3
    rest.port=8083
    rest.host.name配置为集群每台服务器各自的域名

1、配置worker

[kafka@hy-2 conf]$ vim connect-distributed.properties

bootstrap.servers=hy-2:6667
group.id=connect-cluster
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter 
key.converter.schemas.enable=false value.converter.schemas.enable=false internal.key.converter=org.apache.kafka.connect.storage.StringConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false internal.value.converter.schemas.enable=false

offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
#offset.storage.partitions=25

config.storage.topic=connect-configs
config.storage.replication.factor=1

status.storage.topic=connect-status
status.storage.replication.factor=1
#status.storage.partitions=5

rest.host.name=hy-3
rest.port=8083

plugin.path= /usr/hdp/3.0.1.0-187/kafka/plugins

2、每台服务器启动worker

[kafka@hy-2 kafka]$ nohup bin/connect-distributed.sh conf/connect-distributed.properties &

3、 分布式模式下通过restAPI创建connector(单机模式可以通过命令行方式加载启动connector,详见上文单机部署,而分布式模式通过API方式,不支持命令行方式),集群可支持创建多个connector

[kafka@hanyun-2 kafka]$ curl -X PUT -H "Content-Type: application/json" --data '{"name":"hdfs-sink","connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector","format.class":"io.confluent.connect.hdfs.json.JsonFormat","tasks.max":"2","topics":"hdfstopic","hdfs.url":"hdfs://hanyun-1:8020","flush.size":"3","topics.dir":"xiehh","logs.dir":"logs"}' hanyun-2:8083/connectors/hdfs-sink/config

下图为两台服务器启动后的日志截图:
服务器一:启动的分区1和分区0

[2020-02-22 10:43:52,236] INFO Started recovery for topic partition hdfstopic-1 (io.confluent.connect.hdfs.TopicPartitionWriter:255)
[2020-02-22 10:43:52,245] INFO Finished recovery for topic partition hdfstopic-1 (io.confluent.connect.hdfs.TopicPartitionWriter:274)
[2020-02-22 10:43:52,246] INFO Started recovery for topic partition hdfstopic-0 (io.confluent.connect.hdfs.TopicPartitionWriter:255)
[2020-02-22 10:43:52,249] INFO Finished recovery for topic partition hdfstopic-0 (io.confluent.connect.hdfs.TopicPartitionWriter:274)
[2020-02-22 10:45:41,905] INFO Cluster ID: KioCbJz1TUicFq6THN3UHg (org.apache.kafka.clients.Metadata:265)

服务器二:启动的分区2

[2020-02-22 10:43:46,173] INFO WorkerSinkTask{id=hdfs-sink-1} Sink task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSinkTask:295)
[2020-02-22 10:43:46,177] INFO Cluster ID: KioCbJz1TUicFq6THN3UHg (org.apache.kafka.clients.Metadata:265)
[2020-02-22 10:43:46,178] INFO [Consumer clientId=consumer-4, groupId=connect-hdfs-sink] Discovered group coordinator hanyun-3:6667 (id: 2147482644 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:605)
[2020-02-22 10:43:46,178] INFO [Consumer clientId=consumer-4, groupId=connect-hdfs-sink] Revoking previously assigned partitions [] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:411)
[2020-02-22 10:43:46,179] INFO [Consumer clientId=consumer-4, groupId=connect-hdfs-sink] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:442)
[2020-02-22 10:43:52,192] INFO [Consumer clientId=consumer-4, groupId=connect-hdfs-sink] Successfully joined group with generation 1 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:409)
[2020-02-22 10:43:52,194] INFO [Consumer clientId=consumer-4, groupId=connect-hdfs-sink] Setting newly assigned partitions [hdfstopic-2] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:256)
[2020-02-22 10:43:52,198] INFO [Consumer clientId=consumer-4, groupId=connect-hdfs-sink] Resetting offset for partition hdfstopic-2 to offset 0. (org.apache.kafka.clients.consumer.internals.Fetcher:561)
[2020-02-22 10:43:52,211] INFO Started recovery for topic partition hdfstopic-2 (io.confluent.connect.hdfs.TopicPartitionWriter:255)
[2020-02-22 10:43:52,220] INFO Finished recovery for topic partition hdfstopic-2 (io.confluent.connect.hdfs.TopicPartitionWriter:274)
[2020-02-22 10:45:53,787] INFO Cluster ID: KioCbJz1TUicFq6THN3UHg (org.apache.kafka.clients.Metadata:265)
[2020-02-22 10:45:54,152] INFO Cluster ID: KioCbJz1TUicFq6THN3UHg (org.apache.kafka.clients.Metadata:265)

4、写入消息并查看


三、kafka-connect-hbase安装部署

步骤同hdfs-connect

hbase-connector配置如下:

name=kafka-cdc-hbase
connector.class=io.svectors.hbase.sink.HBaseSinkConnector
tasks.max=1
topics=hbasetopic
zookeeper.quorum=hanyun-2:2181
#event.parser.class=io.svectors.hbase.parser.AvroEventParser
event.parser.class=io.svectors.hbase.parser.JsonEventParser
hbase.table.name=hbase-products
hbase.hbase-products.rowkey.columns=firstName,lastName
hbase.hbase-products.rowkey.delimiter=|
#hbase.hbase-products.family=c,d
#hbase.hbase-products.c.columns=c1,c2
#hbase.hbase-products.d.columns=d1,d2
hbase.hbase-products.family=products,cf
hbase.hbase-products.products.columns=lastName
hbase.hbase-products.cf.columns=age,weightInKgs

参数说明:

event.parser.class=io.svectors.hbase.parser.JsonEventParser     //选择json格式
hbase.table.name=hbase-products    //配置hbase表名
hbase.hbase-products.rowkey.columns=firstName,lastName    //此配置项为rowkey组成,来自消息体中的字段值
hbase.hbase-products.rowkey.delimiter=|        //此配置为rowkey的拼接字符
#hbase.hbase-products.family=c,d               //配置列族
#hbase.hbase-products.c.columns=c1,c2          //配置列族对应的列名
#hbase.hbase-products.d.columns=d1,d2
hbase.hbase-products.family=products,cf
hbase.hbase-products.products.columns=lastName
hbase.hbase-products.cf.columns=age,weightInKgs

单机部署连接器启动,通过命令行方式加载启动:

[kafka@hy-3 kafka]# nohup bin/connect-standalone.sh conf/connect-standalone.properties plugins/kafka-connect-hbase/hbase-sink-connector.properties &

你可能感兴趣的:(kafka connect写入hbase和hdfs)