使用Kafka的Connect监听Mysql数据并同步到ElasticSearch-刘宇

使用Kafka的Connect监听Mysql数据并同步到ElasticSearch-刘宇

  • 一、安装zookeeper
    • 1、解压zookeeper的tar包
    • 2、创建zookeeper所使用到的文件夹
    • 3、修改zookeeper配置文件
    • 4、添加zookeeper唯一标识
    • 5、启动zookeeper
  • 二、安装kafka
    • 1、解压
    • 2、修改配置文件
    • 3、后台启动kafka
  • 三、安装Elasticsearch
    • 1、解压Elasticsearch
    • 2、修改配置文件
    • 3、创建data和logs文件夹
    • 4、创建启动用户
    • 5、启动Elasticsearch
    • 6、错误解决
      • 6.1、错误1
      • 6.2、错误2
      • 6.3、错误3
      • 6.4、错误4
  • 七、配置kafka中的Connect,实现将MySQL数据同步到Elasticsearch中
    • 1、前期工作
      • 1.1、所需jar包
      • 1.2、在数据库中创建需要同步的数据库表
      • 1.3、在kafka的config文件中配置mysql到kafka的连接器
      • 1.4、在kafka的config文件中配置kafka到elasticsearch的连接器
    • 2、运行Connect
    • 3、Connector的API

作者:刘宇
CSDN博客地址:https://blog.csdn.net/liuyu973971883
有部分资料参考,如有侵权,请联系删除。如有不正确的地方,烦请指正,谢谢。

前提条件:需要安装JAVA的运行环境,我这边使用的是JDK1.8版本,安装过程就不演示了。

一、安装zookeeper

这边搭建的是zookeeper的集群

1、解压zookeeper的tar包

cd /software
tar -xzvf zookeeper-3.4.14.tar.gz

2、创建zookeeper所使用到的文件夹

#进入zookeeper解压下来的文件夹
cd /software/zookeeper-3.4.14
#创建zookeeper所使用的快照的存储路径
mkdir dataDir
#创建zookeeper所使用的日志的存储路径
mkdir dataDirLog

3、修改zookeeper配置文件

  • 拷贝原始配置文件
#进入zookeeper文件夹的conf文件夹
cd /software/zookeeper-3.4.14/conf
#拷贝配置文件
cp zoo_sample.cfg zoo.cfg
  • 编辑配置文件
#编辑zoo.cfg文件
vi zoo.cfg

添加下面几项配置

#路径为我们刚才创建的文件夹路径
dataDir=/software/zookeeper-3.4.14/dataDir
dataLogDir=/software/zookeeper-3.4.14/dataDirLog
#zookeeper集群,有几个zookeeper就写几个server
server.1=192.168.40.101:2888:3888
server.2=192.168.40.102:2888:3888
server.3=192.168.40.103:2888:3888

4、添加zookeeper唯一标识

  • 进入我们刚才创建的系统快照存储路径
cd /software/zookeeper-3.4.14/dataDir
  • 添加唯一标识myid,其中数字为你配置文件zoo.cfg中的server.x的编号,每一台zookeeper都得有一个自己的myid文件
echo "1" > myid

5、启动zookeeper

#进入zookeeper文件夹的bin目录下
cd /software/zookeeper-3.4.14/bin
#启动zookeeper
./zkServer.sh start

二、安装kafka

这边搭建的是单台kafka,我是安装在103Linux上的

1、解压

#进入software目录
cd /software
#解压
tar -xzvf kafka_2.11-2.2.1.tgz
#修改文件名
mv kafka_2.11-2.2.1 kafka

2、修改配置文件

  • 进入kafka的confifg文件夹,并编辑配置文件
cd /software/kafka/config
vi server.properties
  • 默认配置文件介绍(参考的这位老哥的资料:https://www.cnblogs.com/toutou/p/linux_install_kafka.html)
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

#  broker就是一个kafka的部署实例,在一个kafka集群中,每一台kafka都要有一个broker.id
#  并且,该id唯一,且必须为整数
broker.id=0

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from 
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = security_protocol://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set, 
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

#The number of threads handling network requests
# 默认处理网络请求的线程个数 3个
num.network.threads=3

# The number of threads doing disk I/O
# 执行磁盘IO操作的默认线程个数 8
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
# socket服务使用的进行发送数据的缓冲区大小,默认100kb
socket.send.buffer.bytes=102400

# The receive buffer (SO_SNDBUF) used by the socket server
# socket服务使用的进行接受数据的缓冲区大小,默认100kb
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
# socket服务所能够接受的最大的请求量,防止出现OOM(Out of memory)内存溢出,默认值为:100m
# (应该是socker server所能接受的一个请求的最大大小,默认为100M)
socket.request.max.bytes=104857600

############################# Log Basics (数据相关部分,kafka的数据称为log)#############################

# A comma seperated list of directories under which to store log files
# 一个用逗号分隔的目录列表,用于存储kafka接受到的数据
log.dirs=/home/uplooking/data/kafka

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
# 每一个topic所对应的log的partition分区数目,默认1个。更多的partition数目会提高消费
# 并行度,但是也会导致在kafka集群中有更多的文件进行传输
# (partition就是分布式存储,相当于是把一份数据分开几份来进行存储,即划分块、划分分区的意思)
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
# 每一个数据目录用于在启动kafka时恢复数据和在关闭时刷新数据的线程个数。如果kafka数据存储在磁盘阵列中
# 建议此值可以调整更大。
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1


############################# Log Flush Policy (数据刷新策略)#############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs(平衡) here:
#    1. Durability 持久性: Unflushed data may be lost if you are not using replication.
#    2. Latency 延时性: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput 吞吐量: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.
# kafka中只有基于消息条数和时间间隔数来制定数据刷新策略,而没有大小的选项,这两个选项可以选择配置一个
# 当然也可以两个都配置,默认情况下两个都配置,配置如下。

# The number of messages to accept before forcing a flush of data to disk
# 消息刷新到磁盘中的消息条数阈值
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
# 消息刷新到磁盘生成一个log数据文件的时间间隔
#log.flush.interval.ms=1000

############################# Log Retention Policy(数据保留策略) #############################

# The following configurations control the disposal(清理) of log segments(分片). The policy can
# be set to delete segments after a period of time, or after a given size has accumulated(累积).
# A segment will be deleted whenever(无论什么时间) *either* of these criteria(标准) are met. Deletion always happens
# from the end of the log.
# 下面的配置用于控制数据片段的清理,只要满足其中一个策略(基于时间或基于大小),分片就会被删除

# The minimum age of a log file to be eligible for deletion
# 基于时间的策略,删除日志数据的时间,默认保存7天
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes. 1G
# 基于大小的策略,1G
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
# 数据分片策略
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies 5分钟
# 每隔多长时间检测数据是否达到删除条件
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=localhost:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000

############################# Group Coordinator Settings #############################

# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0
  • 修改配置文件的如下几项
brolker.id=1
#配置内网kafka集群的监听器,用于告诉外部连接者访问指定的主机名和端口。如果是外网集群则需要使用advertised.listeners。
listeners=PLAINTEXT://192.168.40.103:9092
#配置zookeeper集群的地址
zookeeper.connect=192.168.40.101:2181,192.168.40.102:2181,192.168.40.103:2181

3、后台启动kafka

#进入kafka目录
cd /software/kafka
#后台启动
nohup bin/kafka-server-start.sh config/server.properties &

三、安装Elasticsearch

这边搭建的是单台Elasticsearch,我是安装在103的Linux上的

1、解压Elasticsearch

#进入software目录
cd /software
#解压
tar -zxvf elasticsearch-5.6.8.tar.gz

2、修改配置文件

  • 编辑配置文件
vi /software/elasticsearch-5.6.8/config/elasticsearch.yml
  • 修改如下配置
cluster.name: my-application
node.name: node-1
path.data: /software/elasticsearch-5.6.8/data
path.logs: /software/elasticsearch-5.6.8/logs
network.host: 0.0.0.0
http.port: 9200

3、创建data和logs文件夹

#进入到elasticsearch目录
cd /software/elasticsearch-5.6.8
#创建data文件夹
mkdir data
#创建logs文件夹
mkdir logs

4、创建启动用户

因为elasticsearch不能root用户启动,所以我们这边创建一个用户和组来启动它

  • 创建用户和组
#创建用户组
groupadd elsearch
#创建用户并添加到用户组中
useradd -r -g elsearch elsearch
passwd elsearch
  • 将elasticsearch的目录权限设置成该用户和组
chown -R elsearch:elsearch /software/elasticsearch-5.6.8

5、启动Elasticsearch

  • 启动
#切换启动用户
su elsearch
#进入到elasticsearch的bin目录下
cd /software/elasticsearch-5.6.8/bin
#后台启动
nohup ./elasticsearch &
#观察nohup日志,查看是否出错,一般都会出现线程数等不够错误
tail -f nohup.out
  • 检查是否启动成功
curl  http://192.168.40.103:9200
#如果出现下面信息即启动成功
{
  "name" : "node-1",
  "cluster_name" : "my-application",
  "cluster_uuid" : "2UlrJ43PQDKbrqvcTG9IyA",
  "version" : {
    "number" : "5.6.8",
    "build_hash" : "688ecce",
    "build_date" : "2018-02-16T16:46:30.010Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

6、错误解决

6.1、错误1

  • max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]

是因为最大的文件数不够,切换的root用户下修改/etc/security/limits.conf文件即可

  • 添加如下配置,然后重新启动linux即可
*                soft    nofile          65536
*                hard    nofile          65536

6.2、错误2

  • max number of threads [3818] for user [es] is too low, increase to at least [4096]

是因为最大线程个数太低,切换的root用户下修改/etc/security/limits.conf文件即可

  • 添加如下配置,然后重新启动linux即可
*                soft    nproc           4096
*                hard    nproc           4096

6.3、错误3

  • max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

是因为限制一个进程可以拥有的VMA(虚拟内存区域)的数量不够,切换的root用户下修改/etc/sysctl.conf文件即可

  • 添加如下配置
vm.max_map_count=262144
  • 立即生效
sysctl -p

6.4、错误4

  • system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk

是因为如果在Centos6下,是不支持SecComp,而ES5.2.1默认bootstrap.system_call_filter为true进行检测,所以导致检测失败,失败后直接导致ES不能启动

  • 解决方法:
    在elasticsearch.yml中配置bootstrap.system_call_filter为false,注意要在Memory下面,随后重启es
bootstrap.memory_lock: false
bootstrap.system_call_filter: false

七、配置kafka中的Connect,实现将MySQL数据同步到Elasticsearch中

1、前期工作

1.1、所需jar包

  • kafka-connect-jdbc-4.1.1
    下载地址:点击下载
  • mysql-connector-java-5.1.40.jar
    下载地址:点击下载
  • kafka-connect-elasticsearch-5.4.1.jar
    下载地址:点击下载
  • commons-codec-1.11.jar、commons-logging-1.2.jar、 httpclient-4.5.12.jar、httpcore-4.4.13.jar
    下载地址:点击下载
  • common-utils-5.4.1.jar
    下载地址:点击下载
  • httpasyncclient-4.1.3.jar
    下载地址:点击下载
  • httpcore-nio-4.4.6.jar
    下载地址:点击下载
  • jest-6.3.1.jar、jest-common-6.3.1.jar
    下载地址:点击下载
  • gson-2.8.5.jar
    下载地址:点击下载
  • slf4j-api-1.7.26.jar
    下载地址:点击下载

1.2、在数据库中创建需要同步的数据库表

create database test1;
use test1;
create table user(id int PRIMARY KEY AUTO_INCREMENT,username varchar(50),password varchar(50));

1.3、在kafka的config文件中配置mysql到kafka的连接器

  • 创建mysql-test1.properties
# 连接器名称
name=mysql_test1
# 连接器使用的类
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
# 最大任务数
tasks.max=1
# mysql地址
connection.url=jdbc:mysql://192.168.40.102:3306/test1?user=root&password=root&useUnicode=true&characterEncoding=utf-8&useSSL=false&serverTimezone=GMT&autoReconnect=true
# 监听模式:分为incrementing、timestamp、timestamp+incrementing
mode=incrementing
# 监听的字段名
incrementing.column.name=id
# 主题前缀
topic.prefix=mysql_test1_
每10秒刷新一次
poll.interval.ms=10000

1.4、在kafka的config文件中配置kafka到elasticsearch的连接器

  • 创建es-mysql-test1.properties
# 连接器名称
name=es_mysql_test1
# 使用的类
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
# 最大任务数
tasks.max=1
# 主题名,一般都是主题前缀+表名
topics=mysql_test1_user
# 表示写入ES的每条记录的键为kafka主题名字+分区id+偏移量
key.ignore=true
# elasticsearch地址
connection.url=http://192.168.40.103:9200
# elasticsearch索引类型
type.name=test1_user

2、运行Connect

这边使用的是单机的connect模式

#进入kafka的bin目录
cd /software/kafka/bin
#后台启动connect并加上两个连接器的配置文件
nohup ./connect-standalone.sh ../config/connect-standalone.properties ../config/es-mysql-test1.properties ../config/mysql-test1.properties &
#查看nohup日志是否有错误,或者也可以通过connector的api查看各个连接器的状态
tail -f nohup.out

3、Connector的API

curl -X GET http://ip:8083/connector-plugins
GET /connectors – 返回所有正在运行的connector名。 
POST /connectors – 新建一个connector; 请求体必须是json格式并且需要包含name字段和config字段,name是connector的名字,config是json格式,必须包含你的connector的配置信息。 
GET /connectors/{name} – 获取指定connetor的信息。 
GET /connectors/{name}/config – 获取指定connector的配置信息。 
PUT /connectors/{name}/config – 更新指定connector的配置信息。 
GET /connectors/{name}/status – 获取指定connector的状态,包括它是否在运行、停止、或者失败,如果发生错误,还会列出错误的具体信息。 
GET /connectors/{name}/tasks – 获取指定connector正在运行的task。 
GET /connectors/{name}/tasks/{taskid}/status – 获取指定connector的task的状态信息。 
PUT /connectors/{name}/pause – 暂停connector和它的task,停止数据处理知道它被恢复。 
PUT /connectors/{name}/resume – 恢复一个被暂停的connector。 
POST /connectors/{name}/restart – 重启一个connector,尤其是在一个connector运行失败的情况下比较常用 
POST /connectors/{name}/tasks/{taskId}/restart – 重启一个task,一般是因为它运行失败才这样做。 
DELETE /connectors/{name} – 删除一个connector,停止它的所有task并删除配置。 

你可能感兴趣的:(Kafka)