数仓采集项目【04数仓采集模块之zookeeper,kafka,flume安装及kafka source sink channel的几个小案例】

文章目录

    • 1 Zookeeper安装
      • (1)集群规划
      • (2)安装流程
      • (3)修改配置文件
      • (4)zookeeper群起/关脚本
    • 2 kafka集群安装
      • (1)集群规划
      • (2)安装流程
      • (3)kafka群起/关脚本
      • (4)kafka常用命令
      • (5)kafka项目经验
        • (a)producer压力测试
        • (b)Consumer压力测试
        • (c)kafka机器数量计算
        • (d)kafka分区数量计算
    • 3 flume安装
      • (1)集群规划
      • (2)安装流程
      • (3)flume组件选型
        • (a)Source
        • (b)Channel
      • (4)kafkaSource样例
      • (5)kafkaSink样例
      • (6)kafkaSink(将数据发送到多topic)
      • (7)kafka channel样例1 -- common
      • (8)kafka channel样例2 -- no source
      • (9)kafka channel样例3 -- no sink

1 Zookeeper安装

(1)集群规划

服务器hadoop101 服务器hadoop102 服务器hadoop103
Zookeeper Zookeeper Zookeeper Zookeeper

(2)安装流程

#准备安装包,上传,解压
tar -zxvf apache-zookeeper-3.5.7-bin.tar.gz -C /opt/module/
#更改名字
mv apache-zookeeper-3.5.7-bin/ zookeeper-3.5.7
#添加环境变量
#ZOOKEEPER_HOME
export ZOOKEEPER_HOME=/opt/module/zookeeper-3.5.7
export PATH=$PATH:$ZOOKEEPER_HOME/bin
#zkServer tab 能出来命令说明环境配置没有问题

(3)修改配置文件

cd /opt/module/zookeeper-3.5.7/conf/
#修改名称
mv zoo_sample.cfg zoo.cfg
#新建文件zkData
/opt/module/zookeeper-3.5.7/zkData
#在zkData目录下创建myid文件
touch myid 
#在文件中添加数字1,其中hadoop102为2,hadoop103为3
1
-------------------------------------------------
#修改配置文件
vim zoo.cfg
#修改
dataDir=/opt/module/zookeeper-3.5.7/zkData
#配置Zookeeper在hadoop中的通信方式,在文件尾部添加
server.1=hadoop101:2888:3888
server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888
--------------------------------------------------
#分发zookeeper
xsync zookeeper-3.5.7/
#hadoop102修改/opt/module/zookeeper-3.5.7/zkData/目录下myid文件
#添加数字2,hadoop103同理,添加3
#分发环境变量
scp /etc/profile.d/my_env.sh root@hadoop102:/etc/profile.d/
scp /etc/profile.d/my_env.sh root@hadoop103:/etc/profile.d/
#断开重连
#启动zookeeper
#在hadoop101,102,103分别执行以下命令
zkServer.sh start
zkServer.sh status
zkServer.sh stop

(4)zookeeper群起/关脚本

vim zk.sh
------------------------------------------------------------
#添加以下内容
#!/bin/bash
if [ $# -lt 1 ]
then
    echo "USAGE: zk.sh {start|stop|status}"
    exit
fi
case $1 in
"start"){
	for i in hadoop101 hadoop102 hadoop103
	do
        echo ---------- zookeeper $i 启动 ------------
		ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh start"
	done
};;
"stop"){
	for i in hadoop101 hadoop102 hadoop103
	do
        echo ---------- zookeeper $i 停止 ------------    
		ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh stop"
	done
};;
"status"){
	for i in hadoop101 hadoop102 hadoop103
	do
        echo ---------- zookeeper $i 状态 ------------    
		ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh status"
	done
};;
*)
    echo "USAGE: zk.sh {start|stop|status}"
    exit
;;
esac

2 kafka集群安装

(1)集群规划

服务器hadoop101 服务器hadoop102 服务器hadoop103
Kafka Kafka Kafka Kafka

(2)安装流程

#准备安装包,上传,解压
tar -zxvf kafka_2.11-2.4.1.tgz -C /opt/module/
#在kafka目录下创建文件夹
[hike@hadoop101 kafka_2.11-2.4.1]$ mkdir datas
#修改/conf目录下的配置文件server.properties
vim server.properties
-----------------------------
broker.id=1(hadoop102为2,以此类推)
log.dirs=/opt/module/kafka_2.11-2.4.1/datas
zookeeper.connect=hadoop101:2181,hadoop102:2181,hadoop103:2181/kafka
-----------------------------
#分发kafka
xsync kafka_2.11-2.4.1/
#配置环境变量
sudo vim /etc/profile.d/my_env.sh 
#KAFKA_HOME
export KAFKA_HOME=/opt/module/kafka_2.11-2.4.1
export PATH=$PATH:$KAFKA_HOME/bin
#分发环境变量
sudo ~/bin/xsync /etc/profile.d/my_env.sh
#修改broker.id
#在101,102,103启动kafka,注意启动kafka前需要先将zookeeper启动
kafka-server-start.sh -daemon /opt/module/kafka_2.11-2.4.1/config/server.properties 
#进入zookeeper客户端查看kafka的一些信息
zkCli.sh
ls /
ls /kafka
ls /kafka/brokers
ls /kafka/brokers/ids
get /kafka/brokers/ids/1
get /kafka/controller
#关闭kafka
kafka-server-stop.sh

(3)kafka群起/关脚本

vim kafka
------------------------------------------------------------
#添加以下内容
#!/bin/bash
if [ $# -lt 1 ]
then
    echo "USAGE: zk.sh {start|stop|status}"
    exit
fi
case $1 in
"start"){
	for i in hadoop101 hadoop102 hadoop103
	do
        echo ---------- kafka $i 启动 ------------
		ssh $i "/opt/module/kafka_2.11-2.4.1/bin/kafka-server-start.sh -daemon /opt/module/kafka_2.11-2.4.1/config/server.properties"
	done
};;
"stop"){
	for i in hadoop101 hadoop102 hadoop103
	do
        echo ---------- kafka $i 停止 ------------    
		ssh $i "/opt/module/kafka_2.11-2.4.1/bin/kafka-server-stop.sh"
	done
};;
*)
    echo "USAGE: kafka.sh {start|stop}"
    exit
;;
esac
------------------------------------------------------------------
#增加权限
chmod u+x kafka 
#注意先关闭kafka再关闭zookeeper

(4)kafka常用命令

#topic的增删改查
#查看
kafka-topics.sh --list --bootstrap-server hadoop101:9092
#增加
kafka-topics.sh --create --topic hello --bootstrap-server hadoop101:9092 --partitions 2 --replication-factor 3
#查看详细信息
kafka-topics.sh --describe --bootstrap-server hadoop101:9092 --topic hello
#删除
kafka-topics.sh --delete --bootstrap-server hadoop101:9092 --topic hello
#修改
kafka-topics.sh --alter --bootstrap-server hadoop101:9092 --topic hello --partitions 6
#kafka生产消费消息
#101输入
kafka-console-producer.sh --topic hello --broker-list hadoop101:9092
#另启101输入
kafka-console-consumer.sh --topic hello --bootstrap-server hadoop101:9092
#输入一些内容
#从开始消费数据
kafka-console-consumer.sh --topic hello --bootstrap-server hadoop101:9092 --from-beginning
#按照组消费数据
kafka-console-consumer.sh --topic hello --bootstrap-server hadoop101:9092 --group g1

(5)kafka项目经验

(a)producer压力测试

用Kafka官方自带的脚本,对Kafka进行压测。Kafka压测时,可以查看到哪个地方出现了瓶颈(CPU,内存,网络IO)。一般都是网络IO达到瓶颈。

kafka-consumer-perf-test.sh

kafka-producer-perf-test.sh

#在/opt/module/kafka/bin目录下面有这两个文件。
#向test中写数据,每条消息大小100kb,一共写100000条消息,不限制流量尽最大能力去写,指定kafka集群的位置
#test主题只能有一个分区,因为测试出单分区的写性能,之后遇到写性能不够可以增加分区,增加一个分区,写的性能就会加倍
kafka-producer-perf-test.sh  --topic test --record-size 100 --num-records 100000 --throughput -1 --producer-props bootstrap.servers=hadoop101:9092,hadoop102:9092,hadoop103:9092
#测试结果
WARN [Producer clientId=producer-1] Error while fetching metadata with correlation id 1 : {test=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
100000 records sent, 62189.054726 records/sec (5.93 MB/sec), 273.35 ms avg latency, 543.00 ms max latency, 314 ms 50th, 440 ms 95th, 447 ms 99th, 449 ms 99.9th.#吞吐量为5.93 MB/sec,每次写入的平均延迟为273.35毫秒,最大的延迟为543.00毫秒。

(b)Consumer压力测试

Consumer的测试,如果这四个指标(IO,CPU,内存,网络)都不能改变,考虑增加分区数来提升性能。

kafka-consumer-perf-test.sh --broker-list hadoop101:9092,hadoop102:9092,hadoop103:9092 --topic test --fetch-size 10000 --messages 500000 --threads 1
#参数说明
--zookeeper 指定zookeeper的链接信息
--topic 指定topic的名称
--fetch-size 指定每次fetch的数据的大小
--messages 总共要消费的消息个数
--threads  消费数据线程个数
#测试结果
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2022-06-09 14:31:17:344, 2022-06-09 14:31:22:899, 47.6837, 8.5839, 500000, 90009.0009, 1654756277762, -1654756272207, -0.0000, -0.0003

(c)kafka机器数量计算

Kafka机器数量(经验公式)=2 X(峰值生产速度 X 副本数 / 100)+ 1

先拿到峰值生产速度,再根据设定的副本数,就能预估出需要部署Kafka的数量。

比如我们的峰值生产速度是50M/s。副本数为2。

Kafka机器数量=2 X ( 50 X 2 / 100 )+ 1=3台

(d)kafka分区数量计算

创建一个只有1个分区的topic

测试这个topic的producer吞吐量和consumer吞吐量。

假设他们的值分别是Tp和Tc,单位可以是MB/s。

然后假设总的目标吞吐量是Tt,那么分区数=Tt / min(Tp,Tc)

例如:producer吞吐量=20m/s;consumer吞吐量=50m/s,期望吞吐量100m/s;

分区数=100 / 20 =5分区

https://blog.csdn.net/weixin_42641909/article/details/89294698

分区数一般设置为:3-10个

3 flume安装

(1)集群规划

服务器hadoop101 服务器hadoop102 服务器hadoop103
Flume(采集日志) Flume Flume

(2)安装流程

#准备安装包,上传,解压
tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /opt/module/
mv apache-flume-1.9.0-bin/ flume-1.9.0
cd flume-1.9.0/lib/
#修改jar包冲突
rm -rf guava-11.0.2.jar 
#配置环境变量
sudo vim /etc/profile.d/my_env.sh 

#FLUME_HOME
export FLUME_HOME=/opt/module/flume-1.9.0
export PATH=$PATH:$FLUME_HOME/bin

#分发环境变量,断开重连

(3)flume组件选型

(a)Source

Taildir Source相比Exec Source、Spooling Directory Source的优势:

TailDir Source:断点续传、多目录。Flume1.6以前需要自己自定义Source记录每次读取文件位置,实现断点续传。

Exec Source:可以实时搜集数据,但是在Flume不运行或者Shell命令出错的情况下,数据将会丢失。

Spooling Directory Source:监控目录,支持断点续传。

(b)Channel

采用Kafka Channel,省去了Sink,提高了效率。KafkaChannel数据存储在Kafka里面,所以数据是存储在磁盘中。

(4)kafkaSource样例

官方文档:https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

从kafka中读取数据

kafkaSource – Memory Channel – Logger Sink

a1.sources = r1
a1.channels = c1 
a1.sinks = k1

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.sources.r1.kafka.topics = hello
a1.sources.r1.kafka.consumer.group.id = flume
a1.sources.r1.useFlumeEventFormat = false

a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

#执行以下操作
cd /opt/module/flume-1.9.0/
mkdir jobs
cd jobs/
vim kafkasource.conf
#将以上内容复制进去,保存退出,执行
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkasource.conf -n a1 -Dflume.root.logger=INFO,console
#另启动一个生产者
kafka-console-producer.sh --topic hello --broker-list hadoop101:9092
#输入一些内容,查看另一台服务器是否能够正常接收到

(5)kafkaSink样例

将数据写入到kafka中

netcat Source – Memory Channel – kafka Sink

a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666

a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.sinks.k1.kafka.topic = hello
a1.sinks.k1.kafka.producer.acks = -1
a1.sinks.k1.useFlumeEventFormat = false

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#启动一个端口0
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkasink.conf -n a1 -Dflume.root.logger=INFO,console
#另启动一个端口1,执行
nc localhost 6666
#再启动一个端口2,充当消费者
kafka-console-consumer.sh --topic hello --bootstrap-server hadoop101:9092
#在1中输入一些信息,看2是否能够消费到

(6)kafkaSink(将数据发送到多topic)

根据header中值的不同,将消息分别发送到不同的topic中

netcat source – 拦截器 – replicating channel selector – memory channel – kafkasink

a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder

a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.sinks.k1.kafka.topic = other
a1.sinks.k1.kafka.producer.acks = -1
a1.sinks.k1.useFlumeEventFormat = false

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(7)kafka channel样例1 – common

将kafka channel 当做普通的channel使用

netcat source – kafka channel – logger sink

a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666

a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = hello
a1.channels.c1.parseAsFlumeEvent = false

a1.sinks.k1.type = logger

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
vim kafkachannel.conf

flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkachannel.conf -n a1 -Dflume.root.logger=INFO,console

nc localhost 6666 发送一些消息

(8)kafka channel样例2 – no source

从kafka 中读取数据

kafka channel – logger sink

a1.channels = c1
a1.sinks = k1

a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = hello
a1.channels.c1.parseAsFlumeEvent = false

a1.sinks.k1.type = logger

a1.sinks.k1.channel = c1
vim kafkachannel-nosource.conf

flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkachannel-nosource.conf -n a1 -Dflume.root.logger=INFO,console

kafka-console-producer.sh --broker-list hadoop101:9092 --topic hello
输入一些内容

(9)kafka channel样例3 – no sink

将数据写入到kafka

netcat source – kafka channel

a1.sources = r1
a1.channels = c1

a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666

a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = hello
a1.channels.c1.parseAsFlumeEvent = false

a1.sources.r1.channels = c1
vim kafkachannel-nosink.conf

flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkachannel-nosink.conf -n a1 -Dflume.root.logger=INFO,console

nc localhost 6666

kafka-console-consumer.sh --topic hello --bootstrap-server hadoop101:9092

你可能感兴趣的:(数仓采集项目,kafka,zookeeper,flume)