服务器hadoop101 | 服务器hadoop102 | 服务器hadoop103 | |
---|---|---|---|
Zookeeper | Zookeeper | Zookeeper | Zookeeper |
#准备安装包,上传,解压
tar -zxvf apache-zookeeper-3.5.7-bin.tar.gz -C /opt/module/
#更改名字
mv apache-zookeeper-3.5.7-bin/ zookeeper-3.5.7
#添加环境变量
#ZOOKEEPER_HOME
export ZOOKEEPER_HOME=/opt/module/zookeeper-3.5.7
export PATH=$PATH:$ZOOKEEPER_HOME/bin
#zkServer tab 能出来命令说明环境配置没有问题
cd /opt/module/zookeeper-3.5.7/conf/
#修改名称
mv zoo_sample.cfg zoo.cfg
#新建文件zkData
/opt/module/zookeeper-3.5.7/zkData
#在zkData目录下创建myid文件
touch myid
#在文件中添加数字1,其中hadoop102为2,hadoop103为3
1
-------------------------------------------------
#修改配置文件
vim zoo.cfg
#修改
dataDir=/opt/module/zookeeper-3.5.7/zkData
#配置Zookeeper在hadoop中的通信方式,在文件尾部添加
server.1=hadoop101:2888:3888
server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888
--------------------------------------------------
#分发zookeeper
xsync zookeeper-3.5.7/
#hadoop102修改/opt/module/zookeeper-3.5.7/zkData/目录下myid文件
#添加数字2,hadoop103同理,添加3
#分发环境变量
scp /etc/profile.d/my_env.sh root@hadoop102:/etc/profile.d/
scp /etc/profile.d/my_env.sh root@hadoop103:/etc/profile.d/
#断开重连
#启动zookeeper
#在hadoop101,102,103分别执行以下命令
zkServer.sh start
zkServer.sh status
zkServer.sh stop
vim zk.sh
------------------------------------------------------------
#添加以下内容
#!/bin/bash
if [ $# -lt 1 ]
then
echo "USAGE: zk.sh {start|stop|status}"
exit
fi
case $1 in
"start"){
for i in hadoop101 hadoop102 hadoop103
do
echo ---------- zookeeper $i 启动 ------------
ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh start"
done
};;
"stop"){
for i in hadoop101 hadoop102 hadoop103
do
echo ---------- zookeeper $i 停止 ------------
ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh stop"
done
};;
"status"){
for i in hadoop101 hadoop102 hadoop103
do
echo ---------- zookeeper $i 状态 ------------
ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh status"
done
};;
*)
echo "USAGE: zk.sh {start|stop|status}"
exit
;;
esac
服务器hadoop101 | 服务器hadoop102 | 服务器hadoop103 | |
---|---|---|---|
Kafka | Kafka | Kafka | Kafka |
#准备安装包,上传,解压
tar -zxvf kafka_2.11-2.4.1.tgz -C /opt/module/
#在kafka目录下创建文件夹
[hike@hadoop101 kafka_2.11-2.4.1]$ mkdir datas
#修改/conf目录下的配置文件server.properties
vim server.properties
-----------------------------
broker.id=1(hadoop102为2,以此类推)
log.dirs=/opt/module/kafka_2.11-2.4.1/datas
zookeeper.connect=hadoop101:2181,hadoop102:2181,hadoop103:2181/kafka
-----------------------------
#分发kafka
xsync kafka_2.11-2.4.1/
#配置环境变量
sudo vim /etc/profile.d/my_env.sh
#KAFKA_HOME
export KAFKA_HOME=/opt/module/kafka_2.11-2.4.1
export PATH=$PATH:$KAFKA_HOME/bin
#分发环境变量
sudo ~/bin/xsync /etc/profile.d/my_env.sh
#修改broker.id
#在101,102,103启动kafka,注意启动kafka前需要先将zookeeper启动
kafka-server-start.sh -daemon /opt/module/kafka_2.11-2.4.1/config/server.properties
#进入zookeeper客户端查看kafka的一些信息
zkCli.sh
ls /
ls /kafka
ls /kafka/brokers
ls /kafka/brokers/ids
get /kafka/brokers/ids/1
get /kafka/controller
#关闭kafka
kafka-server-stop.sh
vim kafka
------------------------------------------------------------
#添加以下内容
#!/bin/bash
if [ $# -lt 1 ]
then
echo "USAGE: zk.sh {start|stop|status}"
exit
fi
case $1 in
"start"){
for i in hadoop101 hadoop102 hadoop103
do
echo ---------- kafka $i 启动 ------------
ssh $i "/opt/module/kafka_2.11-2.4.1/bin/kafka-server-start.sh -daemon /opt/module/kafka_2.11-2.4.1/config/server.properties"
done
};;
"stop"){
for i in hadoop101 hadoop102 hadoop103
do
echo ---------- kafka $i 停止 ------------
ssh $i "/opt/module/kafka_2.11-2.4.1/bin/kafka-server-stop.sh"
done
};;
*)
echo "USAGE: kafka.sh {start|stop}"
exit
;;
esac
------------------------------------------------------------------
#增加权限
chmod u+x kafka
#注意先关闭kafka再关闭zookeeper
#topic的增删改查
#查看
kafka-topics.sh --list --bootstrap-server hadoop101:9092
#增加
kafka-topics.sh --create --topic hello --bootstrap-server hadoop101:9092 --partitions 2 --replication-factor 3
#查看详细信息
kafka-topics.sh --describe --bootstrap-server hadoop101:9092 --topic hello
#删除
kafka-topics.sh --delete --bootstrap-server hadoop101:9092 --topic hello
#修改
kafka-topics.sh --alter --bootstrap-server hadoop101:9092 --topic hello --partitions 6
#kafka生产消费消息
#101输入
kafka-console-producer.sh --topic hello --broker-list hadoop101:9092
#另启101输入
kafka-console-consumer.sh --topic hello --bootstrap-server hadoop101:9092
#输入一些内容
#从开始消费数据
kafka-console-consumer.sh --topic hello --bootstrap-server hadoop101:9092 --from-beginning
#按照组消费数据
kafka-console-consumer.sh --topic hello --bootstrap-server hadoop101:9092 --group g1
用Kafka官方自带的脚本,对Kafka进行压测。Kafka压测时,可以查看到哪个地方出现了瓶颈(CPU,内存,网络IO)。一般都是网络IO达到瓶颈。
kafka-consumer-perf-test.sh
kafka-producer-perf-test.sh
#在/opt/module/kafka/bin目录下面有这两个文件。
#向test中写数据,每条消息大小100kb,一共写100000条消息,不限制流量尽最大能力去写,指定kafka集群的位置
#test主题只能有一个分区,因为测试出单分区的写性能,之后遇到写性能不够可以增加分区,增加一个分区,写的性能就会加倍
kafka-producer-perf-test.sh --topic test --record-size 100 --num-records 100000 --throughput -1 --producer-props bootstrap.servers=hadoop101:9092,hadoop102:9092,hadoop103:9092
#测试结果
WARN [Producer clientId=producer-1] Error while fetching metadata with correlation id 1 : {test=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
100000 records sent, 62189.054726 records/sec (5.93 MB/sec), 273.35 ms avg latency, 543.00 ms max latency, 314 ms 50th, 440 ms 95th, 447 ms 99th, 449 ms 99.9th.#吞吐量为5.93 MB/sec,每次写入的平均延迟为273.35毫秒,最大的延迟为543.00毫秒。
Consumer的测试,如果这四个指标(IO,CPU,内存,网络)都不能改变,考虑增加分区数来提升性能。
kafka-consumer-perf-test.sh --broker-list hadoop101:9092,hadoop102:9092,hadoop103:9092 --topic test --fetch-size 10000 --messages 500000 --threads 1
#参数说明
--zookeeper 指定zookeeper的链接信息
--topic 指定topic的名称
--fetch-size 指定每次fetch的数据的大小
--messages 总共要消费的消息个数
--threads 消费数据线程个数
#测试结果
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2022-06-09 14:31:17:344, 2022-06-09 14:31:22:899, 47.6837, 8.5839, 500000, 90009.0009, 1654756277762, -1654756272207, -0.0000, -0.0003
Kafka机器数量(经验公式)=2 X(峰值生产速度 X 副本数 / 100)+ 1
先拿到峰值生产速度,再根据设定的副本数,就能预估出需要部署Kafka的数量。
比如我们的峰值生产速度是50M/s。副本数为2。
Kafka机器数量=2 X ( 50 X 2 / 100 )+ 1=3台
创建一个只有1个分区的topic
测试这个topic的producer吞吐量和consumer吞吐量。
假设他们的值分别是Tp和Tc,单位可以是MB/s。
然后假设总的目标吞吐量是Tt,那么分区数=Tt / min(Tp,Tc)
例如:producer吞吐量=20m/s;consumer吞吐量=50m/s,期望吞吐量100m/s;
分区数=100 / 20 =5分区
https://blog.csdn.net/weixin_42641909/article/details/89294698
分区数一般设置为:3-10个
服务器hadoop101 | 服务器hadoop102 | 服务器hadoop103 | |
---|---|---|---|
Flume(采集日志) | Flume | Flume |
#准备安装包,上传,解压
tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /opt/module/
mv apache-flume-1.9.0-bin/ flume-1.9.0
cd flume-1.9.0/lib/
#修改jar包冲突
rm -rf guava-11.0.2.jar
#配置环境变量
sudo vim /etc/profile.d/my_env.sh
#FLUME_HOME
export FLUME_HOME=/opt/module/flume-1.9.0
export PATH=$PATH:$FLUME_HOME/bin
#分发环境变量,断开重连
Taildir Source相比Exec Source、Spooling Directory Source的优势:
TailDir Source:断点续传、多目录。Flume1.6以前需要自己自定义Source记录每次读取文件位置,实现断点续传。
Exec Source:可以实时搜集数据,但是在Flume不运行或者Shell命令出错的情况下,数据将会丢失。
Spooling Directory Source:监控目录,支持断点续传。
采用Kafka Channel,省去了Sink,提高了效率。KafkaChannel数据存储在Kafka里面,所以数据是存储在磁盘中。
官方文档:https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
从kafka中读取数据
kafkaSource – Memory Channel – Logger Sink
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.sources.r1.kafka.topics = hello
a1.sources.r1.kafka.consumer.group.id = flume
a1.sources.r1.useFlumeEventFormat = false
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#执行以下操作
cd /opt/module/flume-1.9.0/
mkdir jobs
cd jobs/
vim kafkasource.conf
#将以上内容复制进去,保存退出,执行
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkasource.conf -n a1 -Dflume.root.logger=INFO,console
#另启动一个生产者
kafka-console-producer.sh --topic hello --broker-list hadoop101:9092
#输入一些内容,查看另一台服务器是否能够正常接收到
将数据写入到kafka中
netcat Source – Memory Channel – kafka Sink
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.sinks.k1.kafka.topic = hello
a1.sinks.k1.kafka.producer.acks = -1
a1.sinks.k1.useFlumeEventFormat = false
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#启动一个端口0
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkasink.conf -n a1 -Dflume.root.logger=INFO,console
#另启动一个端口1,执行
nc localhost 6666
#再启动一个端口2,充当消费者
kafka-console-consumer.sh --topic hello --bootstrap-server hadoop101:9092
#在1中输入一些信息,看2是否能够消费到
根据header中值的不同,将消息分别发送到不同的topic中
netcat source – 拦截器 – replicating channel selector – memory channel – kafkasink
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.sinks.k1.kafka.topic = other
a1.sinks.k1.kafka.producer.acks = -1
a1.sinks.k1.useFlumeEventFormat = false
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
将kafka channel 当做普通的channel使用
netcat source – kafka channel – logger sink
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = hello
a1.channels.c1.parseAsFlumeEvent = false
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
vim kafkachannel.conf
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkachannel.conf -n a1 -Dflume.root.logger=INFO,console
nc localhost 6666 发送一些消息
从kafka 中读取数据
kafka channel – logger sink
a1.channels = c1
a1.sinks = k1
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = hello
a1.channels.c1.parseAsFlumeEvent = false
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
vim kafkachannel-nosource.conf
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkachannel-nosource.conf -n a1 -Dflume.root.logger=INFO,console
kafka-console-producer.sh --broker-list hadoop101:9092 --topic hello
输入一些内容
将数据写入到kafka
netcat source – kafka channel
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
a1.channels.c1.kafka.topic = hello
a1.channels.c1.parseAsFlumeEvent = false
a1.sources.r1.channels = c1
vim kafkachannel-nosink.conf
flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkachannel-nosink.conf -n a1 -Dflume.root.logger=INFO,console
nc localhost 6666
kafka-console-consumer.sh --topic hello --bootstrap-server hadoop101:9092