[TIMEOUT_CLEAN_QUEUE]broker busy, start flow control for a while, period in queue: 1008ms, size of queue: 0
org.apache.rocketmq.remoting.exception.RemotingTooMuchRequestException: sendDefaultImpl call timeout
大部分MQ消息生产发送正常,但每天产生200多个异常发送失败。
2Namesvr
2Master
2Slave
SYNC_MASTER,ASYNC_FLUSH
4C8G Linux
MQ的JVM堆正常,垃圾回收正常无FULLGC.
CPU,内存、磁盘IO都比比较正常
查看broker.log发现有周期性频繁启动 ReadSocketService 和 AcceptSocketService。
2019-12-18 03:15:59 INFO AcceptSocketService - Try to start service thread:ReadSocketService started:false lastThread:null
2019-12-18 03:15:59 INFO AcceptSocketService - Try to start service thread:WriteSocketService started:false lastThread:null
2019-12-18 03:15:59 INFO WriteSocketService - makestop thread WriteSocketService
2019-12-18 03:15:59 INFO WriteSocketService - makestop thread ReadSocketService
2019-12-18 03:16:24 INFO ReadSocketService - makestop thread ReadSocketService
2019-12-18 03:16:24 INFO ReadSocketService - makestop thread WriteSocketService
2019-12-18 03:16:24 INFO AcceptSocketService - Try to start service thread:ReadSocketService started:false lastThread:null
2019-12-18 03:16:24 INFO AcceptSocketService - Try to start service thread:WriteSocketService started:false lastThread:null
2019-12-18 03:16:24 INFO WriteSocketService - makestop thread WriteSocketService
2019-12-18 03:16:24 INFO WriteSocketService - makestop thread ReadSocketService
2019-12-18 03:16:25 INFO BrokerControllerScheduledThread1 - dispatch behind commit log 0 bytes
2019-12-18 03:16:25 INFO BrokerControllerScheduledThread1 - Slave fall behind master: 0 bytes
2019-12-18 03:16:25 INFO brokerOutApi_thread_1 - register broker[0]to name server rocketmq-nameserver-a:9876 OK
2019-12-18 03:16:25 INFO brokerOutApi_thread_2 - register broker[0]to name server rocketmq-nameserver-b:9876 OK
2019-12-18 03:16:49 INFO ReadSocketService - makestop thread ReadSocketService
2019-12-18 03:16:49 INFO ReadSocketService - makestop thread WriteSocketService
2019-12-18 03:16:49 INFO AcceptSocketService - Try to start service thread:ReadSocketService started:false lastThread:null
2019-12-18 03:16:49 INFO AcceptSocketService - Try to start service thread:WriteSocketService started:false lastThread:null
查看MQ日志store.log发现主从连接频繁过期:
2019-12-30 13:01:51 WARN HAClient - HAClient, housekeeping, found this connection[10.0.74.167:10912] expired, 1577682111345
2019-12-30 13:01:51 WARN HAClient - HAClient, master not response some time, so close connection
2019-12-30 13:02:16 INFO HAClient - HAClient, processReadEvent read socket < 0
2019-12-30 13:02:16 WARN HAClient - HAClient, housekeeping, found this connection[10.0.74.167:10912] expired, 1577682136362
2019-12-30 13:02:16 WARN HAClient - HAClient, master not response some time, so close connection
2019-12-30 13:02:41 INFO HAClient - HAClient, processReadEvent read socket < 0
2019-12-30 13:02:41 WARN HAClient - HAClient, housekeeping, found this connection[10.0.74.167:10912] expired, 1577682161382
这两个类主从ha的服务类,基于TCP长连接;怀疑系统网络配置导致TCP频繁断连,查看网络配置。
首先,查看防火墙,确认是关闭状态
[root@rocketmq-slave-c bin]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
broker的IP配置什么也都正确,不然也不会只是少部分失败。
怀疑可能是网络问题导致:
cat /proc/net/sockstat
sockets: used 141
TCP: inuse 35 orphan 0 tw 85 alloc 37 mem 10
UDP: inuse 0 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
等待关闭比较多,继续查看网络相关参数
#redisserver
#vm.overcommit_ratio = 10
#vm.overcommit_memory = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
##单个共享内存段的最大值。最大物理内存*1024*1024*1024-1 如下物理内存为32G的情况
#kernel.shmmax = 34359738368
###控制可以使用的共享内存的总页数
#kernel.shmall = 4294967296
net.ipv4.neigh.default.gc_stale_time = 120
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.all.arp_announce = 2
net.ipv4.conf.lo.arp_announce = 2
###内存资源使用相关设定
net.core.wmem_default = 8388608
net.core.rmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_mem = 8388608 8388608 8388608
##应对DDOS攻击,TCP连接建立设置
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_synack_retries = 1
net.ipv4.tcp_syn_retries = 1
net.ipv4.tcp_max_syn_backlog = 262144
##应对timewait过高,TCP连接断开设置
net.ipv4.tcp_max_tw_buckets = 10000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_fin_timeout = 5
net.ipv4.ip_local_port_range = 4000 65000
###TCP keepalived 连接保鲜设置
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 5
###其他TCP相关调节
net.core.somaxconn = 22144
net.core.netdev_max_backlog = 262144
net.ipv4.tcp_max_orphans = 3276800
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
###文件数
fs.file-max=655350
vm.swappiness=10
kernel.sysrq = 1
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
kernel.numa_balancing = 0
kernel.shmmax = 68719476736
###避免放大攻击
net.ipv4.icmp_echo_ignore_broadcasts=1
###开启恶意icmp错误消息保护
net.ipv4.icmp_ignore_bogus_error_responses=1
###开启SYN洪水攻击保护
net.ipv4.tcp_syncookies=1
从参数上看,没有任何配置会导致主从之间频繁断连,比较奇怪。
RocketMQ 主从机制
1.从服务器主动建立 TCP 连接主服务器,然后每隔 5s 向主服务器发送 commitLog 文件最大偏移量拉取还未同步的消息;
2.主服务器开启监听端口,监听从服务器发送过来的信息,主服务器收到从服务器发过来的偏移量进行解析,并返回查找出未同步的消息给从服务器;
3.客户端收到主服务器的消息后,将这批消息写入 commitLog 文件中,然后更新 commitLog 拉取偏移量,接着继续向主服务拉取未同步的消息。
Slave -> Master 过程
从 HA 实现逻辑可看出,可大致分为两个过程,分别是从服务器上报偏移量,以及主服务器发送未同步消息到从服务器。
从上面的实现类可知,从服务器向主服务器上报偏移量的逻辑在 HAClient 类中,HAClient 类是一个继承了 ServiceThread 类,即它是一个线程服务类,在 Broker 启动后,Broker 启动开一条线程定时执行从服务器上报偏移量到主服务器的任务。
org.apache.rocketmq.store.ha.HAService.HAClient#run:
public void run() {
log.info(this.getServiceName() + " service started");
while (!this.isStopped()) {
try {
// 主动连接主服务器,获取socketChannel对象
if (this.connectMaster()) {
if (this.isTimeToReportOffset()) {
// 执行上报偏移量到主服务器
boolean result = this.reportSlaveMaxOffset(this.currentReportedOffset);
if (!result) {
this.closeMaster();
}
}
// 每隔一秒钟轮询一遍
this.selector.select(1000);
// 处理主服务器发送过来的消息
boolean ok = this.processReadEvent();
if (!ok) {
this.closeMaster();
}
// ......
} else {
this.waitForRunning(1000 * 5);
}
} catch (Exception e) {
log.warn(this.getServiceName() + " service has exception. ", e);
this.waitForRunning(1000 * 5);
}
}
log.info(this.getServiceName() + " service end");
}
从上可以看出,slave主动向master建立连接,5s之后发送自己当前同步的进度,master收到以后向slave发送同步数据,最后master由于slave的连接过期,主动断开连接。
2018-10-22 15:44:07 WARN HAClient - HAClient, housekeeping, found this connection[10.38.33.22:10912] expired, 1540194247979
2018-10-22 15:44:07 WARN HAClient - HAClient, master not response some time, so close connection
2018-10-22 15:44:33 INFO HAClient - HAClient, processReadEvent read socket < 0
从上可以看出,slave也发现和master的连接超时,断开连接。
到这里,非常困惑,master和slave都主动断开连接!看代码,master的同步比较清晰和日志也比较一致。slave的同步日志和日志不一致。slave的同步代码核心代码中通过比较lastWriteTimestamp
和当前时间判断出与master的同步连接过期,以及master没有响应。在接收到master的消息、创建连接、关闭连接的时候都修改这个lastWriteTimestamp
值。关闭重置为0,其他重置为当前时间。看日志发现lastWriteTimestamp
和当前的时间差别巨大是1540194247979
,可以得出lastWriteTimestamp
其实是0。基本上可以判断是因为master主动关闭的。后来,通过tcpdump抓包得到了确认。这里要吐槽这个日志了,slave是被关闭的却是提示master没响应,其实master在关闭之前总共发了5次同步信息。
确认是master主动关闭,接下来的问题是为什么slave没有告诉master自己的进度。日志已经无能为力了,看到代码中有while(true)
片段,猜测会不会是死循环,通过jstack
发现没有。最后,通过手动添加日志发现master向slave发送的同步信息,slave都收到了,然后把lastWriteTimestamp
重置为当前的时间,巧合的是,每次函数isTimeToReportOffset
判断是否需要发送同步进度的时候恰好都为false
。这是因为master向slave同步间隔和slave向master报告同步进度的间隔默认都是5s,slave处理master的同步信息以后会重置lastWriteTimestamp
为当前时间,因此一直无法满足同步的条件,导致以上的现象。
可以知道,连接经常断开显然会影响同步的效率。解决方案可以把master同步时间设置的比slave的同步长,比如slave的同步间隔为3s,master同步间隔为5s。
修改从broker的心跳配置
./mqadmin updateBrokerConfig -b rocketmq-slave-a:10911 -c rocketmq-cluster -n rocketmq-nameserver-a:9876;rocketmq-nameserver-b:9876 -k haSendHeartbeatInterval -v 4000
运行命令,报错如下:
[root@rocketmq-slave-c bin]# sh mqadmin updateBrokerConfig -b rocketmq-slave-c:10911 -c rocketmq-cluster -n 'rocketmq-nameserver-a:9876;rocketmq-nameserver-b:9876' -k haSendHeartbeatInterval -v 4000
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
org.apache.rocketmq.tools.command.SubCommandException: UpdateBrokerConfigSubCommand command failed
at org.apache.rocketmq.tools.command.broker.UpdateBrokerConfigSubCommand.execute(UpdateBrokerConfigSubCommand.java:105)
at org.apache.rocketmq.tools.command.MQAdminStartup.main0(MQAdminStartup.java:135)
at org.apache.rocketmq.tools.command.MQAdminStartup.main(MQAdminStartup.java:86)
Caused by: org.apache.rocketmq.acl.common.AclException: [10015:signature-failed] unable to calculate a request signature. error=[10015:signature-failed] unable to calculate a request signature. error=Algorithm HmacSHA1 not available
at org.apache.rocketmq.acl.common.AclSigner.signAndBase64Encode(AclSigner.java:84)
at org.apache.rocketmq.acl.common.AclSigner.calSignature(AclSigner.java:73)
at org.apache.rocketmq.acl.common.AclSigner.calSignature(AclSigner.java:68)
at org.apache.rocketmq.acl.common.AclUtils.calSignature(AclUtils.java:58)
at org.apache.rocketmq.acl.common.AclClientRPCHook.doBeforeRequest(AclClientRPCHook.java:44)
at org.apache.rocketmq.remoting.netty.NettyRemotingAbstract.doBeforeRpcHooks(NettyRemotingAbstract.java:172)
at org.apache.rocketmq.remoting.netty.NettyRemotingClient.invokeSync(NettyRemotingClient.java:370)
at org.apache.rocketmq.client.impl.MQClientAPIImpl.updateBrokerConfig(MQClientAPIImpl.java:1144)
at org.apache.rocketmq.tools.admin.DefaultMQAdminExtImpl.updateBrokerConfig(DefaultMQAdminExtImpl.java:166)
at org.apache.rocketmq.tools.admin.DefaultMQAdminExt.updateBrokerConfig(DefaultMQAdminExt.java:148)
at org.apache.rocketmq.tools.command.broker.UpdateBrokerConfigSubCommand.execute(UpdateBrokerConfigSubCommand.java:81)
... 2 more
Caused by: org.apache.rocketmq.acl.common.AclException: [10015:signature-failed] unable to calculate a request signature. error=Algorithm HmacSHA1 not available
at org.apache.rocketmq.acl.common.AclSigner.sign(AclSigner.java:63)
at org.apache.rocketmq.acl.common.AclSigner.signAndBase64Encode(AclSigner.java:79)
... 12 more
Caused by: java.security.NoSuchAlgorithmException: Algorithm HmacSHA1 not available
at javax.crypto.Mac.getInstance(Mac.java:181)
at org.apache.rocketmq.acl.common.AclSigner.sign(AclSigner.java:57)
... 13 more
解决方法:
vim /home/H/rocketmq-all-4.4.0/distribution/target/apache-rocketmq/bin/tools.sh
在${JAVA_HOME}/jre/lib/ext后加上ext文件夹的绝对路径
JAVA_OPT="${JAVA_OPT} -server -Xms1g -Xmx1g -Xmn256m -XX:PermSize=128m -XX:MaxPermSize=128m"
JAVA_OPT="${JAVA_OPT} -Djava.ext.dirs=${BASE_DIR}/lib:${JAVA_HOME}/jre/lib/ext:/usr/java/jdk1.8.0_201-amd64/jre/lib/ext"
再次运行,成功。
[root@rocketmq-slave-c bin]# sh mqadmin updateBrokerConfig -b rocketmq-slave-c:10911 -c rocketmq-cluster -n 'rocketmq-nameserver-a:9876;rocketmq-nameserver-b:9876' -k haSendHeartbeatInterval -v 4000
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
update broker config success, rocketmq-slave-c:10911
MQ不在产生异常
参考:
https://cloud.tencent.com/developer/article/1512981
https://www.jianshu.com/p/f8a7ae95bc58
http://zjykzk.github.io/post/cs/rocketmq/slave-sync-from-master-disconnect/