Rocket MQ异常排查RemotingTooMuchRequestException

Rocket MQ异常排查

异常报告

  1. [TIMEOUT_CLEAN_QUEUE]broker busy, start flow control for a while, period in queue: 1008ms, size of queue: 0

  2. org.apache.rocketmq.remoting.exception.RemotingTooMuchRequestException: sendDefaultImpl call timeout
    大部分MQ消息生产发送正常,但每天产生200多个异常发送失败。

环境配置

2Namesvr
2Master
2Slave
SYNC_MASTER,ASYNC_FLUSH
4C8G Linux

监控排查

MQ的JVM堆正常,垃圾回收正常无FULLGC.
CPU,内存、磁盘IO都比比较正常

排查日志

查看broker.log发现有周期性频繁启动 ReadSocketService 和 AcceptSocketService。


2019-12-18 03:15:59 INFO AcceptSocketService - Try to start service thread:ReadSocketService started:false lastThread:null
2019-12-18 03:15:59 INFO AcceptSocketService - Try to start service thread:WriteSocketService started:false lastThread:null
2019-12-18 03:15:59 INFO WriteSocketService - makestop thread WriteSocketService
2019-12-18 03:15:59 INFO WriteSocketService - makestop thread ReadSocketService
2019-12-18 03:16:24 INFO ReadSocketService - makestop thread ReadSocketService
2019-12-18 03:16:24 INFO ReadSocketService - makestop thread WriteSocketService

2019-12-18 03:16:24 INFO AcceptSocketService - Try to start service thread:ReadSocketService started:false lastThread:null
2019-12-18 03:16:24 INFO AcceptSocketService - Try to start service thread:WriteSocketService started:false lastThread:null
2019-12-18 03:16:24 INFO WriteSocketService - makestop thread WriteSocketService
2019-12-18 03:16:24 INFO WriteSocketService - makestop thread ReadSocketService
2019-12-18 03:16:25 INFO BrokerControllerScheduledThread1 - dispatch behind commit log 0 bytes
2019-12-18 03:16:25 INFO BrokerControllerScheduledThread1 - Slave fall behind master: 0 bytes
2019-12-18 03:16:25 INFO brokerOutApi_thread_1 - register broker[0]to name server rocketmq-nameserver-a:9876 OK
2019-12-18 03:16:25 INFO brokerOutApi_thread_2 - register broker[0]to name server rocketmq-nameserver-b:9876 OK
2019-12-18 03:16:49 INFO ReadSocketService - makestop thread ReadSocketService
2019-12-18 03:16:49 INFO ReadSocketService - makestop thread WriteSocketService
2019-12-18 03:16:49 INFO AcceptSocketService - Try to start service thread:ReadSocketService started:false lastThread:null
2019-12-18 03:16:49 INFO AcceptSocketService - Try to start service thread:WriteSocketService started:false lastThread:null

查看MQ日志store.log发现主从连接频繁过期:

2019-12-30 13:01:51 WARN HAClient - HAClient, housekeeping, found this connection[10.0.74.167:10912] expired, 1577682111345
2019-12-30 13:01:51 WARN HAClient - HAClient, master not response some time, so close connection
2019-12-30 13:02:16 INFO HAClient - HAClient, processReadEvent read socket < 0
2019-12-30 13:02:16 WARN HAClient - HAClient, housekeeping, found this connection[10.0.74.167:10912] expired, 1577682136362
2019-12-30 13:02:16 WARN HAClient - HAClient, master not response some time, so close connection
2019-12-30 13:02:41 INFO HAClient - HAClient, processReadEvent read socket < 0
2019-12-30 13:02:41 WARN HAClient - HAClient, housekeeping, found this connection[10.0.74.167:10912] expired, 1577682161382

这两个类主从ha的服务类,基于TCP长连接;怀疑系统网络配置导致TCP频繁断连,查看网络配置。
首先,查看防火墙,确认是关闭状态

[root@rocketmq-slave-c bin]# systemctl status firewalld            
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)

broker的IP配置什么也都正确,不然也不会只是少部分失败。

怀疑可能是网络问题导致:

cat /proc/net/sockstat 
sockets: used 141
TCP: inuse 35 orphan 0 tw 85 alloc 37 mem 10
UDP: inuse 0 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

等待关闭比较多,继续查看网络相关参数

#redisserver 
#vm.overcommit_ratio = 10
#vm.overcommit_memory = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
##单个共享内存段的最大值。最大物理内存*1024*1024*1024-1 如下物理内存为32G的情况
#kernel.shmmax = 34359738368
###控制可以使用的共享内存的总页数
#kernel.shmall = 4294967296
net.ipv4.neigh.default.gc_stale_time = 120
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.all.arp_announce = 2
net.ipv4.conf.lo.arp_announce = 2
###内存资源使用相关设定
net.core.wmem_default = 8388608
net.core.rmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_mem = 8388608 8388608 8388608
##应对DDOS攻击,TCP连接建立设置
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_synack_retries = 1
net.ipv4.tcp_syn_retries = 1
net.ipv4.tcp_max_syn_backlog = 262144
##应对timewait过高,TCP连接断开设置
net.ipv4.tcp_max_tw_buckets = 10000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_fin_timeout = 5
net.ipv4.ip_local_port_range = 4000 65000
###TCP keepalived 连接保鲜设置
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 5
###其他TCP相关调节
net.core.somaxconn = 22144
net.core.netdev_max_backlog = 262144
net.ipv4.tcp_max_orphans = 3276800
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
###文件数
fs.file-max=655350
vm.swappiness=10
kernel.sysrq = 1
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
kernel.numa_balancing = 0
kernel.shmmax = 68719476736
###避免放大攻击
net.ipv4.icmp_echo_ignore_broadcasts=1
###开启恶意icmp错误消息保护
net.ipv4.icmp_ignore_bogus_error_responses=1
###开启SYN洪水攻击保护
net.ipv4.tcp_syncookies=1

从参数上看,没有任何配置会导致主从之间频繁断连,比较奇怪。

分析代码

RocketMQ 主从机制

1.从服务器主动建立 TCP 连接主服务器,然后每隔 5s 向主服务器发送 commitLog 文件最大偏移量拉取还未同步的消息;
2.主服务器开启监听端口,监听从服务器发送过来的信息,主服务器收到从服务器发过来的偏移量进行解析,并返回查找出未同步的消息给从服务器;
3.客户端收到主服务器的消息后,将这批消息写入 commitLog 文件中,然后更新 commitLog 拉取偏移量,接着继续向主服务拉取未同步的消息。

Slave -> Master 过程

从 HA 实现逻辑可看出,可大致分为两个过程,分别是从服务器上报偏移量,以及主服务器发送未同步消息到从服务器。
从上面的实现类可知,从服务器向主服务器上报偏移量的逻辑在 HAClient 类中,HAClient 类是一个继承了 ServiceThread 类,即它是一个线程服务类,在 Broker 启动后,Broker 启动开一条线程定时执行从服务器上报偏移量到主服务器的任务。
org.apache.rocketmq.store.ha.HAService.HAClient#run:

public void run() {
  log.info(this.getServiceName() + " service started");

  while (!this.isStopped()) {
    try {
      // 主动连接主服务器,获取socketChannel对象
      if (this.connectMaster()) {
        if (this.isTimeToReportOffset()) {
          // 执行上报偏移量到主服务器
          boolean result = this.reportSlaveMaxOffset(this.currentReportedOffset);
          if (!result) {
            this.closeMaster();
          }
        }
                // 每隔一秒钟轮询一遍
        this.selector.select(1000);

        // 处理主服务器发送过来的消息
        boolean ok = this.processReadEvent();
        if (!ok) {
          this.closeMaster();
        }

        // ......

      } else {
        this.waitForRunning(1000 * 5);
      }
    } catch (Exception e) {
      log.warn(this.getServiceName() + " service has exception. ", e);
      this.waitForRunning(1000 * 5);
    }
  }

  log.info(this.getServiceName() + " service end");
}

从上可以看出,slave主动向master建立连接,5s之后发送自己当前同步的进度,master收到以后向slave发送同步数据,最后master由于slave的连接过期,主动断开连接。

2018-10-22 15:44:07 WARN HAClient - HAClient, housekeeping, found this connection[10.38.33.22:10912] expired, 1540194247979
2018-10-22 15:44:07 WARN HAClient - HAClient, master not response some time, so close connection
2018-10-22 15:44:33 INFO HAClient - HAClient, processReadEvent read socket < 0

从上可以看出,slave也发现和master的连接超时,断开连接。

到这里,非常困惑,master和slave都主动断开连接!看代码,master的同步比较清晰和日志也比较一致。slave的同步日志和日志不一致。slave的同步代码核心代码中通过比较lastWriteTimestamp和当前时间判断出与master的同步连接过期,以及master没有响应。在接收到master的消息、创建连接、关闭连接的时候都修改这个lastWriteTimestamp值。关闭重置为0,其他重置为当前时间。看日志发现lastWriteTimestamp和当前的时间差别巨大是1540194247979,可以得出lastWriteTimestamp其实是0。基本上可以判断是因为master主动关闭的。后来,通过tcpdump抓包得到了确认。这里要吐槽这个日志了,slave是被关闭的却是提示master没响应,其实master在关闭之前总共发了5次同步信息。

确认是master主动关闭,接下来的问题是为什么slave没有告诉master自己的进度。日志已经无能为力了,看到代码中有while(true)片段,猜测会不会是死循环,通过jstack发现没有。最后,通过手动添加日志发现master向slave发送的同步信息,slave都收到了,然后把lastWriteTimestamp重置为当前的时间,巧合的是,每次函数isTimeToReportOffset判断是否需要发送同步进度的时候恰好都为false。这是因为master向slave同步间隔和slave向master报告同步进度的间隔默认都是5s,slave处理master的同步信息以后会重置lastWriteTimestamp为当前时间,因此一直无法满足同步的条件,导致以上的现象。

解决方案

可以知道,连接经常断开显然会影响同步的效率。解决方案可以把master同步时间设置的比slave的同步长,比如slave的同步间隔为3s,master同步间隔为5s。

修改从broker的心跳配置

./mqadmin updateBrokerConfig  -b rocketmq-slave-a:10911 -c rocketmq-cluster -n rocketmq-nameserver-a:9876;rocketmq-nameserver-b:9876 -k haSendHeartbeatInterval -v 4000

运行命令,报错如下:

[root@rocketmq-slave-c bin]# sh mqadmin updateBrokerConfig  -b rocketmq-slave-c:10911 -c rocketmq-cluster -n 'rocketmq-nameserver-a:9876;rocketmq-nameserver-b:9876' -k haSendHeartbeatInterval -v 4000  
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
org.apache.rocketmq.tools.command.SubCommandException: UpdateBrokerConfigSubCommand command failed
        at org.apache.rocketmq.tools.command.broker.UpdateBrokerConfigSubCommand.execute(UpdateBrokerConfigSubCommand.java:105)
        at org.apache.rocketmq.tools.command.MQAdminStartup.main0(MQAdminStartup.java:135)
        at org.apache.rocketmq.tools.command.MQAdminStartup.main(MQAdminStartup.java:86)
Caused by: org.apache.rocketmq.acl.common.AclException: [10015:signature-failed] unable to calculate a request signature. error=[10015:signature-failed] unable to calculate a request signature. error=Algorithm HmacSHA1 not available
        at org.apache.rocketmq.acl.common.AclSigner.signAndBase64Encode(AclSigner.java:84)
        at org.apache.rocketmq.acl.common.AclSigner.calSignature(AclSigner.java:73)
        at org.apache.rocketmq.acl.common.AclSigner.calSignature(AclSigner.java:68)
        at org.apache.rocketmq.acl.common.AclUtils.calSignature(AclUtils.java:58)
        at org.apache.rocketmq.acl.common.AclClientRPCHook.doBeforeRequest(AclClientRPCHook.java:44)
        at org.apache.rocketmq.remoting.netty.NettyRemotingAbstract.doBeforeRpcHooks(NettyRemotingAbstract.java:172)
        at org.apache.rocketmq.remoting.netty.NettyRemotingClient.invokeSync(NettyRemotingClient.java:370)
        at org.apache.rocketmq.client.impl.MQClientAPIImpl.updateBrokerConfig(MQClientAPIImpl.java:1144)
        at org.apache.rocketmq.tools.admin.DefaultMQAdminExtImpl.updateBrokerConfig(DefaultMQAdminExtImpl.java:166)
        at org.apache.rocketmq.tools.admin.DefaultMQAdminExt.updateBrokerConfig(DefaultMQAdminExt.java:148)
        at org.apache.rocketmq.tools.command.broker.UpdateBrokerConfigSubCommand.execute(UpdateBrokerConfigSubCommand.java:81)
        ... 2 more
Caused by: org.apache.rocketmq.acl.common.AclException: [10015:signature-failed] unable to calculate a request signature. error=Algorithm HmacSHA1 not available
        at org.apache.rocketmq.acl.common.AclSigner.sign(AclSigner.java:63)
        at org.apache.rocketmq.acl.common.AclSigner.signAndBase64Encode(AclSigner.java:79)
        ... 12 more
Caused by: java.security.NoSuchAlgorithmException: Algorithm HmacSHA1 not available
        at javax.crypto.Mac.getInstance(Mac.java:181)
        at org.apache.rocketmq.acl.common.AclSigner.sign(AclSigner.java:57)
        ... 13 more

解决方法:

vim /home/H/rocketmq-all-4.4.0/distribution/target/apache-rocketmq/bin/tools.sh
在${JAVA_HOME}/jre/lib/ext后加上ext文件夹的绝对路径

JAVA_OPT="${JAVA_OPT} -server -Xms1g -Xmx1g -Xmn256m -XX:PermSize=128m -XX:MaxPermSize=128m"
JAVA_OPT="${JAVA_OPT} -Djava.ext.dirs=${BASE_DIR}/lib:${JAVA_HOME}/jre/lib/ext:/usr/java/jdk1.8.0_201-amd64/jre/lib/ext"

再次运行,成功。

[root@rocketmq-slave-c bin]# sh mqadmin updateBrokerConfig  -b rocketmq-slave-c:10911 -c rocketmq-cluster -n 'rocketmq-nameserver-a:9876;rocketmq-nameserver-b:9876' -k haSendHeartbeatInterval -v 4000 
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
update broker config success, rocketmq-slave-c:10911

MQ不在产生异常

参考:

https://cloud.tencent.com/developer/article/1512981

https://www.jianshu.com/p/f8a7ae95bc58

http://zjykzk.github.io/post/cs/rocketmq/slave-sync-from-master-disconnect/

你可能感兴趣的:(消息队列,后端,分布式,网络,java)