两节点rabbitmq主备集群
节点 | 通信端口 | 版本 |
---|---|---|
master | 5672(监听端口)/15672(web端口)/25672(集群通信端口) | rabbitmq-server-3.7.0/erlang-19.3.6.4 |
slave | 5672(监听端口)/15672(web端口)/25672(集群通信端口) | rabbitmq-server-3.7.0/erlang-19.3.6.4 |
1. 主节点机器磁盘爆满后主机宕机,清理和扩容后启动主机,rabbitmq服务停止
2. 重启rabbitmq服务,发现服务有问题,连接rabbitmq报错
进入rabbitmq的web控制台,发现在首页页面报错,Virtual host xxxx experienced an error on node rabbit@rbtnode1 and may be inaccessible,好几个vhost都访问rabbitmq第一个节点失败。
从rabbitmq的日志中,很明显的看出两类报错,一个是用户访问vhost is down已经连接不上了,一个是Unable to recover vhsot无法恢复了,而且很明显看出是rabbitmq的数据路径下A6B2YT8CK302DQET37CULUA99/recovery.dets的问题;实际进入该路径下,看出recovery.dets是0kB大小,文件已经损坏了。
[error] <0.1173.0> Unable to recover vhost <<"prod_XXXX">> data. Reason {badmatch,{error,{{{badmatch,{error,{not_a_dets_file,"/XXX/XXX/XXX/rabbitmq/mnesia/rabbit@rbtnode1/msg_stores/vhosts/A6B2YT8CK302DQET37CULUA99/recovery.dets"}}},[{rabbit_recovery_terms,open_table,1,[{file,"src/rabbit_recovery_terms.erl"},{line,191}]},{rabbit_recovery_terms,init,1,[{file,"src/rabbit_recovery_terms.erl"},{line,171}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,328}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},{child,undefined,rabbit_recovery_terms,{rabbit_recovery_terms,start_link,[<<"prod_bk_nodeman">>]},transient,30000,worker,[rabbit_recovery_terms]}}}}
Error on AMQP connection <0.2915.0> (86.5.XX.XX:44194 -> 86.5.XX.XX:5672, vhost: 'none',
user: 'bk_XXX', state: opening), channel 0:
{handshake_error,opening,
{amqp_error,internal_error,
"access to vhost 'prod_XXX' refused for user 'bk_XXX': vhost 'prod_XXXX' is down",
'connection.open'}}
In RabbitMQ versions starting with 3.7.0 all messages data is combined in the msg_stores/vhosts directory and stored in a subdirectory per vhost. Each vhost directory is named with a hash and contains a .vhost file with the vhost name, so a specific vhost's message set can be backed up separately.
In RabbitMQ versions prior to 3.7.0 messages are stored in several directories under the node data directory: queues, msg_store_persistent and msg_store_transient. Also there is a recovery.dets file which contains recovery metadata if the node was stopped gracefully.
在从3.7.0开始的RabbitMQ版本中,所有消息数据都组合在 msg_stores / vhosts目录中,并存储在每个vhost的子目录中。每个虚拟主机目录都用一个哈希命名,并包含一个带有虚拟主机名的.vhost文件,因此可以单独备份特定虚拟主机的消息集。
在3.7.0之前的RabbitMQ版本中,消息存储在节点数据目录下的多个目录中:queues,msg_store_persistent和msg_store_transient。另外,还有一个recovery.dets文件,如果该节点正常停止,则该文件包含恢复元数据。
https://www.rabbitmq.com/backup.html
从官网描述 recovery.dets 文件看出,正常情况下,该文件记录了rabbitmq内的元数据信息,
因为节点主机由于磁盘爆满导致了意外宕机,所以该文件没有正常写入数据,导致文件损坏。
1)从日志读出所有vhost损坏的文件,进入rabbitmq数据路径下,删除或mv掉recovery.dets文件,
之后重启rabbitmq服务,恢复vhsot的连接,解决问题,recovery.dets也重新写入了元数据。
2)删除有问题的vhost,新建解决问题