故障案例,mongo副本集主节频繁切换

接到个case,mongod版本为2.4,三个节点组成高可用,发现经常频繁地切换,错误信息如下

Mon Mar  7 14:29:15.379 [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
Mon Mar  7 14:29:15.379 [initandlisten] can't create new thread, closing connection
Mon Mar  7 14:29:15.381 [initandlisten] connection accepted from 10.10.14.71:41282 #1801013 (973 connections now open)
Mon Mar  7 14:29:15.381 [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
Mon Mar  7 14:29:15.381 [initandlisten] can't create new thread, closing connection
Mon Mar  7 14:29:15.383 [initandlisten] connection accepted from 10.10.14.71:41270 #1801014 (973 connections now open)
Mon Mar  7 14:29:15.383 [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
Mon Mar  7 14:29:15.383 [initandlisten] can't create new thread, closing connection
Mon Mar  7 14:29:15.385 [initandlisten] connection accepted from 10.10.14.71:41283 #1801015 (973 connections now open)
Mon Mar  7 14:29:15.385 [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
Mon Mar  7 14:29:15.385 [initandlisten] can't create new thread, closing connection
Mon Mar  7 14:29:19.689 [rsHealthPoll] replSet health poll task caught an exception: boost::thread_resource_errorreplSet info 10.10.223.207:27017 is down (or slow to respond): boost::thread_resource_error
Mon Mar  7 14:29:19.689 [rsHealthPoll] replSet member 10.10.223.207:27017 is now in state DOWN
Mon Mar  7 14:29:21.949 [rsHealthPoll] replSet health poll task caught an exception: boost::thread_resource_errorreplSet info 10.10.14.71:27017 is down (or slow to respond): boost::thread_resource_error
Mon Mar  7 14:29:21.949 [rsHealthPoll] replSet member 10.10.14.71:27017 is now in state DOWN

mongod进程一直都在,后来监控内网质量也无影响。

仔细看错误日志,发现连接数一直上不到1000,而配置的连接数有2w,最后定位到是centos的设置问题

13># cat /etc/security/limits.d/90-nproc.conf 
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.


*          soft    nproc     1024
root       soft    nproc     unlimited


发现非root账户限制在了1024,而正好mongod进程是非root用户拉起的,增大该值后,问题得到解决。

你可能感兴趣的:(NoSQL/MongoDB,DB故障处理案例)