MongoDB 分片集群故障RECOVERING 处理纪实




shard1:RECOVERING> rs.status();


        "set" : "shard1",

        "date" : ISODate("2017-03-03T03:08:50.882Z"),

        "myState" : 3,

        "members" : [


                        "_id" : 0,

                        "name" : "",

                        "health" : 1,

                        "state" : 1,

                        "stateStr" : "PRIMARY",

                        "uptime" : 69310,

                        "optime" : Timestamp(1488510526, 3),

                        "optimeDate" : ISODate("2017-03-03T03:08:46Z"),

                        "lastHeartbeat" : ISODate("2017-03-03T03:08:50.416Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:08:49.706Z"),

                        "pingMs" : 0,

                        "electionTime" : Timestamp(1479454146, 1),

                        "electionDate" : ISODate("2016-11-18T07:29:06Z"),

                        "configVersion" : 1



                        "_id" : 1,

                        "name" : "",

                        "health" : 1,

                        "state" : 3,

                        "stateStr" : "RECOVERING",

                        "uptime" : 69311,

                        "optime" : Timestamp(1471072341, 1),

                        "optimeDate" : ISODate("2016-08-13T07:12:21Z"),

                        "configVersion" : 1,

                        "self" : true



                        "_id" : 2,

                        "name" : "",

                        "health" : 1,

                        "state" : 7,

                        "stateStr" : "ARBITER",

                        "uptime" : 69310,

                        "lastHeartbeat" : ISODate("2017-03-03T03:08:50.412Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:08:50.322Z"),

                        "pingMs" : 0,

                        "configVersion" : 1



        "ok" : 1







2、从后台error日志分析replSet errorRS102


[mongodb@mongodb_m2 ~]$ ps -eaf|grep 27017

mongodb  24630     1  0 Mar02 ?        00:03:41 /usr/local/mongodb-linux-x86_64-3.0.3/bin/mongod --shardsvr --replSet shard1 --port 27017 --dbpath /data/mongodb/shard27017 --oplogSize 2048 --logpath /data/mongodb/logs/shard_m1s1_27017.log --logappend --fork

mongodb  39309 30937  0 10:35 pts/0    00:00:00 grep 27017

[mongodb@mongodb_m2 ~]$



more /data/mongodb/logs/shard_m1s1_27017.log

2017-03-03T09:44:59.070+0800 I REPL     [ReplicationExecutor] syncing from:

2017-03-03T09:44:59.071+0800 W REPL     [rsBackgroundSync] we are too stale to use as a sync source

2017-03-03T09:44:59.071+0800 I REPL     [ReplicationExecutor] could not find member to sync from

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet error RS102 too stale to catch up

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet our last optime : Aug 13 15:12:21 57aec855:1

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet oldest available is Feb  7 14:13:10 58996576:1

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet See

2017-03-03T09:45:18.914+0800 I NETWORK  [conn6420] end connection (3 connections now open)

2017-03-03T09:45:18.915+0800 I NETWORK  [initandlisten] connection accepted from #6423 (4 connections now open)

2017-03-03T09:45:20.195+0800 I NETWORK  [conn6421] end connection (3 connections now open)

2017-03-03T09:45:20.196+0800 I NETWORK  [initandlisten] connection accepted from #6424 (4 connections now open)



看记录“replSet oldest available isFeb  7 14:13:10 58996576:1”得知这个副本集合里面最新的记录是2月7日同步过来,从那之后,sync就停止了,所以我们需要再次人工手动进行同步sync复制,表面现象是这样的,具体详细的复制信息,我们还要再去命令窗口查看。




shard1:RECOVERING>  db.printReplicationInfo();

configured oplog size:   2048.003890991211MB

log length start to end: 11028041secs (3063.34hrs)

oplog first event time:  Thu Apr 07 2016 23:51:40 GMT+0800 (CST)

oplog last event time:   Sat Aug 13 2016 15:12:21 GMT+0800 (CST)

now:                     Fri Mar 03 2017 10:37:25 GMT+0800 (CST)                    






shard1:PRIMARY>  db.printReplicationInfo();

configured oplog size:   2048.003890991211MB

log length start to end: 2059878secs (572.19hrs)

oplog first event time:  Tue Feb 07 2017 14:31:13 GMT+0800 (CST)

oplog last event time:   Fri Mar 03 2017 10:42:31 GMT+0800 (CST)

now:                     Fri Mar 03 2017 10:42:32 GMT+0800 (CST)                    








看error日志里面提供的sync的资料 2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet See,发现有如下几种办法同步

(1)Automatically Sync a Member 自动同步


         Duringinitial sync, mongod will remove the content of the dbPath.


You can also force a mongod that is alreadya member of the set to perform an initial sync by restarting the instancewithout the content of the dbPath as follows:

         Stopthe member’s mongod instance. To ensure a clean shutdown, use thedb.shutdownServer() method from the mongo shell or on Linux systems, the mongod--shutdown option.

         Deleteall data and sub-directories from the member’s data directory. By removing thedata dbPath, MongoDB will perform a complete resync. Consider making a backupfirst.      


(2)Sync by Copying Data Files from Another Member,从另外一个成员拷贝数据文件


This approach “seeds” a new or stale memberusing the data files from an existing member of the replica set. The data filesmust be sufficiently recent to allow the new member to catch up with the oplog.Otherwise the member would need to perform an initial sync.

(2.1)Copy the Data Files,         停止备库,然后从seed服务器(也就是primary库)copy数据文件,在copy的时候,注意要把local库也复制过来,复制不能采用mongodump,仅仅只允许使用快照备份数据文件( only a snapshot backup),

(2.2)Sync the Member,启动mongodb实例服务,然后开始应用oplog日志







(1)先关闭mongodb server

shard1:RECOVERING> db.shutdownServer();

2017-03-03T11:10:34.536+0800 I NETWORK  DBClientCursor::init call() failed

server should be down...

2017-03-03T11:10:34.539+0800 I NETWORK  trying reconnect to localhost:27017 ( failed

2017-03-03T11:10:34.539+0800 W NETWORK  Failed to connect to, reason: errno:111 Connection refused

2017-03-03T11:10:34.539+0800 I NETWORK  reconnect localhost:27017 ( failed failed couldn't connect to server localhost:27017 (, connection attempt failed

2017-03-03T11:10:34.543+0800 I NETWORK  trying reconnect to localhost:27017 ( failed

2017-03-03T11:10:34.543+0800 W NETWORK  Failed to connect to, reason: errno:111 Connection refused

2017-03-03T11:10:34.543+0800 I NETWORK  reconnect localhost:27017 ( failed failed couldn't connect to server localhost:27017 (, connection attempt failed




[mongodb@mongodb_m2 shard27017]$ mv /data/mongodb/shard27017 /data/mongodb/shard27017_bak

[mongodb@mongodb_m2 shard27017]$ mkdir /data/mongodb/shard27017

[mongodb@mongodb_m2 shard27017]$ /usr/local/mongodb-linux-x86_64-3.0.3/bin/mongod --shardsvr --replSet shard1 --port 27017 --dbpath /data/mongodb/shard27017 --oplogSize 2048 --logpath /data/mongodb/logs/shard_m1s1_27017.log --logappend --fork

about to fork child process, waiting until server is ready for connections.

forked process: 44687

child process started successfully, parent exiting

[mongodb@mongodb_m2 shard27017]$





shard1:STARTUP2> rs.status();


        "set" : "shard1",

        "date" : ISODate("2017-03-03T03:19:43.367Z"),

        "myState" : 5,

        "syncingTo" : "",

        "members" : [


                        "_id" : 0,

                        "name" : "",

                        "health" : 1,

                        "state" : 1,

                        "stateStr" : "PRIMARY",

                        "uptime" : 85,

                        "optime" : Timestamp(1488511178, 8),

                        "optimeDate" : ISODate("2017-03-03T03:19:38Z"),

                        "lastHeartbeat" : ISODate("2017-03-03T03:19:41.796Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:19:41.796Z"),

                        "pingMs" : 0,

                        "electionTime" : Timestamp(1479454146, 1),

                        "electionDate" : ISODate("2016-11-18T07:29:06Z"),

                        "configVersion" : 1



                        "_id" : 1,

                        "name" : "",

                        "health" : 1,

                        "state" : 5,

                        "stateStr" : "STARTUP2",

                        "uptime" : 141,

                        "optime" : Timestamp(0, 0),

                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),

                        "syncingTo" : "",

                        "configVersion" : 1,

                        "self" : true



                        "_id" : 2,

                        "name" : "",

                        "health" : 1,

                        "state" : 7,

                        "stateStr" : "ARBITER",

                        "uptime" : 85,

                        "lastHeartbeat" : ISODate("2017-03-03T03:19:41.796Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:19:41.796Z"),

                        "pingMs" : 0,

                        "configVersion" : 1



        "ok" : 1





[mongodb@mongodb_m2 mongodb]$  /usr/local/mongodb-linux-x86_64-3.0.3/bin/mongo localhost:27017/admin

MongoDB shell version: 3.0.3

connecting to: localhost:27017/admin

Server has startup warnings:

2017-03-03T11:18:16.884+0800 I CONTROL  [initandlisten]

2017-03-03T11:18:16.884+0800 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.

2017-03-03T11:18:16.884+0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten]

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten]

shard1:PRIMARY> rs.status();


        "set" : "shard1",

        "date" : ISODate("2017-03-03T03:31:34.528Z"),

        "myState" : 1,

        "members" : [


                        "_id" : 0,

                        "name" : "",

                        "health" : 1,

                        "state" : 2,

                        "stateStr" : "SECONDARY",

                        "uptime" : 797,

                        "optime" : Timestamp(1488511889, 2),

                        "optimeDate" : ISODate("2017-03-03T03:31:29Z"),

                        "lastHeartbeat" : ISODate("2017-03-03T03:31:32.612Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:31:33.347Z"),

                        "pingMs" : 0,

                        "syncingTo" : "",

                        "configVersion" : 1



                        "_id" : 1,

                        "name" : "",

                        "health" : 1,

                        "state" : 1,

                        "stateStr" : "PRIMARY",

                        "uptime" : 852,

                        "optime" : Timestamp(1488511889, 2),

                        "optimeDate" : ISODate("2017-03-03T03:31:29Z"),

                        "electionTime" : Timestamp(1488511825, 1),

                        "electionDate" : ISODate("2017-03-03T03:30:25Z"),

                        "configVersion" : 1,

                        "self" : true



                        "_id" : 2,

                        "name" : "",

                        "health" : 1,

                        "state" : 7,

                        "stateStr" : "ARBITER",

                        "uptime" : 797,

                        "lastHeartbeat" : ISODate("2017-03-03T03:31:32.612Z"),

                        "lastHeartbeatRecv" : ISODate("2017-03-03T03:31:33.347Z"),

                        "pingMs" : 0,

                        "configVersion" : 1



        "ok" : 1








> use local


>db.createCollection("", {"capped" : true,"size" : 23 * 1024 * 1024 * 1024})

> db.runCommand( { create:"", capped: true, size: (23 * 1024 * 1024 * 1024) } )
