当Primary节点完成数据操作后,Secondary会做出一系列的动作保证数据的同步: 1:检查自己local库的oplog.rs集合找出最近的时间戳。 2:检查Primary节点local库oplog.rs集合,找出大于此时间戳的记录。 3:将找到的记录插入到自己的oplog.rs集合中,并执行这些操作。
gechongrepl:PRIMARY> rs.status() { "set" : "gechongrepl", "date" : ISODate("2015-07-02T02:38:15Z"), "myState" : 1, "members" : [ { "_id" : 6, "name" : "192.168.91.144:27017", "health" : 1, "state" : 7, "stateStr" : "ARBITER", "uptime" : 1678, "lastHeartbeat" : ISODate("2015-07-02T02:38:14Z"), "lastHeartbeatRecv" : ISODate("2015-07-02T02:38:14Z"), "pingMs" : 1 }, { "_id" : 10, "name" : "192.168.91.135:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 1678, "optime" : Timestamp(1435803750, 1), "optimeDate" : ISODate("2015-07-02T02:22:30Z"), "lastHeartbeat" : ISODate("2015-07-02T02:38:14Z"), "lastHeartbeatRecv" : ISODate("2015-07-02T02:38:13Z"), "pingMs" : 1, "syncingTo" : "192.168.91.148:27017" }, { "_id" : 11, "name" : "192.168.91.148:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 1698, "optime" : Timestamp(1435803750, 1), "optimeDate" : ISODate("2015-07-02T02:22:30Z"), "electionTime" : Timestamp(1435803023, 1), "electionDate" : ISODate("2015-07-02T02:10:23Z"), "self" : true }, { "_id" : 12, "name" : "192.168.91.134:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 1655, "optime" : Timestamp(1435803750, 1), "optimeDate" : ISODate("2015-07-02T02:22:30Z"), "lastHeartbeat" : ISODate("2015-07-02T02:38:14Z"), "lastHeartbeatRecv" : ISODate("2015-07-02T02:38:14Z"), "pingMs" : 1, "syncingTo" : "192.168.91.135:27017" } ], "ok" : 1 }
myState:1表示primary state:1表示primary;7表示arbiter uptime:成员的在线时间 lastHeartbeat:当前实例到远端最近一次成功接收到心跳包的时间 pingMs:本实例到远端路由包的来回时间 optime:读取oplog.rs集合。本实例最近一次的更改时间。
MongoDB通过lastHeartbeat来实现自动转移。
mongod实例每隔两秒就会向其他成员发送一个心跳包,并且通过rs.status()中返回的成员的health来判断成员的状态。如果primary节点不可用了,那么复制集中的所有secondary节点都会触发一次选举操作。选出新的primary节点。如果secondary节点有多个,则会选举拥有最新oplog时间戳记录的或者有较高权限的节点成为primary(注意:如果secondary停止时间过长,导致primary节点的oplog内容被循环写覆盖掉,则需要手动同步secondary节点)
1:拉去同步节点的oplog。例如:Secondary节点拉去Primary节点的oplog。
2:将拉去的oplog写入到自己的oplog中。例如:Secondary节点从Primary拉去的oplog写入到自己的oplog。
3:请求下一个oplog同步到哪里。例如:Secondary会请求Primary节点同步到哪里了?
1:Primary节点插入一条数据
2:同时,会把该数据写入到Primary的oplog中,并且记录一个时间戳
3:db.runCommand({getlasterror:1,w:2})在Primary节点被调用时,Primary就完成了写入操作,等待其他非仲裁节点来同步数据
4:Secondary节点查询Primary的oplog并且拉去oplog
5:Secondary根据时间戳应用oplog
6:Secondary请求大于本身oplog时间戳的oplog
7:Primary更新时间戳
1:新增加的节点或者oplog同步时候被覆写的时候都会进行初始化同步。
2:从源节点取最新的oplog time,标记为start。
3:从源节点克隆所有的数据到目标节点
4:在目标节点建立索引
5:取目标节点最新的oplog time,标记为minValid。
6:在目标节点执行start到minValid的oplog(应该是复制过来还没有执行的oplog,没有完成最终一致性的那部分,就是一个oplog replay的过程)
7:成为正常成员
在新节点上执行
Initial Sync Initial sync copies all the data from one member of the replica set to another member. A member uses initial sync when the member has no data, such as when the member is new, or when the member has data but is missing a history of the set’s replication. When you perform an initial sync, MongoDB: 1:Clones all databases. To clone, the mongod queries every collection in each source database and inserts all data into its own copies of these collections. At this time, _id indexes are also built. The clone process only copies valid data, omitting invalid documents. 2:Applies all changes to the data set. Using the oplog from the source, the mongod updates its data set to reflect the current state of the replica set. 3:Builds all indexes on all collections (except _id indexes, which were already completed). When the mongod finishes building all index builds, the member can transition to a normal state, i.e. secondary.
MongoDB初始化同步数据的时候,可能从主节点同步,也可能是从从节点同步,根据最近的原则,选择最邻近节点去同步数据。(基于ping值)
同时也可以指定从哪个节点来同步数据:
db.adminCommand( { replSetSyncFrom: "[hostname]:[port]" } )
或者
rs.syncFrom("[hostname]:[port]")
初始化同步的源码:http://dl.mongodb.org/dl/src/
C:\Users\John\Desktop\mongodb-src-r2.6.3\src\mongo\db\repl\rs_initialsync.cpp
rs_initialsync.cpp
/** * Copyright (C) 2008 10gen Inc. * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License, version 3, * as published by the Free Software Foundation. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License for more details. * * You should have received a copy of the GNU Affero General Public License * along with this program. If not, see <http://www.gnu.org/licenses/>. * * As a special exception, the copyright holders give permission to link the * code of portions of this program with the OpenSSL library under certain * conditions as described in each individual source file and distribute * linked combinations including the program with the OpenSSL library. You * must comply with the GNU Affero General Public License in all respects for * all of the code used other than as permitted herein. If you modify file(s) * with this exception, you may extend this exception to your version of the * file(s), but you are not obligated to do so. If you do not wish to do so, * delete this exception statement from your version. If you delete this * exception statement from all source files in the program, then also delete * it in the license file. */ #include "mongo/pch.h" #include "mongo/db/repl/rs.h" #include "mongo/db/auth/authorization_manager.h" #include "mongo/db/auth/authorization_manager_global.h" #include "mongo/db/client.h" #include "mongo/db/cloner.h" #include "mongo/db/dbhelpers.h" #include "mongo/db/repl/bgsync.h" #include "mongo/db/repl/oplog.h" #include "mongo/db/repl/oplogreader.h" #include "mongo/bson/optime.h" #include "mongo/db/repl/replication_server_status.h" // replSettings #include "mongo/db/repl/rs_sync.h" #include "mongo/util/mongoutils/str.h" namespace mongo { using namespace mongoutils; using namespace bson; void dropAllDatabasesExceptLocal(); // add try/catch with sleep void isyncassert(const string& msg, bool expr) { if( !expr ) { string m = str::stream() << "initial sync " << msg; theReplSet->sethbmsg(m, 0); uasserted(13404, m); } } void ReplSetImpl::syncDoInitialSync() { static const int maxFailedAttempts = 10; createOplog(); int failedAttempts = 0; while ( failedAttempts < maxFailedAttempts ) { try { _syncDoInitialSync(); break; } catch(DBException& e) { failedAttempts++; str::stream msg; msg << "initial sync exception: "; msg << e.toString() << " " << (maxFailedAttempts - failedAttempts) << " attempts remaining" ; sethbmsg(msg, 0); sleepsecs(30); } } fassert( 16233, failedAttempts < maxFailedAttempts); } bool ReplSetImpl::_syncDoInitialSync_clone(Cloner& cloner, const char *master, const list<string>& dbs, bool dataPass) { for( list<string>::const_iterator i = dbs.begin(); i != dbs.end(); i++ ) { string db = *i; if( db == "local" ) continue; if ( dataPass ) sethbmsg( str::stream() << "initial sync cloning db: " << db , 0); else sethbmsg( str::stream() << "initial sync cloning indexes for : " << db , 0); Client::WriteContext ctx(db); string err; int errCode; CloneOptions options; options.fromDB = db; options.logForRepl = false; options.slaveOk = true; options.useReplAuth = true; options.snapshot = false; options.mayYield = true; options.mayBeInterrupted = false; options.syncData = dataPass; options.syncIndexes = ! dataPass; if (!cloner.go(ctx.ctx(), master, options, NULL, err, &errCode)) { sethbmsg(str::stream() << "initial sync: error while " << (dataPass ? "cloning " : "indexing ") << db << ". " << (err.empty() ? "" : err + ". ") << "sleeping 5 minutes" ,0); return false; } } return true; } void _logOpObjRS(const BSONObj& op); static void emptyOplog() { Client::WriteContext ctx(rsoplog); Collection* collection = ctx.ctx().db()->getCollection(rsoplog); // temp if( collection->numRecords() == 0 ) return; // already empty, ok. LOG(1) << "replSet empty oplog" << rsLog; collection->details()->emptyCappedCollection(rsoplog); } bool Member::syncable() const { bool buildIndexes = theReplSet ? theReplSet->buildIndexes() : true; return hbinfo().up() && (config().buildIndexes || !buildIndexes) && state().readable(); } const Member* ReplSetImpl::getMemberToSyncTo() { lock lk(this); // if we have a target we've requested to sync from, use it if (_forceSyncTarget) { Member* target = _forceSyncTarget; _forceSyncTarget = 0; sethbmsg( str::stream() << "syncing to: " << target->fullName() << " by request", 0); return target; } const Member* primary = box.getPrimary(); // wait for 2N pings before choosing a sync target if (_cfg) { int needMorePings = config().members.size()*2 - HeartbeatInfo::numPings; if (needMorePings > 0) { OCCASIONALLY log() << "waiting for " << needMorePings << " pings from other members before syncing" << endl; return NULL; } // If we are only allowed to sync from the primary, return that if (!_cfg->chainingAllowed()) { // Returns NULL if we cannot reach the primary return primary; } } // find the member with the lowest ping time that has more data than me // Find primary's oplog time. Reject sync candidates that are more than // maxSyncSourceLagSecs seconds behind. OpTime primaryOpTime; if (primary) primaryOpTime = primary->hbinfo().opTime; else // choose a time that will exclude no candidates, since we don't see a primary primaryOpTime = OpTime(maxSyncSourceLagSecs, 0); if (primaryOpTime.getSecs() < static_cast<unsigned int>(maxSyncSourceLagSecs)) { // erh - I think this means there was just a new election // and we don't yet know the new primary's optime primaryOpTime = OpTime(maxSyncSourceLagSecs, 0); } OpTime oldestSyncOpTime(primaryOpTime.getSecs() - maxSyncSourceLagSecs, 0); Member *closest = 0; time_t now = 0; // Make two attempts. The first attempt, we ignore those nodes with // slave delay higher than our own. The second attempt includes such // nodes, in case those are the only ones we can reach. // This loop attempts to set 'closest'. for (int attempts = 0; attempts < 2; ++attempts) { for (Member *m = _members.head(); m; m = m->next()) { if (!m->syncable()) continue; if (m->state() == MemberState::RS_SECONDARY) { // only consider secondaries that are ahead of where we are if (m->hbinfo().opTime <= lastOpTimeWritten) continue; // omit secondaries that are excessively behind, on the first attempt at least. if (attempts == 0 && m->hbinfo().opTime < oldestSyncOpTime) continue; } // omit nodes that are more latent than anything we've already considered if (closest && (m->hbinfo().ping > closest->hbinfo().ping)) continue; if (attempts == 0 && (myConfig().slaveDelay < m->config().slaveDelay || m->config().hidden)) { continue; // skip this one in the first attempt } map<string,time_t>::iterator vetoed = _veto.find(m->fullName()); if (vetoed != _veto.end()) { // Do some veto housekeeping if (now == 0) { now = time(0); } // if this was on the veto list, check if it was vetoed in the last "while". // if it was, skip. if (vetoed->second >= now) { if (time(0) % 5 == 0) { log() << "replSet not trying to sync from " << (*vetoed).first << ", it is vetoed for " << ((*vetoed).second - now) << " more seconds" << rsLog; } continue; } _veto.erase(vetoed); // fall through, this is a valid candidate now } // This candidate has passed all tests; set 'closest' closest = m; } if (closest) break; // no need for second attempt } if (!closest) { return NULL; } sethbmsg( str::stream() << "syncing to: " << closest->fullName(), 0); return closest; } void ReplSetImpl::veto(const string& host, const unsigned secs) { lock lk(this); _veto[host] = time(0)+secs; } /** * Replays the sync target's oplog from lastOp to the latest op on the sync target. * * @param syncer either initial sync (can reclone missing docs) or "normal" sync (no recloning) * @param r the oplog reader * @param source the sync target * @param lastOp the op to start syncing at. replset::InitialSync writes this and then moves to * the queue. replset::SyncTail does not write this, it moves directly to the * queue. * @param minValid populated by this function. The most recent op on the sync target's oplog, * this function syncs to this value (inclusive) * @return if applying the oplog succeeded */ bool ReplSetImpl::_syncDoInitialSync_applyToHead( replset::SyncTail& syncer, OplogReader* r, const Member* source, const BSONObj& lastOp , BSONObj& minValid ) { /* our cloned copy will be strange until we apply oplog events that occurred through the process. we note that time point here. */ try { // It may have been a long time since we last used this connection to // query the oplog, depending on the size of the databases we needed to clone. // A common problem is that TCP keepalives are set too infrequent, and thus // our connection here is terminated by a firewall due to inactivity. // Solution is to increase the TCP keepalive frequency. minValid = r->getLastOp(rsoplog); } catch ( SocketException & ) { log() << "connection lost to " << source->h().toString() << "; is your tcp keepalive interval set appropriately?"; if( !r->connect(source->h().toString()) ) { sethbmsg( str::stream() << "initial sync couldn't connect to " << source->h().toString() , 0); throw; } // retry minValid = r->getLastOp(rsoplog); } isyncassert( "getLastOp is empty ", !minValid.isEmpty() ); OpTime mvoptime = minValid["ts"]._opTime(); verify( !mvoptime.isNull() ); OpTime startingTS = lastOp["ts"]._opTime(); verify( mvoptime >= startingTS ); // apply startingTS..mvoptime portion of the oplog { try { minValid = syncer.oplogApplication(lastOp, minValid); } catch (const DBException&) { log() << "replSet initial sync failed during oplog application phase" << rsLog; emptyOplog(); // otherwise we'll be up! lastOpTimeWritten = OpTime(); lastH = 0; log() << "replSet cleaning up [1]" << rsLog; { Client::WriteContext cx( "local." ); cx.ctx().db()->flushFiles(true); } log() << "replSet cleaning up [2]" << rsLog; log() << "replSet initial sync failed will try again" << endl; sleepsecs(5); return false; } } return true; } /** * Do the initial sync for this member. There are several steps to this process: * * 0. Add _initialSyncFlag to minValid to tell us to restart initial sync if we * crash in the middle of this procedure * 1. Record start time. * 2. Clone. * 3. Set minValid1 to sync target's latest op time. * 4. Apply ops from start to minValid1, fetching missing docs as needed. * 5. Set minValid2 to sync target's latest op time. * 6. Apply ops from minValid1 to minValid2. * 7. Build indexes. * 8. Set minValid3 to sync target's latest op time. * 9. Apply ops from minValid2 to minValid3. 10. Clean up minValid and remove _initialSyncFlag field * * At that point, initial sync is finished. Note that the oplog from the sync target is applied * three times: step 4, 6, and 8. 4 may involve refetching, 6 should not. By the end of 6, * this member should have consistent data. 8 is "cosmetic," it is only to get this member * closer to the latest op time before it can transition to secondary state. */ void ReplSetImpl::_syncDoInitialSync() { replset::InitialSync init(replset::BackgroundSync::get()); replset::SyncTail tail(replset::BackgroundSync::get()); sethbmsg("initial sync pending",0); // if this is the first node, it may have already become primary if ( box.getState().primary() ) { sethbmsg("I'm already primary, no need for initial sync",0); return; } const Member *source = getMemberToSyncTo(); if (!source) { sethbmsg("initial sync need a member to be primary or secondary to do our initial sync", 0); sleepsecs(15); return; } string sourceHostname = source->h().toString(); init.setHostname(sourceHostname); OplogReader r; if( !r.connect(sourceHostname) ) { sethbmsg( str::stream() << "initial sync couldn't connect to " << source->h().toString() , 0); sleepsecs(15); return; } BSONObj lastOp = r.getLastOp(rsoplog); if( lastOp.isEmpty() ) { sethbmsg("initial sync couldn't read remote oplog", 0); sleepsecs(15); return; } // written by applyToHead calls BSONObj minValid; if (replSettings.fastsync) { log() << "fastsync: skipping database clone" << rsLog; // prime oplog init.oplogApplication(lastOp, lastOp); return; } else { // Add field to minvalid document to tell us to restart initial sync if we crash theReplSet->setInitialSyncFlag(); sethbmsg("initial sync drop all databases", 0); dropAllDatabasesExceptLocal(); sethbmsg("initial sync clone all databases", 0); list<string> dbs = r.conn()->getDatabaseNames(); Cloner cloner; if (!_syncDoInitialSync_clone(cloner, sourceHostname.c_str(), dbs, true)) { veto(source->fullName(), 600); sleepsecs(300); return; } sethbmsg("initial sync data copy, starting syncup",0); log() << "oplog sync 1 of 3" << endl; if ( ! _syncDoInitialSync_applyToHead( init, &r , source , lastOp , minValid ) ) { return; } lastOp = minValid; // Now we sync to the latest op on the sync target _again_, as we may have recloned ops // that were "from the future" compared with minValid. During this second application, // nothing should need to be recloned. log() << "oplog sync 2 of 3" << endl; if (!_syncDoInitialSync_applyToHead(tail, &r , source , lastOp , minValid)) { return; } // data should now be consistent lastOp = minValid; sethbmsg("initial sync building indexes",0); if (!_syncDoInitialSync_clone(cloner, sourceHostname.c_str(), dbs, false)) { veto(source->fullName(), 600); sleepsecs(300); return; } } log() << "oplog sync 3 of 3" << endl; if (!_syncDoInitialSync_applyToHead(tail, &r, source, lastOp, minValid)) { return; } // --------- Status status = getGlobalAuthorizationManager()->initialize(); if (!status.isOK()) { warning() << "Failed to reinitialize auth data after initial sync. " << status; return; } sethbmsg("initial sync finishing up",0); verify( !box.getState().primary() ); // wouldn't make sense if we were. { Client::WriteContext cx( "local." ); cx.ctx().db()->flushFiles(true); try { log() << "replSet set minValid=" << minValid["ts"]._opTime().toString() << rsLog; } catch(...) { } // Initial sync is now complete. Flag this by setting minValid to the last thing // we synced. theReplSet->setMinValid(minValid); // Clear the initial sync flag. theReplSet->clearInitialSyncFlag(); cx.ctx().db()->flushFiles(true); } { boost::unique_lock<boost::mutex> lock(theReplSet->initialSyncMutex); theReplSet->initialSyncRequested = false; } // If we just cloned & there were no ops applied, we still want the primary to know where // we're up to replset::BackgroundSync::notify(); changeState(MemberState::RS_RECOVERING); sethbmsg("initial sync done",0); } }