记一次 mongo 集群数据丢失。搭建了一套一主一从一仲裁的 mongo 集群,随后测试在测高可用时,告知数据丢失
据测试原话,杀从 -> 新增数据 -> 杀主 -> 启动原主 -> 启动原从 -> 数据丢失,关键在于她能百分百复现,而我却不能,而且还是很轻易的复现
直接抛答案,测试给我演示的是,杀从 -> 新增数据 -> 杀主 -> 启动原从 -> 启动原主 -> 数据丢失
原因也很简单,仲裁节点永远不会提升为主,当主从都挂了后,先启动的原从就会提升为主,后启动的原主就会降为从。主从同步只会从同步主,此时数据丢失
本机模拟一主一从一仲裁的 mongo 集群
MongoDB 版本:mongodb-community 6.0.4
主 | 从 | 仲裁 |
---|---|---|
localhost:27017 | localhost:37017 | localhost:47017 |
37017.conf、47017.cof 同 27017.conf,只需将 27017 换成 37017、47017
systemLog:
destination: file
path: /tmp/log/mongodb/27017.log
logAppend: true
storage:
dbPath: /tmp/27017
net:
port: 27017
bindIp: 127.0.0.1, ::1
ipv6: true
replication:
replSetName: rs0
# 创建存储目录
mkdir /tmp/27017
mkdir /tmp/27017
# 分别启动三台实例
mongod -f 27017.conf
mongod -f 37017.conf
mongod -f 47017.conf
先连上 27017, mongosh --port 27017
rs.initiate({_id: "rs0", members: [{ _id: 0 , host: "localhost:27017" }]}) // 初始化后该实例就是 master
rs.add("localhost:37017") // 增加从节点
db.adminCommand({ // 设置默认读写关注
"setDefaultRWConcern" : 1,
"defaultWriteConcern" : {
"w" : 1
},
"defaultReadConcern" : { "level" : "local" }
})
rs.addArb("localhost:47017") // 增加仲裁节点
杀从 -> 新增数据 -> 杀主 -> 启动原主 -> 启动原从 -> 数据丢失(没复现)
直接 kill 掉从,连上集群 mongosh mongodb://localhost:27017,localhost:37017,localhost:47017
,查看当前集群状态
rs.status().members;
[
{
_id: 0,
name: 'localhost:27017',
health: 1,
state: 1,
stateStr: 'PRIMARY',
uptime: 2176,
optime: { ts: Timestamp({ t: 1681018937, i: 1 }), t: Long("1") },
optimeDate: ISODate("2023-04-09T05:42:17.000Z"),
lastAppliedWallTime: ISODate("2023-04-09T05:42:17.943Z"),
lastDurableWallTime: ISODate("2023-04-09T05:42:17.943Z"),
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
electionTime: Timestamp({ t: 1681016897, i: 1 }),
electionDate: ISODate("2023-04-09T05:08:17.000Z"),
configVersion: 4,
configTerm: 1,
self: true,
lastHeartbeatMessage: ''
},
{
_id: 1,
name: 'localhost:37017',
health: 0,
state: 8,
stateStr: '(not reachable/healthy)',
uptime: 0,
optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
optimeDate: ISODate("1970-01-01T00:00:00.000Z"),
optimeDurableDate: ISODate("1970-01-01T00:00:00.000Z"),
lastAppliedWallTime: ISODate("2023-04-09T05:37:17.829Z"),
lastDurableWallTime: ISODate("2023-04-09T05:37:17.829Z"),
lastHeartbeat: ISODate("2023-04-09T05:42:18.141Z"),
lastHeartbeatRecv: ISODate("2023-04-09T05:37:23.310Z"),
pingMs: Long("0"),
lastHeartbeatMessage: 'Error connecting to localhost:37017 (::1:37017) :: caused by :: Connection refused',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
configVersion: 4,
configTerm: 1
},
{
_id: 2,
name: 'localhost:47017',
health: 1,
state: 7,
stateStr: 'ARBITER',
uptime: 781,
lastHeartbeat: ISODate("2023-04-09T05:42:17.620Z"),
lastHeartbeatRecv: ISODate("2023-04-09T05:42:17.619Z"),
pingMs: Long("0"),
lastHeartbeatMessage: '',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
configVersion: 4,
configTerm: 1
}
]
插入数据
db.col.insert({"name": "张三"})
{
acknowledged: true,
insertedIds: { '0': ObjectId("643252839561b614ad68d8b9") }
}
直接 kill 掉主,确认当前只有仲裁节点 ps -ef | grep mongod
501 22005 57723 0 1:38下午 ttys001 0:03.94 mongosh mongodb://localhost:27017,localhost:37017,localhost:47017/?serverSelectionTimeou TERM_PROGRAM=Apple_Terminal SHELL=/bin/zsh
0 30435 99065 0 1:53下午 ttys007 0:00.01 grep mongod
0 11804 9349 0 1:21下午 ttys008 0:18.77 bin/mongod -f conf/47017.conf
先后启动主从,并确认当前集群状态
rs.status().members;
[
{
_id: 0,
name: 'localhost:27017',
health: 1,
state: 1,
stateStr: 'PRIMARY',
uptime: 40,
optime: { ts: Timestamp({ t: 1681019814, i: 1 }), t: Long("3") },
optimeDate: ISODate("2023-04-09T05:56:54.000Z"),
lastAppliedWallTime: ISODate("2023-04-09T05:56:54.970Z"),
lastDurableWallTime: ISODate("2023-04-09T05:56:54.970Z"),
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
electionTime: Timestamp({ t: 1681019794, i: 1 }),
electionDate: ISODate("2023-04-09T05:56:34.000Z"),
configVersion: 4,
configTerm: 3,
self: true,
lastHeartbeatMessage: ''
},
{
_id: 1,
name: 'localhost:37017',
health: 1,
state: 2,
stateStr: 'SECONDARY',
uptime: 17,
optime: { ts: Timestamp({ t: 1681019814, i: 1 }), t: Long("3") },
optimeDurable: { ts: Timestamp({ t: 1681019814, i: 1 }), t: Long("3") },
optimeDate: ISODate("2023-04-09T05:56:54.000Z"),
optimeDurableDate: ISODate("2023-04-09T05:56:54.000Z"),
lastAppliedWallTime: ISODate("2023-04-09T05:56:54.970Z"),
lastDurableWallTime: ISODate("2023-04-09T05:56:54.970Z"),
lastHeartbeat: ISODate("2023-04-09T05:57:01.702Z"),
lastHeartbeatRecv: ISODate("2023-04-09T05:57:02.630Z"),
pingMs: Long("0"),
lastHeartbeatMessage: '',
syncSourceHost: 'localhost:27017',
syncSourceId: 0,
infoMessage: '',
configVersion: 4,
configTerm: 3
},
{
_id: 2,
name: 'localhost:47017',
health: 1,
state: 7,
stateStr: 'ARBITER',
uptime: 38,
lastHeartbeat: ISODate("2023-04-09T05:57:02.981Z"),
lastHeartbeatRecv: ISODate("2023-04-09T05:57:02.986Z"),
pingMs: Long("0"),
lastHeartbeatMessage: '',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
configVersion: 4,
configTerm: 3
}
]
查询数据
db.col.findOne();
{ _id: ObjectId("643252839561b614ad68d8b9"), name: '张三' }
杀从 -> 新增数据 -> 杀主 -> 启动原主 -> 启动原从 -> 数据丢失(复现)
直接 kill 掉从,连上集群 mongosh mongodb://localhost:27017,localhost:37017,localhost:47017
,查看当前集群状态
rs.status().members;
[
{
_id: 0,
name: 'localhost:27017',
health: 1,
state: 1,
stateStr: 'PRIMARY',
uptime: 192,
optime: { ts: Timestamp({ t: 1681019975, i: 1 }), t: Long("3") },
optimeDate: ISODate("2023-04-09T05:59:35.000Z"),
lastAppliedWallTime: ISODate("2023-04-09T05:59:35.034Z"),
lastDurableWallTime: ISODate("2023-04-09T05:59:35.034Z"),
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
electionTime: Timestamp({ t: 1681019794, i: 1 }),
electionDate: ISODate("2023-04-09T05:56:34.000Z"),
configVersion: 4,
configTerm: 3,
self: true,
lastHeartbeatMessage: ''
},
{
_id: 1,
name: 'localhost:37017',
health: 0,
state: 8,
stateStr: '(not reachable/healthy)',
uptime: 0,
optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
optimeDate: ISODate("1970-01-01T00:00:00.000Z"),
optimeDurableDate: ISODate("1970-01-01T00:00:00.000Z"),
lastAppliedWallTime: ISODate("2023-04-09T05:59:25.029Z"),
lastDurableWallTime: ISODate("2023-04-09T05:59:25.029Z"),
lastHeartbeat: ISODate("2023-04-09T05:59:33.869Z"),
lastHeartbeatRecv: ISODate("2023-04-09T05:59:32.804Z"),
pingMs: Long("0"),
lastHeartbeatMessage: 'Error connecting to localhost:37017 (::1:37017) :: caused by :: Connection refused',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
configVersion: 4,
configTerm: 3
},
{
_id: 2,
name: 'localhost:47017',
health: 1,
state: 7,
stateStr: 'ARBITER',
uptime: 190,
lastHeartbeat: ISODate("2023-04-09T05:59:35.145Z"),
lastHeartbeatRecv: ISODate("2023-04-09T05:59:35.145Z"),
pingMs: Long("0"),
lastHeartbeatMessage: '',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
configVersion: 4,
configTerm: 3
}
]
插入数据
db.col.insert({"name": "李四"})
{
acknowledged: true,
insertedIds: { '0': ObjectId("643254789561b614ad68d8ba") }
}
直接 kill 掉主,确认当前只有仲裁节点 ps -ef | grep mongod
501 22005 57723 0 1:38下午 ttys001 0:04.96 mongosh mongodb://localhost:27017,localhost:37017,localhost:47017/?serverSelectionTimeou TERM_PROGRAM=Apple_Terminal SHELL=/bin/zsh
0 35004 94552 0 2:01下午 ttys002 0:00.00 grep mongod
0 11804 9349 0 1:21下午 ttys008 0:23.09 bin/mongod -f conf/47017.conf
先启动原从,确认当前集群状态,可见原从晋级为主节点
rs.status().members;
[
{
_id: 0,
name: 'localhost:27017',
health: 0,
state: 8,
stateStr: '(not reachable/healthy)',
uptime: 0,
optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
optimeDate: ISODate("1970-01-01T00:00:00.000Z"),
optimeDurableDate: ISODate("1970-01-01T00:00:00.000Z"),
lastAppliedWallTime: ISODate("1970-01-01T00:00:00.000Z"),
lastDurableWallTime: ISODate("1970-01-01T00:00:00.000Z"),
lastHeartbeat: ISODate("2023-04-09T06:03:41.628Z"),
lastHeartbeatRecv: ISODate("1970-01-01T00:00:00.000Z"),
pingMs: Long("0"),
lastHeartbeatMessage: 'Error connecting to localhost:27017 (::1:27017) :: caused by :: Connection refused',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
configVersion: -1,
configTerm: -1
},
{
_id: 1,
name: 'localhost:37017',
health: 1,
state: 1,
stateStr: 'PRIMARY',
uptime: 32,
optime: { ts: Timestamp({ t: 1681020221, i: 1 }), t: Long("4") },
optimeDate: ISODate("2023-04-09T06:03:41.000Z"),
lastAppliedWallTime: ISODate("2023-04-09T06:03:41.569Z"),
lastDurableWallTime: ISODate("2023-04-09T06:03:41.569Z"),
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
electionTime: Timestamp({ t: 1681020201, i: 1 }),
electionDate: ISODate("2023-04-09T06:03:21.000Z"),
configVersion: 4,
configTerm: 4,
self: true,
lastHeartbeatMessage: ''
},
{
_id: 2,
name: 'localhost:47017',
health: 1,
state: 7,
stateStr: 'ARBITER',
uptime: 31,
lastHeartbeat: ISODate("2023-04-09T06:03:41.581Z"),
lastHeartbeatRecv: ISODate("2023-04-09T06:03:41.594Z"),
pingMs: Long("0"),
lastHeartbeatMessage: '',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
configVersion: 4,
configTerm: 4
}
]
查询数据,因为从再插入数据之前就挂了,所以从根本没有拉取到插入的数据,所以现在数据是没有的
db.col.findOne({"name": "李四"})
null
再启动原主,确认当前集群状态,可见原主降级为了从节点
[
{
_id: 0,
name: 'localhost:27017',
health: 1,
state: 2,
stateStr: 'SECONDARY',
uptime: 5,
optime: { ts: Timestamp({ t: 1681020431, i: 1 }), t: Long("4") },
optimeDurable: { ts: Timestamp({ t: 1681020431, i: 1 }), t: Long("4") },
optimeDate: ISODate("2023-04-09T06:07:11.000Z"),
optimeDurableDate: ISODate("2023-04-09T06:07:11.000Z"),
lastAppliedWallTime: ISODate("2023-04-09T06:07:11.642Z"),
lastDurableWallTime: ISODate("2023-04-09T06:07:11.642Z"),
lastHeartbeat: ISODate("2023-04-09T06:07:18.324Z"),
lastHeartbeatRecv: ISODate("2023-04-09T06:07:18.583Z"),
pingMs: Long("0"),
lastHeartbeatMessage: '',
syncSourceHost: 'localhost:37017',
syncSourceId: 1,
infoMessage: '',
configVersion: 4,
configTerm: 4
},
{
_id: 1,
name: 'localhost:37017',
health: 1,
state: 1,
stateStr: 'PRIMARY',
uptime: 251,
optime: { ts: Timestamp({ t: 1681020431, i: 1 }), t: Long("4") },
optimeDate: ISODate("2023-04-09T06:07:11.000Z"),
lastAppliedWallTime: ISODate("2023-04-09T06:07:11.642Z"),
lastDurableWallTime: ISODate("2023-04-09T06:07:11.642Z"),
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
electionTime: Timestamp({ t: 1681020201, i: 1 }),
electionDate: ISODate("2023-04-09T06:03:21.000Z"),
configVersion: 4,
configTerm: 4,
self: true,
lastHeartbeatMessage: ''
},
{
_id: 2,
name: 'localhost:47017',
health: 1,
state: 7,
stateStr: 'ARBITER',
uptime: 250,
lastHeartbeat: ISODate("2023-04-09T06:07:19.821Z"),
lastHeartbeatRecv: ISODate("2023-04-09T06:07:19.821Z"),
pingMs: Long("0"),
lastHeartbeatMessage: '',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
configVersion: 4,
configTerm: 4
}
]
连接到原主 mongosh mongodb://localhost:27017
,并查询数据,原来写入主的数据也丢失了
rs.secondaryOk() // 允许读取从节点
db.col.findOne({"name": "李四"})
null
当听到 mongo 集群丢数据时,我就觉得有点不可思议,一般向这种非内存数据库都会有 WAL 保证
去官网查查阅 FAQ:MongoDB Storage
Starting in version 3.6, MongoDB configures WiredTiger to create checkpoints (i.e. write the snapshot data to disk) at intervals of 60 seconds
The WiredTiger journal persists all data modifications between checkpoints. If MongoDB exits between checkpoints, it uses the journal to replay all data modified since the last checkpoint
At every 100 milliseconds (See storage.journal.commitIntervalMs)
mongo 默认 60s chekpoint 落盘一次,检查点间的数据有 jounal 保证,启动时会重放 jounal。jounal 默认 100ms 落盘一次
也就是说按照演示一虽然会丢失数据,但是并不肯能轻易的模拟出来(毕竟 50ms 完成数据插入并杀死节点,这手速我不相信是能手动模拟出来的,还能百分百复现)