遇到的问题如下:
2023-08-17 20:24:21.566 CST [1556001] LOG: database system was interrupted; last known up at 2023-08-17 20:21:41 CST
2023-08-17 20:24:21.770 CST [1556001] LOG: restored log file "00000009.history" from archive
cp: 无法获取'/home/postgres/pgarch/0000000A.history' 的文件状态(stat): 没有那个文件或目录
2023-08-17 20:24:21.771 CST [1556001] LOG: entering standby mode
2023-08-17 20:24:21.772 CST [1556001] LOG: restored log file "00000009.history" from archive
cp: 无法获取'/home/postgres/pgarch/000000090000010200000066' 的文件状态(stat): 没有那个文件或目录
2023-08-17 20:24:21.784 CST [1556001] LOG: restored log file "000000080000010200000066" from archive
2023-08-17 20:24:21.851 CST [1556001] FATAL: requested timeline 9 is not a child of this server's history
2023-08-17 20:24:21.851 CST [1556001] DETAIL: Latest checkpoint is at 102/66000060 on timeline 8, but in the history of the requested timeline, the server forked off from that timeline at 102/580000A0.
2023-08-17 20:24:21.851 CST [1555991] LOG: startup process (PID 1556001) exited with exit code 1
2023-08-17 20:24:21.851 CST [1555991] LOG: aborting startup due to startup process failure
2023-08-17 20:24:21.851 CST [1555991] LOG: database system is shut down
出现上面的原因是repmgr出现了双主。
在db206的主机上修改了shared_preload_libraries = 'pg_stat_statements',试图重启,发现无法启动(没有提前创建pg_stat_statements扩展)导致。
[postgres@db206 data]$ vi postgresql.conf
[postgres@db206 data]$ pg_ctl restart
waiting for server to shut down...... done
server stopped
waiting for server to start....2023-08-17 18:11:53.086 CST [6497] FATAL: could not access file "pg_stat_statements": 没有那个文件或目录
2023-08-17 18:11:53.086 CST [6497] LOG: database system is shut down
stopped waiting
pg_ctl: could not start server
这个时候 vi postgresql.conf 把shared_preload_libraries = 'pg_stat_statements'去掉,再次启动数据库,可以启动,试图创建,这个时候备机已经接管主机了
这个时候想起来先去修改db223的shared_preload_libraries = 'pg_stat_statements'(先在备机上给加上)
[postgres@db223 ~]$ vi pg14/data/postgresql.conf
这个时候发现出现了双主(暂时还不知道为什么会出现双主),这个时候时间线也不一样,新主是9,旧主是8
[postgres@db206 data]$ repmgr -f ~/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+----------------------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | db223 | standby | ! running as primary | | default | 100 | 9 | host=db223 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
2 | db206 | primary | * running | | default | 100 | 8 | host=db206 dbname=repmgr user=repmgr password=repmgr connect_timeout=2WARNING: following issues were detected
- node "db223" (ID: 1) is registered as standby but running as primary
试图对从节点进行重新注册操作,提示需要先启动数据库。
[postgres@db206 data]$ repmgr -f /home/postgres/repmgr/repmgr.conf standby unregister
INFO: connecting to local standby
ERROR: connection to database failed
DETAIL:
connection to server at "db206" (172.20.101.206), port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?DETAIL: attempted to connect using:
user=repmgr password=repmgr connect_timeout=2 dbname=repmgr host=db206 fallback_application_name=repmgr options=-csearch_path=
启动之后重新执行命令,又提示现在是主节点。
[postgres@db206 data]$ repmgr -f /home/postgres/repmgr/repmgr.conf standby unregister
INFO: connecting to local standby
INFO: connecting to primary database
ERROR: node 2 is not a standby server
然后试图对主节点执行注销操作,又说db233节点仍然将此节点作为其上游节点。提示:使用“repmgr standby follow”确保这些节点遵循当前的主节点。
[postgres@db206 data]$ repmgr -f /home/postgres/repmgr/repmgr.conf primary unregister
ERROR: 1 other node still has this node as its upstream node
HINT: ensure these nodes are following the current primary with "repmgr standby follow"
DETAIL: the affected node(s) are:
db223 (ID: 1)
这个时候对db223重新加入集群,发现不能在正在运行的节点上执行
[postgres@db223 ~]$ repmgr -f ~/repmgr/repmgr.conf node rejoin -d 'host=db206 port=5432 user=repmgr dbname=repmgr password=repmgr'
ERROR: database is still running in state "in production"
HINT: "repmgr node rejoin" cannot be executed on a running node
停止数据库后,再次执行,这个时候没有报错
[postgres@db223 ~]$ repmgr -f ~/repmgr/repmgr.conf node rejoin -d 'host=db206 port=5432 user=repmgr dbname=repmgr password=repmgr' -F
NOTICE: rejoin target is node "db206" (ID: 2)
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
HINT: provide --force-rewind
重新启动db223,发现还是作为主节点加入,这就很崩溃了。
pg_ctl start
[postgres@db223 ~]$ repmgr -f ~/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | db223 | primary | * running | | default | 100 | 9 | host=db223 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
2 | db206 | primary | ! running | | default | 100 | 8 | host=db206 dbname=repmgr user=repmgr password=repmgr connect_timeout=2WARNING: following issues were detected
- node "db206" (ID: 2) is running but the repmgr node record is inactive
这个时候加上pg_rewind操作是不是就好了呢,发现还是不行,无法读到时间线9的,不知道为什么要读9的时间线,估计还是作为主节点加入吧。
[postgres@db223 ~]$ repmgr -f ~/repmgr/repmgr.conf node rejoin -d 'host=db206 port=5432 user=repmgr dbname=repmgr password=repmgr' --force-rewind
NOTICE: rejoin target is node "db206" (ID: 2)
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/home/postgres/pg14/bin/pg_rewind -D '/home/postgres/pg14/data' --source-server='host=db206 dbname=repmgr user=repmgr password=repmgr connect_timeout=2'"
ERROR: pg_rewind execution failed
DETAIL: pg_rewind: servers diverged at WAL location 102/580000A0 on timeline 8
pg_rewind: error: could not open file "/home/postgres/pg14/data/pg_wal/000000090000010200000058": 没有那个文件或目录
pg_rewind: fatal: could not find previous WAL record at 102/580000A0
最终极的方法是删掉重建,这个时候删掉的是时间线9的,虽然重建好了,但是pg_ctl start无法启动。
[postgres@db223 data]$ rm -rf *
[postgres@db223 data]$ ll
总用量 0
[postgres@db223 data]$ repmgr -h db206 -U repmgr -d repmgr -f /home/postgres/repmgr/repmgr.conf standby clone
NOTICE: destination directory "/home/postgres/pg14/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=db206 user=repmgr dbname=repmgr
DETAIL: current installation size is 12 GB
INFO: replication slot usage not requested; no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/home/postgres/pg14/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
/home/postgres/pg14/bin/pg_basebackup -l "repmgr base backup" -D /home/postgres/pg14/data -h db206 -p 5432 -U repmgr -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /home/postgres/pg14/data start
HINT: after starting the server, you need to re-register this standby with "repmgr standby register --force" to update the existing node record
[postgres@db223 data]$ ^C
[postgres@db223 data]$ pg_ctl start
waiting for server to start....2023-08-17 19:48:33.265 CST [1532642] LOG: redirecting log output to logging collector process
2023-08-17 19:48:33.265 CST [1532642] HINT: Future log output will appear in directory "log".
stopped waiting
pg_ctl: could not start server
查看log日志就是开头的,还是要读取时间线9,但是主库db203是没有时间线8的。又崩溃了。。。
2023-08-17 20:24:21.566 CST [1556001] LOG: database system was interrupted; last known up at 2023-08-17 20:21:41 CST
2023-08-17 20:24:21.770 CST [1556001] LOG: restored log file "00000009.history" from archive
cp: 无法获取'/home/postgres/pgarch/0000000A.history' 的文件状态(stat): 没有那个文件或目录
2023-08-17 20:24:21.771 CST [1556001] LOG: entering standby mode
2023-08-17 20:24:21.772 CST [1556001] LOG: restored log file "00000009.history" from archive
cp: 无法获取'/home/postgres/pgarch/000000090000010200000066' 的文件状态(stat): 没有那个文件或目录
2023-08-17 20:24:21.784 CST [1556001] LOG: restored log file "000000080000010200000066" from archive
2023-08-17 20:24:21.851 CST [1556001] FATAL: requested timeline 9 is not a child of this server's history
2023-08-17 20:24:21.851 CST [1556001] DETAIL: Latest checkpoint is at 102/66000060 on timeline 8, but in the history of the requested timeline, the server forked off from that timeline at 102/580000A0.
2023-08-17 20:24:21.851 CST [1555991] LOG: startup process (PID 1556001) exited with exit code 1
2023-08-17 20:24:21.851 CST [1555991] LOG: aborting startup due to startup process failure
2023-08-17 20:24:21.851 CST [1555991] LOG: database system is shut down
这个时候看了看db223的参数,是不是读取的归档路径不对,然后就看到基于时间线恢复recovery_target_timeline参数
archive_mode = on
archive_command = 'scp %p [email protected]:/home/postgres/pgarch/%f'
archive_cleanup_command = 'pg_archivecleanup /home/postgres/pgarch %r'
restore_command = 'cp /home/postgres/pgarch/%f %p'
recovery_target_timeline = 'latest'
修改了recovery_target_timeline = 'current'之后,再次启动db223就好了。
[postgres@db206 ~]$ repmgr -f ~/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | db223 | standby | running | db206 | default | 100 | 8 | host=db223 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
2 | db206 | primary | * running | | default | 100 | 8 | host=db206 dbname=repmgr user=repmgr password=repmgr connect_timeout=2
总结:
1、暂时还不知道为什么会出现双主,这个还需要复现一下。
2、考虑加一个见证节点(不知道能不能预防双主的出现),有待研究。
3、对recovery_target_timeline 知其然而不知所以然,抽空研究一下。
4、对recovery_target_timeline 在备机上修改完current之后,是否还需要再修改成laster(个人认为是不需要的)。
5大概看了一眼如下博客,解决的很顺利????
repmgr 集群双主问题处理_repmgr 把主库down 了_瀚高PG实验室的博客-CSDN博客