pg主备切换

1. switchover正常主从切换

1.1 正常切换

官网:https://repmgr.org/docs/current/repmgr-standby-switchover.html
repmgr -f /postgresql/pg13/repmgr.conf cluster show
repmgr -f /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
repmgr -f /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind
repmgr -f /postgresql/pg13/repmgr.conf cluster show
注意:切换操作只能在standby 库上执行

-- 可以debug打印详细的切换过程
repmgr -f /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind --log-level DEBUG --verbose


案例1:
[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | primary | * running |          | default  | 100      | 1        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | standby |   running | pg1      | default  | 100      | 1        | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | standby |   running | pg1      | default  | 100      | 1        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running | pg1      | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2



[pgsql@pg2 ~]$ repmgr -f /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
NOTICE: checking switchover on node "pg2" (ID: 2) in --dry-run mode
INFO: prerequisites for using pg_rewind are met
INFO: SSH connection to host "172.72.10.2" succeeded
INFO: able to execute "repmgr" on remote host "172.72.10.2"
INFO: all sibling nodes are reachable via SSH
INFO: 3 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: local node "pg2" (ID: 2) would be promoted to primary; current primary "pg1" (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node "pg1":
  "/postgresql/pg13/bin/pg_ctl  -D '/postgresql/pgdata' -W -m fast stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met



[pgsql@pg2 ~]$ repmgr -f /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind
NOTICE: executing switchover on node "pg2" (ID: 2)
NOTICE: local node "pg2" (ID: 2) will be promoted to primary; current primary "pg1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "pg1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "pg1" (ID: 1) 
DETAIL: executing server command "/postgresql/pg13/bin/pg_ctl  -D '/postgresql/pgdata' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/F000028
NOTICE: promoting standby to primary
DETAIL: promoting server "pg2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "pg2" (ID: 2) was successfully promoted to primary
NOTICE: issuing CHECKPOINT on node "pg2" (ID: 2) 
ERROR: unable to execute CHECKPOINT
INFO: local node 1 can attach to rejoin target node 2
DETAIL: local node's recovery point: 0/F000028; rejoin target node's fork point: 0/F0000A0
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/postgresql/pg13/bin/pg_ctl  -w -D '/postgresql/pgdata' start"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
NOTICE: node  "pg2" (ID: 2) promoted to primary, node "pg1" (ID: 1) demoted to standby
NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "pg2" is now primary and node "pg1" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully



[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | standby |   running | pg2      | default  | 100      | 1        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | primary | * running |          | default  | 100      | 2        | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | standby |   running | pg2      | default  | 100      | 2        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running | pg2      | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2



repmgr=# \x 
Expanded display is on.
repmgr=# select * from nodes;
-[ RECORD 1 ]----+------------------------------------------------------------------------------
node_id          | 2
upstream_node_id | 
active           | t
node_name        | pg2
type             | primary
location         | default
priority         | 100
conninfo         | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
repluser         | repmgr
slot_name        | 
config_file      | /postgresql/pg13/repmgr.conf
-[ RECORD 2 ]----+------------------------------------------------------------------------------
node_id          | 1
upstream_node_id | 2
active           | t
node_name        | pg1
type             | standby
location         | default
priority         | 100
conninfo         | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2
repluser         | repmgr
slot_name        | 
config_file      | /postgresql/pg13/repmgr.conf
-[ RECORD 3 ]----+------------------------------------------------------------------------------
node_id          | 3
upstream_node_id | 2
active           | t
node_name        | pg3
type             | standby
location         | default
priority         | 100
conninfo         | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
repluser         | repmgr
slot_name        | 
config_file      | /postgresql/pg13/repmgr.conf
-[ RECORD 4 ]----+------------------------------------------------------------------------------
node_id          | 6
upstream_node_id | 2
active           | t
node_name        | wintess
type             | witness
location         | default
priority         | 0
conninfo         | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
repluser         | repmgr
slot_name        | 
config_file      | /postgresql/pg13/repmgr.conf


1.2 切换故障

以下所有操作,均在备库执行
PS:若切换失败,则需要重新克隆从库
1)  启动主库
pg_ctl start

2) 重新克隆从库
repmgr -h 172.72.10.2 -U repmgr -d repmgr -f /postgresql/pg13/repmgr.conf standby clone --force 
pg_ctl  start

3) 重新注册
repmgr -f /postgresql/pg13/repmgr.conf standby register --force
repmgr -f  /postgresql/pg13/repmgr.conf  cluster show


4)  把新备库切换为主库
repmgr -f  /postgresql/pg13/repmgr.conf cluster show
repmgr -f  /postgresql/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
-- 可以debug打印详细的切换过程
repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind --log-level DEBUG --verbose
repmgr -f  /postgresql/pg13/repmgr.conf cluster show


2 failover异常手工主从切换

2.1 主库出现故障,然后直接将从库提升为主库

1)主库宕机
pg_ctl -m fast stop

2)查询集群状态
[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+---------------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | standby |   running     | ? pg2    | default  | 100      | 2        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | primary | ? unreachable | ?        | default  | 100      |          | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | standby |   running     | ? pg2    | default  | 100      | 2        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running     | ? pg2    | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "pg1" (ID: 1)'s upstream node "pg2" (ID: 2)
  - unable to determine if node "pg1" (ID: 1) is attached to its upstream node "pg2" (ID: 2)
  - unable to connect to node "pg2" (ID: 2)
  - node "pg2" (ID: 2) is registered as an active primary but is unreachable
  - unable to connect to node "pg3" (ID: 3)'s upstream node "pg2" (ID: 2)
  - unable to determine if node "pg3" (ID: 3) is attached to its upstream node "pg2" (ID: 2)
  - unable to connect to node "wintess" (ID: 6)'s upstream node "pg2" (ID: 2)

HINT: execute with --verbose option to see connection error messages


3) pg3 提升为主库(pg3执行)
repmgr -f /pg13/pg13/repmgr.conf --siblings-follow standby promote


4)查看集群状态
[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | standby |   running | pg3      | default  | 100      | 2        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | primary | - failed  | ?        | default  | 100      |          | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | primary | * running |          | default  | 100      | 3        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running | pg3      | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "pg2" (ID: 2)

HINT: execute with --verbose option to see connection error messages
PS:pg2损坏,pg3升为新主


5) 因pg2损坏,现需修复,作为从库加入集群
pg_ctl stop
repmgr -h 172.72.10.5 -U repmgr -d repmgr -f /postgresql/pg13/repmgr.conf standby clone --force

repmgr -f /postgresql/pg13/repmgr.conf node rejoin -d 'host=172.72.10.5  user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --dry-run --verbose
repmgr -f /postgresql/pg13/repmgr.conf node rejoin -d 'host=172.72.10.5  user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --verbose

pg_ctl status
pg_ctl: server is running (PID: 4356)
/postgresql/pg13/bin/postgres "-D" "/postgresql/pgdata"

6) 查看集群状态
[pgsql@pg1 ~]$ repmgr -f /postgresql/pg13/repmgr.conf cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                            
----+---------+---------+-----------+----------+----------+----------+----------+-------------------------------------------------------------------------------
 1  | pg1     | primary | * running |          | default  | 100      | 4        | host=172.72.10.2 user=repmgr password=repmgr dbname=repmgr connect_timeout=2 
 2  | pg2     | standby |   running | pg1      | default  | 100      | 4        | host=172.72.10.3  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 3  | pg3     | standby |   running | pg1      | default  | 100      | 4        | host=172.72.10.5  user=repmgr password=repmgr dbname=repmgr connect_timeout=2
 6  | wintess | witness | * running | pg1      | default  | 0        | n/a      | host=172.72.10.7  user=repmgr password=repmgr dbname=repmgr connect_timeout=2

2.2 配置自动failover

在这里插入代码片

你可能感兴趣的:(数据库)