repmgr -f /etc/repmgr.conf primary register
repmgr -f /etc/repmgr.conf standby register
repmgr -f /etc/repmgr.conf primary unregister -F --node-id=2
repmgr -f /etc/repmgr.conf standby unregister
克隆之前进行检查
repmgr -h 10.79.21.29 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone --dry-run
真实执行
$repmgr -h 10.79.21.30 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone
NOTICE: destination directory "/home/storage/pgsql/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=10.79.21.30 user=repmgr dbname=repmgr
DETAIL: current installation size is 115 MB
INFO: replication slot usage not requested; no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/home/storage/pgsql/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
/usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup" -D /home/storage/pgsql/data -h 10.79.21.30 -p 5432 -U repmgr -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: /usr/local/pgsql/bin/pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log start
HINT: after starting the server, you need to register this standby with "repmgr standby register"
如果主服务器发生故障或需要从复制集群中删除,则必须指定新的主服务器,以确保集群继续正常运行。可以通过repmgr standby promote 来完成,它将当前服务器上的备用服务器提升为主服务器。
$repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 3 | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | default | 100 | 3 | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2
pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop
$repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+---------------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | primary | ? unreachable | ? | default | 100 | | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | ? node1 | default | 100 | 3 | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- unable to connect to node "node1" (ID: 1)
- node "node1" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
- unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
repmgr -f /etc/repmgr.conf standby promote --log-level=debug --verbose
如果想查看详细的日志输出 可以添加 --log-level=debug --verbose
$repmgr -f /etc/repmgr.conf standby promote --log-level=debug --verbose
NOTICE: using provided configuration file "/etc/repmgr.conf"
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
SET synchronous_commit TO 'local'
INFO: connected to standby, checking its state
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 2
INFO: searching for primary node
DEBUG: get_primary_connection():
SELECT node_id, conninfo, CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority FROM repmgr.nodes WHERE active IS TRUE AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id
INFO: checking if node 1 is primary
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
could not connect to server: Connection refused
Is the server running on host "10.79.21.30" and accepting
TCP/IP connections on port 5432?
DETAIL: attempted to connect using:
user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path=
INFO: checking if node 2 is primary
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_node_replication_stats():
SELECT pg_catalog.current_setting('max_wal_senders')::INT AS max_wal_senders, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_stat_replication) AS attached_wal_receivers, current_setting('max_replication_slots')::INT AS max_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE slot_type='physical') AS total_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS TRUE AND slot_type='physical') AS active_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS FALSE AND slot_type='physical') AS inactive_replication_slots, pg_catalog.pg_is_in_recovery() AS in_recovery
DEBUG: get_active_sibling_node_records():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.upstream_node_id = 1 AND n.node_id != 2 AND n.active IS TRUE ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 2
NOTICE: promoting standby to primary
DETAIL: promoting server "node2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: standby promoted to primary after 1 second(s)
DEBUG: setting node 2 as primary and marking existing primary as failed
DEBUG: begin_transaction()
DEBUG: commit_transaction()
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node2" (ID: 2) was successfully promoted to primary
DEBUG: _create_event(): event is "standby_promote" for node 2
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
INSERT INTO repmgr.events ( node_id, event, successful, details ) VALUES ($1, $2, $3, $4) RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2023-11-15 19:31:25.636843+08"
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
$repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | primary | - failed | ? | default | 100 | | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | 100 | 4 | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- unable to connect to node "node1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
场景
在复制集群的现有主服务器发生故障或删除之后,repmgr standby follow可用于使“孤立”备用服务器成为新的主服务器的从 并追赶上其当前状态。
repmgr -f /etc/repmgr.conf standby follow
在某些情况下,需要以有计划的方式提升备用数据库,例如,主数据库上需要执行维护;repmgr standby swtichover 命令支持这种切换。
repmgr standby switchover
与其他repmgr 操作的不同之处在于,它还在其他服务器(降级候选服务器,以及可选的任何遵循新主服务器的其他服务器)上执行操作,这意味着从执行的服务器到这些服务器需要无密码 SSH 访问 。
repmgr -f /etc/repmgr.conf cluster show
$repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | default | 100 | 1 | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2
切换操作的成功取决于 repmgr能否快速、干净地关闭当前主服务器。
确保被升级的候选者有足够的空闲 walsender 可用(PostgreSQL 配置项max_wal_senders
),并且如果复制槽正在使用中,则至少有一个空闲槽可用于降级候选者(PostgreSQL 配置项max_replication_slots
)。
确保可以从升级候选者(standby)到降级候选者(current primary)进行无密码 SSH 连接。如果--siblings-follow
使用,请确保被从升级的候选者到附加到降级候选者的所有节点(包括 witness server,如果正在使用)可以进行无密码 SSH 连接。
再次检查哪些命令将用于停止/启动/重新启动当前主节点
repmgr -f /etc/repmgr.conf node service --list-actions --action=stop
repmgr -f /etc/repmgr.conf node service --list-actions --action=start
repmgr -f /etc/repmgr.conf node service --list-actions --action=restart
执行前检查
repmgr standby switchover
使用 --dry-run
选项执行前检查;这将执行任何必要的检查并通知成功/失败,并在运行第一个实际命令(关闭当前的主节点)之前停止
repmgr standby switchover -f /etc/repmgr.conf --dry-run --verbose --log-level=debug
$repmgr standby switchover -f /etc/repmgr.conf --dry-run --verbose --log-level=debug
NOTICE: using provided configuration file "/etc/repmgr.conf"
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 2
NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: searching for primary node
DEBUG: get_primary_connection():
SELECT node_id, conninfo, CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority FROM repmgr.nodes WHERE active IS TRUE AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id
INFO: checking if node 1 is primary
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: current primary node is 1
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 1
DEBUG: remote node name is "node1"
DEBUG: test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /bin/true 2>/dev/null
INFO: SSH connection to host "10.79.21.30" succeeded
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug --version >/dev/null 2>&1 && echo "1" || echo "0"
DEBUG: remote_command(): output returned was:
1
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug --version 2>/dev/null
DEBUG: remote_command(): output returned was:
repmgr 5.3.3
DEBUG: "repmgr" version on "10.79.21.30" is 50303
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 test -f /etc/repmgr.conf && echo 1 || echo 0
DEBUG: remote_command(): output returned was:
1
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug node check --data-directory-config --optformat -LINFO 2>/dev/null
DEBUG: remote_command(): output returned was:
--configured-data-directory=OK
INFO: able to execute "repmgr" on remote host "10.79.21.30"
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug node check --replication-config-owner --optformat -LINFO 2>/dev/null
DEBUG: remote_command(): output returned was:
--replication-config-owner=OK
DEBUG: get_node_replication_stats():
SELECT pg_catalog.current_setting('max_wal_senders')::INT AS max_wal_senders, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_stat_replication) AS attached_wal_receivers, current_setting('max_replication_slots')::INT AS max_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE slot_type='physical') AS total_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS TRUE AND slot_type='physical') AS active_replication_slots, (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS FALSE AND slot_type='physical') AS inactive_replication_slots, pg_catalog.pg_is_in_recovery() AS in_recovery
DEBUG: get_active_sibling_node_records():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.upstream_node_id = 1 AND n.node_id != 2 AND n.active IS TRUE ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
INFO: 1 walsenders required, 10 available
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug node check --remote-node-id=2 --replication-connection
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: remote_command(): output returned was:
--connection=OK
INFO: demotion candidate is able to make replication connection to promotion candidate
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings WHERE name = 'archive_mode' AND setting != 'off'
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug node check --terse -LERROR --archive-ready --optformat
DEBUG: remote_command(): output returned was:
--status=OK --files=0
INFO: 0 pending archive files
DEBUG: get_replication_lag_seconds():
SELECT CASE WHEN (pg_catalog.pg_last_wal_receive_lsn() = pg_catalog.pg_last_wal_replay_lsn()) THEN 0 ELSE EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - pg_catalog.pg_last_xact_replay_timestamp()))::INT END AS lag_seconds
DEBUG: lag is 0
INFO: replication lag on this standby is 0 seconds
DEBUG: get_all_node_records():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
NOTICE: attempting to pause repmgrd on 2 nodes
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
SET synchronous_commit TO 'local'
NOTICE: local node "node2" (ID: 2) would be promoted to primary; current primary "node1" (ID: 1) would be demoted to standby
DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf -L debug node service --terse -LERROR --list-actions --action=stop
DEBUG: remote_command(): output returned was:
/usr/local/pgsql/bin/pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop
INFO: following shutdown command would be run on node "node1":
"/usr/local/pgsql/bin/pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
INFO: prerequisites for executing STANDBY SWITCHOVER are met
repmgr -f /etc/repmgr.conf standby switchover
$repmgr -f /etc/repmgr.conf standby switchover
NOTICE: executing switchover on node "node2" (ID: 2)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)
DETAIL: executing server command "/usr/local/pgsql/bin/pg_ctl -D '/home/storage/pgsql/data' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/10000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node2" (ID: 2) was successfully promoted to primary
NOTICE: node "node2" (ID: 2) promoted to primary, node "node1" (ID: 1) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node2" is now primary and node "node1" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully
[[email protected]:/home/storage/repmgr]$repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | standby | running | node2 | default | 100 | 1 | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | 100 | 2 | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2
原因 :没有设置pg_bindir参数
解决 : 配置文件添加pg_bindir参数
$repmgr -f /etc/repmgr.conf standby switchover
NOTICE: executing switchover on node "node2" (ID: 2)
ERROR: unable to execute "repmgr" on "10.79.21.30"
HINT: check "pg_bindir" is set to the correct path in "repmgr.conf"; current value: (not set)
repmgr -f /etc/repmgr.conf standby switchover
NOTICE: executing switchover on node "node2" (ID: 2)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)
DETAIL: executing server command "pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
...
INFO: checking for primary shutdown; 60 of 60 attempts ("shutdown_check_timeout")
ERROR: shutdown of the primary server could not be confirmed
HINT: check the primary server status before performing any further actions
解决:
参数改为绝对路径
service_start_command='/usr/local/pgsql/bin/pg_ctl -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log start'