【PG】PostgreSQL高可用 之repmgr常用命令

注册 / 取消注册(register)

repmgr -f /etc/repmgr.conf primary register
repmgr -f /etc/repmgr.conf standby register

repmgr -f /etc/repmgr.conf primary unregister -F --node-id=2
repmgr -f /etc/repmgr.conf standby  unregister

克隆主库(repmgr standby clone)

克隆之前进行检查

 repmgr -h 10.79.21.29 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone --dry-run

真实执行 

$repmgr -h 10.79.21.30 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone
NOTICE: destination directory "/home/storage/pgsql/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=10.79.21.30 user=repmgr dbname=repmgr
DETAIL: current installation size is 115 MB
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/home/storage/pgsql/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup"  -D /home/storage/pgsql/data -h 10.79.21.30 -p 5432 -U repmgr -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: /usr/local/pgsql/bin/pg_ctl  -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log start
HINT: after starting the server, you need to register this standby with "repmgr standby register"

提升从库为主 (repmgr standby promote)

场景

如果主服务器发生故障或需要从复制集群中删除,则必须指定新的主服务器,以确保集群继续正常运行。可以通过repmgr standby promote 来完成,它将当前服务器上的备用服务器提升为主服务器。

查看状态

$repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 3        | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 3        | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2

停止主库 

 pg_ctl  -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop

再次查看状态(从节点)

$repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+---------------+----------+----------+----------+----------+------------------------------------------------------------------------
 1  | node1 | primary | ? unreachable | ?        | default  | 100      |          | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running     | ? node1  | default  | 100      | 3        | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)
  - node "node1" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
  - unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)

HINT: execute with --verbose option to see connection error messages

提升从节点(在从节点上执行)

repmgr -f /etc/repmgr.conf standby promote --log-level=debug --verbose

如果想查看详细的日志输出 可以添加 --log-level=debug --verbose

$repmgr -f /etc/repmgr.conf standby promote --log-level=debug --verbose
NOTICE: using provided configuration file "/etc/repmgr.conf"
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
INFO: connected to standby, checking its state
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 2
INFO: searching for primary node
DEBUG: get_primary_connection():
  SELECT node_id, conninfo,          CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority	   FROM repmgr.nodes    WHERE active IS TRUE      AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id
INFO: checking if node 1 is primary
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
could not connect to server: Connection refused
	Is the server running on host "10.79.21.30" and accepting
	TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path=
INFO: checking if node 2 is primary
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_node_replication_stats():
 SELECT pg_catalog.current_setting('max_wal_senders')::INT AS max_wal_senders,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_stat_replication) AS attached_wal_receivers,         current_setting('max_replication_slots')::INT AS max_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE slot_type='physical') AS total_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS TRUE AND slot_type='physical')  AS active_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS FALSE AND slot_type='physical') AS inactive_replication_slots,         pg_catalog.pg_is_in_recovery() AS in_recovery
DEBUG: get_active_sibling_node_records():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached     FROM repmgr.nodes n    WHERE n.upstream_node_id = 1      AND n.node_id != 2      AND n.active IS TRUE ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 2
NOTICE: promoting standby to primary
DETAIL: promoting server "node2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: standby promoted to primary after 1 second(s)
DEBUG: setting node 2 as primary and marking existing primary as failed
DEBUG: begin_transaction()
DEBUG: commit_transaction()
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node2" (ID: 2) was successfully promoted to primary
DEBUG: _create_event(): event is "standby_promote" for node 2
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
   INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2023-11-15 19:31:25.636843+08"
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking

查看状态 

$repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
 1  | node1 | primary | - failed  | ?        | default  | 100      |          | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 4        | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)

HINT: execute with --verbose option to see connection error messages

孤立的备用服务器成为新主的从(repmgr standby follow)

场景

在复制集群的现有主服务器发生故障或删除之后,repmgr standby follow可用于使“孤立”备用服务器成为新的主服务器的从 并追赶上其当前状态

repmgr -f /etc/repmgr.conf standby follow

计划内切换(repmgr  standby swtichover)

场景

在某些情况下,需要以有计划的方式提升备用数据库,例如,主数据库上需要执行维护;repmgr  standby swtichover 命令支持这种切换。

repmgr standby switchover与其他repmgr 操作的不同之处在于,它还在其他服务器(降级候选服务器,以及可选的任何遵循新主服务器的其他服务器)上执行操作,这意味着从执行的服务器到这些服务器需要无密码 SSH 访问 。

查看状态

repmgr -f /etc/repmgr.conf cluster show

$repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2

准备切换

切换操作的成功取决于 repmgr能否快速、干净地关闭当前主服务器。

确保被升级的候选者有足够的空闲 walsender 可用(PostgreSQL 配置项max_wal_senders),并且如果复制槽正在使用中,则至少有一个空闲槽可用于降级候选者(PostgreSQL 配置项max_replication_slots)。

确保可以从升级候选者(standby)到降级候选者(current primary)进行无密码 SSH 连接。如果--siblings-follow 使用,请确保被从升级的候选者到附加到降级候选者的所有节点(包括 witness server,如果正在使用)可以进行无密码 SSH 连接。

再次检查哪些命令将用于停止/启动/重新启动当前主节点

repmgr -f /etc/repmgr.conf node service --list-actions --action=stop
repmgr -f /etc/repmgr.conf node service --list-actions --action=start
repmgr -f /etc/repmgr.conf node service --list-actions --action=restart

执行前检查

repmgr standby switchover使用 --dry-run选项执行前检查;这将执行任何必要的检查并通知成功/失败,并在运行第一个实际命令(关闭当前的主节点)之前停止

repmgr standby switchover -f /etc/repmgr.conf --dry-run --verbose --log-level=debug

$repmgr standby switchover -f /etc/repmgr.conf --dry-run --verbose --log-level=debug
NOTICE: using provided configuration file "/etc/repmgr.conf"
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 2
NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: searching for primary node
DEBUG: get_primary_connection():
  SELECT node_id, conninfo,          CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority	   FROM repmgr.nodes    WHERE active IS TRUE      AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id
INFO: checking if node 1 is primary
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: current primary node is 1
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 1
DEBUG: remote node name is "node1"
DEBUG: test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /bin/true 2>/dev/null
INFO: SSH connection to host "10.79.21.30" succeeded
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf  -L debug --version >/dev/null 2>&1 && echo "1" || echo "0"
DEBUG: remote_command(): output returned was:
1

DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf  -L debug --version 2>/dev/null
DEBUG: remote_command(): output returned was:
repmgr 5.3.3

DEBUG: "repmgr" version on "10.79.21.30" is 50303
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 test -f /etc/repmgr.conf && echo 1 || echo 0
DEBUG: remote_command(): output returned was:
1

DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf  -L debug node check --data-directory-config --optformat -LINFO 2>/dev/null
DEBUG: remote_command(): output returned was:
--configured-data-directory=OK

INFO: able to execute "repmgr" on remote host "10.79.21.30"
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf  -L debug node check --replication-config-owner --optformat -LINFO 2>/dev/null
DEBUG: remote_command(): output returned was:
--replication-config-owner=OK

DEBUG: get_node_replication_stats():
 SELECT pg_catalog.current_setting('max_wal_senders')::INT AS max_wal_senders,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_stat_replication) AS attached_wal_receivers,         current_setting('max_replication_slots')::INT AS max_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE slot_type='physical') AS total_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS TRUE AND slot_type='physical')  AS active_replication_slots,         (SELECT pg_catalog.count(*) FROM pg_catalog.pg_replication_slots WHERE active IS FALSE AND slot_type='physical') AS inactive_replication_slots,         pg_catalog.pg_is_in_recovery() AS in_recovery
DEBUG: get_active_sibling_node_records():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached     FROM repmgr.nodes n    WHERE n.upstream_node_id = 1      AND n.node_id != 2      AND n.active IS TRUE ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
INFO: 1 walsenders required, 10 available
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf  -L debug node check --remote-node-id=2 --replication-connection
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: remote_command(): output returned was:
--connection=OK

INFO: demotion candidate is able to make replication connection to promotion candidate
DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings  WHERE name = 'archive_mode' AND setting != 'off'
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf  -L debug node check --terse -LERROR --archive-ready --optformat
DEBUG: remote_command(): output returned was:
--status=OK --files=0

INFO: 0 pending archive files
DEBUG: get_replication_lag_seconds():
 SELECT CASE WHEN (pg_catalog.pg_last_wal_receive_lsn() = pg_catalog.pg_last_wal_replay_lsn())           THEN 0         ELSE EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - pg_catalog.pg_last_xact_replay_timestamp()))::INT           END         AS lag_seconds
DEBUG: lag is 0
INFO: replication lag on this standby is 0 seconds
DEBUG: get_all_node_records():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached     FROM repmgr.nodes n ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
NOTICE: attempting to pause repmgrd on 2 nodes
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.30 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=10.79.21.29 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
NOTICE: local node "node2" (ID: 2) would be promoted to primary; current primary "node1" (ID: 1) would be demoted to standby
DEBUG: remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 10.79.21.30 /usr/local/pgsql/bin/repmgr -f /etc/repmgr.conf  -L debug node service --terse -LERROR --list-actions --action=stop
DEBUG: remote_command(): output returned was:
/usr/local/pgsql/bin/pg_ctl  -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop

INFO: following shutdown command would be run on node "node1":
  "/usr/local/pgsql/bin/pg_ctl  -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
INFO: prerequisites for executing STANDBY SWITCHOVER are met

执行切换

repmgr -f /etc/repmgr.conf standby switchover 

$repmgr -f /etc/repmgr.conf standby switchover
NOTICE: executing switchover on node "node2" (ID: 2)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)
DETAIL: executing server command "/usr/local/pgsql/bin/pg_ctl  -D '/home/storage/pgsql/data' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/10000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node2" (ID: 2) was successfully promoted to primary
NOTICE: node "node2" (ID: 2) promoted to primary, node "node1" (ID: 1) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node2" is now primary and node "node1" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

再次查看状态切换成功

[[email protected]:/home/storage/repmgr]$repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 1        | host=10.79.21.30 port=5432 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 2        | host=10.79.21.29 port=5432 user=repmgr dbname=repmgr connect_timeout=2

报错1 ERROR: unable to execute "repmgr" on "10.79.21.30"

原因 :没有设置pg_bindir参数

解决 : 配置文件添加pg_bindir参数

$repmgr -f /etc/repmgr.conf standby switchover
NOTICE: executing switchover on node "node2" (ID: 2)
ERROR: unable to execute "repmgr" on "10.79.21.30"
HINT: check "pg_bindir" is set to the correct path in "repmgr.conf"; current value: (not set)

报错2 ERROR: shutdown of the primary server could not be confirmed

repmgr -f /etc/repmgr.conf standby switchover
NOTICE: executing switchover on node "node2" (ID: 2)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)
DETAIL: executing server command "pg_ctl  -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
...
INFO: checking for primary shutdown; 60 of 60 attempts ("shutdown_check_timeout")
ERROR: shutdown of the primary server could not be confirmed
HINT: check the primary server status before performing any further actions

解决:

参数改为绝对路径

service_start_command='/usr/local/pgsql/bin/pg_ctl  -D /home/storage/pgsql/data -l /home/storage/pgsql/data/server.log start'

你可能感兴趣的:(PostgreSQL,java,服务器,前端)