首先在各个节点分别安装 pg 库,安装步骤见上一篇文章:PostgreSQL 流复制配置,从库先不要进行初始化操作。
在每台节点配置 ip 及别名的对应关系:修改配置文件/etc/hosts
,添加映射关系 node1 node2 node3
每台节点中,在 root 用户下修改 postgres 用户的密码为 postgres
[root@node1 ~]# passwd postgres
更改用户 postgres 的密码 。
新的 密码:
无效的密码: 密码包含用户名在某些地方
重新输入新的 密码:
配置每台节点之间通过 ssh 免密登录,以 node1 节点为例,依次执行下列操作
su postgres # 切换到 postgres 用户
ssh-keygen # 生成密钥-公钥对
ssh-copy-id node1 # 将公钥保存到 node1 节点
ssh-copy-id node2 # 将公钥保存到 node2 节点
ssh-copy-id node3 # 将公钥保存到 node3 节点
修改 ssh 配置文件:etc/ssh/sshd_config
,开启 pubkey 验证
PubkeyAuthentication yes
切换 root 重新启动 ssh 服务
systemctl restart sshd
切换至 postgres 用户,进行远程连接测试
ssh node1
ssh node2
ssh node3
配置 ssh 远程免密连接的问题:
注意存放密钥的文件夹的权限问题,文件夹需归属于 postgres 用户,且 ssh 对文件夹权限有要求
chown -R postgres:postgres /var/lib/pgsql
chmod 700 /var/lib/pgsql/.ssh
chmod 600 /var/lib/pgsql/.ssh/authorized_keys
若日志没有任何问题,使用开机自启的 ssh 连接需要输入密码,而重新手动启动一个 ssh 服务则可以正常免密连接,是因为 .ssh 目录没有 ssh_home_t 标签,通过下列命令重置!
restorecon -r -vv /var/lib/pgsql/.ssh
在每个节点执行 repmer 的安装,其版本应对应 pg 库的版本,依次执行下列语句
curl https://dl.2ndquadrant.com/default/release/get/14/rpm | sudo bash
yum install repmgr14
yum install -y rsync
listen_addresses = '*'
max_wal_senders = 10
max_replication_slots = 10
wal_level = hot_standby
hot_standby = on
archive_mode = always
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'
local all postgres peer
local replication repmgr trust
host replication repmgr trust
host replication repmgr trust
local repmgr repmgr trust
host repmgr repmgr trust
host repmgr repmgr trust
完成上述配置后启动 pg 数据库服务
systemctl start postgresql-14
对 repmgr 的配置文件进行设置:/etc/repmgr/14/repmgr.conf
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr'
promote_command='/usr/pgsql-14/bin/repmgr standby promote -f /etc/repmgr/14/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-14/bin/repmgr standby follow -f /etc/repmgr/14/repmgr.conf --log-to-file --upstream-node-id=%n'
service_start_command = 'sudo systemctl start postgresql-14'
service_stop_command = 'sudo systemctl stop postgresql-14'
service_restart_command = 'sudo systemctl restart postgresql-14'
service_reload_command = 'sudo systemctl reload postgresql-14'
切换到 postgres 用户,新建 repmgr 用户以及 repmgr 数据库
su - postgers
createuser -s repmgr
createdb repmgr -O repmgr
使用 postgres 用户连接数据库,修改 repmgr 用户的密码为 repmgr
[root@localhost ~]# su - postgres
-bash-4.2$ psql
psql (14.5)
输入 "help" 来获取帮助信息.
postgres=# ALTER USER repmgr ENCRYPTED PASSWORD 'repmgr';
postgres=# exit
-bash-4.2$ exit
在 postgres 用户下执行主数据库节点的集群注册(本机未配置环境变量,需到对应文件夹下执行相关操作)
cd /usr/pgsql-14/bin/
./repmgr -f /etc/repmgr/14/repmgr.conf primary register
-bash-4.2$ ./repmgr cluster show -f /etc/repmgr/14/repmgr.conf
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
1 | node1 | primary | * running | | default | 100 | 1 | host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
从库的数据库安装之后不要进行初始化,若已经执行初始化,则需要清空对应的 data 文件夹
测试从库可以连接到主库,以 node3 为例
[root@node3 ~]# su - postgres
上一次登录:二 11月 8 09:23:06 CST 2022从 node3pts/1 上
-bash-4.2$ psql 'host=node1 user=repmgr dbname=repmgr connect_timeout=2'
用户 repmgr 的口令:
psql (14.5)
输入 "help" 来获取帮助信息.
修改 repmgr 配置文件:/etc/repmgr/14/repmgr.conf
conninfo='host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr'
promote_command='/usr/pgsql-14/bin/repmgr standby promote -f /etc/repmgr/14/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-14/bin/repmgr standby follow -f /etc/repmgr/14/repmgr.conf --log-to-file --upstream-node-id=%n'
service_start_command = 'sudo systemctl start postgresql-14'
service_stop_command = 'sudo systemctl stop postgresql-14'
service_restart_command = 'sudo systemctl restart postgresql-14'
service_reload_command = 'sudo systemctl reload postgresql-14'
cd /usr/pgsql-14/bin/
./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run
-bash-4.2$ cd /usr/pgsql-14/bin/
-bash-4.2$ ./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/var/lib/pgsql/14/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node1 user=repmgr dbname=repmgr
ERROR: connection to database failed
connection to server at "node1" (, port 5432 failed: fe_sendauth: no password supplied
解决方法:在对应的家目录下新建 .pgpass 文件配置密码
cd ~
touch .pgpass
vi ~/.pgpass
chmod 0600 ~/.pgpass
同时在 repmgr 配置文件中新增密码文件的路径配置,如下
cd /usr/pgsql-14/bin/
./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run
./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone
克隆完成后,启动 pg 数据库服务
sudo systemctl start postgresql-14
启动数据库后以 standby 的身份注册集群
./repmgr -f /etc/repmgr/14/repmgr.conf standby register
分别按上述流程对 node2 和 node3 进行集群注册,注册成功后,检查集群状态
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
1 | node1 | primary | * running | | default | 100 | 1 | host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
2 | node2 | standby | running | node1 | default | 100 | 1 | host=node2 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
3 | node3 | standby | running | node1 | default | 100 | 1 | host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
此时可以尝试在主库 node1 进行数据的增删操作,检查从库的情况
分别在三个节点处启动守护进程 repmgrd
./repmgrd -f /etc/repmgr/14/repmgr.conf
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf node status
Node "node3":
PostgreSQL version: 14.5
Total data size: 34 MB
Conninfo: host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
Role: standby
WAL archiving: enabled
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 0 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Upstream node: node1 (ID: 1)
Replication lag: 0 seconds
Last received LSN: 0/D001078
Last replayed LSN: 0/D001078
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf node check
Node "node3":
Server role: OK (node is standby)
Replication lag: OK (0 seconds)
WAL archiving: OK (0 pending archive ready files)
Upstream connection: OK (node "node3" (ID: 3) is attached to expected upstream node "node1" (ID: 1))
Downstream servers: OK (this node has no downstream nodes)
Replication slots: OK (node has no physical replication slots)
Missing physical replication slots: OK (node has no missing physical replication slots)
Configured data directory: OK (configured "data_directory" is "/var/lib/pgsql/14/data")
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
1 | node1 | primary | * running | | default | 100 | 1 | host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
2 | node2 | standby | running | node1 | default | 100 | 1 | host=node2 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
3 | node3 | standby | running | node1 | default | 100 | 1 | host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
1 | node1 | primary | * running | | running | 2133 | no | n/a
2 | node2 | standby | running | node1 | running | 2088 | no | 1 second(s) ago
3 | node3 | standby | running | node1 | running | 2002 | no | 1 second(s) ago
暂停 repmgrd 服务可以在任意一个节点上进行,一般用于数据库维护。在暂停期间,集群处于静止状态,此时停止主库,集群不会自动进行切换。
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service pause
NOTICE: node 1 (node1) paused
NOTICE: node 2 (node2) paused
NOTICE: node 3 (node3) paused
此时检查各个节点服务的状态,可以观察到 Paused 列变为 yes
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
1 | node1 | primary | * running | | running | 2133 | yes | n/a
2 | node2 | standby | running | node1 | running | 2088 | yes | 0 second(s) ago
3 | node3 | standby | running | node1 | running | 2002 | yes | 1 second(s) ago
./repmgr -f /etc/repmgr/14/repmgr.conf service unpause
在主库节点 node1 停止 pg 数据库服务,模拟数据库故障
sudo systemctl stop postgresql-14
主库停止后在从库节点 node2 查看集群服务的状态,此时主库状态变为 unreachable 不可达状态,集群正在尝试重新连接主库
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
1 | node1 | primary | ? unreachable | ? | n/a | n/a | n/a | n/a
2 | node2 | standby | running | ? node1 | running | 2088 | no | 51 second(s) ago
3 | node3 | standby | running | ? node1 | running | 2002 | no | 50 second(s) ago
WARNING: following issues were detected
- unable to connect to node "node1" (ID: 1)
- node "node1" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
- unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)
- unable to connect to node "node3" (ID: 3)'s upstream node "node1" (ID: 1)
- unable to determine if node "node3" (ID: 3) is attached to its upstream node "node1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
在经过一段时间后再次查看集群状态,此时主库状态变为 fail 失败状态,同时集群重新选举 node2 作为新的主库 primary 节点,完成集群的重新配置
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
1 | node1 | primary | - failed | ? | n/a | n/a | n/a | n/a
2 | node2 | primary | * running | | running | 2088 | no | n/a
3 | node3 | standby | running | node2 | running | 2002 | no | 0 second(s) ago
WARNING: following issues were detected
- unable to connect to node "node1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
若 node1 节点服务修复并重启后,其状态变更为 ! running,不在存在于集群管理当中,可以通过下述的节点注册与删除的指令将其移除集群并重新注册
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
1 | node1 | primary | ! running | | running | 2133 | no | n/a
2 | node2 | primary | * running | | running | 2088 | no | n/a
3 | node3 | standby | running | node2 | running | 2002 | no | 1 second(s) ago
WARNING: following issues were detected
- node "node1" (ID: 1) is running but the repmgr node record is inactive
集群删除一个 primary 节点
./repmgr -f /etc/repmgr/14/repmgr.conf primary unregister --node-id=1
集群删除一个 standby 节点
./repmgr standby unregister -f /etc/repmgr/14/repmgr.conf --node-id=3
集群新增一个 standby 节点,在相应的节点做好配置之后,在该节点执行
./repmgr -f /etc/repmgr/14/repmgr.conf standby register
Tip:当主节点宕机之后,可以删除主节点,然后清空主节点下 data 中的数据,按照从库的配置过程重新执行注册,将原来的主库配置为从库即可。