首先在各个节点分别安装 pg 库,安装步骤见上一篇文章:PostgreSQL 流复制配置,从库先不要进行初始化操作。
在每台节点配置 ip 及别名的对应关系:修改配置文件/etc/hosts
,添加映射关系
192.168.86.134 node1
192.168.86.137 node2
192.168.86.138 node3
输入visudo
指令,在文件中新增
postgres ALL=(ALL) NOPASSWD:ALL
每台节点中,在 root 用户下修改 postgres 用户的密码为 postgres
[root@node1 ~]# passwd postgres
更改用户 postgres 的密码 。
新的 密码:
无效的密码: 密码包含用户名在某些地方
重新输入新的 密码:
passwd:所有的身份验证令牌已经成功更新。
配置每台节点之间通过 ssh 免密登录,以 node1 节点为例,依次执行下列操作
su postgres # 切换到 postgres 用户
ssh-keygen # 生成密钥-公钥对
ssh-copy-id node1 # 将公钥保存到 node1 节点
ssh-copy-id node2 # 将公钥保存到 node2 节点
ssh-copy-id node3 # 将公钥保存到 node3 节点
修改 ssh 配置文件:etc/ssh/sshd_config
,开启 pubkey 验证
PubkeyAuthentication yes
切换 root 重新启动 ssh 服务
systemctl restart sshd
切换至 postgres 用户,进行远程连接测试
ssh node1
ssh node2
ssh node3
若不需要输入密码,直接连接成功,则证明配置成功。
配置 ssh 远程免密连接的问题:
注意存放密钥的文件夹的权限问题,文件夹需归属于 postgres 用户,且 ssh 对文件夹权限有要求
chown -R postgres:postgres /var/lib/pgsql
chmod 700 /var/lib/pgsql/.ssh
chmod 600 /var/lib/pgsql/.ssh/authorized_keys
若日志没有任何问题,使用开机自启的 ssh 连接需要输入密码,而重新手动启动一个 ssh 服务则可以正常免密连接,是因为 .ssh 目录没有 ssh_home_t 标签,通过下列命令重置!
restorecon -r -vv /var/lib/pgsql/.ssh
在每个节点执行 repmer 的安装,其版本应对应 pg 库的版本,依次执行下列语句
curl https://dl.2ndquadrant.com/default/release/get/14/rpm | sudo bash
yum install repmgr14
yum install -y rsync
修改配置文件/var/lib/pgsql/14/data/postgresql.conf
,将下述属性进行配置
listen_addresses = '*'
max_wal_senders = 10
max_replication_slots = 10
wal_level = hot_standby
hot_standby = on
archive_mode = always
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'
修改配置文件/var/lib/pgsql/14/data/pg_hba.conf
,配置数据库的连接,在末尾处新增
local all postgres peer
local replication repmgr trust
host replication repmgr 127.0.0.1/32 trust
host replication repmgr 192.168.86.0/24 trust
local repmgr repmgr trust
host repmgr repmgr 127.0.0.1/32 trust
host repmgr repmgr 192.168.86.0/24 trust
完成上述配置后启动 pg 数据库服务
systemctl start postgresql-14
对 repmgr 的配置文件进行设置:/etc/repmgr/14/repmgr.conf
,在末尾处添加
node_id=1
node_name='node1'
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr'
data_directory='/var/lib/pgsql/14/data'
failover=automatic
promote_command='/usr/pgsql-14/bin/repmgr standby promote -f /etc/repmgr/14/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-14/bin/repmgr standby follow -f /etc/repmgr/14/repmgr.conf --log-to-file --upstream-node-id=%n'
service_start_command = 'sudo systemctl start postgresql-14'
service_stop_command = 'sudo systemctl stop postgresql-14'
service_restart_command = 'sudo systemctl restart postgresql-14'
service_reload_command = 'sudo systemctl reload postgresql-14'
repmgrd_pid_file='/tmp/repmgrd.pid'
log_file='/tmp/repmgrd.log'
priority=100
切换到 postgres 用户,新建 repmgr 用户以及 repmgr 数据库
su - postgers
createuser -s repmgr
createdb repmgr -O repmgr
使用 postgres 用户连接数据库,修改 repmgr 用户的密码为 repmgr
[root@localhost ~]# su - postgres
-bash-4.2$ psql
psql (14.5)
输入 "help" 来获取帮助信息.
postgres=# ALTER USER repmgr ENCRYPTED PASSWORD 'repmgr';
ALTER ROLE
postgres=# exit
-bash-4.2$ exit
登出
在 postgres 用户下执行主数据库节点的集群注册(本机未配置环境变量,需到对应文件夹下执行相关操作)
cd /usr/pgsql-14/bin/
./repmgr -f /etc/repmgr/14/repmgr.conf primary register
注册完成后主机配置结束,可以查看注册结果
-bash-4.2$ ./repmgr cluster show -f /etc/repmgr/14/repmgr.conf
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
从库的数据库安装之后不要进行初始化,若已经执行初始化,则需要清空对应的 data 文件夹
测试从库可以连接到主库,以 node3 为例
[root@node3 ~]# su - postgres
上一次登录:二 11月 8 09:23:06 CST 2022从 node3pts/1 上
-bash-4.2$ psql 'host=node1 user=repmgr dbname=repmgr connect_timeout=2'
用户 repmgr 的口令:
psql (14.5)
输入 "help" 来获取帮助信息.
repmgr=#
修改 repmgr 配置文件:/etc/repmgr/14/repmgr.conf
,在末尾处添加
node_id=1
node_name='node3'
conninfo='host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr'
data_directory='/var/lib/pgsql/14/data'
failover=automatic
promote_command='/usr/pgsql-14/bin/repmgr standby promote -f /etc/repmgr/14/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-14/bin/repmgr standby follow -f /etc/repmgr/14/repmgr.conf --log-to-file --upstream-node-id=%n'
service_start_command = 'sudo systemctl start postgresql-14'
service_stop_command = 'sudo systemctl stop postgresql-14'
service_restart_command = 'sudo systemctl restart postgresql-14'
service_reload_command = 'sudo systemctl reload postgresql-14'
repmgrd_pid_file='/tmp/repmgrd.pid'
log_file='/tmp/repmgrd.log'
priority=100
使用--dry-run
参数尝试克隆服务器
cd /usr/pgsql-14/bin/
./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run
此处在实际测试中存在问题,报错信息提示未提供密码
-bash-4.2$ cd /usr/pgsql-14/bin/
-bash-4.2$ ./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/var/lib/pgsql/14/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node1 user=repmgr dbname=repmgr
ERROR: connection to database failed
DETAIL:
connection to server at "node1" (192.168.86.134), port 5432 failed: fe_sendauth: no password supplied
解决方法:在对应的家目录下新建 .pgpass 文件配置密码
cd ~
touch .pgpass
vi ~/.pgpass
chmod 0600 ~/.pgpass
在新建的文件中写入配置
#hostname:port:database:username:password
node1:5432:repmgr:repmgr:repmgr
同时在 repmgr 配置文件中新增密码文件的路径配置,如下
passfile='/var/lib/pgsql/.pgpass'
重新进行尝试
cd /usr/pgsql-14/bin/
./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run
测试无异常报错,则可实际执行克隆操作
./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone
克隆完成后,启动 pg 数据库服务
sudo systemctl start postgresql-14
启动数据库后以 standby 的身份注册集群
./repmgr -f /etc/repmgr/14/repmgr.conf standby register
分别按上述流程对 node2 和 node3 进行集群注册,注册成功后,检查集群状态
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
2 | node2 | standby | running | node1 | default | 100 | 1 | host=node2 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
3 | node3 | standby | running | node1 | default | 100 | 1 | host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
此时可以尝试在主库 node1 进行数据的增删操作,检查从库的情况
分别在三个节点处启动守护进程 repmgrd
./repmgrd -f /etc/repmgr/14/repmgr.conf
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf node status
Node "node3":
PostgreSQL version: 14.5
Total data size: 34 MB
Conninfo: host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
Role: standby
WAL archiving: enabled
Archive command: /bin/true
WALs pending archiving: 0 pending files
Replication connections: 0 (of maximal 10)
Replication slots: 0 physical (of maximal 10; 0 missing)
Upstream node: node1 (ID: 1)
Replication lag: 0 seconds
Last received LSN: 0/D001078
Last replayed LSN: 0/D001078
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf node check
Node "node3":
Server role: OK (node is standby)
Replication lag: OK (0 seconds)
WAL archiving: OK (0 pending archive ready files)
Upstream connection: OK (node "node3" (ID: 3) is attached to expected upstream node "node1" (ID: 1))
Downstream servers: OK (this node has no downstream nodes)
Replication slots: OK (node has no physical replication slots)
Missing physical replication slots: OK (node has no missing physical replication slots)
Configured data directory: OK (configured "data_directory" is "/var/lib/pgsql/14/data")
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
2 | node2 | standby | running | node1 | default | 100 | 1 | host=node2 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
3 | node3 | standby | running | node1 | default | 100 | 1 | host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
1 | node1 | primary | * running | | running | 2133 | no | n/a
2 | node2 | standby | running | node1 | running | 2088 | no | 1 second(s) ago
3 | node3 | standby | running | node1 | running | 2002 | no | 1 second(s) ago
暂停 repmgrd 服务可以在任意一个节点上进行,一般用于数据库维护。在暂停期间,集群处于静止状态,此时停止主库,集群不会自动进行切换。
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service pause
NOTICE: node 1 (node1) paused
NOTICE: node 2 (node2) paused
NOTICE: node 3 (node3) paused
此时检查各个节点服务的状态,可以观察到 Paused 列变为 yes
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
1 | node1 | primary | * running | | running | 2133 | yes | n/a
2 | node2 | standby | running | node1 | running | 2088 | yes | 0 second(s) ago
3 | node3 | standby | running | node1 | running | 2002 | yes | 1 second(s) ago
使用下列命令解除暂停状态
./repmgr -f /etc/repmgr/14/repmgr.conf service unpause
在主库节点 node1 停止 pg 数据库服务,模拟数据库故障
sudo systemctl stop postgresql-14
主库停止后在从库节点 node2 查看集群服务的状态,此时主库状态变为 unreachable 不可达状态,集群正在尝试重新连接主库
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+---------------+----------+---------+------+---------+--------------------
1 | node1 | primary | ? unreachable | ? | n/a | n/a | n/a | n/a
2 | node2 | standby | running | ? node1 | running | 2088 | no | 51 second(s) ago
3 | node3 | standby | running | ? node1 | running | 2002 | no | 50 second(s) ago
WARNING: following issues were detected
- unable to connect to node "node1" (ID: 1)
- node "node1" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
- unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)
- unable to connect to node "node3" (ID: 3)'s upstream node "node1" (ID: 1)
- unable to determine if node "node3" (ID: 3) is attached to its upstream node "node1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
在经过一段时间后再次查看集群状态,此时主库状态变为 fail 失败状态,同时集群重新选举 node2 作为新的主库 primary 节点,完成集群的重新配置
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
1 | node1 | primary | - failed | ? | n/a | n/a | n/a | n/a
2 | node2 | primary | * running | | running | 2088 | no | n/a
3 | node3 | standby | running | node2 | running | 2002 | no | 0 second(s) ago
WARNING: following issues were detected
- unable to connect to node "node1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
若 node1 节点服务修复并重启后,其状态变更为 ! running,不在存在于集群管理当中,可以通过下述的节点注册与删除的指令将其移除集群并重新注册
-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
1 | node1 | primary | ! running | | running | 2133 | no | n/a
2 | node2 | primary | * running | | running | 2088 | no | n/a
3 | node3 | standby | running | node2 | running | 2002 | no | 1 second(s) ago
WARNING: following issues were detected
- node "node1" (ID: 1) is running but the repmgr node record is inactive
集群删除一个 primary 节点
./repmgr -f /etc/repmgr/14/repmgr.conf primary unregister --node-id=1
集群删除一个 standby 节点
./repmgr standby unregister -f /etc/repmgr/14/repmgr.conf --node-id=3
集群新增一个 standby 节点,在相应的节点做好配置之后,在该节点执行
./repmgr -f /etc/repmgr/14/repmgr.conf standby register
Tip:当主节点宕机之后,可以删除主节点,然后清空主节点下 data 中的数据,按照从库的配置过程重新执行注册,将原来的主库配置为从库即可。