【应用】基于 Repmgr 的 PostgreSQL 集群配置

基于 Repmgr 的 PostgreSQL 集群配置

  • Repmgr 的配置
    • 配置前准备
    • 安装 repmgr
    • 主库配置(node1)
    • 从库配置(node2/node3)
    • 启动守护进程 repmgrd
  • Repmgr 集群管理
    • node 相关
    • cluster 相关
    • service 相关
    • 集群的暂停和启动
    • 集群故障处理
    • 节点的注册与删除

Repmgr 的配置

配置前准备

首先在各个节点分别安装 pg 库,安装步骤见上一篇文章:PostgreSQL 流复制配置,从库先不要进行初始化操作。

在每台节点配置 ip 及别名的对应关系:修改配置文件/etc/hosts,添加映射关系

192.168.86.134 node1
192.168.86.137 node2
192.168.86.138 node3

输入visudo指令,在文件中新增

postgres ALL=(ALL) NOPASSWD:ALL

每台节点中,在 root 用户下修改 postgres 用户的密码为 postgres

[root@node1 ~]# passwd postgres
更改用户 postgres 的密码 。
新的 密码:
无效的密码: 密码包含用户名在某些地方
重新输入新的 密码:
passwd:所有的身份验证令牌已经成功更新。

配置每台节点之间通过 ssh 免密登录,以 node1 节点为例,依次执行下列操作

su postgres # 切换到 postgres 用户
ssh-keygen # 生成密钥-公钥对
ssh-copy-id node1 # 将公钥保存到 node1 节点
ssh-copy-id node2 # 将公钥保存到 node2 节点
ssh-copy-id node3 # 将公钥保存到 node3 节点

修改 ssh 配置文件:etc/ssh/sshd_config,开启 pubkey 验证

PubkeyAuthentication yes

切换 root 重新启动 ssh 服务

systemctl restart sshd

切换至 postgres 用户,进行远程连接测试

ssh node1
ssh node2
ssh node3

若不需要输入密码,直接连接成功,则证明配置成功。

配置 ssh 远程免密连接的问题:

  • 注意存放密钥的文件夹的权限问题,文件夹需归属于 postgres 用户,且 ssh 对文件夹权限有要求

    chown -R postgres:postgres /var/lib/pgsql
    chmod 700 /var/lib/pgsql/.ssh
    chmod 600 /var/lib/pgsql/.ssh/authorized_keys
    
  • 若日志没有任何问题,使用开机自启的 ssh 连接需要输入密码,而重新手动启动一个 ssh 服务则可以正常免密连接,是因为 .ssh 目录没有 ssh_home_t 标签,通过下列命令重置!

    restorecon -r -vv /var/lib/pgsql/.ssh
    

安装 repmgr

在每个节点执行 repmer 的安装,其版本应对应 pg 库的版本,依次执行下列语句

curl https://dl.2ndquadrant.com/default/release/get/14/rpm | sudo bash
yum install repmgr14
yum install -y rsync

主库配置(node1)

修改配置文件/var/lib/pgsql/14/data/postgresql.conf,将下述属性进行配置

listen_addresses = '*'

max_wal_senders = 10

max_replication_slots = 10

wal_level = hot_standby

hot_standby = on

archive_mode = always

archive_command = '/bin/true'

shared_preload_libraries = 'repmgr'

修改配置文件/var/lib/pgsql/14/data/pg_hba.conf,配置数据库的连接,在末尾处新增

local   all           postgres                                  peer
local   replication   repmgr                                    trust
host    replication   repmgr            127.0.0.1/32            trust
host    replication   repmgr            192.168.86.0/24         trust
local   repmgr        repmgr                                    trust
host    repmgr        repmgr            127.0.0.1/32            trust
host    repmgr        repmgr            192.168.86.0/24         trust

完成上述配置后启动 pg 数据库服务

systemctl start postgresql-14

对 repmgr 的配置文件进行设置:/etc/repmgr/14/repmgr.conf,在末尾处添加

node_id=1

node_name='node1'

conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr'

data_directory='/var/lib/pgsql/14/data'

failover=automatic

promote_command='/usr/pgsql-14/bin/repmgr standby promote -f /etc/repmgr/14/repmgr.conf --log-to-file'

follow_command='/usr/pgsql-14/bin/repmgr standby follow -f /etc/repmgr/14/repmgr.conf --log-to-file --upstream-node-id=%n'

service_start_command  = 'sudo systemctl start postgresql-14'

service_stop_command    = 'sudo systemctl stop postgresql-14'

service_restart_command = 'sudo systemctl restart postgresql-14'

service_reload_command  = 'sudo systemctl reload postgresql-14'

repmgrd_pid_file='/tmp/repmgrd.pid'

log_file='/tmp/repmgrd.log'

priority=100

切换到 postgres 用户,新建 repmgr 用户以及 repmgr 数据库

su - postgers
createuser -s repmgr
createdb repmgr -O repmgr

使用 postgres 用户连接数据库,修改 repmgr 用户的密码为 repmgr

[root@localhost ~]# su - postgres
-bash-4.2$ psql
psql (14.5)
输入 "help" 来获取帮助信息.

postgres=# ALTER USER repmgr ENCRYPTED PASSWORD 'repmgr';
ALTER ROLE
postgres=# exit
-bash-4.2$ exit
登出

在 postgres 用户下执行主数据库节点的集群注册(本机未配置环境变量,需到对应文件夹下执行相关操作)

cd /usr/pgsql-14/bin/
./repmgr -f /etc/repmgr/14/repmgr.conf primary register

注册完成后主机配置结束,可以查看注册结果

-bash-4.2$ ./repmgr cluster show -f /etc/repmgr/14/repmgr.conf
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr

从库配置(node2/node3)

从库的数据库安装之后不要进行初始化,若已经执行初始化,则需要清空对应的 data 文件夹

测试从库可以连接到主库,以 node3 为例

[root@node3 ~]# su - postgres
上一次登录:二 118 09:23:06 CST 2022从 node3pts/1 上
-bash-4.2$ psql 'host=node1 user=repmgr dbname=repmgr connect_timeout=2'
用户 repmgr 的口令:
psql (14.5)
输入 "help" 来获取帮助信息.

repmgr=#

修改 repmgr 配置文件:/etc/repmgr/14/repmgr.conf,在末尾处添加

node_id=1

node_name='node3'

conninfo='host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr'

data_directory='/var/lib/pgsql/14/data'

failover=automatic

promote_command='/usr/pgsql-14/bin/repmgr standby promote -f /etc/repmgr/14/repmgr.conf --log-to-file'

follow_command='/usr/pgsql-14/bin/repmgr standby follow -f /etc/repmgr/14/repmgr.conf --log-to-file --upstream-node-id=%n'

service_start_command  = 'sudo systemctl start postgresql-14'

service_stop_command    = 'sudo systemctl stop postgresql-14'

service_restart_command = 'sudo systemctl restart postgresql-14'

service_reload_command  = 'sudo systemctl reload postgresql-14'

repmgrd_pid_file='/tmp/repmgrd.pid'

log_file='/tmp/repmgrd.log'

priority=100

使用--dry-run参数尝试克隆服务器

cd /usr/pgsql-14/bin/
./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run

此处在实际测试中存在问题,报错信息提示未提供密码

-bash-4.2$ cd /usr/pgsql-14/bin/
-bash-4.2$ ./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/var/lib/pgsql/14/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node1 user=repmgr dbname=repmgr
ERROR: connection to database failed
DETAIL:
connection to server at "node1" (192.168.86.134), port 5432 failed: fe_sendauth: no password supplied

解决方法:在对应的家目录下新建 .pgpass 文件配置密码

cd ~
touch .pgpass
vi ~/.pgpass
chmod 0600 ~/.pgpass

在新建的文件中写入配置

#hostname:port:database:username:password
node1:5432:repmgr:repmgr:repmgr

同时在 repmgr 配置文件中新增密码文件的路径配置,如下

passfile='/var/lib/pgsql/.pgpass'

重新进行尝试

cd /usr/pgsql-14/bin/
./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run

测试无异常报错,则可实际执行克隆操作

./repmgr -h node1 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone

克隆完成后,启动 pg 数据库服务

sudo systemctl start postgresql-14

启动数据库后以 standby 的身份注册集群

./repmgr -f /etc/repmgr/14/repmgr.conf standby register

分别按上述流程对 node2 和 node3 进行集群注册,注册成功后,检查集群状态

-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=node2 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
 3  | node3 | standby |   running | node1    | default  | 100      | 1        | host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr

此时可以尝试在主库 node1 进行数据的增删操作,检查从库的情况

启动守护进程 repmgrd

分别在三个节点处启动守护进程 repmgrd

./repmgrd -f /etc/repmgr/14/repmgr.conf

Repmgr 集群管理

node 相关

-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf node status
Node "node3":
        PostgreSQL version: 14.5
        Total data size: 34 MB
        Conninfo: host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
        Role: standby
        WAL archiving: enabled
        Archive command: /bin/true
        WALs pending archiving: 0 pending files
        Replication connections: 0 (of maximal 10)
        Replication slots: 0 physical (of maximal 10; 0 missing)
        Upstream node: node1 (ID: 1)
        Replication lag: 0 seconds
        Last received LSN: 0/D001078
        Last replayed LSN: 0/D001078

-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf node check
Node "node3":
        Server role: OK (node is standby)
        Replication lag: OK (0 seconds)
        WAL archiving: OK (0 pending archive ready files)
        Upstream connection: OK (node "node3" (ID: 3) is attached to expected upstream node "node1" (ID: 1))
        Downstream servers: OK (this node has no downstream nodes)
        Replication slots: OK (node has no physical replication slots)
        Missing physical replication slots: OK (node has no missing physical replication slots)
        Configured data directory: OK (configured "data_directory" is "/var/lib/pgsql/14/data")

cluster 相关

-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=node1 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=node2 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr
 3  | node3 | standby |   running | node1    | default  | 100      | 1        | host=node3 user=repmgr dbname=repmgr connect_timeout=2 password=repmgr

service 相关

-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
 ID | Name  | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
 1  | node1 | primary | * running |          | running | 2133 | no      | n/a
 2  | node2 | standby |   running | node1    | running | 2088 | no      | 1 second(s) ago
 3  | node3 | standby |   running | node1    | running | 2002 | no      | 1 second(s) ago

集群的暂停和启动

暂停 repmgrd 服务可以在任意一个节点上进行,一般用于数据库维护。在暂停期间,集群处于静止状态,此时停止主库,集群不会自动进行切换。

-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service pause
NOTICE: node 1 (node1) paused
NOTICE: node 2 (node2) paused
NOTICE: node 3 (node3) paused

此时检查各个节点服务的状态,可以观察到 Paused 列变为 yes

-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
 ID | Name  | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
 1  | node1 | primary | * running |          | running | 2133 | yes     | n/a
 2  | node2 | standby |   running | node1    | running | 2088 | yes     | 0 second(s) ago
 3  | node3 | standby |   running | node1    | running | 2002 | yes     | 1 second(s) ago

使用下列命令解除暂停状态

./repmgr -f /etc/repmgr/14/repmgr.conf service unpause

集群故障处理

在主库节点 node1 停止 pg 数据库服务,模拟数据库故障

sudo systemctl stop postgresql-14

主库停止后在从库节点 node2 查看集群服务的状态,此时主库状态变为 unreachable 不可达状态,集群正在尝试重新连接主库

-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
 ID | Name  | Role    | Status        | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+-------+---------+---------------+----------+---------+------+---------+--------------------
 1  | node1 | primary | ? unreachable | ?        | n/a     | n/a  | n/a     | n/a
 2  | node2 | standby |   running     | ? node1  | running | 2088 | no      | 51 second(s) ago
 3  | node3 | standby |   running     | ? node1  | running | 2002 | no      | 50 second(s) ago

WARNING: following issues were detected
  - unable to  connect to node "node1" (ID: 1)
  - node "node1" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
  - unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)
  - unable to connect to node "node3" (ID: 3)'s upstream node "node1" (ID: 1)
  - unable to determine if node "node3" (ID: 3) is attached to its upstream node "node1" (ID: 1)

HINT: execute with --verbose option to see connection error messages

在经过一段时间后再次查看集群状态,此时主库状态变为 fail 失败状态,同时集群重新选举 node2 作为新的主库 primary 节点,完成集群的重新配置

-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
 ID | Name  | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
 1  | node1 | primary | - failed  | ?        | n/a     | n/a  | n/a     | n/a
 2  | node2 | primary | * running |          | running | 2088 | no      | n/a
 3  | node3 | standby |   running | node2    | running | 2002 | no      | 0 second(s) ago

WARNING: following issues were detected
  - unable to  connect to node "node1" (ID: 1)

HINT: execute with --verbose option to see connection error messages

若 node1 节点服务修复并重启后,其状态变更为 ! running,不在存在于集群管理当中,可以通过下述的节点注册与删除的指令将其移除集群并重新注册

-bash-4.2$ ./repmgr -f /etc/repmgr/14/repmgr.conf service status
 ID | Name  | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
 1  | node1 | primary | ! running |          | running | 2133 | no      | n/a
 2  | node2 | primary | * running |          | running | 2088 | no      | n/a
 3  | node3 | standby |   running | node2    | running | 2002 | no      | 1 second(s) ago

WARNING: following issues were detected
  - node "node1" (ID: 1) is running but the repmgr node record is inactive

节点的注册与删除

集群删除一个 primary 节点

./repmgr -f /etc/repmgr/14/repmgr.conf primary unregister  --node-id=1

集群删除一个 standby 节点

./repmgr standby unregister -f /etc/repmgr/14/repmgr.conf --node-id=3

集群新增一个 standby 节点,在相应的节点做好配置之后,在该节点执行

./repmgr -f /etc/repmgr/14/repmgr.conf standby register

Tip:当主节点宕机之后,可以删除主节点,然后清空主节点下 data 中的数据,按照从库的配置过程重新执行注册,将原来的主库配置为从库即可。

你可能感兴趣的:(数据库,postgresql,数据库)