PostgreSQL , 流复制 , 主从 , 逻辑订阅
PostgreSQL 自从2010年推出的9.0版本开始,支持流式物理复制,用户可以通过流式复制,构建只读备库(主备物理复制,块级别一致)。流式物理复制可以做到极低的延迟(通常在1毫秒以内)。
2011年推出的9.1版本,支持同步复制,当时只支持一个同步流复制备节点(例如配置了3个备,只有一个是同步模式的,其他都是异步模式)。
在同步复制模式下,当用户提交事务时,需要等待这笔事务的WAL日志复制到同步流复制备节点,才会返回提交成功的ACK给客户端。
同步模式下,可以确保数据的0丢失。(只要客户端收到了事务提交成功的ACK,这笔事务的WAL就有两份。)
2012年推出的9.2版本,支持级联流复制。意思是备库还可以再连备库。
级联复制特别适合跨机房的使用,例如主库在A机房,备库在B机房,但是B机房需要建立多个备库时,那么B机房只需要建立一个直连主库的备库,其他的备库可以通过B机房的这个备库级联产生。从而减少网络开销。
2012年推出的9.2版本,除了支持级联复制,还支持虚拟备库,什么是虚拟备库呢?就是只有WAL,没有数据文件的备库。
通过虚拟备库,可以流式的接收WAL,进行实时的流式WAL归档。提高备份或归档的实时性。
2014年推出的9.4版本,在WAL中增加了逻辑复制需要的基础信息,通过插件,可以实现逻辑复制。
逻辑复制可以做到对主库的部分复制,例如表级复制,而不是整个集群的块级一致复制。
逻辑复制的备库不仅仅是只读的,也可以执行写操作。
2016年推出的9.6版本,PG的流式复制,通过复制WAL达到同步的目的,因此同步级别也和WAL有关。通过synchronous_commit参数,可以配置事务的同步级别。
1、on, 表示本地WAL fsync,同步standby WAL fsync。即两份持久化的WAL。
2、remote_apply, 表示本地WAL fsync,同步standby WAL 已恢复。这个带来的RT最高。
3、remote_write, 表示本地WAL fsync,同步standby WAL 异步write完成。一份持久化,备库的WAL可能还在OS CACHE中。
4、local, 表示本地WAL fsync。
5、off, 表示本地WAL写到wal buffer中即返回客户端事务提交成功的ACK,为异步提交(数据库CRASH可能导致事务丢失,但不会导致数据库不一致)。
RT影响,从低到高如下:
off, local, remote_write, on, remote_apply。
2017年推出的10版本,pg_basebackup, pg_receivewal支持流式压缩备份WAL。
2017年推出的10版本,支持quorum based的同步流复制,例如有3个备节点,你可以告诉主库,这个事务需要3份WAL副本,那么主库需要等待至少2个备节点已将WAL同步过去的反馈,才会将事务提交成功的ACK返回给客户端。
quorum based同步流复制,结合raft协议,可以实现零数据丢失的高可用、高可靠架构。
2017年推出的10版本,内置了逻辑订阅的功能。
2017年推出的10版本,通过逻辑订阅的功能,可以实现多Master架构。
以一主两从,quorum based 同步(一副本)为例,介绍PG 10的主从最佳实践。
三台机器(假设主机有64G内存),同一局域网,相互网络互通(至少数据库监听端口应该互通)。
Linux CentOS 7.x x64
HOSTA : 监听端口1921
HOSTB : 监听端口1921
HOSTC : 监听端口1921
(本文使用一台物理机来模拟,IP为127.0.0.1,端口分别为2921,2922,2923,读者请根据实际环境修改)
请参考
《PostgreSQL on Linux 最佳部署手册》
mkdir /disk1/digoal/pgdata2921
chown digoal /disk1/digoal/pgdata2921
initdb -D /disk1/digoal/pgdata2921 -E UTF8 --locale=C -U postgres
cd /disk1/digoal/pgdata2921
vi postgresql.conf
listen_addresses = '0.0.0.0'
port = 2921
max_connections = 1000
unix_socket_directories = '.'
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
shared_buffers = 8GB
maintenance_work_mem = 1GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 500
bgwriter_lru_multiplier = 5.0
bgwriter_flush_after = 0
effective_io_concurrency = 0
max_worker_processes = 16
backend_flush_after = 0
wal_level = replica
fsync = on
synchronous_commit = remote_write
full_page_writes = on
wal_buffers = 128MB
wal_writer_delay = 10ms
wal_writer_flush_after = 0
checkpoint_timeout = 30min
max_wal_size = 16GB
min_wal_size = 8GB
checkpoint_completion_target = 0.5
checkpoint_flush_after = 0
max_wal_senders = 10
wal_keep_segments = 1024
synchronous_standby_names = 'ANY 1 (*)'
hot_standby = on
max_standby_archive_delay = 300s
max_standby_streaming_delay = 300s
wal_receiver_status_interval = 1s
hot_standby_feedback = off
log_destination = 'csvlog'
logging_collector = on
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_timezone = 'PRC'
log_autovacuum_min_duration = 0
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.1
autovacuum_freeze_max_age = 1000000000
autovacuum_multixact_freeze_max_age = 1200000000
autovacuum_vacuum_cost_delay = 0
autovacuum_vacuum_cost_limit = 0
vacuum_freeze_table_age = 800000000
vacuum_multixact_freeze_table_age = 800000000
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
cd /disk1/digoal/pgdata2921
vi pg_hba.conf
# "local" is for Unix domain socket connections only
local all all trust
# IPv4 local connections:
host all all 127.0.0.1/32 trust
# IPv6 local connections:
host all all ::1/128 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local replication all trust
host replication all 127.0.0.1/32 trust
host replication all ::1/128 trust
host replication all 0.0.0.0/0 md5
cd /disk1/digoal/pgdata2921
vi recovery.done
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=127.0.0.1 port=2921 user=rep password=pwd'
# recovery_min_apply_delay = 0 #延迟多少分钟应用,用户可以配置延迟的备库,例如给一点误操作的缓冲时间。在备库不会这么早被应用。
pg_ctl start -D /disk1/digoal/pgdata2921
psql -h 127.0.0.1 -p 2921
psql (10beta1)
Type "help" for help.
postgres=# set synchronous_commit =off;
SET
postgres=# create role rep login replication encrypted password 'pwd';
CREATE ROLE
mkdir /disk1/digoal/pgdata2922
chown digoal /disk1/digoal/pgdata2922
chmod 700 /disk1/digoal/pgdata2922
export PGPASSWORD="pwd"
pg_basebackup -D /disk1/digoal/pgdata2922 -F p -X stream -h 127.0.0.1 -p 2921 -U rep
cd /disk1/digoal/pgdata2922
vi postgresql.conf
port = 2922
cd /disk1/digoal/pgdata2922
mv recovery.done recovery.conf
pg_ctl start -D /disk1/digoal/pgdata2922
mkdir /disk1/digoal/pgdata2923
chown digoal /disk1/digoal/pgdata2923
chmod 700 /disk1/digoal/pgdata2923
export PGPASSWORD="pwd"
pg_basebackup -D /disk1/digoal/pgdata2923 -F p -X stream -h 127.0.0.1 -p 2921 -U rep
cd /disk1/digoal/pgdata2923
vi postgresql.conf
port = 2923
cd /disk1/digoal/pgdata2923
mv recovery.done recovery.conf
pg_ctl start -D /disk1/digoal/pgdata2923
psql -h 127.0.0.1 -p 2921
psql (10beta1)
Type "help" for help.
postgres=# \x
Expanded display is on.
postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) as sent_delay,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn)) as write_delay,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) as flush_delay,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as replay_delay,
*
from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
sent_delay | 0 bytes
write_delay | 0 bytes
flush_delay | 0 bytes
replay_delay | 0 bytes
pid | 11962
usesysid | 16384
usename | rep
application_name | walreceiver
client_addr | 127.0.0.1
client_hostname |
client_port | 63083
backend_start | 2017-07-11 17:15:31.231492+08
backend_xmin |
state | streaming
sent_lsn | 1/C0000060
write_lsn | 1/C0000060
flush_lsn | 1/C0000060
replay_lsn | 1/C0000060
write_lag |
flush_lag |
replay_lag |
sync_priority | 1
sync_state | quorum
-[ RECORD 2 ]----+------------------------------
sent_delay | 0 bytes
write_delay | 0 bytes
flush_delay | 0 bytes
replay_delay | 0 bytes
pid | 11350
usesysid | 16384
usename | rep
application_name | walreceiver
client_addr | 127.0.0.1
client_hostname |
client_port | 63077
backend_start | 2017-07-11 17:15:13.818043+08
backend_xmin |
state | streaming
sent_lsn | 1/C0000060
write_lsn | 1/C0000060
flush_lsn | 1/C0000060
replay_lsn | 1/C0000060
write_lag |
flush_lag |
replay_lag |
sync_priority | 1
sync_state | quorum
psql -h 127.0.0.1 -p 2922
-- 查看当前WAL应用是否暂停了
postgres=# select pg_is_wal_replay_paused();
pg_is_wal_replay_paused
-------------------------
f
(1 row)
-- 查看WAL接收到的位点
postgres=# select pg_last_wal_receive_lsn();
pg_last_wal_receive_lsn
-------------------------
1/C0000060
(1 row)
-- 查看WAL的应用位点
postgres=# select pg_last_wal_replay_lsn();
pg_last_wal_replay_lsn
------------------------
1/C0000060
(1 row)
-- 查看wal receiver的统计信息
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_get_wal_receiver();
-[ RECORD 1 ]---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid | 11349
status | streaming
receive_start_lsn | 1/C0000000
receive_start_tli | 1
received_lsn | 1/C0000060
received_tli | 1
last_msg_send_time | 2017-07-11 17:23:14.372327+08
last_msg_receipt_time | 2017-07-11 17:23:14.372361+08
latest_end_lsn | 1/C0000060
latest_end_time | 2017-07-11 17:15:13.819553+08
slot_name |
conninfo | user=rep password=******** dbname=replication host=127.0.0.1 port=2921 fallback_application_name=walreceiver sslmode=disable sslcompression=1 target_session_attrs=any
postgres=# select pg_wal_replay_pause();
-[ RECORD 1 ]-------+-
pg_wal_replay_pause |
-- 暂停WAL的应用,例如要做一些排错时
postgres=# select pg_is_wal_replay_paused();
-[ RECORD 1 ]-----------+--
pg_is_wal_replay_paused | t
postgres=# select pg_wal_replay_resume();
-[ RECORD 1 ]--------+-
pg_wal_replay_resume |
-- 继续应用WAL
postgres=# select pg_is_wal_replay_paused();
-[ RECORD 1 ]-----------+--
pg_is_wal_replay_paused | f
1、如果要防止主库删除备库还没有接收的WAL文件,有两种方法。
使用slot,或者配置足够大的wal keep。
但是这两种方法都有一定的风险或问题,例如当备库挂了,或者备库不再使用了,而用户忘记删除对应的SLOT时。可能导致主库WAL无限膨胀。
而wal keep则会导致主库的WAL预留足够的个数,占用一定空间。
相关参数
主 postgresql.conf
# max_replication_slots = 10
# wal_keep_segments = 1024
备 recovery.conf
# primary_slot_name = ''
2、如果不想通过以上方法预防备库需要的WAL已被删除,那么可以配置主库的归档,同时备库需要能获取到已归档的WAL文件。
相关参数
主 postgresql.conf
#archive_mode = off # enables archiving; off, on, or always
# (change requires restart)
#archive_command = '' # command to use to archive a logfile segment
# placeholders: %p = path of file to archive
# %f = file name only
# e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'
备 recovery.conf
# restore_command = '' # e.g. 'cp /mnt/server/archivedir/%f %p'
3、保护好recovery.conf文件中的密码,因为配置的是明文。
4、主备之间的带宽请足够大,否则可能导致主备延迟。
连接主库进行TPC-B的压测
pgbench -i -s 100 -h 127.0.0.1 -p 2921 -U postgres
pgbench -n -r -P 1 -h 127.0.0.1 -p 2921 -U postgres -c 32 -j 32 -T 120
transaction type:
scaling factor: 100
query mode: simple
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 1326066
latency average = 2.896 ms
latency stddev = 2.030 ms
tps = 11050.199659 (including connections establishing)
tps = 11051.140876 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set aid random(1, 100000 * :scale)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.043 BEGIN;
0.154 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.112 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.159 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.423 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.092 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
1.910 END;
观察主备的延迟
psql -h 127.0.0.1 -p 2921
psql (10beta1)
Type "help" for help.
postgres=# \x
Expanded display is on.
postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) as sent_delay,
postgres-# pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn)) as write_delay,
postgres-# pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) as flush_delay,
postgres-# pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as replay_delay,
postgres-# *
postgres-# from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
sent_delay | 4024 bytes
write_delay | 4024 bytes
flush_delay | 9080 bytes
replay_delay | 13 kB
pid | 11962
usesysid | 16384
usename | rep
application_name | walreceiver
client_addr | 127.0.0.1
client_hostname |
client_port | 63083
backend_start | 2017-07-11 17:15:31.231492+08
backend_xmin |
state | streaming
sent_lsn | 2/1C61E398
write_lsn | 2/1C61E398
flush_lsn | 2/1C61CFD8
replay_lsn | 2/1C61BEF8
write_lag | 00:00:00.000129
flush_lag | 00:00:00.001106
replay_lag | 00:00:00.001626
sync_priority | 1
sync_state | quorum
-[ RECORD 2 ]----+------------------------------
sent_delay | 0 bytes
write_delay | 4024 bytes
flush_delay | 9080 bytes
replay_delay | 12 kB
pid | 11350
usesysid | 16384
usename | rep
application_name | walreceiver
client_addr | 127.0.0.1
client_hostname |
client_port | 63077
backend_start | 2017-07-11 17:15:13.818043+08
backend_xmin |
state | streaming
sent_lsn | 2/1C61F350
write_lsn | 2/1C61E398
flush_lsn | 2/1C61CFD8
replay_lsn | 2/1C61C388
write_lag | 00:00:00.000542
flush_lag | 00:00:00.001582
replay_lag | 00:00:00.001952
sync_priority | 1
sync_state | quorum
postgres=# \watch 1
《PostgreSQL 逻辑订阅 - 给业务架构带来了什么希望?》
《PostgreSQL 10.0 preview 逻辑复制 - 原理与最佳实践》
《使用PostgreSQL逻辑订阅实现multi-master》