利用上一篇提到的pg_rman来做备份恢复测试,相对而言它的命令更简单,在紧急恢复时更方便。
create table test1017(a int);
insert into test1017 values(1);
insert into test1017 values(2);
insert into test1017 values(3);
select * from test1017;
a
---
1
2
3
(3 rows)
pg_rman backup -b full -s -C -Z --keep-data-days=10 --keep-arclog-files=15 --keep-arclog-days=10 --keep-srvlog-files=10 --keep-srvlog-days=15 -h 127.0.0.1 -p 5432 -U postgres -d users
输出:
INFO: copying database files
INFO: copying archived WAL files
INFO: copying server log files
INFO: backup complete
INFO: Please execute 'pg_rman validate' to verify the files are correctly copied.
INFO: start deleting old archived WAL files from ARCLOG_PATH (keep files = 15, keep days = 10)
INFO: the threshold timestamp calculated by keep days is "2019-10-07 00:00:00"
INFO: start deleting old server files from SRVLOG_PATH (keep files = 10, keep days = 15)
INFO: the threshold timestamp calculated by keep days is "2019-10-02 00:00:00"
INFO: start deleting old backup (keep after = 2019-10-07 00:00:00)
INFO: does not include the backup just taken
INFO: backup "2019-10-17 15:00:01" should be kept
DETAIL: This is taken after "2019-10-07 00:00:00".
INFO: backup "2019-10-17 14:59:44" should be kept
DETAIL: This is taken after "2019-10-07 00:00:00".
pg_rman validate;
输出:
INFO: validate: "2019-10-16 06:44:17" backup, archive log files and server log files by CRC
INFO: backup "2019-10-16 06:44:17" is valid
insert into test1017 values(444);
insert into test1017 values(555);
users=# select * from test1017;
a
-----
1
2
3
444
555
(5 rows)
pg_rman backup -b incremental -s -C -Z --keep-data-days=10 --keep-arclog-files=15 --keep-arclog-days=10 --keep-srvlog-files=10 --keep-srvlog-days=15 -h 127.0.0.1 -p 5432 -U postgres -d users
输出:
INFO: copying database files
INFO: copying archived WAL files
INFO: copying server log files
INFO: backup complete
INFO: Please execute 'pg_rman validate' to verify the files are correctly copied.
INFO: start deleting old archived WAL files from ARCLOG_PATH (keep files = 15, keep days = 10)
INFO: the threshold timestamp calculated by keep days is "2019-10-07 00:00:00"
INFO: start deleting old server files from SRVLOG_PATH (keep files = 10, keep days = 15)
INFO: the threshold timestamp calculated by keep days is "2019-10-02 00:00:00"
INFO: start deleting old backup (keep after = 2019-10-07 00:00:00)
INFO: does not include the backup just taken
INFO: backup "2019-10-17 15:15:25" should be kept
DETAIL: This is taken after "2019-10-07 00:00:00".
INFO: backup "2019-10-17 15:00:01" should be kept
DETAIL: This is taken after "2019-10-07 00:00:00".
INFO: backup "2019-10-17 14:59:44" should be kept
DETAIL: This is taken after "2019-10-07 00:00:00".
pg_rman validate;
INFO: validate: "2019-10-17 15:20:50" backup, archive log files and server log files by CRC
INFO: backup "2019-10-17 15:20:50" is valid
务必确认备份状态是ok的,万一是corrupt的后面恢复不了就悲剧了
-bash-4.2$ pg_rman show
=====================================================================
StartTime EndTime Mode Size TLI Status
=====================================================================
2019-10-17 15:20:50 2019-10-17 15:20:52 INCR 66kB 1 OK <-- 测试的增备
2019-10-17 15:15:25 2019-10-17 15:15:28 FULL 4570kB 1 OK <-- 测试的全备
2019-10-17 15:00:01 2019-10-17 15:00:02 ARCH 31kB 1 OK
2019-10-17 14:59:44 2019-10-17 14:59:47 FULL 6210kB 1 OK
查看备份文件内容
-bash-4.2$ cd /data/PRD/pgbackup/
-bash-4.2$ ls
20191017 backup pg_rman.ini pg_rman_log system_identifier test.file timeline_history
-bash-4.2$ cd 20191017
-bash-4.2$ ls
145944 150001 151525 152050
-bash-4.2$ cd 151525 <--- 对应备份开始时间2019-10-17 15:15:25
-bash-4.2$ ls
arclog backup.ini database file_arclog.txt file_database.txt file_srvlog.txt mkdirs.sh srvlog
-bash-4.2$ cd database/
-bash-4.2$ ls
backup_label pg_dynshmem pg_multixact pg_stat PG_VERSION postmaster.opts
base pg_hba.conf pg_notify pg_stat_tmp pg_wal postmaster.pid
current_logfiles pg_ident.conf pg_replslot pg_subtrans pg_xact repmgr
global pg_log pg_serial pg_tblspc postgresql.auto.conf
pg_commit_ts pg_logical pg_snapshots pg_twophase postgresql.conf
从输出信息及备份文件可以看出,这种备份方式也是copy文件
模拟误删除os文件导致整个数据目录丢失,此时希望恢复到最近时间点(前提是增备之后的都归档没丢),故障前test1017有5条数据,恢复后也应该5条
关闭pg服务
pg_ctl stop -m fast
挪走$PGDATA目录模拟删除(不要真的删,万一真恢复不回来就要重装了...)
-bash-4.2$ pwd
/data/PRD/postgres/10.7/pg5432
-bash-4.2$ mv data data.bak
-bash-4.2$ ls
data.bak pg_archlog pgbackup
再起pg,肯定挂了
-bash-4.2$ postgres -D $PGDATA &
[1] 5259
-bash-4.2$ postgres: could not access directory "/data/PRD/postgres/10.7/pg5432/data": No such file or directory
Run initdb or pg_basebackup to initialize a PostgreSQL data directory.
[1]+ Exit 2 postgres -D $PGDATA
开始恢复数据
mkdir data # postgres用户创建数据目录并授权
chmod 0700 data
cp -R pg_archlog pg_archlog_bak #备份下归档目录
使用pg_rman restore还原
注意pg_rman会自动生成recovery.conf文件,不需手动创建
-bash-4.2$ pg_rman restore
输出:
WARNING: pg_controldata file "/data/PRD/postgres/10.7/pg5432/data/global/pg_control" does not exist
INFO: the recovery target timeline ID is not given
INFO: use timeline ID of latest full backup as recovery target: 1
INFO: calculating timeline branches to be used to recovery target point
INFO: searching latest full backup which can be used as restore start point
INFO: found the full backup can be used as base in recovery: "2019-10-17 15:15:25"
INFO: copying online WAL files and server log files
INFO: clearing restore destination
INFO: validate: "2019-10-17 15:15:25" backup, archive log files and server log files by SIZE
INFO: backup "2019-10-17 15:15:25" is valid
INFO: restoring database files from the full mode backup "2019-10-17 15:15:25"
INFO: searching incremental backup to be restored
INFO: validate: "2019-10-17 15:20:50" backup, archive log files and server log files by SIZE
INFO: backup "2019-10-17 15:20:50" is valid
INFO: restoring database files from the incremental mode backup "2019-10-17 15:20:50"
INFO: searching backup which contained archived WAL files to be restored
INFO: backup "2019-10-17 15:20:50" is valid
INFO: restoring WAL files from backup "2019-10-17 15:20:50"
INFO: restoring online WAL files and server log files
INFO: generating recovery.conf
INFO: restore complete
HINT: Recovery will start automatically when the PostgreSQL server is started.
看到是还原到了2019-10-17 15:20:50,当pg服务启动时会自动进行Recovery
看一下pg_rman自动生成的recovery.conf文件
-bash-4.2$ cat recovery.done
# recovery.conf generated by pg_rman 1.3.8
restore_command = 'cp /data/PRD/postgres/10.7/pg5432/pg_archlog/%f %p'
recovery_target_timeline = '1'
可以看到数据文件已经回来了
-bash-4.2$ cd data
-bash-4.2$ ls
backup_label pg_dynshmem pg_multixact pg_stat PG_VERSION postmaster.opts
base pg_hba.conf pg_notify pg_stat_tmp pg_wal recovery.conf
current_logfiles pg_ident.conf pg_replslot pg_subtrans pg_xact repmgr
global pg_log pg_serial pg_tblspc postgresql.auto.conf
pg_commit_ts pg_logical pg_snapshots pg_twophase postgresql.conf
启动pg服务
-bash-4.2$ postgres -D $PGDATA &
[1] 5702
-bash-4.2$ 2019-10-17 15:32:23 CST::@:[5702]: LOG: listening on IPv4 address "0.0.0.0", port 5432
2019-10-17 15:32:23 CST::@:[5702]: LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-10-17 15:32:23 CST::@:[5702]: LOG: redirecting log output to logging collector process
2019-10-17 15:32:23 CST::@:[5702]: HINT: Future log output will appear in directory "/data/PRD/postgres/10.7/pg5432/data/pg_log".
查看日志
-bash-4.2$ tail -100f postgresql-17.log
2019-10-17 15:32:23 CST::@:[5704]: LOG: database system was interrupted; last known up at 2019-10-17 15:20:50 CST
2019-10-17 15:32:23 CST::@:[5704]: LOG: starting archive recovery <-- 开始归档恢复
2019-10-17 15:32:23 CST::@:[5704]: LOG: restored log file "000000010000000000000007" from archive
2019-10-17 15:32:24 CST::@:[5704]: LOG: redo starts at 0/7000028
2019-10-17 15:32:24 CST::@:[5704]: LOG: consistent recovery state reached at 0/70000F8
2019-10-17 15:32:24 CST::@:[5702]: LOG: database system is ready to accept read only connections
2019-10-17 15:32:24 CST::@:[5704]: LOG: restored log file "000000010000000000000008" from archive
cp: cannot stat ‘/data/PRD/postgres/10.7/pg5432/pg_archlog/000000010000000000000009’: No such file or directory
2019-10-17 15:32:24 CST::@:[5704]: LOG: redo done at 0/8000088
2019-10-17 15:32:24 CST::@:[5704]: LOG: last completed transaction was at log time 2019-10-17 15:20:52.167637+08
2019-10-17 15:32:24 CST::@:[5704]: LOG: restored log file "000000010000000000000008" from archive
cp: cannot stat ‘/data/PRD/postgres/10.7/pg5432/pg_archlog/00000002.history’: No such file or directory
2019-10-17 15:32:25 CST::@:[5704]: LOG: selected new timeline ID: 2 <-- 产生新的时间线
2019-10-17 15:32:25 CST::@:[5704]: LOG: archive recovery complete <-- 归档恢复完成
cp: cannot stat ‘/data/PRD/postgres/10.7/pg5432/pg_archlog/00000001.history’: No such file or directory
2019-10-17 15:32:25 CST::@:[5708]: LOG: checkpoint starting: end-of-recovery immediate wait
2019-10-17 15:32:25 CST::@:[5708]: LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.023 s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB, estimate=32768 kB
2019-10-17 15:32:25 CST::@:[5702]: LOG: database system is ready to accept connections
检查测试表,5条数据,ok
users=# select * from test1017;
a
-----
1
2
3
444
555
(5 rows)
这是工作中更常用的一种恢复,一般用于误删除/更改表后恢复到误操作之前,找回数据。若要恢复到最后一次备份之后的时间点,也需要保证归档没丢。
./pg_rman_backup_db.sh
这里有一个巨坑
此时有两种解决方法:
-bash-4.2$ pg_rman show
=====================================================================
StartTime EndTime Mode Size TLI Status
=====================================================================
2019-10-17 16:11:36 2019-10-17 16:11:39 FULL 4616kB 5 OK <----- 注意此时时间线是5,因为之前还测过几次别的恢复
2019-10-17 16:04:24 2019-10-17 16:04:26 ARCH 207kB 4 OK
2019-10-17 15:20:50 2019-10-17 15:20:52 INCR 66kB 1 OK
2019-10-17 15:15:25 2019-10-17 15:15:28 FULL 4570kB 1 OK
2019-10-17 15:00:01 2019-10-17 15:00:02 ARCH 31kB 1 OK
2019-10-17 14:59:44 2019-10-17 14:59:47 FULL 6210kB 1 OK
users=# insert into test1017 values(1017);
INSERT 0 1
users=# select now();
now
-------------------------------
2019-10-17 16:12:57.859887+08
(1 row)
users=# select * from test1017;
a
------
1
2
3
444
555
1017
(6 rows)
保险起见做个Checkpoint再切下归档
users=# checkpoint;
CHECKPOINT
users=#
users=# select pg_switch_wal();
pg_switch_wal
---------------
0/B000308
(1 row)
users=# truncate table test1017;
TRUNCATE TABLE
users=#
users=# select now();
now
-------------------------------
2019-10-17 16:17:24.705675+08
(1 row)
users=# select * from test1017;
a
---
(0 rows)
根据最新数据插入时间(2019-10-17 16:12:57)及truncate时间(2019-10-17 16:17:24),恢复到2019-10-17 16:15:00应该能找回被truncate前数据。注意这个时间不是很准,最好再往前推几分钟,避免恢复后还是找不到被误操作的数据。
pg_ctl stop -m fast
挪走$PGDATA目录
cd /data/PRD/postgres/10.7/pg5432
mv data data.bak
-bash-4.2$ pg_rman restore --recovery-target-time "2019-10-17 16:15:00"
输出:
WARNING: pg_controldata file "/data/PRD/postgres/10.7/pg5432/data/global/pg_control" does not exist
INFO: the recovery target timeline ID is not given
INFO: use timeline ID of latest full backup as recovery target: 5
INFO: calculating timeline branches to be used to recovery target point
INFO: searching latest full backup which can be used as restore start point
INFO: found the full backup can be used as base in recovery: "2019-10-17 16:11:36"
INFO: copying online WAL files and server log files
INFO: clearing restore destination
INFO: validate: "2019-10-17 16:11:36" backup, archive log files and server log files by SIZE
INFO: backup "2019-10-17 16:11:36" is valid
INFO: restoring database files from the full mode backup "2019-10-17 16:11:36"
INFO: searching incremental backup to be restored
INFO: searching backup which contained archived WAL files to be restored
INFO: backup "2019-10-17 16:11:36" is valid
INFO: restoring WAL files from backup "2019-10-17 16:11:36"
INFO: restoring online WAL files and server log files
INFO: generating recovery.conf
INFO: restore complete
HINT: Recovery will start automatically when the PostgreSQL server is started.
-bash-4.2$ postgres -D $PGDATA &
[1] 11868
-bash-4.2$ 2019-10-17 16:19:53 CST::@:[11868]: LOG: listening on IPv4 address "0.0.0.0", port 5432
2019-10-17 16:19:53 CST::@:[11868]: LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-10-17 16:19:53 CST::@:[11868]: LOG: redirecting log output to logging collector process
2019-10-17 16:19:53 CST::@:[11868]: HINT: Future log output will appear in directory "/data/PRD/postgres/10.7/pg5432/data/pg_log".
查看日志
-bash-4.2$ tail -100f postgresql-17.log
2019-10-17 16:19:53 CST::@:[11870]: LOG: database system was interrupted; last known up at 2019-10-17 16:11:37 CST
2019-10-17 16:19:54 CST::@:[11870]: LOG: restored log file "00000005.history" from archive
2019-10-17 16:19:54 CST::@:[11870]: LOG: starting point-in-time recovery to 2019-10-17 16:15:00+08
2019-10-17 16:19:54 CST::@:[11870]: LOG: restored log file "00000005.history" from archive
2019-10-17 16:19:54 CST::@:[11870]: LOG: restored log file "00000005000000000000000A" from archive
2019-10-17 16:19:54 CST::@:[11870]: LOG: redo starts at 0/A000028
2019-10-17 16:19:54 CST::@:[11870]: LOG: consistent recovery state reached at 0/A000130
2019-10-17 16:19:54 CST::@:[11868]: LOG: database system is ready to accept read only connections
2019-10-17 16:19:54 CST::@:[11870]: LOG: restored log file "00000005000000000000000B" from archive
2019-10-17 16:19:54 CST::@:[11870]: LOG: restored log file "00000005000000000000000C" from archive
2019-10-17 16:19:54 CST::@:[11870]: LOG: recovery stopping before commit of transaction 578, time 2019-10-17 16:17:18.160569+08
2019-10-17 16:19:54 CST::@:[11870]: LOG: recovery has paused <-- 注意
2019-10-17 16:19:54 CST::@:[11870]: HINT: Execute pg_wal_replay_resume() to continue. <-- 注意
2019-10-17 16:20:25 CST:[local]:[unknown]@[unknown]:[11911]: LOG: connection received: host=[local]
2019-10-17 16:20:25 CST:[local]:postgres@users:[11911]: LOG: connection authorized: user=postgres database=users
注意日志里以下信息
2019-10-17 16:19:54 CST::@:[11870]: LOG: recovery has paused
2019-10-17 16:19:54 CST::@:[11870]: HINT: Execute pg_wal_replay_resume() to continue.
这与recovery.conf文件的recovery_target_action参数设置有关(默认为pause),后面文章会再提到这个参数。
需要先执行pg_wal_replay_resume(),否则在数据库查询语句会被卡住
users=# select * from test1017;
^CCancel request sent
ERROR: canceling statement due to user request
日志里会有阻塞信息
2019-10-17 16:20:25 CST:[local]:[unknown]@[unknown]:[11911]: LOG: connection received: host=[local]
2019-10-17 16:20:25 CST:[local]:postgres@users:[11911]: LOG: connection authorized: user=postgres database=users
2019-10-17 16:21:12 CST:[local]:postgres@users:[11911]: LOG: process 11911 still waiting for AccessShareLock on relation 16416 of database 16385 after 1000.120 ms at character 15
2019-10-17 16:21:12 CST:[local]:postgres@users:[11911]: DETAIL: Process holding the lock: 11870. Wait queue: 11911.
2019-10-17 16:21:12 CST:[local]:postgres@users:[11911]: STATEMENT: select * from test1017;
2019-10-17 16:21:26 CST:[local]:[unknown]@[unknown]:[11977]: LOG: connection received: host=[local]
users=# select pg_wal_replay_resume();
pg_wal_replay_resume
----------------------
(1 row)
users=# select * from test1017;
a
------
1
2
3
444
555
1017
(6 rows)
成功啦~