pg_rewind is really useful for postgresql dba to build an efficient HA cluster.
https://github.com/vmware/pg_rewind
Here, I want to verify some interesting point. Sync replication is quite different from async replication. There may be data loss (transaction already commit in master) for aync replication, which may cause data inconsistency in the standby and may not be acceptable for some application. Under async replication situation, recovery should start from the master.
On the other hand, sync replication is no data loss replication. Consider the following algorithm, which is obtained from a presentation of postgresql sync replication development team:
Here, T1, T2,… represent time point in the time line
T1. Master issue a transaction (xid: TX1) Commit
T2. Master Flush TX1 WAL to disk WAL records
T3. Master Send TX1 WAL records to standby and standby flush the received WAL records to disk
Segments are formed
from records in the
standby server.
WAL entries are sent
before returning from
commits by records.
T4. Standby return acknowledge to master
T5. Master commit the tansaction TX1 in the memory and flush to db disk storage
The 'worst case scenario' of this algorithm is after T2 (between T2 and T3), master crash and the WAL records of TX1 have not send to the standby. However, in the master, the data in memory and db disk storage have not commit yet since it crash before it receive any ack from the standby, end users and/or customers have no idea whether the transaction TX1 is successful or not.
The standby machine now promote become new master, transaction records of TX1 either has no trace or is not committed in the WAL of new master (promoted from the standby), no rollback is needed. *Note: From application view point, the client should resubmit TX1 after a timeout period, this can be implement either in application server level or in middleware.
As long as the old master use pg_rewind to sync from the new master, TX1 never happened (TX1 only exist in the old master's old WAL log).
From the postgresql HA-cluster view point, TX1 never commit. From the end user view point, TX1 never commit.
For database reseach analyst, this is an interesting topic. TX1 only consider committed if both the master and the standby are under normal operations, then both machines should have the WAL record on disk.
If the standby machine fail, then only the master is running and is not HA-cluster anymore, WAL record written to the master is considered comit.
Finally, if both master and standby fail after T2 (between T2 and T3), then recover from master. TX1 will consider commit.
The above scenario consider there is only one sync replication standby machine, other standby machine should be thru cascading async replication for performance reason. A general operations guidelines for this HA config:
1. If both the master and the standby are under normal operations, then any transaction only consider committed if both machines have the WAL record on disk.
2. If any one of the machine fail, either the master or standby, then the normal operating machine's WAL is the 'correct' one. The failed machine should recover by using pg_rewind and sync with the normal operating machine.
3. if both master and standby fail simultaneous, then recover from master. The standby machine should use pg_rewind to recover and sync with the mater.
Postgresql HA cluster-Sync Rep+pg_rewind - part 2
http://my.oschina.net/u/2399919/blog/471459