The pg_rewind improvement in PostgreSQL 9.6

The pg_rewind tool was first introduced in PostgreSQL 9.5. This is best used in a situation when a standby becomes a master, then the old master can be reattached to the new master as a standby without restoring it from the new base backup.

The tool examines the timeline histories of the new master (the old standby) and the old master to determine the point where they diverged, and it expects to find Write-Ahead Logs (WAL) in the old master cluster’s pg_xlog directory reaching all the way back to the point of divergence.

For pg_rewind, DBAs need to have the following:

  1. wal_log_hints = on or PostgreSQL cluster with data checksum.

DBAs can also have data checksums and wal_log_hints, both enabled in their PostgreSQL cluster.

For more information on wal_log_hints and enabling data checksum, the following links are useful:

https://www.postgresql.org/docs/9.5/static/runtime-config-wal.html#GUC-WAL-LOG-HINTS

https://www.postgresql.org/docs/9.5/static/app-initdb.html#APP-INITDB-DATA-CHECKSUMS

  1. full_page_writes should be enabled, which is by default enabled in Postgres. For more information on this parameter, see the following link:

https://www.postgresql.org/docs/9.5/static/runtime-config-wal.html#GUC-FULL-PAGE-WRITES

In Postgres 9.6, Alexander Korotkov has introduced a new feature in the pg_rewind tool. Details of the commit are given below:

 commit e50cda78404d6400b1326a996a4fabb144871151
 Author: Teodor Sigaev <teodor@sigaev.ru>
 Date:   Tue Dec 1 18:56:44 2015 +0300
 
      Use pg_rewind when old master timeline was switched
 
      Allow pg_rewind to work when old master timeline was switched. Now
      user can return promoted standby to old master.
  
      old master timeline history becomes a global variable. Index
      in old master timeline history is used in function interfaces instead of
      specifying TLI directly. Thus, SimpleXLogPageRead() can easily start
      reading XLOGs from next timeline when current timeline ends.
  
      Author: Alexander Korotkov
      Review: Michael Paquier

With the new feature, a DBA/user can now reattach the standby that got promoted due to the following reasons:

  1. User error:

By mistake, the user promoted the standby using pg_ctl promote option or by creating a trigger

  1. Failover script has some logic problem, which resulted in promotion of standby to Master
  2. A user is testing failover scenario. (In failover testing, the standby needs to be promoted and needs to bring back to a state where standby can be safely attached to PostgreSQL master)
  3. Bug in failover management tool (a typical scenario is due to a network blip and the failover management tool promoted the standby)

There could be other possible scenarios where this can be useful.

Let’s try with PostgreSQL 9.5 to see the previous behavior of pg_rewind in a use case, where the DBA wants to reattach the standby to the primary server after promoting it.

For this test, we have two PostgreSQL clusters: node1 and node2. Node1 (port: 54445) is a master/primary. Node2 (port: 54446) is a standby which streams data from master node1 (port 54445)
Below is streaming replication status:

bash-postgres-9.5 $ psql -p 54445 -x -c "select * from pg_stat_replication;"
-[ RECORD 1 ]----+---------------------------------
pid              | 4482
usesysid         | 10
usename          | enterprisedb
application_name | walreceiver
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 46780
backend_start    | 21-DEC-16 18:15:08.442887 +00:00
backend_xmin     | 
state            | streaming
sent_location    | 0/8000060
write_location   | 0/8000060
flush_location   | 0/8000060
replay_location  | 0/8000060
sync_priority    | 0
sync_state       | async


bash-postgres-9.5 $  psql -p 54446 -c "SELECT pg_is_in_recovery()"
 pg_is_in_recovery 
-------------------
 t
(1 row)

As you can see the PostgreSQL instance running on port 54446 is a standby (i.e it’s in recovery mode) and it’s streaming data from node1 (port 54445).

Now, let’s promote node2 and check the status of replication on node1.

bash-postgres-9.5 $  touch /tmp/test.trigger
bash-postgres-9.5 $  psql -p 54446 -c "SELECT pg_is_in_recovery()"
 pg_is_in_recovery 
-------------------
 f
(1 row)

bash-postgres-9.5 $  psql -p 54445 -x -c "select * from pg_stat_replication;"
(0 rows)

The above snapshot shows that node2(port 54446) is now out of recovery mode. And replication status on node1(port 54445) shows that node2 is not streaming data from node1.

Now, let’s try to use pg_rewind to reattach node2 (port 54446) as standby to node1 (port 54445) to verify behavior in version 9.5.

bash-postgres-9.5 $ pg_ctl -D /var/lib/ppas/node2/ stop
bash-postgres-9.5 $  pg_rewind -D /var/lib/ppas/node2  --source-server="port=54445 host=127.0.0.1" 
could not find common ancestor of the source and target cluster's timelines
Failure, exiting

The above messages show that the user cannot reattach a promoted standby to the old master, which is a limitation of pg_rewind in 9.5, and it’s an expected behavior of pg_rewind 9.5.

We can now try the same scenario with PostgreSQL 9.6 to check the pg_rewind improvement.

bash-postgres-9.6 $ pg_ctl --version
pg_ctl (EnterpriseDB) 9.6.1.4

bash-postgres-9.6$ psql -p 54445 -x -c "select * from pg_stat_replication;"
-[ RECORD 1 ]----+---------------------------------
pid              | 4838
usesysid         | 10
usename          | enterprisedb
application_name | walreceiver
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 46790
backend_start    | 21-DEC-16 18:35:03.216873 +00:00
backend_xmin     | 
state            | streaming
sent_location    | 0/4000060
write_location   | 0/4000060
flush_location   | 0/4000060
replay_location  | 0/4000060
sync_priority    | 0
sync_state       | async

bash-postgres-9.6$ psql -p 54446 -x -c "select pg_is_in_recovery();"
-[ RECORD 1 ]-----+--
pg_is_in_recovery | t

The above snapshot shows the streaming replication between node1 (port 54445) to node2 (port 54446).

Now, let’s promote node2. Below is a snapshot of promoting node2:

bash-postgres-9.6$ touch /tmp/test
bash-postgres-9.6$ psql -p 54446 -x -c "select pg_is_in_recovery();"
-[ RECORD 1 ]-----+--
pg_is_in_recovery | f


bash-postgres-9.6$ psql -p 54446 -x -c "select * from pg_stat_replication;"
(0 rows)

The above snapshot shows that node2 (port 54446) is out of recovery mode and there is no replication between node1(port 54445) and node2(port 54446).

Let’s try to use pg_rewind to re-attach node2 as standby with node1.

bash-postgres-9.6$ pg_ctl -D /var/lib/ppas/node2/ stop
waiting for server to shut down.... done
server stopped
bash-postgres-9.6$ pg_rewind -D /var/lib/ppas/node2 --source-server="port=54445 host=127.0.0.1" 
servers diverged at WAL position 0/4000140 on timeline 1
rewinding from last common checkpoint at 0/4000098 on timeline 1
Done!

The above snapshot shows that we were able to pg_rewind node2 (port 54446) to the last common checkpoint of node1(port 54445).

We can now start node2 and check the status of replication. Below is a snapshot of that activity:

bash-postgres-9.6$ pg_ctl -D /var/lib/ppas/node2/ start
server starting
-bash-4.1$ 2016-12-21 18:55:15 UTC LOG:  redirecting log output to logging collector process
2016-12-21 18:55:15 UTC HINT:  Future log output will appear in directory "pg_log".

bash-postgres-9.6$ psql -p 54445 -x -c "select * from pg_stat_replication;"
-[ RECORD 1 ]----+---------------------------------
pid              | 5019
usesysid         | 10
usename          | enterprisedb
application_name | walreceiver
client_addr      | 127.0.0.1
client_hostname  | 
client_port      | 46794
backend_start    | 21-DEC-16 18:55:15.635726 +00:00
backend_xmin     | 
state            | streaming
sent_location    | 0/4012BA0
write_location   | 0/4012BA0
flush_location   | 0/4012BA0
replay_location  | 0/4012BA0
sync_priority    | 0
sync_state       | async

bash-postgres-9.6$ psql -p 54446 -x -c "select pg_is_in_recovery();"
-[ RECORD 1 ]-----+--
pg_is_in_recovery | t