An Introduction of Logical Replication setup and maintenance.

As of 2020/12/23, I’m based on PostgreSQL 13.

Setup

A basic setup is creating a publication on a primary DB and creating a subscription on a replica DB. Some preparations are required before creating publications and subscriptions.

DB parameters

Publisher (Primary) DB

Subscriber (Replica) DB

DB role

PostgreSQL doc for role configuration of logical replication

Publisher DB

To create a publication

To connect from a subscriber

I recommended to create a dedicated user to connect a publisher from a subscriber. If you plan to do DB maintenance by use logical replication, e.g. replicating DB by logical replication and switching to a replica, you may want to block traffic to a primary DB before switching to the replica. In such case, you can realize that by disabling DB users except the one used for logical replication.

Subscriber DB

To create a subscription, the user must be a superuser. (On RDS, rds_superuser instead)

DB Schema

Setup Example on RDS

Publisher DB

CREATE ROLE replication_test LOGIN IN ROLE rds_superuser PASSWORD 'secret'; 

-- connect by replication_test
CREATE PUBLICATION pub_test FOR ALL TABLES;

Subscriber DB

-- connect by a user granted rds_superuser
CREATE SUBSCRIPTION sub
CONNECTION 'host=primarydb port=5432 user=replication_test dbname=test_db password=secret'
PUBLICATION pub_test;

Note: A role should have privileges to update replication destination tables.

Maintaining logical replication

DDL

When you execute DDL, in most cases you need to execute it on the replica as well as the primary.

When you create a table and add it to a publication or you use CREATE PUBLICATION ... FOR ALL TABLES, you need to execute ALTER SUBSCRIPTION … REFRESH PUBLICATION on the replica to start replication on the new table.

Monitoring

Publisher DB

Subscriber DB

Multiplexing replication connections

Replicating all tables by a publication, i.e. CREATE PUBLICATION … FOR ALL TABLES , is a good start point for the first trial. However, if the primary DB has many updates that the replication worker can’t handle by single process, you need to split publications with corresponding subscriptions to parallelize replay of updates.
Note that on logical replication, WALs are sent to the replica when a transaction is committed. It means if there are transactions that has many updates, they will cause large WALs to replay on the replica and could be a cause of replication delay increase.

If you have 2 tables that have many updates, they should be assigned different publications. e.g. given tbl_1 and tbl_2 have many updates, you should create 2 publications like CREATE PUBLICATION pub_1 FOR tbl_1 and CREATE PUBLICATION pub_2 FOR tbl_2.

If a table has significant updates, partitioning the table and assigning different publications for partition tables can be an option.

Trouble shooting

A subscription cannot be dropped when a publication is not reachable

disable subscription and disassociate publication, then you can drop a subscription

test_db=> drop subscription sub_test ;
ERROR:  could not drop the replication slot "sub_test" on publisher
DETAIL:  The error was: ERROR:  replication slot "sub_test" does not exist

test_db=> alter subscription sub_test disable;
ALTER SUBSCRIPTION
test_db=> alter subscription sub_test set (slot_name = NONE);
ALTER SUBSCRIPTION
test_db=> drop subscription sub_test ;
DROP SUBSCRIPTION

Table sync worker failed in the initial data copy due to statement timeout

Error example

2021-01-18 09:07:20 UTC::@:[18426]:ERROR:  could not receive data from WAL stream: ERROR:  canceling statement due to statement timeout
2021-01-18 09:07:20 UTC::@:[18426]:CONTEXT:  COPY test, line 2182054
2021-01-18 09:07:20 UTC::@:[7831]:LOG:  background worker "logical replication worker" (PID 18426) exited with exit code 1

A: increase or disable (set 0) statement_timeout. COPY statement is used for the initial data copy and it takes long time for a large table.

WAL sender process terminated due to replication timeout

Error example

2021-01-25 02:17:23 UTC:54.243.203.218(31507):replication_test@testdb:[45580]:LOG:  terminating walsender process due to replication timeout
2021-01-25 02:17:23 UTC:54.243.203.218(31507):replication_test@testdb:[45580]:CONTEXT:  slot "sub_test", output plugin "pgoutput", in the change callback, associated LSN 57406/F6EB8570

A: increase wal_sender_timeout. similar issue

Table cannot be dropped on the subscriber when it has not synchronized yet

A table of subscriber cannot be dropped even if the table of publisher was already deleted when table synchronization hasn’t completed yet. Publication refresh (ALTER SUBSCRIPTION ... REFRESH PUBLICATION) is required to delete the table of subscriber in this case after the table was deleted on the publisher.

test_db=> drop table test;
ERROR:  could not drop relation mapping for subscription "sub_test"
DETAIL:  Table synchronization for relation "test" is in progress and is in state "i".
HINT:  Use ALTER SUBSCRIPTION ... ENABLE to enable subscription if not already enabled or use DROP SUBSCRIPTION ... to drop the subscription.
test_db=> ALTER SUBSCRIPTION sub_test REFRESH PUBLICATION;
ALTER SUBSCRIPTION
test_db=> drop table test;
DROP TABLE

Observations

Replication apply worker is likely to be a bottleneck.

It’s better to split publications and subscriptions to parallelize replication apply for tables that have heavy traffic.

Logical replication slots must be dropped before in-place major version upgrade of a RDS instance

------------------------------------------------------------------
Upgrade could not be run on Tue Jan 19 03:06:06 2021
------------------------------------------------------------------
The instance could not be upgraded from 11.1.R1 to 12.4.R1 because of following reasons. Please take appropriate action on databases that have usages incompatible with requested major engine version upgrade and try again.
- Following usages in database 'testdb' need to be corrected before upgrade:
-- The instance could not be upgraded because one or more databases have logical replication slots. Please drop all logical replication slots and try again.

References