How to upgrade PostgreSQL with logical replication

The goal of this guide is to upgrade a PostgreSQL Database to a newer version by means of logical replication. Aptible uses pglogical to create logical replicas.

The main benefit of using this method is that the replica can be created beforehand and will stay up-to-date with the source Database until it's time to cut over to the new Database. This allows for upgrades to be performed with minimal downtime.

Preparation

Test the schema

If data is being transferred to a Database running a different PostgreSQL version than the original, first check that the schema can be restored on the desired version by following the How to test a PostgreSQL Database's schema on a new version guide.

Test the upgrade

Testing the schema should catch a number of issues but it's also recommended to test the upgrade before performing it in production. The easiest way to do this is to restore the latest backup of the Database and performing the upgrade against the restored Database. The restored Database should have the same container size as the production Database.

Example:

aptible backup:restore 1234 --handle upgrade-test --container-size 4096

Tools

Install the Aptible CLI and PostgreSQL Client Tools. This guide uses the psql client tool.

Configuration

Collect information on the Database you'd like to upgrade and store it in the following environment variables for use later in the guide:

  • SOURCE_HANDLE - The handle (i.e. name) of the Database.
  • ENVIRONMENT - The handle of the environment the Database belongs to.

Example:

SOURCE_HANDLE='source-db'
ENVIRONMENT='test-environment'

Collect information on the replica and store it in in the following environment variables:

  • REPLICA_HANDLE - The handle (i.e. name) for the Database.
  • REPLICA_VERSION - The desired PostgreSQL version. Run aptible db:versions to see a full list of options.
  • REPLICA_CONTAINER_SIZE (Optional) - The size of the replica's container in MB. Having more memory and CPU available speeds up the initialization process, up to a certain point. See the Database Scaling documentation for a full list of supported container sizes.

Example:

REPLICA_HANDLE='upgrade-test'
REPLICA_VERSION='14'
REPLICA_CONTAINER_SIZE=4096

Tunnel into the source Database

In a separate terminal, create a Database Tunnel to the source Database using the Aptible CLI.

aptible db:tunnel "$SOURCE_HANDLE" --environment "$ENVIRONMENT"

The tunnel will block the current terminal until it's stopped. Collect the tunnel's full URL, which is printed by aptible db:tunnel, and store it in the SOURCE_URL environment variable in the original terminal.

Example:

SOURCE_URL='postgresql://aptible:[email protected]:5432/db'

Check for existing pglogical nodes

Each PostgreSQL database on the server can only have a single pglogical node. If there's already an existing node, the replica will fail setup. The following script will check for existing pglogical nodes.

psql "$SOURCE_URL" --tuples-only --no-align --command \
  'SELECT datname FROM pg_database WHERE datistemplate IS FALSE' |

while IFS= read -r db; do
  psql "$SOURCE_URL" -v ON_ERROR_STOP=1 << EOF &> /dev/null
    \connect "$db"
    SELECT pglogical.pglogical_node_info();
EOF

  if [ $? -eq 0 ]; then
    echo "pglogical node found on $db"
  fi
done

If the command does not report any nodes, no action is necessary. If it does, either replication will have to be set up manually, instead of using aptible db:replicate --logical, or the node will have to be dropped.

Note that if logical replication was previously attempted, but failed, then the node could be left behind from the previous attempt. See the Cleanup section and follow the instructions for cleaning up the source Database.

Check for tables without a primary key

Logical replication requires that rows be uniquely identifiable in order to function properly. This is most easily accomplished by ensuring that all tables have a primary key.

The following script will iterate over all PostgreSQL databases on the Database server and list tables that do not have a primary key:

psql "$SOURCE_URL" --tuples-only --no-align --command \
  'SELECT datname FROM pg_database WHERE datistemplate IS FALSE' |

while IFS= read -r db; do
  echo "Database: $db"
  psql "$SOURCE_URL" << EOF
    \connect "$db";

    SELECT tab.table_schema, tab.table_name
    FROM information_schema.tables tab
    LEFT JOIN information_schema.table_constraints tco
              ON tab.table_schema = tco.table_schema
              AND tab.table_name = tco.table_name
              AND tco.constraint_type = 'PRIMARY KEY'
    WHERE tab.table_type = 'BASE TABLE'
          AND tab.table_schema NOT IN ('pg_catalog', 'information_schema', 'pglogical')
          AND tco.constraint_name IS NULL
    ORDER BY table_schema, table_name;
EOF
done

If all of the databases return (0 rows) then no action is necessary.

Example output:

Database: db
You are now connected to database "db" as user "aptible".
 table_schema | table_name
--------------+------------
(0 rows)

Database: postgres
You are now connected to database "postgres" as user "aptible".
 table_schema | table_name
--------------+------------
(0 rows)

If any tables come back without a primary key, one can be added to an existing column or a new column with ALTER TABLE.

Create the replica

The upgraded replica can be created ahead of the actual upgrade as it will stay up-to-date with the source Database.

aptible db:replicate "$SOURCE_HANDLE" "$REPLICA_HANDLE" \
  --logical \
  --version "$REPLICA_VERSION" \
  --environment "$ENVIRONMENT" \
  --container-size "${REPLICA_CONTAINER_SIZE:-4096}"

If the command raises errors, review the operation logs output by the command for an explanation as to why the error occurred. In order to attempt logical replication after the issue(s) have been addressed, the source Database will need to be cleaned up. See the Cleanup section and follow the instructions for cleaning up the source Database. The broken replica also needs to be deprovisioned in order to free up its handle to be used by the new replica:

aptible db:deprovision "$REPLICA_HANDLE" --environment "$ENVIRONMENT"

If the operation is successful, then the replica has been successfully set up. All that remains is for it to finish initializing (i.e. pulling all existing data) then it will be ready to be cut over to.

πŸ“˜

Note

pglogical will copy the source Database's structure at the time the subscription is created. However, subsequent changes to the Database structure, a.k.a. Data Definition Language (DDL) commands, are not included in logical replication. These commands need to be applied to the replica as well as the source Database to ensure that changes to the data are properly replicated.

pglogical provides a convenient replicate_ddl_command function that, when run on the source Database, applies a DDL command to the source Database then queues the statement to be applied to the replica. For example, to add a column to a table:

SELECT pglogical.replicate_ddl_command('ALTER TABLE public.foo ADD COLUMN bar TEXT;');

Execution

Tunnel into the replica

In a separate terminal, create a Database Tunnel to the replica using the Aptible CLI.

aptible db:tunnel "$REPLICA_HANDLE" --environment "$ENVIRONMENT"

The tunnel will block the current terminal until it's stopped. Collect the tunnel's full URL, which is printed by aptible db:tunnel, and store it in the REPLICA_URL environment variable in the original terminal.

Example:

REPLICA_URL='postgresql://aptible:[email protected]:5432/db'

Wait for initialization to complete

While replicas are usually created very quickly, it can take some time to pull all of the data from the source Database depending on its disk footprint. The replica can be queried to see what tables still need to be initialized.

SELECT * FROM pglogical.local_sync_status WHERE NOT sync_status = 'r';

If any rows are returned, the replica is still initializing. This query can be used in a short script to test and wait for initialization to complete on all databases on the replica:

psql "$REPLICA_URL" --tuples-only --no-align --command \
  'SELECT datname FROM pg_database WHERE datistemplate IS FALSE' |

while IFS= read -r db; do
  while psql "$REPLICA_URL" --tuples-only --quiet << EOF | grep -E '.+'; do
    \connect "$db"
    SELECT * FROM pglogical.local_sync_status WHERE NOT sync_status = 'r';
EOF
    sleep 3
  done
done

There is a known issue with pglogical in which, during replica initialization, replication may pause until the next time the source Database is written to. For production Databases this usually isn't an issue since it's being actively used but for Databases that aren't used much, like Databases that may have been restored to test logical replication, this issue can arise.

The following script works similar to the one above but it also creates a table, writes to it, then drops the table in order to ensure that initialization continues even if the source Database is idle:

psql "$REPLICA_URL" --tuples-only --no-align --command \
  'SELECT datname FROM pg_database WHERE datistemplate IS FALSE' |

while IFS= read -r db; do
  while psql "$REPLICA_URL" --tuples-only --quiet << EOF | grep -E '.+'; do
    \connect "$db"
    SELECT * FROM pglogical.local_sync_status WHERE NOT sync_status = 'r';
EOF
    psql "$SOURCE_URL" -v ON_ERROR_STOP=1 --quiet << EOF
      \connect "$db"
      CREATE TABLE _aptible_logical_sync (col INT);
      INSERT INTO _aptible_logical_sync VALUES (1);
      DROP TABLE _aptible_logical_sync;
EOF

    sleep 3
  done
done

Once the query returns zero rows from the replica, or one of the scripts completes, the replica has finished initializing which means it's ready to be cut over to.

Speeding Up Initialization

Each index on a table adds overhead to inserting rows so the more indexes a table has, the longer it will take to be copied over. This can cause large Databases or those with many indexes to take much longer to initialize. If the initialization process appears to be going slowly, all of the indexes (except for primary keys) can be disabled:

psql "$REPLICA_URL" --tuples-only --no-align --command \
  'SELECT datname FROM pg_database WHERE datistemplate IS FALSE' |

while IFS= read -r db; do
  echo "Database: $db"
  psql "$REPLICA_URL" << EOF
    \connect "$db"

    UPDATE pg_index SET indisready = FALSE WHERE indexrelid IN (
      SELECT idx.indexrelid FROM pg_index idx
      INNER JOIN pg_class cls ON idx.indexrelid = cls.oid
      INNER JOIN pg_namespace nsp ON cls.relnamespace = nsp.oid
      WHERE nsp.nspname !~ '^pg_'
      AND nsp.nspname NOT IN ('information_schema', 'pglogical')
      AND idx.indisprimary IS FALSE
    );
EOF
done

# Reload in order to restart the current COPY operation without indexes
aptible db:reload "$REPLICA_HANDLE" --environment "$ENVIRONMENT"

After the replica has been initialized, the indexes will need to be rebuilt. This can still take some time for large tables but is much faster than the indexes being evaluated each time a row is inserted:

psql "$REPLICA_URL" --tuples-only --no-align --command \
  'SELECT datname FROM pg_database WHERE datistemplate IS FALSE' |

while IFS= read -r db; do
  echo "Database: $db"
  psql "$REPLICA_URL" --tuples-only --no-align --quiet << EOF |
    \connect "$db"

    SELECT CONCAT('"', nsp.nspname, '"."', cls.relname, '"') FROM pg_index idx
      INNER JOIN pg_class cls ON idx.indexrelid = cls.oid
      INNER JOIN pg_namespace nsp ON cls.relnamespace = nsp.oid
      WHERE nsp.nspname !~ '^pg_'
      AND nsp.nspname NOT IN ('information_schema', 'pglogical')
      AND idx.indisprimary IS FALSE
      AND idx.indisready IS FALSE;
EOF

  while IFS= read -r index; do
    echo "Reindexing: $index"
    psql "$REPLICA_URL" --quiet << EOF
      \connect "$db"

      REINDEX INDEX CONCURRENTLY $index;
EOF
  done
done

If any indexes have issues reindexing CONCURRENTLY this keyword can be removed but note that, when not indexing concurrently, the table the index belongs to will be locked which will prevent writes while indexing.

Enable synchronous replication

Enabling synchronous replication ensures that all data that's written to the source Database is also written to the replica:

psql "$SOURCE_URL" << EOF
  ALTER SYSTEM SET synchronous_standby_names=aptible_subscription;
  SELECT pg_reload_conf();
EOF

Scale Services down

This step is optional. Scaling all Services that use the source Database to zero containers ensures that they can’t write to the Database during the cutover. This will result in some downtime in exchange for preventing replication conflicts that can result from Services writing to both the source and replica Databases at the same time.

It's usually easiest to prepare a script that scales all Services down and another that scales them back up to their current values once the upgrade has been complete. Current container counts can be found in the Aptible Dashboard or by running APTIBLE_OUTPUT_FORMAT=json aptible apps.

Example scale command:

aptible apps:scale --app my-app cmd --container-count 0

Update all Apps to use the replica

Assuming Database's Credentials are provided to Apps via the App's Configuration, this can be done relatively easily using the aptible config:set command. This step is also usually easiest to commplete by preparing a script that updates all relevant Apps.

Example config command:

aptible config:set --app my-app DB_URL='postgresql://user:[email protected]:5432/db'

Sync sequences

Ensure that the sequences on the replica are up-to-date with the source Database:

psql "$SOURCE_URL" --tuples-only --no-align --command \
  'SELECT datname FROM pg_database WHERE datistemplate IS FALSE' |

while IFS= read -r db; do
  psql "$SOURCE_URL" << EOF
    \connect "$db"
    SELECT pglogical.synchronize_sequence( seqoid ) FROM pglogical.sequence_state;
EOF
done

Stop replication

Now that all the Apps have been updated to use the new replica, there is no need to replicate changes from the source Database.

Drop the pglogical subscriptions, nodes, and extension from the replica:

psql "$REPLICA_URL" --tuples-only --no-align --command \
  'SELECT datname FROM pg_database WHERE datistemplate IS FALSE' |

while IFS= read -r db; do
  psql "$REPLICA_URL" << EOF
    \connect "$db"
    SELECT pglogical.drop_subscription('aptible_subscription');
    SELECT pglogical.drop_node('aptible_subscriber');
    DROP EXTENSION pglogical;
EOF
done

Clear synchronous_standby_names on the source Database:

psql "$SOURCE_URL" << EOF
  ALTER SYSTEM RESET synchronous_standby_names;
  SELECT pg_reload_conf();
EOF

Scale Services up

Scale any services that were scaled down to zero containers back to their original number of container. If a script was created to do this, now is the time to run it.

Example scale command:

aptible apps:scale --app my-app cmd --container-count 2

Once all of the services have come back up, the upgrade is complete!

Cleanup

Vacuum and Analyze

Vacuuming the target Database after upgrading reclaims space occupied by dead tuples and analyzing the tables collects information on the table's contents in order to improve query performance.

psql "$REPLICA_URL" --tuples-only --no-align --command \
  'SELECT datname FROM pg_database WHERE datistemplate IS FALSE' |

while IFS= read -r db; do
  psql "$REPLICA_URL" << EOF
    \connect "$db"

    VACUUM ANALYZE;
EOF
done

Source Database

🚧

Caution

If you're cleaning up from a failed replication attempt and you're not sure if pglogical was being used previously, check with other members of your organization before performing cleanup as this may break existing pglogical subscribers.

Drop the pglogical replication slots (if they exist), nodes, and extension:

psql "$SOURCE_URL" --tuples-only --no-align --command \
  'SELECT datname FROM pg_database WHERE datistemplate IS FALSE' |

while IFS= read -r db; do
  psql "$SOURCE_URL" << EOF
    \connect "$db"

    SELECT pg_drop_replication_slot((
      SELECT pglogical.pglogical_gen_slot_name(
        '$db',
        'aptible_publisher',
        'aptible_subscription'
      )
    ));

    \set ON_ERROR_STOP 1

    SELECT pglogical.drop_node('aptible_publisher');
    DROP EXTENSION pglogical;
EOF
done

If the script above raises errors about replication slots being active, then replication was not stopped properly. Ensure that the instructions in the Stop replication section has been completed.

Reset max_worker_processes

aptible db:replicate --logical may have increased the max_worker_processes on the replica to ensure that it has enough to support replication. Now that replication has ben terminated, the setting can be set back to the default by running the following command:

psql "$REPLICA_URL" --command "ALTER SYSTEM RESET max_worker_processes;"

See How Logical Replication Works in the command documentation for more details.

Unlink the Databases

Aptible retains a link between replicas and their source Database that prevents the source Database from being deleted before the replica. In order to deprovision the source Database once you've switched over to the replica, contact Aptible Support with the name of the replica (i.e. $REPLICA_HANDLE) and request that it be unlinked from its source Database.

Deprovision

Once the original Database is no longer necessary, it should be deprovisioned or it will continue to incur costs. Note that this will purge all automated Backups. If you'd like to retain the Backups, contact Aptible Support to update them.

aptible db:deprovision "$SOURCE_HANDLE" --environment "$SOURCE_ENVIRONMENT"