Concurrency

The reason why pgcopydb has been developed is mostly to allow two aspects that are not possible to achieve directly with pg_dump and pg_restore, and that requires just enough fiddling around that not many scripts have been made available to automate around.

Notes about concurrency

The pgcopydb too implements many operations concurrently to one another, by ways of using the fork() system call. This means that pgcopydb creates sub-processes that each handle a part of the work.

The process tree then looks like the following:

$ pgcopydb clone --follow --table-jobs 4 --index-jobs 4 --large-objects-jobs 4
 + pgcopydb clone worker
    + pgcopydb copy supervisor [ --table-jobs 4 ]
      - pgcopydb copy queue worker
      - pgcopydb copy worker
      - pgcopydb copy worker
      - pgcopydb copy worker
      - pgcopydb copy worker

    + pgcopydb blob metadata worker [ --large-objects-jobs 4 ]
      - pgcopydb blob data worker
      - pgcopydb blob data worker
      - pgcopydb blob data worker
      - pgcopydb blob data worker

    + pgcopydb index supervisor [ --index-jobs 4 ]
      - pgcopydb index/constraints worker
      - pgcopydb index/constraints worker
      - pgcopydb index/constraints worker
      - pgcopydb index/constraints worker

    + pgcopydb vacuum supervisor [ --table-jobs 4 ]
      - pgcopydb vacuum analyze worker
      - pgcopydb vacuum analyze worker
      - pgcopydb vacuum analyze worker
      - pgcopydb vacuum analyze worker

    + pgcopydb sequences reset worker

 + pgcopydb follow worker [ --follow ]
   - pgcopydb stream receive
   - pgcopydb stream apply

Observe that when using pgcopydb clone --follow --table-jobs 4 --index-jobs 4 --large-objects-jobs 4, pgcopydb creates 26 sub-processes.

The 26 total is counted from:

1 clone worker + 1 copy supervisor + 1 copy queue worker + 4 copy workers + 1 blob metadata worker + 4 blob data workers + 1 index supervisor + 4 index workers + 1 vacuum supervisor + 4 vacuum workers + 1 sequence reset worker

that’s 1 + 1 + 1 + 4 + 1 + 4 + 1 + 4 + 1 + 4 + 1 = 23

1 follow worker + 1 stream receive + 1 stream apply

that’s 1 + 1 + 1 = 3

At the end, it is 23 + 3 = 26 total

Here is a description of the process tree:

When starting with the TABLE DATA copying step, then pgcopydb creates as many sub-processes as specified by the --table-jobs command line option (or the environment variable PGCOPYDB_TABLE_JOBS), and an extra process is created to send the table to the queue and to handle TRUNCATE commands for COPY-partitioned tables.

A single sub-process is created by pgcopydb to copy the Postgres Large Objects (BLOBs) metadata found on the source database to the target database, and as many as --large-objects-jobs processes are started to copy the large object data.

To drive the index and constraint build on the target database, pgcopydb creates as many sub-processes as specified by the --index-jobs command line option (or the environment variable PGCOPYDB_INDEX_JOBS).

It is possible with Postgres to create several indexes for the same table in parallel, for that, the client just needs to open a separate database connection for each index and run each CREATE INDEX command in its own connection, at the same time. In pgcopydb, this is implemented by running one sub-process per index to create.

The --index-jobs option is global for the whole operation, so that it’s easier to setup to the count of available CPU cores on the target Postgres instance. Usually, a given CREATE INDEX command uses 100% of a single core.

To drive the VACUUM ANALYZE workload on the target database, pgcopydb creates as many sub-processes as specified by the --table-jobs command line option.

To reset sequences in parallel to COPYing the table data, pgcopydb creates a single dedicated sub-process.

When using the --follow option then another sub-process leader is created to handle the two Change Data Capture processes.

One process implements pgcopydb stream receive to fetch changes using logical decoding and store them in the SQLite Change Data Capture output database.

Another process implements pgcopydb stream apply to transform those changes inline (writing parameterised statements to the CDC replay database) and apply them to the target Postgres instance. This process loops over querying the pgcopydb sentinel table until the apply mode has been enabled, and then applies the replayed transactions.

The receive and apply processes coordinate their shutdown over a small lifecycle pipe (modelled on PostgreSQL’s postmaster death-watch); see The receive→apply lifecycle pipe for the design.

For each table, build all indexes concurrently

pgcopydb takes the extra step and makes sure to create all your indexes in parallel to one-another, going the extra mile when it comes to indexes that are associated with a constraint.

Postgres introduced the configuration parameter synchronize_seqscans in version 8.3, eons ago. It is on by default and allows the following behavior:

From the PostgreSQL documentation

This allows sequential scans of large tables to synchronize with each other, so that concurrent scans read the same block at about the same time and hence share the I/O workload.

The other aspect that pg_dump and pg_restore are not very smart about is how they deal with the indexes that are used to support constraints, in particular unique constraints and primary keys.

Those indexes are exported using the ALTER TABLE command directly. This is fine because the command creates both the constraint and the underlying index, so the schema in the end is constructed as expected.

That said, those ALTER TABLE ... ADD CONSTRAINT commands require a level of locking that prevents any concurrency. As we can read on the docs for ALTER TABLE:

From the PostgreSQL documentation

Although most forms of ADD table_constraint require an ACCESS EXCLUSIVE lock, ADD FOREIGN KEY requires only a SHARE ROW EXCLUSIVE lock. Note that ADD FOREIGN KEY also acquires a SHARE ROW EXCLUSIVE lock on the referenced table, in addition to the lock on the table on which the constraint is declared.

The trick is then to first issue a CREATE UNIQUE INDEX statement and when the index has been built then issue a second command in the form of ALTER TABLE ... ADD CONSTRAINT ... PRIMARY KEY USING INDEX ..., as in the following example which is taken from the logs of an actual pgcopydb run:

52:06 68898 INFO  COPY "demo"."tracking";
52:06 68899 INFO  COPY "demo"."client";
52:06 68899 INFO  Creating 2 indexes for table "demo"."client"
52:06 68906 INFO  CREATE UNIQUE INDEX client_pkey ON demo.client USING btree (client);
52:06 68907 INFO  CREATE UNIQUE INDEX client_pid_key ON demo.client USING btree (pid);
52:06 68898 INFO  Creating 1 indexes for table "demo"."tracking"
52:06 68908 INFO  CREATE UNIQUE INDEX tracking_pkey ON demo.tracking USING btree (client, ts);
52:06 68907 INFO  ALTER TABLE "demo"."client" ADD CONSTRAINT "client_pid_key" UNIQUE USING INDEX "client_pid_key";
52:06 68906 INFO  ALTER TABLE "demo"."client" ADD CONSTRAINT "client_pkey" PRIMARY KEY USING INDEX "client_pkey";
52:06 68908 INFO  ALTER TABLE "demo"."tracking" ADD CONSTRAINT "tracking_pkey" PRIMARY KEY USING INDEX "tracking_pkey";

This trick is worth a lot of performance gains on its own, as has been discovered and experienced and appreciated by pgloader users already.

Same-table Concurrency

For some databases, it just so happens that most of the database size on-disk is contained within a single giant table, or within a short list of giant tables. When this happens, the concurrency model that is implemented with --table-jobs still allocates a single process to COPY all the data from the source table.

Same-table concurrency allows pgcopydb to use more than one process at the same time to process a single source table. The data is then logically partitioned (on the fly) and split between processes:

To fetch the data from the source database, the COPY processes then use SELECT queries like in the following example:
COPY (SELECT * FROM source.table WHERE id BETWEEN      1 AND 123456)
COPY (SELECT * FROM source.table WHERE id BETWEEN 123457 AND 234567)
COPY (SELECT * FROM source.table WHERE id BETWEEN 234568 AND 345678)
...
This is only possible when the source.table has at least one column of an integer type (int2, int4, and int8 are supported) and with a UNIQUE or PRIMARY KEY constraint. We must make sure that any given row is selected only once overall to avoid introducing duplicates on the target database.

When a table is missing such a primary key column of an integer data type, pgcopydb then automatically resorts to using CTID based comparisons. See Postgres documentation section about System Columns for more information about Postgres CTIDs.

The COPY processes then use the SELECT queries like in the following example:
COPY (SELECT * FROM source.table WHERE ctid >= '(0,0)'::tid and ctid < '(5925,0)'::tid)
COPY (SELECT * FROM source.table WHERE ctid >= '(5925,0)'::tid and ctid < '(11850,0)'::tid)
COPY (SELECT * FROM source.table WHERE ctid >= '(11850,0)'::tid and ctid < '(17775,0)'::tid)
COPY (SELECT * FROM source.table WHERE ctid >= '(17775,0)'::tid and ctid < '(23698,0)'::tid)
COPY (SELECT * FROM source.table WHERE ctid >= '(23698,0)'::tid)
To decide if a table COPY processing should be split, the command line option split-tables-larger-than is used, or the environment variable PGCOPYDB_SPLIT_TABLES_LARGER_THAN.

The expected value is either a plain number of bytes, or a pretty-printed number of bytes such as 250 GB.

When using this option, then tables that have at least this amount of data and also a candidate key for the COPY partitioning are then distributed among a number of COPY processes.

The number of COPY processes is computed by dividing the table size by the threshold set with the split option. For example, if the threshold is 250 GB then a 400 GB table is going to be distributed among 2 COPY processes.

The command pgcopydb list table-parts may be used to list the COPY partitioning that pgcopydb computes given a source table and a threshold.

Significant differences when using same-table COPY concurrency

When same-table concurrency happens for a source table, some operations are not implemented as they would have been without same-table concurrency. Specifically:

TRUNCATE and COPY FREEZE Postgres optimization

When using a single COPY process, it’s then possible to TRUNCATE the target table in the same transaction as the COPY command, as in the following synthetic example:
BEGIN;
TRUNCATE table ONLY;
COPY table FROM stdin WITH (FREEZE);
COMMIT
This technique allows Postgres to implement several optimizations, doing work during the COPY that would otherwise need to happen later when executing the first queries on the table.

When using same-table concurrency then we have several transactions happening concurrently on the target system that are copying data from the source table. This means that we have to TRUNCATE separately and the FREEZE option can not be used.
CREATE INDEX and VACUUM

Even when same-table COPY concurrency is enabled, creating the indexes on the target system only happens after the whole data set has been copied over. This means that only when the last process is done with the COPYing, then this process will take care of the indexes and the vacuum analyze operation.

Same-table COPY concurrency performance limitations

Finally, it might be that same-table concurrency is not effective at all in some use cases. Here is a list of limitations to have in mind when selecting to use this feature:

Network Bandwidth

The most common performance bottleneck relevant to database migrations is the network bandwidth. When the bandwidth is saturated (used in full) then same-table concurrency will provide no performance benefits.

Disks IOPS

The second most common performance bottleneck relevant to database migrations is disks IOPS and, in the Cloud, burst capacity. When the disk bandwidth is used in full, then same-table concurrency will provide no performance benefits.

Source database system uses read IOPS, target database system uses both read and write IOPS (copying the data writes to disk, creating the indexes both read table data from disk and then write index data to disk).

On-disk data organisation

When using a single COPY process, the target system may fill-in the Postgres table in a clustered way, using each disk page in full before opening the next one, in a sequential fashion.

When using same-table COPY concurrency, then the target Postgres system needs to handle concurrent writes to the same table, resulting in a possibly less effective disk usage.

How that may impact your application performance is to be tested.

synchronize_seqscans

Postgres implemented this option back in version 8.3. The option is now documented in the Version and Platform Compatibility section.

The documentation reads:

This allows sequential scans of large tables to synchronize with each other, so that concurrent scans read the same block at about the same time and hence share the I/O workload.

The impact on performance when having concurrent COPY processes reading the same source table at the same time is to be assessed. At the moment there is no option in pgcopydb to SET synchronize_seqscans TO off when using same-table COPY concurrency.

Use your usual Postgres configuration editing for testing.

Cloning all databases: the `--all-databases` option

The --all-databases option to pgcopydb clone allows cloning every non-template user database from a source Postgres instance to a target instance in a single invocation. This is the equivalent of pg_dumpall for the data phase.

When --all-databases is active, pgcopydb coordinates the copy in three phases that run sequentially, while keeping the --table-jobs and --index-jobs worker pools global — shared across all databases, not allocated per-database.

Phase I — snapshot export and pre-data (parallel across databases)

A long-lived snapshot holder subprocess is forked first. It connects to each source database, exports one REPEATABLE READ snapshot per database, writes the snapshot IDs into the instance catalog, and then blocks on a pipe until Phase III completes. This keeps all per-database snapshot transactions open so that Phase I and II workers can call SET TRANSACTION SNAPSHOT to import them for consistent reads.

Once all snapshots are ready, a pre-data supervisor is forked. It starts --table-jobs pre-data workers that each dequeue a database name from a shared queue, then run the full pre-data pipeline for that database: schema fetch from the source, pg_dump --pre-data, and pg_restore --pre-data on the target. Roles are copied once at the instance level before any of this begins.

The process tree for this phase looks like the following:

$ pgcopydb clone --all-databases --table-jobs 4 --index-jobs 4
 + pgcopydb snapshot holder           (blocks until Phase III done)
 + pgcopydb pre-data supervisor
     - pgcopydb pre-data worker       (processes db1, then db4, …)
     - pgcopydb pre-data worker       (processes db2, then db5, …)
     - pgcopydb pre-data worker       (processes db3, then db6, …)
     - pgcopydb pre-data worker

Phase II — global COPY, indexes, and vacuum (single shared pool)

After all pre-data work is complete, the COPY phase begins. All tables from all databases are collected and sorted by size (largest first), then enqueued into a single global COPY queue. The --table-jobs COPY workers and --index-jobs index workers are each created once and serve the entire workload regardless of which database each table belongs to.

This is the key resource-sharing optimisation: the largest tables get started first across database boundaries, keeping all workers busy even when the database sizes are very uneven.

Each COPY worker maintains a small connection cache keyed by database name so that connections are reused when consecutive queue entries belong to the same database. Index and vacuum workers open their own per-database connections on demand.

When --all-databases is active, each DDL and DML progress log line is prefixed with the database name so that interleaved output from workers handling different databases stays distinguishable:

pagila: CREATE UNIQUE INDEX actor_pkey ON actor USING btree (actor_id)
f1db:   TRUNCATE ONLY "public"."constructorstandings"
f1db:   COPY "public"."constructorstandings"
chinook: VACUUM ANALYZE "public"."Track"

The process tree for this phase looks like the following:

+ pgcopydb global copy supervisor
    - pgcopydb global copy queue filler  (enqueues all tables, sorted by size)
    - pgcopydb global copy worker        (--table-jobs workers, any database)
    - pgcopydb global copy worker
    - pgcopydb global copy worker
    - pgcopydb global copy worker
+ pgcopydb global index supervisor
    - pgcopydb index/constraints worker  (--index-jobs workers, any database)
    - pgcopydb index/constraints worker
    - pgcopydb index/constraints worker
    - pgcopydb index/constraints worker
+ pgcopydb global vacuum supervisor
    - pgcopydb vacuum analyze worker     (--table-jobs workers, any database)
    - pgcopydb vacuum analyze worker
    - pgcopydb vacuum analyze worker
    - pgcopydb vacuum analyze worker

Phase III — post-data (sequential, one database at a time)

After the global COPY phase completes, the remaining post-data work runs sequentially: for each database, a subprocess is forked that calls pg_restore --post-data (triggers, rules, comments, and any objects not already created by the index workers) and resets sequences to match the source values captured during Phase I. The databases are processed one at a time in the order they were discovered.

Once all post-data work is done, the parent closes the done-pipe to the snapshot holder, which then commits all its held transactions and exits.

Consolidated summary

At the end of an --all-databases run, pgcopydb prints a consolidated summary that aggregates timing statistics across all databases: COPY throughput, index build time, constraint time, VACUUM time, and total wall-clock duration. Individual per-database summaries are also printed before the consolidated one.