Resuming Operations (snaphots)
An important aspect of pgcopydb design is detailed in the documentation section Notes about concurrency and has to do with using many concurrent worker processes to implement parallelism.
Even when using multiple worker processes, it is important that pgcopydb operations are consistent. It is essential to guarantee that the same source schema and data set is used by every single worker process throughout the operations.
Consistency with multiple Postgres sessions is achieved thanks to Postgres ability to export and import snapshots. As per Postgres docs about Snapshot Synchronization Functions:
From the PostgreSQL documentation
PostgreSQL allows database sessions to synchronize their snapshots. A snapshot determines which data is visible to the transaction that is using the snapshot. Synchronized snapshots are necessary when two or more sessions need to see identical content in the database. If two sessions just start their transactions independently, there is always a possibility that some third transaction commits between the executions of the two START TRANSACTION commands, so that one session sees the effects of that transaction and the other does not.
To solve this problem, PostgreSQL allows a transaction to export the snapshot it is using. As long as the exporting transaction remains open, other transactions can import its snapshot, and thereby be guaranteed that they see exactly the same view of the database that the first transaction sees. But note that any database changes made by any one of these transactions remain invisible to the other transactions, as is usual for changes made by uncommitted transactions. So the transactions are synchronized with respect to pre-existing data, but act normally for changes they make themselves.
Snapshots are exported with the pg_export_snapshot function, shown in Table 9.94, and imported with the SET TRANSACTION command.
Using these Postgres APIs allows pgcopydb to implement consistent operations even when using multiple worker processes.
Bypassing consistency issues
If you can ensure that no writes happen on the source database for the whole duration of the pgcopydb operations, which means no schema change (DDL) and no data change (DML), then consistency issues can’t happen: that’s because the database is static for our context, probably within a maintenance window setup where the applications are disconnected from the source database service.
Note that pgcopydb offers the --not-consistent
option that allows
bypassing all the complexity of sharing a snapshot throughout the
operations. In particular, resuming operations after a crash or even
implementing multi-steps operations is made easier when bypassing
consistency aspects altogether.
When you are able to work within a maintenance window where the database is
isolated from any application traffic, consider using --not-consistent
.
Consistency and concurrency: Postgres snapshots
As seen above, Postgres offers different APIs to export and import a snapshot:
Function
pg_export_snapshot()
exports the current snapshot.SQL command
SET TRANSACTION SNAPSHOT
imports the given snapshot.Replication protocol command
CREATE_REPLICATION_SLOT
allows exporting its snapshot.
Exporting a Postgres snapshot can be done either at the create replication
slot time, or from a non-replication connection using the SQL function
pg_export_snapshot()
. This is an either/or situation, Postgres does not
allow mixing these two approaches.
Also remember that a single snapshot must be used throughout pgcopydb operations, both the initial COPY of the schema and data and also the Change Data Capture aspects in order to achieve consistency (no data loss, no duplicates in the data change stream).
To be able to implement multiple worker processes in pgcopydb and have a consistent view of the same database (schema, data) in every single process, pgcopydb needs to first export a single common snapshot and then have every worker process import that snapshot when connecting to the source database.
When implementing Change-Data-Capture thanks to the --follow
option,
then it is also required that there is no gap between the initial snapshot
and the first change received, and also that no changes are sent that were
part of the initial copy. Postgres knows how to offer that guarantee via its
snapshot exporting facility in the CREATE_REPLICATION_SLOT
replication
command.
As a result, the pgcopydb snapshot --follow
command is required by the
Postgres API to also create the pgcopydb replication slot, and exports the
replication slot snapshot.
The pgcopydb snapshot
command, when the --follow
option is not used,
simply connects to the Postgres source database using the normal query
protocol and run the command select pg_export_snapshot()
to grab a
snapshot that can be shared by all the worker processes.
Moreover the Postgres API for exporting a snapshot has the following limitation:
From the PostgreSQL docs
The snapshot is available for import only until the end of the transaction that exported it.
This means that the pgcopydb snapshot
command must be kept running for
the whole pgcopydb initial copy operations. The replication client only uses
the replication slot to ensure consistency, so when only the follow worker
processes are running, holding on to the snapshot is not required anymore.
Resumability of pgcopydb operations
The ability to resume operations when using pgcopydb faces three quite
different contexts. Depending on the context and when the previous operation
has been interrupted, then running the same pgcopydb command again with the
--resume
option might just work, or might error out because the
situation does not allow a consistent resuming of the operation that was
interrupted.
Bypassing consistency issues
When using the --resume --not-consistent
options then there is no
restriction around snapsht re-use when trying to resume interrupted
operations.
Consistent copy of the data
When using pgcopydb clone --resume
the snapshot used in the previous
attempts is going to be re-used. For Postgres to be able to import that
snapshot again, the transaction that exported the snapshot must be still
running on the source database system.
- Single pgcopydb command
When using
pgcopydb clone
the snapshot holding process is part of that single process tree, and any interruption of this command (signal, C-c, crash) also terminates the snapshot holding sub-process and the snapshot is then lost.- Separate pgcopydb snapshot command
That’s why the
pgcopydb snapshot
command can be used separately. Then the mainpgcopydb clone
command re-uses the snapshot automatically and can be left holding the snapshot even in case of an interruption of thepgcopydb clone
command.- External snapshot
It is also possible to use another command or software to export and maintain the snapshot that pgcopydb uses and then use
pgcopydb clone --snapshot ...
to share the snapshot with pgcopydb.
Consistent copy of the data with CDC
When using Change Data Capture with the --follow
option resuming
operations consistently requires the following situation:
The initial COPY of the data needs to still have access to the exported snapshot.
Even when the snapshot has been exported with a replication protocol command, Postgres still requires the session to be maintained opened here.
The logical replication on the client side is not concerned with the snapshot operations, that’s done server-side when creating the replication slot; from there on all the client has to do is consume from the replication slot.
Snapshot and catalogs (cache invalidation)
The source catalog table setup
registers information about the current
pgcopydb command. The information is checked at start-up in order to avoid
re-using data in a different context.
The information registered is the following, and also contains the
snapshot information. In case of a mismatch, consider using --resume
--not-consistent
when that’s relevant to your operations.
Here’s how to inspect the current setup
information that pgcopydb
maintains in its local catalog cache:
$ sqlite3 /tmp/pgcopydb/schema/source.db
sqlite> .mode line
sqlite> select * from setup;
id = 1
source_pg_uri = postgres:///pagila
target_pg_uri = postgres:///plop
snapshot = 00000003-00000048-1
split_tables_larger_than = 0
filters = {"type":"SOURCE_FILTER_TYPE_NONE"}
plugin =
slot_name =
The source and target connection strings only contain the Postgres servers hostname, port, database name and connecting role name. In particular, authentication credentials are not stored in the catalogs.