GitHub - ankane/pgsync: Sync data from one Postgres database to another (original) (raw)

pgsync

Sync data from one Postgres database to another (like pg_dump/pg_restore). Designed for:

speed - tables are transferred in parallel
security - built-in methods to prevent sensitive data from ever leaving the server
flexibility - gracefully handles schema differences, like missing columns and extra columns
convenience - sync partial tables, groups of tables, and related records

🍊 Battle-tested at Instacart

Installation

pgsync is a command line tool. To install, run:

This will give you the pgsync command. If installation fails, you may need to install dependencies.

You can also install it with Homebrew:

Setup

In your project directory, run:

This creates .pgsync.yml for you to customize. We recommend checking this into your version control (assuming it doesn’t contain sensitive information). pgsync commands can be run from this directory or any subdirectory.

How to Use

First, make sure your schema is set up in both databases. We recommend using a schema migration tool for this, but pgsync also provides a few convenience methods. Once that’s done, you’re ready to sync data.

Sync tables

Sync specific tables

Works with wildcards as well

Sync specific rows (existing rows are overwritten)

pgsync products "where store_id = 1"

You can also preserve existing rows

pgsync products "where store_id = 1" --preserve

Or truncate them

pgsync products "where store_id = 1" --truncate

Tables

Exclude specific tables

pgsync --exclude table1,table2

Add to .pgsync.yml to exclude by default

exclude:

table1
table2

Sync tables from all schemas or specific schemas (by default, only the search path is synced)

pgsync --all-schemas

or

pgsync --schemas public,other

or

pgsync public.table1,other.table2

Groups

Define groups in .pgsync.yml:

groups: group1: - table1 - table2

And run:

Variables

You can also use groups to sync a specific record and associated records in other tables.

To get product 123 with its reviews, last 10 coupons, and store, use:

groups: product: products: "where id = {1}" reviews: "where product_id = {1}" coupons: "where product_id = {1} order by created_at desc limit 10" stores: "where id in (select store_id from products where id = {1})"

And run:

Schema

Sync the schema before the data (this wipes out existing data)

Specify tables

pgsync table1,table2 --schema-first

Sync the schema without data (this wipes out existing data)

pgsync does not try to sync Postgres extensions.

Sensitive Data

Prevent sensitive data like email addresses from leaving the remote server.

Define rules in .pgsync.yml:

data_rules: email: unique_email last_name: random_letter birthday: random_date users.auth_token: value: secret visits_count: statement: "(RANDOM() * 10)::int" encrypted_*: null

last_name matches all columns named last_name and users.last_name matches only the users table. Wildcards are supported, and the first matching rule is applied.

Options for replacement are:

unique_email
unique_phone
unique_secret
random_letter
random_int
random_date
random_time
random_ip
value
statement
null
untouched

Rules starting with unique_ require the table to have a single column primary key. unique_phone requires a numeric primary key.

Foreign Keys

Foreign keys can make it difficult to sync data. Three options are:

Defer constraints (recommended)
Manually specify the order of tables
Disable foreign key triggers, which can silently break referential integrity (not recommended)

To defer constraints, use:

pgsync --defer-constraints

To manually specify the order of tables, use --jobs 1 so tables are synced one-at-a-time.

pgsync table1,table2,table3 --jobs 1

To disable foreign key triggers and potentially break referential integrity, use:

pgsync --disable-integrity

This requires superuser privileges on the to database. If syncing to (not from) Amazon RDS, use the rds_superuser role. If syncing to (not from) Heroku, there doesn’t appear to be a way to disable integrity.

Triggers

Disable user triggers with:

pgsync --disable-user-triggers

Sequences

Skip syncing sequences with:

Append-Only Tables

For extremely large, append-only tables, sync in batches.

pgsync large_table --in-batches

Note: This requires the table to have a numeric, increasing primary key

The script will resume where it left off when run again, making it great for backfills.

Connection Security

Always make sure your connection is secure when connecting to a database over a network you don’t fully trust. Your best option is to connect over SSH or a VPN. Another option is to use sslmode=verify-full. If you don’t do this, your database credentials can be compromised.

Safety

To keep you from accidentally overwriting production, the destination is limited to localhost or 127.0.0.1 by default.

To use another host, add to_safe: true to your .pgsync.yml.

Multiple Databases

To use with multiple databases, run:

This creates .pgsync-db2.yml for you to edit. Specify a database in commands with:

Django

If you run pgsync --init in a Django project, migrations will be excluded in .pgsync.yml.

exclude:

django_migrations

Heroku

If you run pgsync --init in a Heroku project, the from database will be set in .pgsync.yml.

from: $(heroku config:get DATABASE_URL)?sslmode=require

Laravel

If you run pgsync --init in a Laravel project, migrations will be excluded in .pgsync.yml.

Rails

If you run pgsync --init in a Rails project, Active Record metadata and schema migrations will be excluded in .pgsync.yml.

exclude:

ar_internal_metadata
schema_migrations

Debugging

To view the SQL that’s run, use:

Other Commands

Help

Version

List tables

Scripts

Use groups when possible to take advantage of parallelism.

For Ruby scripts, you may need to do:

Bundler.with_unbundled_env do system "pgsync ..." end

Docker

Get the Docker image with:

docker pull ankane/pgsync alias pgsync="docker run -ti ankane/pgsync"

This will give you the pgsync command.

Dependencies

If installation fails, your system may be missing Ruby or libpq.

On Mac, run:

On Ubuntu, run:

sudo apt-get install ruby-dev libpq-dev build-essential

Upgrading

Run:

To use master, run:

gem install specific_install gem specific_install https://github.com/ankane/pgsync.git

With Homebrew, run:

With Docker, run:

docker pull ankane/pgsync

Also check out:

Dexter - The automatic indexer for Postgres
PgHero - A performance dashboard for Postgres
pgslice - Postgres partitioning as easy as pie

Thanks

Inspired by heroku-pg-transfer.

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

git clone https://github.com/ankane/pgsync.git cd pgsync bundle install

createdb pgsync_test1 createdb pgsync_test2 createdb pgsync_test3

bundle exec rake test

GitHub - ankane/pgsync: Sync data from one Postgres database to another (original) (raw)

pgsync

Installation

Setup

How to Use

Tables

or

or

Groups

Variables

Schema

Sensitive Data

Foreign Keys

Triggers

Sequences

Append-Only Tables

Connection Security

Safety

Multiple Databases

Integrations

Django

Heroku

Laravel

Rails

Debugging

Other Commands

Scripts

Docker

Dependencies

Upgrading

Related Projects

Thanks

History

Contributing