GitHub - cybertec-postgresql/pg_squeeze: A PostgreSQL extension for automatic bloat cleanup (original) (raw)

PostgreSQL extension that removes unused space from a table and optionally sorts tuples according to particular index (as if CLUSTER command was executed concurrently with regular reads / writes). In fact we try to replacepg_repack extension.

While providing very similar functionality, pg_squeeze takes a different approach as it:

  1. Implements the functionality purely on server side.
  2. Utilizes recent improvements of PostgreSQL database server.

While (1) makes both configuration and use simpler (compared to pg_repackwhich uses both server and client side code), it also allows for rather smooth implementation of unattended processing using background workers.

As for (2), one important difference (besides the use of background workers) is that we use logical decoding instead of triggers to capture concurrent changes.

INSTALL

Install PostgreSQL before proceeding. Make sure to have pg_config binary, these are typically included in -dev and -devel packages.

git clone https://github.com/cybertec-postgresql/pg_squeeze.git cd pg_squeeze make make install

Add these to postgresql.conf:

wal_level = logical
max_replication_slots = 1 # ... or add 1 to the current value.
shared_preload_libraries = 'pg_squeeze' # ... or add the library to the existing ones.

Restart the cluster, and invoke:

CREATE EXTENSION pg_squeeze;

Note: when upgrading a database cluster with pg_squeeze installed (either usingpgdumpall/restore or pgupgrade), make sure that the new cluster haspgsqueeze in sharedpreloadlibraries before you upgrade, otherwise the upgrade will fail.

Register table for regular processing

First, make sure that your table has an identity index - this is necessary to process changes other transactions might do while pg_squeeze is doing its work. If the replica identity of the table is DEFAULT or FULL, primary key constraint provides the identity index. If your table has no primary key, you need to set the identity index explicitly using the ALTER COMMAND ... REPLICA IDENTITY USING INDEX ... command.

To make the pg_squeeze extension aware of the table, you need to insert a record into squeeze.tables table. Once added, statistics of the table are checked periodically. Whenever the table meets criteria to be "squeezed", a "task" is added to a queue. The tasks are processed sequentially, in the order they were created.

The simplest "registration" looks like:

INSERT INTO squeeze.tables (tabschema, tabname, schedule)
VALUES ('public', 'foo', ('{30}', '{22}', NULL, NULL, '{3, 5}'));

Additional columns can be specified optionally, for example:

INSERT INTO squeeze.tables (
    tabschema,
    tabname,
    schedule,
    free_space_extra,
    vacuum_max_age,
    max_retry
)
VALUES (
    'public',
    'bar',
    ('{30}', '{22}', NULL, NULL, '{3, 5}'),
    30,
    '2 hours',
    2
);

Following is the complete description of table metadata.

CREATE TYPE schedule AS (  
    minutes       minute[],  
    hours         hour[],  
    days_of_month dom[],  
    months        month[],  
    days_of_week  dow[]  
);  

Here, minutes (0-59) and hours (0-23) specify the time of the check within a day, while days_of_month (1-31), months (1-12) anddays_of_week (0-7, where both 0 and 7 stand for Sunday) determine the day of the check.
The check is performed if minute, hour and month all match the current timestamp, while NULL value means any minute, hour and month respectively. As for days_of_month and days_of_week, at least one of these needs to match the current timestamp, or both need to be NULL for the check to take place.
For example, in the entries above tell that table public.bar should be checked every Wednesday and Friday at 22:30.

squeeze.table is the only table user should modify. If you want to change anything else, make sure you perfectly understand what you are doing.

Ad-hoc processing for any table

It's also possible to squeeze tables manually without registering (i.e. inserting the corresponding record into squeeze.tables), and without prior checking of the actual bloat.

Function signature:

squeeze.squeeze_table(
    tabchema name,
    tabname name,
    clustering_index name,
    rel_tablespace name,
    ind_tablespaces name[]
)

Sample execution:

SELECT squeeze.squeeze_table('public', 'pgbench_accounts');

Note that the function is not transactional: it only starts a background worker, tells it which table it should process and exits. Rollback of the transaction the function was called in does not revert the changes done by the worker.

Enable / disable table processing

To enable processing of bloated tables, run this statement as superuser:

SELECT squeeze.start_worker();

The function starts a background worker (scheduler worker) that periodically checks which of the registered tables should be checked for bloat, and creates a task for each. Another worker (squeeze worker) is launched whenever a task exists for particular database.

If the scheduler worker is already running for the current database, the function does not report any error but the new worker will exit immediately.

If the workers are running for the current database, you can use the following statement to stop them:

SELECT squeeze.stop_worker();

Only the functions mentioned in this documentation are considered user interface. If you want to call any other one, make sure you perfectly understand what you're doing.

If you want the background workers to start automatically during startup of the whole PostgreSQL cluster, add entries like this to postgresql.conf file:

squeeze.worker_autostart = 'my_database your_database'
squeeze.worker_role = postgres

Next time you start the cluster, two or more workers (i.e. one scheduler worker and one or more squeeze workers) will be launched for my_databaseand the same for your_database. If you take this approach, note that any worker will either reject to start or will stop without doing any work if either:

  1. The pg_squeeze extension does not exist in the database, or
  2. squeeze.worker_role parameter specifies role which does not have the superuser privileges.

The functions/configuration variables explained above use singular form of the word worker although there are actually two workers. This is because only one worker existed in the previous versions of pg_squeeze, which ensured both scheduling and execution of the tasks. This implementation change is probably not worth to force all users to adjust their configuration files during upgrade.

Control the impact on other backends

Although the table being squeezed is available for both read and write operations by other transactions most of the time, exclusive lock is needed to finalize the processing. If pg_squeeze occasionally seems to block access to tables too much, consider setting squeeze.max_xlock_time GUC parameter. For example:

SET squeeze.max_xlock_time TO 100;

Tells that the exclusive lock shouldn't be held for more than 0.1 second (100 milliseconds). If more time is needed for the final stage, pg_squeeze releases the exclusive lock, processes changes committed by other transactions in between and tries the final stage again. Error is reported if the lock duration is exceeded a few more times. If that happens, you should either increase the setting or schedule processing of the problematic table to a different daytime, when the write activity is lower.

Running multiple workers per database

If you think that a single squeeze worker does not cope with the load, consider setting the squeeze.workers_per_database configuration variable to value higher than 1. Then the pg_squeeze extension will be able to process multiple tables at a time - one table per squeeze worker. However, be aware that this setting affects all databases in which you actively use thepg_squeeze extension. The total number of all the squeeze workers in the cluster (including the "scheduler workers") cannot exceed the in-core configuration variable max_worker_processes.

Monitoring

Unregister table

If particular table should no longer be subject to periodical squeeze, simply delete the corresponding row from squeeze.tables table.

It's also a good practice to unregister table that you're going to drop, although the background worker does unregister non-existing tables periodically.

Upgrade

Make sure to install PostgreSQL and pg_config, see installsection.

make # Compile the newer version. pg_ctl -D /path/to/cluster stop # Stop the cluster. make install pg_ctl -D /path/to/cluster start # Start the cluster.

Connect to each database containing pg_squeeze and run this command:

ALTER EXTENSION pg_squeeze UPDATE;

Upgrade from 1.2.x

As there's no straightforward way to migrate the scheduling information (see the notes on the schedule column of the squeeze.tablestable) automatically, and as the schedule column must not contain NULL values, the upgrade deletes the contents of the squeeze.tablestable. Please export the table contents to a file before you perform the upgrade and configure the checks of those tables again as soon as the upgrade is done.

Concurrency

  1. The extension does not prevent other transactions from altering table at certain stages of the processing. If a "disruptive command" (i.e. ALTER TABLE, VACUUM FULL, CLUSTER or TRUNCATE) manages to commit before the squeeze could finish, the squeeze_table() function aborts and all changes done to the table are rolled back. The max_retry column ofsqueeze.tables table determines how many times the squeeze worker will retry. Besides that, change of schedule might help you to avoid disruptions.
  2. Like pg_repack, pg_squeeze also changes visibility of rows and thus allows for MVCC-unsafe behavior described in the first paragraph ofmvcc-caveats.

Disk Space Requirements

Performing a full-table squeeze requires free disk space about twice as large as the target table and its indexes. For example, if the total size of the tables and indexes to be squeezed is 1GB, an additional 2GB of disk space is required.