Using SQLAlchemy to create and manage PostgreSQL Materialized Views by Jeff Widman (original) (raw)
This is part of a series on combining PostgreSQL Materialized Views, Recursive CTEs, and SQLAlchemy:
- Using SQLAlchemy to create and manage PostgreSQL materialized views (this post)
- Traversing a tree stored as an adjacency list using a Recursive CTE built in SQLAlchemy
- Working example code on GitHub
Overview of materialized views in SQL:
If you’re not familiar with SQL views, they’re basically a temporary table whose contents are the output of another SQL query. Normally the table generated by the view only lasts for the duration of a single containing query, but a materialized view stores (materializes) the results so they can be used in later queries.
Materialized views offer a number of benefits:
- Fast and simple retrieval. Retrieving data from a materialized view is just like writing queries against a normal table:
| | SELECT calculated_result FROM my_mat_view WHERE id = 4 |
| ------------------------------------------------------------- | - Unlike normal views, materialized views (at least in PostgreSQL) support traditional database indexes for even faster retrieval:
| | CREATE UNIQUE INDEX ON my_mat_view (id) |
| -------------------------------------------- | - Simple data maintenance. Recalculating all the data is a simple:
| | REFRESH MATERIALIZED VIEW my_mat_view |
| ------------------------------------------ | - Zero-downtime refresh. PostgreSQL 9.4 onward supports refreshing a materialized view concurrently in the background without impacting other queries:
| | REFRESH MATERIALIZED VIEW CONCURRENTLY my_mat_view |
| ------------------------------------------------------- | - Doesn’t add complexity to your infrastructure because it’s built into your database and requires no additional services.
Drawbacks of materialized views:
- Any data you want to store in a materialized view has be the result of another SQL query–you can’t generate the data in your app and then store it in the materialized view.
- Like any cache, data can go stale. You need to remember to manually issue a REFRESH command to your database. Either your app can update the materialized view after it inserts/updates the source data, or if that causes too many updates, just run a cron job that refreshes the materialized view on a regular schedule. You can also use a database trigger, although you’ll need to create a custom plpgsql function.
- You can’t update a single row–only the entire view.
Overall, materialized views in PostgreSQL are a very convenient way to add caching to many applications.
Ways to teach SQLAlchemy about a materialized view:
In order to write ORM queries against my materialized views, I needed to tell SQLAlchemy about the materialized view. Querying a materialized view is very similar to a normal SQL table, so the easiest way to do this is manually create the materialized view in your database. Then, within your app, create a normal SQLAlchemy Table mapping that corresponds to the columns in your materialized view.
While this makes it possible to query the materialized view, it also means that you’ll hit problems when creating/dropping all your tables because SQLAlchemy sees the materialized view as a normal SQL table. While uncommon in production, it’s very common thing to do in a test suite or your local dev instance. Additionally, using a traditional SQLAlchemy table mapping makes it difficult to refresh the materialized view from within your app. I already manage all my database tables and metadata with SQLAlchemy, and wanted the convenience of fully managing my materialized views as well.
Adding custom DDL commands to SQLAlchemy for materialized views:
SQLAlchemy doesn’t have a built-in command for creating views, let alone materialized views. But it is possible to subclass DDLElement and create your own custom DDL commands. After several iterations (and help from Mike Beyer), here’s what I ended up with:
12345678910111213141516171819202122232425262728293031323334353637 | # materialized_view_factory.pyfrom sqlalchemy.ext import compilerfrom sqlalchemy.schema import DDLElementfrom app import db # I'm using Flask-SQLAlchemyclass CreateMaterializedView(DDLElement): def __init__(self, name, selectable): self.name = name self.selectable = selectable@compiler.compiles(CreateMaterializedView)def compile(element, compiler, **kw): # Could use "CREATE OR REPLACE MATERIALIZED VIEW..." # but I'd rather have noisy errors return 'CREATE MATERIALIZED VIEW %s AS %s' % ( element.name, compiler.sql_compiler.process(element.selectable, literal_binds=True))def create_mat_view(name, selectable, metadata=db.metadata): _mt = db.MetaData() # temp metadata just for initial Table object creation t = db.Table(name, _mt) # the actual mat view class is bound to db.metadata for c in selectable.c: t.append_column(db.Column(c.name, c.type, primary_key=c.primary_key)) db.event.listen( metadata, 'after_create', CreateMaterializedView(name, selectable) ) db.event.listen( metadata, 'before_drop', db.DDL('DROP MATERIALIZED VIEW IF EXISTS ' + name) ) return t |
---|
Security Note: Using string interpolation to assemble SQL commands always makes me nervous about accidentally creating a SQL injection vulnerability. None of these commands include user-input, so we’re fine. However, be careful not to accidentally re-use this code somewhere else where you are accepting user input.
Creating a materialized view using SQLAlchemy:
This lets us create a materialized view by passing a SELECT statement generated with SQLAlchemy core to our custom create_mat_view() function.
Here’s an example of how you’d call it–this calculates the total number of reviews and average review rating for each GearItem:
| | class GearItemMV(db.Model): __table__ = create_mat_view("gear_item_mv", db.select( [GearItem.id.label('id'), db.func.count(GearReview.id).label('review_count'), db.func.avg(GearReview.rating).label('review_rating'),] ).select_from(db.join(GearItem, GearReview, isouter=True) ).group_by(GearItem.id)) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
The above code will emit the following DDL statement to the database:
| | CREATE MATERIALIZED VEW AS SELECT gear_item.id AS id, count(gear_review.id) AS review_count, avg(gear_review.rating) AS review_ratingFROM gear_item LEFT OUTER JOIN gear_review ON gear_item.id = gear_review.gear_item_id GROUP BY gear_item.id | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Sidenote: I’m using the factory pattern to create my Flask app, so I can only pass a db.select() query to create_mat_view(). Trying to use the normal db.session.query() from the SQLAlchemy ORM won’t work because it creates a circular dependency. The Flask-SQLAlchemy sessionmaker() requires an app context, but the db object (and table mappings) are processed by Python before the app is instantiated. So when SQLAlchemy tries to setup the table mappings, calling db.session() calls the Flask-SQLAlchemy sessionmaker() which throws an exception because there’s no app context. Using db.select() from SQLAlchemy Core avoids this issue because it doesn’t create a session. Within the selectable itself, it’s fine to use ORM constructs such as object references like GearReview. This works because SQLAlchemy doesn’t have to create a session in order to resolve what a GearReview is.
Querying a materialized view using SQLAlchemy:
Since our recipe handles both creating the materialized view and mapping it to an ORM object, querying is very simple:
| | data = db.session.query(GearItemMV).get(row_id) | | ---------------------------------------------------- |
Alternatively, if you’re using Flask-SQLAlchemy: GearItemMV.query.get(row_id)
Additionally, this example has a 1:1 mapping between the GearItem objects and the pre-calculated data about them stored in GearItemMV, which makes for convenient queries like this:
| | gear_item_review_count = db.session.query(GearItemMV).get(gear_item_id).review_count | | ---------------------------------------------------------------------------------------------- |
While convenient, this results in brittle code because we have direct queries against the materialized view littered throughout our codebase. A better way is to map the materialized view data as attributes on the original objects, decoupling the attribute call from the underlying implementation of how the attribute is calculated/cached. This avoids having to change a bunch of controller/view code if we switch caching mechanisms.
Mapping the cached result as an attribute on the original object:
SQLAlchemy supports several ways to map the materialized view output onto the original object.
Initially I tried using a column_property():
| | GearItem.review_count = db.column_property(db.select([GearItemMV.review_count] ).where(GearItem.id==GearItemMV.id)) | | ----------------------------------------------------------------------------------------------------------------------------- |
Because column_properties are evaluated at import time, this has to be defined after the GearItemMV definition, it cannot be declared inline with the GearItem model definition.
There were three main problems with this approach:
- Under the covers, column_property() uses correlated subqueries. Correlated subqueries are slow because the query planner can’t optimize them–it effectively has to run a complete subquery for every row returned by the parent query. So unlike traditional set-based SQL where retrieving additional rows is O(log n), the work required to retrieve column_properties is O(n*m) where n is number of objects and m is the number of column_properties on the objects.
For example, lets say my GearItem class has two column properties review_count and review_rating, and I want to retrieve thirty objects from the database. Effectively the database is processing the outer query to return thirty items, plus an additional query for each correlated subquery for each item. You won’t notice this in the SQL query logs because only one query is sent to the database, but if you actually run EXPLAIN on the query, you’ll see that effectively the one query ballooned into doing the work of 61 queries! - SQLAlchemy can’t take advantage of the pseudo-caching ability of the session since this query doesn’t use session.query.get(). Normally, once an object has been loaded within a session, querying it again using db.session.query(object).get(pk_id) will retrieve the existing object from the SQLAlchemy session without re-querying the database. This only happens using the special get() function which first checks the session’s identity map for the primary key.
- Column properties are calculated as soon as an object is loaded. So whenever I load a GearItem, even if I never access the review data, I’m still triggering the expensive correlated subquery.
After a little more research, I switched to mapping the attributes onto the parent object using a SQLAlchemy @hybrid_property. Like a normal python property, it’s only evaluated on access, plus it provides additional benefits.
Here was my initial attempt:
| | # within GearItem model definition@hybrid_propertydef review_count(self): # if GearItem object doesn't have ID, it was just created and hasn't been # flushed to the database yet, so it won't have corresponding entry in # materialized view if self.id: gi_mv = GearItemMV.query.get(self.id) # GearItemMV definition uses a LEFT JOIN so if self.id exists, yet # GearItemMV(self.id) doesn't exist, than we know the materialized view # needs refreshing if gi_mv is not None: return gi_mv.review_count | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
It works fine for querying individual items. It also caches the GearItemMV object in the session, so if I retrieve review_count and then later retrieve review_rating it will only emit one database call.
However, it doesn’t work for querying at the class level (eg, db.session.query(GearItem).filter(GearItem.review_count < 4)). The SQLAlchemy docs suggest handling this using hybrid_property.expression(), but my attempts felt needlessly overcomplicated and inelegant.
After some more thought, I realized that due to the 1:1 relationship between GearItem rows and GearItemMV rows, I could let the ORM handle the mapping as if it were a traditional foreign-key based relationship. This matched how I was mentally thinking about the original_table:materialized_view relationship, so it should make it easier to write queries.
The only problem is that materialized views can’t have actual foreign keys in the database.
It turns out that SQLAlchemy lets us define table-to-table relationships that effectively function as foreign-key relationships but only within in SQLAlchemy. They are never actually emitted to the database as foreign key constraints:
| | # within GearItem model definitionmat_view = db.relationship('GearItemMV', backref='original', uselist=False, # makes it a one-to-one relationship primaryjoin='GearItem.id==GearItemMV.id', foreign_keys='GearItemMV.id') | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
This results in a much cleaner @hybrid_property:
| | # within GearItem model definition@hybrid_propertydef count_gear_items(self): if self.mat_view is not None: # if None, mat_view needs refreshing return self.mat_view.count_gear_items | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Refreshing a materialized view using SQLAlchemy:
Refreshing a materialized view is straightforward:
| | REFRESH MATERIALIZED VIEW mat_view_name | | -------------------------------------------- |
Unfortunately, this locks the materialized view until the refresh completes. PostgreSQL 9.4 added CONCURRENTLY to let reads continue at the expense of a longer refresh time:
| | REFRESH MATERIALIZED VIEW CONCURRENTLY mat_view_name | | --------------------------------------------------------- |
My refresh function:
| | def refresh_mat_view(name, concurrently): # since session.execute() bypasses autoflush, must manually flush in order # to include newly-created/modified objects in the refresh db.session.flush() _con = 'CONCURRENTLY ' if concurrently else '' db.session.execute('REFRESH MATERIALIZED VIEW ' + _con + name) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
For convenience, I turned this into a class method on my materialized views:
| | class MaterializedView(db.Model): __abstract__ = True @classmethod def refresh(cls, concurrently=True): '''Refreshes the current materialized view''' refresh_mat_view(cls.__table__.fullname, concurrently) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
All my materialized views inherit from this MaterializedView instead of the normal declarative base class. So now I can call GearItemMV.refresh() to refresh that specific view.
Similarly, I also added a function for refreshing all views:
| | def refresh_all_mat_views(concurrently=True): '''Refreshes all materialized views. Currently, views are refreshed in non-deterministic order, so view definitions can't depend on each other.''' # TODO use `include='materialized'` once https://bitbucket.org/zzzeek/sqlalchemy/issues/3588/ lands mat_views = db.inspect(db.engine).get_view_names() for v in mat_views: refresh_mat_view(v, concurrently) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
You’ll need to tweak this if your materialized views depend on one another, or if your database includes normal (non-materialized) views. Alternatively, you can refresh all your materialized views using a custom plpgsql function.
If you’re following this blog post step-by-step and try these refresh functions, PostgreSQL will complain that refreshing concurrently requires a unique index.
Indexing a materialized view using SQLAlchemy:
Initially I tried creating an index using the normal SQLAlchemy command:
| | Index(GearItemMV.id).create() | | -------------------------------- |
However, that threw an exception because SQLAlchemy tries to create the index as part of the db.metadata creation process, but our materialized view recipe doesn’t actually create the materialized view until after the db.metadata creation finishes. So we need to tell SQLAlchemy to delay creating indexes on materialized views until after the materialized view is created:
| | # this is located within the create_mat_view() function# see full example code at end of blog post for details@db.event.listens_for(metadata, 'after_create')def create_indexes(target, connection, **kw): for idx in t.indexes: idx.create(connection) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Note that rather than creating a custom event to trigger when the materialized view is created, I instead hooked into the after_create event for db.metadata. This is the same event that my create materialized view command hooks onto, so theoretically the create index could happen before the materialized view is created. However, I asked on the SQLAlchemy mailing list, and Mike Bayer said the events are executed in deterministic order based on when they were registered as event listeners. Because the create index call is registered after the create materialized view call, everything should work fine.
A few other notes regarding indexes for materialized views:
- IF NOT EXISTS can’t be included in the CREATE INDEX command for materialized views.
- Creating an index uses a SHARE lock, so it doesn’t lockout reads, only writes. Since materialized views are by definition read-only, it’s pointless to create an index using CONCURRENTLY (which uses a SHARE UPDATE EXCLUSIVE lock that allows writes).
- Using ALTER TABLE ADD INDEX syntax isn’t supported for materialized views since they aren’t tables.
Putting it all together:
I created a simple GitHub repo with working examples of the complete code. There’s both a standalone SQLAlchemy example, and a Flask-based example.
Adding a Recursive CTE:
If you enjoyed this, you might like my next blog post where I walk through how to incorporate a Recursive CTE into your materialized view.