Continuous Database Integration: A DataOps Practice (original) (raw)

Part of building a system, of compiling and testing it, is building the database (if it changed). This is true for a database being accessed by one system, by one hundred system, or one thousand. This article overviews the process of continuous database integration (CDI).This article is organized into the following topics:

  1. Definitions
  2. Why continuous database integration is important
  3. Why continuous database integration is different
  4. How continuous database integration works
  5. Why continuous database integration is hard
  6. Continuous database integration in context
  7. Related resources

1. Definitions

Let’s begin with a few definitions:

2. Why Continuous Database Integration is Important

There are several reasons why CDI is critical to your success:

  1. CI is a key quality technique. CI has become the norm for software development teams. Because databases are critical aspects of the solutions that these teams are building, databases must also be included in the build process.
  2. Databases should be trustworthy. Yet they seldom are. In fact, all assets within your organization should be trustworthy, otherwise they’re not really assets. Users of databases should be able to trust that the data is valid, that the functionality encapsulated within the database works as expected. CDI, when implemented and maintained properly, enables you to provide trustworthy databases.
  3. Databases are shared resources. A significant challenge is that databases are shared across many systems, including applications, other data sources, documentation, tests (we hope), and other tooling. This is depicted in Figure 1. Due to this high coupling, it is imperative that we automate the quality validation of these critical assets.
  4. You no longer have any choice. The environment in which you operate evolves quickly, and as a result you must be able to evolve your assets, including your databases, to reflect these changes. CDI enables you to evolve your databases safely.

Figure 1. Databases are shared resources.

3. Why Continuous Database Integration is Different

CDI is different than CI because database testing is more complex than testing code. The issues with this include:

  1. Persistent state increases testing complexity
  2. Testing risk increases the closer you get to production
  3. Database tests are time consuming

3.1 Persistent State Increases Testing Complexity

A test should put the system into a known state, run, check for the expected results, then return the system back to the state before the test ran. If it doesn’t return the system to the original state we say that there is a side effect. Side effects are problematic in the best of circumstances, but with databases it can be very problematic due to persistent state (data) within the database. Consider the following scenarios:

Due to persistent state any side effects of database tests have the potential to affect both the current test run as well as future test runs. Luckily there are techniques to counteract this problem:

Figure 2. Pseudo-code for a database test as a transaction.

Begin Transaction RunTest

Put database in known state

Run the test

Check the results of the test against the expected results

Rollback Transaction RunTest

Log results of test

3.2 Testing Risk Increases the Closer You Get to Production

A common development practice is to have sandboxes, different environments in which your assets may be developed and tested. This concept is overviewed in Figure 3. The idea is that developers (or Agile data engineers ) work in their sandboxes, developing and testing to the best of their ability. When a change is ready for the next level of testing it is promoted up a level, in this case to a team integration sandbox where a more sophisticated test suite is run. When the test run(s) are successful in a given sandbox the changes are promoted to the next level where an even more sophisticated (and usually more expensive) test suite is run. The process repeats until changes are deployed into production. When this process is fully automated we refer to this as continuous deployment.

Figure 3. Sandboxes within your technical environment (click to expand).

The sandbox strategy reveals several important considerations for your CDI strategy:

  1. CDI should occur in all sandboxes except production. This is for two reasons, first running tests takes time and resources and will have a negative impact on performance. With the exception of targeting testing to explore production problems, you typically don’t want to test in production. Second, as mentioned early, you don’t rebuild/restore a production database except when there has been a catastrophic failure. As you will see below, rebuild/restore is an important aspect of the CDI process.
  2. CDI on a developer sandbox must be fast. Developers and Agile data engineers are highly paid professionals. As a result you are motivated to be efficient with their time, and that includes ensuring that the automated test suite running in that sandbox is fast. In the next section we explore this in detail.

3.3 Database Tests Are Time Consuming

Database access takes time. The more database tests you have, the longer your test suite takes to run. Where an automated test running against application code may run on the order of milliseconds a simple database test may run in tenths of a second and a complex test in seconds or more. This starts to add up quickly if your automated test suite includes hundreds or even thousands of database tests.

There are several strategies that you can follow to increase the speed of your database tests. In order of most effective to least effective, they are:

  1. Only run a subset of the tests. A common strategy is for developers to run only a handful of detailed tests within their environments. These tests typically focus on the aspects of what they are currently working on, and once that functionality is deemed finished for the present time frame the corresponding tests for it are promoted to one or more test suites running in other sandboxes, as per Figure 3.
  2. Run against a subset of the data. As the size of a database grows, or more accurately as the number of rows in database structures increases, the slower it generally becomes to access the data. For example, a test that validates the total of a numeric column in a table will run faster when 100 rows are being totalled compared with totally 10,000 rows.
  3. Run against a database mock. A mock is a simulation of an asset, such as a subsystem, a database, or an external service. Mocks are created to reduce the cost of testing, where the cost may be money, time, or both. Figure 4 depicts the process of database testing when using database mocks. Mocks usually prove valuable in most situations with the exception of databases. Database mocking almost always proves to be a bad idea in practice because it is very difficult to fully mock a database, the effort and cost to do so almost always exceeds the benefits, and many of the most important classes of testing (in particular performance and security) aren’t viable when run against mocks.

Figure 4. The process of database testing using database mocks.

4. How Continuous Database Integration Works

Figure 5 overviews the steps of CDI process. The steps of this process mirror those of the normal CI process, albeit with data-oriented terminology in some cases.

Figure 5. The process of continuous database integration.

The steps of the CDI process are:

  1. Ensure build readiness. This is a quick check that validates that all of the tools and environments required to build the asset, in this case the database, are currently available. This includes, but is not limited to, your configuration management (CM) environment, the build tools, the test environment, testing tools, and other validation tools.
  2. Obtain current source. All of the source code, source data, test code, test data, configuration files, and any other component required to build the database should be checked out of your CM environment.
  3. Build or restore the test database. You want to put your test database into a known state at the start of your test run. You can do this either by restoring the database from a back up or by recreating the database and running your database creation scripts to create the database from scratch.
  4. Perform static schema analysis. Static schema analysis is the database version of static code analysis. The idea is that you inspect the database schema, via automated tooling, validating that it conforms to defined standards. Database schema analysis will validate that your schema, both code and structure, conforms to your organizational database design conventions (e.g. naming conventions). This step is optional, although highly recommended, as tools may not yet exist for your technology platform (so start building them, and better yet initiate an open-source initiatives).
  5. Run automated regression tests. Your automated test suite(s) are run against the database.
  6. Perform dynamic analysis. Dynamic analysis tools validate that a system, in this case a database, performs under common stresses. Dynamic analysis tools often look for security threats – SQL injection, broad user privileges, (lack of) audit tracking, missing security patches, poor password control, and more. As with static analysis this step is also optional but highly recommended.
  7. Report results. The results of your efforts should be logged so that you may act on them. The results will likely be moved into your data warehouse so as to take advantage of your reporting infrastructure.

The CDI process runs on every sandbox, although as discussed earlier there is often a different strategy on the development sandbox due to performance considerations.

5. Why Continuous Database Integration is Hard

There are several reasons why CDI proves to be more difficult than standard CI in practice:

  1. Database tests may have side effects. We discussed this in detail earlier in the persistent state increases testing complexity section. We also discussed potential techniques to overcome this problem.
  2. Incompatible mindset. Agility requires a greater focus on working in an evolutionary manner than do serial ways of working (WoW). As a result this motivates agilists to focus on producing quality assets because without high quality you cannot easily evolve those assets. It also motivates them to automate as much of their work as possible. These are new concepts for data professionals, most of whom are still new to the agile mindset and WoW. This problem can be overcome through training, coaching, non-solo work such as pairing or mobbing, and more importantly adopting an agile WoW (mindset shifts tend to occur as the result of changing your WoW).
  3. Lack of skills. Where agilists strive to be generalizing specialists with a wide range of skills, traditionalists tend to be specialized with narrower but often deeper skills. As a result data professionals coming from a traditional background are very likely to have very deep skills and knowledge within the data space, but may not have the testing background (yet) required to be successful working in an agile manner. Similarly they may not have the development skills, yet, required to automate those tests or to produce the deployment scripts required to support database evolution. Like the previous problem, this one can be overcome through training, coaching, non-solo work, and through evolving your WoW.
  4. Poor tooling. For a long time the data community has suffered from a lack of agile tooling. This has been slowly getting better over time, there are some particularly good tools in the DataVault 2 space as well as some good Agile data engineer tools in general, but we still have a long way to go. There are also some very interesting open source tools available to you.

6. Continuous Database Integration in Context

The following table summarizes the trade-offs associated with continuous database integration and provides advice for when (not) to adopt it.

Advantages Automates much of the drudgery around the overall build process Increases consistency and predictability of database evolution work Brings database evolution up to common software engineering practice Enables greater visibility into the work and work products of database evolution via automated logging of results
Disadvantages Requires investment in automation of the database development infrastructure, potentially including new tooling Requires investment in the creation of automated tests May require training and coaching of data engineers in agile development techniques
When to Adopt This Practice Minimally, you will need to have your database assets under configuration management control. Additionally, you will want to have automated testing and validation in place, or have at least started to put it in place, for a database otherwise there’s nothing of impact to invoke.