Introduction to DataOps: Bringing Databases Into DevOps (original) (raw)

DataOps is the streamlined combination of data development and data operations. Data development, known better as data engineering, comprises the activities involved with engineering and evolving the data aspects of your technical solutions. Data operations comprises the activities to operate, support, and govern the data aspects of your technical solutions.

This article works through the following topics:

  1. Why DataOps?
  2. The DataOps lifecycle
  3. Critical DataOps Practices
  4. The DataOps pipeline
  5. DataOps and DevOps
  6. Isn’t DataOps really “data DevOps”?

1. Why DataOps?

There are several reasons why DataOps is critical to your organization:

  1. Data is mission-critical. Quality data is required in different forms, at different times, and in different ways across your organization to enable data-informed decision making.
  2. Data is an innate component of your systems. Just as the rest of your information technology (IT) group has adopted continuous ways of working (WoW), so must data professionals. Just as data is part of your systems, DataOps is part of DevOps.
  3. The increasing rate of change. The increasing rate of change, and the corresponding increased demand by your customers, demands an increased level of responsiveness. An effective DataOps strategy shortens the feedback cycle from the time that you recognize a need for new or improved data to the time that you deliver it.
  4. Increasing complexity. The increasing complexity of our technical environment requires greater automation and more responsive WoW.
  5. Increasing need for regulatory compliance. The increased importance of data within society motivates greater regulation of data and its usage. We are seeing countries all around the world enacting privacy, artificial intelligence (AI), and other forms of data regulation. The depth and breadth of these evolving regulations requires greater and more integrated automation throughout the entire data lifecycle.

2. The DataOps Lifecycle

Figure 1 depicts the DataOps process loop. It is shown as an infinite mobius loop to indicate that DataOps is considered a continuous initiative that will last for the life of your organizational data. Data development is shown on the left-hand portion of Figure 1, comprised of activities to envision, implement, validate, and integrate your data assets. Data operations is shown on the right-hand portion of Figure 1, comprised of activities to deploy, operate & support, and provide feedback to development.

Figure 1. The DataOps continuous process loop (click to enlarge).

DataOps/Data DevOps

3. Critical DataOps Practices

Figure 1 above indicates what I believe to be key practices supporting DataOps. These practices are:

  1. Agile data architecture. Data architecture is the foundation of a data strategy that supports your organization’s goals and priorities. Agile data architecture does so in a collaborative and evolutionary (iterative and incremental) manner.
  2. Agile data modeling. Data modeling is the act of exploring data-oriented structures. Agile data modeling is data modeling done in an evolutionary and collaborative manner.
  3. Agile database engineering. The work required to technically implement data assets, including data stores, data tooling, data transmission, and other components.
  4. Automated database regression testing. The validation of a data asset, in particular a database, in an automated manner. This is achieved through the creation of a test suite, comprised of automated tests, that is invoked via a testing tool.
  5. Continuous database deployment (CDD). Continuous deployment (CD) is the automatic deployment of a solution once it has passed any requisite quality criteria. CDD is CD of a data store.
  6. Continuous database integration (CDI). Continuous integration (CI) is the automatic invocation of the build process for an asset. CDI is CI of a data store.
  7. Data lineage. Data lineage is the act of fully tracing a data element through the processing steps performed on it from source to destination.
  8. Data repair. Fixing data quality problems at the actual source, such as an existing production database.
  9. Data security. Data security is the practice of protecting digital information from unauthorized access, corruption or theft throughout its entire lifecycle.
  10. Database refactoring. A database refactoring is a simple change to a database schema that improves its design while retaining both its behavioral and informational semantics.
  11. Lean data governance. The goal of data governance is to ensure the quality, availability, integrity, security, and usability within an organization. A lean data governance approach promotes a healthy, collaborative relationship between data professionals and the teams that they’re supporting.
  12. Manual data testing. The validation of a data asset in a non-automated manner.
  13. Operational data quality assurance. The ongoing monitoring and verification that operational data and its supporting infrastructure meets or exceeds the quality of service (QoS) requirements set for it.
  14. Test data generation. Tooling and procedures to generate artificial data for testing purposes.
  15. Thin/vertical slicing. The organization of deliverables into “thin vertical slices” of consumable value that may be deployed into production quickly. These slices are completely implemented – the analysis, design, programming, and testing are complete – and offer real business value.

Yes, there are many more data-oriented practices that you will adopt to make DataOps work in practice, but the ones listed above are the critical ones in my opinion.

4. The DataOps Pipeline

Your DataOps pipeline is the combination of technologies that you use to automate the oriented activities of the DataOps lifecycle. Figure 2 indicates the type of activities that are commonly automated. Note that this automation may not be 100%, for example you may not have automated (yet) all of your database tests. Also note that Figure 2 calls out categories of work that should be automated, such as CDI and CDD, but it does not call out specific tooling to do so. A quick Internet search will soon reveal many tool for each category.

Figure 2. Automation throughout the DataOps pipeline (click to enlarge).

DataOps pipeline

Notice how Figure 1 and Figure 2 differ. Figure 1 focuses on high-level activities, in particular those that reflect agile data ways of working (WoW). Figure 2 reflects activities that are typically automated, some of which reflect traditional WoW (although hopefully performed in an agile manner) and some of which are clearly “new” agile WoW. Techniques such as database refactoring don’t appear on Figure 2 because very little of this technique is part of your DataOps pipeline. The portions of the work that would be included in your pipeline would be the scripts to deploy a refactoring when it is first implemented and the scripts to remove the old schema (if any) and scaffolding once the transition period has ended. Although these are both very important things, they’re actually a very small parts of the overall refactoring work.

5. DataOps and DevOps

A fundamental question is how do DataOps and DevOps relate to one another? Here is how:

  1. DataOps is a critical subset of DevOps. Just like data is a critical aspect of your systems, data ways of working (WoW) are critical aspects of your overall WoW.
  2. Many of the DataOps development practices are more complex. It isn’t simply a matter of taking the names of common DevOps practices, such as automated regression testing and continuous integration, and sticking the word database into them. Because databases persist data, the engineering practices for them are more complicated than the corresponding non-database practices. For example, automated database regression testing must ensure that tests put the database back into the pre-test state (e.g. reset the data values), otherwise there could be side effects that derail other tests.
  3. Most of the Ops side of DevOps is data operations. Just saying.

6. Isn’t DataOps Really Data DevOps?

Yes, DataOps should be more accurately called Data DevOps. DataOps is a much sleeker marketing term, and marketing tends to win out over accuracy in practice. Although I have used the terms “Data DevOps” and “Database DevOps” since around 2018, I’ve decided to abandon them in favour of DataOps.