About data lineage (original) (raw)

Data lineage is a visual map that tracks the entire lifecycle of your data. It shows you where your data comes from (the origin), where it travels (the destinations), and all the changes or transformations that happen along the way.

You can view this complete map of your data's journey directly in the Google Cloud console for assets created in products such as Knowledge Catalog (formerly Dataplex Universal Catalog), BigQuery (including external tables created for Iceberg REST Catalog), and Vertex AI. Because workflows often span multiple regions, Knowledge Catalog supportsmulti-region lineage, which provides a unified view of your data's journey across the global Google Cloud ecosystem. Advanced users can also retrieve this information by using theData Lineage API.

Why you need data lineage

Modern companies move and change large amounts of data constantly. For example, transforming raw customer purchases into reports, dashboards, and machine learning models. This complexity creates critical challenges for your team:

Data lineage solves these problems by providing a clear, visual, and documented journey of your data. This lets you quickly understand data sources, trace errors, assess the impact of changes, and maintain compliance.

How data lineage works

The Data lineage workflow includes the following steps:

  1. Data sources and ingestion: lineage information from your data sources initiates the entire process. For more information, seeLineage sources.
    • Google Cloud services: when the Data Lineage API is enabled, supported services such as BigQuery and Dataflow automatically report lineage events whenever data is moved or transformed.
    • Custom sources: for any systems not automatically supported by Google Cloud integrations, you can use the Data Lineage API to manually record lineage information. We recommend importing events formatted according to the OpenLineage standard.
  2. Lineage platform: this central platform ingests, models, and stores all lineage data. For more information, seeLineage information model and granularity.
    • Data Lineage API: this API acts as the single entry point for all incoming lineage information. It uses a hierarchical data model consisting of three core concepts: process, run, and event.
    • Processing and storage: the platform processes incoming data and stores it in reliable, query-optimized databases.
  3. User experience: you can interact with the stored lineage information in two primary ways:
    • Visual exploration: in the Google Cloud console, a frontend service fetches and renders the lineage data as an interactive graph or list. This is supported for Knowledge Catalog, BigQuery, Lakehouse (for Iceberg REST Catalog tables), physical layer (Cloud Storage), and Vertex AI (for models, datasets, through pipelines; and feature store views, and feature groups). This is ideal for visually exploring your data's journey. For more information, seeLineage views in the Google Cloud console.
    • Programmatic access: using an API client, you can directly communicate with the Data Lineage API to automate lineage management. This lets you write lineage information from custom sources. It also lets you read and query the stored lineage data for use in other applications or for building custom reports.

Which API should I use for data lineage?

To perform immediate, single-level lookups, use the SearchLinks API. To build a complete lineage graph or perform deep impact analysis (up to 100 levels), use the SearchLineageStreaming API.

Depending on your use case, select the most appropriate method:

Feature SearchLinks SearchLineageStreaming
Depth 1 level (immediate neighbors) Up to 100 levels
Execution Synchronous Real-time streaming
Use case Simple lookups of direct sources or targets Building a complete lineage graph or performing impact analysis

Identify direction

What data sources are supported for data lineage?

You can populate lineage information in Knowledge Catalog in the following ways:

BigQuery

When you enable data lineage in your BigQuery project, Knowledge Catalog automatically records lineage information for the following:

BigQuery copy, query, and load jobs are represented as processes.

To view the process details, on the lineage graph, click the Process details icon Process details icon..

Each process contains the BigQuery job_idin the attributeslist for the most recent BigQuery job.

Other services

Data lineage supports integration with the following Google Cloud services:

Data lineage for custom data sources

You can use the Data Lineage APIto manually record lineage information for any data source that integrated systems don't support.

Knowledge Catalog can create lineage graphs for manually recorded lineage if you use afullyQualifiedName that matches the fully qualified names of existing Knowledge Catalog entries. If you want to record lineage for a custom data source, you must first create acustom entry.

Each process for a custom data source can contain a sql key in the attributes list. The value of this key is used to render a code highlight in the details panel of the Data lineage graph. The SQL statement is displayed as it was provided. You are responsible for filtering out sensitive information. The key name sql is case-sensitive.

OpenLineage

If you already use OpenLineage to collect lineage information from other data sources, you can import OpenLineage events into Knowledge Catalog and view these events in the Google Cloud console. For more information, seeIntegrate with OpenLineage.

Automated data lineage tracking

When you enable the Data Lineage API, Google Cloud systems that support data lineage start reporting their data movement. Each integrated system can submit lineage information for a different range of data sources.

Control lineage ingestion

To manage costs and governance policies, you can turn lineage generation on or off for specific Google Cloud services. You can configure this ingestion centrally at the organization, folder, and project levels. During preview, this feature supports configuring lineage ingestion only for Managed Service for Apache Spark.

Knowledge Catalog evaluates the resource hierarchy (project, then folders, then organization) to determine the effective configuration. The first configuration explicitly set at any level in this upward traversal takes effect.

For example, consider an organization test-org with the following Managed Service for Apache Spark lineage configurations:

In this scenario, the following settings are applicable:

Controlling lineage data generation helps you manage costs and governance policies. For example, you can disable lineage collection for development projects or high-volume workloads that don't require lineage tracking.

For information on how to configure and control lineage ingestion, seeControl lineage ingestion for a service.

Multi-region data lineage

Data lineage is an inherently regionalized service. Lineage metadata including links, processes, and events is securely recorded and isolated within the specific geographical location where the underlying data transformation or asset modification occurs.

As modern enterprise data architectures scale, pipeline workflows frequently cross project and regional boundaries. For example, a BigQuery transformation pipeline running in us-central1 might read a source table inus-east1 and output aggregated metrics into a Cloud Storage bucket located in europe-west1.

To establish a comprehensive, end-to-end view of your data's lifecycle across these independent geographical spaces, use a multi-region lineage search method.

For more information, seeAbout multi-region lineage search.

Limitations

Data lineage has the following limitations:

Column-level lineage limitations

Column-level lineage has the following additional limitations:

Pricing

What's next