About data lineage (original) (raw)

Data lineage is a visual map that tracks the entire lifecycle of your data. It shows you where your data comes from (the origin), where it travels (the destinations), and all the changes or transformations that happen along the way.

You can view this complete map of your data's journey directly in the Google Cloud console for assets created in products such as Knowledge Catalog (formerly Dataplex Universal Catalog), BigQuery (including external tables created for Iceberg REST Catalog), and Vertex AI. Because workflows often span multiple regions, Knowledge Catalog supportsmulti-region lineage, which provides a unified view of your data's journey across the global Google Cloud ecosystem. Advanced users can also retrieve this information by using theData Lineage API.

Why you need data lineage

Modern companies move and change large amounts of data constantly. For example, transforming raw customer purchases into reports, dashboards, and machine learning models. This complexity creates critical challenges for your team:

Trust and verification: Data users often struggle to confirm that the reports and numbers they see are accurate and come from a trusted source.
Troubleshooting: When an error appears in a final report, data teams might find it difficult and time-consuming to trace the issue back through every step to its root cause.
Change management: Before changing or deleting a piece of data (like a column in a table), teams need to know every single downstream report or model that relies on it to avoid breaking critical systems.
Compliance: Leaders need visibility into how sensitive data (like customer or financial information) is used across the organization to meet regulatory requirements.

Data lineage solves these problems by providing a clear, visual, and documented journey of your data. This lets you quickly understand data sources, trace errors, assess the impact of changes, and maintain compliance.

How data lineage works

The Data lineage workflow includes the following steps:

Data sources and ingestion: lineage information from your data sources initiates the entire process. For more information, seeLineage sources.
- Google Cloud services: when the Data Lineage API is enabled, supported services such as BigQuery and Dataflow automatically report lineage events whenever data is moved or transformed.
- Custom sources: for any systems not automatically supported by Google Cloud integrations, you can use the Data Lineage API to manually record lineage information. We recommend importing events formatted according to the OpenLineage standard.
Lineage platform: this central platform ingests, models, and stores all lineage data. For more information, seeLineage information model and granularity.
- Data Lineage API: this API acts as the single entry point for all incoming lineage information. It uses a hierarchical data model consisting of three core concepts: process, run, and event.
- Processing and storage: the platform processes incoming data and stores it in reliable, query-optimized databases.
User experience: you can interact with the stored lineage information in two primary ways:
- Visual exploration: in the Google Cloud console, a frontend service fetches and renders the lineage data as an interactive graph or list. This is supported for Knowledge Catalog, BigQuery, Lakehouse (for Iceberg REST Catalog tables), physical layer (Cloud Storage), and Vertex AI (for models, datasets, through pipelines; and feature store views, and feature groups). This is ideal for visually exploring your data's journey. For more information, seeLineage views in the Google Cloud console.
- Programmatic access: using an API client, you can directly communicate with the Data Lineage API to automate lineage management. This lets you write lineage information from custom sources. It also lets you read and query the stored lineage data for use in other applications or for building custom reports.

Which API should I use for data lineage?

To perform immediate, single-level lookups, use the SearchLinks API. To build a complete lineage graph or perform deep impact analysis (up to 100 levels), use the SearchLineageStreaming API.

Depending on your use case, select the most appropriate method:

Feature	SearchLinks	SearchLineageStreaming
Depth	1 level (immediate neighbors)	Up to 100 levels
Execution	Synchronous	Real-time streaming
Use case	Simple lookups of direct sources or targets	Building a complete lineage graph or performing impact analysis

Identify direction

Upstream (Origins):
- In SearchLinks, set the target field to your asset's FQN.
- In SearchLineageStreaming, set direction to UPSTREAM.
Downstream (Destinations):
- In SearchLinks, set the source field to your asset's FQN.
- In SearchLineageStreaming, set direction to DOWNSTREAM.

What data sources are supported for data lineage?

You can populate lineage information in Knowledge Catalog in the following ways:

Automatically from integrated Google Cloud services
Manually, by using the Data Lineage API for custom sources
By importing events from OpenLineage

BigQuery

When you enable data lineage in your BigQuery project, Knowledge Catalog automatically records lineage information for the following:

New tables created as a result of the following BigQuery jobs:
- Copy jobs
- Load jobs that use a Cloud Storage URI
- Query jobs that use the following data definition language (DDL) in GoogleSQL:
  * CREATE TABLE
  * CREATE TEMP TABLE
  * CREATE TABLE AS SELECT
  * CREATE TABLE COPY
  * CREATE TABLE CLONE
  * CREATE TABLE FUNCTION
  * CREATE TABLE LIKE
  * CREATE VIEW
  * CREATE MATERIALIZED VIEW
Existing tables when you use the following data manipulation language (DML) statements in GoogleSQL:
- SELECTin relation to any of the listed table types:
  * BigQuery Views
  * BigQuery Materialized Views
  * BigQuery External Tables
- INSERT SELECT
- MERGE
- UPDATE
- DELETE

BigQuery copy, query, and load jobs are represented as processes.

To view the process details, on the lineage graph, click the Process details icon Process details icon. .

Each process contains the BigQuery job_idin the attributeslist for the most recent BigQuery job.

Other services

Data lineage supports integration with the following Google Cloud services:

Cloud Data Fusion
Dataflow
Lakehouse for Iceberg REST Catalog tables
Lakehouse Iceberg REST catalog tables with Lakehouse runtime catalog,Apache Iceberg REST Catalog in Lakehouse runtime catalogor custom Iceberg Catalog for BigQuery in Lakehouse runtime catalog for Managed Service for Apache Spark (1.10and 1.9).
Looker (Google Cloud core) (Preview)
Managed Service for Apache Airflow
Managed Service for Apache Spark: Apache Hive clusters
Managed Service for Apache Spark: Apache Spark clusters
Managed Service for Apache Spark: serverless deployment
Vertex AI Feature Store
Vertex AI Pipelines

Data lineage for custom data sources

You can use the Data Lineage APIto manually record lineage information for any data source that integrated systems don't support.

Knowledge Catalog can create lineage graphs for manually recorded lineage if you use afullyQualifiedName that matches the fully qualified names of existing Knowledge Catalog entries. If you want to record lineage for a custom data source, you must first create acustom entry.

Each process for a custom data source can contain a sql key in the attributes list. The value of this key is used to render a code highlight in the details panel of the Data lineage graph. The SQL statement is displayed as it was provided. You are responsible for filtering out sensitive information. The key name sql is case-sensitive.

OpenLineage

If you already use OpenLineage to collect lineage information from other data sources, you can import OpenLineage events into Knowledge Catalog and view these events in the Google Cloud console. For more information, seeIntegrate with OpenLineage.

Automated data lineage tracking

When you enable the Data Lineage API, Google Cloud systems that support data lineage start reporting their data movement. Each integrated system can submit lineage information for a different range of data sources.

Control lineage ingestion

To manage costs and governance policies, you can turn lineage generation on or off for specific Google Cloud services. You can configure this ingestion centrally at the organization, folder, and project levels. During preview, this feature supports configuring lineage ingestion only for Managed Service for Apache Spark.

Knowledge Catalog evaluates the resource hierarchy (project, then folders, then organization) to determine the effective configuration. The first configuration explicitly set at any level in this upward traversal takes effect.

If you set a configuration at the project level, Knowledge Catalog uses it.
If no configuration is set at the project level, Knowledge Catalog uses the configuration from the nearest parent folder with an explicit configuration.
If no configuration is set at the project or folder level, Knowledge Catalog uses the organization-level configuration.
If no configuration is set at any of these levels, Knowledge Catalog uses the system default for the integration. The default for lineage enablement configuration can be Enabled or Disabled. For Managed Service for Apache Spark, lineage ingestion is Enabled by default where the Data Lineage API is active.

For example, consider an organization test-org with the following Managed Service for Apache Spark lineage configurations:

Organization test-org: Enabled
- Folder folder-a: Disabled
  * Project project-a: No configuration set
- Folder folder-b: Enabled
  * Project project-b: Disabled

In this scenario, the following settings are applicable:

For project-a, lineage ingestion is Disabled. Knowledge Catalog starts evaluating from project-a, finds no configuration, moves up to folder-a, and applies theDisabled configuration from folder-a.
For project-b, lineage ingestion is Disabled. Knowledge Catalog starts evaluating from project-b and applies its Disabled configuration, overriding settings atfolder-b and test-org.

Controlling lineage data generation helps you manage costs and governance policies. For example, you can disable lineage collection for development projects or high-volume workloads that don't require lineage tracking.

For information on how to configure and control lineage ingestion, seeControl lineage ingestion for a service.

Multi-region data lineage

Data lineage is an inherently regionalized service. Lineage metadata including links, processes, and events is securely recorded and isolated within the specific geographical location where the underlying data transformation or asset modification occurs.

As modern enterprise data architectures scale, pipeline workflows frequently cross project and regional boundaries. For example, a BigQuery transformation pipeline running in us-central1 might read a source table inus-east1 and output aggregated metrics into a Cloud Storage bucket located in europe-west1.

To establish a comprehensive, end-to-end view of your data's lifecycle across these independent geographical spaces, use a multi-region lineage search method.

For more information, seeAbout multi-region lineage search.

Limitations

Data lineage has the following limitations:

All lineage information is retained in the system for only 30 days.
Lineage information persists after you delete the related data source. For example, if you delete a BigQuery table, you can still view its lineage through the API and the console for up to 30 days.
Data lineage doesn't automatically record direct lineage information for BigQuery routines. If a routine is used in a query, data lineage records lineage between the tables that the routine reads as dependencies of tables that the query writes.

Column-level lineage limitations

Column-level lineage has the following additional limitations:

Column-level lineage isn't collected for BigQuery load jobsor for routines.
Upstream column-level lineage isn't collected for external tables.
Column-level lineage isn't collected if a job creates more than 1,500 column-level links. In these cases, only table-level lineage is collected.
CLL support is limited to top-level columns in BigQuery tables. Nested fields within complex types (like STRUCT or JSON) are not supported.
The search functionality with the field parameter only operates on links that explicitly define column-to-column relationships. It doesn't return results or traverse links that are only defined at the table level. There is no support for searching between table-level links and column-level links (e.g., finding all columns related to a table-level link, or vice-versa). The API will only return links where both source and target specify a field.
Support for partitioned tables is limited, because partitioning columns like_PARTITIONDATE and _PARTITIONTIME aren't recognized in the lineage graph.
Console limitations:
- The lineage graph traversal is limited to a depth of 20 levels and 10,000 links in each direction.

Pricing

Knowledge Catalog uses the premium processing SKU to charge for data lineage. For more information, seePricing.
To separate data lineage charges from other charges in the Knowledge Catalog premium processing SKU, on theCloud Billing report, use the labelgoog-dataplex-workload-type with the value LINEAGE.
If you call the Data Lineage APIOrigin sourceType with a value other than CUSTOM, it causes additional costs.

What's next

Learn how to track data lineage for a BigQuery table copy and query jobs.
Learn how to use data lineage with Google Cloud systems.
Learn about lineage views in the Google Cloud console.
Explore the Data Lineage API.
For administrative information, see Lineage considerations and data lineage audit logging.