Discover data (original) (raw)

This guide explains how to enable and use Dataplex Discovery. Discovery scans and extracts metadata from data in a data lake and registers it to Dataproc Metastore, BigQuery, and Data Catalog for analysis, search, and exploration.

Overview

For each Dataplex asset with Discovery enabled, Dataplex does the following:

For unstructured data, such as images and videos, Dataplex Discovery automatically detects and registers groups of files sharing media type as filesets. For example, if gs://images/group1contains GIF images, and gs://images/group2 contains JPEG images, Dataplex Discovery detects and registers two filesets. For structured data, such as Avro, Discovery detects files only if they are located in folders that contain the same data format and schema.

The discovered tables and filesets are registered in Data Catalog for search and discovery. The tables appear in Dataproc Metastore as Hive-style tables, and in BigQuery as external tables, so that data is automatically made available for analysis.

Discovery supports the following structured and semi-structured data formats:

Discovery supports the following compression format for structured and semi-structured data:

Discovery configuration

Discovery is enabled by default when you create a new zone or asset. You can disable Discovery at the zone or asset level.

When you create a zone or an asset, you can choose to inherit Discovery settings at the zone level, or override Discovery settings at the asset level.

The following are the Discovery configuration options available at the zone and asset levels:

When you create a data zone in your Dataplex lake, Dataplex creates a BigQuery dataset in the project containing the lake. Dataplex publishes tables into that dataset for tables discovered in the Cloud Storage buckets added to the data zone as assets. The dataset is referred to as a metadata_publishing dataset_ corresponding to the zone.

Each Dataplex data zone maps to a dataset in BigQuery or a database in Dataproc Metastore, where metadata information is automatically made available.

You can edit auto-discovered metadata, such as table name or schema, using the Dataplex metadata API.

View discovered tables and filesets

You can search for discovered tables and filesets in the DataplexSearch view in the Google Cloud console.

Open Search

For more accurate search results, use Dataplex-specific filters, such as lake and data zone names. The top 50 items per facet are displayed on the filters list. You can find any additional items using the search box.

Each entry contains detailed technical and operational metadata.

From the entry details page, you can query the table in BigQuery and view corresponding Dataproc Metastore registration details.

If a Cloud Storage table can be published into BigQuery as an external table, then you can see the following in its entry details view:

The Dataplex metadata entries are directly visible andsearchable in Data Catalog. To learn more, see the Data Catalog Search reference.

All discovered entries can be viewed through the Dataplex metadata API.

Discovery actions

Discovery raises the following administrator actions whenever data-related issues are detected during scans.

Invalid data format

Actions include the following:

Incompatible schema

Actions include the following:

Invalid partition definition

Actions include the following:

Missing data

Actions include the following:

Resolve Discovery actions

Data with actions is checked by subsequent Discovery scans. When the issue triggering the action is fixed, the action is resolved automatically by the next scheduled Discovery scan.

Other Discovery actions

In addition to the preceding Discovery actions, there are three other types of actions related to resource status and security policy propagations in Dataplex.

These types of actions are auto-resolved when the underlying resource or security configuration issues are corrected.

FAQ

What should I do if the schema inferred by Discovery is incorrect?

If the inferred schema is different from what is expected for a given table, you can override the inferred schema by updating metadata using themetadata API. Make sure to setuserManagedto true so that your edit is not overwritten in subsequent Discovery scans.

How do I exclude files from a Discovery scan?

By default, Discovery excludes certain types of files from scanning, including the following:

You can specify additional include or exclude patterns by using the Discovery configuration at the zone or asset level, or by using the metadata API.

What should I do if the table grouping detected by Discovery is too granular?

If the tables detected by Discovery are at a more granular level compared to the table root path—for example, each individual partition is registered as a table, then there could be several reasons:

You can resolve this issue in either of the following ways:

After you take one of the corrective steps, in the next Discovery scan, the following occurs:

How do I specify table names?

You can specify table names by using the metadata API.

What happens if I create tables manually in Dataproc Metastore or BigQuery?

When Discovery is enabled for a given asset, you don't need to manually register entries in Dataproc Metastore or BigQuery.

You can manually define table name, schema, and partition definitions, while switching off Dataplex Discovery. Alternatively, you do the following:

  1. Create a tableby only specifying the required information, such as table root path.
  2. Use Dataplex Discovery to populate the rest of the metadata, such as schema and partition definitions.
  3. Keep the metadata up-to-date.

What should I do if my table is not showing up in BigQuery?

While Dataplex metadata is all centrally registered in the metadata API, only Cloud Storage tables that are compatible with BigQuery are published to BigQuery as external tables. As part of table entry details in themetadata API, you can find a BigQuery compatibility marker that indicates which entities are published to BigQuery and why.

Limitations

What's next