Query Bigtable external tables (original) (raw)

This document describes how to use BigQuery to query data stored in aBigtable external table. For information on how to query data directly from Bigtable, see GoogleSQL for Bigtable overview.

Bigtable is Google's sparsely populated NoSQL database that can scale to billions of rows, thousands of columns, and petabytes of data. For information on the Bigtable data model, see Storage model.

Query permanent external tables

Before you begin, you or an administrator in your organization must create an external table for you to use. For details and required permissions, see Create a BigQuery external table.

Required roles

To query Bigtable external tables, ensure you have the following roles.

BigQuery Data Viewer (roles/bigquery.dataViewer)
BigQuery User (roles/bigquery.user)
Bigtable Reader (roles/bigtable.reader)

Depending on your permissions, you can grant these roles to yourself or ask your administrator to grant them to you. For more information about granting roles, seeViewing the grantable roles on resources.

To see the exact BigQuery permissions that are required to query external tables, expand the Required permissions section:

You might also be able to get these permissions with custom rolesor other predefined roles.

Query the table

You can run a query against a permanent external Bigtable table exactly as if it were astandard BigQuery table, subject to the limitationson external data sources. For more information, see Run interactive and batch queries.

Query temporary external tables

Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

To query an external data source without creating a permanent table, you provide a table definition for the temporary table, and then use that table definition in a command or call to query the temporary table. You can provide the table definition in any of the following ways:

A table definition file
An inline schema definition
A JSON schema file

The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table.

When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others.

Using a temporary external table instead of a permanent external table has some limitations, including the following:

You must have the Bigtable Admin (roles/bigtable.admin) role.
This approach does not let you use the Google Cloud console to infer the schema of the Bigtable table and automatically create the table definition. You must create the table definition yourself.

Required roles

To query Bigtable temporary external tables, ensure you have the following roles:

BigQuery Data Viewer (roles/bigquery.dataViewer)
BigQuery User (roles/bigquery.user)
Bigtable Admin (roles/bigtable.admin)

To see the exact BigQuery permissions that are required to query external tables, expand the Required permissions section:

You might also be able to get these permissions with custom rolesor other predefined roles.

Create and query the table

To query Bigtable data using a temporary external table, you:

Create a table definition file
Submit both a query and a table definition file

Creating and querying a temporary external table is supported by the bq command-line tool and the API.

bq

To query a temporary table using a table definition file, enter thebq query command with the --external_table_definition flag.

(Optional) Supply the --location flag and set the value to yourlocation.

bq --location=LOCATION query
--use_legacy_sql=false
--external_table_definition=TABLE::DEFINITION_FILE
'QUERY'

Replace the following:

LOCATION: the name of your location. The --location flag is optional.
TABLE: the name of the temporary table you're creating.
DEFINITION_FILE: the path to thetable definition fileon your local machine.
QUERY: the query you're submitting to the temporary table.

For example, the following command creates and queries a temporary table named follows using a table definition file named follows_def.

bq query \
--use_legacy_sql=false \
--external_table_definition=follows::/tmp/follows_def \
'SELECT
  COUNT(rowkey)
 FROM
   follows'

API

Create a query. See Querying data for information about creating a query job.
(Optional) Specify your location in the location property in thejobReference section of the job resource.
Specify the external data source properties by setting theExternalDataConfiguration for the table resource.

Java

Before trying this sample, follow the Java setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

Performance considerations

The performance of queries against Bigtable external data sources depends on three factors:

The number of rows
The amount of data read
The extent of parallelization

BigQuery tries to read as little data as possible by only reading the column families that are referenced in the query. The extent of parallelization depends on how many nodes you have in your Bigtable cluster and how many splits you have for your table.

Note that Bigtable auto-merges splits based on load. If your table is not being read frequently, there will be fewer splits over time and a gradual degradation in query performance. For more information, see How BigQuery optimizes your data over time.

Compute

When you query your Bigtable data from BigQuery, you have the following options for compute:

Cluster nodes, which is the default.
Data Boost (Preview), a serverless compute option that lets you isolate your analytics traffic without impacting the application-serving traffic that your clusters' nodes are handling.

To use Data Boost, you or your administrator must create a definition file that specifies a Data Boost app profile in the Bigtable URI. For more information, see Create a Bigtable external table.

If you don't use Data Boost, be aware that querying Bigtable from BigQuery consumes Bigtable CPU cycles. CPU consumption by BigQuery when using provisioned nodes for compute might affect latency and throughput for other concurrent requests such as live user traffic serving. For example, high CPU usage on Bigtable affects long-tail queries and increases latency at the 99th percentile.

As a result, you should monitor Bigtable CPU usage to verify that you're within the recommended bounds as noted on the Bigtable monitoring dashboard in the Google Cloud console. Increasing the number of nodes for your instance lets you handle both BigQuery traffic and traffic from other concurrent requests.

Query filters

You can add query filters when querying an external table to reduce BigQuery resource usage.

Row key filter

Queries with a row key equality filter only read that specific row. For example, in GoogleSQL syntax:

SELECT COUNT(follows.column.name) FROM dataset.table WHERE rowkey = "alice";

Range filters such as rowkey > '1' and rowkey < '8' are also supported, but only when rowkey is read as a string with the readRowkeyAsString option.

Filter by column family and qualifier

You can also select a specific column family or a specific qualifier within a column family. To filter by column family, select the column family name, and the result includes only the selected column family. In the following example, user_info represents a column family:

SELECT
  rowkey AS user_id,
  user_info
FROM
  `project.dataset.table`;

To filter by a specific qualifier, you must first declare them in "columns" in the external table definition:

CREATE OR REPLACE EXTERNAL TABLE project.dataset.table OPTIONS ( format = 'CLOUD_BIGTABLE', uris = ['https://googleapis.com/bigtable/projects/…/instances/…/tables/…'], bigtable_options = '''{ "columnFamilies": [ { "familyId": "user_info", "columns": [ { "qualifierString": "name" }, { "qualifierString": "email" }, { "qualifierString": "registered_at" } ] }, { "familyId": "session_data" } ], "readRowkeyAsString": true, "timestampSuffix": "_ts" }''' );

After the external table is created, use a SELECT statement to query a specific qualifier. This ensures that BigQuery pushes down the filter to Bigtable and only loads the specified qualifiers when running a SELECT statement from BigQuery, not the entire column family's data. This reduces BigQuery resource consumption.

SELECT
  rowkey AS user_id,
  user_info.email.`cell`[SAFE_OFFSET(0)].value as email
FROM
  `project.dataset.table`;