Query Bigtable external tables (original) (raw)

This document describes how to use BigQuery to query data stored in aBigtable external table. For information on how to query data directly from Bigtable, see GoogleSQL for Bigtable overview.

Bigtable is Google's sparsely populated NoSQL database that can scale to billions of rows, thousands of columns, and petabytes of data. For information on the Bigtable data model, see Storage model.

Query permanent external tables

Before you begin, you or an administrator in your organization must create an external table for you to use. For details and required permissions, see Create a BigQuery external table.

Required roles

To query Bigtable external tables, ensure you have the following roles.

Depending on your permissions, you can grant these roles to yourself or ask your administrator to grant them to you. For more information about granting roles, seeViewing the grantable roles on resources.

To see the exact BigQuery permissions that are required to query external tables, expand the Required permissions section:

You might also be able to get these permissions with custom rolesor other predefined roles.

Query the table

You can run a query against a permanent external Bigtable table exactly as if it were astandard BigQuery table, subject to the limitationson external data sources. For more information, see Run interactive and batch queries.

Query temporary external tables

Querying an external data source using a temporary table is useful for one-time, ad-hoc queries over external data, or for extract, transform, and load (ETL) processes.

To query an external data source without creating a permanent table, you provide a table definition for the temporary table, and then use that table definition in a command or call to query the temporary table. You can provide the table definition in any of the following ways:

The table definition file or supplied schema is used to create the temporary external table, and the query runs against the temporary external table.

When you use a temporary external table, you do not create a table in one of your BigQuery datasets. Because the table is not permanently stored in a dataset, it cannot be shared with others.

Using a temporary external table instead of a permanent external table has some limitations, including the following:

Required roles

To query Bigtable temporary external tables, ensure you have the following roles:

Depending on your permissions, you can grant these roles to yourself or ask your administrator to grant them to you. For more information about granting roles, seeViewing the grantable roles on resources.

To see the exact BigQuery permissions that are required to query external tables, expand the Required permissions section:

You might also be able to get these permissions with custom rolesor other predefined roles.

Create and query the table

To query Bigtable data using a temporary external table, you:

Creating and querying a temporary external table is supported by the bq command-line tool and the API.

bq

To query a temporary table using a table definition file, enter thebq query command with the --external_table_definition flag.

(Optional) Supply the --location flag and set the value to yourlocation.

bq --location=LOCATION query
--use_legacy_sql=false
--external_table_definition=TABLE::DEFINITION_FILE
'QUERY'

Replace the following:

For example, the following command creates and queries a temporary table named follows using a table definition file named follows_def.

bq query \
--use_legacy_sql=false \
--external_table_definition=follows::/tmp/follows_def \
'SELECT
  COUNT(rowkey)
 FROM
   follows'

API

Java

Before trying this sample, follow the Java setup instructions in theBigQuery quickstart using client libraries. For more information, see theBigQuery Java API reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.

Performance considerations

The performance of queries against Bigtable external data sources depends on three factors:

BigQuery tries to read as little data as possible by only reading the column families that are referenced in the query. The extent of parallelization depends on how many nodes you have in your Bigtable cluster and how many splits you have for your table.

Note that Bigtable auto-merges splits based on load. If your table is not being read frequently, there will be fewer splits over time and a gradual degradation in query performance. For more information, see How BigQuery optimizes your data over time.

Compute

When you query your Bigtable data from BigQuery, you have the following options for compute:

To use Data Boost, you or your administrator must create a definition file that specifies a Data Boost app profile in the Bigtable URI. For more information, see Create a Bigtable external table.

If you don't use Data Boost, be aware that querying Bigtable from BigQuery consumes Bigtable CPU cycles. CPU consumption by BigQuery when using provisioned nodes for compute might affect latency and throughput for other concurrent requests such as live user traffic serving. For example, high CPU usage on Bigtable affects long-tail queries and increases latency at the 99th percentile.

As a result, you should monitor Bigtable CPU usage to verify that you're within the recommended bounds as noted on the Bigtable monitoring dashboard in the Google Cloud console. Increasing the number of nodes for your instance lets you handle both BigQuery traffic and traffic from other concurrent requests.

Query filters

You can add query filters when querying an external table to reduce BigQuery resource usage.

Row key filter

Queries with a row key equality filter only read that specific row. For example, in GoogleSQL syntax:

SELECT COUNT(follows.column.name) FROM dataset.table WHERE rowkey = "alice";

Range filters such as rowkey > '1' and rowkey < '8' are also supported, but only when rowkey is read as a string with the readRowkeyAsString option.

Filter by column family and qualifier

You can also select a specific column family or a specific qualifier within a column family. To filter by column family, select the column family name, and the result includes only the selected column family. In the following example, user_info represents a column family:

SELECT
  rowkey AS user_id,
  user_info
FROM
  `project.dataset.table`;

To filter by a specific qualifier, you must first declare them in "columns" in the external table definition:

CREATE OR REPLACE EXTERNAL TABLE project.dataset.table OPTIONS ( format = 'CLOUD_BIGTABLE', uris = ['https://googleapis.com/bigtable/projects/…/instances/…/tables/…'], bigtable_options = '''{ "columnFamilies": [ { "familyId": "user_info", "columns": [ { "qualifierString": "name" }, { "qualifierString": "email" }, { "qualifierString": "registered_at" } ] }, { "familyId": "session_data" } ], "readRowkeyAsString": true, "timestampSuffix": "_ts" }''' );

After the external table is created, use a SELECT statement to query a specific qualifier. This ensures that BigQuery pushes down the filter to Bigtable and only loads the specified qualifiers when running a SELECT statement from BigQuery, not the entire column family's data. This reduces BigQuery resource consumption.

SELECT
  rowkey AS user_id,
  user_info.email.`cell`[SAFE_OFFSET(0)].value as email
FROM
  `project.dataset.table`;