Work with remote functions (original) (raw)
Stay organized with collections Save and categorize content based on your preferences.
A BigQuery remote function allows you to implement your function in other languages than SQL and Javascript or with the libraries or services which are not allowed in BigQueryuser-defined functions.
Overview
A BigQuery remote function lets you incorporate GoogleSQL functionality with software outside of BigQuery by providing a direct integration withCloud Run functions andCloud Run. With BigQuery remote functions, you can deploy your functions in Cloud Run functions or Cloud Run implemented with any supported language, and then invoke them from GoogleSQL queries.
Workflow
- Create the HTTP endpoint in Cloud Run functions or Cloud Run.
- Create a remote function in BigQuery.
- Create a connection of type
CLOUD_RESOURCE
. - Create a remote function.
- Create a connection of type
- Use the remote function in a query just like any other user-defined functions.
Limitations
- Remote functions only support one of the followingdata types as argument type or return type:
- Boolean
- Bytes
- Numeric
- String
- Date
- Datetime
- Time
- Timestamp
- JSON
Remote functions do not supportARRAY
,STRUCT
,INTERVAL
, orGEOGRAPHY
types.
- You cannot create table-valued remote functions.
- You cannot use remote functions when creating materialized views.
- The return value of a remote function is always assumed to be non-deterministic so the result of a query calling a remote function is not cached.
- You might see repeated requests with the same data to your endpoint, even after successful responses, due to transient network errors or BigQuery internal errors.
- When a remote function evaluation is skipped for some rows due to short-circuiting, for example, in conditional expressionsor a MERGE statementwith
WHEN [NOT] MATCHED
, batching is not used with the remote function. In this case, thecalls
field in the HTTP request bodyhas exactly one element. - If the dataset associated with the remote function is replicated to a destination region through cross-region dataset replication, the remote function can only be queried in the region that it was created in.
Create an endpoint
To create a remote function that can implement business logic, you must create an HTTP endpoint by using either Cloud Run functions or Cloud Run. The endpoint must be able to process a batch of rows in a single HTTP POST request and return the results for the batch as an HTTP response.
If you are creating the remote function by using BigQuery DataFrames, you don't have to manually create the HTTP endpoint; the service does that for you automatically.
See theCloud Run functions tutorial and otherCloud Run functions documentation on how to write, deploy, test and maintain a Cloud Run function.
See theCloud Run quick startand other Cloud Run documentation on how to write, deploy, test and maintain a Cloud Run service.
It's recommended that you keep the default authentication instead of allowing unauthenticated invocation of your Cloud Run function or Cloud Run service.
Input format
BigQuery sends HTTP POST requests with JSON body in the following format:
Field name | Description | Field type |
---|---|---|
requestId | Id of the request. Unique over multiple requests sent to this endpoint in a GoogleSQL query. | Always provided. String. |
caller | Job full resource name for the GoogleSQL query calling the remote function. | Always provided. String. |
sessionUser | Email of the user executing the GoogleSQL query. | Always provided. String. |
userDefinedContext | The user defined context that was used when creating the remote function in BigQuery. | Optional. A JSON object with key-value pairs. |
calls | A batch of input data. | Always provided. A JSON array.Each element itself is a JSON array, which is a JSON encoded argument list of one remote function call. |
An example of a request:
{
"requestId": "124ab1c",
"caller": "//bigquery.googleapis.com/projects/myproject/jobs/myproject:US.bquxjob_5b4c112c_17961fafeaf",
"sessionUser": "test-user@test-company.com",
"userDefinedContext": {
"key1": "value1",
"key2": "v2"
},
"calls": [
[null, 1, "", "abc"],
["abc", "9007199254740993", null, null]
]
}
Output format
BigQuery expects the endpoint should return a HTTP response in the following format, otherwise BigQuery can't consume it and will fail the query calling the remote function.
Field name | Description | Value Range |
---|---|---|
replies | A batch of return values. | Required for a successful response. A JSON array.Each element corresponds to a JSON encoded return value of the external function.Size of the array must match the size of the JSON array of calls in the HTTP request. For example, if the JSON array in calls has 4 elements, this JSON array needs to have 4 elements as well. |
errorMessage | Error message when the HTTP response code other than 200 is returned. For non-retryable errors, we return this as part of the BigQuery job's error message to the user. | Optional. String. Size should be less than 1KB. |
An example of a successful response:
{
"replies": [
1,
0
]
}
An example of a failed response:
{
"errorMessage": "Received but not expected that the argument 0 be null".
}
HTTP response code
Your endpoint should return the HTTP response code 200 for a successful response. When BigQuery receives any other value, BigQuery considers the response as a failure, and retries when the HTTP response code is 408, 429, 500, 503 or 504 until some internal limit.
JSON encoding of SQL data type
JSON encoding in HTTP request/response followsthe existing BigQuery JSON encodingfor TO_JSON_STRING function.
Sample Cloud Run function code
The following sample Python code implements adding all the integer arguments of the remote function. It handles a request with the arguments for batched invocations and returns all the result in a response.
import functions_framework
from flask import jsonify
# Max INT64 value encoded as a number in JSON by TO_JSON_STRING. Larger values are encoded as
# strings.
# See https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#json_encodings
_MAX_LOSSLESS=9007199254740992
@functions_framework.http
def batch_add(request):
try:
return_value = []
request_json = request.get_json()
calls = request_json['calls']
for call in calls:
return_value.append(sum([int(x) if isinstance(x, str) else x for x in call if x is not None]))
replies = [str(x) if x > _MAX_LOSSLESS or x < -_MAX_LOSSLESS else x for x in return_value]
return_json = jsonify( { "replies": replies } )
return return_json
except Exception as e:
return jsonify( { "errorMessage": str(e) } ), 400
Assuming that the function is deployed in the project my_gcf_project
in regionus-east1
as the function name remote_add
, it can be accessed via the endpoint https://us-east1-my_gcf_project.cloudfunctions.net/remote_add
.
Sample Cloud Run code
The following sample Python code implements a web service, which can be built and deployed to Cloud Run for the same functionality.
import os
from flask import Flask, request, jsonify
# Max INT64 value encoded as a number in JSON by TO_JSON_STRING. Larger values are encoded as
# strings.
# See https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#json_encodings
_MAX_LOSSLESS=9007199254740992
app = Flask(__name__)
@app.route("/", methods=['POST'])
def batch_add():
try:
return_value = []
request_json = request.get_json()
calls = request_json['calls']
for call in calls:
return_value.append(sum([int(x) if isinstance(x, str) else x for x in call if x is not None]))
replies = [str(x) if x > _MAX_LOSSLESS or x < -_MAX_LOSSLESS else x for x in return_value]
return jsonify( { "replies" : replies } )
except Exception as e:
return jsonify( { "errorMessage": str(e) } ), 400
if __name__ == "__main__":
app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
See the guide on how to build and deploy the code.
Assuming that the Cloud Run service is deployed in the projectmy_gcf_project
in region us-east1
as the service name remote_add
, it can be accessed via the endpointhttps://remote_add-<project_id_hash>-ue.a.run.app
.
Create a remote function
BigQuery uses a CLOUD_RESOURCE
connection to interact with your Cloud Run function. In order to create a remote function, you must create a CLOUD_RESOURCE
connection. If you are creating the remote function by using BigQuery DataFramesand you have been granted the Project IAM Admin (roles/resourcemanager.projectIamAdmin
) role, then you don't have to manually create the connection and grant it access; the service does that for you automatically.
Create a connection
You must have a Cloud resource connection to connect to Cloud Run function and Cloud Run.
Select one of the following options:
Console
- Go to the BigQuery page.
Go to BigQuery - In the Explorer pane, click Add data.
The Add data dialog opens. - In the Filter By pane, in the Data Source Type section, select Business Applications.
Alternatively, in the Search for data sources field, you can enterVertex AI
. - In the Featured data sources section, click Vertex AI.
- Click the Vertex AI Models: BigQuery Federation solution card.
- In the Connection type list, selectVertex AI remote models, remote functions and BigLake (Cloud Resource).
- In the Connection ID field, enter a name for your connection.
- Click Create connection.
- Click Go to connection.
- In the Connection info pane, copy the service account ID for use in a later step.
bq
- In a command-line environment, create a connection:
bq mk --connection --location=REGION --project_id=PROJECT_ID \
--connection_type=CLOUD_RESOURCE CONNECTION_ID
The--project_id
parameter overrides the default project.
Replace the following:REGION
: yourconnection regionPROJECT_ID
: your Google Cloud project IDCONNECTION_ID
: an ID for your connection
When you create a connection resource, BigQuery creates a unique system service account and associates it with the connection.
Troubleshooting: If you get the following connection error,update the Google Cloud SDK:
Flags parsing error: flag --connection_type=CLOUD_RESOURCE: value should be one of...
- Retrieve and copy the service account ID for use in a later step:
bq show --connection PROJECT_ID.REGION.CONNECTION_ID
The output is similar to the following:
name properties
1234.REGION.CONNECTION_ID {"serviceAccountId": "connection-1234-9u56h9@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}
Terraform
Use thegoogle_bigquery_connectionresource.
To authenticate to BigQuery, set up Application Default Credentials. For more information, seeSet up authentication for client libraries.
The following example creates a Cloud resource connection namedmy_cloud_resource_connection
in the US
region:
To apply your Terraform configuration in a Google Cloud project, complete the steps in the following sections.
Prepare Cloud Shell
- Launch Cloud Shell.
- Set the default Google Cloud project where you want to apply your Terraform configurations.
You only need to run this command once per project, and you can run it in any directory.
export GOOGLE_CLOUD_PROJECT=PROJECT_ID
Environment variables are overridden if you set explicit values in the Terraform configuration file.
Prepare the directory
Each Terraform configuration file must have its own directory (also called a root module).
- In Cloud Shell, create a directory and a new file within that directory. The filename must have the
.tf
extension—for examplemain.tf
. In this tutorial, the file is referred to asmain.tf
.
mkdir DIRECTORY && cd DIRECTORY && touch main.tf - If you are following a tutorial, you can copy the sample code in each section or step.
Copy the sample code into the newly createdmain.tf
.
Optionally, copy the code from GitHub. This is recommended when the Terraform snippet is part of an end-to-end solution. - Review and modify the sample parameters to apply to your environment.
- Save your changes.
- Initialize Terraform. You only need to do this once per directory.
terraform init
Optionally, to use the latest Google provider version, include the-upgrade
option:
terraform init -upgrade
Apply the changes
- Review the configuration and verify that the resources that Terraform is going to create or update match your expectations:
terraform plan
Make corrections to the configuration as necessary. - Apply the Terraform configuration by running the following command and entering
yes
at the prompt:
terraform apply
Wait until Terraform displays the "Apply complete!" message. - Open your Google Cloud project to view the results. In the Google Cloud console, navigate to your resources in the UI to make sure that Terraform has created or updated them.
Set up access
You must give the new connection read-only access to your Cloud Run function or Cloud Run service. It is not recommended to allow unauthenticated invocation for your Cloud Run function or Cloud Run service.
To grant roles, follow these steps:
- Go to the IAM & Admin page.
Go to IAM & Admin - Click Add.
The Add principals dialog opens. - In the New principals field, enter the service account ID that you copied earlier.
- In the Select a role field, select one of the following options:
- If you are using a 1st-gen Cloud Run function, choose Cloud Function, and then select Cloud Function Invoker role.
- If you are using a 2nd-gen Cloud Run function, choose Cloud Run, and then select Cloud Run Invoker role.
- If you are using a Cloud Run service, choose Cloud Run, and then select Cloud Run Invoker role.
- Click Save.
Create a remote function
To create a remote function:
SQL
Run the followingCREATE FUNCTIONstatement in BigQuery:
- In the Google Cloud console, go to the BigQuery page.
Go to BigQuery - In the query editor, enter the following statement:
CREATE FUNCTIONPROJECT_ID.DATASET_ID
.remote_add(x INT64, y INT64) RETURNS INT64
REMOTE WITH CONNECTIONPROJECT_ID.LOCATION.CONNECTION_NAME
OPTIONS (
endpoint = 'ENDPOINT_URL'
)
Replace the following:DATASET_ID
: the ID of your BigQuery dataset.ENDPOINT_URL
: the URL of your Cloud Run function or Cloud Run remote function endpoint.
- Click Run.
For more information about how to run queries, see Run an interactive query.
BigQuery DataFrames
- Enable the required APIs and make sure you have been granted the required roles, as described in the Requirements section ofRemote functions.
- Use theremote_function decorator:
import bigframes.pandas as bpd
# Set BigQuery DataFrames options
bpd.options.bigquery.project = your_gcp_project_id
bpd.options.bigquery.location = "US"
# BigQuery DataFrames gives you the ability to turn your custom scalar
# functions into a BigQuery remote function. It requires the GCP project to
# be set up appropriately and the user having sufficient privileges to use
# them. One can find more details about the usage and the requirements via
# `help` command.
help(bpd.remote_function)
# Read a table and inspect the column of interest.
df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
df["body_mass_g"].head(10)
# Define a custom function, and specify the intent to turn it into a remote
# function. It requires a BigQuery connection. If the connection is not
# already created, BigQuery DataFrames will attempt to create one assuming
# the necessary APIs and IAM permissions are setup in the project. In our
# examples we will be letting the default connection `bigframes-default-connection`
# be used. We will also set `reuse=False` to make sure we don't
# step over someone else creating remote function in the same project from
# the exact same source code at the same time. Let's try a `pandas`-like use
# case in which we want to apply a user defined scalar function to every
# value in a `Series`, more specifically bucketize the `body_mass_g` value
# of the penguins, which is a real number, into a category, which is a
# string.
@bpd.remote_function(
reuse=False,
cloud_function_service_account="default",
)
def get_bucket(num: float) -> str:
if not num:
return "NA"
boundary = 4000
return "at_or_above_4000" if num >= boundary else "below_4000"
# Then we can apply the remote function on the `Series` of interest via
# `apply` API and store the result in a new column in the DataFrame.
df = df.assign(body_mass_bucket=df["body_mass_g"].apply(get_bucket))
# This will add a new column `body_mass_bucket` in the DataFrame. You can
# preview the original value and the bucketized value side by side.
df[["body_mass_g", "body_mass_bucket"]].head(10)
# The above operation was possible by doing all the computation on the
# cloud. For that, there is a google cloud function deployed by serializing
# the user code, and a BigQuery remote function created to call the cloud
# function via the latter's http endpoint on the data in the DataFrame.
# The BigQuery remote function created to support the BigQuery DataFrames
# remote function can be located via a property `bigframes_remote_function`
# set in the remote function object.
print(f"Created BQ remote function: {get_bucket.bigframes_remote_function}")
# The cloud function can be located via another property
# `bigframes_cloud_function` set in the remote function object.
print(f"Created cloud function: {get_bucket.bigframes_cloud_function}")
# Warning: The deployed cloud function may be visible to other users with
# sufficient privilege in the project, so the user should be careful about
# having any sensitive data in the code that will be deployed as a remote
# function.
# Let's continue trying other potential use cases of remote functions. Let's
# say we consider the `species`, `island` and `sex` of the penguins
# sensitive information and want to redact that by replacing with their hash
# code instead. Let's define another scalar custom function and decorate it
# as a remote function. The custom function in this example has external
# package dependency, which can be specified via `packages` parameter.
@bpd.remote_function(
reuse=False,
packages=["cryptography"],
cloud_function_service_account="default",
)
def get_hash(input: str) -> str:
from cryptography.fernet import Fernet
# handle missing value
if input is None:
input = ""
key = Fernet.generate_key()
f = Fernet(key)
return f.encrypt(input.encode()).decode()
# We can use this remote function in another `pandas`-like API `map` that
# can be applied on a DataFrame
df_redacted = df[["species", "island", "sex"]].map(get_hash)
df_redacted.head(10)
You need to have the permission bigquery.routines.create
on the dataset where you create the remote function, and the bigquery.connections.delegate
permission (available from the BigQuery Connection Admin role) on the connection that is used by the remote function.
Providing user defined context
You can specify user_defined_context
in OPTIONS
as a form of key-value pairs, which will be part of every HTTP request to the endpoint. With user defined context, you can create multiple remote functions but re-use a single endpoint, that provides different behaviors based on the context passed to it.
The following examples create two remote functions to encrypt and decryptBYTES
data using the same endpoint.
CREATE FUNCTION `PROJECT_ID.DATASET_ID`.encrypt(x BYTES)
RETURNS BYTES
REMOTE WITH CONNECTION `PROJECT_ID.LOCATION.CONNECTION_NAME`
OPTIONS (
endpoint = 'ENDPOINT_URL',
user_defined_context = [("mode", "encryption")]
)
CREATE FUNCTION `PROJECT_ID.DATASET_ID`.decrypt(x BYTES)
RETURNS BYTES
REMOTE WITH CONNECTION `PROJECT_ID.LOCATION.CONNECTION_NAME`
OPTIONS (
endpoint = 'ENDPOINT_URL',
user_defined_context = [("mode", "decryption")]
)
Limiting number of rows in a batch request
You can specify max_batching_rows
in OPTIONS
as the maximum number of rows in each HTTP request, to avoidCloud Run functions timeout. If you specify max_batching_rows
, BigQuery determines the number of rows in a batch up to the max_batching_rows
limit. If not specified, BigQuery determines the number of rows to batch automatically.
Use a remote function in a query
Make sure you havegranted the permission on your Cloud Run function, so that it is accessible to BigQuery's service account associated with the connection of the remote function.
You also need to have the permission bigquery.routines.get
on the dataset where the remote function is, and the bigquery.connections.use
permission, which you can get through the BigQuery Connection User
role, on the connection which is used by the remote function.
You can use a remote function in a query just like auser defined function.
For example, you can use the remote_add
function in the example query:
SELECT
val,
`PROJECT_ID.DATASET_ID`.remote_add(val, 2)
FROM
UNNEST([NULL,2,3,5,8]) AS val;
This example produces the following output:
+------+-----+
| val | f0_ |
+------+-----+
| NULL | 2 |
| 2 | 4 |
| 3 | 5 |
| 5 | 7 |
| 8 | 10 |
+------+-----+
Supported regions
There are two types of locations in BigQuery:
- A region is a specific geographic place, such as London.
- A multi-region is a large geographic area, such as the United States, that contains two or more geographic places.
Single regions
In a BigQuery single region dataset, you can only create a remote function that uses a Cloud Run function deployed in the same region. For example:
- A remote function in BigQuery single region
us-east4
can only use a Cloud Run function inus-east4
.
So for single regions, remote functions are only supported in regions that support both Cloud Run functions and BigQuery.
Multi-regions
In a BigQuery multi-region (US
, EU
) dataset, you can only create a remote function that uses a Cloud Run function deployed in a region within the same large geographic area (US, EU). For example:
- A remote function in BigQuery
US
multi-region can only use a Cloud Run function deployed in any single region in the US geographic area, such asus-central1
,us-east4
,us-west2
, etc. - A remote function in BigQuery
EU
multi-region can only use a Cloud Run function deployed in any single region inmember states of the European Union, such aseurope-north1
,europe-west3
, etc.
For more information about BigQuery regions and multi-regions, see the Dataset Locations page. For more information about Cloud Run functions regions, see theCloud Run functions Locations page.
Connections
For either a single-region location or multi-region location, you can only create a remote function in the same location as the connection you use. For example, to create a remote function in the US
multi-region, use a connection located in the US
multi-region.
Pricing
- Standard BigQuery pricing applies.
- In addition, costs may be incurred for Cloud Run functions and Cloud Run by using this feature. Please review theCloud Run functions andCloud Run pricing pages for details.
Using VPC Service Controls
VPC Service Controls is a Google Cloud feature that allows you to set up a secure perimeter to guard against data exfiltration. To use VPC Service Controls with remote functions for additional security, or to use endpoints with internal traffic
ingress settings, follow theVPC Service Controls guide to:
- Create a service perimeter.
- Add the BigQuery project of the query using the remote function into the perimeter.
- Add the endpoint project into the perimeter and set
Cloud Functions API
orCloud Run API
in the restricted services based on your endpoint type. For more details, see Cloud Run functions VPC Service Controls and Cloud Run VPC Service Controls.
Best practices for remote functions
- Prefilter your input: If your input can be easily filtered down before being passed to a remote function, your query will likely be faster and cheaper.
- Keep your Cloud Run function scalable. Scalability is a function ofminimum instances,maximum instances, andconcurrency.
- Where possible, use the default value for your Cloud Run function's maximum number of instances.
- Note that there is no default limit for 1st gen HTTP Cloud Run functions. To avoid unbounded scaling events with 1st gen HTTP Cloud Run functions while testing or in production, we recommendsetting a limit, for example, 3000.
- Follow other Cloud Run function tips for better performance. Remote function queries interacting with a high latency Cloud Run function might fail due to timeout.
- Implement your endpoint to return a right HTTP response code and payload for a failed response.
- To minimize retries from BigQuery, use HTTP response codes other than 408, 429, 500, 503 and 504 for a failed response, and make sure to catch all exceptions in your function code. Otherwise, the HTTP service framework may automatically return 500 for any uncaught exceptions. You might still see retried HTTP requests when BigQuery retries a failed data partition or query.
- Your endpoint should return a JSON payload in the defined format for a failed response. Even not strictly required, it helps BigQuery distinguish whether the failed response is from your function implementation or the infrastructure of Cloud Run functions/Cloud Run. For the latter, BigQuery may retry with a different internal limit.
Quotas
For information about remote function quotas, see Quotas and limits.