External functions best practices | Snowflake Documentation (original) (raw)

This topic documents best practices that will improve efficiency and prevent unexpected results that could occur if a remote service is not designed to be compatible with Snowflake.

You can find additional best practices in the following documents:

Use a remote service’s batch API if available

Some remote services offer both batch mode and single-row mode. If the queries that use an external function are expected to send multiple rows, then Snowflake recommends using the batch mode of the remote service to improve performance.

This rule does not necessarily apply if:

Process one row at a time

To minimize networking overhead, Snowflake typically batches rows to send to remote services. The number of batches and the size of each batch can vary.

In addition, the order of batches can vary, and the order of rows within a batch can vary. Even if the query contains an ORDER BY clause, the ORDER BY is usually applied after the external function(s) have been called.

Because batch size and row order are not guaranteed, writing a function that returns a value for a row that depends upon any other row in this batch or previous batches can produce non-deterministic results.

Snowflake strongly recommends that the remote service process each row independently. The return value for each input row should depend on only that input row, not on other input rows. (Currently, external functions do not support window functions, for example.)

Note also that because batch size is not guaranteed, counting batches is not meaningful.

See also Ensure your external function is stateless.

Do not assume that the remote service is passed each row exactly once

If Snowflake calls a remote service, and the remote service receives the request and returns a result, but Snowflake does not receive the result due to a temporary network problem, Snowflake might repeat the request. If Snowflake retries, the remote service might see the same row twice (or more).

This can cause unexpected effects. For example, because the remote service might get called more than once for the same value, a remote service that assigns unique IDs might have gaps in the sequence of those IDs. In some cases, such effects can be reduced by tracking the batch ID in the sf-external-function-query-batch-id field of the request header to determine whether a particular batch of rows has been processed previously. When Snowflake retries a request for a specific batch, Snowflake uses the same batch ID as it used earlier for the same batch.

Snowflake retries when it receives the following errors:

Requests are retried until a total retry timeout is reached. The total retry timeout is not user-configurable. Snowflake might adjust this limit in the future.

When the total retry timeout is reached without a successful retry, the query fails.

If your external function call times out when the remote service is working, and all the elements between Snowflake and the remote service seem to be working, you can try a smaller batch size to see if that reduces the timeout errors.

To learn how to set the maximum batch size, see CREATE EXTERNAL FUNCTION.

Ensure your external function is stateless

In general, an external function (including the remote service) should avoid storing state information, both:

If the remote service changes state information and then uses that information to affect future outputs, the function might return different values than expected.

For example, consider a simple remote service that contains an internal counter and returns the number of rows received since the remote service first started. If there is a temporary network problem, and Snowflake repeats a request with the same data, the remote service will count the re-sent rows twice (or more).

For an example involving external state, see Avoid side-effects.

In the rare cases where a function is not stateless, the documentation for callers should say clearly that the function is not stateless, and the function should be marked volatile.

If a remote service handles requestsasynchronously, then the remote service author must write the remote service to store and manage some state temporarily. For example, the remote service must store the HTTP POST request’s batch ID so that if an HTTP GET is received with the same batch ID, the remote service can return HTTP code 202 when the specified batch is still being processed.

Note that a query can be aborted for various reasons, which means that there is no guarantee that a final GET will arrive after the remote service has finished generating a result. Remote services that store state for asynchronous requests should eventually time out and clean up that internal state. The optimal timeout might change in the future, but currently Snowflake recommends preserving information about asynchronous requests for at least 10 minutes and preferably 12 hours before deleting it.

Avoid side-effects

An external function (including the remote service) should avoid side effects, such as changing external state (information stored outside the remote service).

For example, if the remote service reports out-of-range values to a government agency, that is a side effect.

Side-effects can be useful, but the side-effects of calling an external function are not always predictable. For example, suppose that you call a remote service that analyzes an anonymized health record and returns a diagnosis. Suppose also that if the diagnosis is that the patient has a contagious disease, then the diagnosis is reported to an agency that keeps count of the number of cases of that disease. This is a useful side effect. However, it is vulnerable to problems such as:

There are also situations in which rows could be undercounted rather than overcounted.

In the very rare cases where a function has side effects, the documentation for callers should say clearly what the side effects are, and the function should be marked volatile.

Categorize your function as volatile or immutable

Functions can be categorized as volatile or immutable. (The CREATE EXTERNAL FUNCTIONstatement allows the user to specify whether the function is volatile or immutable.)

For an external function to be considered immutable, it should meet the following criteria:

If a function meets these two criteria, then Snowflake can use certain types of optimizations to reduce the number of rows or batches sent to the remote service. (These optimizations might evolve over time, and are not described in detail here.)

Snowflake cannot detect or enforce immutability, or factors that affect immutability (for example, side effects). The writer of a remote service should document whether the remote service meets the criteria to be labeled immutable. If a remote service has side effects, then the external function that calls that remote service should be marked volatile, even if the function call returns the same output value for the same input value. If you are not certain that a remote service is immutable, then any external function that calls that remote service should be labeled volatile.

Account for timeout errors

An external function call involves Snowflake, a remote service, a proxy service, and potentially other elements in the chain. None of these elements know how long a particular function call should take, so none know exactly when to stop waiting and return a timeout error. Each step might have its own independent timeout. For more information about timeouts and retries, see Account for timeout errors and retries.

Minimize latency

To minimize latency and improve performance of external function calls, Snowflake recommends doing the following when practical:

Develop and test external functions one step at a time

Snowflake recommends that you test without Snowflake before testing with Snowflake.

During the early stages of developing an external function, use the cloud platform proxy service console (e.g. the Amazon API Gateway console) and remote service development console (e.g. the AWS Lambda console) to help develop and test the proxy service and remote service.

For example, if you have developed a Lambda function, you might want to test it extensively through the Lambda console before testing it by calling it from Snowflake.

Testing through the proxy service console and remote service console usually has the following advantages:

Of course, after you’ve tested the remote service and the proxy service as much as you can without Snowflake, you should test them with Snowflake. The advantages of testing with Snowflake include:

Consider the following test cases:

Make your remote service asynchronous

If you are writing a remote service, and if your remote service might not return results within the expected timeout, then consider making your remote serviceasynchronous. For details, see Asynchronous vs. Synchronous remote services.

Ensure that arguments to the external function correspond to arguments parsed by the remote service

When passing arguments to or from an external function, ensure that the data types are appropriate. If the value sent can’t fit into the data type being received, the value might be truncated or corrupted, or the remote service call might fail.

For example, because some Snowflake SQL numeric data types can store larger values than commonly-used JavaScript data types, de-serializing large numbers from JSON is particularly sensitive in JavaScript.

If you change the number, data types, or order of the arguments to the remote service, remember to make the corresponding changes to the external function. Currently, the ALTER FUNCTION command does not have an option to change parameters, so you must drop and re-create the external function to change the arguments.