Health checks overview (original) (raw)

Google Cloud offers configurable health checks for Google Cloud load balancer backends, Cloud Service Mesh backends, and application-based autohealing for managed instance groups. This document covers key health checking concepts.

Unless otherwise noted, Google Cloud health checks are implemented by dedicated software tasks that connect to backends according to parameters specified in a health check resource. Each connection attempt is called a_probe_. Google Cloud records the success or failure of each probe.

Based on a configurable number of sequential successful or failed probes, an overall health state is computed for each backend. Backends that respond successfully for the configured number of times are considered_healthy_. Backends that fail to respond successfully for a separately configurable number of times are unhealthy.

The overall health state of each backend determines eligibility to receive new requests or connections. You can configure the criteria that define a successful probe. This is discussed in detail in the section How health checks work.

Health checks implemented by dedicated software tasks use special routes that aren't defined in your Virtual Private Cloud (VPC) network. For more information, see Paths for health checks.

Health check categories, protocols, and ports

Health checks have a category and a protocol. The two categories are health checks and legacy health checks and their supported protocols are as follows:

The protocol and port determine how health check probes are done. For example, a health check can use the HTTP protocol on TCP port 80, or it can use the TCP protocol for a named port in an instance group.

You cannot convert a legacy health check to a health check, and you cannot convert a health check to a legacy health check.

Select a health check

Health checks must be compatible with the type of load balancer (or Cloud Service Mesh) and the backend types. The factors to consider when you select a health check are as follows:

The next section describes valid health check selections for each type of load balancer and backend.

Load balancer guide

This table shows the supported health check category and scope for each load balancer type.

Load balancer Health check category and scope
Global external Application Load Balancer Classic Application Load Balancer * Global external proxy Network Load Balancer Classic proxy Network Load Balancer Cross-region internal Application Load Balancer Cross-region internal proxy Network Load Balancer Health check (global)
Regional external Application Load Balancer Regional internal Application Load Balancer Regional internal proxy Network Load Balancer Regional external proxy Network Load Balancer Health check (regional)
Regional external passthrough Network Load Balancer Backend service-based load balancer: Health check (regional) Target pool-based load balancer: Legacy health check (global with the HTTP protocol)
Internal passthrough Network Load Balancer Health check (global orregional)

*For external Application Load Balancers, legacy health checks are not recommended but are sometimes supported, depending on the load balancer mode.

Load balancer mode Legacy health checks supported
Global external Application Load Balancer Classic Application Load Balancer Yes, if both of the following are true: The backends are instance groups. The backend VMs serve traffic that uses the HTTP or HTTPS protocol.
Regional external Application Load Balancer No

Additional usage notes

Health checking with Cloud Service Mesh

Note the following differences in behavior when you're using health checks with Cloud Service Mesh.

How health checks work

The following sections describe how health checks work.

Probes

When you create a health check ora legacy health check, you specify the following flags or accept their default values. Each health check or legacy health check that you create is implemented by multiple probes. These flags control how frequently _each_probe evaluates instances in instance groups or endpoints in zonal NEGs.

A health check's settings cannot be configured on a per-backend basis. Health checks are associated with an entire backend service. For a target pool-based regional external passthrough Network Load Balancer, a legacy HTTP health check is associated with the entire target pool. Thus, the parameters for the probe are the same for all backends referenced by a given backend service or target pool.

Configuration flag Purpose Default value
Check intervalcheck-interval The check interval is the amount of time from the start of one probe_issued by one prober_ to the start of the next probe_issued by the same prober_. Units are seconds. 5s (5 seconds)
Timeouttimeout The timeout is the amount of time that Google Cloud waits for a response to a probe. Its value must be less than or equal to the check interval. Units are seconds. 5s (5 seconds)

Probe IP ranges and firewall rules

For health checks to work, you must create ingress allow firewall rules so that traffic from Google Cloud probers can connect to your backends. For instructions, see Create required firewall rules.

The following table shows the source IP ranges to allow for each load balancer:

Product Health check probe source IP ranges
Global external Application Load Balancer Global external proxy Network Load Balancer For IPv4 traffic to the backends: 35.191.0.0/16 130.211.0.0/22 For IPv6 traffic to the backends: 2600:2d00:1:1::/64 2600:2d00:1:b029::/64
Regional external Application Load Balancer 1, 2 Cross-region internal Application Load Balancer1 Regional internal Application Load Balancer 1, 2 Regional external proxy Network Load Balancer1, 2 Regional internal proxy Network Load Balancer1, 2 Cross-region internal proxy Network Load Balancer 1 For IPv4 traffic to the backends: 35.191.0.0/16 130.211.0.0/22 For IPv6 traffic to the backends: 2600:2d00:1:b029::/64
Classic proxy Network Load Balancer Classic Application Load Balancer Cloud Service Mesh, except for internet NEG backends and hybrid NEG backends For IPv4 traffic to the backends: 35.191.0.0/16 130.211.0.0/22
Regional external passthrough Network Load Balancer 3 For IPv4 traffic to the backends: 35.191.0.0/16 209.85.152.0/22 209.85.204.0/22 For IPv6 traffic to the backends: 2600:1901:8001::/48
Internal passthrough Network Load Balancer For IPv4 traffic to the backends: 35.191.0.0/16 130.211.0.0/22 For IPv6 traffic to the backends: 2600:2d00:1:b029::/64
Cloud Service Mesh with internet NEG backends and hybrid NEG backends IP addresses of the VMs running the Envoy software For a sample configuration, see theCloud Service Mesh documentation

1Allowing traffic from Google's health check probe ranges isn't required for hybrid NEGs. However, if you're using a combination of hybrid and zonal NEGs in a single backend service, you need to allow traffic from the Google health check probe ranges for the zonal NEGs.

2For regional internet NEGs, health checks are optional. Traffic from load balancers using regional internet NEGs originates from the proxy-only subnet and is then NAT-translated (by using Cloud NAT) to either manually or automatically allocated NAT IP addresses. This traffic includes both health check probes and user requests from the load balancer to the backends. For details, see Regional NEGs: Use a Cloud NAT gateway.

3 Target pool-based regional external passthrough Network Load Balancers support only IPv4 traffic and might proxy health checks through the metadata server. In this case, health check packet sources match the IP address of the metadata server:169.254.169.254. You don't have to create firewall rules to permit traffic from the metadata server. Packets from the metadata server are always allowed.

Security considerations for probe IP ranges

Consider the following information when planning health checks and the necessary firewall rules:

Importance of firewall rules

Google Cloud requires that you create the necessary ingress allowfirewall rules to permit traffic from probers to your backends:

If you don't have ingress allow firewall rules that permit the health check, the implied deny ingress rule blocks inbound traffic. When probers can't contact your backends, the load balancer considers your backends to be unhealthy.

Multiple probes and frequency

Google Cloud sends health check probes from multiple redundant systems called probers. Probers use specific source IP ranges. Google Cloud does not rely on just one prober to implement a health check—multiple probers simultaneously evaluate the instances in instance group backends or the endpoints in zonal NEG backends. If one prober fails, Google Cloud continues to track backend health states.

The interval and timeout settings that you configure for a health check are applied to each prober. For a given backend, software access logs andtcpdump show more frequent probes than your configured settings.

This is expected behavior, and you cannot configure the number of probers that Google Cloud uses for health checks. However, you can estimate the effect of multiple simultaneous probes by considering the following factors.

Destination for probe packets

The following table shows the network interface and destination IP addresses to which health check probers send packets, depending on the type of load balancer.

For regional external passthrough Network Load Balancers and internal passthrough Network Load Balancers, the application must bind to the load balancer's IP address (or any IP address 0.0.0.0).

Load balancer Destination network interface Destination IP address
Global external Application Load Balancer Global external proxy Network Load Balancer For instance group backends, the primary network interface (nic0). For zonal NEG backends with GCE_VM_IP_PORT endpoints, the network interface in the VPC network that's associated with the NEG. For zonal NEG backends with NON_GCP_PRIVATE_IP_PORT endpoints, the endpoint must represent an interface of an on-premises resource that's reachable by way of a route in the VPC network associated with the NEG and in the region that contains the NEG. For instance group backends, the primary internal IPv4 or IPv6 address associated with the primary network interface (nic0) of each instance. For zonal NEG backends with GCE_VM_IP_PORT endpoints, the IP address of the endpoint: either a primary internal IPv4 or IPv6 address of the network interface or an internal IPv4 or IPv6 address from an alias IP range of the network interface. For zonal NEG backends with NON_GCP_PRIVATE_IP_PORT endpoints, the IP address of the endpoint.
Classic Application Load Balancer Regional external Application Load Balancer Cross-region internal Application Load Balancer Regional internal Application Load Balancer Classic proxy Network Load Balancer Regional external proxy Network Load Balancer Cross-region internal proxy Network Load Balancer 1 Regional internal proxy Network Load Balancer Cloud Service Mesh For instance group backends, the primary network interface (nic0). For zonal NEG backends with GCE_VM_IP_PORT endpoints, the network interface in the VPC network that's associated with the NEG. For zonal NEG backends with NON_GCP_PRIVATE_IP_PORT endpoints, the endpoint must represent an interface of an on-premises resource that's reachable by way of a route in the VPC network associated with the NEG and in the region that contains the NEG. For instance group backends, the primary internal IPv4 address associated with the primary network interface (nic0) of each instance. For zonal NEG backends with GCE_VM_IP_PORT endpoints, the IP address of the endpoint: either a primary internal IPv4 address of the network interface or an internal IPv4 address from an alias IP range of the network interface. For zonal NEG backends with NON_GCP_PRIVATE_IP_PORT endpoints, the IP address of the endpoint.
Regional external passthrough Network Load Balancer Primary network interface (nic0) The IP address of the external forwarding rule. If multiple forwarding rules point to the same backend service (for target-pool based regional external passthrough Network Load Balancers, the same target pool), Google Cloud sends probes to each forwarding rule's IP address. This can result in an increase in the number of probes.
Internal passthrough Network Load Balancer For both instance group backends and zonal NEG backends withGCE_VM_IP endpoints, the network interface used depends on how the backend service is configured. For details, seeBackend services and network interfaces. The IP address of the internal forwarding rule. If multiple forwarding rules point to the same backend service, Google Cloud sends probes to each forwarding rule's IP address. This can result in an increase in the number of probes.

Success criteria for HTTP, HTTPS, and HTTP/2

HTTP, HTTPS, and HTTP/2 health checks always require an HTTP 200 (OK) response code to be received before the health check timeout. All other HTTP response codes, including redirect response codes like 301 and 302, are considered unhealthy.

In addition to requiring an HTTP 200 (OK) response code, you can:

The following table lists valid combinations of request path and response flags that are available for HTTP, HTTPS, and HTTP/2 health checks.

Configuration flags Prober behavior Success criteria
Neither --request-path nor --response specified The prober uses / as the request path. HTTP 200 (OK) response code only.
Both --request-path and --response specified The prober uses the configured request path. HTTP 200 (OK) response code and up to the first 1,024 ASCII characters of the HTTP response body must match the expected response string.
Only --response specified The prober uses / as the request path. HTTP 200 (OK) response code and up to the first 1,024 ASCII characters of the HTTP response body must match the expected response string.
Only --request-path specified The prober uses the configured request path. HTTP 200 (OK) response code only.

Success criteria for SSL and TCP

TCP and SSL health checks have the following base success criteria:

The following table lists valid combinations of request and response flags that are available for TCP and SSL health checks. Both request and response flags must consist only of single-byte, printable ASCII characters, each string being no more than 1,024 characters long.

Configuration flags Prober behavior Success criteria
Neither --request nor --response specified The prober doesn't send any request string. Base success criteria only.
Both --request and --response specified The prober sends the configured request string. Base success criteria and the response string received by the prober must exactly match the expected response string.
Only --response specified The prober doesn't send any request string. Base success criteria and the response string received by the prober must exactly match the expected response string.
Only --request specified The prober sends the configured request string. Base success criteria only (any response string is not checked).

Success criteria for gRPC

gRPC health checks are used only with gRPC applications, Google Cloud load balancers, and Cloud Service Mesh. Google Cloud supports two types of gRPC health checks:

If you are using gRPC health checks (with or without TLS), make sure that the gRPC service sends the RPC response with the status OK and the status field set to SERVING or NOT_SERVING accordingly.

For more information, see the following:

Success criteria for legacy health checks

If the response received by the legacy health check probe is HTTP 200 OK, the probe is considered successful. All other HTTP response codes, including a redirect (301, 302), are considered unhealthy.

Health state

Google Cloud uses the following healthy and unhealthy threshold configuration flags to determine the overall health state of each backend to which traffic is load balanced.

Configuration flag Purpose Default value
Healthy thresholdhealthy-threshold The healthy threshold specifies the number of sequential successful probe results for a _previously unhealthy backend_1 to be considered healthy. Previously unhealthy backends can become healthy if they are able to meet the healthy threshold again. Google Cloud considers backends to be healthy after this healthy threshold has been met. Healthy backends are eligible to receive new connections. Newly added backends might be considered healthy after a single successful probe. A threshold of 2 probes.
Unhealthy thresholdunhealthy-threshold The unhealthy threshold specifies the number of sequential failed probe results for a _previously healthy backend_2 to be considered unhealthy. Google Cloud considers backends to be unhealthy when the unhealthy threshold has been met. Unhealthy backends are not eligible to receive new connections; however, existing connections are *not* immediately terminated. Instead, the connection remains open until a timeout occurs or until traffic is dropped. A threshold of 2 probes.

The specific behavior when all backends are unhealthy differs depending on the type of load balancer that you're using:

Load balancer Behavior when all backends are unhealthy
Classic Application Load Balancer Returns an HTTP `502` status code to clients when all backends are unhealthy.
Global external Application Load Balancer Cross-region internal Application Load Balancer Regional external Application Load Balancer Regional internal Application Load Balancer Returns an HTTP `503` status code to clients when all backends are unhealthy.
Proxy Network Load Balancers Terminates new client TCP connections when all backends are unhealthy.
Internal passthrough Network Load Balancer Backend service-based regional external passthrough Network Load Balancers Distributes new connections according to failover configuration, backend weight, and backend healthy. For details, see: Traffic distribution for internal passthrough Network Load Balancers Traffic distribution for backend service-based regional external passthrough Network Load Balancers
Target pool-based regional external passthrough Network Load Balancers Distributes traffic to all backend VMs as a last resort when all backends are unhealthy.

Additional notes

The following sections include some more notes about using health checks on Google Cloud.

Certificates and health checks

Google Cloud health check probers don't perform certificate validation, even for protocols that require that your backends use certificates (SSL, HTTPS, and HTTP/2)—for example:

Health checks that use any protocol, but not legacy health checks, allow you to set a proxy header by using the --proxy-header flag.

Health checks that use HTTP, HTTPS, or HTTP/2 protocols and legacy health checks allow you to specify an HTTP Host header by using the --host flag.

If you're using any custom request headers, note that the load balancer adds these headers only to the client requests, not to the health check probes. If your backend requires a specific header for authorization that is missing from the health check packet, the health check might fail.

Example health check

Suppose you set up a health check with the following settings:

With these settings, the health check behaves as follows:

  1. Multiple redundant systems are simultaneously configured with the health check parameters. Interval and timeout settings are applied to each system. For more information, seeMultiple probes and frequency.
  2. Each health check prober does the following:
    1. Initiates an HTTP connection from one of the source IP addresses to the backend instance every 30 seconds.
    2. Waits up to five seconds for an HTTP 200 (OK) status code (the success criteria for HTTP, HTTPS, and HTTP/2 protocols).
  3. A backend is considered unhealthy when at least one health check probe system does the following:
    1. Does not receive an HTTP 200 (OK) response code for two consecutive probes. For example, the connection might be refused, or there might be a connection or socket timeout.
    2. Receives two consecutive responses that don't match the protocol-specific success criteria.
  4. A backend is considered healthy when at least one health check probe system receives two consecutive responses that match the protocol-specific success criteria.

In this example, each prober initiates a connection every 30 seconds. Thirty seconds elapses between a prober's connection attempts regardless of the duration of the timeout (whether or not the connection timed out). In other words, the timeout must always be less than or equal to the interval, and the timeout never increases the interval.

In this example, each prober's timing looks like the following, in seconds:

  1. t=0: Start probe A.
  2. t=5: Stop probe A.
  3. t=30: Start probe B.
  4. t=35: Stop probe B.
  5. t=60: Start probe C.
  6. t=65: Stop probe C.

What's next