Live migration process during maintenance events (original) (raw)

During a planned maintenance event for the underlying hardware of a virtual machine (VM) instance or bare metal instance, the host server is unavailable. To keep an instance running during a host event, Compute Engine performs a_live migration_ of the instance to another host server in the same zone. For more information about host events, seeAbout host events.

Live migration lets Google Cloud perform maintenance without interrupting a workload, rebooting an instance, or modifying any of the instance's properties, such as IP addresses, metadata, block storage data, application state, or network settings.

Live migration keeps instances running during the following situations:

Compute Engine only performs a live migration of VMs that have the host maintenance policy set to migrate. For information about how to change the host maintenance policy, seeSet VM host maintenance policy.

Live migration process and Local SSD disks

Compute Engine can live migrate instances with Local SSD disks attached (excluding Z3 instances with more than 18 TiB of attached Titanium SSD). Compute Engine moves the compute instances along with their Local SSD data to a new host server in advance of any planned maintenance.

Limitations

Live migration is not supported for the following VM types:

How does the live migration process work?

When a VM is scheduled to live migrate, Compute Engine provides anotification so that you can prepare your workloads and applications for this live migration disruption. During live migration, Google Cloud observes a minimum disruption time, which is typically much less than 1 second. If a VM is not set to live migrate, Compute Engine terminates the VM during host maintenance. VMs that are set to terminate during a host eventstop and (optionally) restart.

When Google Cloud migrates a running VM from one host to another, it moves the complete state of the VM from the source to the destination in a way that is transparent to the guest OS and anything communicating with it. There are many components involved in making this work seamlessly, but the high-level steps are shown in the following illustration:

Migrating a VM and each of its resources to a new host system             without requiring the guest operating system to restart.

Live migration components

The process begins with a notification that a VM needs to be moved from its current host machine. The notification might start with a file change indicating that a new BIOS version is available, a hardware operation scheduling maintenance, or an automatic signal from an impending hardware failure.

Google Cloud's cluster management software constantly watches for these events and schedules them based on policies that control the data centers, such as capacity utilization rates and the number of VMs that a single customer can migrate at once.

After a VM is selected for migration, Google Cloud provides a notification to the guest that a migration is happening soon. After a waiting period, a target host is selected and the host is asked to set up a new, empty "target" VM to receive the migrating "source" VM. Authentication is used to establish a connection between the source and the target.

There are three stages involved in the VM's migration:

  1. Source brownout. The VM is still executing on the source, while most state is sent from the source to the target. For example, Google Cloud copies all the guest memory to the target, while tracking the pages that have been changed on the source. The time spent in source brownout is a function of the size of the guest memory and the rate at which pages are being changed.
  2. Blackout. A very brief moment when the VM is not running anywhere, the source VM is paused and all the remaining state required to begin running the VM on the target is sent. The VM enters the blackout stage when sending state changes during the source brownout stage reaches a point of diminishing returns. An algorithm is used that balances numbers of bytes of memory being sent against the rate at which the guest VM is making changes.
    During blackout events, the system clock appears to jump forward, up to 5 seconds. If a blackout event exceeds 5 seconds, Google Cloud stops and synchronizes the clock using a daemon that is included as part of the VM guest packages.
  3. Target brownout. The VM executes on the target VM. The source VM is present and might provide support for the target VM. For example, until the network fabric has caught up with the new location of the target VM, the source VM provides forwarding services for packets to and from the target VM.

Finally, the migration is complete and the system deletes the source VM. You can see that the migration took place in the Cloud Logging logsfor your VM.

Live migration of sole-tenant VMs

As your workload runs, you might want to move VMs to a different sole-tenant node or node group. If you move a VM to a group of nodes, Compute Engine determines which node to place it on. For information about sole-tenancy, seeSole-tenancy overview.

To move sole-tenant VMs to a different node or node group, you can manually initiate a live migration. You can also manually initiate a live migration to move a VM on a multi-tenant host into a sole-tenant node. For more information, see Manually live migrate VMs.

What's next