Introducing Neuron Runtime 2.x (libnrt.so) — AWS Neuron Documentation (original) (raw)

Introducing Neuron Runtime 2.x (libnrt.so)#

Table of contents

What are we changing?#

Starting with the Neuron 1.16.0 release, Neuron Runtime 1.x (neuron-rtd) is entering maintenance mode and is being replaced by Neuron Runtime 2.x, a shared library named (libnrt.so). For more information on Runtime 1.x see 10/27/2021 - Neuron Runtime 1.x (neuron-rtd) enters maintenance mode.

Upgrading to libnrt.so simplifies the Neuron installation and upgrade process, introduces new capabilities for allocating NeuronCores to applications, streamlines container creation, and deprecates tools that are no longer needed.

This document describes the capabilities of Neuron Runtime 2.x in detail, provides information needed for successful installation and upgrade, and provides information needed for successful upgrade of Neuron applications using Neuron Runtime 1.x (included in releases before Neuron 1.16.0) to Neuron Runtime 2.x (included in releases Neuron 1.16.0 or newer).

Why are we making this change?#

Before Neuron 1.16.0, Neuron Runtime was delivered as a daemon (neuron-rtd), and communicated with Neuron framework extensions through a gRPC interface.neuron-rtd was packaged as an rpm or debian package (aws-neuron-runtime) and required a separate installation step.

Starting with Neuron 1.16.0, Neuron Runtime 2.x is delivered as a shared library (libnrt.so) and is directly linked to Neuron framework extensions.libnrt.so is packaged and installed as part of the Neuron framework extensions (e.g. TensorFlow Neuron, PyTorch Neuron or MXNet Neuron), and does not require a separate installation step. Installing Neuron Runtime as part of the Neuron framework extensions simplifies installation and improves the user experience. In addition, since libnrt.so is directly linked to the Neuron framework extensions, faster communication between the Neuron Runtime and Neuron Frameworks is enabled by eliminating the gRPC interface overhead.

For more information see How will this change affect the Neuron SDK? and Migrate your application to Neuron Runtime 2.x (libnrt.so).

How will this change affect the Neuron SDK?#

Neuron Driver#

Use the latest Neuron Driver. For successful installation and upgrade to Neuron 1.16.0 or newer, you must install or upgrade to Neuron Driver (aws-neuron-dkms) version 2.1.5.0 or newer. Neuron applications using Neuron 1.16.0 will fail if they do not detect Neuron Driver version 2.1.5.0 or newer. For installation and upgrade instructions see install-guide-index.

To see details of Neuron component versions please see Release Content.

Important

For successful installation or update to Neuron 1.16.0 and newer from previous releases:

Neuron Runtime#

Neuron framework extensions#

Starting from Neuron 1.16.0, Neuron framework extensions (TensorFlow Neuron, PyTorch Neuron, or MXNet Neuron) are packaged together withlibnrt.so. It is required to install the aws-neuron-dkms Driver version 2.1.5.0 or newer for proper operation. The neuron-rtd daemon that was installed in previous releases no longer works starting with Neuron 1.16.0.

To see details of Neuron component versions see Release Content.

TensorFlow model server#

Starting from Neuron 1.16.0, the TensorFlow Neuron model server is packaged together with libnrt.so and expects aws-neuron-dkms version 2.1.5.0 or newer for proper operation.

Note

The TensorFlow Neuron model server included in Neuron 1.16.0 runs from the directory in which it was installed and will not run properly if copied to a different location, due to its dependency on libnrt.so.

Neuron tools#

How will this change affect me?#

Neuron installation and upgrade#

As explained in “How will this change affect the Neuron SDK?”, starting from Neuron 1.16.0, libnrt.so requires the latest Neuron Driver (aws-neuron-dkms). In addition, it is no longer necessary to install aws-neuron-runtime. To install Neuron or to upgrade to latest Neuron version, follow the installation and upgrade instructions below:

Migrate your application to Neuron Runtime 2.x (libnrt.so)#

For a successful migration from previous releases of your application to Neuron 1.16.0 or newer, make sure you perform the following:

  1. Prerequisite
    Read “How will this change affect the Neuron SDK?”.
  2. Make sure you are not using Neuron Runtime 1.x (aws-neuron-runtime)
    • Remove any code that installs aws-neuron-runtime from any CI/CD scripts.
    • Stop neuron-rtd by running sudo systemctl stop neuron-rtd
    • Uninstall neuron-rtd by running sudo apt remove aws-neuron-runtime or sudo yum remove aws-neuron-runtime
  3. Upgrade to your Neuron Framework of choice:
  4. If you have code that starts and/or stops neuron-rtd
    Remove any code that starts or stops neuron-rtd from any CI/CD scripts.
  5. Application running multiple neuron-rtd
    If your application runs multiple processes and requires running multiple neuron-rtd daemons:
    • Remove the code that runs multiple neuron-rtd daemons.
    • Instead of allocating Neuron devices to neuron-rtd through configuration files, use NEURON_RT_VISIBLE_CORES or NEURON_RT_NUM_CORES environment variables to allocate NeuronCores. See NeuronX Runtime Configuration for details.
      If you application uses NEURONCORE_GROUP_SIZES, see the next item.
      Note
      NEURON_RT_VISIBLE_CORES and NEURON_RT_NUM_CORES environment variables enable you to allocate NeuronCores to an application. Allocating NeuronCores improves application granularity, because Neuron devices include multiple NeuronCores.
  6. Application running multiple processes using NEURONCORE_GROUP_SIZES
    • Consider using NEURON_RT_VISIBLE_CORES or NEURON_RT_NUM_CORES environment variables instead of NEURONCORE_GROUP_SIZES, which is being deprecated.
      See NeuronX Runtime Configuration for details.
    • If you are using TensorFlow Neuron (tensorflow-neuron (TF2.x)) and you are replacing NEURONCORE_GROUP_SIZES=AxB which enables auto multicore replication, see the new API TensorFlow 2.x (tensorflow-neuron) Auto Multicore Replication (Beta) for usage and documentation.
    • The behavior of your application will remain the same as before if you do not set NEURON_RT_VISIBLE_CORES and do not set NEURON_RT_NUM_CORES.
    • If you are considering migrating to NEURON_RT_VISIBLE_CORES or NEURON_RT_NUM_CORES:
      * NEURON_RT_VISIBLE_CORES takes precedence over NEURON_RT_NUM_CORES.
      * If you are migrating to NEURON_RT_VISIBLE_CORES:
      > * For TensorFlow applications or PyTorch applications make sure that NEURONCORE_GROUP_SIZES is unset, or that NEURONCORE_GROUP_SIZES allocates the same or smaller number of NeuronCores as allocated by NEURON_RT_VISIBLE_CORES.
      > * For MXNet applications, setting NEURONCORE_GROUP_SIZES and NEURON_RT_VISIBLE_CORES environment variables at the same time is not supported. Use NEURON_RT_VISIBLE_CORES only.
      > * See NeuronX Runtime Configuration for more details on how to use NEURON_RT_VISIBLE_CORES.
      * If you are migrating to NEURON_RT_NUM_CORES:
      > * Make sure that NEURONCORE_GROUP_SIZES is unset.
      > * See NeuronX Runtime Configuration for more details on how to use NEURON_RT_NUM_CORES.
  7. Application running multiple processes accessing the same NeuronCore
    If your application accesses the same NeuronCore from multiple processes, this is no longer possible with libnrt.so. Instead, modify your application to access the same NeuronCore from multiple threads.
    Note
    Optimal performance of multi-model execution is achieved when each NeuronCore executes a single model.
  8. Neuron Tools
  9. Containers
    If your application is running within a container, and it previously executed neuron-rtd within the container, you need to re-build your container, so it will not include or install aws-neuron-runtime. See neuron-containers and containers-migration-to-runtime2 for details.

Troubleshooting#

Application fails to start#

Description#

Starting with the Neuron 1.16.0 release, Neuron Runtime (libnrt.so) requires Neuron Driver 2.0 or greater (aws-neuron-dkms). Neuron Runtime requires the Neuron Driver (aws-neuron-dkms package) to access Neuron devices.

If aws-neuron-dkms is not installed, the application will fail with an error message on the console and syslog similar to the following:

NRT:nrt_init Unable to determine Neuron Driver version. Please check aws-neuron-dkms package is installed.

If an old aws-neuron-dkms is installed, the application will fail with an error message on the console and syslog similar to the following:

NRT:nrt_init This runtime requires Neuron Driver version 2.0 or greater. Please upgrade aws-neuron-dkms package.

Solution#

Follow the installation steps in install-guide-index to install aws-neuron-dkms.

Application fails to start although I installed latest aws-neuron-dkms#

Description#

Starting from the Neuron 1.16.0 release, Neuron Runtime (libnrt.so) requires Neuron Driver 2.0 or greater (aws-neuron-dkms). If an old aws-neuron-dkms is installed, the application will fail. You may try to install aws-neuron-dkms and still face application failure, because the aws-neuron-dkms installation failed as a result of neuron-rtd daemon that was still running.

Solution#

Application unexpected behavior when upgrading to release Neuron 1.16.0 or newer#

Description#

When upgrading to release Neuron 1.16.0 or newer from previous releases, the OS may include two different versions of_Neuron Runtime_: the libnrt.so shared library and neuron-rtd daemon. This can happen if the user did not stop neuron-rtd daemon or did not make sure to uninstall the existing Neuron version before upgrade. In this case the user application may behave unexpectedly.

Solution#

If the OS includes two different versions of Neuron Runtime, libnrt.so shared library and neuron-rtd daemon:

Application unexpected behavior when downgrading to releases before Neuron 1.6.0 (from Neuron 1.16.0 or newer)#

Description#

When upgrading to release Neuron 1.16.0 or newer from previous releases, and then downgrading back to releases before Neuron 1.6.0, the OS may include two different versions of Neuron Runtime: the libnrt.so shared library and neuron-rtd daemon. This can happen if the user did not make sure to uninstall the existing Neuron version before the upgrade or downgrade. In this case the user application may behave unexpectedly.

Solution#

If the OS include two different versions of Neuron Runtime, libnrt.so shared library and neuron-rtd daemon:

Neuron Core is in use#

Description#

A Neuron Core cannot be shared between two applications. If an application started using a Neuron Core all other applications trying to use the NeuronCore will fail during runtime initialization with the following message in the console and in syslog:

ERROR NRT:nrt_allocate_neuron_cores NeuronCore(s) not available - Requested:nc1-nc1 Available:0

Solution#

Terminate the the process using NeuronCore and then try launching the application.

Frequently Asked Questions (FAQ)#

Do I need to recompile my model to run it with Neuron Runtime 2.x (libnrt.so)?#

No.

Do I need to change my application launch command?#

No.

Can libnrt.so and neuron-rtd co-exist in the same environment?#

Although we recommend upgrading to the latest Neuron release, we understand that for a transition period you may continue using neuron-rtd for old releases. If you are using Neuron Framework (PyTorch,TensorFlow or MXNet) from releases before Neuron 1.16.0:

Warning

Executing models using Neuron Framework (PyTorch,TensorFlow or MXNet) from Neuron 1.16.0 and newer in an environment where neuron-rtd is running may cause undefined behavior. Make sure to stop neuron-rtd before executing models using Neuron Framework (PyTorch,TensorFlow or MXNet) from Neuron 1.16.0 and newer.

Are there Neuron framework versions that will not support Neuron Runtime 2.x (libnrt.so)?#

All supported PyTorch Neuron and TensorFlow framework extensions, in addition to Neuron MXnet 1.8.0 framework extensions support Neuron Runtime 2.x.

Neuron MxNet 1.5.1 does not support Neuron Runtime 2.x (libnrt.so) and has now entered maintenance mode. See 10/27/2021 - Neuron support for Apache MXNet 1.5 enters maintenance mode for details.

This document is relevant for: Inf1