NxD Inference API Reference — AWS Neuron Documentation (original) (raw)

NxD Inference API Reference#

NeuronX Distributed (NxD) Inference (neuronx-distributed-inference) is an open-source PyTorch-based inference library that simplifies deep learning model deployment on AWS Inferentia and Trainium instances. Neuronx Distributed Inference includes a model hub and modules that users can reference to implement their own models on Neuron.

This API guide describes API and configuration functions and parameters that you can use when you directly interact with the NxD Inference library.

Note

NxD Inference also supports integration with vLLM. When you use vLLM, you can use the override_neuron_config attribute to override defaults using theNeuronConfig parameters described in this API guide. For more information about vLLM integration, see vLLM User Guide for NxD Inference.

Table of contents

Configuration#

NxD Inference defines configuration objects that enable you to control how a model is compiled and used for inference. When you compile a model, its configuration is serialized to a JSON file in the compiled checkpoint, so you can distribute the compiled checkpoint to additional Neuron instances without needing to compile on each instance.

NxD Inference supports loading HuggingFace model checkpoints and configurations. When you run a model from a HuggingFace checkpoint, NxD Inference loads the model configuration from the model’s PretrainedConfig.

NeuronConfig#

NeuronConfig contains compile-time configuration options for inference on Neuron.

Initialization#

Pass the NeuronConfig attributes as keyword args.

Functions#

Attributes#

InferenceConfig#

InferenceConfig contains a NeuronConfig and model configuration attributes.

Initialization#

You can pass attributes through keyword args, or provide aload_config hook that is called during initialization to load the configuration attributes.

InferenceConfig is compatible with HuggingFace transformers. To use a model from HuggingFace transformers, you can populate an InferenceConfig with the attributes from the model’s PretrainedConfig, which is stored in config.json in the model checkpoint.

from neuronx_distributed_inference.models.llama import ( LlamaInferenceConfig, LlamaNeuronConfig ) from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

model_path = "/home/ubuntu/models/Meta-Llama-3.1-8B"

neuron_config = LlamaNeuronConfig() config = LlamaInferenceConfig( neuron_config, load_config=load_pretrained_config(model_path), )

Attributes#

An InferenceConfig includes neuron_config and any other attributes that you set during initialization.

InferenceConfig also supports an attribute map, which lets you configure additional names or aliases for attributes. When you get or set an attribute by an alias, you retrieve or modify the value of the original attribute. When you initialize an InferenceConfig from a HuggingFace PretrainedConfig, it automatically inherits the attribute map from that PretrainedConfig.

Functions#

MoENeuronConfig#

A NeuronConfig subclass for mixture-of-experts (MoE) models. This config includes attributes specific to MoE models. MoE model configurations, such as DbrxNeuronConfig, are subclasses of MoENeuronConfig.

Initialization#

Pass the attributes as keyword args.

Functions#

Attributes#

FusedSpecNeuronConfig#

A configuration for a model that uses fused speculation, which is a speculative decoding feature where the target and draft models are compiled into a combined model to improve performance. For more information, see Fused Speculation.

Attributes#

Generation#

HuggingFaceGenerationAdapter#

NxD Inference supports running inference with the HuggingFace generateinference. To use HuggingFace-style generation, create a HuggingFaceGenerationAdapter that wraps a Neuron application model. Then, you can call generate on the adapted model.

generation_model = HuggingFaceGenerationAdapter(neuron_model) outputs = generation_model.generate( inputs.input_ids, attention_mask=inputs.attention_mask, generation_config=generation_config )

Models#

NxD Inference provides a model hub with production ready models. You can use these existing models to run inference, or use them as reference implementations when you develop your own models on Neuron. All model inherit from base classes that provide a basic set of functionality that is common to all models.

NeuronApplicationBase#

NeuronApplicationBase is the base class for all application models, including NeuronBaseForCausalLM. NeuronApplicationBase provides functions to compile and load models. This class extendstorch.nn.Module. Application models are the entry point to running inference with NxD Inference. You can extend this class to define new application models that implement use cases in addition to causal LM.

Attributes#

Functions#

NeuronBaseForCausalLM#

NeuronBaseForCausalLM is the base application class that you use to generate text with causal language models. This class extends NeuronApplicationBase. You can extend this class to run text generation in custom models.

Attributes#

Functions#

NeuronBaseModel#

NeuronBaseModel is the base class for all models. This class extendstorch.nn.Module. In instances of NeuronBaseModel, you define the modules, such as attention, MLP, and decoder layers, that make up a model. You can extend this class to define custom decoder models.

Attributes#

Functions#

ModelWrapper#

Wraps a model to prepare it for compilation. Neuron applications, such as NeuronBaseForCausalLM, use this class to prepare a model for compilation. ModelWrapper defines the inputs to use when tracing the model during compilation.

To define a custom model with additional model inputs, you can extend ModelWrapper and override the input_generator function, which defines the inputs for tracing.

Functions#