Onboarding models to run on NxD Inference — AWS Neuron Documentation (original) (raw)

Onboarding models to run on NxD Inference#

This guide covers how to onboard a model to get it to run on NxD Inference for the first time. To learn more about how to optimize a model on Neuron, see the NxD Inference Features Configuration Guide.

Table of contents

Overview#

This guide demonstrates how to adapt an existing PyTorch model to run on Neuron with the NeuronX Distributed (NxD) Inference library. At a high-level, you will do the following:

  1. Define configuration classes. NxD Inference models include a NeuronConfig, which defines Neuron-specific configuration parameters, and an InferenceConfig, which defines model configuration parameters. When adapting a model that works with HuggingFace, InferenceConfig is synonymous to PretrainedConfig.
  2. Define model classes. When you define model classes, you replace linear layers with parallel layers that are optimized for distributed inference on Neuron. NxD Inference also provides modules for attention, KV cache management, and more, which you can use to write model classes that work with Neuron. Model classes are compiled to run effectively on Neuron.
  3. Define application heads. Application heads orchestrate passing inputs to the correct compiled model. Application heads also provide the interface to compile and load the model.
  4. Convert weights to a supported format. NxD Inference supports safetensors and pickle formats.

1. Define a NeuronConfig class#

Define a Neuron configuration class, which extends NeuronConfig. NeuronConfig includes Neuron-specific configuration parameters. In the config class for your model, you can define any additional Neuron-specific configuration parameters that your model requires.

from neuronx_distributed_inference.models.config import NeuronConfig

class NeuronLlamaConfig(NeuronConfig): def init(self, **kwargs): super().init(**kwargs) # Set any args/defaults

2. Define an InferenceConfig class#

Define an inference configuration class, which extends InferenceConfig. InferenceConfig includes model parameters, such as those from a HuggingFace PretrainedConfig (like LlamaConfig). When users initialize your config, they can provide required attributes directly, or they can populate the config from a HuggingFace PretrainedConfig. You can also override get_required_attributes to enforce that certain attributes are present.

from neuronx_distributed_inference.models.config import InferenceConfig, NeuronConfig

class LlamaInferenceConfig(InferenceConfig): def get_required_attributes(self) -> List[str]: return [ "hidden_size", "num_attention_heads", "num_hidden_layers", "num_key_value_heads", "pad_token_id", "vocab_size", "max_position_embeddings", "rope_theta", "rms_norm_eps", "hidden_act", ]

@classmethod
def get_neuron_config_cls(cls) -> Type[NeuronConfig]:
    return NeuronLlamaConfig

3. Define a Neuron model#

Define a Neuron model. This class is a subclass of NeuronBaseModel, which is a PyTorch module.

  1. In this class, you provide implementations forsetup_attr_for_model(self, config) andinit_model(self, config).
    1. In setup_attr_for_model, set values for the following attributes. You can set these attributes from values in configand config.neuron_config.
      1. self.on_device_sampling
      2. self.tp_degree
      3. self.hidden_size
      4. self.num_attention_heads
      5. self.num_key_value_heads
      6. self.max_batch_size
      7. self.buckets
    2. In init_model, initialize the modules that make up the model.
      1. For attention modules, extend NeuronAttentionBase, which provides a group query attention (GQA) implementation adapted to Neuron.
      2. Replace linear layers (such as in attention and MLP) with Neuron parallel layers (RowParallelLinear and ColumnParallelLinear).
        1. For more information about RowParallelLinear and ColumnParallelLinear layers, see Tensor Parallelism Overview.
      3. Replace embeddings with Neuron parallel embeddings (ParallelEmbedding).
      4. Replace any other modules that require Neuron-specific implementations.

Note: This example demonstrates a simplified version of NeuronLlamaModel from from the NxDI model hub.

from torch import nn from transformers.activations import ACT2FN

from neuronx_distributed.parallel_layers import parallel_state from neuronx_distributed.parallel_layers.layers import ColumnParallelLinear, RowParallelLinear, ParallelEmbedding

from neuronx_distributed_inference.models.model_base import NeuronBaseModel from neuronx_distributed_inference.modules.attention.attention_base import NeuronAttentionBase from neuronx_distributed_inference.modules.attention.utils import RotaryEmbedding from neuronx_distributed_inference.modules.custom_calls import CustomRMSNorm

class NeuronLlamaMLP(nn.Module): """ This class just replace the linear layers (gate_proj, up_proj and down_proj) with column and row parallel layers """

def __init__(self, config: InferenceConfig):
    super().__init__()
    self.config = config
    self.neuron_config = config.neuron_config
    self.tp_degree = config.neuron_config.tp_degree
    self.hidden_size = config.hidden_size
    self.intermediate_size = config.intermediate_size
    self.act_fn = ACT2FN[config.hidden_act]

    self.gate_proj = ColumnParallelLinear(
        self.hidden_size,
        self.intermediate_size,
        bias=False,
        gather_output=False,
        dtype=config.neuron_config.torch_dtype,
        pad=True,
    )
    self.up_proj = ColumnParallelLinear(
        self.hidden_size,
        self.intermediate_size,
        bias=False,
        gather_output=False,
        dtype=config.neuron_config.torch_dtype,
        pad=True,
    )
    self.down_proj = RowParallelLinear(
        self.intermediate_size,
        self.hidden_size,
        bias=False,
        input_is_parallel=True,
        dtype=config.neuron_config.torch_dtype,
        pad=True,
    )

def forward(self, x):
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

class NeuronLlamaAttention(NeuronAttentionBase): """ Compared with LlamaAttention, this class just 1. replaces the q_proj, k_proj, v_proj with column parallel layer 2. replaces the o_proj with row parallel layer 3. update self.num_head to be self.num_head / tp_degree 4. update self.num_key_value_heads to be self.num_key_value_heads / tp_degree 5. update forward() method to adjust to changes from self.num_head """

def __init__(self, config: InferenceConfig):
    super().__init__()

    self.config = config
    self.neuron_config = config.neuron_config
    self.hidden_size = config.hidden_size
    self.num_attention_heads = config.num_attention_heads
    self.num_key_value_heads = config.num_key_value_heads
    self.head_dim = self.hidden_size // self.num_attention_heads
    self.max_position_embeddings = config.max_position_embeddings
    self.rope_theta = config.rope_theta
    self.padding_side = config.neuron_config.padding_side
    self.torch_dtype = config.neuron_config.torch_dtype

    self.tp_degree = parallel_state.get_tensor_model_parallel_size()

    self.fused_qkv = config.neuron_config.fused_qkv
    self.clip_qkv = None

    self.init_gqa_properties()
    self.init_rope()

def init_rope(self):
    self.rotary_emb = RotaryEmbedding(
        self.head_dim,
        max_position_embeddings=self.max_position_embeddings,
        base=self.rope_theta,
    )

class NeuronLlamaDecoderLayer(nn.Module): """ Just replace the attention with the NXD version, and MLP with the NXD version """

def __init__(self, config: InferenceConfig):
    super().__init__()
    self.hidden_size = config.hidden_size
    self.self_attn = NeuronLlamaAttention(config)
    self.mlp = NeuronLlamaMLP(config)
    self.input_layernorm = CustomRMSNorm(
        config.hidden_size,
        eps=config.rms_norm_eps,
    )
    self.post_attention_layernorm = CustomRMSNorm(
        config.hidden_size,
        eps=config.rms_norm_eps,
    )

def forward(
    self,
    hidden_states: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
    past_key_value: Optional[Tuple[torch.Tensor]] = None,
    **kwargs,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
    residual = hidden_states
    hidden_states = self.input_layernorm(hidden_states)

    # Self Attention
    attn_outs = self.self_attn(
        hidden_states=hidden_states,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_value=past_key_value,
        **kwargs,
    )

    hidden_states, present_key_value = attn_outs
    hidden_states = residual + hidden_states

    # Fully Connected
    residual = hidden_states
    hidden_states = self.post_attention_layernorm(hidden_states)
    hidden_states = self.mlp(hidden_states)
    hidden_states = residual + hidden_states

    return (hidden_states, present_key_value)

class NeuronLlamaModel(NeuronBaseModel): """ The neuron version of the LlamaModel """

def setup_attr_for_model(self, config: InferenceConfig):
    # Needed for init_inference_optimization()
    self.on_device_sampling = config.neuron_config.on_device_sampling_config is not None
    self.tp_degree = config.neuron_config.tp_degree
    self.hidden_size = config.hidden_size
    self.num_attention_heads = config.num_attention_heads
    self.num_key_value_heads = config.num_key_value_heads
    self.max_batch_size = config.neuron_config.max_batch_size
    self.buckets = config.neuron_config.buckets

def init_model(self, config: InferenceConfig):
    self.padding_idx = config.pad_token_id
    self.vocab_size = config.vocab_size

    self.embed_tokens = ParallelEmbedding(
        config.vocab_size,
        config.hidden_size,
        self.padding_idx,
        dtype=config.neuron_config.torch_dtype,
        shard_across_embedding=True,
        # We choose to shard across embedding dimension because this stops XLA from introducing
        # rank specific constant parameters into the HLO. We could shard across vocab, but that
        # would require us to use non SPMD parallel_model_trace.
        pad=True,
    )
    self.lm_head = ColumnParallelLinear(
        config.hidden_size,
        config.vocab_size,
        bias=False,
        pad=True,
    )

    self.layers = nn.ModuleList(
        [NeuronLlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)]
    )
    self.norm = CustomRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

4. Define an application/task head#

Define an application/task head. Applications includes causal LM, classification, and so on. This class extends a task-specific Neuron application head class (such as NeuronBaseForCausalLM), or the general NeuronApplicationHead class.

  1. In this class, you provide an value for _model_cls which is the Neuron model class you defined.
  2. You can also override any other functions as needed for your model, such as get_compiler_args(self) orconvert_hf_to_neuron_state_dict(model_state_dict, neuron_config).

Note: This example demonstrates a simplified version ofNeuronLlamaForCausalLMfrom the NxD Inference model hub.

class NeuronLlamaForCausalLM(NeuronBaseForCausalLM): _model_cls = NeuronLlamaModel

@classmethod
def get_config_cls(cls):
    return LlamaInferenceConfig

NxD Inference offers Asyncronous Runtime Support as an alternative method to executing NEFFs in parallel with CPU Logic. To evaluate if your task can utilize async_mode, the following questions must be answered:

  1. Does your task repeatedly execute a model for a single user request? If not, then async_mode won’t offer any benefits.
    • Example: The Auto Regressive loops used in LLMs perform repeated execution of models for a given prompt, which can get some benefits from async mode.
  2. Does the output of one execution get passed onto the next execution without manipulation? If not, then async_mode is incompatible.
    • NOTE: It might be possible to address this by moving some manipulation logic within the neff.
    • Example: For LLMs using on-device-sampling, we pass in the token generated as output as input to the next step in the auto regressive loop directly. Without on-device-sampling, the sampling logic will rely on logits as output, which is a data dependent compute pattern that is incompatible with async mode.
  3. Is there sufficient CPU logic that is independent of the previous outputs? If not, then async_mode likely won’t offer major benefits.
    • Example: In production workloads, these are typically server overheads (scheduling, logging, etc.), but this could also be some pre/post processing steps in the model execution pipeline.

Based on the answers above, async_mode will need to be set accordingly, and/or, be configured to work correctly with the application.

5. Convert weights to a supported format#

NxD Inference supports weights stored in the model path in the following formats:

Format Sharded File name
Safetensors No model.safetensors
Safetensors Yes model.safetensors.index.json
Pickle No pytorch_model.bin
Pickle Yes pytorch_model.bin.index.json

If your weights are in another format, you must convert them to one of these formats before you can compile and load the model to Neuron. See the following references for more information about these formats:

Integrating Onboarded Model with vLLM#

After completing the model onboarding in NxDI using the steps outlined in this guide, you can follow these steps to run that model through vLLM.

  1. Register your application/task head class in vLLM within _NEURON_SUPPORTED_MODELS(see vllm/model_executor/model_loader/neuronx_distributed.py).
  2. Use the local directory as model_name_or_path within vLLM which contains the model weights and the config.json file that works with your model’s InferenceConfig class.
  3. Pass in any custom NeuronConfig attributes by using theoverride_neuron_config attribute while initializing the vLLM engine.
  4. Run offline/online inference to get the model working with vLLM.

Evaluating Models on Neuron#

NxD Inference provides tools that you can use to evaluate the accuracy and performance of the models that you onboard to Neuron.

Logit Matching#

The logit matching evaluation tool verifies that output logits are within certain tolerances of expected logits. With this evaluation tool, NxD Inference runs generation on the Neuron device. Then, it compares the output logits against expected logits, which you can provide or generate with the HuggingFace model on CPU.

During logit validation, if the output tokens diverge, then this process runs generation on Neuron again, using the tokens up to the point where it diverged. This process is performed repeatedly each time the output diverges, until the entire output matches. This process uses greedy sampling to choose the most likely token at each index.

Once all tokens match, this process compares the logits generated on Neuron with the expected logits. If all logits are within expected tolerances, this accuracy check passes. Divergence difference tolerance is used to compare the logits at the token that diverges. Absolute and relative tolerance are used to compare the values of the logits for the top k highest scoring tokens. For best results, use a lower relative tolerance for smaller k values, and a higher relative tolerance for larger k values. A top k of None means to compare logits for all possible tokens at each index.

Logit matching uses the following tolerances by default, and you can customize these tolerances.

If all logits are within expected thresholds, this accuracy check passes.

Example (check_accuracy_logits API)#

from neuronx_distributed_inference.utils.accuracy import check_accuracy_logits

Init Neuron model, HuggingFace tokenizer, and HuggingFace generation config.

check_accuracy_logits( model, tokenizer, generation_config, )

Token Matching#

The token matching evaluation tool verifies that output tokens match expected tokens. With this evaluation tool, Neuronx Distributed Inference runs generation on the Neuron device. Then, it compares the output against expected tokens, which you can provide or generate with the HuggingFace model on CPU. If all tokens match, this accuracy check passes.

Example (check_accuracy API)#

from neuronx_distributed_inference.utils.accuracy import check_accuracy

Init Neuron model, HuggingFace tokenizer, and HuggingFace generation config.

check_accuracy( model, tokenizer, generation_config, )

Benchmarking#

NxD Inference provides a benchmarking tool that evaluates the latency and throughput of a Neuron model and its sub-models (context encoding, token generation, etc.).

Example (benchmark_sampling API)#

from neuronx_distributed_inference.utils.benchmark import benchmark_sampling

Init Neuron model and HuggingFace generation config.

benchmark_sampling(model, generation_config)

Example benchmarking result#

{ "e2e_model": { "latency_ms_p50": 28890.24031162262, "latency_ms_p90": 28977.734088897705, "latency_ms_p95": 28983.17071199417, "latency_ms_p99": 29032.21325159073, "latency_ms_p100": 29044.473886489868, "latency_ms_avg": 28879.499554634094, "throughput": 283.66142510545984 }, "context_encoding_model": { "latency_ms_p50": 705.0175666809082, "latency_ms_p90": 705.3698301315308, "latency_ms_p95": 705.6618571281433, "latency_ms_p99": 705.8443236351013, "latency_ms_p100": 705.8899402618408, "latency_ms_avg": 705.0377488136292, "throughput": 5809.618005408024 }, "token_generation_model": { "latency_ms_p50": 27.20165252685547, "latency_ms_p90": 27.295589447021484, "latency_ms_p95": 27.324914932250977, "latency_ms_p99": 27.655515670776367, "latency_ms_p100": 32.74345397949219, "latency_ms_avg": 27.19622969277793, "throughput": 147.22298324644066 } }

Profiling Models#

Neuron provides a profiling tool, neuron-profile, which you can use to analyze the performance of a compiled Neuron model. For more information, see Neuron Profile User Guide.

Evaluating Models with the Inference Demo Script#

NxD Inference provides an inference_demo console script, which you can run from the environment where you installneuronx_distributed_inference.

Note: Before you can use a custom model with the inference_demo, you must add it to the MODEL_TYPES dictionary in inference_demo.py.

This script provides the following arguments to configure evaluation tools:

Debugging Models on Neuron#

When you debug models on Neuron, you can enable debug logging to view information about inputs and outputs of the NeuronBaseForCausalLM forward function, which calls the NeuronBaseModel’s forward function.

import logging

logging.getLogger().setLevel(logging.DEBUG)

Because the forward function of NeuronBaseModel is compiled, you cannot use log/print statements to debug code that is called from this forward function (or any other compiled code).

Debugging Neuron modeling code on CPU isn’t yet supported.

Writing Tests on Neuron#

NxD Inference provides tools to help you write unit and integration tests that validate your model works as expected. For more information, seeTesting modeling code with NxD Inference.