GitHub - NVIDIA/Audio2Face-3D-Training-Framework: Audio2Face-3D Training Framework for creating custom neural networks that generate realistic facial animations from audio input (original) (raw)

Resources:

Audio2Face-3D

Audio2Face-3D generates high-fidelity facial animations from an audio source. The technology is capable of producing detailed and realistic articulation, including precise motion for the skin, jaw, tongue, and eyes, to achieve accurate lip-sync and lifelike character expression, including emotions.

Audio2Face-3D Training Framework is the core tool for training high-fidelity facial animation models within the Audio2Face-3D ecosystem. It supports both NVIDIA's prebuilt models and custom models tailored to specific characters, languages, or artistic styles. Training these models requires extensive datasets of synchronized facial animation and corresponding audio, which the framework is designed to leverage efficiently.

Audio2Face and Training Framework

Documentation Navigation

This README

Detailed Guides

Prerequisites

System Requirements

  1. Operating System: Linux or WSL2 (Ubuntu 22.04 recommended)
  2. Storage: ~1 GB of free space for framework artifacts and the example dataset
  3. Hardware: CUDA-compatible GPU with at least 6 GB VRAM
  4. NVIDIA Driver: Use the following supported range:
    • Linux: 575.57 - 579.x
    • Windows/WSL2: 576.57 - 579.x
    • Check your current version: nvidia-smi
  5. Docker: Required for running the framework
  6. NVIDIA Docker: Required for GPU acceleration

Quick Start

This quick start guide provides a comprehensive walkthrough of the Audio2Face-3D Training Framework.

Using a sample dataset available from Hugging Face, you will learn the complete end-to-end workflow, from initial setup to testing a newly trained model.

In this guide, you will learn to:

Note: If you are not familiar with Linux and are working on a Windows system, please refer to the Detailed Setup Under Windows (WSL2 / Ubuntu) section in the Training Framework page.

1. Clone Repository

Clone the Audio2Face-3D Training Framework repository:

Create audio2face directory and navigate to it

mkdir -p ~/audio2face && cd ~/audio2face

Clone the repository

git clone https://github.com/NVIDIA/Audio2Face-3D-Training-Framework.git

2. Setup Workspace

Create new directories to hold datasets and training files:

Create datasets and workspace directories

mkdir -p ~/audio2face/datasets mkdir -p ~/audio2face/workspace

3. Configure Environment

Navigate to the repository directory

cd ~/audio2face/Audio2Face-3D-Training-Framework

Copy environment file template

cp .env.example .env

Edit the .env file with your actual paths (use absolute paths):

A2F_DATASETS_ROOT="/home//audio2face/datasets" A2F_WORKSPACE_ROOT="/home//audio2face/workspace"

4. Download Example Dataset

We provide the Audio2Face-3D Example Dataset as part of this framework.

  1. Download the dataset:
    • You can download the Claire dataset from: Claire Dataset on Hugging Face
    • It needs to be placed under the A2F_DATASETS_ROOT directory as defined in the environment
    • Authentication: You may need to authenticate with Hugging Face to access the dataset:
      * Using Tokens: Hugging Face Tokens
      * Using SSH Key: Hugging Face SSH Keys
    • Clone the dataset using the following commands:

Navigate to the datasets directory

cd ~/audio2face/datasets

Make sure git LFS is installed

sudo apt-get install -y git-lfs git lfs install

Clone Claire dataset in the datasets directory using https

git clone https://huggingface.co/datasets/nvidia/Audio2Face-3D-Dataset-v1.0.0-claire

Or alternatively clone Claire dataset in the datasets directory using SSH

git clone git@hf.co:datasets/nvidia/Audio2Face-3D-Dataset-v1.0.0-claire

  1. Verify the dataset structure:
    • After download, your dataset directory should look like this:
/home/<username>/audio2face/datasets/
└── Audio2Face-3D-Dataset-v1.0.0-claire/
      ├── data/
      │   └── claire/
      │       ├── audio/
      │       ├── cache/
      │       └── ...
      ├── docs/
      └── ...

5. Setup Permissions and Build Docker

Navigate to the repository directory

cd ~/audio2face/Audio2Face-3D-Training-Framework

Add executable permissions

chmod +x docker/*.sh

Build Docker container

./docker/build_docker.sh

Note: In the next steps, all python run_*.py commands automatically execute inside Docker containers with pre-configured dependencies.

6. Run Example Training

Python Note: In Ubuntu, the python command can be python3. You'll get a warning with the correct spelling for your installation.

Step 1: Preprocess the Dataset

Run preprocessing with example config

python run_preproc.py example-diffusion claire

Once this process is completed, the log will print the Preproc Run Name Full, like this:

Name of the output from preproc

This name is important for future steps. It needs to be added to the config_train.py file located in the configs/example-diffusion directory. In this file, you need to locate the following section:

PREPROC_RUN_NAME_FULL = { "claire": "XXXXXX_XXXXXX_example", }

The value needs to be updated with the name that was provided in the shell log from the preproc script. In the example above, it would be updated as follows:

PREPROC_RUN_NAME_FULL = { "claire": "250909_135508_example", }

Note: A new sub-directory is also created in the workspace/output_preproc directory containing the artifacts of the preproc process.

Step 2: Train

Run training example

python run_train.py example-diffusion

Note: The training process can take some time (between 30 and 40 minutes depending on your hardware). The training log provides guidance on how much time is needed to complete the training.

Again, once this process is completed, a new sub-directory will be created in the workspace/output_train directory. The name of that directory will be reflected in the shell log. It will look like this:

Name of the output from preproc

You can use this name as <TRAINING_RUN_NAME_FULL> in next step.

Step 3: Deploy

run the deploy example

python run_deploy.py example-diffusion

This process creates a new sub-directory in the workspace/output_deploy directory. The name of that directory will be reflected in the shell log.

This new directory contains all the files required to use the trained model for inference.

7. Model Validation and Testing

Once training is complete, validate your custom model using one of the following methods:

**Option 1: Python Inference:**Generate animations in .npy format or Maya cache (.mc) format using the built-in inference engine:

python run_inference.py example-diffusion

**Option 2: Maya-ACE Integration:**Deploy and test your model in a visual production environment using Maya and the Maya-ACE plugin.

The Maya-ACE plugin enables real-time visualization of animation inference. It allows you to see the output from a model directly on a character within the Autodesk Maya 3D environment, providing immediate visual feedback for testing and validation

Citation

If you use Audio2Face-3D Training Framework in your research, please cite:

@misc{nvidia2025audio2face3d, title={Audio2Face-3D: Audio-driven Realistic Facial Animation For Digital Avatars}, author={Chaeyeon Chung and Ilya Fedorov and Michael Huang and Aleksey Karmanov and Dmitry Korobchenko and Roger Ribera and Yeongho Seol}, year={2025}, eprint={2508.16401}, archivePrefix={arXiv}, primaryClass={cs.GR}, url={https://arxiv.org/abs/2508.16401}, note={Authors listed in alphabetical order} }