GitHub - qiuk2/AAR: [Official Implementation] Acoustic Autoregressive Modeling 🔥 (original) (raw)
Efficient Autoregressive Audio Modeling via Next-Scale Prediction
Efficient Autoregressive Audio Modeling via Next-Scale Prediction
Updates
- (2024.08.24) Demo Released, tokenizer for other datasets will be available in two weeks.
- (2024.08.22) Add SAT and AAR code, demo will be released soon.
- (2024.08.20) Repo created. Code and checkpoints will be released this week.
Installation
- Install all packages via
pip3 install -r requirements.txt.
Dataset
We download our Audioset from the website https://research.google.com/audioset/ and collect it as
AudioSet
├── audioset_unbalanced_train_mp3
├── unbalanced_train_segments.csv
└── audioset_eval_raw_mp3
Scale-level audio tokenizer (SAT)
We are currently training large-scale SAT for music, audio, and speech. We expect the checkpoint will be ready and released in Sept.
Training
python3 train_SAT_mpi.py --config config/train/SAT.yaml --train_dir /path/to/audioset_unbalanced_train_mp3 --train_csv /path/to/csv --batch_size <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>b</mi><mi>s</mi><mo>−</mo><mo>−</mo><mi>g</mi><mi>p</mi><mi>u</mi><mi>s</mi></mrow><annotation encoding="application/x-tex">bs --gpus </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7778em;vertical-align:-0.0833em;"></span><span class="mord mathnormal">b</span><span class="mord mathnormal">s</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.7778em;vertical-align:-0.1944em;"></span><span class="mord">−</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord mathnormal">p</span><span class="mord mathnormal">u</span><span class="mord mathnormal">s</span></span></span></span>gpus --output_dir /path/to/save/ckpt --use_prefetcher True --resume latest
Inference
python3 inference_SAT.py --config config/inference/SAT.yaml --resume /path/to/ckpt.pth --test_dir /path/to/audioset_eval_raw_mp3 --batch_size $bs
Pre-trained model
We provide Audioset pre-trained SAT checkpoint as follows:
| model | # Scale | # Tokens | latent_dim | FAD | HF weights 🤗 |
|---|---|---|---|---|---|
| SAT | 16 | 455 | 64 | 1.09 | SAT.pth |
| SAT | 16 | 455 | 128 | 1.40 | SAT.pth |
Training
python3 train_AAR_mpi.py --config config/train/AAR.yaml --train_dir /path/to/audioset_unbalanced_train_mp3 --train_csv /path/to/csv --batch_size <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>b</mi><mi>s</mi><mo>−</mo><mo>−</mo><mi>g</mi><mi>p</mi><mi>u</mi><mi>s</mi></mrow><annotation encoding="application/x-tex">bs --gpus </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7778em;vertical-align:-0.0833em;"></span><span class="mord mathnormal">b</span><span class="mord mathnormal">s</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.7778em;vertical-align:-0.1944em;"></span><span class="mord">−</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord mathnormal">p</span><span class="mord mathnormal">u</span><span class="mord mathnormal">s</span></span></span></span>gpus --output_dir /path/to/save/ckpt --use_prefetcher True --resume latest --vqvae_pretrained_path /path/to/vae/ckpt --latent_dim <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>l</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>n</mi><mi>t</mi><mo>−</mo><mo>−</mo><mi>d</mi><mi>i</mi><mi>m</mi><mi>e</mi><mi>n</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow><annotation encoding="application/x-tex">latent --dimension </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7778em;vertical-align:-0.0833em;"></span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">a</span><span class="mord mathnormal">t</span><span class="mord mathnormal">e</span><span class="mord mathnormal">n</span><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.7778em;vertical-align:-0.0833em;"></span><span class="mord">−</span><span class="mord mathnormal">d</span><span class="mord mathnormal">im</span><span class="mord mathnormal">e</span><span class="mord mathnormal">n</span><span class="mord mathnormal">s</span><span class="mord mathnormal">i</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span></span></span></span>dim
Inference
python3 inference_AAR.py --config config/inference/AAR.yaml --aar_pretrained_path /path/to/aar.pth --vqvae_pretrained_path /path/to/vqvae.pth --test_dir /path/to/audioset_eval_raw_mp3 --batch_size $bs --output_dir /path/to/save
Pre-trained model
We provide Audioset pre-trained AAR checkpoint as follows:
| model | # Scale | # Tokens | latent_dim | FAD | HF weights 🤗 |
|---|---|---|---|---|---|
| SAT | 16 | 455 | 128 | 1.40 | SAT.pth |
| AAR | 16 | 455 | 128 | 6.01 | AAR.pth |
Citation
@misc{qiu2024efficient,
title={Efficient Autoregressive Audio Modeling via Next-Scale Prediction},
author={Kai Qiu and Xiang Li and Hao Chen and Jie Sun and Jinglu Wang and Zhe Lin and Marios Savvides and Bhiksha Raj},
year={2024},
eprint={2408.09027},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
