TANGOFLUX : Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization (original) (raw)

Abstract

We introduceTANGOFLUX , an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TANGOFLUX achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.

Salient Features

🚀 TANGOFLUX can generate up to 30 seconds long 44.1kHz stereo audios in about 3 seconds on an A40 GPU.

Comparative Samples

Text Description	Stable Audio Open	TANGO 2	AudioLDM2	AudioBox	TANGOFLUX (Ours)
Melodic human whistling harmonizing with natural birdsong	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
A basketball bounces rhythmically on a court, shoes squeak against the floor, and a referee’s whistle cuts through the air.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
Dripping water echoes sharply, a distant growl reverberates through the cavern, and soft scraping metal suggests something lurking unseen.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
A train conductor blows a sharp whistle, metal wheels screech on the rails, and passengers murmur while settling into their seats	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
A pile of coins spills onto a wooden table with a metallic clatter, followed by the hushed murmur of a tavern crowd and the creak of a swinging door.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.
The deep growl of an alligator ripples through the swamp as reeds sway with a soft rustle and a turtle splashes into the murky water.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.	Your browser does not support the audio element.

The three general trends that can be observed in these examples—in concurrence with the human evaluators—are (i) more audible presence of the events and (ii) better event-order reproduction (iii) higher quality audio in the outputs of TANGOFLUX over other models.

Resources

1. We share ourcode on GitHub, which aims to open source the audio generation model training and evaluation for easier comparison.

2. We have released our model checkpoints on HuggingFace for reproducibility.

Acknowledgement

Powered by Stability AI
This website is created based on https://github.com/AudioLDM/AudioLDM.github.io