Riffusion’s AI generates music from text using visual sonograms (original) (raw)

After generating the sonogram image, Riffusion uses Torchaudio to change the sonogram to sound, playing it back as audio.

A sonogram represents time, frequency, and amplitude in a two-dimensional image.

A sonogram represents time, frequency, and amplitude in a two-dimensional image.

A sonogram represents time, frequency, and amplitude in a two-dimensional image. Credit: Riffusion

"This is the v1.5 Stable Diffusion model with no modifications, just fine-tuned on images of spectrograms paired with text," write Riffusion's creators on its explanation page. "It can generate infinite variations of a prompt by varying the seed. All the same web UIs and techniques like img2img, inpainting, negative prompts, and interpolation work out of the box."

Visitors to the Riffusion website can experiment with the AI model thanks to an interactive web app that generates interpolated sonograms (smoothly stitched together for uninterrupted playback) in real time while visualizing the spectrogram continuously on the left side of the page.

A screenshot of the Riffusion website, which lets you type in prompts and hear the resulting sonograms.

A screenshot of the Riffusion website, which lets you type in prompts and hear the resulting sonograms.

A screenshot of the Riffusion website, which lets you type in prompts and hear the resulting sonograms. Credit: Riffusion

It can fuse styles, too. For example, typing in "smooth tropical dance jazz" brings in elements of different genres for a novel result, encouraging experimentation by blending styles.

Of course, Riffusion is not the first AI-powered music generator. Earlier this year, Harmonai released Dance Diffusion, an AI-powered generative music model. OpenAI's Jukebox, announced in 2020, also generates new music with a neural network. And websites like Soundraw create music non-stop on the fly.

Compared to those more streamlined AI music efforts, Riffusion feels more like the hobby project it is. The music it generates ranges from interesting to unintelligible, but it remains a notable application of latent diffusion technology that manipulates audio in a visual space.

The Riffusion model checkpoint and code are available on GitHub.