Towards Generating Ambisonics Using Audio-visual Cue for Virtual Reality (original) (raw)

Immersive Spatial Audio Reproduction for VR/AR Using Room Acoustic Modelling from 360° Images

2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), 2019

Recent progresses in Virtual Reality (VR) and Augmented Reality (AR) allow us to experience various VR/AR applications in our daily life. In order to maximise the immersiveness of user in VR/AR environments, a plausible spatial audio reproduction synchronised with visual information is essential. In this paper, we propose a simple and efficient system to estimate room acoustic for plausible reproducton of spatial audio using 360 • cameras for VR/AR applications. A pair of 360 • images is used for room geometry and acoustic property estimation. A simplified 3D geometric model of the scene is estimated by depth estimation from captured images and semantic labelling using a convolutional neural network (CNN). The real environment acoustics are characterised by frequency-dependent acoustic predictions of the scene. Spatially synchronised audio is reproduced based on the estimated geometric and acoustic properties in the scene. The reconstructed scenes are rendered with synthesised spatial audio as VR/AR content. The results of estimated room geometry and simulated spatial audio are evaluated against the actual measurements and audio calculated from ground-truth Room Impulse Responses (RIRs) recorded in the rooms.

Spatial audio in 360° videos

Proceedings of the 13th ACM Multimedia Systems Conference

Immersive technologies are rapidly gaining traction across a variety of application domains. 360° video is one such technology, which can be captured with an omnidirectional multi-camera arrangement. With a Virtual Reality (VR) Head Mounted Display (HMD), users have the freedom to look in any direction they wish within the scene. While there is a plethora of work focused on modeling visual attention (VA) in VR, little research has considered the influence of the audio modality on VA in VR. It is well known that audio has an important role in VR experiences. Listeners can experience sound in all directions with high quality spatial audio. One such technique, Ambisonics or 3D audio, provides a full 360° sound soundscape.

Novel approaches to production and post-production of immersive VR/360 audio-visual experiences

2017

Key aspirations for VR and 360 Audio-Visual productions are immersion and presence. Yet, these are impossible to achieve without close spatial synchronisation between fields of vision and the soundscape. Audio has been recognised by many VR/360 practitioners as the most difficult aspect to control when included within the user experience. In an attempt to overcome this there is much current research into object-based and scene-based audio paradigms that deploy complex rendering and spatialisation techniques via game-engines such as Unity and proprietary software. There has been some success with these approaches, most notably BBC R&D’s landmark VR experience, The Turning Forest. However, many VR/360 productions may not require such complex workflows in order to render an engaging user experience, particularly if content delivery is linear. Scene-based audio using Ambisonic principles coupled with simple head-tracking technology offers a more manageable workflow that is ideal for sho...

Scene-Based Audio and Higher Order Ambisonics: A technology overview and application to Next-Generation Audio, VR and 360° Video

EBU Technical Review Paper, 2019

Scene Based Audio is a set of technologies for 3D audio that is based on Higher Order Ambisonics. HOA is a technology that allows for accurate capturing, efficient delivery, and compelling reproduction of 3D audio sound fields on any device, such as headphones, arbitrary loudspeaker configurations, or soundbars. We introduce SBA and we describe the workflows for production, transport and reproduction of 3D audio using HOA. The efficient transport of HOA is made possible by state-of-the-art compression technologies contained in the MPEG-H Audio standard. We discuss how SBA and HOA can be used to successfully implement Next Generation Audio systems, and to deliver any combination of TV, VR, and 360° video experiences using a single audio workflow.

Panoptic Reconstruction of Immersive Virtual Soundscapes Using Human-Scale Panoramic Imagery with Visual Recognition

Proceedings of the 26th International Conference on Auditory Display (ICAD 2021), 2021

This work, situated at Rensselaer’s Collaborative-Research Augmented Immersive Virtual Environment Laboratory (CRAIVE-Lab), uses panoramic image datasets for spatial audio display. A system is developed for the room-centered immersive virtual reality facility to analyze panoramic images on a segment-by-segment basis, using pre-trained neural network models for semantic segmentation and object detection, thereby generating audio objects with respective spatial locations. These audio objects are then mapped with a series of synthetic and recorded audio datasets and populated within a spatial audio environment as virtual sound sources. The resulting audiovisual outcomes are then displayed using the facility’s human-scale panoramic display, as well as the 128-channel loudspeaker array for wave field synthesis (WFS). Performance evaluation indicates effectiveness for real-time enhancements, with potentials for large-scale expansion and rapid deployment in dynamic immersive virtual enviro...

Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention

arXiv: Computer Vision and Pattern Recognition, 2021

Binaural audio gives the listener an immersive experience and can enhance augmented and virtual reality. However, recording binaural audio requires specialized setup with a dummy human head having microphones in left and right ears. Such a recording setup is difficult to build and setup, therefore mono audio has become the preferred choice in common devices. To obtain the same impact as binaural audio, recent efforts have been directed towards lifting mono audio to binaural audio conditioned on the visual input from the scene. Such approaches have not used an important cue for the task: the distance of different sound producing objects from the microphones. In this work, we argue that depth map of the scene can act as a proxy for inducing distance information of different objects in the scene, for the task of audio binauralization. We propose a novel encoder-decoder architecture with a hierarchical attention mechanism to encode image, depth and audio feature jointly. We design the network on top of state-of-the-art transformer networks for image and depth representation. We show empirically that the proposed method outperforms state-of-the-art methods comfortably for two challenging public datasets FAIR-Play and MUSIC-Stereo. We also demonstrate with qualitative results that the method is able to focus on the right information required for the task. The project details are available at https://krantiparida.github.io/ projects/bmonobinaural.html

Perceptual audio rendering of complex virtual environments

ACM Transactions on Graphics, 2004

: Left, an overview of a test virtual environment, containing 174 sound sources. All vehicles are moving. Mid-left, the magenta dots indicate the locations of the sound sources while the red sphere represents the listener. Notice that the train and the river are extended sources modeled by collections of point sources. Mid-right, ray-paths from the sources to the listener. Paths in red correspond to the perceptually masked sound sources. Right, the blue boxes are clusters of sound sources with the representatives of each cluster in grey. Combination of auditory culling and spatial clustering allows us to render such complex audio-visual scenes in real-time.

Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal Attention

2021

Binaural audio gives the listener the feeling of being in the recording place and enhances the immersive experience if coupled with AR/VR. But the problem with binaural audio recording is that it requires a specialized setup which is not possible to fabricate within handheld devices as compared to traditional mono audio that can be recorded with a single microphone. In order to overcome this drawback, prior works have tried to uplift the mono recorded audio to binaural audio as a post processing step conditioning on the visual input. But all the prior approaches missed other most important information required for the task, i.e. distance of different sound producing objects from the recording setup. In this work, we argue that the depth map of the scene can act as a proxy for encoding distance information of objects in the scene and show that adding depth features along with image features improves the performance both qualitatively and quantitatively. We propose a novel encoder-dec...

Ambisonic Sound Design for Theatre with Virtual Reality Demonstration - A Case Study

This paper discusses ambisonic sound design for a theatrical production of King Lear. Sound, and its use in theatre, has taken a back-seat in comparison to the development of other theatre technologies such as lighting, projection and automation in recent years. Spatial audio implementations in theatre give the sound designer and the artistic team much greater scope for creativity, along with improvements in source separation and intelligibility of sources due to spatial unmasking. A 360 degree video was also recorded, with first and third order ambisonic binaural reproductions of the sound design stitched on to the video to create a virtual reality experience. The project was successful, whilst highlighting some practical and perceptual limitations in spatial audio for theatre.

A comparison of different surround sound recording and reproduction techniques based on the use of a 32 capsules microphone array, including the influence of panoramic video

This paper provides a comparison between the operational results obtained reproducing a three-dimensional sound field by means of traditional 1st order Ambisonics, and employing for the first time the virtual microphone technique 3DVMS. Audio and video were recorded at the same time, employing 32-capsules spherical microphone arrays and a panoramic video capture system of our design. In both cases, a matrix of FIR filters was employed, for deriving the standard 4 B-format components (Ambisonics), or 32 highlydirective virtual microphones pointing at the same directions of the 32 loudspeakers (3DVMS). A pool of test subjects was employed for comparative listening tests, evaluating some standard psychoacoustical parameters. Furthermore, the same tests were repeated with and without the accompanying panoramic video.