Multichannel video/audio acquisition for immersive conferencing (original) (raw)
Related papers
The media industry is currently being pulled in the often-opposing directions of increased realism (high resolution, stereoscopic, large screen) and personalisation (selection and control of content, availability on many devices). A capture, production and delivery system capable of supporting both these trends is being developed by a consortium of European organisations in the EU-funded FascinatE project. This paper reports on the latest developments and presents results obtained from a test shoot at a UK Premier League football match. These include the use of imagery from broadcast cameras to add detail to key areas of the panoramic scene, and the automated generation of spatial audio to match the selected view. The paper explains how a 3D laser scan of the scene can help register the cameras and microphones into a common reference frame.
Selectable Directional Audio for Multiple Telepresence in Immersive Intelligent Environments
The general focus of this paper concerns the development of telepresence within intelligent immersive environments. The overall aim is the development of a system that combines multiple audio and video feeds from geographically dispersed people into a single environment view, where sound appears to be linked to the appropriate visual source on a panoramic viewer based on the gaze of the user. More specifically this paper describes a novel directional audio system for telepresence which seeks to reproduce sound sources (conversations) in a panoramic viewer in their correct spatial positions to increase the realism associated with telepresence applications such as online meetings. The intention of this work is that external attendees to an online meeting would be able to move their head to focus on the video and audio stream from a particular person or group so as decrease the audio from all other streams (i.e. speakers) to a background level. The main contribution of this paper is a methodology that captures and reproduces these spatial audio and video relationships. In support of this we have created a multiple camera recording scheme to emulate the behavior of a panoramic camera, or array of cameras, at such meeting which uses the Chroma key photographic effect to integrate all streams into a common panoramic video image thereby creating a common shared virtual space. While this emulation is only implemented as an experiment, it opens the opportunity to create telepresence systems with selectable real time video and audio streaming using multiple camera arrays. Finally we report on the results of an evaluation of our spatial audio scheme that demonstrates that the techniques both work and improve the users’ experience, by comparing a traditional omni directional audio scheme versus selectable directional binaural audio scenarios.
Journal of Signal Processing Systems, 2013
Emerging multi-modal signal processing applications require a sustained effort on the part of the developer to realize and deploy an application. A rapid prototyping platform will reduce the effort, cost, and time required to develop and deploy an application. In this paper, a rapid prototyping platform is developed for realizing a multi-modal signal processing application that involve real time interfacing of multi-modal signals both at the input and the output. The platform allows the designer to simulate various applications and produce the required product only after entire testing has been done. A portable intelligent meeting capture system that can be rapidly deployed in smart meeting rooms is implemented on this platform. The setup consists of a microphone array which computes the two-dimensional direction of arrival (DOA). The azimuth and the elevation angles are computed using advanced signal processing algorithms like GCC-PHAT, MUSIC which are implemented on the Real Time Operating System (RT-OS). The DOAs are communicated to a wireless networked camera which steers in real time towards the active speaker. Performance evaluation of the rapidly prototyped system is tested in real time meetings in terms of average error deviations in the DOA. The accuracy of the results indicate further miniaturization of the system. The possibilities of using this platform for developing multi-modal signal processing in general is also described.
Multiple narrow-baseline system for immersive teleconferencing
International Symposium on VIPromCom Video/Image Processing and Multimedia Communications, 2002
An important aim of immersive teleconferencing systems is to create realistic 3D virtual views of remote conferees. Hence, systems should be able to deal with hand gestures as well as occluded areas in reference images required in derived views. The quality of such derived views is dependent not only on the analysis and synthesis process but also the multiview camera setup. Often the popular convergent wide-baseline stereo approach aspires to achieve too much through a single camera pair: maximum information and reliable disparity maps. We identify how this dichotomy leads to problems in the analysis and synthesis process, often leading to a restrictive system specific solution. We then define a new approach, a multiple narrow-baseline setup , designed to overcome the limitations of the wide-baseline setup , being modular, both in terms of system requirements as well as algorithmically, and scalable, with respect to the number of conferees.
Shared interactive video for teleconferencing
Proceedings of the eleventh ACM international conference on Multimedia - MULTIMEDIA '03, 2003
We present a system that allows remote and local participants to control devices in a meeting environment using mouse or pen based gestures "through" video windows. Unlike state-of-theart device control interfaces that require interaction with text commands, buttons, or other artificial symbols, our approach allows users to interact with devices through live video of the environment. This naturally extends our video supported pan/tilt/zoom (PTZ) camera control system, by allowing gestures in video windows to control not only PTZ cameras, but also other devices visible in video images. For example, an authorized meeting participant can show a presentation on a screen by dragging the file on a personal laptop and dropping it on the video image of the presentation screen. This paper presents the system architecture, implementation tradeoffs, and various meeting control scenarios.
A televiewing system for multiple simultaneous customized perspectives and resolutions
ITSC 2001. 2001 IEEE Intelligent Transportation Systems. Proceedings (Cat. No.01TH8585), 2001
Recent innovations in real-time machine vision, distributed computing, software architectures, and high-speed communication are expanding the available technology for intelligent system development. These technologies allow the realization of intelligent systems that provide the capabilities for a user to experience events from remote locations in an interactive way. In this paper we describe research aimed at the realization of a powerful televiewing system applied to the traffic incident detection and monitoring needs of today's highways. Sensor clusters utilizing both rectilinear and omni-directional cameras will provide an interactive, real-time, multi-resolution televiewing interface to emergency response crews. Ultimately, this system will have a direct impact on reducing incident related highway congestion by improving the quality of information to which emergency personnel have access.
Immersive 3-D Video Conferencing: Challenges, Concepts, and Implementations
In this paper, a next generation 3-D video conferencing system is presented that provides immersive telepresence and natural representation of all participants in a shared virtual meeting space. The system is based on the principle of a shared virtual table environment which guarantees correct eye contact and gesture reproduction and enhances the quality of human-centered communication. The virtual environment is modeled in MPEG-4 which also allows the seamless integration of explicit 3-D head models for a low-bandwidth connection to mobile users. In this case, facial expression and motion information is transmitted instead of video streams resulting in bit-rates of a few kbit/s per participant. Beside low bit-rates, the model-based approach enables new possibilities for image enhancements like digital make-up, digital dressing, or modification of scene lighting.
The coliseum immersive teleconferencing system
2002
We describe Coliseum, a desktop system for immersive teleconferencing. Five cameras attached to a desktop LCD monitor are directed at a participant. View synthesis methods produce arbitrary-perspective renderings of the participant from these video streams, and transmit them to other participants. Combining these renderings in a shared synthetic environment gives the illusion of having remote participants interacting in a common space.
Human-centric control of video functions and underlying resources in 3d tele-immersive systems
ACM SIGMultimedia Records, 2011
3D tele-immersion (3DTI) has the potential of enabling virtual-reality-like interaction among remote people with real-time 3D video. However, today's 3DTI systems still suffer from various performance issues, limiting their broader deployment, due to the enormous demand on temporal (computing) and spatial (networking) resources. Past research focused on system-centric approaches for technical optimization, without taking human users into the loop. We argue that human factors (including user preferences, semantics, limitations, etc.) are an important and integral part of the cyber-physical 3DTI systems, and should not be neglected. This thesis proposes a novel, comprehensive, human-centric framework for improving the qualities of 3DTI throughout its video function pipeline. We make three major contributions at different phases of the pipeline. At the sending side, we develop an intra-stream data adaptation scheme that reduces level-ofdetails within each stream without users being aware of it. This human-centric approach exploits limitations of human vision, and excludes details that are imperceptible. It effectively alleviates the data load for computation-intensive operations, thus improves the temporal efficiency of the systems. Yet even with intra-stream data reduced, spatial efficiency is still a problem due to the multi-stream/multi-site nature of 3DTI collaboration. We thus develop an interstream data adaptation scheme at the networking phase to reduce the number of streams with minimal disruption to the visual quality. This human-centric approach prioritizes streams based on user views and excludes less important streams from transmission. It considerably reduces the data load for networking, and thus enhances the spatial resource efficiency. The above two approaches (level-of-details reduction within a video stream and view-based differentiation among streams) work seamlessly together to bring both temporal and spatial resource demands under control, and prove to improve various qualities of the systems. Finally, at the receiving side, we take a holistic approach to study the "quality" concept in 3DTI environments. Our human-centric quality framework focuses on the Quality-of-Experience (QoE) concept that models user's perceptions, emotions, performances, etc. It investigates how the traditional Qualityof-Service (QoS) impacts QoE, and reveals how QoS should be improved for the best user experience. This thesis essentially demonstrates the importance of bringing human-awareness into the design, execution, and evaluation of the complex resource-constrained 3DTI environments. To Xiao & Sicheng iii First and foremost, I would like to express my deep gratitude to my advisor, Professor Klara Nahrstedt, for initially giving me the opportunity to join this fascinating tele-immersion project, for pushing me forward while giving me the freedom to go for things that excited me, for all the valuable discussions and countless comments on my drafts, for the willing advice whenever needed, and for the continuing support, belief, and encouragement. I will be forever grateful for having Klara as my advisor. I also sincerely thank my other committee members, Professor Thomas Huang, Professor Ramesh Jain, and Professor Indranil Gupta, for offering insightful feedback and constructive suggestions on my thesis. My special thanks go to Professor Gupta who has taught me a great deal during our collaboration on the inter-stream adaptation work. Many thanks to my colleagues in the tele-immersion project, particularly
Virtual space teleconferencing using a sea of cameras
1994
A new approach to telepresence is presented in which a m ultitude of stationary cameras are used to acquire both photometric and depth information. A virtual environment is constructed by displaying the acquired data from the remote site in accordance with the head position and orientation of a local participant. Shown are preliminary results of a depth image of a human subject calculated from 11 closely spaced video camera positions. A user wearing a head-mounted display walks around this 3D data that has been inserted into a 3D model of a simple room. Future systems based on this approach m a y exhibit more natural and intuitive i n teraction among participants than current 2 D teleconferencing systems.