EPSRC logo

Details of Grant 

EPSRC Reference: EP/V03538X/1
Title: Immersive Audio-Visual 3D Scene Reproduction Using a Single 360 Camera
Principal Investigator: KIM, Dr H
Other Investigators:
Researcher Co-Investigators:
Project Partners:
Audioscenic BBC University of Surrey
Department: Sch of Electronics and Computer Sci
Organisation: University of Southampton
Scheme: New Investigator Award
Starts: 01 September 2021 Ends: 29 February 2024 Value (£): 267,460
EPSRC Research Topic Classifications:
Image & Vision Computing Music & Acoustic Technology
EPSRC Industrial Sector Classifications:
Creative Industries
Related Grants:
Panel History:
Panel DatePanel NameOutcome
27 Jan 2021 EPSRC ICT Prioritisation Panel January 2021 Announced
Summary on Grant Application Form
The COVID-19 pandemic has changed our lifestyle and caused high demand for remote communication and experience. Many organizations have had to set up remote work systems with video conferencing platforms. However, current video conferencing systems do not meet basic requirements for remote collaboration due to the lack of eye contact, gaze awareness and spatial audio synchronisation. Reproduction of a real space as an audio-visual 3D model allows users to remotely experience real-time interaction in real environments, thus it can be widely utilised in various applications such as healthcare, teleconferencing, education, entertainments, etc. The goal of this project is to develop a simple and practical solution to estimate geometrical structure and acoustic properties of general scenes allowing spatial audio to be adapted to the environment and listener location to give an immersive rendering of the scene to improve user experience.

Existing 3D scene reproduction systems have two problems. (i) Audio and vision systems have been researched separately. Computer vision research has mainly focused on improving the visual side of scene reconstruction. In an immersive display, such as a VR system, the experience is not perceived as "realistic" by users if sound is not matched with the visual cues. On the other hand, audio researches have been using only audio sensors to measure acoustic properties without considering the complementary effect with visual sensors. (ii) Current capture and recording systems for 3D scene reproduction require too invasive set up and professional process to be deployed by users in their private places. A LiDAR sensor is expensive and requires long scanning time. Perspective images require large number of photos to cover the whole scene.

The objective of this research is to develop an end-to-end audio-visual 3D scene reproduction pipeline using a single shot from a consumer 360 (panoramic) camera. In order to make the system easily accessible by common users in their own private spaces, automatic solution using computer vision and artificial intelligence algorithms should be included in the back-end. A deep neural network (DNN) jointly trained for semantic scene reconstruction and acoustic property prediction for the captured environments will be developed. This process includes inference for invisible regions from the camera. Impulse Responses (IRs) characterising acoustic attributes of an environment allow to reproduce the acoustics of the space with any sound sources. It also allows to extract the original (dry) sound by eliminating acoustic effects from recorded sound so that this source can be re-rendered in new environments with different acoustic effects. A simple and efficient method to estimate acoustic IRs from the captured single 360 photo will be investigated.

This semantic scene data is used to provide immersive audio-visual experience to users. Two types of display scenarios will be considered: personalised display system such as a VR headset with headphones and communal display system (e.g., TV or projector) with loudspeakers. Real-time 3D human pose tracking using a single 360 camera will be developed to accurately render 3D audio-visual scene at the locations of users. Delivering binaural sound to listeners using loudspeakers is a challenging task. Audio beam-forming techniques aligned with human-pose tracking for multiple loudspeakers will be investigated in collaboration with the project partners in audio processing.

The resulting system would have a significant impact on innovation of VR and multimedia systems, and open up new and interesting applications for their deployment. This award should provide the foundation for the PI to establish and lead a group with a unique research direction which is aligned with national priorities and will address a major long-term research challenge.

Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.soton.ac.uk