Details of Grant

EPSRC Reference:

EP/S016260/1

Title:

ROSSINI: Reconstructing 3D structure from single images: a perceptual reconstruction approach

Principal Investigator:

Schofield, Professor AJ

Other Investigators:

Researcher Co-Investigators:

Project Partners:

CrossWing

Double Negative Ltd

Microsoft

Department:

College of Health and Life Sciences

Organisation:

Aston University

Scheme:

Standard Research

Starts:

01 April 2019

Ends:

30 June 2023

Value (£):

409,881

EPSRC Research Topic Classifications:

Image & Vision Computing

Vision & Senses - ICT appl.

EPSRC Industrial Sector Classifications:

Information Technologies

Related Grants:

EP/S016368/1

EP/S016317/1

Panel History:

Panel Date	Panel Name	Outcome
04 Sep 2018	EPSRC ICT Prioritisation Panel September 2018	Announced

Summary on Grant Application Form

Consumers enjoy the immersive experience of 3D content in cinema, TV and virtual reality (VR), but it is expensive to produce. Filming a 3D movie requires two cameras to simulate the two eyes of the viewer. A common but expensive alternative is to film a single view, then use video artists to create the left and right eyes' views in post-production. What if a computer could automatically produce a 3D model (and binocular images) from 2D content: 'lifting images into 3D'? This is the overarching aim of this project. Lifting into 3D has multiple uses, such as route planning for robots, obstacle avoidance for autonomous vehicles, alongside applications in VR and cinema.

Estimating 3D structure from a 2D image is difficult because in principle, the image could have been created from an infinite number of 3D scenes. Identifying which of these possible worlds is correct is very hard, yet humans interpret 2D images as 3D scenes all the time. We do this every time we look at a photograph, watch TV or gaze into the distance, where binocular depth cues are weak. Although we make some errors in judging distances, our ability to quickly understand the layout of any scene enables us to navigate through and interact with any environment.

Computer scientists have built machine vision systems for lifting to 3D by incorporating scene constraints. A popular technique is to train a deep neural network with a collection of 2D images and associated 3D range data. However, to be successful, this approach requires a very large dataset, which can be expensive to acquire. Furthermore, performance is only as good as the dataset is complete: if the system encounters a type of scene or geometry that does not conform to the training dataset, it will fail. Most methods have been trained for specific situations - e.g. indoor, or street scenes - and these systems are typically less effective for rural scenes and less flexible and robust than humans. Finally, such systems provide a single reconstructed output, without any measure of uncertainty. The user must assume that the 3D reconstruction is correct, which will be a costly assumption in many cases.

Computer systems are designed and evaluated based upon their accuracy with respect to the real world. However, the ultimate goal of lifting into 3D is not perfect accuracy - rather it is to deliver a 3D representation that provides a useful and compelling visual experience for a human observer, or to guide a robot whilst avoiding obstacles. Importantly, humans are expert at interacting with 3D environments, even though our perception can deviate substantially from true metric depth. This suggests that human-like representations are both achievable and sufficient, in any and all environments.

ROSSINI will develop a new machine vision system for 3D reconstruction that is more flexible and robust than previous methods. Focussing on static images, we will identify key structural features that are important to humans. We will combine neural networks with computer vision methods to form human-like descriptions of scenes and 3D scene models. Our aims are to (i) produce 3D representations that look correct to humans even if they are not strictly geometrically correct (ii) do so for all types of scene and (iii) express the uncertainty inherent in each reconstruction. To this end we will collect data on human interpretation of images and incorporate this information into our network. Our novel training method will learn from humans and existing ground truth datasets; the training algorithm selecting the most useful human tasks (i.e. judge depth within a particular image) to maximise learning. Importantly, the inclusion of human perceptual data should reduce the overall quantity of training data required, while mitigating the risk of over-reliance on a specific dataset. Moreover, when fully trained, our system will produce 3D reconstructions alongside information about the reliability of the depth estimates.

Key Findings

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in non-academic contexts

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts

Description	This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised

Sectors submitted by the Researcher

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL:

Further Information:

Organisation Website:

http://www.aston.ac.uk