EPSRC logo

Details of Grant 

EPSRC Reference: EP/S016260/1
Title: ROSSINI: Reconstructing 3D structure from single images: a perceptual reconstruction approach
Principal Investigator: Schofield, Professor AJ
Other Investigators:
Researcher Co-Investigators:
Project Partners:
CrossWing Double Negative Ltd Microsoft
Department: College of Health and Life Sciences
Organisation: Aston University
Scheme: Standard Research
Starts: 01 April 2019 Ends: 30 June 2023 Value (£): 409,881
EPSRC Research Topic Classifications:
Image & Vision Computing Vision & Senses - ICT appl.
EPSRC Industrial Sector Classifications:
Information Technologies
Related Grants:
EP/S016368/1 EP/S016317/1
Panel History:
Panel DatePanel NameOutcome
04 Sep 2018 EPSRC ICT Prioritisation Panel September 2018 Announced
Summary on Grant Application Form
Consumers enjoy the immersive experience of 3D content in cinema, TV and virtual reality (VR), but it is expensive to produce. Filming a 3D movie requires two cameras to simulate the two eyes of the viewer. A common but expensive alternative is to film a single view, then use video artists to create the left and right eyes' views in post-production. What if a computer could automatically produce a 3D model (and binocular images) from 2D content: 'lifting images into 3D'? This is the overarching aim of this project. Lifting into 3D has multiple uses, such as route planning for robots, obstacle avoidance for autonomous vehicles, alongside applications in VR and cinema.

Estimating 3D structure from a 2D image is difficult because in principle, the image could have been created from an infinite number of 3D scenes. Identifying which of these possible worlds is correct is very hard, yet humans interpret 2D images as 3D scenes all the time. We do this every time we look at a photograph, watch TV or gaze into the distance, where binocular depth cues are weak. Although we make some errors in judging distances, our ability to quickly understand the layout of any scene enables us to navigate through and interact with any environment.

Computer scientists have built machine vision systems for lifting to 3D by incorporating scene constraints. A popular technique is to train a deep neural network with a collection of 2D images and associated 3D range data. However, to be successful, this approach requires a very large dataset, which can be expensive to acquire. Furthermore, performance is only as good as the dataset is complete: if the system encounters a type of scene or geometry that does not conform to the training dataset, it will fail. Most methods have been trained for specific situations - e.g. indoor, or street scenes - and these systems are typically less effective for rural scenes and less flexible and robust than humans. Finally, such systems provide a single reconstructed output, without any measure of uncertainty. The user must assume that the 3D reconstruction is correct, which will be a costly assumption in many cases.

Computer systems are designed and evaluated based upon their accuracy with respect to the real world. However, the ultimate goal of lifting into 3D is not perfect accuracy - rather it is to deliver a 3D representation that provides a useful and compelling visual experience for a human observer, or to guide a robot whilst avoiding obstacles. Importantly, humans are expert at interacting with 3D environments, even though our perception can deviate substantially from true metric depth. This suggests that human-like representations are both achievable and sufficient, in any and all environments.

ROSSINI will develop a new machine vision system for 3D reconstruction that is more flexible and robust than previous methods. Focussing on static images, we will identify key structural features that are important to humans. We will combine neural networks with computer vision methods to form human-like descriptions of scenes and 3D scene models. Our aims are to (i) produce 3D representations that look correct to humans even if they are not strictly geometrically correct (ii) do so for all types of scene and (iii) express the uncertainty inherent in each reconstruction. To this end we will collect data on human interpretation of images and incorporate this information into our network. Our novel training method will learn from humans and existing ground truth datasets; the training algorithm selecting the most useful human tasks (i.e. judge depth within a particular image) to maximise learning. Importantly, the inclusion of human perceptual data should reduce the overall quantity of training data required, while mitigating the risk of over-reliance on a specific dataset. Moreover, when fully trained, our system will produce 3D reconstructions alongside information about the reliability of the depth estimates.
Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.aston.ac.uk