EPSRC logo

Details of Grant 

EPSRC Reference: EP/T028572/1
Title: Visual AI: An Open World Interpretable Visual Transformer
Principal Investigator: Zisserman, Professor A
Other Investigators:
Vedaldi, Professor A Noble, Professor A Bilen, Dr H
Damen, Professor D
Researcher Co-Investigators:
Project Partners:
BBC Continental Teves AG & Co. oHG Intelligent Ultrasound
Nielson Plexalis Ltd Samsung Electronics UK Ltd
Department: Engineering Science
Organisation: University of Oxford
Scheme: Programme Grants
Starts: 01 December 2020 Ends: 30 November 2025 Value (£): 5,912,097
EPSRC Research Topic Classifications:
Artificial Intelligence Computational Linguistics
Image & Vision Computing
EPSRC Industrial Sector Classifications:
Healthcare Creative Industries
Information Technologies Transport Systems and Vehicles
Related Grants:
Panel History:
Panel DatePanel NameOutcome
25 Feb 2020 Programme Grants Interview Panel (MIQS) - February 2020 Announced
Summary on Grant Application Form
With the advent of deep learning and the availability of big data, it is now possible to train machine learning algorithms for a multitude of visual tasks, such as tagging personal image collections in the cloud, recognizing faces, and 3D shape scanning with phones. However, each of these tasks currently requires training a neural network on a very large image dataset specifically collected and labelled for that task. The resulting networks are good experts for the target task, but they only understand the 'closed world' experienced during training and can 'say' nothing useful about other content, nor can they be applied to other tasks without retraining, nor do they have an ability to explain their decisions or to recognise their limitations. Furthermore, current visual algorithms are usually 'single modal', they 'close their ears' to the other modalities (audio, text) that may be readily available.

The core objective of the Programme is to develop the next generation of audio-visual algorithms that does not have these limitations. We will carry out fundamental research to develop a Visual Transformer capable of visual analysis with the flexibility and interpretability of a human visual system, and aided by the other 'senses' - audio and text. It will be able to continually learn from raw data streams without requiring the traditional 'strong supervision' of a new dataset for each new task, and deliver and distill semantic and geometric information over a multitude of data types (for example, videos with audio, very large scale image and video datasets, and medical images with text records).

The Visual Transformer will be a key component of next generation AI, able to address multiple downstream audio-visual tasks, significantly superseding the current limitations of computer vision systems, and enabling new and far reaching applications.

A second objective addresses transfer and translation. We seek impact in a variety of other academic disciplines and industry which today greatly under-utilise the power of the latest computer vision ideas. We will target these disciplines to enable them to leapfrog the divide between what they use (or do not use) today which is dominated by manual review and highly interactive analysis frame-by-frame, to a new era where automated visual analytics of very large datasets becomes the norm. In short, our goal is to ensure that the newly developed methods are used by industry and academic researchers in other areas, and turned into products for societal and economic benefit. To this end open source software, datasets, and demonstrators will be disseminated on the project website.

The ubiquity of digital images and videos means that every UK citizen may potentially benefit from the Programme research in different ways. One example is smart audio-visual glasses, that can pay attention to a person talking by using their lip movements to mask out other ambient sounds. A second is an app that can answer visual questions (or retrieve matches) for text-queries over large scale audio-visual collections, such as a person's entire personal videos. A third is AI-guided medical screening, that can aid a minimally trained healthcare professional to perform medical scans.

Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.ox.ac.uk