Computer vision, the technology that allows machines to understand the content of image automatically, is fuelling a revolution in digital image processing. For example, it is now possible to use computers to search billions of images and millions of hours of video in the Internet for a particular content (Google Googles), interpret gestures and body motions to play games (Microsoft Kinect), automatically focus cameras on faces, or build smart cameras that can monitor hazardous industrial equipment on a 24h basis.
If not for their scale, these tasks would appear trivial to a human. However, vision is computationally exceptionally challenging, to the extent that more than half of our brain is dedicated to this function alone. Since this complexity cannot be met by hand-crafting software, vision architectures are nowadays learned automatically from million of example images, leveraging advanced machine learning and optimisation technologies. Despite recent terrific successes, however, machine vision still pales in comparison to vision in humans. Probably the most disappointing restriction is that these systems can address a single task at a time, such as deciding whether a particular image contains, say, person. Recognising a different concept, for example a dog, or addressing a different task, for example outlining rather than recognising the person, requires learning a new system from scratch, wasting time and effort.
My research idea is to transform existing architectures into repositories of 'visual knowledge' that can be reused and extended incrementally to address multiple tasks and domains, greatly improving the efficiency, scalability, and flexibility of the technology. The key scientific challenge is to understand how visual information is encoded in state-of-the-art vision systems. In fact, since these are learned automatically rather than being hand-crafted, it is currently unclear what information is captured by them and how it is represented. An in-depth investigation will explicate this formally and quantitatively and will be the basis to share and integrate visual knowledge between a growing number of concepts and tasks, including ones not addressed by the initial design of the system. At the same time, identifying fine-grained information will allow a system to obtain a more detailed, comprehensive, and meaningful understanding of the content of images.
The potential for impact is huge as the proposed research will enhance core computer vision technology that already powers countless applications. For example, computers will be able to search images by matching more detailed queries expressed using a far richer visual vocabulary; software will be extensible to new domains and tasks with minimal effort; and computer vision systems will be able to explain in explicit, intuitive terms how they understand images.
The research outcomes will be evaluated in the most rigorous manner on international benchmark data and protocols. Research results will be made available to a widespread technical audience by distributing open source software implementing the new technology. The project is also likely to have a strong academic impact, consolidating the leadership of the UK in computer vision, a strategic competitive area in the digital economy.
|