This project represents joint work between 12 leading Chinese Universities, and several other invited key partners in the UK and US.
The Internet, and other large-scale databases, form a significant resource of what may be termed "visual media": images, videos, 3D shape models, and so on. Internet text searches usually produce useful results. However, it can be much more difficult to find visual media, e.g. videos with specific content, or images similar to a picture in one's mind's eye. This is partly due to the fact that most image search is based on text inputs, and partly due to the difficulty of classifying pictures. It is easy for humans to "know" what an image contains, but image understanding by computer requires many tricky tasks - splitting an image into separate objects, and analysing their colour, their shape, and many other attributes. Better solutions to search of visual media would enable many applications in addition to search itself, and we will also look at one of them - the re-use of existing visual media when creating new visual media.
This project has four main goals.
The first is to investigate new approaches to structural analysis of visual media. This will include devising methods to find salient information (for example, what is the main object? what is irrelevant background? how is this object composed of parts?), and methods which process the information on different scales (small details may be just as important as overall shape, for example). The aim is to come up with hierarchical descriptions of the important information in visual media.
The second is to find efficient new approaches to comparing, classifying and searching visual media, based on the above hierarchical descriptions. We will also look at how sketches can be used as a much more powerful means than text of allowing users to describe what they want to find when searching.
The third area to be considered is editing and resynthesis of visual media. Structural analysis will provide more meaningful ways to select parts of an image than just, for example, all parts of the scene with a certain colour. In turn, this will simplify the process of editing visual media. Users will be able to apply consistent editing to scene elements with similar meaning (e.g. the user controls bending of one finger, and the computer applies a similar bend to the rest of the fingers of a hand, despite minor shape differences). More powerful search will also allow elements to be rapidly retrieved from visual media databases or the Internet to be combined into new scenes, or to be included within existing images, with suitable adjustment for different lighting, etc. When video is processed, further considerations will be needed to ensure results are consistent over time, and smoothly vary as time progresses; the vast amounts of data involved in video processing make this a challenging problem.
The final area of work concerns the use of machine learning techniques to assist with all of the previous goals. The aim here is to automatically learn to recognize complex patterns, permitting software to make intelligent decisions based on visual data. Ultimately, a careful balance must be struck in which the user is firmly in control of the creative process, but the computer makes it easy for the user to produce the desired results.
|