Details of Grant

EPSRC Reference:

EP/C010035/1

Title:

Extracting the Science from Scientific Publications

Principal Investigator:

Copestake, Professor A

Other Investigators:

Murray-Rust, Professor P

Parker, Professor M

Teufel, Professor S

Researcher Co-Investigators:

Project Partners:

International Union of Crystallography

Nature Publishing Group

Royal Society of Chemistry Publishing

Department:

Computer Science and Technology

Organisation:

University of Cambridge

Scheme:

Standard Research (Pre-FEC)

Starts:

01 November 2005

Ends:

30 April 2010

Value (£):

676,835

EPSRC Research Topic Classifications:

Artificial Intelligence	Comput./Corpus Linguistics
Intelligent & Expert Systems

EPSRC Industrial Sector Classifications:

Chemicals

Related Grants:

Panel History:

Summary on Grant Application Form

Many tools exist for processing natural languages, such as English, but there is no single perfect system. Different approaches have different strengths and weaknesses. For instance, some very fast processors are designed to make decisions about part of speech: e.g., that `fly' in the sentence `You'll have to fly' is a verb rather than a noun. Other processors can do much more: e.g., they realise that `you' will be doing the flying, and may be able to decide whether `fly' is meant literally or idiomatically (in context). But such `deep' systems are much slower at processing text and far more complex to build than the simpler `shallow' systems. Therefore researchers in natural language processing try to combine multiple systems in different ways, in particular so that deep systems are only used on text that is identified as interesting by a shallow system. However, progress has been hindered by the lack of a common interface between systems.We are developing a formal language which captures some aspects of the meaning of natural language in a way that allows contributions from different processors to be combined. The combined systems can be used to extract knowledge from text for later machine use, or to give human browsers information about the structure of texts and their interconnections. In this project, we will use this approach to analyse research papers in Chemistry, so that aspects of their meaning can be extracted and used in the Semantic Web. For example, we can obtain information about how particular compounds are synthesised and represent this so that researchers can look up the information more easily. We are also trying to automatically discover information about the meaning of terms used in Chemistry. For instance, our system might discover that `an alkaloid is a type of azacycle' from the phrase `the concise synthesis of naturally occurring alkaloids and other complex polycyclic azacycles'. We will also analyse text structure so that we can tell whether an author is agreeing with a previous publication or criticising it.These tools will be combined in a complete system for use by working chemists who will give us feedback on the results. We are collaborating with major publishers who are allowing us to experiment with papers in their collections. We expect to use a GRID of parallel computers to process tens of thousands of papers in order to build a substantial knowledge base. At the end of the project, we will investigate the extension of this work to other sciences. However, the general approach will have wide application to extraction of information from many types of text.

Key Findings

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in non-academic contexts

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts

Description	This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised

Sectors submitted by the Researcher

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL:

Further Information:

Organisation Website:

http://www.cam.ac.uk