EPSRC logo

Details of Grant 

EPSRC Reference: EP/C010035/1
Title: Extracting the Science from Scientific Publications
Principal Investigator: Copestake, Professor A
Other Investigators:
Murray-Rust, Professor P Parker, Professor M Teufel, Professor S
Researcher Co-Investigators:
Project Partners:
International Union of Crystallography Nature Publishing Group Royal Society of Chemistry Publishing
Department: Computer Science and Technology
Organisation: University of Cambridge
Scheme: Standard Research (Pre-FEC)
Starts: 01 November 2005 Ends: 30 April 2010 Value (£): 676,835
EPSRC Research Topic Classifications:
Artificial Intelligence Comput./Corpus Linguistics
Intelligent & Expert Systems
EPSRC Industrial Sector Classifications:
Chemicals
Related Grants:
Panel History:  
Summary on Grant Application Form
Many tools exist for processing natural languages, such as English, but there is no single perfect system. Different approaches have different strengths and weaknesses. For instance, some very fast processors are designed to make decisions about part of speech: e.g., that `fly' in the sentence `You'll have to fly' is a verb rather than a noun. Other processors can do much more: e.g., they realise that `you' will be doing the flying, and may be able to decide whether `fly' is meant literally or idiomatically (in context). But such `deep' systems are much slower at processing text and far more complex to build than the simpler `shallow' systems. Therefore researchers in natural language processing try to combine multiple systems in different ways, in particular so that deep systems are only used on text that is identified as interesting by a shallow system. However, progress has been hindered by the lack of a common interface between systems.We are developing a formal language which captures some aspects of the meaning of natural language in a way that allows contributions from different processors to be combined. The combined systems can be used to extract knowledge from text for later machine use, or to give human browsers information about the structure of texts and their interconnections. In this project, we will use this approach to analyse research papers in Chemistry, so that aspects of their meaning can be extracted and used in the Semantic Web. For example, we can obtain information about how particular compounds are synthesised and represent this so that researchers can look up the information more easily. We are also trying to automatically discover information about the meaning of terms used in Chemistry. For instance, our system might discover that `an alkaloid is a type of azacycle' from the phrase `the concise synthesis of naturally occurring alkaloids and other complex polycyclic azacycles'. We will also analyse text structure so that we can tell whether an author is agreeing with a previous publication or criticising it.These tools will be combined in a complete system for use by working chemists who will give us feedback on the results. We are collaborating with major publishers who are allowing us to experiment with papers in their collections. We expect to use a GRID of parallel computers to process tens of thousands of papers in order to build a substantial knowledge base. At the end of the project, we will investigate the extension of this work to other sciences. However, the general approach will have wide application to extraction of information from many types of text.
Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Impacts
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.cam.ac.uk