Details of Grant

EPSRC Reference:

GR/J53508/01

Title:

MAPPING BETWEEN CORPUS ANNOTATION SCHEMES

Principal Investigator:

Atwell, Professor ES

Other Investigators:

Souter, Dr DC

Researcher Co-Investigators:

Project Partners:

Department:

Sch of Computing

Organisation:

University of Leeds

Scheme:

Standard Research (Pre-FEC)

Starts:

20 December 1993

Ends:

19 June 1997

Value (£):

175,739

EPSRC Research Topic Classifications:

Human Communication in ICT

EPSRC Industrial Sector Classifications:

Related Grants:

Panel History:

Summary on Grant Application Form

To investigate and develop methods of automatically mapping between the grammatical annotation schemes of the most widely known corpora, thus assessing their differences and improving their reusability. Annotating a single corpus with the different schemes allows for comparisons, and will provide a rich test-bed for automatic parsers. Progress:Our primary objectives for the end of the first year on the project were to obtain corpora, automatic annotators and relevant documentation for the corpora we wished to include in our mapping suite of programs and to derive mappings between the tagged annotation of those corpora (the parsed annotation were scheduled to be included in the second year). Our objectives have shifted as it became apparent that it was more efficient to work with both the tagged and parsed components of a corpus, where they both exist, at the same time. To this aim we have nearly completed both tagging and parsing, by hand, the Spoken English Corpus (SEC) according to the International Corpus of English (ICE) annotation guidelines. Expertise in the ICE schemes was attained through collaboration with the TOSCA project at Nijmegen University in the Netherlands. Concurrently we have developed a program able to extract the correspondences from parallel tagging schemes such as the one created by adding ICE annotation to SEC. The automatically-extracted list of correspondences constitutes a mapping. We have been using this list in a series of experiments to re-label the text of one corpus with that of the other using the mapping programs and evaluating the results. We have also obtained several other corpora, along with some automatic annotation programs and various manuals and related documentation. We have investigated mappings between other pairs of corpus annotation schemes using these resources. Our publications include:Atwell, Eric, John Hughes and Clive Souter. 1994. AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models. In Judith Klavans and Philip Resnik, editors, Proceedings of The Balancing Act - Combining Symbolic and Statistical Approaches to Language , Workshop in conjunction with the 32nd Annual Meeting of the Association for Computational Linguistics. New Mexico State University, Las Cruces, New Mexico, USA, 27th-30th June. Hughes, John, Clive Souter and Eric Atwell. 1995. Automatic Extraction of Tagset Mappings from Parallel-Annotated Corpora. To Appear in the proceedings of From Text to Tags: Issues in Multilingual Language Analysis , SIGDAT Workshop in Conjunction with the European Chapter of the Association of Computational Linguistics. Dublin, Ireland. Project Team:Eric Atwell, Seb Haigh, John Hughes, Clive Souter, Tim Willis.

Key Findings

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in non-academic contexts

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts

Description	This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised

Sectors submitted by the Researcher

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL:

Further Information:

Organisation Website:

http://www.leeds.ac.uk