Details of Grant

EPSRC Reference:

EP/I034750/1

Title:

Non-Parametric Models of Phrase-based Machine Translation

Principal Investigator:

Cohn, Dr TA

Other Investigators:

Researcher Co-Investigators:

Project Partners:

Department:

Computer Science

Organisation:

University of Sheffield

Scheme:

First Grant - Revised 2009

Starts:

01 March 2012

Ends:

31 August 2013

Value (£):

101,251

EPSRC Research Topic Classifications:

Artificial Intelligence	Comput./Corpus Linguistics
Human Communication in ICT	Statistics & Appl. Probability

EPSRC Industrial Sector Classifications:

No relevance to Underpinning Sectors

Related Grants:

Panel History:

Panel Date	Panel Name	Outcome
15 Mar 2011	EPSRC ICT Responsive Mode - Mar 2011	Announced

Summary on Grant Application Form

Machine Translation is the automatic process of translating human language text in one language into another language using a computer. Statistical Machine Translation (SMT) is a data-driven approach to machine translation which uses machine learning techniques to learn how to translate directly from large collections of sentences and their translations. SMT has seen a surge in popularity in the last decade, and has now matured into an invaluable means for data access and communication, as evidenced by the many successful commercial SMT systems, e.g., Google Translate and Microsoft Translator. These technologies are beginning to have a substantial impact on individuals, businesses and governments by enabling communication with foreign language speakers and enabling access to the growing amounts of foreign language data. Consequently automatic translation is rapidly becoming a key technology across all levels of the community, and improvements in its quality have the potential to further increase its impact. Although current systems can produce good translations for some language pairs, such as French-English, their performance is markedly worse for many others, e.g., Chinese-English and Basque-Spanish. There are two key reasons for this: language similarity and data availability. Predominant SMT approaches do not model the structure of the language, instead assuming that translation can be performed largely at the level of single words or small groups of words. For this reason they are unable to describe large changes in word order which are commonly required for translating between dissimilar languages. For example, in Japanese sentences typically follow a subject-object-verb order, while in English sentences are subject-verb-object. In order to translate between Japanese and English the positions of the verb and object phrase must be reversed. Another reason for SMT under-performing is data availability. Current techniques require very large collections of sentences and their translations in order to learn a good translation model (needing hundreds of thousands or millions of sentence pairs), but performance suffers when considerably less data is available. For official languages of the leading countries in the West and Asia this type of data is often plentiful, however for the remaining majority of the world's languages and dialects translation data is exceedingly rare. This problem is exacerbated for languages with large vocabularies, for instance Finnish in which each word can convey a wide range of syntactic and semantic information including the gender, syntactic case, tense, number and aspect. This project will tackle both of these issues, focusing primarily on the second issue of generalising from small training sets; The novelty of our approach will help to make inroads into the first issue of dealing with word-order differences between structurally dissimilar languages. The project will develop a translation model which reframes phrase-based translation in a novel way by using much simpler translation units than phrase-based models, primarily single words and their translations, while also modelling correlations between the translations used in a sentence. This will allow the model to describe implicitly arbitrarily large translation units and thus the approach is more general that current phrase-based translation. Additionally, our approach confers a number of further benefits. Most notably, it will make better use of training data by compiling denser and more reliable statistics, and thus will generalise more accurately from small training sets. In addition our approach will support a richer set of translation fragments than current phrase-based models, including gapping phrase-pairs which can describe syntactic divergences between structurally dissimilar languages. These benefits should lead to improvements in translation accuracy across a wide range of language pairs.

Key Findings

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in non-academic contexts

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts

Description	This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised

Sectors submitted by the Researcher

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL:

Further Information:

Organisation Website:

http://www.shef.ac.uk