EPSRC logo

Details of Grant 

EPSRC Reference: EP/J002526/1
Title: Deep architectures for statistical speech synthesis
Principal Investigator: Yamagishi, Dr J
Other Investigators:
Researcher Co-Investigators:
Project Partners:
Department: Centre for Speech Technology Research
Organisation: University of Edinburgh
Scheme: Career Acceleration Fellowship
Starts: 01 November 2011 Ends: 31 October 2016 Value (£): 741,163
EPSRC Research Topic Classifications:
Human Communication in ICT
EPSRC Industrial Sector Classifications:
Healthcare
Related Grants:
Panel History:
Panel DatePanel NameOutcome
28 Jun 2011 Fellowships 2011 Interviews Panel F (ICT) Announced
Summary on Grant Application Form
Speech synthesis is the conversion of written text into speech output. Applications range from telephone dialogue systems to computer games and clinical applications. Current speech synthesis systems have a very limited range of difference voices available. This is because it is complex and expensive to create them.

Unfortunately, that is a big problem for many interesting applications, including one we are focusing on in this proposal: assistive communication aids for people with vocal problems due to Motor Neurone Disease and other conditions. At the moment, these people are forced to use devices with inappropriate voices, very often in the wrong accent and sometimes even of the wrong sex! This is a disincentive for them to communicate, even with their own family, since they do not "own" the voice and it does not reflect their identity. The voice is an integral part of identity, and we are creating the technology to allow people to communicate in their own voice, when their natural speech has become hard to understand or they can no longer speak at all.

The technology we will develop has a lot of other applications too: it will enable a speech synthesiser to adjust not only the speaker identity but many other properties too. For example, adjusting speaking effort will simulate what human talkers do in noisy conditions to make their speech more intelligible. Our starting point is a technique we have pioneered, called speaker adaptation.

Speaker adaptation has proven to be highly successful in enabling the flexible transformation of the characteristics of a text-to-speech synthesis system, based on a small amount of recorded speech. It can be used for changing the characteristics of the speech to a different speaker or speaking style. However, current methods do not use any deep knowledge about speech and does not generalise across similar situations. This is considerably less natural and flexible than human speech production, in which speech is controlled by human talkers based simply on prior experience. For instance, we effortlessly adapt our speech in noisy environments, compared with quiet environments, in order to increase intelligibility. The current adaptation techniques that we have pioneered are completely automatic, but they do not enable this prior knowledge to be incorporated in a straightforward way.

In some preliminary work, we have developed a model which includes information about the movement of the speech articulators: the tongue, lips and so on. Then, using our knowledge of how humans alter their speech production in the presence of noise (hyper- & hypo-articulation), we have demonstrated that it is possible to improve the intelligibility of synthetic speech in noise.

The current proposal is to extend and generalise this preliminary work, in order to integrate many other types of knowledge about human speech into this model. We will develop a new model which allows us to include more information about how speech is produced, as well as information about how it is perceived and how external factors, such as background noise, affect speech.

One important application of this technology is to create personalised speech synthesis for people with disordered speech (caused by Motor Neurone Disease, for example). Current technology for creating voices does not work for these people, because their speech is usually already disordered. Our technique can actually correct this, and produce speech which sounds like the person, but is more intelligible than their current natural speech. We have already produced a proof-of-concept system demonstrating that this works. The current proposal will make the technology available and affordable to a wide range of people.

Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Impacts
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: http://www.ed.ac.uk