Details of Grant

EPSRC Reference:

EP/R044732/1

Title:

Exact scalable inference for coalescent processes

Principal Investigator:

Koskela, Dr J

Other Investigators:

Researcher Co-Investigators:

Project Partners:

Technical University Berlin

University of Oxford

Department:

Statistics

Organisation:

University of Warwick

Scheme:

New Investigator Award

Starts:

01 August 2018

Ends:

31 July 2021

Value (£):

99,492

EPSRC Research Topic Classifications:

Statistics & Appl. Probability

EPSRC Industrial Sector Classifications:

No relevance to Underpinning Sectors

Related Grants:

Panel History:

Panel Date	Panel Name	Outcome
06 Jun 2018	EPSRC Mathematical Sciences Prioritisation Panel June 2018	Announced

Summary on Grant Application Form

Modern genetic data sets are vast, both in terms of the number of sequenced individuals and the length of the sequenced DNA segments. Patterns within these data sets carry information about the biological and demographic histories of the population, which cannot usually be observed directly.

The central tool connecting observed patterns to predictions and inference is the Kingman coalescent: a random tree that provides a model for the unobserved ancestry of the sampled DNA sequences. Since the ancestry is unobserved, inferences are made by averaging over all possible ancestries.

In simple cases the average over ancestries can be calculated analytically, but in most biologically relevant scenarios the average has to be approximated. This is usually done by simulating an ensemble of possible ancestral trees, and treating the ensemble average as an approximation of the true, unknown average. The quality of the approximation depends on the degree to which the ensemble is representative of the set of all possible ancestries.

Ensuring that an ensemble is both representative, and not infeasibly large, is a challenging problem. Existing methods for producing ensembles split into two categories: importance sampling (IS), and Markov chain Monte Carlo (MCMC), of which the latter is typically more flexible and easier to implement. Both are known to scale poorly with the size and complexity of the data set. This proposal seeks to improve the scalability of state of the art MCMC methods in three related ways:

1. Much work has been done to characterise optimal IS algorithms, which have been observed to perform roughly as well as naive implementations of the more flexible MCMC. Preliminary results for this project show that optimality results for IS can also be used to characterise optimal MCMC algorithms, but this has never been done. This work will investigate and thoroughly benchmark the performance of the resulting, optimised MCMC algorithms.

2. The practical utility of MCMC algorithms has improved dramatically through so-called optimal scaling results, which provide a guide for how to tweak the algorithm as the data set grows. However, these typically apply only to settings in which the distribution being simulated consists of independent, real-valued components. In genetics, the distributions of interests consist of trees, and is hence much more complicated. This project will investigate extensions of optimal scaling results to tree-valued settings using recently developed machinery of optimal scaling via Dirichlet forms, which are a natural way to analyse tree-valued algorithms.

3. A recently published algorithm called msprime uses a novel data structure, called a sparse tree, to improve the speed and memory consumption of naive coalescent simulation by many orders of magnitude. This does not immediately translate to improved inference algorithms, because naive simulation typically results in ensembles that are poor representations of the true average. The sparse tree structure cannot be directly inserted into an MCMC algorithm, but preliminary work has identified several ways in which MCMC can be modified to use data structures resembling sparse trees. This project will implement and benchmark all of the resulting algorithms to determine which of these ways is the most effective.

The end result of these three streams will be a highly optimised, flexible, open source algorithm for inference in genetics. It will have unprecedented performance on large data sets due to a combination of mathematical optimisation (objectives 1 and 2) and optimisation of the underlying data structure (objective 3). MCMC algorithms also provide automatic, rigorous uncertainty quantification for their estimates, which many state-of-the-art competitors are not able to provide. This makes MCMC particularly well suited to e.g. clinical practice, where understanding uncertainties is crucial for medical outcomes.

Key Findings

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in non-academic contexts

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts

Description	This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised

Sectors submitted by the Researcher

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL:

Further Information:

Organisation Website:

http://www.warwick.ac.uk