EPSRC Reference: 
EP/R044732/1 
Title: 
Exact scalable inference for coalescent processes 
Principal Investigator: 
Koskela, Dr J 
Other Investigators: 

Researcher CoInvestigators: 

Project Partners: 

Department: 
Statistics 
Organisation: 
University of Warwick 
Scheme: 
New Investigator Award 
Starts: 
01 August 2018 
Ends: 
31 July 2020 
Value (£): 
99,492

EPSRC Research Topic Classifications: 
Statistics & Appl. Probability 


EPSRC Industrial Sector Classifications: 
No relevance to Underpinning Sectors 


Related Grants: 

Panel History: 

Summary on Grant Application Form 
Modern genetic data sets are vast, both in terms of the number of sequenced individuals and the length of the sequenced DNA segments. Patterns within these data sets carry information about the biological and demographic histories of the population, which cannot usually be observed directly.
The central tool connecting observed patterns to predictions and inference is the Kingman coalescent: a random tree that provides a model for the unobserved ancestry of the sampled DNA sequences. Since the ancestry is unobserved, inferences are made by averaging over all possible ancestries.
In simple cases the average over ancestries can be calculated analytically, but in most biologically relevant scenarios the average has to be approximated. This is usually done by simulating an ensemble of possible ancestral trees, and treating the ensemble average as an approximation of the true, unknown average. The quality of the approximation depends on the degree to which the ensemble is representative of the set of all possible ancestries.
Ensuring that an ensemble is both representative, and not infeasibly large, is a challenging problem. Existing methods for producing ensembles split into two categories: importance sampling (IS), and Markov chain Monte Carlo (MCMC), of which the latter is typically more flexible and easier to implement. Both are known to scale poorly with the size and complexity of the data set. This proposal seeks to improve the scalability of state of the art MCMC methods in three related ways:
1. Much work has been done to characterise optimal IS algorithms, which have been observed to perform roughly as well as naive implementations of the more flexible MCMC. Preliminary results for this project show that optimality results for IS can also be used to characterise optimal MCMC algorithms, but this has never been done. This work will investigate and thoroughly benchmark the performance of the resulting, optimised MCMC algorithms.
2. The practical utility of MCMC algorithms has improved dramatically through socalled optimal scaling results, which provide a guide for how to tweak the algorithm as the data set grows. However, these typically apply only to settings in which the distribution being simulated consists of independent, realvalued components. In genetics, the distributions of interests consist of trees, and is hence much more complicated. This project will investigate extensions of optimal scaling results to treevalued settings using recently developed machinery of optimal scaling via Dirichlet forms, which are a natural way to analyse treevalued algorithms.
3. A recently published algorithm called msprime uses a novel data structure, called a sparse tree, to improve the speed and memory consumption of naive coalescent simulation by many orders of magnitude. This does not immediately translate to improved inference algorithms, because naive simulation typically results in ensembles that are poor representations of the true average. The sparse tree structure cannot be directly inserted into an MCMC algorithm, but preliminary work has identified several ways in which MCMC can be modified to use data structures resembling sparse trees. This project will implement and benchmark all of the resulting algorithms to determine which of these ways is the most effective.
The end result of these three streams will be a highly optimised, flexible, open source algorithm for inference in genetics. It will have unprecedented performance on large data sets due to a combination of mathematical optimisation (objectives 1 and 2) and optimisation of the underlying data structure (objective 3). MCMC algorithms also provide automatic, rigorous uncertainty quantification for their estimates, which many stateoftheart competitors are not able to provide. This makes MCMC particularly well suited to e.g. clinical practice, where understanding uncertainties is crucial for medical outcomes.

Key Findings 
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in nonacademic contexts 
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts 
Description 
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk 
Summary 

Date Materialised 


Sectors submitted by the Researcher 
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL: 

Further Information: 

Organisation Website: 
http://www.warwick.ac.uk 