EPSRC Reference: 
EP/R013381/1 
Title: 
Statistical and Computational Challenges in Highdimensional Data Analysis 
Principal Investigator: 
Shah, Dr RD 
Other Investigators: 

Researcher CoInvestigators: 

Project Partners: 

Department: 
Pure Maths and Mathematical Statistics 
Organisation: 
University of Cambridge 
Scheme: 
First Grant  Revised 2009 
Starts: 
01 June 2018 
Ends: 
31 March 2021 
Value (£): 
100,220

EPSRC Research Topic Classifications: 
Statistics & Appl. Probability 


EPSRC Industrial Sector Classifications: 

Related Grants: 

Panel History: 

Summary on Grant Application Form 
We are living in an age of information: scientists, businesses and governments are collecting datasets of unprecedented size and complexity at an everincreasing rate, with the hope of using statistics to discover patterns and help inform decisions that will shape the future of our society. Typically datasets consists of observations (e.g. patients) on which a number of variables have been measured (e.g. height, weight). Whilst modern datasets can have many observations, the trend today is towards datasets with a very large number of variables. This is particularly true in genomics where scientific advances have allowed researchers to collect detailed genetic information on patients amounting to thousands or even hundreds of thousands of variables. More generally, automated data collection has given rise to socalled highdimensional datasets across a variety of disciplines. For example, in healthcare analytics, aspects of a patient's history can give rise to datasets with a huge number of variables indicating what combinations of drugs were prescribed at particular times.
The field of highdimensional statistics is a response to the challenges posed by these sorts of datasets which often render infeasible more traditional approaches designed for settings with only a handful of carefully chosen variables. Whilst much progress has been made, there remain several challenges, and this proposal will address some key outstanding methodological problems. Our methods will be applicable in a wide variety of settings, but two areas of application we will explore in collaboration are genomics and healthcare analytics. Our proposal consists of three projects which are described below.
Often along with the variables measured on a number of observations, we have an outcome or response of interest whose relationship with the variables we wish to learn from the data. In many cases, this relationship can be complex and depend on interactions between several groups of variables. Searching for combinations of variables which only together contribute to the response presents a serious computational challenge as the number of subsets of variables to search through quickly grows with the size of the subset. Even examining interacting pairs of variables can be computationally infeasible when the number of variables in the tens of thousands. A key contribution of our research will be to develop new methods that can scale efficiently to capture high order interactions in highdimensional data.
Uncertainty quantification for highdimensional data, for instance producing pvalues quantifying the significance of variables in determining the response, is crucial in order to avoid deriving false conclusions from data. However research on this important topic is still in its infancy with many existing approaches often highly unstable in practical settings. Our proposal will develop new robust and computationally efficient methods for pvalue construction and other forms of uncertainty quantification for a variety of models.
In some settings we do not have a distinguished response but rather would like to understand relationships between the variables themselves. Graphical models provide a useful way to model such dependencies but the available methods are often not scalable to the size of datasets now faced by many practitioners. We will use new computational techniques to develop randomised algorithms that avoid explicitly assessing each pair of variables to determine their relationship but can still deliver estimates of the strongest dependencies. The method will have broad applicability, but for example with biological data can help to learn the network of dependencies governing the underlying biological processes.

Key Findings 
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in nonacademic contexts 
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts 
Description 
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk 
Summary 

Date Materialised 


Sectors submitted by the Researcher 
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL: 

Further Information: 

Organisation Website: 
http://www.cam.ac.uk 