Details of Grant

EPSRC Reference:

EP/F02682X/1

Title:

Pattern matching algorithms for massive datasets

Principal Investigator:

Clifford, Dr R

Other Investigators:

Researcher Co-Investigators:

Project Partners:

Department:

Computer Science

Organisation:

University of Bristol

Scheme:

First Grant Scheme

Starts:

18 February 2008

Ends:

17 August 2011

Value (£):

278,545

EPSRC Research Topic Classifications:

Fundamentals of Computing

EPSRC Industrial Sector Classifications:

No relevance to Underpinning Sectors

Related Grants:

Panel History:

Panel Date	Panel Name	Outcome
18 Oct 2007	ICT Prioritisation Panel (Technology)	Announced

Summary on Grant Application Form

This project aims to provide the tools necessary for pattern matchingin massive datasets in the 21st century. Pattern matching problemsare pervasive and it is therefore hard to overstate theirimportance. The hugely successful new field of high throughputcomputational genetics, which is the lifeblood of pharmaceuticalindustries, is founded on the ability to perform approximate stringmatching accurately and quickly. Perhaps more mundanely but no lesssignificant economically, linear time exact matching algorithms arenow taken for granted as basic tools in every text editor and wordprocessor used today.Despite the success that pattern matching algorithms continue toenjoy, new problems in urgent need of a solution arise continually.These revolve around data processing applications where the datasetsare massive, subject to error or ambiguity and where processing isrequired online or in real-time. For example, the problem of exactmatching has well known optimal solutions both for online search andwhen the data to be queried can be indexed beforehand. However,unlike exact matching, the problem of finding the fastest algorithmsfor approximate matching has still not been resolved under almost anymeasure of similarity. Where the data is of a non-standard form, forexample consisting of numerical rather than symbolic information, evenless is currently known about how to search or index the informationefficiently.Another vital difference between the old and new settings is notsimply in the quantity of data available but also the ways in which ithas to be processed. The public genome sequencing projects, forexample, have produced 100s of gigabytes of sequence and related metadata. However these datasets are relatively straightforward to handlecompared to the processing of information passing through Internetrouters and over telephone wires every day or stored in the World WideWeb. In this situation it is not sufficient simply that an algorithmsruns fast. Ideally it should also require considerably less space thanthe input, update at least as quickly as the new data are arriving andstill be able to perform complex queries on the whole dataset.This project will directly address these two separate but interrelatedchallenges. Real-time and online matching algorithms will be developedto handle situations where vast amounts of data are streaming past atvery high rates. The project will also consider new forms ofapproximation and present fast algorithmic solutions that will allowdatasets that result from modern applications and industries to besearched for approximate matches without the need to rely onheuristics. Finally as part of the work on improved methods forapproximate matching, this project will develop faster and smallerindexes that will for the first time allow approximate matching ontruly massive datasets to become feasible in practice.

Key Findings

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in non-academic contexts

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts

Description	This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised

Sectors submitted by the Researcher

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL:

Further Information:

Organisation Website:

http://www.bris.ac.uk