Details of Grant

EPSRC Reference:

EP/J017108/1

Title:

Approximate Matching of Sequences.

Principal Investigator:

Crochemore, Professor M

Other Investigators:

Researcher Co-Investigators:

Project Partners:

Department:

Computer Science

Organisation:

Kings College London

Scheme:

Standard Research

Starts:

01 June 2012

Ends:

30 November 2013

Value (£):

156,691

EPSRC Research Topic Classifications:

Fundamentals of Computing

EPSRC Industrial Sector Classifications:

No relevance to Underpinning Sectors

Related Grants:

Panel History:

Panel Date	Panel Name	Outcome
01 Feb 2012	EPSRC ICT Responsive Mode - Feb 2012	Announced

Summary on Grant Application Form

One of the most ancient, fundamental and still active research field in theoretical computer science concerns Pattern Matching and its many approaches. The classic pattern matching problem consists in recognizing a pattern P as "matching" or "not matching" within a given text T and of listing all the occurrence positions, if any. It is

straightforward that the text may be whatever kind of input data and the pattern is usually a smaller piece of the same kind of data. One instance of the problem is to suppose that the text is provided before the pattern.

Then the text can be preprocessed to build an index data structure that accelerates dramatically further searches.

There exist several such data structures. This branch of pattern matching is usually called Text Indexing.

Its natural application scenario is when the text does not change or changes slowly with respect to the great amount of pattern search queries, e.g. a sequenced genome, a digital library or a biometric database.

The most important variant of Text Indexing concerns "approximate matching" of the pattern with text segments.

It is called Approximate Pattern Matching. Approximate means that an occurrence of the pattern within the text is considered a match even if it presents some errors, but if the distance (resp. similarity) between the pattern and the text segment is under (resp. over) a certain threshold.

Approximate variants of text indexing fit well with scenarios where a mistyping errors may occurs, the data sampling process is prone to introduce errors, the data source is not constant over the time as for biometric data, or when we are just looking for data similarity, e.g. for genes mutations in DNA sequences.

Pattern matching plays a fundamental role in many research and application fields, like information retrieval, data mining, bioinformatics, molecular biology functional and structural analysis, molecular pharmacology, computational musicology, network security, biometric identification, and many others. Among these, bioinformatics and computational molecular biology are the domains that mostly and directly benefit from any enhancement of pattern matching theoretical knowledge and solutions.

While straight text indexing counts many efficient solutions, its approximate versions have not found optimal solutions yet. Several solutions exploit the property that an approximate match is usually composed of some shorter and closer

matches, i.e. exact matches or approximate matches with a smaller number of errors. This approach does not provide any breakthrough towards an optimal solution.

Our approach is to design data structures and their related algorithms to get an efficient solution to the problem. We have explored and experimented several solutions that provided remarkable potentialities. They need to be developed further and tailored to some applications. Implementation for external memory and for distributed calculus shall be

investigated. As a by-product some theoretical results related to the distance between sequences should be achieved. The complexity of many classical result will be reviewed to highlight their dependence from the maximal-repetition-with-error parameter. All this is expected to provide a substantial step forward in the pattern matching research field.

The technological breakthrough proposed in this research allows to dramatically decrease the computational cost of detecting local similarities within a large family of biological sequences and of performing some approximate pattern matching related tasks. Furthermore, our designed algorithms will extend the range of applications of approximate pattern matching based solutions as they will allow to accomplish more complex investigations on sequences with the same computational resource either at the level of personal computer scale solutions, and of big research centres.

Key Findings

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Potential use in non-academic contexts

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Impacts

Description	This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised

Sectors submitted by the Researcher

This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk

Project URL:

Further Information:

Organisation Website: