EPSRC logo

Details of Grant 

EPSRC Reference: EP/G010587/1
Title: Tolerating faults in interconnection networks for parallel computing
Principal Investigator: Stewart, Professor IA
Other Investigators:
Researcher Co-Investigators:
Project Partners:
Department: Computer Science
Organisation: Durham, University of
Scheme: Standard Research
Starts: 01 February 2009 Ends: 31 January 2012 Value (£): 275,210
EPSRC Research Topic Classifications:
Fundamentals of Computing Parallel Computing
EPSRC Industrial Sector Classifications:
No relevance to Underpinning Sectors
Related Grants:
Panel History:
Panel DatePanel NameOutcome
17 Jul 2008 ICT Prioritisation Panel (July 2008) Deferred
08 Sep 2008 ICT Prioritisation Panel (September 08) Announced
Summary on Grant Application Form
In distributed-memory multiprocessors, the dominant factor inhibiting faster global computations is inter-processor communication. Communication is dependent upon the topology of the interconnection network, the routing mechanism, the flow control policy , and the method of switching. We are concerned with issues relating to the topology of the interconnection network. The choice of how we connect processors in a distributed-memory multiprocessor is a fundamental design decision. There are numerous, often conflicting, considerations to bear in mind. For instance, we would like our interconnection network to be symmetric (to make programming and analysis easier), have small diameter (to lessen message-passing latency), be recursively decomposable (to aid scalability), be highly connected (to improve fault-tolerance and reliability), be regular of low degree (to lessen communication overheads and design complexity), support rapid and easy inter-processor communication, support the simulation of other machines based on other topologies, and so on. These properties all give rise to improved computational performance. However, there does not exist an interconnection network that is optimal on all counts and trade-offs have to be made. A multitude of interconnection networks have been proposed with each of these networks having some good (topological) properties and some not so good. When building distributed-memory multiprocessors with massive numbers of processors, some capacity for fault-tolerance is required, for one would still wish the machine to be operative under (a limited number of) processor or link faults. As to what one requires in terms of fault-tolerance depends upon the context, but the minimal requirement is that the (non-faulty portion of the) interconnection network should remain connected. However, usually more is required. Other important properties relevant to parallel computing include Hamiltonicity properties, for the existence of Hamiltonian cycles in networks is of crucial importance, given the ubiquity of such cycles as data structures in many distributed algorithms (they are primarily used to facilitate message-passing). Not only is the existence of Hamiltonian cycles of great importance but also the existence of Hamiltonian paths, and more generally the existence of cycles and paths of different lengths. The existence of Hamiltonian (or, at least, long) paths is extremely useful as we regularly need to simulate linear-array computations in distributed-memory multiprocessors; having a long path allows us to cater for such simulations where there are many different array-lengths involved in the simulations. In addition, given the ubiquity of cycle-based computations and algorithms in parallel computation, not only is the simulation of linear-array-based computations important but so is the simulation of cycle-based computations (of varying lengths).The research in this proposal is all about the toleration of faults in interconnection networks. There are three threads to the research: the study of the existence of paths and cycles (of varying lengths) in various interconnection networks under conditional fault assumptions (that is, asumptions on the distributions of the faults in the network); the study of fault-tolerance in Optical Transpose Interconnect System (OTIS) networks; and the distributed construction of embedded structures within a faulty network.
Key Findings
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Potential use in non-academic contexts
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Impacts
Description This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Summary
Date Materialised
Sectors submitted by the Researcher
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
Project URL:  
Further Information:  
Organisation Website: