EPSRC Reference: |
EP/G010587/1 |
Title: |
Tolerating faults in interconnection networks for parallel computing |
Principal Investigator: |
Stewart, Professor IA |
Other Investigators: |
|
Researcher Co-Investigators: |
|
Project Partners: |
|
Department: |
Computer Science |
Organisation: |
Durham, University of |
Scheme: |
Standard Research |
Starts: |
01 February 2009 |
Ends: |
31 January 2012 |
Value (£): |
275,210
|
EPSRC Research Topic Classifications: |
Fundamentals of Computing |
Parallel Computing |
|
EPSRC Industrial Sector Classifications: |
No relevance to Underpinning Sectors |
|
|
Related Grants: |
|
Panel History: |
|
Summary on Grant Application Form |
In distributed-memory multiprocessors, the dominant factor inhibiting faster global computations is inter-processor communication. Communication is dependent upon the topology of the interconnection network, the routing mechanism, the flow control policy , and the method of switching. We are concerned with issues relating to the topology of the interconnection network. The choice of how we connect processors in a distributed-memory multiprocessor is a fundamental design decision. There are numerous, often conflicting, considerations to bear in mind. For instance, we would like our interconnection network to be symmetric (to make programming and analysis easier), have small diameter (to lessen message-passing latency), be recursively decomposable (to aid scalability), be highly connected (to improve fault-tolerance and reliability), be regular of low degree (to lessen communication overheads and design complexity), support rapid and easy inter-processor communication, support the simulation of other machines based on other topologies, and so on. These properties all give rise to improved computational performance. However, there does not exist an interconnection network that is optimal on all counts and trade-offs have to be made. A multitude of interconnection networks have been proposed with each of these networks having some good (topological) properties and some not so good. When building distributed-memory multiprocessors with massive numbers of processors, some capacity for fault-tolerance is required, for one would still wish the machine to be operative under (a limited number of) processor or link faults. As to what one requires in terms of fault-tolerance depends upon the context, but the minimal requirement is that the (non-faulty portion of the) interconnection network should remain connected. However, usually more is required. Other important properties relevant to parallel computing include Hamiltonicity properties, for the existence of Hamiltonian cycles in networks is of crucial importance, given the ubiquity of such cycles as data structures in many distributed algorithms (they are primarily used to facilitate message-passing). Not only is the existence of Hamiltonian cycles of great importance but also the existence of Hamiltonian paths, and more generally the existence of cycles and paths of different lengths. The existence of Hamiltonian (or, at least, long) paths is extremely useful as we regularly need to simulate linear-array computations in distributed-memory multiprocessors; having a long path allows us to cater for such simulations where there are many different array-lengths involved in the simulations. In addition, given the ubiquity of cycle-based computations and algorithms in parallel computation, not only is the simulation of linear-array-based computations important but so is the simulation of cycle-based computations (of varying lengths).The research in this proposal is all about the toleration of faults in interconnection networks. There are three threads to the research: the study of the existence of paths and cycles (of varying lengths) in various interconnection networks under conditional fault assumptions (that is, asumptions on the distributions of the faults in the network); the study of fault-tolerance in Optical Transpose Interconnect System (OTIS) networks; and the distributed construction of embedded structures within a faulty network.
|
Key Findings |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Potential use in non-academic contexts |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Impacts |
Description |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk |
Summary |
|
Date Materialised |
|
|
Sectors submitted by the Researcher |
This information can now be found on Gateway to Research (GtR) http://gtr.rcuk.ac.uk
|
Project URL: |
|
Further Information: |
|
Organisation Website: |
|