TY - GEN
T1 - Dependable parallel computing with agents based on a task graph model
AU - Chabridon, Sophie
AU - Gelenbe, Eroi
N1 - Publisher Copyright:
© 1995 IEEE.
PY - 1995/1/1
Y1 - 1995/1/1
N2 - We discuss a novel technique for improving the dependability of parallel programs executing on a MIMD shared memory architecture. The idea is to empower certain tasks of each application program to carry out failure detection, and to reschedule the execution of those tasks which are considered to have failed. The technique we propose is based on a task graph representation of the parallel program, in which communications between tasks have been voluntarily isolated to the end of each task which is being considered. We propose and evaluate several algorithms which can detect failures and restart failed tasks. A discrete-event simulator is used to evaluate the performance under the effect of failures, with the use of our detection and restart algorithms, of a specific parallel application: the Fast Fourier Transform.
AB - We discuss a novel technique for improving the dependability of parallel programs executing on a MIMD shared memory architecture. The idea is to empower certain tasks of each application program to carry out failure detection, and to reschedule the execution of those tasks which are considered to have failed. The technique we propose is based on a task graph representation of the parallel program, in which communications between tasks have been voluntarily isolated to the end of each task which is being considered. We propose and evaluate several algorithms which can detect failures and restart failed tasks. A discrete-event simulator is used to evaluate the performance under the effect of failures, with the use of our detection and restart algorithms, of a specific parallel application: the Fast Fourier Transform.
KW - Dependability
KW - Discrete-event simulation
KW - Parallel computing
KW - Software-based failure detection and restart
UR - https://www.scopus.com/pages/publications/84891470357
U2 - 10.1109/EMPDP.1995.389188
DO - 10.1109/EMPDP.1995.389188
M3 - Conference contribution
AN - SCOPUS:84891470357
T3 - Proceedings - Euromicro Workshop on Parallel and Distributed Processing
SP - 350
EP - 357
BT - Proceedings - Euromicro Workshop on Parallel and Distributed Processing
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 1995 Euromicro Workshop on Parallel and Distributed Processing
Y2 - 25 January 1995 through 27 January 1995
ER -