Partial differential equations preconditioner resilient to soft and hard faults

  • Francesco Rizzi
  • , Karla Morris
  • , Khachik Sargsyan
  • , Paul Mycek
  • , Cosmin Safta
  • , Olivier Le Maitre
  • , Omar Knio
  • , Bert Debusschere

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We present a domain-decomposition-based pre-conditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm is based on the following steps: first, the computational domain is split into overlapping subdomains, second, the target PDE is solved on each subdomain for sampled values of the local current boundary conditions, third, the subdomain solution samples are collected and fed into a regression step to build maps between the subdomains' boundary conditions, finally, the intersection of these maps yields the updated state at the subdomain boundaries. This reformulation allows us to recast the problem as a set of independent tasks. The implementation relies on an asynchronous server-client framework, where one or more reliable servers hold the data, while the clients ask for tasks and execute them. This framework provides resiliency to hard faults such that if a client crashes, it stops asking for work, and the servers simply distribute the work among all the other clients alive. Erroneous subdomain solves (e.g. due to soft faults) appear as corrupted data, which is either rejected if that causes a task to fail, or is seamlessly filtered out during the regression stage through a suitable noise model. Three different types of faults are modeled: hard faults modeling nodes (or clients) crashing, soft faults occurring during the communication of the tasks between server and clients, and soft faults occurring during task execution. We demonstrate the resiliency of the approach for a 2D elliptic PDE, and explore the effect of the faults at various failure rates.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages552-562
Number of pages11
ISBN (Electronic)9781467365987
DOIs
Publication statusPublished - 26 Oct 2015
Externally publishedYes
EventIEEE International Conference on Cluster Computing, CLUSTER 2015 - Chicago, United States
Duration: 8 Sept 201511 Sept 2015

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2015-October
ISSN (Print)1552-5244

Conference

ConferenceIEEE International Conference on Cluster Computing, CLUSTER 2015
Country/TerritoryUnited States
CityChicago
Period8/09/1511/09/15

Keywords

  • Client-server systems
  • Distributed computing
  • Fault tolerance
  • Fault tolerant systems
  • High performance computing
  • Message passing
  • Parallel algorithms
  • Parallel programming
  • Partial d
  • Resilience
  • Scientific computing
  • Software engineering
  • Supercomputers

Fingerprint

Dive into the research topics of 'Partial differential equations preconditioner resilient to soft and hard faults'. Together they form a unique fingerprint.

Cite this