Transparent high-speed network checkpoint/Restart in MPI

Julien Adam, Sameer Shende, Jean Baptiste Besnard, Marc Pérache, Julien Jaeger, Allen D. Malony, Patrick Carribault

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show how existing transparent checkpointing methods can be practically applied to MPI implementations given a sufficient collaboration from the MPI runtime. Our C/R technique is then measured on MPI benchmarks such as IMB and Lulesh relying on Infiniband high-speed network, demonstrating that the chosen approach is sufficiently general and that performance is mostly preserved. We argue that enabling fault-tolerance without any modification inside target MPI applications is possible, and show how it could be the first step for more integrated resiliency combined with failure mitigation like ULFM.

Original languageEnglish
Title of host publicationEuroMPI 2018 - Proceedings of the 25th European MPI Users' Group Meeting
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450364928
DOIs
Publication statusPublished - 23 Sept 2018
Externally publishedYes
Event25th European MPI Users' Group Meeting, EuroMPI 2018 - Barcelona, Spain
Duration: 23 Sept 201826 Sept 2018

Publication series

NameACM International Conference Proceeding Series

Conference

Conference25th European MPI Users' Group Meeting, EuroMPI 2018
Country/TerritorySpain
CityBarcelona
Period23/09/1826/09/18

Keywords

  • Checkpoint-Restart
  • DMTCP
  • Fault-Tolerance
  • Infiniband

Fingerprint

Dive into the research topics of 'Transparent high-speed network checkpoint/Restart in MPI'. Together they form a unique fingerprint.

Cite this