Passer à la navigation principale Passer à la recherche Passer au contenu principal

Unified model for assessing checkpointing protocols at extreme-scale

  • George Bosilca
  • , Aurélien Bouteiller
  • , Elisabeth Brunet
  • , Franck Cappello
  • , Jack Dongarra
  • , Amina Guermouche
  • , Thomas Herault
  • , Yves Robert
  • , Frédéric Vivien
  • , Dounia Zaidouni
  • University of Tennessee
  • University of Illinois at Urbana-Champaign
  • Ecole Normale Supérieure de Lyon

Résultats de recherche: Contribution à un journalArticleRevue par des pairs

Résumé

In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.

langue originaleAnglais
Pages (de - à)2772-2791
Nombre de pages20
journalConcurrency and Computation: Practice and Experience
Volume26
Numéro de publication17
Les DOIs
étatPublié - 10 déc. 2014

Empreinte digitale

Examiner les sujets de recherche de « Unified model for assessing checkpointing protocols at extreme-scale ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation