Passer à la navigation principale Passer à la recherche Passer au contenu principal

Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning

  • Truong Thao Nguyen
  • , Francois Trahay
  • , Jens Domke
  • , Aleksandr Drozd
  • , Emil Vatai
  • , Jianwei Liao
  • , Mohamed Wahib
  • , Balazs Gerofi
  • National Institute of Advanced Industrial Science and Technology
  • RIKEN Center for Computational Science
  • Tokyo Institute of Technology
  • Amigawa Gk
  • Southwest University

Résultats de recherche: Le chapitre dans un livre, un rapport, une anthologie ou une collectionContribution à une conférenceRevue par des pairs

Résumé

Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, due to rapidly growing data set sizes this approach has become increasingly infeasible. Surprisingly, the questions of why and to what extent random access is required have not received a lot of attention in the literature from an empirical standpoint. In this paper, we revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme.

langue originaleAnglais
titreProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
EditeurInstitute of Electrical and Electronics Engineers Inc.
Pages1085-1096
Nombre de pages12
ISBN (Electronique)9781665481069
Les DOIs
étatPublié - 1 janv. 2022
Evénement36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 - Virtual, Online, France
Durée: 30 mai 20223 juin 2022

Série de publications

NomProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Une conférence

Une conférence36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022
Pays/TerritoireFrance
La villeVirtual, Online
période30/05/223/06/22

Empreinte digitale

Examiner les sujets de recherche de « Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation