Theoretical limits of pipeline parallel optimization and application to distributed deep learning

Research output: Contribution to journalConference articlepeer-review

Abstract

We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective functions, we provide matching lower and upper complexity bounds and show that a naive pipeline parallelization of Nesterov's accelerated gradient descent is optimal. For non-smooth convex functions, we provide a novel algorithm coined Pipeline Parallel Random Smoothing (PPRS) that is within a d1/4 multiplicative factor of the optimal convergence rate, where d is the underlying dimension. While the convergence rate still obeys a slow e-2 convergence rate, the depth-dependent part is accelerated, resulting in a near-linear speed-up and convergence time that only slightly depends on the depth of the deep learning architecture. Finally, we perform an empirical analysis of the non-smooth non-convex case and show that, for difficult and highly non-smooth problems, PPRS outperforms more traditional optimization algorithms such as gradient descent and Nesterov's accelerated gradient descent for problems where the sample size is limited, such as few-shot or adversarial learning.

Original languageEnglish
JournalAdvances in Neural Information Processing Systems
Volume32
Publication statusPublished - 1 Jan 2019
Externally publishedYes
Event33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019 - Vancouver, Canada
Duration: 8 Dec 201914 Dec 2019

Fingerprint

Dive into the research topics of 'Theoretical limits of pipeline parallel optimization and application to distributed deep learning'. Together they form a unique fingerprint.

Cite this