TY - GEN
T1 - AKSEL
T2 - 24th International Conference on Principles of Distributed Systems, OPODIS 2020
AU - Boussetta, Amine
AU - El-Mhamdi, El Mahdi
AU - Guerraoui, Rachid
AU - Maurer, Alexandre
AU - Rouault, Sébastien
N1 - Publisher Copyright:
© Amine Boussetta, El-Mahdi El-Mhamdi, Rachid Guerraoui, Alexandre Maurer, and Sébastien Rouault; licensed under Creative Commons License CC-BY 24th International Conference on Principles of Distributed Systems (OPODIS 2020).
PY - 2021/1/1
Y1 - 2021/1/1
N2 - Modern machine learning architectures distinguish servers and workers. Typically, a d-dimensional model is hosted by a server and trained by n workers, using a distributed stochastic gradient descent (SGD) optimization scheme. At each SGD step, the goal is to estimate the gradient of a cost function. The simplest way to do this is to average the gradients estimated by the workers. However, averaging is not resilient to even one single Byzantine failure of a worker. Many alternative gradient aggregation rules (GARs) have recently been proposed to tolerate a maximum number f of Byzantine workers. These GARs differ according to (1) the complexity of their computation time, (2) the maximal number of Byzantine workers despite which convergence can still be ensured (breakdown point), and (3) their accuracy, which can be captured by (3.1) their angular error, namely the angle with the true gradient, as well as (3.2) their ability to aggregate full gradients. In particular, many are not full gradients for they operate on each dimension separately, which results in a coordinate-wise blended gradient, leading to low accuracy in practical situations where the number (s) of workers that are actually Byzantine in an execution is small (s << f).
AB - Modern machine learning architectures distinguish servers and workers. Typically, a d-dimensional model is hosted by a server and trained by n workers, using a distributed stochastic gradient descent (SGD) optimization scheme. At each SGD step, the goal is to estimate the gradient of a cost function. The simplest way to do this is to average the gradients estimated by the workers. However, averaging is not resilient to even one single Byzantine failure of a worker. Many alternative gradient aggregation rules (GARs) have recently been proposed to tolerate a maximum number f of Byzantine workers. These GARs differ according to (1) the complexity of their computation time, (2) the maximal number of Byzantine workers despite which convergence can still be ensured (breakdown point), and (3) their accuracy, which can be captured by (3.1) their angular error, namely the angle with the true gradient, as well as (3.2) their ability to aggregate full gradients. In particular, many are not full gradients for they operate on each dimension separately, which results in a coordinate-wise blended gradient, leading to low accuracy in practical situations where the number (s) of workers that are actually Byzantine in an execution is small (s << f).
KW - Byzantine failures
KW - Machine learning
KW - Stochastic gradient descent
U2 - 10.4230/LIPIcs.OPODIS.2020.8
DO - 10.4230/LIPIcs.OPODIS.2020.8
M3 - Conference contribution
AN - SCOPUS:85101726840
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 24th International Conference on Principles of Distributed Systems, OPODIS 2020
A2 - Bramas, Quentin
A2 - Oshman, Rotem
A2 - Romano, Paolo
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Y2 - 14 December 2020 through 16 December 2020
ER -