TY - GEN
T1 - NONASYMPTOTIC ANALYSIS OF STOCHASTIC GRADIENT DESCENT WITH THE RICHARDSON-ROMBERG EXTRAPOLATION
AU - Sheshukova, Marina
AU - Belomestny, Denis
AU - Durmus, Alain
AU - Moulines, Eric
AU - Naumov, Alexey
AU - Samsonov, Sergey
N1 - Publisher Copyright:
© 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - We address the problem of solving strongly convex and smooth minimization problems using stochastic gradient descent (SGD) algorithm with a constant step size. Previous works suggested to combine the Polyak-Ruppert averaging procedure with the Richardson-Romberg extrapolation to reduce the asymptotic bias of SGD at the expense of a mild increase of the variance. We significantly extend previous results by providing an expansion of the mean-squared error of the resulting estimator with respect to the number of iterations n. We show that the root mean-squared error can be decomposed into the sum of two terms: a leading one of order O(n−1/2) with explicit dependence on a minimax-optimal asymptotic covariance matrix, and a second-order term of order O(n−3/4), where the power 3/4 is best known. We also extend this result to the higher-order moment bounds. Our analysis relies on the properties of the SGD iterates viewed as a time-homogeneous Markov chain. In particular, we establish that this chain is geometrically ergodic with respect to a suitably defined weighted Wasserstein semimetric.
AB - We address the problem of solving strongly convex and smooth minimization problems using stochastic gradient descent (SGD) algorithm with a constant step size. Previous works suggested to combine the Polyak-Ruppert averaging procedure with the Richardson-Romberg extrapolation to reduce the asymptotic bias of SGD at the expense of a mild increase of the variance. We significantly extend previous results by providing an expansion of the mean-squared error of the resulting estimator with respect to the number of iterations n. We show that the root mean-squared error can be decomposed into the sum of two terms: a leading one of order O(n−1/2) with explicit dependence on a minimax-optimal asymptotic covariance matrix, and a second-order term of order O(n−3/4), where the power 3/4 is best known. We also extend this result to the higher-order moment bounds. Our analysis relies on the properties of the SGD iterates viewed as a time-homogeneous Markov chain. In particular, we establish that this chain is geometrically ergodic with respect to a suitably defined weighted Wasserstein semimetric.
UR - https://www.scopus.com/pages/publications/105010219451
M3 - Conference contribution
AN - SCOPUS:105010219451
T3 - 13th International Conference on Learning Representations, ICLR 2025
SP - 64100
EP - 64130
BT - 13th International Conference on Learning Representations, ICLR 2025
PB - International Conference on Learning Representations, ICLR
T2 - 13th International Conference on Learning Representations, ICLR 2025
Y2 - 24 April 2025 through 28 April 2025
ER -