Abstract
This work proposes a novel analysis of stochastic gradient descent (SGD) for non-convex and smooth optimization. Our analysis sheds light on the impact of the probability distribution of the gradient noise on the convergence rate of the norm of the gradient. In the case of sub-Gaussian and centered noise, we prove that, with probability 1 - d, the number of iterations to reach a precision e for the squared gradient norm is O(e-2 ln(1/d)). In the case of centered and integrable heavy-tailed noise, we show that, while the expectation of the iterates may be infinite, the squared gradient norm still converges with probability 1 - d in O(e-pd-q) iterations, where p, q > 2. This result shows that heavy-tailed noise on the gradient slows down the convergence of SGD without preventing it, proving that SGD is robust to gradient noise with unbounded variance, a setting of interest for Deep Learning. In addition, it indicates that choosing a step size proportional to T-1/b where b is the tail-parameter of the noise and T is the number of iterations leads to the best convergence rates. Both results are simple corollaries of a unified analysis using the novel concept of biased expectations, a simple and intuitive mathematical tool to obtain concentration inequalities. Using this concept, we propose a new quantity to measure the amount of noise added to the gradient, and discuss its value in multiple scenarios.
| Original language | English |
|---|---|
| Journal | Advances in Neural Information Processing Systems |
| Volume | 2020-December |
| Publication status | Published - 1 Jan 2020 |
| Externally published | Yes |
| Event | 34th Conference on Neural Information Processing Systems, NeurIPS 2020 - Virtual, Online Duration: 6 Dec 2020 → 12 Dec 2020 |
Fingerprint
Dive into the research topics of 'Robustness analysis of non-convex stochastic gradient descent using biased expectations'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver