The Heavy-Tail Phenomenon in SGD
Various notions of capacity and complexity have been proposed for characterizing properties of stochastic gradient descent (SGD) in deep learning.
April 20, 2023 | Babbio 2nd FL Room 219 & Zoom | 5:00 - 6:00 PM
In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the flatness of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the step size to the batch-size, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the tail-index, which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters, the distribution of the SGD iterates will converge to a heavy-tailed stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed data whose distribution has finite moments of all order, the iterates can be heavy tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We support our theory with experiments conducted on synthetic
data, fully connected, and convolutional neural networks. This is based on the joint work with Mert Gurbuzbalaban and Umut Simsekli.
Lingjiong Zhu got his BA from University of Cambridge in 2008 and PhD from New York University in 2013. He worked at Morgan Stanley and University of Minnesota before joining the faculty at Florida State University in 2015, where he is currently an Associate Professor. His research interests include applied probability, data science, financial engineering, and operations research. His works have been published in many leading outlets including Annals of Applied Probability, Finance and Stochastics, ICML, INFORMS Journal on Computing, Journal of Machine Learning Research, NeurIPS, Production and Operations Management, SIAM Journal on Financial Mathematics and Operations Research. His research has been supported by three NSF grants and a Simons Collaboration Grant. He was a recipient of Kurt O. Friedrichs Prize for an outstanding dissertation from Courant Institute, New York University in 2013 and Developing Scholar Award from Florida State University in 2022.