Unraveling the mysteries of stochastic gradient descent on deep networksMS50

We show that stochastic gradient descent (SGD) performs variational inference, it minimizes an average potential over the posterior distribution on weights with an entropic regularization term. For deep networks, this potential is different from the original loss used to compute gradients due to highly non-isotropic mini-batch gradient noise. Most likely trajectories of SGD in this case are not Brownian motion near critical points, they are closed loops in the weight space. Joint work with Stefano Soatto.

This presentation is part of Minisymposium “MS50 - Analysis, Optimization, and Applications of Machine Learning in Imaging (3 parts)
organized by: Michael Moeller (University of Siegen) , Gitta Kutyniok (Technische Universität Berlin) .

Pratik Chaudhari (University of California, Los Angeles, UCLA)