The cookie-related information is fully under our control. These cookies are not used for any purpose other than those described here. Unibo policy
We show that stochastic gradient descent (SGD) performs variational inference, it minimizes an average potential over the posterior distribution on weights with an entropic regularization term. For deep networks, this potential is different from the original loss used to compute gradients due to highly non-isotropic mini-batch gradient noise. Most likely trajectories of SGD in this case are not Brownian motion near critical points, they are closed loops in the weight space. Joint work with Stefano Soatto.
This presentation is part of Minisymposium “MS50 - Analysis, Optimization, and Applications of Machine Learning in Imaging (3 parts)”
organized by: Michael Moeller (University of Siegen) , Gitta Kutyniok (Technische Universität Berlin) .