Modeling and Learning Deep Representations, in Theory and in PracticeMS70

We establish a connection between nonconvex optimization of the kind used in Deep Learning, and nonlinear partial differential equations (PDEs). We interpret empirically successful relaxation techniques motivated from statistical physics for training deep neural networks as solutions of a viscous HamiltonJacobi (HJ) PDE. The underlying stochastic control interpretation allows us to prove that these techniques perform better than stochastic gradient descent (SGD). Moreover, we derive this PDE from a stochastic homogenization problem which proves connections to algorithms for distributed training of deep networks like ElasticSGD. Our analysis provides insight into the geometry of the energy landscape and suggests new algorithms based on the non-viscous HamiltonJacobi PDE that can effectively tackle the high dimensionality of modern neural networks. Joint work with Pratik Chaudhari, Adam Oberman, Stanley Osher and Guillaume Carlier. Preview at: ArXiv 1704.04932

This presentation is part of Minisymposium “MS70 - Innovative Challenging Applications in Imaging Sciences (2 parts)
organized by: Roberto Mecca (University of Bologna and University of Cambridge) , Giulia Scalet (Dept. Civil Engineering and Architecture, University of Pavia) , Federica Sciacchitano (Dept. Mathematics, University of Genoa) .

Authors:
Stefano Soatto (University of California, Los Angeles)
Pratik Chaudhari (University of California, Los Angeles, UCLA)
Keywords:
deep learning, machine learning, neural networks, nonlinear optimization, partial differential equation models, stochastic gradient descent