Video of the Talk: https://youtu.be/Q9R6qm5iv1k, opens an external URL in a new window
To participate enter with the following Zoom-Link:
Meeting-ID: 913 8324 6249
In machine learning and especially in deep learning there is one algorithm that, including many of its variations, is used almost universally for training large and non-linear models: stochastic gradient descent (SGD).
Applying a SGD method for minimizing an objective gives rise to a discrete-time process of estimated parameter values. While the mathematical description is fairly simple, the behavior of the algorithm generally is not. In order to better understand the dynamics of the estimated values it is reasonable to approximate the discrete-time process with the solution of a differential equation. The resulting gradient flow equation describes the mean evolution of the SGD process very well. However, it does not account for the noise inherent in the SGD method.
For example it does not see the difference between different mini-batch sizes or between having an infinite list of fresh data versus a finite sample of data. To rectify this issue one can introduce a noise term to the gradient flow equation, turning it into a so called stochastic differential equation. A solution to the resulting equation is called a diffusion approximation to SGD.
In this talk we describe how to explicitly calculate and compare the errors of gradient flow and the so called first-order diffusion approximation. Further, we show that one can find an even better, second-order diffusion approximation. Finally, some applications of diffusion approximations are explored.