New insights and perspectives on the natural gradient method
release_ptvfze3nxzc73izt3pbbfg7ofq
by
James Martens
2020
Abstract
Natural gradient descent is an optimization method traditionally motivated
from the perspective of information geometry, and works well for many
applications as an alternative to stochastic gradient descent. In this paper we
critically analyze this method and its properties, and show how it can be
viewed as a type of 2nd-order optimization method, with the Fisher information
matrix acting as a substitute for the Hessian. In many important cases, the
Fisher information matrix is shown to be equivalent to the Generalized
Gauss-Newton matrix, which both approximates the Hessian, but also has certain
properties that favor its use over the Hessian. This perspective turns out to
have significant implications for the design of a practical and robust natural
gradient optimizer, as it motivates the use of techniques like trust regions
and Tikhonov regularization. Additionally, we make a series of contributions to
the understanding of natural gradient and 2nd-order methods, including: a
thorough analysis of the convergence speed of stochastic natural gradient
descent (and more general stochastic 2nd-order methods) as applied to convex
quadratics, a critical examination of the oft-used "empirical" approximation of
the Fisher matrix, and an analysis of the (approximate) parameterization
invariance property possessed by natural gradient methods (which we show also
holds for certain other curvature, but notably not the Hessian).
In text/plain
format
Archived Content
There are no accessible files associated with this release. You could check other releases for this work for an accessible version.
Know of a fulltext copy of on the public web? Submit a URL and we will archive it
1412.1193v11
access all versions, variants, and formats of this works (eg, pre-prints)