Approximation by neural networks
We follow these notes.
The topic of approximation by neural networks is vast. The following overview is not comprehensive and skips many important considerations and valuable contributions.
If you have important suggestions for additions, please email me.
The presentation of the Universal Approximation Theorem is based on this article, which aims to give an elementary introduction using only a first course in real analysis.
The first proof of the Universal Approximation Theorem was given by Cybenko in 1989 for activation functions which tend to 0 at negative infinity and to 1 at positive infinity. It uses much heavier functional analysis and focuses on neural networks with a single hidden layer.
Leshno, Lin, Pinkus and Schocken demonstrated in 1993 that any non-polynomial activation function can be used for neural networks with the universal approximation property, also in the multi-layer case. The proof is based on approximating monomials in x by derivatives (d/dr)^k sigma (rx) at r=0 with difference quotients. With this, any polynomial can be approximated, and then any continuous function by the Stone-Weierstrass theorem. If sigma is not smooth, the argument requires an additional approximation by mollification. Of course, polynomial activation functions lead to neural networks which represent polynomials of fixed degree (depending on sigma and the network depth), so they cannot be used for neural networks.
An interesting proof of universal approximation for ReLU activation and deep narrow networks was given by Boris Hanin in 2019, based on the observation that such networks allow us to take the maximum of linear functions. This allows approximation of convex functions, and general functions by the difference of two convex functions.
The Universal Approximation Theorem at most implies that neural networks are a reasonable model, but does not suggest that they are superior to other methods (e.g. polynomials, splines, wavelets, Fourier expansions, ...). In 1993, Barron constructed a function class which is so large that every linear method of approximation (linear in the parameters, not the spatial variable!) suffers from the curse of dimensionality, while neural networks with a single hidden layer can approximate all functions in this class with a dimension-independent rate.
Barron constructed the function class using Fourier transforms. The result is universal for all sigmoidal activation functions (i.e. functions which tend to 1 and 0 respectively at positive and negative infinity), and really bases its properties on the Heaviside function. A precise analysis for ReLU activation was given in terms of another integral transform, the Radon transform, by Ongie, Willet, Soudry and Srebro in 2020.
The function space corresponding to infinitely wide ReLU networks with a single hidden layer has been introduced several times independently and is referred to as F_1, Barron space, Radon BV, and the variation space of the ReLU dictionary by different authors. An introduction with some interesting pointwise properties can be found in this article and the references cited, while some interesting properties of the function space were explored by Siegel and Xu in a series of articles, also for some more general activation functions.
The phenomenon of depth separation describes a vast difference in behavior between neural networks with L and L' > L layers.
A particularly striking example of depth separation was given by Maiorov and Pinkus in 1998, who prove that there exists an activation function sigma: R -> R which is analytic, bounded, monotone increasing, such that a neural network with two hidden layers and a fixed number of parameters can approximate any target function to arbitrary accuracy on a compact set. Nothing like this can be done with a single hidden layer. We can think of this neural network as a kind of space-filling 'curve' in the space of continuous functions. The proof involves a deep result, the Kolmogorov-Arnold representation theorem (but is fairly straight-forward if you take that for granted).
In 2015, Telgarsky constructed classification problems for n=2^k data points in one dimension on which a neural network with 2L layers and width 2 achieves perfect classification/zero error, whereas every neural network with L layers needs approximately 2^(k/L) neurons per layer to achiever error <1/6. The proof is based on the fact that sawtooth functions can be approximated much better by compositions than by linear combinations of ReLUs.
Eldan and Shamir demonstrated in 2016 that there exists a function (indexed by dimension) which can be approximated well by a neural network with three hidden layers and a number of neurons which grows polynomially in the dimension d, but requires exponentially many neurons when approximated by a neural network with a single hidden layer. The proof that the function cannot be approximated is based on the isometry property of the Fourier transform.
A depth separation of a different flavor was constructed by E and Wojtowytsch in 2020. There it is proved that even infinitely wide neural networks with a single hidden layer (but bounded coefficients) cannot represent functions with a curved 'singular set' (set of points where the function fails to be differentiable). Such functions can, on the other hand, easily be represented by the composition of two infinitely wide neural networks f_2 o f_1, e.g. for f_1(x) = ||x||_2 and f_2(z) = |1-z|.
There are several topics concerning neural network approximation which would work well for a literature review/article summary/research project. This is by no means an exhaustive list, but can be taken as a starting point.
Continuous depth ResNets ('neural ODEs')
Approximation of functions in smoothness classes (Hölder or Sobolev spaces)
Approximation of function on lower-dimensional manifolds. In this context, we speak of beating the curse of dimensionality if only the manifold dimension matters, not that of the ambient space (which is typically significantly higher)
Function spaces designed for neural networks
Structure-preserving approximation. For example, when approximating a function which is invariant under the action of a group on the data domain, can it be approximated by networks which are also invariant? Or, if the function is convex, can it be approximated by neural networks which are convex? How does the theory apply to convolutional networks?
How does the activation function impact the functions which can be approximated by neural networks?
Other neural network architectures (e.g. RNNs and LSTMs for sequence data)